How AI is reshaping agile software development — a two-dimensional framework spanning four modes of AI engagement and five organisational contexts
For two decades, agile software development — and Scrum in particular — has been the dominant paradigm for how software is designed, developed, and delivered. Its core assumptions are human-centred: small cross-functional teams, iterative delivery, continuous feedback, collective ownership of code, and emergent design through collaboration. These assumptions now face their most significant challenge.
AI is not merely adding a new tool to the developer's workbench. It is simultaneously changing what software engineers build, how they build it, the medium through which they build it, and — most disruptively — who (or what) does the building. Each of these four shifts interacts differently with agile principles, and each plays out differently depending on the kind of organisation and software being created.
Most commentary treats "AI and software development" as a single conversation. This study proposes a two-dimensional framework that disentangles it. The first dimension is the mode of AI engagement — four prepositions (of, with, through, to) that describe qualitatively different relationships between software engineering and AI. The second dimension is the organisational context — five categories that capture the radically different constraints, incentives, and cultures within which agile practices operate.
How has AI changed the agile methodology, and how must agile methodologies — especially Scrum — evolve in response across different modes of AI engagement and organisational contexts?
Establishing what AI is disrupting: the assumptions, practices, and roles that Scrum codifies
Scrum has become the de facto standard for organising software delivery work. Its adoption across industries — from startups to government — makes it the right baseline against which to measure AI's impact. But that baseline needs to be examined critically: Scrum was designed for a world where all productive work was done by humans, where code was written line by line, and where the primary bottleneck was coordination between people.
Every core Scrum element — roles, ceremonies, artefacts, and principles — now faces questions that its creators never anticipated. The study will examine each element through the lens of AI disruption.
Product Owner, Scrum Master, Development Team — designed for human-only teams. What happens when AI is a team member?
Sprint Planning, Daily Standup, Sprint Review, Retrospective — paced for human velocity. AI-augmented velocity changes the calculus.
Product Backlog, Sprint Backlog, Increment — sized for human estimation. Story points and velocity metrics lose meaning if AI changes throughput non-linearly.
Self-organisation, cross-functionality, inspect-and-adapt — rooted in human cognition and social dynamics. How do they translate to hybrid human-AI teams?
The study will not take an uncritical view of Scrum. Agile practices were already under strain before AI — from "Scrum theatre" (going through the motions without embracing the principles), over-ceremonialisation, difficulties scaling to large organisations, and tension between agile's preference for emergent architecture and the needs of complex enterprise systems. AI arrives into this already-contested landscape, and in some cases it may resolve longstanding problems rather than create new ones.
An escalating spectrum of entanglement between software engineering and AI, from building AI systems to transferring engineering agency
The engineering discipline required to design, build, test, deploy, and maintain AI-powered software at production scale. Scrum meets stochastic behaviour, data-code entanglement, and ML experimentation culture.
Using AI copilots, code generators, and test assistants to enhance human developers' productivity. Scrum's velocity metrics, estimation, and review practices face recalibration.
Creating software by describing intent to AI systems that generate the implementation. Scrum's assumption that development is the bottleneck gives way to specification and validation as the critical path.
The transfer of software engineering agency to AI systems — from autonomous maintenance through to independent development. Scrum's human-centred model confronts non-human agency.
The same AI capability has radically different implications depending on what kind of organisation is building what kind of software
Internal applications, systems integration, digital transformation. Scrum used for in-house teams and managed service providers. Often constrained by legacy, governance, and risk aversion.
Independent Software Vendors building commercial products. Scrum at the core of product development. Competitive pressure to adopt AI first. Revenue tied to feature velocity.
Large-scale ERP, CRM, platform vendors. Massive codebases, complex release trains. SAFe and scaled agile frameworks. Ecosystem of partners and integrators.
Small teams, high agility, resource-constrained. Scrum often informal or adapted. AI adoption driven by existential competitive pressure and the promise of "doing more with less."
Aerospace, medical devices, autonomous vehicles, defence. Regulated development processes that sit uneasily with agile. AI introduces novel assurance challenges around non-determinism.
How each AI engagement mode interacts with each organisational context — the core analytical structure of the study
| SE of AI | SE with AI | SE through AI | SE to AI | |
|---|---|---|---|---|
| Corporate IT | Building internal AI/ML features; MLOps immaturity; data governance as blockerScrum teams lack ML expertise; sprint cycles misaligned with experimentation | Copilot adoption for legacy modernisation and integration work; productivity claims vs. security riskVelocity metrics disrupted; code review burden shifts | Business analysts generating internal tools via AI; citizen development at scaleProduct Owner role transforms; sprint planning becomes specification review | Autonomous maintenance of legacy estate; AI-driven incident responseTeam composition questions; what does an "agile team" manage vs. oversee? |
| ISVs | AI-native product development; evaluation and safety as product concernsSprint definitions of "done" must include AI-specific quality gates | Competitive pressure to maximise developer velocity; talent strategy implicationsEstimation models broken; pair programming redefined as human-AI pairing | Rapid prototyping and MVP generation; AI-first product designSprint zero collapses; time-to-first-increment measured in hours not weeks | AI-maintained products; autonomous feature development from usage dataProduct Owner as AI supervisor; continuous deployment becomes continuous generation |
| Enterprise Software | AI features in platforms (Salesforce Einstein, SAP Joule); scaling AI across product suitesSAFe meets MLOps; programme-level coordination of AI capabilities | Thousands of developers using AI tools; governance at scale; IP riskScaled agile ceremonies strained; cross-team AI-generated code dependencies | Platform-level code generation; AI-driven customisation and configurationPartner/integrator ecosystem disruption; who configures vs. who builds? | Self-evolving enterprise platforms; autonomous patching and migrationRelease trains become autonomous; human governance model for AI-driven evolution |
| Startups & SMEs | AI-native startups; ML as core product; lean teams building complex systemsScrum informality helps; but testing and evaluation discipline often lacking | Force multiplication for small teams; founder-developers using AI extensively"10x developer" thesis tested; dependency risk if AI tools change or pricing shifts | Non-technical founders building MVPs; solo developers creating complex productsScrum may be unnecessary for one-person AI-assisted development; new frameworks needed | AI-maintained products after founding team moves on; autonomous SaaSBusiness model implications; software as self-sustaining entity |
| Safety-Critical | AI in medical devices, autonomous vehicles, avionics; certification challengesSprint-based delivery in tension with V&V requirements and regulatory cycles | Cautious adoption; AI-generated code in regulated environments; audit trailsTraceability requirements intensify; every AI suggestion must be documented | Largely blocked by regulatory constraints; specification languages more viable than NLFormal methods revival? AI generating provably correct implementations | Autonomous safety-critical systems: the hardest governance challengeAccountability frameworks; certification of AI-as-engineer; regulatory evolution |
Each cell in this matrix represents a distinct research territory with its own evidence base, challenges, and implications for agile practice. The study will not treat all 20 cells equally — but it will identify which cells are most consequential and where the evidence is strongest.
Building AI systems — particularly those based on large language models, multimodal architectures, and agentic workflows — demands fundamentally different approaches to testing, deployment, versioning, and quality assurance. Classical software engineering assumes deterministic behaviour: given the same inputs, a well-built system produces the same outputs. AI systems are inherently stochastic. This single fact ripples through every Scrum practice.
The data-code entanglement at the heart of ML systems means the training data is as much a part of the system's behaviour as its source code. Scrum's definition of "done" must expand to include data quality, model evaluation, bias assessment, and safety testing — none of which fit neatly into a two-week sprint.
A critical tension exists between ML research culture (notebooks, experimentation, rapid iteration on model architectures) and production engineering culture (reliability, observability, incident response). Scrum sits uncomfortably between these two cultures, and many organisations have developed parallel processes rather than adapting Scrum itself.
Sprint Planning: Work estimation is fundamentally harder. Model training runs have uncertain duration; evaluation may reveal the approach is a dead end, invalidating sprint commitments. Spike-heavy planning becomes the norm.
Definition of Done: Must expand to include model evaluation benchmarks, bias testing, safety assessment, data quality validation — significantly lengthening the path to "done."
Retrospectives: Become the most valuable ceremony, as experimentation-heavy work generates more learning than delivery per sprint. Retrospectives must capture experimental findings, not just process improvements.
This is the strand receiving the most attention today, driven by explosive adoption of AI coding assistants — GitHub Copilot, Cursor, Claude Code, Amazon CodeWhisperer. The core proposition is augmentation: AI handles the mechanical aspects of development while humans focus on design, architecture, and judgement.
The evidence is substantial but contradictory. Studies show productivity gains of 20–55% on certain task types, but unevenly distributed. Junior developers often see the largest speedups on boilerplate; senior developers gain most on exploration and unfamiliar codebases. Documented risks include automation complacency (accepting generated code without scrutiny) and skill atrophy (eroding foundational understanding).
For Scrum, the "with" strand creates an immediate practical crisis: velocity metrics lose their meaning. If a developer who previously completed 8 story points per sprint now completes 20, has the team's capacity tripled? Or has the definition of a story point shifted? Sprint planning, backlog grooming, and capacity forecasting all need recalibration — and most teams are improvising rather than working from a principled framework.
The Klarna problem is instructive: early claims of dramatic headcount reduction through AI augmentation were walked back as the organisation discovered the hidden costs of AI-generated technical debt, reduced system comprehension, and institutional knowledge loss.
Sprint Planning: Story point estimation breaks down. Teams need new calibration — perhaps measuring complexity-of-specification rather than implementation effort, or introducing AI-adjusted velocity baselines.
Daily Standup: "What did you do yesterday?" becomes partly "What did you prompt for and validate yesterday?" New norms needed around AI tool usage transparency and shared understanding of AI-generated code.
Code Review / Sprint Review: The review burden shifts dramatically. AI-generated code may be syntactically correct but architecturally incoherent, introducing subtle quality problems that only surface later. Reviews must become more architectural and less syntactic.
Team Composition: The optimal team size and skill mix may change. If AI augmentation makes each developer 2-3× more productive on implementation, do teams shrink? Or do they redirect freed capacity toward design, testing, and user research?
This is the paradigm shift strand. Where "with AI" treats AI as a tool within an essentially unchanged process, "through AI" reconceives the process itself. Software is created not by writing code but by describing intent — through natural language, examples, constraints, and feedback — to AI systems that generate the implementation.
This includes prompt-driven development, agent-orchestrated engineering (AI agents that decompose requirements, generate code, write tests, and iterate autonomously), and specification-first development where formal or semi-formal specifications are implemented by AI.
For Scrum, this strand is the most disruptive because it challenges Scrum's fundamental assumption: that development is the bottleneck. If a working application can be generated from a detailed specification in hours rather than sprints, the entire sprint cadence, estimation model, and team structure require rethinking. The bottleneck shifts to specification quality, validation thoroughness, and architectural coherence across generated components.
The critical open question is ceiling complexity: at what scale does "through AI" break down? Evidence suggests it works well for small to medium applications but struggles with large-scale systems requiring deep domain knowledge, complex state management, and integration across organisational boundaries.
Sprint Cadence: Two-week sprints may be too long when generation is fast and too short when validation is thorough. A dual rhythm may emerge: rapid generation cycles nested within longer validation and integration sprints.
Product Owner Role: Transforms from prioritiser-of-features into specifier-of-intent. The PO becomes the primary "developer" in a meaningful sense — their specifications are the input that produces software. This demands far more technical fluency than traditional PO roles assumed.
Definition of Done: Shifts from "code works and is tested" to "generated system is validated, architecturally sound, maintainable, secure, and aligned with specification." Validation becomes the heavyweight activity, not implementation.
Team Structure: The classic Scrum team of 5–9 developers may give way to smaller teams of specifiers, validators, and AI orchestrators — or even solo practitioners who can generate significant systems independently.
This is the most speculative and most consequential strand. "To AI" captures the transfer of engineering agency from human practitioners to AI systems. It ranges from the near-term (AI systems that autonomously maintain, optimise, and evolve existing codebases) to the long-term (AI as the primary "engineer" with humans in supervisory or governance roles).
The distinction from "through" is agency and autonomy. In the "through" paradigm, humans remain the initiators and decision-makers. In the "to" paradigm, AI systems possess sufficient context, judgement, and capability to act independently across the software lifecycle — identifying what needs to be built, making architectural choices, and managing trade-offs.
For Scrum, this strand poses an existential question: what is the role of a human development process when the development is not done by humans? Scrum was designed to coordinate human effort. If AI performs the engineering, the framework must evolve from a development methodology into a governance and oversight methodology — closer to an audit and assurance function than a delivery framework.
The workforce implications require honest analysis. If even a partial transfer of agency occurs, demand shifts from implementation skills to governance, oversight, domain expertise, and ethical judgement. The profession does not disappear but is fundamentally reshaped.
Fundamental Reframing: Scrum evolves from a delivery methodology to a governance methodology. Sprint Reviews become audit reviews. Retrospectives become governance assessments. The Scrum Master becomes a process assurance role.
Product Owner: Becomes the primary human authority — the person who defines intent, sets boundaries, and accepts or rejects AI-generated evolution. This is a more consequential and demanding role than today's PO.
The Team: "Development Team" gives way to "Oversight Team" or "Governance Team." Membership shifts toward domain experts, quality engineers, security specialists, and ethicists — people who can evaluate AI's work rather than perform it.
Cadence: Sprint cadence may decouple from development cadence entirely. AI may develop continuously while human governance operates on a review cadence — daily, weekly, or event-triggered rather than sprint-aligned.
Analytical lenses to apply across both dimensions of the framework
Several themes cut across all four strands and all five contexts. These are not separate categories but lenses that must be applied systematically to each cell of the matrix:
A phased approach to producing an authoritative, evidence-based analysis grounded in practitioner reality
Systematic review of academic research, industry studies, and practitioner evidence. Priority: separating measured outcomes from vendor claims, particularly on AI productivity metrics.
Semi-structured interviews spanning: enterprise Scrum Masters, ML engineering leads, ISV CTOs, safety-critical SE practitioners, agile coaches adapting to AI, and CS educators revising curricula.
Deep-dive cases across the matrix: 2–3 per organisational context, selected to cover at least two AI engagement strands each. Documented evidence over self-reported narratives.
Quantitative survey of software professionals mapping: current AI tool adoption, perceived agile practice impact, workforce concerns, and organisational readiness — segmented by the five contexts.
Validate the two-dimensional framework (4 strands × 5 contexts) with 6–8 expert reviewers. Test the Agile/Scrum baseline assumptions. Refine the matrix — identify which cells are most consequential and where evidence is strongest. Develop interview protocol segmented by organisational context. Design survey instrument with context-aware routing.
Conduct 20–25 expert interviews across organisational contexts and AI strands. Deploy practitioner survey targeting 250+ respondents. Begin case study identification and data collection — prioritising cases where agile practices have been explicitly adapted in response to AI. Build evidence matrix mapping findings to framework cells.
Analyse findings per strand and per organisational context. Develop the "Agile Evolution Model" — a maturity framework for how Scrum practices adapt across the four strands. Draft case studies. Conduct cross-cutting theme analysis. Assess UK policy implications with specific attention to the 9% problem and public sector delivery.
Write the full study report structured around both dimensions. Produce the executive summary and practitioner-oriented recommendations. Develop the "Agile-AI Readiness Assessment" — a self-assessment tool for teams. Submit for expert peer review from both agile and AI communities.
Finalise report incorporating reviewer feedback. Produce derivative outputs: four-strand summary infographic, matrix poster, newsletter series (one per strand + one per context = 9 editions), policy brief, interactive web version, and presentation deck. Launch via Digital Leaders Network, Digital Economy Dispatches, LinkedIn newsletter, and targeted events.
Approximately 30,000–35,000 words of substantive analysis, structured to serve both strand-focused and context-focused readers