AI Coding Agent Capability Maturity Model

Tier Summary Table

Tier	Name	Primary Agent Role	Autonomy Level	Orchestration Scope	Agent Topology	Security Posture	Context Engineering	Agent Memory	Measurement
1	Supervised Helpers	Inline assistant	Step-by-step approval	None	Single helper per developer	Manual review, developer education	Ad-hoc individual prompts	None (stateless)	Informal, self-reported
2	Structured Pair Programming	Collaborative partner	Partially supervised, bounded tasks	Light tool calls	Single agent per session	Automated scanning, dependency verification	Team templates, shared instruction files	Episodic (session-persisted rules)	Team-level quality gates
3	Task-Specific Agents	Delegated task agent	Delegated execution, output reviewed	Per-task pipelines	Parent agent with subagents per task	Supply chain verification, SBOM, secrets management	Org-wide knowledge bases, architectural constraints	Semantic (cross-session knowledge accumulation)	Objective metrics segmented by AI involvement
4	Orchestrated Agent Workflows	Coordinated agents across SDLC	Semi-autonomous within guardrails	Multi-step workflow orchestration	Flat agent teams per workflow	Red teaming, progressive trust, audit trails	Dedicated context engineering function	Procedural (learned workflows, skills ecosystem)	Multi-dimensional ROI, end-to-end delivery metrics
5	Autonomous Multi-Agent Platform	Persistent agent teams	High autonomy with exception handling	Enterprise-wide agent mesh	Hierarchical teams of teams	Hardened isolation, kill switches, continuous compliance	Enterprise-wide context governance	Self-improving (background consolidation, feedback-integrated)	Continuous evaluation lifecycle

Tier 1 — Supervised Helpers

At Tier 1, developers are experimenting with AI coding assistants embedded in their IDE or chat, using them for small, low-risk tasks and approving each action.

Typical usage

Code completion, small refactors, test scaffolding, and documentation stubs generated via inline suggestions or short prompts.
AI is treated as a smarter autocomplete, not a workflow owner; outputs are always reviewed and often heavily edited.

Autonomy and oversight

Step-by-step approval: every change is applied manually or via explicit approval of each suggested edit or command.
No unattended execution; AI cannot run tools, scripts, or external commands on its own.

Integration and tooling

Single-tool integration inside the IDE or web UI only.
No plugins, external APIs, or repository-wide automation invoked by agents.

Agent topology

Single helper agent per developer or per session; no subagents or additional agent roles.
All reasoning and generation occur within one agent context.

Security & supply chain

Manual peer review of all AI-generated code is the primary security control.
Developer education on AI-specific risks (hallucinated dependencies, insecure defaults) is required.
No AI-generated code deployed without secondary human review.
Consider: OWASP LLM Top 10 awareness training, policy prohibiting AI for security-critical code paths (authentication, cryptography, payment processing).

Context engineering

Ad-hoc individual prompting with no shared patterns or persistence.
Context is limited to single-file scope (4–32K tokens effective).
No instruction files, prompt libraries, or shared conventions for AI interaction.

Agent self-improvement & memory

Agents are stateless; each session starts from zero.
No persistent memory, learning, or feedback loops.
Working memory only (the context window itself).

Cost & token economics

Basic awareness of AI tool subscription costs.
No per-developer usage tracking, token monitoring, or budget allocation.
Consider: establishing baseline cost visibility before scaling adoption.

Developer impact & workforce

Usage is individual and ad hoc; no team-wide standards.
No formal assessment of AI's effect on developer skills, trust, or cognitive load.
Developers retain full comprehension of code they ship since AI output is heavily edited.

Measurement & observability

Metrics and governance are minimal or absent; risk is managed primarily by developer diligence.
AI impact is assessed informally through self-report (note: perceived productivity gains often diverge significantly from objective measurements).

Compliance & governance

No formal AI usage policy.
No attribution of AI-generated code in commits.
Minimal awareness of IP and licensing implications.

Risks and constraints

Limited value beyond personal productivity because practices are not standardized.
Risk of inconsistent quality and missed defects if individuals over-trust suggestions without explicit policy or training.
AI-generated code can contain a high rate of OWASP Top 10 vulnerabilities; without systematic review, these propagate silently.

Tier 2 — Structured Pair Programming

At Tier 2, teams adopt AI coding agents as structured pair-programming partners with emerging standards for when and how to use them.

Typical usage

Multi-line suggestions, function-level implementations, and test generation guided by explicit prompt patterns and coding conventions.
Developers routinely ask agents to explain code, propose designs, and refactor modules, then iteratively refine outputs.

Autonomy and oversight

Partially supervised: agents can perform bounded edits but still require human confirmation before committing.
Short-lived tool use (running tests, formatting, static analysis) is allowed via agent-issued commands with human feedback loops.

Integration and tooling

Agents are integrated with local tooling such as test runners, linters, and formatters.
Simple scripts or CI checks may be generated by the agent, but deployment and production changes remain manual.

Agent topology

Single agent per developer session acting as a pair-programming partner; no explicit subagents.
The agent can call tools but does not coordinate with other agents.

Security & supply chain

Automated SAST scanning tuned for LLM-generated patterns on every PR.
SCA dependency verification: all AI-suggested packages validated against public registries before installation.
AI-generated code attributed in commits to enable audit differentiation.
Sandboxing enabled for all agent sessions at minimum via hardened containers.
Consider: OWASP scanning gates, dependency lockfile enforcement, IDE-level security scanning plugins.

Context engineering

Team-level prompt templates and shared instruction files adopted (e.g., AGENTS.md or tool-specific equivalent).
Context spans multi-file scope (64–200K effective tokens).
Teams document AI usage patterns, prompting styles, and review requirements.
Consider: keeping instruction files under 300 lines, focusing on what the agent would get wrong without the file.

Agent self-improvement & memory

Episodic memory: session learnings persist via human-authored instruction files that capture debugging insights, build commands, and conventions.
The load-execute-write back pattern begins: context loaded at session start, updated manually when agents consistently make mistakes.
No automated memory capture or consolidation.

Cost & token economics

Per-developer cost tracking established; team-level budgets for AI tooling.
Model routing for simple vs. complex requests to reduce waste (the majority of agent tokens tend to be non-generative — file reading, context re-sending, tool output — with actual code generation often accounting for a small fraction of total token usage).
Consider: routing simple requests (formatting, boilerplate) to smaller/cheaper models, establishing per-developer daily cost baselines (industry average: ~$6/developer/day for capable agents).

Developer impact & workforce

Code review processes explicitly account for AI-generated changes, with expectations for tests, documentation, and rationale.
Teams distinguish between delegation mode (generates code, pastes it in) and explanation mode (conceptual inquiry, follow-up questions); explanation mode is encouraged to preserve skill development.
Consider: monitoring for deskilling signals — developers who primarily delegate to AI may see lower comprehension and debugging ability compared to those who use AI for conceptual inquiry and follow-up questions.

Measurement & observability

Quality gates: defect density per KLOC, code churn rate, and duplication rate tracked.
AI-assisted PRs tracked separately from human-only PRs to enable comparative analysis.
Consider: DX Core 4 framework (Speed, Effectiveness, Quality, Impact) as a starting measurement structure.

Compliance & governance

AI usage policy documented and communicated to all teams.
AI-generated code is attributed in version control metadata.
Basic IP awareness: teams understand that AI-generated code may carry licensing implications.
Consider: evaluating vendor compliance certifications, establishing data residency requirements for AI model interactions.

Risks and constraints

Over-reliance on AI for boilerplate can mask skills gaps if training and mentoring are not adjusted.
Limited support for cross-file or cross-service changes; agents still operate mostly within small scopes.
PR review time tends to increase significantly with AI adoption; review processes must scale accordingly.
Code duplication risk: AI-generated code can trend toward copy/paste rather than refactoring.

Tier 3 — Task-Specific Agents with Tools and Subagents

At Tier 3, organizations introduce task-specific AI agents that can use plugins, tools, or skills to perform targeted, multi-step development tasks end-to-end under supervision, often decomposing work into subagents.

Typical usage

Agents handle defined tasks such as adding a feature flag, updating API clients, migrating endpoints, or improving test coverage.
Work is framed as discrete jobs with clear acceptance criteria; humans focus on scoping tasks and reviewing results.

Autonomy and oversight

Autonomous execution with output review: agents can read multiple files, call tools, and propose commits, but final changes are gated through human review and CI.
Unattended execution is allowed only for short-lived, low-risk tasks rather than long-running code modifications.

Integration and tooling

Agents use plugins/skills for repository search, issue trackers, feature flag systems, and basic CI integration.
Simple per-task orchestrations emerge ("analyze → modify → test → summarize").

Agent topology

A single parent agent may spawn subagents for decomposition (e.g., test-writer, migration agent), but these remain local to the task.
From the developer's perspective, this still appears as one logical agent team per task.

Security & supply chain

Automated supply chain verification for AI-generated dependencies: all packages validated against allowlists before inclusion.
SBOM generation for AI-suggested dependencies with private registry mirrors.
AI-specific security training for all developers; model-specific security profiles documented (security characteristics vary widely across models and prompt configurations).
Secrets management via runtime injection or workload identity; no static credentials in agent-accessible environments.
Consider: private package registries, lockfile cryptographic hashing, slopsquatting detection tooling (AI models can hallucinate non-existent package names at significant rates).

Context engineering

Organization-wide context engineering: shared knowledge bases, architectural constraints injected via instruction files, prompt libraries.
Instruction files include four essential sections: project context, code style preferences, exact command strings, and architecture decisions.
Context quality is actively managed — files are updated whenever agents consistently make mistakes.
Consider: AGENTS.md as the universal cross-tool standard (associated with meaningful runtime and token savings); keeping files concise since shorter files outperform longer ones; auto-generated context files score significantly worse than manually refined versions.

Agent self-improvement & memory

Semantic memory: agents accumulate cross-session knowledge through managed instruction files and context that captures domain facts, preferences, and architectural rules.
CI/CD feedback loops at Observer level: LLM scores builds passively, failure patterns are logged and fed back to context files.
Skills discovery and composition: agents can find and use reusable expertise packages relevant to the current task.
The load-execute-write back pattern is formalized: context loaded at session start, tasks executed, learnings written before session end.

Cost & token economics

Prompt caching deployed for repeated context (up to 90% savings on cached input tokens).
Context compaction: irrelevant context actively trimmed to reduce token waste.
Session hygiene enforced: max-iteration limits (15–25) to prevent runaway loops; scoped prompts preferred over vague instructions.
Consider: context compaction tooling, batch processing for non-urgent tasks (documentation, test writing) at reduced rates.

Developer impact & workforce

Product and engineering teams define standard "agentable" tasks with templates for input context and expected outputs.
Review processes explicitly account for AI-generated PR characteristics: PRs tend to be significantly larger and carry higher issue density than human-authored PRs.
Refactoring rate tracked as a first-class metric to counteract AI's tendency toward code duplication.
Consider: tiered review where automation handles style/security/lint as a first pass, humans focus on architecture and necessity.

Measurement & observability

Objective measurement infrastructure deployed: engineering metrics segmented by AI involvement vs. human-only work.
Quality metrics made co-equal with throughput metrics: defect density, code churn, refactoring rate weighted alongside deployment frequency and lead time.
The perception-reality gap is actively managed: self-reported productivity is supplemented with objective measures.
Consider: DX AI Measurement Framework dimensions (Utilization, Impact, Cost); CodeHealth-type metrics with validated links to speed and quality.

Compliance & governance

Governance introduces approval rules based on task type, environment (dev vs. prod), and risk profile.
Vendor compliance certifications assessed; data residency requirements enforced.
AI-generated code IP implications understood and policy established.

Risks and constraints

Without careful scoping, agents may make broad changes that increase review load and change risk.
Observability of agent and subagent actions is often immature, limiting post-hoc analysis.
Reliability compounds poorly: per-action accuracy compounds multiplicatively across multi-step workflows, so even high per-step success rates can produce low end-to-end reliability.
AI code can be functional but systematically lacking in architectural judgment — common anti-patterns (excessive inline commenting, over-specification, ignoring battle-tested libraries) appear frequently without governance.

Tier 4 — Orchestrated Agent Workflows (Single Teams)

At Tier 4, organizations deploy orchestrated workflows in which multiple agents or agent roles collaborate across stages of the software development lifecycle, forming explicit agent teams per workflow.

Typical usage

End-to-end flows such as "triage bug → localize in code → propose fix → run tests → open PR → generate release notes" are partially automated via agent workflows.
Different agent roles (e.g., Analyzer, Implementer, Reviewer) coordinate via a shared orchestration layer.

Autonomy and oversight

Semi-autonomous execution: workflows can run unattended for moderate durations, pausing at human approval gates (e.g., PR creation, production changes).
Human-in-the-loop patterns are explicitly designed, with clear points for escalation, override, and rollback.

Integration and tooling

Central orchestration coordinates tools such as VCS, CI/CD, ticketing, observability, and feature flag platforms.
Shared components (agent bus, execution controller, evaluation harness) support multi-agent coordination and durable, long-running tasks.

Agent topology

Flat agent teams per workflow: a single orchestrator coordinates several peer agents (e.g., planner, coder, tester), each of which may internally use subagents.
Teams are typically scoped to a value stream or workflow; there is not yet a global "team of teams" mesh across domains.

Security & supply chain

Red teaming of coding agents: adversarial testing of agent workflows for prompt injection, sandbox escape, and cascading failure modes.
Progressive trust model: agents start with tight boundaries that expand as audit data demonstrates reliability.
Audit trails with trace IDs, agent identity/permissions, decision chains, and action outcomes for every invocation; minimum 6-month retention.
Blast radius containment: risk-tiered autonomy (read-only fully autonomous, write requires approval, high-risk blocked with escalation); circuit breakers terminate execution after specified failures.
Consider: dual-LLM architectures separating trusted command processing from untrusted input handling to defend against prompt injection; formal threat modeling for multi-agent interaction patterns.

Context engineering

Context engineering is an organizational capability with a dedicated lead or team.
Dynamic context assembly: multiple sources (instruction files, retrieved documents, conversation history, tool outputs) are systematically combined for each agent invocation.
Agent Decision Records (AgDRs) extend Architecture Decision Records with agent metadata, structured decision context, tradeoff documentation, and status tracking.
Consider: multi-tier context systems (hot always-loaded conventions, specialized domain-expert agents, on-demand specification documents); context quality metrics and continuous validation.

Agent self-improvement & memory

Procedural memory: agents learn workflows and action patterns through reusable expertise packages (Skills) with progressive disclosure.
CI/CD feedback loops at Gatekeeper level: LLM serves as a blocking quality gate, automated PR-level AI code review is required before merge.
The load-execute-write back pattern operates at organizational scale: error patterns across teams feed back to shared instruction files and context infrastructure.
Skills ecosystem actively curated: agents self-author skills from observed patterns; cross-platform compatibility enables sharing across tool boundaries.
Consider: implementing the four-type memory taxonomy (working, episodic, semantic, procedural) as an explicit architectural layer; cross-agent knowledge sharing through dedicated memory infrastructure.

Cost & token economics

Multi-dimensional ROI measurement: Utilization (adoption rates, AI-assisted PRs), Impact (time savings, quality metrics, developer satisfaction), Cost (tool spend, token costs, rework burden).
Cost optimization stacks deployed: model routing + context compaction + prompt caching + batch processing (combined savings can be substantial).
Token economics measured at the outcome level (cost per successful task completion), not just input level (cost per token).
Consider: A/B testing agent configurations to optimize cost-quality tradeoffs; establishing cost baselines per workflow type.

Developer impact & workforce

Tiered review: automation handles style, security, and lint checks; humans focus on architecture, necessity, and system-level implications.
Senior engineer time allocation explicitly managed — reviewing AI-generated code typically takes longer than reviewing human-written code due to larger diffs and higher issue density.
Developer trust actively monitored: familiarity with AI tools can decrease trust rather than increase it (opposite of typical technology adoption curves).
Junior developer pipeline maintained: AI augments rather than replaces the apprenticeship model.
Consider: redefining performance expectations around review quality and system understanding rather than code volume; structured onboarding for AI workflows (expect 1–2 weeks initial frustration).

Measurement & observability

End-to-end delivery metrics: cycle time to production, not just coding speed, is the primary throughput measure.
Engineering metrics incorporate DORA framework plus AI-specific dimensions, acknowledging that traditional DORA metrics inflate under AI assistance.
Value Stream Management practices adopted to convert individual AI productivity gains into organizational advantage.
Consider: DORA's seven team archetypes for assessment rather than linear performance tiers; supplementing DORA with quality and experience metrics.

Compliance & governance

Full compliance posture for applicable regulations: data residency, IP provenance, vendor certifications assessed.
Consider: evaluating deployment architectures against jurisdictional data requirements.

Risks and constraints

Poorly aligned governance can either over-constrain agents (limiting value) or under-constrain them (increasing operational risk).
Change management and developer trust become critical: teams must understand when to rely on workflows and when to intervene.
Multi-agent systems introduce cascading failure modes where semantic opacity means natural language errors pass validation checks — qualitatively different from traditional distributed system failures.
Silent failure is the most dangerous mode: agents complete tasks with confident, well-formatted output that is wrong, with no error or flag.

Tier 5 — Autonomous Multi-Agent Platform (Teams of Teams)

At Tier 5, AI coding agents operate as a managed, autonomous multi-agent platform with strong governance, observability, and lifecycle management, organized as hierarchical teams of teams.

Typical usage

Persistent agent teams perform continuous activities such as dependency updates, security patching, regression detection, and routine refactors across repositories.
Agent routing and specialization allow the platform to assign tasks dynamically based on skill, context, and risk profile.

Autonomy and oversight

High autonomy within guardrails: long-running unattended executions are allowed, with automated checks, SLOs, and exception workflows for high-risk actions.
Governance defines layered supervision: automatic approval for low-risk changes, mandatory human review for sensitive systems or patterns.

Integration and tooling

A universal orchestration layer ("agent mesh") coordinates heterogeneous agents, tools, and runtimes across the engineering ecosystem.
Comprehensive observability (traces, structured logs, evaluations, cost and performance metrics) underpins continuous improvement of agent behaviors.

Agent topology

Hierarchical multi-agent system: meta-orchestrators oversee domain orchestrators, which coordinate workflow-level teams that spawn worker subagents for specific tasks.
Agent teams can invoke other teams as services (e.g., a feature-delivery team calling a security-hardening team).

Security & supply chain

Hardened isolation as default for all agent workloads — purpose-built sandboxing that provides hardware-level separation, not just OS-level containers (frontier models have demonstrated the ability to escape common container misconfigurations).
Kill switches accessible to operations staff without engineering intervention.
Full supply chain provenance tracking; air-gapped deployment option for sensitive codebases.
Continuous compliance: security posture automatically validated, not just periodically assessed.
Consider: Firecracker microVMs or gVisor for hardware-level isolation; network-layer proxy injection for secrets management (strongest pattern, neutralizes prompt injection as credential theft vector).

Context engineering

Enterprise-wide context governance: continuous validation, automated optimization (compaction, caching, routing), measurable context quality metrics.
Context infrastructure is treated as a critical system — organizations at this level may maintain thousands of lines of context infrastructure spanning hot memory, specialized domain agents, and on-demand specification documents.
AI-first workflow design: specification-driven development with authority hierarchies (Specs > Tests > Implementation) enforced by tooling.
Consider: dedicated context engineering team; context infrastructure as a platform service consumed by all agent teams; automated freshness monitoring (stale context is "the silent killer of AI knowledge systems").

Agent self-improvement & memory

Self-improving memory systems: background consolidation processes periodically merge, deduplicate, and resolve contradictions across accumulated knowledge.
CI/CD feedback loops at Healer level: agents with write access analyze failures, generate fixes, create branches, and open PRs if tests pass; human-supervised autonomous contribution.
Full memory taxonomy implemented: working memory (context window), episodic memory (past events with timestamps and decay), semantic memory (facts, domain knowledge), procedural memory (learned workflows, skills).
Continuous evaluation lifecycle: pre-deployment benchmarks → canary deployment → production monitoring → failure-to-test-case feedback loops → regression detection.
Skills ecosystem is self-sustaining: agents author, publish, and consume reusable expertise; compounding improvements emerge over time.
Consider: the load-execute-write back pattern at enterprise scale; background "dream" consolidation processes for long-running projects; knowledge graph-backed memory for complex cross-project dependencies.

Cost & token economics

Outcome-centric economics: ROI measured at the organizational delivery level, not individual developer output.
Automated cost optimization: model routing, caching, compaction, and batching are continuously tuned based on workload patterns.
Token economics integrated into platform governance: cost per successful outcome tracked alongside quality and velocity.
Consider: A/B testing at the workflow level to optimize cost-quality tradeoffs; establishing cost SLOs per workflow type.

Developer impact & workforce

Organizational practices assume agents and agent teams as default participants in SDLC processes.
Developer roles have shifted from code writing to reviewing, orchestrating, and governing AI output — deep system understanding becomes more, not less, important.
Active upskilling programs maintain and develop developer capabilities; junior developer mentoring pipeline preserved as a strategic investment.
Consider: the "High-Leverage Engineer" concept — emphasizing that human judgment, architectural vision, and system understanding are the scarce resources; maintaining human juniors for succession planning even as AI handles routine coding tasks.

Measurement & observability

Continuous evaluation lifecycle fully operational: pre-deployment benchmarks, canary deployment, production monitoring, failure-to-test-case feedback loops, regression detection.
Organizational delivery metrics (not individual output) are the primary assessment criteria.
DORA's seven foundational AI capabilities assessed: clear AI stance, healthy data ecosystems, AI-accessible internal data, strong version control, working in small batches, user-centric focus, and quality internal platforms.
Consider: DORA's seven team archetypes for nuanced assessment (from "nothing works well" to "harmonious high-achievers").

Compliance & governance

Dedicated platform and governance roles manage agent lifecycle: design, evaluation, deployment, monitoring, and retirement.
Full regulatory compliance: data residency, IP provenance, and applicable regulatory requirements.
Vendor certifications validated; data processing agreements enforce zero-retention options where required.
Consider: ISO/IEC 42001:2023 (AI management system standard) as a compliance benchmark; regional deployment architectures for jurisdictional requirements; continuous compliance monitoring rather than periodic assessment.

Risks and constraints

Misaligned incentives (e.g., optimizing for volume of changes rather than quality and user impact) can lead to waste or instability.
Regulatory, security, and ethical requirements require continuous review as capabilities and autonomy grow.
The AI Productivity Paradox remains relevant even at scale: individual output metrics improve while organizational delivery can stay flat unless Value Stream Management practices actively convert individual gains into system-level throughput.

Cross-Cutting Themes

These systemic findings span all tiers and must inform assessment at every level.

AI as Amplifier

DORA 2025's central finding: AI magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones. Higher AI adoption correlates with increases in both delivery throughput and delivery instability. Organizations with weak engineering culture, testing, or review practices will see faster accumulation of technical debt and security vulnerabilities at every tier. Maturity requires strong engineering fundamentals, not just more automation.

The AI Productivity Paradox

Individual output metrics improve while organizational delivery stays flat. AI shifts bottlenecks from coding to review, testing, and release. PR review time and PR sizes grow substantially. The review burden — not coding speed — is the binding constraint on organizational throughput. Each tier must address how review scales alongside AI-generated output.

The Perception-Reality Gap

Developers consistently perceive AI makes them faster, but objective measurements often tell a different — sometimes opposite — story. Self-reported productivity is unreliable as an assessment metric. Every tier must emphasize objective measurement over self-report, and higher tiers must deploy infrastructure to surface this gap.

Context Over Model Selection

Changing only the context harness — not the model — can dramatically improve agent performance. Incomplete context is a primary driver of iteration cycles and rework. Most agent failures are context failures, not model failures. Context engineering capability is the primary lever organizations control for improving AI output quality across all other dimensions.

Deskilling Risk

The mode of AI use determines whether it builds or erodes developer capability. Developers who completely delegate to AI tend to show weaker comprehension and debugging skills, while those who use AI for conceptual inquiry and ask follow-up questions maintain or build capability. The critical variable is cognitive engagement, not AI usage volume. Maturity assessment should evaluate how teams use AI, not just how much.

Technical Debt Acceleration

AI-generated code can be functional but systematically lacking in architectural judgment. Refactoring rates can decline while code duplication grows. AI-coauthored PRs can carry more issues than human-authored ones. Without explicit governance, increased velocity produces increased codebase size which accumulates technical debt which decreases future velocity. Quality metrics (defect density, code churn, refactoring rate) must be co-equal with throughput metrics at every tier.

Using the Model

Assess current state: Identify the tier that best matches current AI coding practices across all dimensions, recognizing that different teams — and different dimensions — may sit at different tiers simultaneously. Consider using DORA's seven team archetypes rather than collapsing maturity into a single score.
Plan progression: Use the dimensions above to define specific, low-regret steps toward the next tier rather than jumping directly to full autonomy. Prioritize context engineering and measurement infrastructure — these are the highest-leverage investments and prerequisites for safe progression on other dimensions.
Align capability and control: Evolve governance, observability, security, and talent readiness alongside technical capabilities. Higher autonomy without stronger controls amplifies risk, not value. The evidence is consistent: organizations that treat AI agent adoption as purely a tooling decision face compounding risks, while those embedding it within mature engineering practices see genuine transformation.
Measure what matters: Supplement traditional engineering metrics with AI-specific dimensions. Traditional DORA metrics inflate under AI assistance — deployment frequency and lead time improve on paper while hiding increased review overhead and new defect categories. Use multi-dimensional frameworks that capture Speed, Effectiveness, Quality, and Impact together.
Manage the human dimension: Developer trust, skill development, cognitive load, and the junior talent pipeline are not secondary concerns — they determine whether AI adoption is sustainable. Involve teams in tool selection rather than mandating top-down. Encourage explanation mode over delegation mode. Preserve the apprenticeship model for junior developers.

geowa4/ai-coding-agent-capability-maturity-model.md

Select an option

No results found

Select an option

No results found

AI Coding Agent Capability Maturity Model

Tier Summary Table

Tier 1 — Supervised Helpers

Tier 2 — Structured Pair Programming

Tier 3 — Task-Specific Agents with Tools and Subagents

Tier 4 — Orchestrated Agent Workflows (Single Teams)

Tier 5 — Autonomous Multi-Agent Platform (Teams of Teams)

Cross-Cutting Themes

AI as Amplifier

The AI Productivity Paradox

The Perception-Reality Gap

Context Over Model Selection

Deskilling Risk

Technical Debt Acceleration

Using the Model