"Human teams compensate for incomplete docs with meetings, intuition, and shared mental models. LLM systems cannot. Therefore every artifact is a coordination layer, and every phase is a contract."
Human software teams function as a distributed system with implicit channels: hallway conversations, shared organizational context, years of accumulated intuition, real-time social correction. When a spec is ambiguous, a senior engineer fills the gap from memory and instinct.
LLM agents have none of that.
What they have instead is this: given a bounded context with clear inputs and an explicit goal, frontier models perform remarkably well. That is the design constraint this architecture is built around.
The failure mode of current LLM agent systems is not intelligence, it is context pollution, objective blending, and unconstrained scope. An agent asked to simultaneously plan, build, test, review, and learn will hallucinate across all five. Not because the model is weak, but because the runtime was designed for humans.
AgentSWEOS vNext is designed for what LLMs actually are:
flowchart LR
subgraph Input[Task Input]
Q[Queue] --> R[Requirements Agent]
end
subgraph Pipeline[Eight-Phase Pipeline]
direction LR
R --> A[Architecture Agent]
A --> P[Planning Agent]
P --> I[Implementation Agent]
I --> V[Verification Agent]
V --> S[Security Agent]
S --> Int[Integration Agent]
Int --> Ret[Retrospective Agent]
end
subgraph Gates[Phase Gates]
G1[spec complete?]
G2[interfaces valid?]
G3[plan reviewable?]
G4[lint + unit pass?]
G5[all checks pass?]
G6[no blocking findings?]
G7[no regressions?]
end
R --> G1 --> A
A --> G2 --> P
P --> G3 --> I
I --> G4 --> V
V --> G5 --> S
S --> G6 --> Int
Int --> G7 --> Ret
Ret --> Done[Done ✓]
Block[BLOCKED]
Block -.->|rollback to previous phase| A
Block -.->|escalate to human| Human[Human Review]
bounded probabilistic transformers that perform best with:
- minimal, scoped context
- explicit input artifacts
- single, stated objectives
- structured output contracts
- verification before state transition
The redesign treats the SWE pipeline not as Agile iteration but as a compiler pipeline: each pass receives a structured intermediate representation, transforms it according to narrow rules, and emits a verified output for the next pass. Errors are caught at phase boundaries, not discovered at ship time.
| Problem | Mechanism |
|---|---|
| Context drift | Each iteration adds to prompt history; reasoning degrades as window fills |
| Objective blending | "Build and test and review" activates conflicting evaluation criteria simultaneously |
| Error amplification | A flawed assumption in planning contaminates implementation, tests, and reward |
| Self-certification | The same agent that built it cannot reliably review it. it anchors to its own reasoning |
| Unmeasurable learning | What did the loop learn? Which phase caused the failure? |
| Problem | Phase model solution |
|---|---|
| Context drift | Each phase starts fresh from structured artifacts, not accumulated chat |
| Objective blending | One agent, one objective, one output type per phase |
| Error amplification | Phase gates catch failures before they propagate downstream |
| Self-certification | Verification and security phases are structurally independent |
| Unmeasurable learning | Failures are attributed to the phase that produced the bad artifact |
flowchart TB
subgraph Contract[Phase Contract Example: Architecture Agent]
Inputs[Required Inputs<br/>spec.md<br/>acceptance-criteria.md<br/>constraints.md]
Context[Context Scope<br/>requirements/<br/>verified-patterns/<br/>architect-agent/skills/]
Agent[Architect Agent<br/>Persona: Staff Engineer<br/>Objective: Define system topology]
Outputs[Produced Outputs<br/>adr-XXX.md<br/>interfaces.md<br/>risk-analysis.md]
Rules[Validation Rules<br/>- every external system named<br/>- every interface typed<br/>- assumptions explicit]
Forbidden[Forbidden Actions<br/>✗ write to implementation/<br/>✗ write to planning/<br/>✗ invoke code executor]
end
Inputs --> Agent
Context --> Agent
Agent --> Outputs
Outputs --> Rules
Rules -->|pass| Gate[APPROVED.md → next phase]
Rules -->|fail| Block[BLOCKED.md → rollback]
Forbidden -.->|enforced by runtime| Agent
flowchart LR
subgraph Input[Task Input]
Q[Queue] --> R[Requirements Agent]
end
subgraph Pipeline[Eight-Phase Pipeline]
direction LR
R --> A[Architecture Agent]
A --> P[Planning Agent]
P --> I[Implementation Agent]
I --> V[Verification Agent]
V --> S[Security Agent]
S --> Int[Integration Agent]
Int --> Ret[Retrospective Agent]
end
subgraph Gates[Phase Gates]
G1[spec complete?]
G2[interfaces valid?]
G3[plan reviewable?]
G4[lint + unit pass?]
G5[all checks pass?]
G6[no blocking findings?]
G7[no regressions?]
end
R --> G1 --> A
A --> G2 --> P
P --> G3 --> I
I --> G4 --> V
V --> G5 --> S
S --> G6 --> Int
Int --> G7 --> Ret
Ret --> Done[Done ✓]
Block[BLOCKED]
Block -.->|rollback to previous phase| A
Block -.->|escalate to human| Human[Human Review]
Each phase runs a different specialist agent with a different persona, a different context window, and a different output contract. No phase can mutate the outputs of a preceding phase — only signal a rollback.
A phase contract is the machine-readable interface between pipeline stages. It replaces the implicit human coordination that PRDs, standups, and Slack threads provide in human teams.
flowchart LR
subgraph TaskArtifacts[Task Artifacts]
T1[requirements/<br/>spec.md, ACs]
T2[architecture/<br/>ADRs, interfaces]
T3[planning/<br/>work breakdown]
T4[implementation/<br/>unit patches]
T5[verification/<br/>test results]
end
subgraph AgentsWithContext[Agents with Bounded Context]
direction TB
A1[Requirements Agent<br/>~800 tokens<br/>raw task only]
A2[Architect Agent<br/>~2,000 tokens<br/>requirements + patterns]
A3[Builder Agent<br/>~3,000 tokens<br/>1 unit + interfaces]
A4[Verifier Agent<br/>~2,500 tokens<br/>ACs + patches + tool output]
end
T1 -.->|scope: requirements/*| A2
T1 --> A1
T2 -.->|scope: architecture/*| A3
T2 -.->|scope: architecture/*| A4
T1 --> A4
T3 -.->|not visible| A3
T4 --> A4
T5 --> A4
note1[Builder never sees:<br/>- reward history<br/>- other units' patches<br/>- full repository<br/>- retrospective data]
# phases/contracts/architecture.yaml
phase: architecture
version: 1
required_inputs:
- tasks/{task-id}/requirements/spec.md
- tasks/{task-id}/requirements/acceptance-criteria.md
- tasks/{task-id}/requirements/constraints.md
produced_outputs:
- tasks/{task-id}/architecture/adr-001.md # at minimum one ADR
- tasks/{task-id}/architecture/interfaces.md # all external/internal APIs typed
- tasks/{task-id}/architecture/risk-analysis.md
validation_rules:
- every external system is named and bounded
- every interface has typed inputs and outputs
- every assumption is written as an explicit assumption, not embedded prose
- risk analysis covers: data, auth, third-party dependencies, rollback path
allowed_mutations:
- tasks/{task-id}/architecture/*
forbidden_actions:
- write to tasks/{task-id}/implementation/*
- write to tasks/{task-id}/planning/*
- edit tasks/{task-id}/requirements/* # can only signal rejection
- invoke code executor
- call external APIs
rollback_signal:
path: tasks/{task-id}/architecture/BLOCKED.md
reason: required # why the phase cannot proceed
missing: required # what spec information is absent
context_scope:
include:
- tasks/{task-id}/requirements/
- agents/architect-agent/skills/
- agents/architect-agent/persona.md
- world/verified-patterns/
- world/anti-patterns/
exclude:
- tasks/{task-id}/implementation/
- tasks/{task-id}/retrospective/
- agents/*/rewards.md # no reward history leaks into architecture reasoningThe contract is parsed by the runtime before any LLM call is made. If required inputs are missing or malformed, the phase does not start, it emits a structured block signal upstream. The LLM never sees incomplete state.
The layout is organized around tasks as units of work and phases as named subtrees within each task. Agent state is strictly separate from task artifacts.
/data/
├── tasks/
│ └── task-2026-001/
│ ├── META.yaml # task id, priority, created, status
│ │
│ ├── requirements/
│ │ ├── spec.md # what and why, not how
│ │ ├── acceptance-criteria.md # machine-checkable conditions for done
│ │ ├── constraints.md # non-negotiable boundaries
│ │ └── APPROVED.md # signed off before architecture starts
│ │
│ ├── architecture/
│ │ ├── adr-001.md # Architecture Decision Record
│ │ ├── interfaces.md # every API typed: inputs, outputs, errors
│ │ ├── dependency-graph.json # machine-readable system graph
│ │ ├── risk-analysis.md
│ │ └── APPROVED.md
│ │
│ ├── planning/
│ │ ├── work-breakdown.md # slices, each independently testable
│ │ ├── dependency-order.json # which units block which
│ │ ├── execution-plan.md # sequenced, time-bounded units
│ │ └── APPROVED.md
│ │
│ ├── implementation/
│ │ ├── units/
│ │ │ ├── unit-001/
│ │ │ │ ├── patch.diff
│ │ │ │ ├── commit-message.md
│ │ │ │ └── self-check.md # builder's pre-verification notes
│ │ │ └── unit-002/
│ │ └── READY_FOR_VERIFICATION.md
│ │
│ ├── verification/
│ │ ├── lint.json # machine output, not prose
│ │ ├── unit-tests.json # pass/fail/coverage per unit
│ │ ├── integration-tests.json
│ │ ├── spec-fidelity.md # does impl match acceptance criteria?
│ │ ├── reviewer-report.md # five-axis review, severity-labeled
│ │ └── PASSED.md / BLOCKED.md
│ │
│ ├── security/
│ │ ├── threat-model.md
│ │ ├── owasp-checklist.json # structured, not prose
│ │ ├── secrets-scan.json # automated scan output
│ │ ├── auth-paths.md # every auth boundary documented
│ │ └── CLEARED.md / HUMAN_REVIEW_REQUIRED.md
│ │
│ ├── integration/
│ │ ├── merge-plan.md
│ │ ├── compatibility-checks.json
│ │ ├── regression-results.json
│ │ ├── deployment-implications.md
│ │ └── MERGED.md / BLOCKED.md
│ │
│ └── retrospective/
│ ├── phase-failures.md # which phase failed and why
│ ├── root-cause.md
│ ├── lessons.md
│ └── reward.yaml # structured, tool-verified score
│
├── phases/
│ └── contracts/
│ ├── requirements.yaml
│ ├── architecture.yaml
│ ├── planning.yaml
│ ├── implementation.yaml
│ ├── verification.yaml
│ ├── security.yaml
│ ├── integration.yaml
│ └── retrospective.yaml
│
├── agents/
│ ├── requirements-agent/
│ │ ├── persona.md
│ │ ├── skills/
│ │ └── rewards.md
│ ├── architect-agent/
│ ├── planner-agent/
│ ├── builder-agent/
│ ├── verifier-agent/
│ ├── security-agent/
│ ├── integration-agent/
│ └── retrospective-agent/
│
├── shared/
│ ├── queue.md # inbound task queue
│ ├── pipeline-state.yaml # current task + phase + status
│ ├── rollback-log.md # all phase rejections, indexed
│ └── human-review-queue.md # blocked items needing human input
│
└── world/
├── verified-patterns/ # promoted after 3+ successful tasks
├── anti-patterns/ # promoted after 2+ verified failures
├── routing-policies/ # learned model tier assignments
└── reward-evidence/ # cross-task pattern evidence accumulator
The key structural rule: agents read from world/ and their own agents/<name>/ directory. They read task artifacts only within their phase's context_scope. They write only to their phase's allowed_mutations paths.
The runtime enforces this. It is not a guideline. it is a mount configuration.
Persona: Technical product manager. Obsessed with ambiguity elimination. Refuses to proceed on vague intent.
Single objective: Transform a task description into a machine-readable requirements package that any downstream agent can execute without asking follow-up questions.
Context window receives:
- The raw task description
agents/requirements-agent/skills/spec-driven-development.mdworld/anti-patterns/underspecified-requirements.md(if it exists)- Prior task specs of the same type (for format consistency)
Produces:
# spec.md
## Intent
What problem this solves and for whom. One paragraph. No implementation language.
## Scope
What is explicitly IN scope. What is explicitly OUT of scope.
## Acceptance Criteria
- [ ] AC-001: [machine-verifiable condition]
- [ ] AC-002: [machine-verifiable condition]
...
## Open Questions
Questions that block downstream phases. Each must be answered before APPROVED.md is written.# acceptance-criteria.md
Each criterion maps to a verification method:
AC-001:
description: "API returns 429 and Retry-After header on rate limit"
verification_method: integration_test
test_file: tests/test_rate_limiting.py::test_429_response
AC-002:
description: "All database writes are idempotent"
verification_method: unit_test
test_file: tests/test_idempotency.pyRollback signal: If the original task description is too ambiguous to spec, the requirements agent writes tasks/{id}/requirements/BLOCKED.md with specific questions. Execution halts. Human or task-submitter must resolve.
Cannot: Write code, design systems, suggest implementation approaches. The spec describes what, never how.
Persona: Staff engineer with a distributed systems background. Thinks in interfaces, not implementations. Writes ADRs, not opinions.
Single objective: Given a complete spec, define the system topology, interface contracts, and risk landscape such that implementation can proceed without architectural decisions.
Context window receives:
tasks/{id}/requirements/(all files)agents/architect-agent/skills/world/verified-patterns/(relevant subset by tag)world/anti-patterns/(relevant subset)
Produces:
# interfaces.md
## Internal APIs
### UserSyncService
method: sync_user_batch
input:
users: List[UserId] # max 100 per call
as_of: datetime # idempotency key
output:
synced: List[UserId]
failed: List[{id: UserId, reason: str}]
errors:
RateLimitError: retry after Retry-After header
AuthError: non-retryable, escalate
## External Dependencies
### VendorAPI
base_url: from env VENDOR_API_BASE_URL
auth: Bearer token, rotated every 24h
rate_limit: 100 req/min, token bucket
pagination: cursor-based, field: next_cursor# adr-001.md
## ADR-001: Cursor-based pagination over offset
### Status: Accepted
### Context
Vendor API uses cursor pagination. Offset pagination would require re-fetching
pages on record insertion, causing duplicate processing.
### Decision
Implement cursor pagination. Store last_cursor per sync job in the database.
### Consequences
- Resumable syncs after failure (positive)
- Cannot seek to arbitrary page (acceptable for this use case)
- Cursor expiry must be handled: if cursor is >24h old, restart from beginning
### Alternatives Rejected
Offset pagination: race condition risk, not supported by vendor API contract.Rollback signal: If the spec is insufficient to make an architectural decision. for example, it does not state whether the system needs to be stateless — the architecture agent writes BLOCKED.md with exactly which spec sections are incomplete. It does not guess.
Cannot: Write production code, make planning decisions, assign work units.
Persona: Engineering lead who has run many sprints and learned that large units are risk multipliers.
Single objective: Decompose the approved architecture into independently testable, independently deployable work units with explicit dependency ordering.
Context window receives:
tasks/{id}/requirements/acceptance-criteria.mdtasks/{id}/architecture/(all files)agents/planner-agent/skills/planning-and-task-breakdown.md
Produces:
# work-breakdown.md rendered as structured YAML
units:
- id: unit-001
title: "Database schema: user_sync_cursor table"
type: schema_migration
size_estimate: small
acceptance_criteria_covered: [AC-005]
depends_on: []
verification:
method: migration_runs_clean
rollback_tested: required
- id: unit-002
title: "VendorAPIClient: cursor pagination + rate limit handling"
type: library_module
size_estimate: medium
acceptance_criteria_covered: [AC-001, AC-002, AC-003]
depends_on: [unit-001]
verification:
method: unit_tests
test_file: tests/test_vendor_client.py
coverage_threshold: 90
- id: unit-003
title: "SyncOrchestrator: batch scheduling + cursor persistence"
type: service_module
size_estimate: medium
acceptance_criteria_covered: [AC-004, AC-006]
depends_on: [unit-001, unit-002]
verification:
method: integration_tests
fixtures: tests/fixtures/vendor_api_mock.pyHard rules:
- No unit covers more than three acceptance criteria
- Every unit has a verification method before it is approved
- Units that touch auth, secrets, or PII are flagged for security review
- Size estimates above
largemust be split before APPROVED.md is written
Cannot: Make architectural decisions, write code, adjust requirements.
Persona: Staff engineer. Incremental. Test-first. Atomic commits. Never invents requirements.
Single objective: Implement exactly one approved work unit per invocation. No more.
Context window receives:
- The single assigned
unit-00Ndefinition tasks/{id}/architecture/interfaces.mdtasks/{id}/architecture/adr-*.md- The relevant code slice from
repos/{project}/(only the affected modules) agents/builder-agent/skills/(relevant subset based on unit type)
The builder does not receive:
- The full repository
- Reward history
- Previous retrospectives
- World model
- Other units' patches
This is deliberate. The builder's job is narrow code transformation, not system design. If the builder encounters ambiguity that requires a decision not covered by the interfaces or ADRs, the correct action is to emit a BLOCKED.md, not to invent an answer.
Produces per unit:
implementation/units/unit-001/
├── patch.diff # the actual code change
├── commit-message.md # conventional commits format, references AC IDs
└── self-check.md # builder's checklist before handing off
# self-check.md (builder fills before handoff)
Unit: unit-002 VendorAPIClient
Pre-verification checklist:
- [x] Implements cursor pagination as specified in interfaces.md
- [x] Rate limit retry uses Retry-After header, not fixed backoff
- [x] No hardcoded credentials or URLs
- [x] New code has docstrings at module and function level
- [x] No logic added beyond what acceptance criteria require
- [ ] Integration test written (blocked: mock fixture not yet available - unit-003 dependency)
Assumptions made: none. All decisions covered by interfaces.md and ADR-001.
Deviations from plan: none.The critical constraint: The builder cannot award itself a passing grade. The self-check.md is a handoff document, not a verification gate. Verification is a separate phase, a separate agent, and a separate context window.
Persona: QA engineer with zero tolerance for self-reported test results. Only tool output counts.
Single objective: Verify that implementation matches the acceptance criteria. Produce structured, machine-readable verdicts. Find spec divergence.
Context window receives:
tasks/{id}/requirements/acceptance-criteria.mdtasks/{id}/architecture/interfaces.mdtasks/{id}/planning/work-breakdown.mdtasks/{id}/implementation/units/(all patches and self-checks)- Test runner output (JSON) from the executor container
Produces:
// unit-tests.json
{
"task_id": "task-2026-001",
"unit_id": "unit-002",
"runner": "pytest",
"exit_code": 0,
"passed": 14,
"failed": 0,
"coverage": 0.91,
"threshold_met": true,
"acceptance_criteria_verified": ["AC-001", "AC-002", "AC-003"]
}# reviewer-report.md
## Five-Axis Review: unit-002 VendorAPIClient
### Correctness
PASS
Rate limit retry reads Retry-After header correctly.
Cursor pagination matches interfaces.md contract.
### Test Quality
PASS — 91% coverage. Edge cases covered: empty page, expired cursor,
missing Retry-After header (defaults to 60s).
### Security
REVIEW — Line 47: token passed as query parameter in one fallback path.
Severity: BLOCKER
Required: move to Authorization header unconditionally.
### Readability
PASS — Functions are <30 lines. Naming is unambiguous.
One nit: `_do_fetch` is too generic, suggest `_fetch_cursor_page`.
### Change Sizing
PASS — 87 lines changed. Within the 100-line guidance.
## Overall: BLOCKED
Reason: Security finding BLOCKER on line 47 must be resolved before
this unit proceeds to security review phase.Rollback signal: BLOCKED.md written to tasks/{id}/verification/. Includes the unit ID, finding severity, and exact location. Builder receives only the BLOCKED.md — not the full reviewer context — to avoid anchoring to the reviewer's suggested fix.
Cannot: Modify implementation, write tests on behalf of the builder, approve its own findings.
Persona: Application security engineer. OWASP-fluent. Threat models everything. Never approves "we'll fix it later."
Single objective: Verify that the implementation does not introduce security vulnerabilities, secrets exposure, or authentication weaknesses.
Context window receives:
tasks/{id}/architecture/interfaces.mdtasks/{id}/architecture/risk-analysis.mdtasks/{id}/implementation/units/(all patches)tasks/{id}/verification/secrets-scan.json(automated scan output)agents/security-agent/skills/security-and-hardening.md- OWASP Top 10 checklist (embedded in agent skills)
Produces:
// owasp-checklist.json
{
"task_id": "task-2026-001",
"checked_items": [
{"id": "A01", "name": "Broken Access Control", "status": "PASS", "notes": "Auth scope verified per ADR-001"},
{"id": "A02", "name": "Cryptographic Failures", "status": "PASS", "notes": "No sensitive data stored in logs"},
{"id": "A03", "name": "Injection", "status": "PASS", "notes": "All user input parameterized"},
{"id": "A07", "name": "Auth Failures", "status": "REVIEW", "notes": "Token rotation interval not enforced in client"}
],
"secrets_scan": "PASS",
"auth_paths_documented": true,
"human_review_required": false,
"blocking_findings": []
}Human review trigger conditions (any one is sufficient):
- Any finding rated CRITICAL or BLOCKER
- Code touches auth, session management, or credential storage
- New external service dependency introduced
- PII schema change
- First security review for a new integration pattern
When human review is triggered, the agent writes to shared/human-review-queue.md and the pipeline halts. It does not proceed optimistically.
Cannot: Approve findings provisionally, suggest "fix in follow-up," modify implementation.
Persona: Platform engineer. Thinks about the system, not the feature. Owns the merge and what happens after.
Single objective: Validate that the completed implementation integrates safely with the existing system — no regressions, no architectural drift, no deployment surprises.
Context window receives:
- All
PASSED.mdandCLEARED.mdfrom verification and security tasks/{id}/architecture/dependency-graph.jsontasks/{id}/planning/execution-plan.md- Regression test results from executor
- Current
repos/{project}HEAD state
Produces:
# merge-plan.md
## Integration Assessment: task-2026-001
### Regression Results
All 247 pre-existing tests: PASS
New tests added: 14 (all pass)
Coverage delta: +2.1%
### Architectural Drift
None detected. Implementation follows interface contract in interfaces.md.
No new external dependencies introduced beyond those in architecture/risk-analysis.md.
### Deployment Implications
- Database migration required before code deploy (unit-001)
- Migration is backwards-compatible: old code reads null cursor as "start from beginning"
- No feature flag required
- Rollback procedure: run migration rollback script (included in unit-001 patch)
### Merge Status: APPROVEDCannot: Re-run verification, override security findings, approve partial implementations.
Persona: Engineering manager who has debugged many postmortems. Interested in systemic causes, not blame.
Single objective: Analyze the completed task pipeline, attribute failures to phases, extract reusable lessons, and update the world model only with evidence meeting promotion criteria.
Context window receives:
- All phase artifacts for the task (read-only)
shared/rollback-log.mdentries for this taskagents/retrospective-agent/skills/world/reward-evidence/(accumulator for cross-task patterns)
Produces:
# phase-failures.md
## task-2026-001 Phase Failures
| Phase | Status | Cause | Iterations |
|---|---|---|---|
| Requirements | PASS | — | 1 |
| Architecture | PASS | — | 1 |
| Planning | PASS | — | 1 |
| Implementation | BLOCKED (1x) | Security finding: token in query param | 2 |
| Verification | BLOCKED (1x) | Security reviewer caught issue | — |
| Security | PASS | — | 1 |
| Integration | PASS | — | 1 |
Root cause: Builder agent did not apply security-and-hardening skill
to auth token handling. Skill was present in skill library but not
activated because unit type was tagged "library_module" not "auth_module."
Recommendation: Expand security-and-hardening activation trigger to include
any unit whose interfaces.md section references Bearer tokens or credentials.# reward.yaml
task_id: task-2026-001
composite_score: +1 # majority dimensions positive
reward_decomposition:
correctness: +1 # all ACs verified by tool
spec_fidelity: +1 # implementation matches acceptance-criteria.md
architectural_compliance: +1 # no ADR violations
verification_cleanliness: 0 # one security rollback required
phase_discipline: +1 # no phase boundary violations
context_efficiency: +1 # no unnecessary escalations
rollback_stability: +1 # integration introduced no regressions
maintainability: +1 # coverage increased, complexity stable
tool_verified: true
model_tier_used: mid
phases_rolled_back: [implementation]
rollback_count: 1Pattern promotion rules: A pattern enters world/verified-patterns/ only when:
- It has appeared in 3 or more task retrospectives
- All appearances have composite score +1
- At least one appearance includes tool-verified integration test coverage
- The retrospective agent writes a promotion proposal that the pipeline runner confirms
This prevents single-task superstition from becoming institutional knowledge.
The most important operational detail: each agent receives the minimum context required to complete its phase, assembled fresh from structured artifacts, not from conversation history.
Requirements Agent context:
~800 tokens raw task + persona + skill
Architecture Agent context:
~2,000 tokens requirements package + relevant world patterns
Planning Agent context:
~1,500 tokens requirements ACs + architecture + planning skill
Builder Agent context (per unit):
~3,000 tokens single unit definition + interfaces + affected code slice
Verification Agent context:
~2,500 tokens ACs + interfaces + patches + tool output
Security Agent context:
~2,000 tokens patches + risk analysis + OWASP checklist
Integration Agent context:
~2,000 tokens approval signals + regression output + dependency graph
Retrospective Agent context:
~3,500 tokens full task artifact set (read-only)
No phase sees reward history from unrelated tasks. No phase sees other agents' internal reasoning. No phase sees more of the codebase than its work unit requires.
This is aggressive context minimization. It is not a performance optimization — it is a correctness constraint.
Model tier is assigned per phase, not per task. The routing policy lives in world/routing-policies/ and is updated by the retrospective agent based on empirical failure data.
| Phase | Default Tier | Escalation Trigger |
|---|---|---|
| Requirements | cheap | Ambiguous domain (legal, medical, financial) |
| Architecture | mid | Novel system topology; no matching verified pattern |
| Planning | cheap | Standard decomposition; known patterns |
| Implementation (per unit) | cheap | Unit type: scaffold, boilerplate |
| Implementation (per unit) | mid | Unit type: library_module, service_module |
| Implementation (per unit) | premium | Unit type: auth, security-critical, novel algorithm |
| Verification | mid | All code review passes through mid tier |
| Security | mid→premium | Any OWASP finding escalates to premium |
| Integration | cheap | Structured checks, not creative reasoning |
| Retrospective | mid | Learning and pattern promotion require coherent reasoning |
Routing is logged in every reward entry. After 50 tasks, the retrospective agent can identify whether a given phase-tier combination correlates with rollback frequency. That data updates the routing policy file directly.
flowchart TB
subgraph Host[Docker Host]
subgraph Runner[Orchestration Layer]
PR[Pipeline Runner<br/>state machine controller]
QM[Queue Manager]
end
subgraph Agents[Agent Layer - One per phase]
RA[Requirements Agent]
AA[Architect Agent]
PA[Planner Agent]
BA[Builder Agent]
VA[Verifier Agent]
SA[Security Agent]
IA[Integration Agent]
RTA[Retrospective Agent]
end
subgraph Data[Shared Volume: /data]
Tasks[tasks/<br/>task artifacts]
World[world/<br/>verified patterns]
State[pipeline-state.yaml]
Metrics[metrics/<br/>pipeline.jsonl]
end
subgraph Exec[Execution Layer]
E[Executor Container<br/>air-gapped<br/>no network]
end
subgraph External[External]
API[LLM API<br/>]
end
end
PR -->|dispatches to| Agents
Agents -->|read/write| Data
Agents -->|API calls| API
RA -->|block signal| PR
PR -->|enforces contracts| Agents
VA -->|sends test requests| E
E -->|returns JSON results| VA
PR -->|writes metrics| Metrics
PR -->|updates| State
version: "3.9"
services:
pipeline-runner:
build: .
container_name: agentos-pipeline
environment:
- AGENT_ROLE=pipeline_runner
- LLM_API_KEY=$LLM_API_KEY
- CHEAP_MODEL=${CHEAP_MODEL}
- MID_MODEL=${MID_MODEL}
- PREMIUM_MODEL=${PREMIUM_MODEL}
- PIPELINE_POLL_INTERVAL=10
volumes:
- agent-data:/data
networks:
- agent-net
requirements-agent:
build: .
container_name: agentos-requirements
environment:
- AGENT_ROLE=requirements
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
architect-agent:
build: .
container_name: agentos-architect
environment:
- AGENT_ROLE=architect
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
planner-agent:
build: .
container_name: agentos-planner
environment:
- AGENT_ROLE=planner
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
builder-agent:
build: .
container_name: agentos-builder
environment:
- AGENT_ROLE=builder
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
verifier-agent:
build: .
container_name: agentos-verifier
environment:
- AGENT_ROLE=verifier
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
security-agent:
build: .
container_name: agentos-security
environment:
- AGENT_ROLE=security
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
integration-agent:
build: .
container_name: agentos-integration
environment:
- AGENT_ROLE=integration
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
retrospective-agent:
build: .
container_name: agentos-retrospective
environment:
- AGENT_ROLE=retrospective
- LLM_API_KEY=${LLM_API_KEY}
volumes:
- agent-data:/data
networks:
- agent-net
executor:
image: python:3.12-slim
container_name: agentos-executor
user: "65534:65534"
read_only: true
tmpfs:
- /tmp:size=256m
- /workspace:size=512m
volumes:
- agent-data:/data:ro
networks: [] # air-gapped: no network at all
cpus: "2.0"
mem_limit: 512m
volumes:
agent-data:
driver: local
driver_opts:
type: none
o: bind
device: ./data
networks:
agent-net:
driver: bridge
internal: trueThe pipeline-runner is the orchestrator. It:
- reads
shared/queue.mdfor new tasks - reads
shared/pipeline-state.yamlto know the current phase - reads the current phase's contract to validate inputs
- dispatches to the appropriate agent container via
shared/inbox/ - validates phase outputs before updating
pipeline-state.yaml - writes
shared/rollback-log.mdon any phase rejection
No agent container transitions the pipeline. Only the runner does. This means phase transitions are auditable and deterministic — the runner's logic is a simple state machine, not an LLM.
# shared/pipeline-state.yaml
task_id: task-2026-001
current_phase: verification
phase_status: in_progress
phase_started: 2026-05-10T09:14:00Z
rollback_count_this_phase: 0
total_rollbacks_this_task: 1
phase_history:
- phase: requirements
status: completed
iterations: 1
completed: 2026-05-10T08:30:00Z
- phase: architecture
status: completed
iterations: 1
completed: 2026-05-10T08:52:00Z
- phase: planning
status: completed
iterations: 1
completed: 2026-05-10T09:01:00Z
- phase: implementation
status: completed
iterations: 2
rollbacks: 1
rollback_reason: "Security finding: token in query param (unit-002, line 47)"
completed: 2026-05-10T09:13:00Z
human_review_pending: false
blocked: falseThe state machine has eight valid states plus two exceptional states:
REQUIREMENTS → ARCHITECTURE → PLANNING → IMPLEMENTATION
→ VERIFICATION → SECURITY → INTEGRATION → RETROSPECTIVE → DONE
Exceptional:
BLOCKED_AWAITING_HUMAN (any phase can enter this)
ROLLED_BACK (any phase can send execution to previous phase)
Rollbacks are bounded: any phase that rolls back three times on the same issue escalates to BLOCKED_AWAITING_HUMAN. The system will not loop indefinitely.
The original reward model scored task outcomes. The phase-oriented model scores pipeline health — how reliably each phase produced correct outputs, how many rollbacks occurred, which phases caused cascading failures.
# reward.yaml (full schema)
task_id: task-2026-001
timestamp: 2026-05-10T11:00:00Z
composite_score: +1
# Task-level outcome dimensions
reward_decomposition:
correctness: +1 # tool-verified: all ACs pass
spec_fidelity: +1 # implementation matches spec exactly
architectural_compliance: +1 # no ADR violations detected
verification_cleanliness: 0 # 1 rollback required (acceptable but not clean)
phase_discipline: +1 # no agent violated phase contract boundaries
context_efficiency: +1 # no unnecessary premium escalation
rollback_stability: +1 # integration regression-free
maintainability: +1 # coverage delta positive
# Phase-level health (feeds routing and skill policies)
phase_health:
requirements: {iterations: 1, outcome: clean}
architecture: {iterations: 1, outcome: clean}
planning: {iterations: 1, outcome: clean}
implementation: {iterations: 2, outcome: rollback, cause: security_gap}
verification: {iterations: 1, outcome: clean}
security: {iterations: 1, outcome: clean}
integration: {iterations: 1, outcome: clean}
retrospective: {iterations: 1, outcome: clean}
# Attribution for learning
failure_attribution:
phase: implementation
agent: builder-agent
cause: "security skill not activated for Bearer token handling"
resolution: "updated routing: auth-adjacent units now activate security skill"
# Evidence accumulation
context_tags: [python, async, api-client, cursor-pagination, rate-limiting]
model_tiers_used:
requirements: cheap
architecture: mid
planning: cheap
implementation: mid
verification: mid
security: mid
integration: cheap
retrospective: mid
tool_verified: trueThe retrospective agent writes to world/reward-evidence/ rather than directly to world/verified-patterns/. The accumulator tracks:
# world/reward-evidence/cursor-pagination-pattern.yaml
pattern_candidate: cursor-based-api-sync
evidence:
- task_id: task-2026-001
composite: +1
tool_verified: true
context_tags: [cursor-pagination, api-client]
- task_id: task-2026-003
composite: +1
tool_verified: true
context_tags: [cursor-pagination, webhook-replay]
promotion_criteria:
min_successful_tasks: 3 # not yet met: 2/3
min_tool_verified: true # met
min_integration_coverage: 1 # met
status: accumulating # will promote on 3rd qualifying taskNo agent writes directly to world/verified-patterns/. The pipeline runner promotes candidates when criteria are met. This is not a policy the LLM decides — it is a deterministic check run by the runner after every retrospective completion.
Traditional agent observability measures output quality. This system measures pipeline reliability.
| Metric | Signal |
|---|---|
phase_rollback_rate per phase |
Which phases are producing unreliable artifacts |
iterations_per_phase distribution |
Where is the most rework happening |
rollback_cascade_rate |
How often a rollback in phase N causes rollback in phase N-1 |
human_escalation_rate |
How often the pipeline cannot self-resolve |
spec_ambiguity_rate |
How often requirements agent blocks on missing information |
architecture_violation_rate |
How often verification catches spec divergence |
security_finding_rate by severity |
BLOCKER findings indicate systemic builder gaps |
pattern_promotion_rate |
Rate of learning solidifying into reusable knowledge |
context_tokens_per_phase |
Is context scope creeping upward |
model_tier_distribution per phase |
Is routing empirically correct |
composite_reward_trend over 50 tasks |
Is the system improving or plateauing |
These are emitted as JSONL to data/metrics/pipeline.jsonl after every phase completion. The runner writes them; agents do not.
The original AgentSWEOS optimized for:
task success rate
AgentSWEOS vNext optimizes for:
reliable state transitions between phases
That shift matters because a system that completes tasks at 80% reliability by cutting corners on verification is not better than one that completes tasks at 60% reliability but catches every failure before it reaches production.
The architecture treats LLM agents as what they are — probabilistic transformers that perform well in narrow, structured contexts — and builds reliability through:
- explicit contracts instead of tacit understanding
- structured artifacts instead of conversational state
- independent verification instead of self-certification
- deterministic phase transitions instead of improvised iteration
- bounded context windows instead of accumulating history
- evidence-gated learning instead of autonomous belief updates
The result is a system that degrades gracefully, fails loudly, attributes failures precisely, and improves measurably over time.
That is the engineering standard software should be held to, regardless of whether the engineer is human or probabilistic.
The Minimal Agent Specification (MAS)
One File. Few Tokens. Any Agent.
While working on AgentOS I came across this problem:
You have 20 specialist agents. Your orchestrator needs to know: Who are they? What can they do? Where's their state?
You don't want to load 20 full personas into memory. You don't want to parse 20,000 tokens of backstory just to route a simple task.
You want a business card. A tiny header file that tells you everything you need before you decide to have a conversation.
That's the Minimal Agent Specification.
What It Is
A single file —
.agent— in every agent's directory. Small enough to scan hundreds in seconds. Rich enough to route tasks intelligently.That's it. The orchestrator reads this file and immediately knows:
Why It Matters
Before MAS: Orchestrator loads every agent's full persona. 10 agents × 2000 tokens = 20,000 tokens before the first task. Slow. Expensive. Fragile.
After MAS: Orchestrator scans
.agentfiles. 10 agents × 125 tokens = 1,250 tokens. Then loads only the agents that match the task.The difference is 10x to 100x reduction in context loading.
The Agent Text Size
The MAS is designed for the real world. Every word is a token. No waste.
For comparison:
125 tokens is the sweet spot. Small enough to scan hundreds of agents in one batch. Rich enough to make intelligent routing decisions.
What You Get
For the orchestrator: A registry that requires no database. Just a filesystem and 125 tokens per agent.
For the agent: A stable identity that doesn't change when you learn new skills. Your
.agentsays who you are. Yourskills.mdsays what you've learned.For the operator: One file to edit when an agent's domain or capabilities change. No hunting through prompts.
For the system: The ability to discover, validate, and route to agents without loading their full context. This is how you scale from 5 agents to 500.
The One-Page Specification
The Rule
That's it. A business card for every agent. A discovery layer that needs no database. A scale enabler for multi-agent systems.
Your agents are only as useful as your ability to find them. The Minimal Agent Specification makes sure you always can.