Skip to content

Instantly share code, notes, and snippets.

@MuhammadYossry
Last active May 10, 2026 09:28
Show Gist options
  • Select an option

  • Save MuhammadYossry/c80355627b6a7b4359b9351d404e81db to your computer and use it in GitHub Desktop.

Select an option

Save MuhammadYossry/c80355627b6a7b4359b9351d404e81db to your computer and use it in GitHub Desktop.
Draft: A Phase-Oriented Software Engineering Runtime for Probabilistic Agents

AgentSWEOS vNext

A Phase-Oriented Software Engineering Runtime for LLM Coding Agents

"Human teams compensate for incomplete docs with meetings, intuition, and shared mental models. LLM systems cannot. Therefore every artifact is a coordination layer, and every phase is a contract."


The Core Premise

Human software teams function as a distributed system with implicit channels: hallway conversations, shared organizational context, years of accumulated intuition, real-time social correction. When a spec is ambiguous, a senior engineer fills the gap from memory and instinct.

LLM agents have none of that.

What they have instead is this: given a bounded context with clear inputs and an explicit goal, frontier models perform remarkably well. That is the design constraint this architecture is built around.

The failure mode of current LLM agent systems is not intelligence, it is context pollution, objective blending, and unconstrained scope. An agent asked to simultaneously plan, build, test, review, and learn will hallucinate across all five. Not because the model is weak, but because the runtime was designed for humans.

AgentSWEOS vNext is designed for what LLMs actually are:

flowchart LR
    subgraph Input[Task Input]
        Q[Queue] --> R[Requirements Agent]
    end

    subgraph Pipeline[Eight-Phase Pipeline]
        direction LR
        R --> A[Architecture Agent]
        A --> P[Planning Agent]
        P --> I[Implementation Agent]
        I --> V[Verification Agent]
        V --> S[Security Agent]
        S --> Int[Integration Agent]
        Int --> Ret[Retrospective Agent]
    end

    subgraph Gates[Phase Gates]
        G1[spec complete?] 
        G2[interfaces valid?]
        G3[plan reviewable?]
        G4[lint + unit pass?]
        G5[all checks pass?]
        G6[no blocking findings?]
        G7[no regressions?]
    end

    R --> G1 --> A
    A --> G2 --> P
    P --> G3 --> I
    I --> G4 --> V
    V --> G5 --> S
    S --> G6 --> Int
    Int --> G7 --> Ret
    
    Ret --> Done[Done ✓]
    
    Block[BLOCKED] 
    Block -.->|rollback to previous phase| A
    Block -.->|escalate to human| Human[Human Review]
Loading

bounded probabilistic transformers that perform best with:

  • minimal, scoped context
  • explicit input artifacts
  • single, stated objectives
  • structured output contracts
  • verification before state transition

The redesign treats the SWE pipeline not as Agile iteration but as a compiler pipeline: each pass receives a structured intermediate representation, transforms it according to narrow rules, and emits a verified output for the next pass. Errors are caught at phase boundaries, not discovered at ship time.


Why Phases Beat Loops for LLM Agents

The loop model breaks down because:

Problem Mechanism
Context drift Each iteration adds to prompt history; reasoning degrades as window fills
Objective blending "Build and test and review" activates conflicting evaluation criteria simultaneously
Error amplification A flawed assumption in planning contaminates implementation, tests, and reward
Self-certification The same agent that built it cannot reliably review it. it anchors to its own reasoning
Unmeasurable learning What did the loop learn? Which phase caused the failure?

The phase model addresses each:

Problem Phase model solution
Context drift Each phase starts fresh from structured artifacts, not accumulated chat
Objective blending One agent, one objective, one output type per phase
Error amplification Phase gates catch failures before they propagate downstream
Self-certification Verification and security phases are structurally independent
Unmeasurable learning Failures are attributed to the phase that produced the bad artifact
flowchart TB
    subgraph Contract[Phase Contract Example: Architecture Agent]
        Inputs[Required Inputs<br/>spec.md<br/>acceptance-criteria.md<br/>constraints.md]
        
        Context[Context Scope<br/>requirements/<br/>verified-patterns/<br/>architect-agent/skills/]
        
        Agent[Architect Agent<br/>Persona: Staff Engineer<br/>Objective: Define system topology]
        
        Outputs[Produced Outputs<br/>adr-XXX.md<br/>interfaces.md<br/>risk-analysis.md]
        
        Rules[Validation Rules<br/>- every external system named<br/>- every interface typed<br/>- assumptions explicit]
        
        Forbidden[Forbidden Actions<br/>✗ write to implementation/<br/>✗ write to planning/<br/>✗ invoke code executor]
    end

    Inputs --> Agent
    Context --> Agent
    Agent --> Outputs
    Outputs --> Rules
    
    Rules -->|pass| Gate[APPROVED.md → next phase]
    Rules -->|fail| Block[BLOCKED.md → rollback]
    
    Forbidden -.->|enforced by runtime| Agent
Loading

The Eight Phases

flowchart LR
    subgraph Input[Task Input]
        Q[Queue] --> R[Requirements Agent]
    end

    subgraph Pipeline[Eight-Phase Pipeline]
        direction LR
        R --> A[Architecture Agent]
        A --> P[Planning Agent]
        P --> I[Implementation Agent]
        I --> V[Verification Agent]
        V --> S[Security Agent]
        S --> Int[Integration Agent]
        Int --> Ret[Retrospective Agent]
    end

    subgraph Gates[Phase Gates]
        G1[spec complete?] 
        G2[interfaces valid?]
        G3[plan reviewable?]
        G4[lint + unit pass?]
        G5[all checks pass?]
        G6[no blocking findings?]
        G7[no regressions?]
    end

    R --> G1 --> A
    A --> G2 --> P
    P --> G3 --> I
    I --> G4 --> V
    V --> G5 --> S
    S --> G6 --> Int
    Int --> G7 --> Ret
    
    Ret --> Done[Done ✓]
    
    Block[BLOCKED] 
    Block -.->|rollback to previous phase| A
    Block -.->|escalate to human| Human[Human Review]
Loading

Each phase runs a different specialist agent with a different persona, a different context window, and a different output contract. No phase can mutate the outputs of a preceding phase — only signal a rollback.


Phase Contracts: The Foundational Primitive

A phase contract is the machine-readable interface between pipeline stages. It replaces the implicit human coordination that PRDs, standups, and Slack threads provide in human teams.

flowchart LR
    subgraph TaskArtifacts[Task Artifacts]
        T1[requirements/<br/>spec.md, ACs]
        T2[architecture/<br/>ADRs, interfaces]
        T3[planning/<br/>work breakdown]
        T4[implementation/<br/>unit patches]
        T5[verification/<br/>test results]
    end

    subgraph AgentsWithContext[Agents with Bounded Context]
        direction TB
        
        A1[Requirements Agent<br/>~800 tokens<br/>raw task only]
        A2[Architect Agent<br/>~2,000 tokens<br/>requirements + patterns]
        A3[Builder Agent<br/>~3,000 tokens<br/>1 unit + interfaces]
        A4[Verifier Agent<br/>~2,500 tokens<br/>ACs + patches + tool output]
    end

    T1 -.->|scope: requirements/*| A2
    T1 --> A1
    
    T2 -.->|scope: architecture/*| A3
    T2 -.->|scope: architecture/*| A4
    
    T1 --> A4
    
    T3 -.->|not visible| A3
    
    T4 --> A4
    
    T5 --> A4

    
    note1[Builder never sees:<br/>- reward history<br/>- other units' patches<br/>- full repository<br/>- retrospective data]
    
Loading
# phases/contracts/architecture.yaml

phase: architecture
version: 1

required_inputs:
  - tasks/{task-id}/requirements/spec.md
  - tasks/{task-id}/requirements/acceptance-criteria.md
  - tasks/{task-id}/requirements/constraints.md

produced_outputs:
  - tasks/{task-id}/architecture/adr-001.md      # at minimum one ADR
  - tasks/{task-id}/architecture/interfaces.md   # all external/internal APIs typed
  - tasks/{task-id}/architecture/risk-analysis.md

validation_rules:
  - every external system is named and bounded
  - every interface has typed inputs and outputs
  - every assumption is written as an explicit assumption, not embedded prose
  - risk analysis covers: data, auth, third-party dependencies, rollback path

allowed_mutations:
  - tasks/{task-id}/architecture/*

forbidden_actions:
  - write to tasks/{task-id}/implementation/*
  - write to tasks/{task-id}/planning/*
  - edit tasks/{task-id}/requirements/*           # can only signal rejection
  - invoke code executor
  - call external APIs

rollback_signal:
  path: tasks/{task-id}/architecture/BLOCKED.md
  reason: required              # why the phase cannot proceed
  missing: required             # what spec information is absent

context_scope:
  include:
    - tasks/{task-id}/requirements/
    - agents/architect-agent/skills/
    - agents/architect-agent/persona.md
    - world/verified-patterns/
    - world/anti-patterns/
  exclude:
    - tasks/{task-id}/implementation/
    - tasks/{task-id}/retrospective/
    - agents/*/rewards.md          # no reward history leaks into architecture reasoning

The contract is parsed by the runtime before any LLM call is made. If required inputs are missing or malformed, the phase does not start, it emits a structured block signal upstream. The LLM never sees incomplete state.


Filesystem Layout

The layout is organized around tasks as units of work and phases as named subtrees within each task. Agent state is strictly separate from task artifacts.

/data/
├── tasks/
│   └── task-2026-001/
│       ├── META.yaml                    # task id, priority, created, status
│       │
│       ├── requirements/
│       │   ├── spec.md                  # what and why, not how
│       │   ├── acceptance-criteria.md   # machine-checkable conditions for done
│       │   ├── constraints.md           # non-negotiable boundaries
│       │   └── APPROVED.md              # signed off before architecture starts
│       │
│       ├── architecture/
│       │   ├── adr-001.md               # Architecture Decision Record
│       │   ├── interfaces.md            # every API typed: inputs, outputs, errors
│       │   ├── dependency-graph.json    # machine-readable system graph
│       │   ├── risk-analysis.md
│       │   └── APPROVED.md
│       │
│       ├── planning/
│       │   ├── work-breakdown.md        # slices, each independently testable
│       │   ├── dependency-order.json    # which units block which
│       │   ├── execution-plan.md        # sequenced, time-bounded units
│       │   └── APPROVED.md
│       │
│       ├── implementation/
│       │   ├── units/
│       │   │   ├── unit-001/
│       │   │   │   ├── patch.diff
│       │   │   │   ├── commit-message.md
│       │   │   │   └── self-check.md    # builder's pre-verification notes
│       │   │   └── unit-002/
│       │   └── READY_FOR_VERIFICATION.md
│       │
│       ├── verification/
│       │   ├── lint.json                # machine output, not prose
│       │   ├── unit-tests.json          # pass/fail/coverage per unit
│       │   ├── integration-tests.json
│       │   ├── spec-fidelity.md         # does impl match acceptance criteria?
│       │   ├── reviewer-report.md       # five-axis review, severity-labeled
│       │   └── PASSED.md / BLOCKED.md
│       │
│       ├── security/
│       │   ├── threat-model.md
│       │   ├── owasp-checklist.json     # structured, not prose
│       │   ├── secrets-scan.json        # automated scan output
│       │   ├── auth-paths.md            # every auth boundary documented
│       │   └── CLEARED.md / HUMAN_REVIEW_REQUIRED.md
│       │
│       ├── integration/
│       │   ├── merge-plan.md
│       │   ├── compatibility-checks.json
│       │   ├── regression-results.json
│       │   ├── deployment-implications.md
│       │   └── MERGED.md / BLOCKED.md
│       │
│       └── retrospective/
│           ├── phase-failures.md        # which phase failed and why
│           ├── root-cause.md
│           ├── lessons.md
│           └── reward.yaml              # structured, tool-verified score
│
├── phases/
│   └── contracts/
│       ├── requirements.yaml
│       ├── architecture.yaml
│       ├── planning.yaml
│       ├── implementation.yaml
│       ├── verification.yaml
│       ├── security.yaml
│       ├── integration.yaml
│       └── retrospective.yaml
│
├── agents/
│   ├── requirements-agent/
│   │   ├── persona.md
│   │   ├── skills/
│   │   └── rewards.md
│   ├── architect-agent/
│   ├── planner-agent/
│   ├── builder-agent/
│   ├── verifier-agent/
│   ├── security-agent/
│   ├── integration-agent/
│   └── retrospective-agent/
│
├── shared/
│   ├── queue.md                         # inbound task queue
│   ├── pipeline-state.yaml              # current task + phase + status
│   ├── rollback-log.md                  # all phase rejections, indexed
│   └── human-review-queue.md           # blocked items needing human input
│
└── world/
    ├── verified-patterns/               # promoted after 3+ successful tasks
    ├── anti-patterns/                   # promoted after 2+ verified failures
    ├── routing-policies/                # learned model tier assignments
    └── reward-evidence/                 # cross-task pattern evidence accumulator

The key structural rule: agents read from world/ and their own agents/<name>/ directory. They read task artifacts only within their phase's context_scope. They write only to their phase's allowed_mutations paths.

The runtime enforces this. It is not a guideline. it is a mount configuration.


The Eight Specialist Agents

① Requirements Agent

Persona: Technical product manager. Obsessed with ambiguity elimination. Refuses to proceed on vague intent.

Single objective: Transform a task description into a machine-readable requirements package that any downstream agent can execute without asking follow-up questions.

Context window receives:

  • The raw task description
  • agents/requirements-agent/skills/spec-driven-development.md
  • world/anti-patterns/underspecified-requirements.md (if it exists)
  • Prior task specs of the same type (for format consistency)

Produces:

# spec.md
## Intent
What problem this solves and for whom. One paragraph. No implementation language.

## Scope
What is explicitly IN scope. What is explicitly OUT of scope.

## Acceptance Criteria
- [ ] AC-001: [machine-verifiable condition]
- [ ] AC-002: [machine-verifiable condition]
...

## Open Questions
Questions that block downstream phases. Each must be answered before APPROVED.md is written.
# acceptance-criteria.md
Each criterion maps to a verification method:

AC-001:
  description: "API returns 429 and Retry-After header on rate limit"
  verification_method: integration_test
  test_file: tests/test_rate_limiting.py::test_429_response

AC-002:
  description: "All database writes are idempotent"
  verification_method: unit_test
  test_file: tests/test_idempotency.py

Rollback signal: If the original task description is too ambiguous to spec, the requirements agent writes tasks/{id}/requirements/BLOCKED.md with specific questions. Execution halts. Human or task-submitter must resolve.

Cannot: Write code, design systems, suggest implementation approaches. The spec describes what, never how.


② Architecture Agent

Persona: Staff engineer with a distributed systems background. Thinks in interfaces, not implementations. Writes ADRs, not opinions.

Single objective: Given a complete spec, define the system topology, interface contracts, and risk landscape such that implementation can proceed without architectural decisions.

Context window receives:

  • tasks/{id}/requirements/ (all files)
  • agents/architect-agent/skills/
  • world/verified-patterns/ (relevant subset by tag)
  • world/anti-patterns/ (relevant subset)

Produces:

# interfaces.md
## Internal APIs

### UserSyncService
method: sync_user_batch
input:
  users: List[UserId]           # max 100 per call
  as_of: datetime               # idempotency key
output:
  synced: List[UserId]
  failed: List[{id: UserId, reason: str}]
errors:
  RateLimitError: retry after Retry-After header
  AuthError: non-retryable, escalate

## External Dependencies
### VendorAPI
  base_url: from env VENDOR_API_BASE_URL
  auth: Bearer token, rotated every 24h
  rate_limit: 100 req/min, token bucket
  pagination: cursor-based, field: next_cursor
# adr-001.md
## ADR-001: Cursor-based pagination over offset

### Status: Accepted

### Context
Vendor API uses cursor pagination. Offset pagination would require re-fetching
pages on record insertion, causing duplicate processing.

### Decision
Implement cursor pagination. Store last_cursor per sync job in the database.

### Consequences
- Resumable syncs after failure (positive)
- Cannot seek to arbitrary page (acceptable for this use case)
- Cursor expiry must be handled: if cursor is >24h old, restart from beginning

### Alternatives Rejected
Offset pagination: race condition risk, not supported by vendor API contract.

Rollback signal: If the spec is insufficient to make an architectural decision. for example, it does not state whether the system needs to be stateless — the architecture agent writes BLOCKED.md with exactly which spec sections are incomplete. It does not guess.

Cannot: Write production code, make planning decisions, assign work units.


③ Planning Agent

Persona: Engineering lead who has run many sprints and learned that large units are risk multipliers.

Single objective: Decompose the approved architecture into independently testable, independently deployable work units with explicit dependency ordering.

Context window receives:

  • tasks/{id}/requirements/acceptance-criteria.md
  • tasks/{id}/architecture/ (all files)
  • agents/planner-agent/skills/planning-and-task-breakdown.md

Produces:

# work-breakdown.md rendered as structured YAML

units:
  - id: unit-001
    title: "Database schema: user_sync_cursor table"
    type: schema_migration
    size_estimate: small
    acceptance_criteria_covered: [AC-005]
    depends_on: []
    verification:
      method: migration_runs_clean
      rollback_tested: required

  - id: unit-002
    title: "VendorAPIClient: cursor pagination + rate limit handling"
    type: library_module
    size_estimate: medium
    acceptance_criteria_covered: [AC-001, AC-002, AC-003]
    depends_on: [unit-001]
    verification:
      method: unit_tests
      test_file: tests/test_vendor_client.py
      coverage_threshold: 90

  - id: unit-003
    title: "SyncOrchestrator: batch scheduling + cursor persistence"
    type: service_module
    size_estimate: medium
    acceptance_criteria_covered: [AC-004, AC-006]
    depends_on: [unit-001, unit-002]
    verification:
      method: integration_tests
      fixtures: tests/fixtures/vendor_api_mock.py

Hard rules:

  • No unit covers more than three acceptance criteria
  • Every unit has a verification method before it is approved
  • Units that touch auth, secrets, or PII are flagged for security review
  • Size estimates above large must be split before APPROVED.md is written

Cannot: Make architectural decisions, write code, adjust requirements.


④ Builder Agent

Persona: Staff engineer. Incremental. Test-first. Atomic commits. Never invents requirements.

Single objective: Implement exactly one approved work unit per invocation. No more.

Context window receives:

  • The single assigned unit-00N definition
  • tasks/{id}/architecture/interfaces.md
  • tasks/{id}/architecture/adr-*.md
  • The relevant code slice from repos/{project}/ (only the affected modules)
  • agents/builder-agent/skills/ (relevant subset based on unit type)

The builder does not receive:

  • The full repository
  • Reward history
  • Previous retrospectives
  • World model
  • Other units' patches

This is deliberate. The builder's job is narrow code transformation, not system design. If the builder encounters ambiguity that requires a decision not covered by the interfaces or ADRs, the correct action is to emit a BLOCKED.md, not to invent an answer.

Produces per unit:

implementation/units/unit-001/
├── patch.diff          # the actual code change
├── commit-message.md   # conventional commits format, references AC IDs
└── self-check.md       # builder's checklist before handing off
# self-check.md (builder fills before handoff)

Unit: unit-002 VendorAPIClient

Pre-verification checklist:
- [x] Implements cursor pagination as specified in interfaces.md
- [x] Rate limit retry uses Retry-After header, not fixed backoff
- [x] No hardcoded credentials or URLs
- [x] New code has docstrings at module and function level
- [x] No logic added beyond what acceptance criteria require
- [ ] Integration test written (blocked: mock fixture not yet available - unit-003 dependency)

Assumptions made: none. All decisions covered by interfaces.md and ADR-001.
Deviations from plan: none.

The critical constraint: The builder cannot award itself a passing grade. The self-check.md is a handoff document, not a verification gate. Verification is a separate phase, a separate agent, and a separate context window.


⑤ Verification Agent

Persona: QA engineer with zero tolerance for self-reported test results. Only tool output counts.

Single objective: Verify that implementation matches the acceptance criteria. Produce structured, machine-readable verdicts. Find spec divergence.

Context window receives:

  • tasks/{id}/requirements/acceptance-criteria.md
  • tasks/{id}/architecture/interfaces.md
  • tasks/{id}/planning/work-breakdown.md
  • tasks/{id}/implementation/units/ (all patches and self-checks)
  • Test runner output (JSON) from the executor container

Produces:

// unit-tests.json
{
  "task_id": "task-2026-001",
  "unit_id": "unit-002",
  "runner": "pytest",
  "exit_code": 0,
  "passed": 14,
  "failed": 0,
  "coverage": 0.91,
  "threshold_met": true,
  "acceptance_criteria_verified": ["AC-001", "AC-002", "AC-003"]
}
# reviewer-report.md

## Five-Axis Review: unit-002 VendorAPIClient

### Correctness
PASS 
Rate limit retry reads Retry-After header correctly.
       Cursor pagination matches interfaces.md contract.

### Test Quality
PASS — 91% coverage. Edge cases covered: empty page, expired cursor,
       missing Retry-After header (defaults to 60s).

### Security
REVIEW — Line 47: token passed as query parameter in one fallback path.
         Severity: BLOCKER
         Required: move to Authorization header unconditionally.

### Readability
PASS — Functions are <30 lines. Naming is unambiguous.
       One nit: `_do_fetch` is too generic, suggest `_fetch_cursor_page`.

### Change Sizing
PASS — 87 lines changed. Within the 100-line guidance.

## Overall: BLOCKED
Reason: Security finding BLOCKER on line 47 must be resolved before
this unit proceeds to security review phase.

Rollback signal: BLOCKED.md written to tasks/{id}/verification/. Includes the unit ID, finding severity, and exact location. Builder receives only the BLOCKED.md — not the full reviewer context — to avoid anchoring to the reviewer's suggested fix.

Cannot: Modify implementation, write tests on behalf of the builder, approve its own findings.


⑥ Security Review Agent

Persona: Application security engineer. OWASP-fluent. Threat models everything. Never approves "we'll fix it later."

Single objective: Verify that the implementation does not introduce security vulnerabilities, secrets exposure, or authentication weaknesses.

Context window receives:

  • tasks/{id}/architecture/interfaces.md
  • tasks/{id}/architecture/risk-analysis.md
  • tasks/{id}/implementation/units/ (all patches)
  • tasks/{id}/verification/secrets-scan.json (automated scan output)
  • agents/security-agent/skills/security-and-hardening.md
  • OWASP Top 10 checklist (embedded in agent skills)

Produces:

// owasp-checklist.json
{
  "task_id": "task-2026-001",
  "checked_items": [
    {"id": "A01", "name": "Broken Access Control", "status": "PASS", "notes": "Auth scope verified per ADR-001"},
    {"id": "A02", "name": "Cryptographic Failures", "status": "PASS", "notes": "No sensitive data stored in logs"},
    {"id": "A03", "name": "Injection", "status": "PASS", "notes": "All user input parameterized"},
    {"id": "A07", "name": "Auth Failures", "status": "REVIEW", "notes": "Token rotation interval not enforced in client"}
  ],
  "secrets_scan": "PASS",
  "auth_paths_documented": true,
  "human_review_required": false,
  "blocking_findings": []
}

Human review trigger conditions (any one is sufficient):

  • Any finding rated CRITICAL or BLOCKER
  • Code touches auth, session management, or credential storage
  • New external service dependency introduced
  • PII schema change
  • First security review for a new integration pattern

When human review is triggered, the agent writes to shared/human-review-queue.md and the pipeline halts. It does not proceed optimistically.

Cannot: Approve findings provisionally, suggest "fix in follow-up," modify implementation.


⑦ Integration Agent

Persona: Platform engineer. Thinks about the system, not the feature. Owns the merge and what happens after.

Single objective: Validate that the completed implementation integrates safely with the existing system — no regressions, no architectural drift, no deployment surprises.

Context window receives:

  • All PASSED.md and CLEARED.md from verification and security
  • tasks/{id}/architecture/dependency-graph.json
  • tasks/{id}/planning/execution-plan.md
  • Regression test results from executor
  • Current repos/{project} HEAD state

Produces:

# merge-plan.md

## Integration Assessment: task-2026-001

### Regression Results
All 247 pre-existing tests: PASS
New tests added: 14 (all pass)
Coverage delta: +2.1%

### Architectural Drift
None detected. Implementation follows interface contract in interfaces.md.
No new external dependencies introduced beyond those in architecture/risk-analysis.md.

### Deployment Implications
- Database migration required before code deploy (unit-001)
- Migration is backwards-compatible: old code reads null cursor as "start from beginning"
- No feature flag required
- Rollback procedure: run migration rollback script (included in unit-001 patch)

### Merge Status: APPROVED

Cannot: Re-run verification, override security findings, approve partial implementations.


⑧ Retrospective Agent

Persona: Engineering manager who has debugged many postmortems. Interested in systemic causes, not blame.

Single objective: Analyze the completed task pipeline, attribute failures to phases, extract reusable lessons, and update the world model only with evidence meeting promotion criteria.

Context window receives:

  • All phase artifacts for the task (read-only)
  • shared/rollback-log.md entries for this task
  • agents/retrospective-agent/skills/
  • world/reward-evidence/ (accumulator for cross-task patterns)

Produces:

# phase-failures.md

## task-2026-001 Phase Failures

| Phase | Status | Cause | Iterations |
|---|---|---|---|
| Requirements | PASS || 1 |
| Architecture | PASS || 1 |
| Planning | PASS || 1 |
| Implementation | BLOCKED (1x) | Security finding: token in query param | 2 |
| Verification | BLOCKED (1x) | Security reviewer caught issue ||
| Security | PASS || 1 |
| Integration | PASS || 1 |

Root cause: Builder agent did not apply security-and-hardening skill
to auth token handling. Skill was present in skill library but not
activated because unit type was tagged "library_module" not "auth_module."

Recommendation: Expand security-and-hardening activation trigger to include
any unit whose interfaces.md section references Bearer tokens or credentials.
# reward.yaml

task_id: task-2026-001
composite_score: +1    # majority dimensions positive

reward_decomposition:
  correctness:              +1    # all ACs verified by tool
  spec_fidelity:            +1    # implementation matches acceptance-criteria.md
  architectural_compliance: +1    # no ADR violations
  verification_cleanliness:  0    # one security rollback required
  phase_discipline:         +1    # no phase boundary violations
  context_efficiency:       +1    # no unnecessary escalations
  rollback_stability:       +1    # integration introduced no regressions
  maintainability:          +1    # coverage increased, complexity stable

tool_verified: true
model_tier_used: mid
phases_rolled_back: [implementation]
rollback_count: 1

Pattern promotion rules: A pattern enters world/verified-patterns/ only when:

  • It has appeared in 3 or more task retrospectives
  • All appearances have composite score +1
  • At least one appearance includes tool-verified integration test coverage
  • The retrospective agent writes a promotion proposal that the pipeline runner confirms

This prevents single-task superstition from becoming institutional knowledge.


Context Engineering Per Phase

The most important operational detail: each agent receives the minimum context required to complete its phase, assembled fresh from structured artifacts, not from conversation history.

Requirements Agent context:
  ~800 tokens    raw task + persona + skill

Architecture Agent context:
  ~2,000 tokens  requirements package + relevant world patterns

Planning Agent context:
  ~1,500 tokens  requirements ACs + architecture + planning skill

Builder Agent context (per unit):
  ~3,000 tokens  single unit definition + interfaces + affected code slice

Verification Agent context:
  ~2,500 tokens  ACs + interfaces + patches + tool output

Security Agent context:
  ~2,000 tokens  patches + risk analysis + OWASP checklist

Integration Agent context:
  ~2,000 tokens  approval signals + regression output + dependency graph

Retrospective Agent context:
  ~3,500 tokens  full task artifact set (read-only)

No phase sees reward history from unrelated tasks. No phase sees other agents' internal reasoning. No phase sees more of the codebase than its work unit requires.

This is aggressive context minimization. It is not a performance optimization — it is a correctness constraint.


Model Routing

Model tier is assigned per phase, not per task. The routing policy lives in world/routing-policies/ and is updated by the retrospective agent based on empirical failure data.

Phase Default Tier Escalation Trigger
Requirements cheap Ambiguous domain (legal, medical, financial)
Architecture mid Novel system topology; no matching verified pattern
Planning cheap Standard decomposition; known patterns
Implementation (per unit) cheap Unit type: scaffold, boilerplate
Implementation (per unit) mid Unit type: library_module, service_module
Implementation (per unit) premium Unit type: auth, security-critical, novel algorithm
Verification mid All code review passes through mid tier
Security mid→premium Any OWASP finding escalates to premium
Integration cheap Structured checks, not creative reasoning
Retrospective mid Learning and pattern promotion require coherent reasoning

Routing is logged in every reward entry. After 50 tasks, the retrospective agent can identify whether a given phase-tier combination correlates with rollback frequency. That data updates the routing policy file directly.


Docker Compose: Pipeline Runtime

flowchart TB
    subgraph Host[Docker Host]
        
        subgraph Runner[Orchestration Layer]
            PR[Pipeline Runner<br/>state machine controller]
            QM[Queue Manager]
        end
        
        subgraph Agents[Agent Layer - One per phase]
            RA[Requirements Agent]
            AA[Architect Agent]
            PA[Planner Agent]
            BA[Builder Agent]
            VA[Verifier Agent]
            SA[Security Agent]
            IA[Integration Agent]
            RTA[Retrospective Agent]
        end
        
        subgraph Data[Shared Volume: /data]
            Tasks[tasks/<br/>task artifacts]
            World[world/<br/>verified patterns]
            State[pipeline-state.yaml]
            Metrics[metrics/<br/>pipeline.jsonl]
        end
        
        subgraph Exec[Execution Layer]
            E[Executor Container<br/>air-gapped<br/>no network]
        end
        
        subgraph External[External]
            API[LLM API<br/>]
        end
    end

    PR -->|dispatches to| Agents
    Agents -->|read/write| Data
    Agents -->|API calls| API
    
    RA -->|block signal| PR
    PR -->|enforces contracts| Agents
    
    VA -->|sends test requests| E
    E -->|returns JSON results| VA
    
    PR -->|writes metrics| Metrics
    PR -->|updates| State
Loading
version: "3.9"

services:
  pipeline-runner:
    build: .
    container_name: agentos-pipeline
    environment:
      - AGENT_ROLE=pipeline_runner
      - LLM_API_KEY=$LLM_API_KEY
      - CHEAP_MODEL=${CHEAP_MODEL}
      - MID_MODEL=${MID_MODEL}
      - PREMIUM_MODEL=${PREMIUM_MODEL}
      - PIPELINE_POLL_INTERVAL=10
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  requirements-agent:
    build: .
    container_name: agentos-requirements
    environment:
      - AGENT_ROLE=requirements
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  architect-agent:
    build: .
    container_name: agentos-architect
    environment:
      - AGENT_ROLE=architect
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  planner-agent:
    build: .
    container_name: agentos-planner
    environment:
      - AGENT_ROLE=planner
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  builder-agent:
    build: .
    container_name: agentos-builder
    environment:
      - AGENT_ROLE=builder
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  verifier-agent:
    build: .
    container_name: agentos-verifier
    environment:
      - AGENT_ROLE=verifier
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  security-agent:
    build: .
    container_name: agentos-security
    environment:
      - AGENT_ROLE=security
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  integration-agent:
    build: .
    container_name: agentos-integration
    environment:
      - AGENT_ROLE=integration
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  retrospective-agent:
    build: .
    container_name: agentos-retrospective
    environment:
      - AGENT_ROLE=retrospective
      - LLM_API_KEY=${LLM_API_KEY}
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  executor:
    image: python:3.12-slim
    container_name: agentos-executor
    user: "65534:65534"
    read_only: true
    tmpfs:
      - /tmp:size=256m
      - /workspace:size=512m
    volumes:
      - agent-data:/data:ro
    networks: []                      # air-gapped: no network at all
    cpus: "2.0"
    mem_limit: 512m

volumes:
  agent-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ./data

networks:
  agent-net:
    driver: bridge
    internal: true

The pipeline-runner is the orchestrator. It:

  • reads shared/queue.md for new tasks
  • reads shared/pipeline-state.yaml to know the current phase
  • reads the current phase's contract to validate inputs
  • dispatches to the appropriate agent container via shared/inbox/
  • validates phase outputs before updating pipeline-state.yaml
  • writes shared/rollback-log.md on any phase rejection

No agent container transitions the pipeline. Only the runner does. This means phase transitions are auditable and deterministic — the runner's logic is a simple state machine, not an LLM.


Pipeline State Machine

# shared/pipeline-state.yaml

task_id: task-2026-001
current_phase: verification
phase_status: in_progress
phase_started: 2026-05-10T09:14:00Z
rollback_count_this_phase: 0
total_rollbacks_this_task: 1

phase_history:
  - phase: requirements
    status: completed
    iterations: 1
    completed: 2026-05-10T08:30:00Z
  - phase: architecture
    status: completed
    iterations: 1
    completed: 2026-05-10T08:52:00Z
  - phase: planning
    status: completed
    iterations: 1
    completed: 2026-05-10T09:01:00Z
  - phase: implementation
    status: completed
    iterations: 2
    rollbacks: 1
    rollback_reason: "Security finding: token in query param (unit-002, line 47)"
    completed: 2026-05-10T09:13:00Z

human_review_pending: false
blocked: false

The state machine has eight valid states plus two exceptional states:

REQUIREMENTS → ARCHITECTURE → PLANNING → IMPLEMENTATION
    → VERIFICATION → SECURITY → INTEGRATION → RETROSPECTIVE → DONE

Exceptional:
  BLOCKED_AWAITING_HUMAN   (any phase can enter this)
  ROLLED_BACK              (any phase can send execution to previous phase)

Rollbacks are bounded: any phase that rolls back three times on the same issue escalates to BLOCKED_AWAITING_HUMAN. The system will not loop indefinitely.


Reward and Learning Architecture

What Changes From the Original

The original reward model scored task outcomes. The phase-oriented model scores pipeline health — how reliably each phase produced correct outputs, how many rollbacks occurred, which phases caused cascading failures.

# reward.yaml (full schema)

task_id: task-2026-001
timestamp: 2026-05-10T11:00:00Z
composite_score: +1

# Task-level outcome dimensions
reward_decomposition:
  correctness:              +1    # tool-verified: all ACs pass
  spec_fidelity:            +1    # implementation matches spec exactly
  architectural_compliance: +1    # no ADR violations detected
  verification_cleanliness:  0    # 1 rollback required (acceptable but not clean)
  phase_discipline:         +1    # no agent violated phase contract boundaries
  context_efficiency:       +1    # no unnecessary premium escalation
  rollback_stability:       +1    # integration regression-free
  maintainability:          +1    # coverage delta positive

# Phase-level health (feeds routing and skill policies)
phase_health:
  requirements:   {iterations: 1, outcome: clean}
  architecture:   {iterations: 1, outcome: clean}
  planning:       {iterations: 1, outcome: clean}
  implementation: {iterations: 2, outcome: rollback, cause: security_gap}
  verification:   {iterations: 1, outcome: clean}
  security:       {iterations: 1, outcome: clean}
  integration:    {iterations: 1, outcome: clean}
  retrospective:  {iterations: 1, outcome: clean}

# Attribution for learning
failure_attribution:
  phase: implementation
  agent: builder-agent
  cause: "security skill not activated for Bearer token handling"
  resolution: "updated routing: auth-adjacent units now activate security skill"

# Evidence accumulation
context_tags: [python, async, api-client, cursor-pagination, rate-limiting]
model_tiers_used:
  requirements: cheap
  architecture: mid
  planning: cheap
  implementation: mid
  verification: mid
  security: mid
  integration: cheap
  retrospective: mid

tool_verified: true

Pattern Promotion Evidence

The retrospective agent writes to world/reward-evidence/ rather than directly to world/verified-patterns/. The accumulator tracks:

# world/reward-evidence/cursor-pagination-pattern.yaml

pattern_candidate: cursor-based-api-sync
evidence:
  - task_id: task-2026-001
    composite: +1
    tool_verified: true
    context_tags: [cursor-pagination, api-client]
  - task_id: task-2026-003
    composite: +1
    tool_verified: true
    context_tags: [cursor-pagination, webhook-replay]

promotion_criteria:
  min_successful_tasks: 3      # not yet met: 2/3
  min_tool_verified: true      # met
  min_integration_coverage: 1  # met

status: accumulating           # will promote on 3rd qualifying task

No agent writes directly to world/verified-patterns/. The pipeline runner promotes candidates when criteria are met. This is not a policy the LLM decides — it is a deterministic check run by the runner after every retrospective completion.


Observability: What to Measure

Traditional agent observability measures output quality. This system measures pipeline reliability.

Metric Signal
phase_rollback_rate per phase Which phases are producing unreliable artifacts
iterations_per_phase distribution Where is the most rework happening
rollback_cascade_rate How often a rollback in phase N causes rollback in phase N-1
human_escalation_rate How often the pipeline cannot self-resolve
spec_ambiguity_rate How often requirements agent blocks on missing information
architecture_violation_rate How often verification catches spec divergence
security_finding_rate by severity BLOCKER findings indicate systemic builder gaps
pattern_promotion_rate Rate of learning solidifying into reusable knowledge
context_tokens_per_phase Is context scope creeping upward
model_tier_distribution per phase Is routing empirically correct
composite_reward_trend over 50 tasks Is the system improving or plateauing

These are emitted as JSONL to data/metrics/pipeline.jsonl after every phase completion. The runner writes them; agents do not.


What This Architecture Optimizes For

The original AgentSWEOS optimized for:

task success rate

AgentSWEOS vNext optimizes for:

reliable state transitions between phases

That shift matters because a system that completes tasks at 80% reliability by cutting corners on verification is not better than one that completes tasks at 60% reliability but catches every failure before it reaches production.

The architecture treats LLM agents as what they are — probabilistic transformers that perform well in narrow, structured contexts — and builds reliability through:

  • explicit contracts instead of tacit understanding
  • structured artifacts instead of conversational state
  • independent verification instead of self-certification
  • deterministic phase transitions instead of improvised iteration
  • bounded context windows instead of accumulating history
  • evidence-gated learning instead of autonomous belief updates

The result is a system that degrades gracefully, fails loudly, attributes failures precisely, and improves measurably over time.

That is the engineering standard software should be held to, regardless of whether the engineer is human or probabilistic.

@MuhammadYossry
Copy link
Copy Markdown
Author

MuhammadYossry commented May 10, 2026

The Minimal Agent Specification (MAS)

One File. Few Tokens. Any Agent.

While working on AgentOS I came across this problem:
You have 20 specialist agents. Your orchestrator needs to know: Who are they? What can they do? Where's their state?

You don't want to load 20 full personas into memory. You don't want to parse 20,000 tokens of backstory just to route a simple task.

You want a business card. A tiny header file that tells you everything you need before you decide to have a conversation.

That's the Minimal Agent Specification.


What It Is

A single file — .agent — in every agent's directory. Small enough to scan hundreds in seconds. Rich enough to route tasks intelligently.

identity: orion
domain: backend-engineering

capabilities:
  - run_sql
  - http_request
  - read_file

state:
  skills: skills.md
  goals: goals.md

budget:
  max_skills: 20

That's it. The orchestrator reads this file and immediately knows:

  • Who you are (identity)
  • What you're good at (capabilities)
  • Where your memory lives (state pointers)
  • How much you can handle (budget)

Why It Matters

Before MAS: Orchestrator loads every agent's full persona. 10 agents × 2000 tokens = 20,000 tokens before the first task. Slow. Expensive. Fragile.

After MAS: Orchestrator scans .agent files. 10 agents × 125 tokens = 1,250 tokens. Then loads only the agents that match the task.

Task: "Run anomaly detection on April events"

Scan phase: iris has "run_sql" capability → load iris
           orion has "run_sql" → also a candidate
           planner has no matching capability → skip

Load phase: Now load full personas for iris and orion only

The difference is 10x to 100x reduction in context loading.


The Agent Text Size

The MAS is designed for the real world. Every word is a token. No waste.

Component Tokens (approx)
Identity + domain 10
Capabilities (5 items) 35
State pointers (4 items) 35
Budget (4 items) 25
Formatting (spacing, dashes, newlines) 20
Total ~125 tokens

For comparison:

  • A tweet is ~35 tokens
  • A typical email is ~200 tokens
  • A full agent persona is 1,000–3,000 tokens

125 tokens is the sweet spot. Small enough to scan hundreds of agents in one batch. Rich enough to make intelligent routing decisions.


What You Get

For the orchestrator: A registry that requires no database. Just a filesystem and 125 tokens per agent.

For the agent: A stable identity that doesn't change when you learn new skills. Your .agent says who you are. Your skills.md says what you've learned.

For the operator: One file to edit when an agent's domain or capabilities change. No hunting through prompts.

For the system: The ability to discover, validate, and route to agents without loading their full context. This is how you scale from 5 agents to 500.


The One-Page Specification

# .agent — place this file in every agent directory

identity: <unique name>              # required
domain: <area of expertise>          # required

capabilities:                        # list what this agent can do
  - read_file                        # max 10 items
  - run_sql
  - http_request

state:                               # where learning lives
  skills: skills.md                  # required
  goals: goals.md
  rewards: rewards.md

budget:                              # hard limits
  max_skills: 20                     # default
  max_goals: 5                       # default

The Rule

No agent is discoverable without a .agent file.
No orchestrator should load a full persona before reading it.
125 tokens is the contract.

That's it. A business card for every agent. A discovery layer that needs no database. A scale enabler for multi-agent systems.

Your agents are only as useful as your ability to find them. The Minimal Agent Specification makes sure you always can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment