Skip to content

Instantly share code, notes, and snippets.

@MuhammadYossry
Created May 1, 2026 11:01
Show Gist options
  • Select an option

  • Save MuhammadYossry/957e25285a4883836a428c2404988794 to your computer and use it in GitHub Desktop.

Select an option

Save MuhammadYossry/957e25285a4883836a428c2404988794 to your computer and use it in GitHub Desktop.
Example: AgentSWEOS Blueprint, Production-Grade Software Engineering Agent Runtime

Example: AgentSWEOS Blueprint

Production-Grade Software Engineering Agent Runtime

Design philosophy: AgentSWEOS is not a general-purpose coding assistant. It is a disciplined SWE process modeled as a containerized agent β€” one that applies the same structured workflows a senior engineer brings to every task, learns what actually works, and forgets what doesn't.


0. Template Adaptation Decisions

Before the design, here is a transparent accounting of what was kept, changed, and dropped from the source prompt template β€” and why.

Template Section Decision Rationale
Executive Summary βœ… Kept Essential framing
Runtime Shape βœ… Adapted Orchestrator + specialist workers (not generic single container)
Filesystem Layout βœ… Heavily adapted SWE-specific: repos/, build/, reviews/, skills/ pre-loaded
Memory Model βœ… Adapted Pre-seeded with 20 Agent Skills; world model = codebase topology
Verification Architecture βœ… Expanded CI simulation, linting, test pyramid are first-class citizens
Learning Policy βœ… Adapted Constrained-active: cannot overwrite pre-seeded skills, can earn new ones
Model Routing βœ… Kept Critical for cost control in a dev loop
Security Policy βœ… Adapted Code execution sandbox is the primary risk surface
Docker Project Files βœ… Kept Core deliverable
Observability βœ… Adapted SWE-specific metrics: test pass rate, review gate rejections, skill churn
Example Tasks βœ… Kept 10 realistic SWE scenarios
Risks & Failure Modes βœ… Adapted Emphasizes code execution risks over data hallucination
v1/v2 Scope βœ… Kept
customers/, clauses/, approvals/ folders ❌ Dropped Domain-irrelevant
Medical/legal verification gates ❌ Dropped Wrong domain
Generic world/entities/ open ontology ⚠️ Scoped down Replaced with world/codebase/ β€” bounded, tool-verified only
Autonomous truth generation ❌ Dropped 2026 LLM limits; all world-model writes require tool-backed evidence
hypotheses/ as free-form speculation ⚠️ Constrained Hypotheses require a minimum of 2 corroborating tool outputs to persist

1. Executive Summary

AgentSWEOS is a containerized, self-learning software engineering agent that applies structured engineering workflows β€” spec, plan, build, test, review, ship β€” to coding tasks. It is pre-seeded with 20 production-grade skills from the Agent Skills framework and earns new ones through task outcomes scored against an honest reward signal.

Unlike a chat-based coding assistant, AgentSWEOS:

  • Enforces process discipline (spec before code, tests before merge)
  • Maintains a persistent, inspectable memory of what it knows and why
  • Routes work to the cheapest model capable of handling it
  • Runs code in a sandboxed executor and verifies outcomes with real tool output
  • Commits every learning event to Git for longitudinal observability

The result is an agent that improves with use, stays narrow by design, and can be debugged with ordinary developer tools.


2. Runtime Shape: Orchestrator + Specialist Workers

Chosen shape: Multi-container orchestrator + specialist workers.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   AgentSWEOS Runtime                 β”‚
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚ Orchestrator│────▢│  Task Queue (FIFO file)  β”‚   β”‚
β”‚  β”‚  (planner)  │◀────│  shared/queue.md         β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β”‚                                            β”‚
β”‚    β”Œβ”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                       β”‚
β”‚    β–Ό                         β–Ό                       β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                 β”‚
β”‚  β”‚  Builder β”‚          β”‚  Reviewer β”‚                 β”‚
β”‚  β”‚  Worker  β”‚          β”‚  Worker   β”‚                 β”‚
β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜                 β”‚
β”‚       β”‚                      β”‚                       β”‚
β”‚    β”Œβ”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”                    β”‚
β”‚    β”‚   Code Executor (sandbox)  β”‚                    β”‚
β”‚    β”‚   executor/ container      β”‚                    β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β”‚                                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”‚
β”‚  β”‚  Git Volume (longitudinal    β”‚                    β”‚
β”‚  β”‚  memory + working repos)     β”‚                    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why not single container? Code execution must be sandboxed away from agent state. A compromised or runaway build process cannot corrupt the agent's memory or the Git history. The executor is an ephemeral container with no network access and a read-only view of the repo under work.

Why not full microservices? Over-engineering for v1. Three logical roles (orchestrator, worker, executor) map cleanly to three services in docker-compose. Specialist personas (code-reviewer, test-engineer, security-auditor) are runtime configurations of the worker, not separate containers.


3. Purpose-Built Filesystem Layout

/data/
β”œβ”€β”€ agents/
β”‚   └── agentos/                    # orchestrator state
β”‚       β”œβ”€β”€ persona.md              # read-only: "senior SWE, process-first"
β”‚       β”œβ”€β”€ constraints.md          # read-only: budgets, forbidden tools, PII rules
β”‚       β”œβ”€β”€ skills.md               # PRE-SEEDED with 20 Agent Skills (budget: 30)
β”‚       β”œβ”€β”€ goals.md                # recurring gaps discovered at runtime
β”‚       β”œβ”€β”€ rewards.md              # rolling log, last 50 (SWE tasks are longer)
β”‚       β”œβ”€β”€ reflections.md          # failure post-mortems, last 20
β”‚       └── skills/                 # deep skill files (20 pre-seeded + earned)
β”‚           β”œβ”€β”€ spec-driven-development.md
β”‚           β”œβ”€β”€ test-driven-development.md
β”‚           β”œβ”€β”€ incremental-implementation.md
β”‚           β”œβ”€β”€ code-review-and-quality.md
β”‚           β”œβ”€β”€ security-and-hardening.md
β”‚           β”œβ”€β”€ ... (all 20 Agent Skills)
β”‚           └── [earned at runtime]
β”‚
β”œβ”€β”€ workers/
β”‚   β”œβ”€β”€ builder/                    # builder worker state (mirrors agent/ shape)
β”‚   └── reviewer/                   # reviewer worker state
β”‚
β”œβ”€β”€ repos/                          # working copies of target codebases
β”‚   └── <project>/
β”‚       β”œβ”€β”€ .git/
β”‚       └── src/
β”‚
β”œβ”€β”€ build/                          # build artifacts, test results, coverage
β”‚   └── <project>/
β”‚       β”œβ”€β”€ last-test-run.json
β”‚       β”œβ”€β”€ coverage.json
β”‚       └── lint-report.json
β”‚
β”œβ”€β”€ reviews/                        # structured review outputs (handoff files)
β”‚   └── <task-id>/
β”‚       β”œβ”€β”€ request.md              # what the orchestrator asked
β”‚       β”œβ”€β”€ findings.md             # reviewer output (typed, structured)
β”‚       └── resolution.md          # builder response + final decision
β”‚
β”œβ”€β”€ world/
β”‚   └── codebase/                   # BOUNDED: only tool-verified facts
β”‚       β”œβ”€β”€ index.md                # known projects, languages, entry points
β”‚       β”œβ”€β”€ patterns/               # confirmed architectural patterns
β”‚       β”‚   └── <pattern>.md        # requires 2+ tool-verified +1 occurrences
β”‚       └── anti-patterns/          # confirmed bad patterns found in target repos
β”‚
β”œβ”€β”€ shared/                         # Unix IPC layer
β”‚   β”œβ”€β”€ queue.md                    # FIFO task queue
β”‚   β”œβ”€β”€ inbox/                      # per-agent async messages
β”‚   β”‚   β”œβ”€β”€ orchestrator.md
β”‚   β”‚   β”œβ”€β”€ builder.md
β”‚   β”‚   └── reviewer.md
β”‚   β”œβ”€β”€ outbox/                     # (gitignored: high-frequency ephemeral)
β”‚   β”œβ”€β”€ locks/                      # semaphore files for shared state
β”‚   └── segment.md                  # shared working context for current task
β”‚
└── .git/                           # longitudinal memory
    └── (hooks/, tags, branches)

Key SWE-specific layout decisions:

  • repos/ is where agents actually touch code. It is a separate subtree from agent state so Git can diff them independently.
  • build/ stores machine-readable test and lint output. The agent reads structured JSON, not terminal output β€” no hallucination of "tests pass".
  • reviews/ uses the handoff file pattern. Builder β†’ Reviewer communication is typed and asynchronous. The reviewer never reads builder private state.
  • world/codebase/ is intentionally narrow. It only stores what a tool (test runner, linter, AST parser) has confirmed. Free-form speculation lives in reflections.md under the agent's own private directory until it earns enough evidence to promote.

4. Memory Model

Memory Component Exists Justification
skills.md βœ… Pre-seeded + expandable 20 Agent Skills are the founding policy. New skills can be earned (budget: 30). Pruning is forced when budget fills.
goals.md βœ… Runtime-earned Recurring gaps (e.g., "always misses auth scope on pagination endpoints") become explicit goals
rewards.md βœ… Honest outcome signal. SWE tasks use 5-dimension decomposition (see Β§5)
reflections.md βœ… Failure post-mortems. Pattern in 3+ reflections β†’ candidate goal
world/codebase/ βœ… Bounded Tool-verified codebase facts only. No open-ended ontology.
cases/ ❌ Dropped Too similar to reflections for v1. Revisit in v2 if pattern recognition proves useful.
templates/ βœ… Embedded in skills PR templates, commit message formats live inside the relevant skill file, not as a separate memory tree
user preferences ⚠️ In persona.md Per-project preferences (language, style guide) are injected into persona at task start via agents.yaml, not a mutable memory file
policies/ βœ… As constraints.md Hard rules (no rm -rf, no network in executor, no PII logging) are in the read-only constraints file
hypotheses/ ⚠️ Constrained Exist only in reflections.md. Promotion to world/codebase/patterns/ requires 2 tool-verified corroborations.

Why pre-seed skills instead of bootstrapping from zero?

The AgentOS blueprint's "earn from nothing" approach is elegant for general agents. For a SWE agent, the 20 Agent Skills represent validated engineering judgment that should not be re-discovered empirically. Starting from zero would waste tasks relearning "write tests before merging" β€” a known-good prior.

The skill budget still applies: earned skills compete with pre-seeded ones. If a pre-seeded skill consistently earns 0 or -1 for this agent's actual task distribution, it gets pruned exactly like any other.


5. Reward Signal Design

Plain +1/0/-1 is too coarse for SWE tasks. Every task scored with:

# rewards.md entry
- task_id: task-2025-001
  task_type: bug-fix
  timestamp: 2025-01-15T14:32:00Z
  reward_decomposition:
    correctness: +1      # tests pass, bug is actually fixed
    process:     +1      # spec β†’ plan β†’ build β†’ test followed
    efficiency:   0      # took 3 LLM calls where 2 would have sufficed
    test_quality: +1     # regression test added
    review_gate: +1      # reviewer approved, no critical findings
  composite_score: +1    # majority positive = +1
  skills_applied: [test-driven-development, debugging-and-error-recovery]
  skills_candidate: null
  model_used: mid-tier
  context_tags: [python, async, bug-fix, database]
  tool_verified: true    # score is from test runner, not self-assessment

tool_verified: true is the critical constraint. The agent cannot award itself a +1 based on "looks right". Correctness requires:

  • pytest / jest / equivalent: exit code 0
  • Lint: zero new violations
  • Review gate: reviewer worker found no CRITICAL or BLOCKER findings

This eliminates the core failure mode of self-scoring systems: reward hacking.


6. Verification Architecture

Task Attempt
    β”‚
    β”œβ”€ Static Analysis ──────────── lint, type-check (tool output β†’ JSON)
    β”‚
    β”œβ”€ Unit Tests ────────────────── pytest/jest with coverage threshold
    β”‚
    β”œβ”€ Integration Tests ─────────── against fixtures, never live external
    β”‚   (if applicable)
    β”‚
    β”œβ”€ Reviewer Worker ───────────── five-axis code review (Agent Skills)
    β”‚   β”œβ”€β”€ correctness axis
    β”‚   β”œβ”€β”€ test quality axis
    β”‚   β”œβ”€β”€ security axis (OWASP Top 10 checklist)
    β”‚   β”œβ”€β”€ readability axis
    β”‚   └── change sizing axis (~100 lines / PR)
    β”‚
    β”œβ”€ Human Review Gate ─────────── triggered when:
    β”‚   β”œβ”€β”€ security-and-hardening skill is activated
    β”‚   β”œβ”€β”€ change touches auth, secrets, or PII paths
    β”‚   β”œβ”€β”€ reviewer finds CRITICAL severity finding
    β”‚   └── composite reward would be first +1 for a new skill candidate
    β”‚
    └─ Reward Scoring ────────────── only after all gates pass

What is NOT in the verification chain:

  • Simulation environments (too fragile for v1; revisit in v2)
  • LLM-as-judge for correctness (self-scoring; tool output is authoritative)
  • Live external API calls from the executor (network disabled in sandbox)

7. Learning Policy: Constrained-Active

Policy: constrained-active

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LEARNING BOUNDARIES                                    β”‚
β”‚                                                         β”‚
β”‚  βœ… CAN learn:                                          β”‚
β”‚     New skills beyond the 20 pre-seeded ones           β”‚
β”‚     Updated reward_evidence for existing skills        β”‚
β”‚     New goals from repeated reflection patterns        β”‚
β”‚     New world/codebase patterns (tool-verified only)   β”‚
β”‚                                                         β”‚
β”‚  ⚠️  CAN update with constraints:                       β”‚
β”‚     Pre-seeded skill reward_evidence (up or down)      β”‚
β”‚     Prune pre-seeded skill only if reward_evidence     β”‚
β”‚     shows 5+ tasks with composite ≀ 0                  β”‚
β”‚                                                         β”‚
β”‚  ❌ CANNOT do:                                          β”‚
β”‚     Rewrite persona.md or constraints.md               β”‚
β”‚     Self-award +1 without tool-verified evidence       β”‚
β”‚     Promote hypothesis to world model without 2        β”‚
β”‚     corroborating tool outputs                         β”‚
β”‚     Earn a new skill on first +1 (requires 2+          β”‚
β”‚     confirmed +1 outcomes across different tasks)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why require 2+ confirmed +1s for a new skill?
A single success could be task-specific luck. Two independent +1s across different task_ids with tool_verified: true demonstrates generalizability. This directly mirrors RL's minimum-sample requirements before policy update.


8. Model Routing Policy

Task Class Trigger Model Tier Rationale
Boilerplate generation task_type: scaffold, low reasoning_demand Cheap (Haiku-class) Deterministic, low-stakes
Refactoring, simple bug fixes context_tags include known patterns Cheap Pattern-matched to existing skills
Spec writing, API design skills_applied includes spec-driven-development Mid (Sonnet-class) Needs coherent multi-step reasoning
Complex debugging, root cause task_type: debug, high reasoning_demand Mid Multi-hypothesis reasoning needed
Security review security-and-hardening activated Mid β†’ Premium High-stakes; escalate on any finding
Reviewer worker (five-axis) All code review passes Mid Structured output, not creative
Architecture decisions documentation-and-adrs activated Premium Long-horizon, high consequence
Human review gate prep Summarizing findings for human Cheap Mechanical formatting task
First +1 skill candidate New skill validation pass Premium High consequence; verify thoroughly

Routing is driven by context_tags from reward history, not static rules. After 50+ tasks, the agent has empirical data on which tag combinations actually need which model tier. The routing table above is the prior; task history updates the posterior.


9. Security Policy

Risk Level: Medium (code execution is the primary attack surface)

β”Œβ”€β”€ Network Policy ──────────────────────────────────────────┐
β”‚  Orchestrator container: outbound to model API only        β”‚
β”‚  Worker containers: no outbound network                    β”‚
β”‚  Executor container: NO network (air-gapped)              β”‚
β”‚  Test fixtures: local only, no live external APIs         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€ Code Execution Sandbox ──────────────────────────────────┐
β”‚  Executor runs as non-root UID 65534                      β”‚
β”‚  Read-only filesystem except /tmp and /workspace          β”‚
β”‚  CPU limit: 2 cores, Memory limit: 512MB                 β”‚
β”‚  Timeout: 120s per execution (hard kill)                  β”‚
β”‚  No shell access from agent β€” only structured commands    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€ Forbidden Operations (constraints.md, enforced) ─────────┐
β”‚  rm -rf on any path outside /tmp                          β”‚
β”‚  git push to any remote                                   β”‚
β”‚  Writes to repos/ without task_id in active queue        β”‚
β”‚  Reading or logging any file matching *secret*, *.env,   β”‚
β”‚  *credential*, *token* (pattern match before any read)   β”‚
β”‚  Spawning subprocesses with shell=True                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€ PII Policy ──────────────────────────────────────────────┐
β”‚  No PII in rewards.md or reflections.md                   β”‚
β”‚  No real credentials in task queue entries                β”‚
β”‚  world/codebase/ entries must be anonymized               β”‚
β”‚  Human review required before any task touching           β”‚
β”‚  user data schemas or auth pathways                       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€ Approval Gates ──────────────────────────────────────────┐
β”‚  Human review before: any file touching auth, PII,        β”‚
β”‚  secrets management, or security-critical paths           β”‚
β”‚  Dual-model review for: new skill candidates,             β”‚
β”‚  architecture decisions, world model promotions           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

10. Docker Project Files

Dockerfile

FROM python:3.12-slim AS base

# ── System dependencies ──────────────────────────────────────
RUN apt-get update && apt-get install -y --no-install-recommends \
    git curl jq \
    && rm -rf /var/lib/apt/lists/*

# ── Non-root agent user ──────────────────────────────────────
RUN useradd -m -u 1001 agent
WORKDIR /app

# ── Python dependencies ──────────────────────────────────────
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# ── Agent source ─────────────────────────────────────────────
COPY src/ ./src/
COPY agents.yaml .

# ── Git hooks (commit-per-learning-event) ────────────────────
COPY hooks/post-reward.sh /app/hooks/
RUN chmod +x /app/hooks/post-reward.sh

USER agent

# Volume: /data (all persistent state)
VOLUME ["/data"]

ENTRYPOINT ["python", "src/main.py"]

docker-compose.yml

version: "3.9"

services:
  orchestrator:
    build: .
    container_name: agentos-orchestrator
    environment:
      - AGENT_ROLE=orchestrator
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CHEAP_MODEL=claude-haiku-4-5-20251001
      - MID_MODEL=claude-sonnet-4-6
      - PREMIUM_MODEL=claude-opus-4-6
    volumes:
      - agent-data:/data
    networks:
      - agent-net
    depends_on:
      - executor

  builder:
    build: .
    container_name: agentos-builder
    environment:
      - AGENT_ROLE=builder
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CHEAP_MODEL=claude-haiku-4-5-20251001
      - MID_MODEL=claude-sonnet-4-6
      - PREMIUM_MODEL=claude-opus-4-6
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  reviewer:
    build: .
    container_name: agentos-reviewer
    environment:
      - AGENT_ROLE=reviewer
      - AGENT_PERSONA=code-reviewer   # uses the code-reviewer Agent Skills persona
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CHEAP_MODEL=claude-haiku-4-5-20251001
      - MID_MODEL=claude-sonnet-4-6
      - PREMIUM_MODEL=claude-opus-4-6
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  executor:
    image: python:3.12-slim          # minimal, no agent code
    container_name: agentos-executor
    user: "65534:65534"              # nobody
    read_only: true
    tmpfs:
      - /tmp:size=256m
      - /workspace:size=512m
    volumes:
      - agent-data:/data:ro          # read-only view of repos
    networks: []                     # NO network access
    cpus: "2.0"
    mem_limit: 512m
    entrypoint: ["python", "-c", "import time; time.sleep(86400)"]  # kept alive, receives work via IPC

volumes:
  agent-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ./data

networks:
  agent-net:
    driver: bridge
    internal: true                   # no external internet from workers

agents.yaml

agents:
  orchestrator:
    persona: "Senior software engineering team lead. Enforce process discipline.
      Spec before code. Tests before merge. Never skip review gates."
    skill_budget: 30
    pre_seeded_skills:
      - spec-driven-development
      - planning-and-task-breakdown
      - incremental-implementation
      - test-driven-development
      - context-engineering
      - source-driven-development
      - frontend-ui-engineering
      - api-and-interface-design
      - browser-testing-with-devtools
      - debugging-and-error-recovery
      - code-review-and-quality
      - code-simplification
      - security-and-hardening
      - performance-optimization
      - git-workflow-and-versioning
      - ci-cd-and-automation
      - deprecation-and-migration
      - documentation-and-adrs
      - shipping-and-launch
      - idea-refine
    constraints:
      max_reward_log: 50
      max_reflection_log: 20
      min_task_composite_for_skill_earn: "+1"
      min_confirmations_for_new_skill: 2
      min_failures_before_prune: 5
      forbidden_tools: ["rm -rf outside /tmp", "git push", "shell=True subprocess"]
      human_review_triggers:
        - security-and-hardening activated
        - file path matches auth|secret|credential|pii|token
        - reviewer CRITICAL finding
        - new skill candidate first validation
      model_routing:
        cheap: ${CHEAP_MODEL}
        mid: ${MID_MODEL}
        premium: ${PREMIUM_MODEL}

  builder:
    persona: "Staff engineer. Build one thin vertical slice at a time.
      Always run tests. Commit atomically."
    inherits: orchestrator.constraints

  reviewer:
    persona: "Senior engineer doing five-axis code review. Severity labels:
      Nit / Optional / FYI / CRITICAL / BLOCKER. Changes > 100 lines get split."
    review_axes:
      - correctness
      - test_quality
      - security
      - readability
      - change_sizing

env.example

# Model API
ANTHROPIC_API_KEY=sk-ant-...

# Model tiers β€” map to actual model strings
CHEAP_MODEL=claude-haiku-4-5-20251001
MID_MODEL=claude-sonnet-4-6
PREMIUM_MODEL=claude-opus-4-6

# Execution limits
EXECUTOR_TIMEOUT_SECONDS=120
MAX_PARALLEL_WORKERS=2

# Learning controls
SKILL_BUDGET=30
MIN_CONFIRMATIONS_FOR_SKILL=2

# Observability
LOG_LEVEL=INFO
METRICS_FILE=/data/metrics/runtime.jsonl

Startup Flow

docker compose up
    β”‚
    β”œβ”€ executor starts (sandboxed, awaiting work)
    β”‚
    β”œβ”€ orchestrator starts
    β”‚   β”œβ”€ mounts /data volume
    β”‚   β”œβ”€ checks /data/agents/agentos/skills.md exists
    β”‚   β”‚   └─ if not: runs seed.py β†’ writes all 20 Agent Skills
    β”‚   β”œβ”€ installs Git hooks into /data/.git/hooks/
    β”‚   └─ begins polling shared/queue.md
    β”‚
    β”œβ”€ builder starts β†’ polls shared/inbox/builder.md
    └─ reviewer starts β†’ polls shared/inbox/reviewer.md

11. Git as Longitudinal Memory

Commit format

agentos(learn): earn async-pagination-pattern from +1 API task
task: sync paginated records from internal billing service
reward: composite +1 [correctness+1, process+1, efficiency+1, test+1, review+1]
skills-added: async-pagination-pattern
skills-pruned: none
world-model-update: world/codebase/patterns/async-pagination.md
model-used: mid-tier

Git hook: post-reward.sh

#!/bin/bash
# Runs after every rewards.md write
# Extracts composite score and commits learning event

REWARDS_FILE="/data/agents/agentos/rewards.md"
LAST_SCORE=$(head -n 20 "$REWARDS_FILE" | grep "composite_score:" | head -1 | awk '{print $2}')
TASK_ID=$(head -n 20 "$REWARDS_FILE" | grep "task_id:" | head -1 | awk '{print $2}')

cd /data
git add agents/ world/ --  # never add repos/ or build/ to agent memory commits
git commit -m "agentos(reward): task ${TASK_ID} scored ${LAST_SCORE}" \
  --allow-empty-message 2>/dev/null || true

.gitignore

# Ephemeral IPC
shared/outbox/
shared/locks/

# Build artifacts (high-churn, not agent memory)
build/

# Active repo working copies (managed by their own Git)
repos/

# Runtime noise
*.pyc
__pycache__/
.env

Useful Git queries

# Full learning timeline
git log --grep="^agentos" --oneline

# Reward trend (composite scores over time)
git log --grep="^agentos(reward)" --format="%s" | grep -oP 'scored \K[+-]\d'

# When was a specific skill earned
git log -S "async-pagination-pattern" --oneline

# Skill churn: are we maturing or thrashing?
git log --grep="skills-added" --format="%s" | grep -c "skills-added:"
git log --grep="skills-pruned" --format="%s" | grep -c "skills-pruned:"

# Detect reward regression
git log --grep="scored" --format="%s" | tail -10   # last 10 outcomes

12. Observability

Metrics emitted to METRICS_FILE as JSONL after every task:

Metric Type What It Tells You
composite_reward +1/0/-1 Is the agent improving?
test_pass_rate float 0-1 Core correctness signal
review_gate_rejections count How often reviewer blocks builder
skill_churn_rate earned / pruned Maturing vs thrashing
model_tier_distribution cheap/mid/premium % Cost efficiency
tasks_per_skill map[skill β†’ count] Which skills are actually used
human_review_triggers count by trigger type How often human is pulled in
executor_timeout_rate % Are code runs getting too complex?
world_model_promotion_rate count Hypothesis β†’ confirmed pattern
tool_verified_rate % Are reward scores honest?

Log format:

{"ts":"2025-01-15T14:32:00Z","task_id":"task-2025-001","event":"reward","composite":1,"tool_verified":true,"model":"mid","skills_applied":["test-driven-development"]}
{"ts":"2025-01-15T14:33:00Z","task_id":"task-2025-001","event":"commit","skills_added":null,"skills_pruned":null}

13. Example Tasks for AgentSWEOS

  1. Spec + implement a rate-limited API client β€” triggers api-and-interface-design, incremental-implementation, test-driven-development
  2. Fix a flaky async test in a Python service β€” triggers debugging-and-error-recovery, test-driven-development
  3. Refactor a 500-line module that grew beyond the Rule of 500 β€” triggers code-simplification, code-review-and-quality
  4. Add OWASP-compliant input validation to a form endpoint β€” triggers security-and-hardening β†’ human review gate
  5. Write an Architecture Decision Record for switching from REST to gRPC β€” triggers documentation-and-adrs, api-and-interface-design, premium model
  6. Set up CI pipeline with lint + test + coverage thresholds β€” triggers ci-cd-and-automation, git-workflow-and-versioning
  7. Deprecate a legacy auth endpoint with a migration path β€” triggers deprecation-and-migration, security-and-hardening
  8. Profile and fix a slow database query causing P95 regression β€” triggers performance-optimization, debugging-and-error-recovery
  9. Build a React component with WCAG 2.1 AA accessibility β€” triggers frontend-ui-engineering, test-driven-development
  10. Pre-launch checklist for a new microservice β€” triggers shipping-and-launch, security-and-hardening, documentation-and-adrs

14. Risks & Failure Modes

Risk Likelihood Mitigation
Executor escape (sandbox breakout) Low Non-root, read-only FS, no network, hard timeout
Reward hacking (self-awarding +1 without tool evidence) Medium tool_verified: true required; Git hook validates before commit
Skill ossification (pre-seeded skills never pruned) Medium Reward evidence tracked per-skill; prune if 5+ composite ≀ 0
Hallucinated world model facts Medium world/codebase/ writes require 2 corroborating tool outputs
Model routing drift (always escalating to premium) Low-Medium Routing logged in metrics; alert if premium % > 20 of tasks
Context window overflow on large repos Medium context-engineering skill limits context to context_tags matches
Review gate fatigue (human ignoring gates) Medium Human review triggers must be narrow and specific; log ignored gates
Git history bloat from high-frequency tasks Low .gitignore excludes build/, outbox/; commits only on learning events
Spec skip (builder jumping straight to code) Medium spec-driven-development skill enforced by orchestrator before queuing to builder

15. v1 Scope β€” Ship in 2 Weeks

Goal: One working end-to-end task loop with honest rewards and observable Git history.

Week 1:
β”œβ”€ Docker Compose up with orchestrator + executor (no reviewer yet)
β”œβ”€ Seed 20 Agent Skills into skills.md on first boot
β”œβ”€ Task queue: single FIFO file, polled every 30s
β”œβ”€ Builder: spec β†’ code β†’ test (pytest only, no integration tests)
β”œβ”€ Executor: sandboxed Python runs, structured JSON output
β”œβ”€ Rewards: 3-dimension (correctness, process, test_quality)
β”‚   all tool-verified via pytest exit code + lint
└─ Git: learning commits on +1 reward only

Week 2:
β”œβ”€ Add reviewer worker (five-axis, code-review-and-quality skill)
β”œβ”€ Human review gate (write to stdout + block until ack file exists)
β”œβ”€ Full 5-dimension reward decomposition
β”œβ”€ Metrics JSONL emission
└─ Basic `git log` queries working and documented

Explicitly out of scope for v1:

  • Multi-language executor (Python only)
  • world/codebase/ pattern learning (too early; not enough task history)
  • Model routing (use mid-tier for everything in v1; routing in v2)
  • Horizontal scaling (single orchestrator)

16. v2 Expansion Plan

After 200+ real tasks with honest reward data:

  1. Model routing activation β€” use task history to set routing thresholds empirically
  2. world/codebase/ pattern learning β€” enough task history to validate promotion criteria
  3. Multi-language executor β€” add Node.js runner with jest support
  4. Horizontal worker scaling β€” multiple builder workers on shared queue (semaphore locks already in place)
  5. Benchmark dashboard β€” visualize skill accumulation rate, pruning efficiency, reward trend over time
  6. Security auditor persona β€” third worker using the security-auditor Agent Skills persona, activated on gate triggers
  7. Integration test fixtures β€” local service stubs for API integration testing in executor
  8. Spec quality scoring β€” measure downstream task success rate by spec author to improve spec-driven-development skill reward evidence

AgentSWEOS: disciplined engineering process as a runtime. Debuggable with git log. Improvable through honest failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment