Design philosophy: AgentSWEOS is not a general-purpose coding assistant. It is a disciplined SWE process modeled as a containerized agent β one that applies the same structured workflows a senior engineer brings to every task, learns what actually works, and forgets what doesn't.
Before the design, here is a transparent accounting of what was kept, changed, and dropped from the source prompt template β and why.
| Template Section | Decision | Rationale |
|---|---|---|
| Executive Summary | β Kept | Essential framing |
| Runtime Shape | β Adapted | Orchestrator + specialist workers (not generic single container) |
| Filesystem Layout | β Heavily adapted | SWE-specific: repos/, build/, reviews/, skills/ pre-loaded |
| Memory Model | β Adapted | Pre-seeded with 20 Agent Skills; world model = codebase topology |
| Verification Architecture | β Expanded | CI simulation, linting, test pyramid are first-class citizens |
| Learning Policy | β Adapted | Constrained-active: cannot overwrite pre-seeded skills, can earn new ones |
| Model Routing | β Kept | Critical for cost control in a dev loop |
| Security Policy | β Adapted | Code execution sandbox is the primary risk surface |
| Docker Project Files | β Kept | Core deliverable |
| Observability | β Adapted | SWE-specific metrics: test pass rate, review gate rejections, skill churn |
| Example Tasks | β Kept | 10 realistic SWE scenarios |
| Risks & Failure Modes | β Adapted | Emphasizes code execution risks over data hallucination |
| v1/v2 Scope | β Kept | |
customers/, clauses/, approvals/ folders |
β Dropped | Domain-irrelevant |
| Medical/legal verification gates | β Dropped | Wrong domain |
Generic world/entities/ open ontology |
Replaced with world/codebase/ β bounded, tool-verified only |
|
| Autonomous truth generation | β Dropped | 2026 LLM limits; all world-model writes require tool-backed evidence |
hypotheses/ as free-form speculation |
Hypotheses require a minimum of 2 corroborating tool outputs to persist |
AgentSWEOS is a containerized, self-learning software engineering agent that applies structured engineering workflows β spec, plan, build, test, review, ship β to coding tasks. It is pre-seeded with 20 production-grade skills from the Agent Skills framework and earns new ones through task outcomes scored against an honest reward signal.
Unlike a chat-based coding assistant, AgentSWEOS:
- Enforces process discipline (spec before code, tests before merge)
- Maintains a persistent, inspectable memory of what it knows and why
- Routes work to the cheapest model capable of handling it
- Runs code in a sandboxed executor and verifies outcomes with real tool output
- Commits every learning event to Git for longitudinal observability
The result is an agent that improves with use, stays narrow by design, and can be debugged with ordinary developer tools.
Chosen shape: Multi-container orchestrator + specialist workers.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AgentSWEOS Runtime β
β β
β βββββββββββββββ ββββββββββββββββββββββββββββ β
β β OrchestratorββββββΆβ Task Queue (FIFO file) β β
β β (planner) βββββββ shared/queue.md β β
β ββββββββ¬βββββββ ββββββββββββββββββββββββββββ β
β β β
β ββββββ΄βββββββββββββββββββββ β
β βΌ βΌ β
β ββββββββββββ βββββββββββββ β
β β Builder β β Reviewer β β
β β Worker β β Worker β β
β ββββββ¬ββββββ βββββββ¬ββββββ β
β β β β
β ββββΌβββββββββββββββββββββββΌβββ β
β β Code Executor (sandbox) β β
β β executor/ container β β
β ββββββββββββββββββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββ β
β β Git Volume (longitudinal β β
β β memory + working repos) β β
β ββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why not single container? Code execution must be sandboxed away from agent state. A compromised or runaway build process cannot corrupt the agent's memory or the Git history. The executor is an ephemeral container with no network access and a read-only view of the repo under work.
Why not full microservices? Over-engineering for v1. Three logical roles (orchestrator, worker, executor) map cleanly to three services in docker-compose. Specialist personas (code-reviewer, test-engineer, security-auditor) are runtime configurations of the worker, not separate containers.
/data/
βββ agents/
β βββ agentos/ # orchestrator state
β βββ persona.md # read-only: "senior SWE, process-first"
β βββ constraints.md # read-only: budgets, forbidden tools, PII rules
β βββ skills.md # PRE-SEEDED with 20 Agent Skills (budget: 30)
β βββ goals.md # recurring gaps discovered at runtime
β βββ rewards.md # rolling log, last 50 (SWE tasks are longer)
β βββ reflections.md # failure post-mortems, last 20
β βββ skills/ # deep skill files (20 pre-seeded + earned)
β βββ spec-driven-development.md
β βββ test-driven-development.md
β βββ incremental-implementation.md
β βββ code-review-and-quality.md
β βββ security-and-hardening.md
β βββ ... (all 20 Agent Skills)
β βββ [earned at runtime]
β
βββ workers/
β βββ builder/ # builder worker state (mirrors agent/ shape)
β βββ reviewer/ # reviewer worker state
β
βββ repos/ # working copies of target codebases
β βββ <project>/
β βββ .git/
β βββ src/
β
βββ build/ # build artifacts, test results, coverage
β βββ <project>/
β βββ last-test-run.json
β βββ coverage.json
β βββ lint-report.json
β
βββ reviews/ # structured review outputs (handoff files)
β βββ <task-id>/
β βββ request.md # what the orchestrator asked
β βββ findings.md # reviewer output (typed, structured)
β βββ resolution.md # builder response + final decision
β
βββ world/
β βββ codebase/ # BOUNDED: only tool-verified facts
β βββ index.md # known projects, languages, entry points
β βββ patterns/ # confirmed architectural patterns
β β βββ <pattern>.md # requires 2+ tool-verified +1 occurrences
β βββ anti-patterns/ # confirmed bad patterns found in target repos
β
βββ shared/ # Unix IPC layer
β βββ queue.md # FIFO task queue
β βββ inbox/ # per-agent async messages
β β βββ orchestrator.md
β β βββ builder.md
β β βββ reviewer.md
β βββ outbox/ # (gitignored: high-frequency ephemeral)
β βββ locks/ # semaphore files for shared state
β βββ segment.md # shared working context for current task
β
βββ .git/ # longitudinal memory
βββ (hooks/, tags, branches)
Key SWE-specific layout decisions:
repos/is where agents actually touch code. It is a separate subtree from agent state so Git can diff them independently.build/stores machine-readable test and lint output. The agent reads structured JSON, not terminal output β no hallucination of "tests pass".reviews/uses the handoff file pattern. Builder β Reviewer communication is typed and asynchronous. The reviewer never reads builder private state.world/codebase/is intentionally narrow. It only stores what a tool (test runner, linter, AST parser) has confirmed. Free-form speculation lives inreflections.mdunder the agent's own private directory until it earns enough evidence to promote.
| Memory Component | Exists | Justification |
|---|---|---|
skills.md |
β Pre-seeded + expandable | 20 Agent Skills are the founding policy. New skills can be earned (budget: 30). Pruning is forced when budget fills. |
goals.md |
β Runtime-earned | Recurring gaps (e.g., "always misses auth scope on pagination endpoints") become explicit goals |
rewards.md |
β | Honest outcome signal. SWE tasks use 5-dimension decomposition (see Β§5) |
reflections.md |
β | Failure post-mortems. Pattern in 3+ reflections β candidate goal |
world/codebase/ |
β Bounded | Tool-verified codebase facts only. No open-ended ontology. |
cases/ |
β Dropped | Too similar to reflections for v1. Revisit in v2 if pattern recognition proves useful. |
templates/ |
β Embedded in skills | PR templates, commit message formats live inside the relevant skill file, not as a separate memory tree |
user preferences |
Per-project preferences (language, style guide) are injected into persona at task start via agents.yaml, not a mutable memory file |
|
policies/ |
β
As constraints.md |
Hard rules (no rm -rf, no network in executor, no PII logging) are in the read-only constraints file |
hypotheses/ |
Exist only in reflections.md. Promotion to world/codebase/patterns/ requires 2 tool-verified corroborations. |
Why pre-seed skills instead of bootstrapping from zero?
The AgentOS blueprint's "earn from nothing" approach is elegant for general agents. For a SWE agent, the 20 Agent Skills represent validated engineering judgment that should not be re-discovered empirically. Starting from zero would waste tasks relearning "write tests before merging" β a known-good prior.
The skill budget still applies: earned skills compete with pre-seeded ones.
If a pre-seeded skill consistently earns 0 or -1 for this agent's actual
task distribution, it gets pruned exactly like any other.
Plain +1/0/-1 is too coarse for SWE tasks. Every task scored with:
# rewards.md entry
- task_id: task-2025-001
task_type: bug-fix
timestamp: 2025-01-15T14:32:00Z
reward_decomposition:
correctness: +1 # tests pass, bug is actually fixed
process: +1 # spec β plan β build β test followed
efficiency: 0 # took 3 LLM calls where 2 would have sufficed
test_quality: +1 # regression test added
review_gate: +1 # reviewer approved, no critical findings
composite_score: +1 # majority positive = +1
skills_applied: [test-driven-development, debugging-and-error-recovery]
skills_candidate: null
model_used: mid-tier
context_tags: [python, async, bug-fix, database]
tool_verified: true # score is from test runner, not self-assessmenttool_verified: true is the critical constraint. The agent cannot award
itself a +1 based on "looks right". Correctness requires:
pytest/jest/ equivalent: exit code 0- Lint: zero new violations
- Review gate: reviewer worker found no CRITICAL or BLOCKER findings
This eliminates the core failure mode of self-scoring systems: reward hacking.
Task Attempt
β
ββ Static Analysis ββββββββββββ lint, type-check (tool output β JSON)
β
ββ Unit Tests ββββββββββββββββββ pytest/jest with coverage threshold
β
ββ Integration Tests βββββββββββ against fixtures, never live external
β (if applicable)
β
ββ Reviewer Worker βββββββββββββ five-axis code review (Agent Skills)
β βββ correctness axis
β βββ test quality axis
β βββ security axis (OWASP Top 10 checklist)
β βββ readability axis
β βββ change sizing axis (~100 lines / PR)
β
ββ Human Review Gate βββββββββββ triggered when:
β βββ security-and-hardening skill is activated
β βββ change touches auth, secrets, or PII paths
β βββ reviewer finds CRITICAL severity finding
β βββ composite reward would be first +1 for a new skill candidate
β
ββ Reward Scoring ββββββββββββββ only after all gates pass
What is NOT in the verification chain:
- Simulation environments (too fragile for v1; revisit in v2)
- LLM-as-judge for correctness (self-scoring; tool output is authoritative)
- Live external API calls from the executor (network disabled in sandbox)
Policy: constrained-active
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LEARNING BOUNDARIES β
β β
β β
CAN learn: β
β New skills beyond the 20 pre-seeded ones β
β Updated reward_evidence for existing skills β
β New goals from repeated reflection patterns β
β New world/codebase patterns (tool-verified only) β
β β
β β οΈ CAN update with constraints: β
β Pre-seeded skill reward_evidence (up or down) β
β Prune pre-seeded skill only if reward_evidence β
β shows 5+ tasks with composite β€ 0 β
β β
β β CANNOT do: β
β Rewrite persona.md or constraints.md β
β Self-award +1 without tool-verified evidence β
β Promote hypothesis to world model without 2 β
β corroborating tool outputs β
β Earn a new skill on first +1 (requires 2+ β
β confirmed +1 outcomes across different tasks) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why require 2+ confirmed +1s for a new skill?
A single success could be task-specific luck. Two independent +1s across
different task_ids with tool_verified: true demonstrates generalizability.
This directly mirrors RL's minimum-sample requirements before policy update.
| Task Class | Trigger | Model Tier | Rationale |
|---|---|---|---|
| Boilerplate generation | task_type: scaffold, low reasoning_demand |
Cheap (Haiku-class) | Deterministic, low-stakes |
| Refactoring, simple bug fixes | context_tags include known patterns |
Cheap | Pattern-matched to existing skills |
| Spec writing, API design | skills_applied includes spec-driven-development |
Mid (Sonnet-class) | Needs coherent multi-step reasoning |
| Complex debugging, root cause | task_type: debug, high reasoning_demand |
Mid | Multi-hypothesis reasoning needed |
| Security review | security-and-hardening activated |
Mid β Premium | High-stakes; escalate on any finding |
| Reviewer worker (five-axis) | All code review passes | Mid | Structured output, not creative |
| Architecture decisions | documentation-and-adrs activated |
Premium | Long-horizon, high consequence |
| Human review gate prep | Summarizing findings for human | Cheap | Mechanical formatting task |
| First +1 skill candidate | New skill validation pass | Premium | High consequence; verify thoroughly |
Routing is driven by context_tags from reward history, not static rules.
After 50+ tasks, the agent has empirical data on which tag combinations
actually need which model tier. The routing table above is the prior; task
history updates the posterior.
Risk Level: Medium (code execution is the primary attack surface)
βββ Network Policy βββββββββββββββββββββββββββββββββββββββββββ
β Orchestrator container: outbound to model API only β
β Worker containers: no outbound network β
β Executor container: NO network (air-gapped) β
β Test fixtures: local only, no live external APIs β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ Code Execution Sandbox βββββββββββββββββββββββββββββββββββ
β Executor runs as non-root UID 65534 β
β Read-only filesystem except /tmp and /workspace β
β CPU limit: 2 cores, Memory limit: 512MB β
β Timeout: 120s per execution (hard kill) β
β No shell access from agent β only structured commands β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ Forbidden Operations (constraints.md, enforced) ββββββββββ
β rm -rf on any path outside /tmp β
β git push to any remote β
β Writes to repos/ without task_id in active queue β
β Reading or logging any file matching *secret*, *.env, β
β *credential*, *token* (pattern match before any read) β
β Spawning subprocesses with shell=True β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ PII Policy βββββββββββββββββββββββββββββββββββββββββββββββ
β No PII in rewards.md or reflections.md β
β No real credentials in task queue entries β
β world/codebase/ entries must be anonymized β
β Human review required before any task touching β
β user data schemas or auth pathways β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββ Approval Gates βββββββββββββββββββββββββββββββββββββββββββ
β Human review before: any file touching auth, PII, β
β secrets management, or security-critical paths β
β Dual-model review for: new skill candidates, β
β architecture decisions, world model promotions β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
FROM python:3.12-slim AS base
# ββ System dependencies ββββββββββββββββββββββββββββββββββββββ
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl jq \
&& rm -rf /var/lib/apt/lists/*
# ββ Non-root agent user ββββββββββββββββββββββββββββββββββββββ
RUN useradd -m -u 1001 agent
WORKDIR /app
# ββ Python dependencies ββββββββββββββββββββββββββββββββββββββ
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# ββ Agent source βββββββββββββββββββββββββββββββββββββββββββββ
COPY src/ ./src/
COPY agents.yaml .
# ββ Git hooks (commit-per-learning-event) ββββββββββββββββββββ
COPY hooks/post-reward.sh /app/hooks/
RUN chmod +x /app/hooks/post-reward.sh
USER agent
# Volume: /data (all persistent state)
VOLUME ["/data"]
ENTRYPOINT ["python", "src/main.py"]version: "3.9"
services:
orchestrator:
build: .
container_name: agentos-orchestrator
environment:
- AGENT_ROLE=orchestrator
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- CHEAP_MODEL=claude-haiku-4-5-20251001
- MID_MODEL=claude-sonnet-4-6
- PREMIUM_MODEL=claude-opus-4-6
volumes:
- agent-data:/data
networks:
- agent-net
depends_on:
- executor
builder:
build: .
container_name: agentos-builder
environment:
- AGENT_ROLE=builder
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- CHEAP_MODEL=claude-haiku-4-5-20251001
- MID_MODEL=claude-sonnet-4-6
- PREMIUM_MODEL=claude-opus-4-6
volumes:
- agent-data:/data
networks:
- agent-net
reviewer:
build: .
container_name: agentos-reviewer
environment:
- AGENT_ROLE=reviewer
- AGENT_PERSONA=code-reviewer # uses the code-reviewer Agent Skills persona
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- CHEAP_MODEL=claude-haiku-4-5-20251001
- MID_MODEL=claude-sonnet-4-6
- PREMIUM_MODEL=claude-opus-4-6
volumes:
- agent-data:/data
networks:
- agent-net
executor:
image: python:3.12-slim # minimal, no agent code
container_name: agentos-executor
user: "65534:65534" # nobody
read_only: true
tmpfs:
- /tmp:size=256m
- /workspace:size=512m
volumes:
- agent-data:/data:ro # read-only view of repos
networks: [] # NO network access
cpus: "2.0"
mem_limit: 512m
entrypoint: ["python", "-c", "import time; time.sleep(86400)"] # kept alive, receives work via IPC
volumes:
agent-data:
driver: local
driver_opts:
type: none
o: bind
device: ./data
networks:
agent-net:
driver: bridge
internal: true # no external internet from workersagents:
orchestrator:
persona: "Senior software engineering team lead. Enforce process discipline.
Spec before code. Tests before merge. Never skip review gates."
skill_budget: 30
pre_seeded_skills:
- spec-driven-development
- planning-and-task-breakdown
- incremental-implementation
- test-driven-development
- context-engineering
- source-driven-development
- frontend-ui-engineering
- api-and-interface-design
- browser-testing-with-devtools
- debugging-and-error-recovery
- code-review-and-quality
- code-simplification
- security-and-hardening
- performance-optimization
- git-workflow-and-versioning
- ci-cd-and-automation
- deprecation-and-migration
- documentation-and-adrs
- shipping-and-launch
- idea-refine
constraints:
max_reward_log: 50
max_reflection_log: 20
min_task_composite_for_skill_earn: "+1"
min_confirmations_for_new_skill: 2
min_failures_before_prune: 5
forbidden_tools: ["rm -rf outside /tmp", "git push", "shell=True subprocess"]
human_review_triggers:
- security-and-hardening activated
- file path matches auth|secret|credential|pii|token
- reviewer CRITICAL finding
- new skill candidate first validation
model_routing:
cheap: ${CHEAP_MODEL}
mid: ${MID_MODEL}
premium: ${PREMIUM_MODEL}
builder:
persona: "Staff engineer. Build one thin vertical slice at a time.
Always run tests. Commit atomically."
inherits: orchestrator.constraints
reviewer:
persona: "Senior engineer doing five-axis code review. Severity labels:
Nit / Optional / FYI / CRITICAL / BLOCKER. Changes > 100 lines get split."
review_axes:
- correctness
- test_quality
- security
- readability
- change_sizing# Model API
ANTHROPIC_API_KEY=sk-ant-...
# Model tiers β map to actual model strings
CHEAP_MODEL=claude-haiku-4-5-20251001
MID_MODEL=claude-sonnet-4-6
PREMIUM_MODEL=claude-opus-4-6
# Execution limits
EXECUTOR_TIMEOUT_SECONDS=120
MAX_PARALLEL_WORKERS=2
# Learning controls
SKILL_BUDGET=30
MIN_CONFIRMATIONS_FOR_SKILL=2
# Observability
LOG_LEVEL=INFO
METRICS_FILE=/data/metrics/runtime.jsonldocker compose up
β
ββ executor starts (sandboxed, awaiting work)
β
ββ orchestrator starts
β ββ mounts /data volume
β ββ checks /data/agents/agentos/skills.md exists
β β ββ if not: runs seed.py β writes all 20 Agent Skills
β ββ installs Git hooks into /data/.git/hooks/
β ββ begins polling shared/queue.md
β
ββ builder starts β polls shared/inbox/builder.md
ββ reviewer starts β polls shared/inbox/reviewer.md
agentos(learn): earn async-pagination-pattern from +1 API task
task: sync paginated records from internal billing service
reward: composite +1 [correctness+1, process+1, efficiency+1, test+1, review+1]
skills-added: async-pagination-pattern
skills-pruned: none
world-model-update: world/codebase/patterns/async-pagination.md
model-used: mid-tier
#!/bin/bash
# Runs after every rewards.md write
# Extracts composite score and commits learning event
REWARDS_FILE="/data/agents/agentos/rewards.md"
LAST_SCORE=$(head -n 20 "$REWARDS_FILE" | grep "composite_score:" | head -1 | awk '{print $2}')
TASK_ID=$(head -n 20 "$REWARDS_FILE" | grep "task_id:" | head -1 | awk '{print $2}')
cd /data
git add agents/ world/ -- # never add repos/ or build/ to agent memory commits
git commit -m "agentos(reward): task ${TASK_ID} scored ${LAST_SCORE}" \
--allow-empty-message 2>/dev/null || true# Ephemeral IPC
shared/outbox/
shared/locks/
# Build artifacts (high-churn, not agent memory)
build/
# Active repo working copies (managed by their own Git)
repos/
# Runtime noise
*.pyc
__pycache__/
.env
# Full learning timeline
git log --grep="^agentos" --oneline
# Reward trend (composite scores over time)
git log --grep="^agentos(reward)" --format="%s" | grep -oP 'scored \K[+-]\d'
# When was a specific skill earned
git log -S "async-pagination-pattern" --oneline
# Skill churn: are we maturing or thrashing?
git log --grep="skills-added" --format="%s" | grep -c "skills-added:"
git log --grep="skills-pruned" --format="%s" | grep -c "skills-pruned:"
# Detect reward regression
git log --grep="scored" --format="%s" | tail -10 # last 10 outcomesMetrics emitted to METRICS_FILE as JSONL after every task:
| Metric | Type | What It Tells You |
|---|---|---|
composite_reward |
+1/0/-1 |
Is the agent improving? |
test_pass_rate |
float 0-1 |
Core correctness signal |
review_gate_rejections |
count |
How often reviewer blocks builder |
skill_churn_rate |
earned / pruned |
Maturing vs thrashing |
model_tier_distribution |
cheap/mid/premium % |
Cost efficiency |
tasks_per_skill |
map[skill β count] |
Which skills are actually used |
human_review_triggers |
count by trigger type |
How often human is pulled in |
executor_timeout_rate |
% |
Are code runs getting too complex? |
world_model_promotion_rate |
count |
Hypothesis β confirmed pattern |
tool_verified_rate |
% |
Are reward scores honest? |
Log format:
{"ts":"2025-01-15T14:32:00Z","task_id":"task-2025-001","event":"reward","composite":1,"tool_verified":true,"model":"mid","skills_applied":["test-driven-development"]}
{"ts":"2025-01-15T14:33:00Z","task_id":"task-2025-001","event":"commit","skills_added":null,"skills_pruned":null}- Spec + implement a rate-limited API client β triggers
api-and-interface-design,incremental-implementation,test-driven-development - Fix a flaky async test in a Python service β triggers
debugging-and-error-recovery,test-driven-development - Refactor a 500-line module that grew beyond the Rule of 500 β triggers
code-simplification,code-review-and-quality - Add OWASP-compliant input validation to a form endpoint β triggers
security-and-hardeningβ human review gate - Write an Architecture Decision Record for switching from REST to gRPC β
triggers
documentation-and-adrs,api-and-interface-design, premium model - Set up CI pipeline with lint + test + coverage thresholds β triggers
ci-cd-and-automation,git-workflow-and-versioning - Deprecate a legacy auth endpoint with a migration path β triggers
deprecation-and-migration,security-and-hardening - Profile and fix a slow database query causing P95 regression β triggers
performance-optimization,debugging-and-error-recovery - Build a React component with WCAG 2.1 AA accessibility β triggers
frontend-ui-engineering,test-driven-development - Pre-launch checklist for a new microservice β triggers
shipping-and-launch,security-and-hardening,documentation-and-adrs
| Risk | Likelihood | Mitigation |
|---|---|---|
| Executor escape (sandbox breakout) | Low | Non-root, read-only FS, no network, hard timeout |
| Reward hacking (self-awarding +1 without tool evidence) | Medium | tool_verified: true required; Git hook validates before commit |
| Skill ossification (pre-seeded skills never pruned) | Medium | Reward evidence tracked per-skill; prune if 5+ composite β€ 0 |
| Hallucinated world model facts | Medium | world/codebase/ writes require 2 corroborating tool outputs |
| Model routing drift (always escalating to premium) | Low-Medium | Routing logged in metrics; alert if premium % > 20 of tasks |
| Context window overflow on large repos | Medium | context-engineering skill limits context to context_tags matches |
| Review gate fatigue (human ignoring gates) | Medium | Human review triggers must be narrow and specific; log ignored gates |
| Git history bloat from high-frequency tasks | Low | .gitignore excludes build/, outbox/; commits only on learning events |
| Spec skip (builder jumping straight to code) | Medium | spec-driven-development skill enforced by orchestrator before queuing to builder |
Goal: One working end-to-end task loop with honest rewards and observable Git history.
Week 1:
ββ Docker Compose up with orchestrator + executor (no reviewer yet)
ββ Seed 20 Agent Skills into skills.md on first boot
ββ Task queue: single FIFO file, polled every 30s
ββ Builder: spec β code β test (pytest only, no integration tests)
ββ Executor: sandboxed Python runs, structured JSON output
ββ Rewards: 3-dimension (correctness, process, test_quality)
β all tool-verified via pytest exit code + lint
ββ Git: learning commits on +1 reward only
Week 2:
ββ Add reviewer worker (five-axis, code-review-and-quality skill)
ββ Human review gate (write to stdout + block until ack file exists)
ββ Full 5-dimension reward decomposition
ββ Metrics JSONL emission
ββ Basic `git log` queries working and documented
Explicitly out of scope for v1:
- Multi-language executor (Python only)
- world/codebase/ pattern learning (too early; not enough task history)
- Model routing (use mid-tier for everything in v1; routing in v2)
- Horizontal scaling (single orchestrator)
After 200+ real tasks with honest reward data:
- Model routing activation β use task history to set routing thresholds empirically
- world/codebase/ pattern learning β enough task history to validate promotion criteria
- Multi-language executor β add Node.js runner with jest support
- Horizontal worker scaling β multiple builder workers on shared queue (semaphore locks already in place)
- Benchmark dashboard β visualize skill accumulation rate, pruning efficiency, reward trend over time
- Security auditor persona β third worker using the
security-auditorAgent Skills persona, activated on gate triggers - Integration test fixtures β local service stubs for API integration testing in executor
- Spec quality scoring β measure downstream task success rate by spec author to improve
spec-driven-developmentskill reward evidence
AgentSWEOS: disciplined engineering process as a runtime. Debuggable with git log. Improvable through honest failure.