Example: AgentSWEOS Blueprint

Production-Grade Software Engineering Agent Runtime

Design philosophy: AgentSWEOS is not a general-purpose coding assistant. It is a disciplined SWE process modeled as a containerized agent — one that applies the same structured workflows a senior engineer brings to every task, learns what actually works, and forgets what doesn't.

0. Template Adaptation Decisions

Before the design, here is a transparent accounting of what was kept, changed, and dropped from the source prompt template — and why.

Template Section	Decision	Rationale
Executive Summary	✅ Kept	Essential framing
Runtime Shape	✅ Adapted	Orchestrator + specialist workers (not generic single container)
Filesystem Layout	✅ Heavily adapted	SWE-specific: repos/, build/, reviews/, skills/ pre-loaded
Memory Model	✅ Adapted	Pre-seeded with 20 Agent Skills; world model = codebase topology
Verification Architecture	✅ Expanded	CI simulation, linting, test pyramid are first-class citizens
Learning Policy	✅ Adapted	Constrained-active: cannot overwrite pre-seeded skills, can earn new ones
Model Routing	✅ Kept	Critical for cost control in a dev loop
Security Policy	✅ Adapted	Code execution sandbox is the primary risk surface
Docker Project Files	✅ Kept	Core deliverable
Observability	✅ Adapted	SWE-specific metrics: test pass rate, review gate rejections, skill churn
Example Tasks	✅ Kept	10 realistic SWE scenarios
Risks & Failure Modes	✅ Adapted	Emphasizes code execution risks over data hallucination
v1/v2 Scope	✅ Kept
`customers/`, `clauses/`, `approvals/` folders	❌ Dropped	Domain-irrelevant
Medical/legal verification gates	❌ Dropped	Wrong domain
Generic `world/entities/` open ontology	⚠️ Scoped down	Replaced with `world/codebase/` — bounded, tool-verified only
Autonomous truth generation	❌ Dropped	2026 LLM limits; all world-model writes require tool-backed evidence
`hypotheses/` as free-form speculation	⚠️ Constrained	Hypotheses require a minimum of 2 corroborating tool outputs to persist

1. Executive Summary

AgentSWEOS is a containerized, self-learning software engineering agent that applies structured engineering workflows — spec, plan, build, test, review, ship — to coding tasks. It is pre-seeded with 20 production-grade skills from the Agent Skills framework and earns new ones through task outcomes scored against an honest reward signal.

Unlike a chat-based coding assistant, AgentSWEOS:

Enforces process discipline (spec before code, tests before merge)
Maintains a persistent, inspectable memory of what it knows and why
Routes work to the cheapest model capable of handling it
Runs code in a sandboxed executor and verifies outcomes with real tool output
Commits every learning event to Git for longitudinal observability

The result is an agent that improves with use, stays narrow by design, and can be debugged with ordinary developer tools.

2. Runtime Shape: Orchestrator + Specialist Workers

Chosen shape: Multi-container orchestrator + specialist workers.

┌──────────────────────────────────────────────────────┐
│                   AgentSWEOS Runtime                 │
│                                                      │
│  ┌─────────────┐     ┌──────────────────────────┐   │
│  │ Orchestrator│────▶│  Task Queue (FIFO file)  │   │
│  │  (planner)  │◀────│  shared/queue.md         │   │
│  └──────┬──────┘     └──────────────────────────┘   │
│         │                                            │
│    ┌────┴────────────────────┐                       │
│    ▼                         ▼                       │
│  ┌──────────┐          ┌───────────┐                 │
│  │  Builder │          │  Reviewer │                 │
│  │  Worker  │          │  Worker   │                 │
│  └────┬─────┘          └─────┬─────┘                 │
│       │                      │                       │
│    ┌──▼──────────────────────▼──┐                    │
│    │   Code Executor (sandbox)  │                    │
│    │   executor/ container      │                    │
│    └────────────────────────────┘                    │
│                                                      │
│  ┌──────────────────────────────┐                    │
│  │  Git Volume (longitudinal    │                    │
│  │  memory + working repos)     │                    │
│  └──────────────────────────────┘                    │
└──────────────────────────────────────────────────────┘

Why not single container? Code execution must be sandboxed away from agent state. A compromised or runaway build process cannot corrupt the agent's memory or the Git history. The executor is an ephemeral container with no network access and a read-only view of the repo under work.

Why not full microservices? Over-engineering for v1. Three logical roles (orchestrator, worker, executor) map cleanly to three services in docker-compose. Specialist personas (code-reviewer, test-engineer, security-auditor) are runtime configurations of the worker, not separate containers.

3. Purpose-Built Filesystem Layout

/data/
├── agents/
│   └── agentos/                    # orchestrator state
│       ├── persona.md              # read-only: "senior SWE, process-first"
│       ├── constraints.md          # read-only: budgets, forbidden tools, PII rules
│       ├── skills.md               # PRE-SEEDED with 20 Agent Skills (budget: 30)
│       ├── goals.md                # recurring gaps discovered at runtime
│       ├── rewards.md              # rolling log, last 50 (SWE tasks are longer)
│       ├── reflections.md          # failure post-mortems, last 20
│       └── skills/                 # deep skill files (20 pre-seeded + earned)
│           ├── spec-driven-development.md
│           ├── test-driven-development.md
│           ├── incremental-implementation.md
│           ├── code-review-and-quality.md
│           ├── security-and-hardening.md
│           ├── ... (all 20 Agent Skills)
│           └── [earned at runtime]
│
├── workers/
│   ├── builder/                    # builder worker state (mirrors agent/ shape)
│   └── reviewer/                   # reviewer worker state
│
├── repos/                          # working copies of target codebases
│   └── <project>/
│       ├── .git/
│       └── src/
│
├── build/                          # build artifacts, test results, coverage
│   └── <project>/
│       ├── last-test-run.json
│       ├── coverage.json
│       └── lint-report.json
│
├── reviews/                        # structured review outputs (handoff files)
│   └── <task-id>/
│       ├── request.md              # what the orchestrator asked
│       ├── findings.md             # reviewer output (typed, structured)
│       └── resolution.md          # builder response + final decision
│
├── world/
│   └── codebase/                   # BOUNDED: only tool-verified facts
│       ├── index.md                # known projects, languages, entry points
│       ├── patterns/               # confirmed architectural patterns
│       │   └── <pattern>.md        # requires 2+ tool-verified +1 occurrences
│       └── anti-patterns/          # confirmed bad patterns found in target repos
│
├── shared/                         # Unix IPC layer
│   ├── queue.md                    # FIFO task queue
│   ├── inbox/                      # per-agent async messages
│   │   ├── orchestrator.md
│   │   ├── builder.md
│   │   └── reviewer.md
│   ├── outbox/                     # (gitignored: high-frequency ephemeral)
│   ├── locks/                      # semaphore files for shared state
│   └── segment.md                  # shared working context for current task
│
└── .git/                           # longitudinal memory
    └── (hooks/, tags, branches)

Key SWE-specific layout decisions:

repos/ is where agents actually touch code. It is a separate subtree from agent state so Git can diff them independently.
build/ stores machine-readable test and lint output. The agent reads structured JSON, not terminal output — no hallucination of "tests pass".
reviews/ uses the handoff file pattern. Builder → Reviewer communication is typed and asynchronous. The reviewer never reads builder private state.
world/codebase/ is intentionally narrow. It only stores what a tool (test runner, linter, AST parser) has confirmed. Free-form speculation lives in reflections.md under the agent's own private directory until it earns enough evidence to promote.

4. Memory Model

Memory Component	Exists	Justification
`skills.md`	✅ Pre-seeded + expandable	20 Agent Skills are the founding policy. New skills can be earned (budget: 30). Pruning is forced when budget fills.
`goals.md`	✅ Runtime-earned	Recurring gaps (e.g., "always misses auth scope on pagination endpoints") become explicit goals
`rewards.md`	✅	Honest outcome signal. SWE tasks use 5-dimension decomposition (see §5)
`reflections.md`	✅	Failure post-mortems. Pattern in 3+ reflections → candidate goal
`world/codebase/`	✅ Bounded	Tool-verified codebase facts only. No open-ended ontology.
`cases/`	❌ Dropped	Too similar to reflections for v1. Revisit in v2 if pattern recognition proves useful.
`templates/`	✅ Embedded in skills	PR templates, commit message formats live inside the relevant skill file, not as a separate memory tree
`user preferences`	⚠️ In persona.md	Per-project preferences (language, style guide) are injected into persona at task start via `agents.yaml`, not a mutable memory file
`policies/`	✅ As `constraints.md`	Hard rules (no `rm -rf`, no network in executor, no PII logging) are in the read-only constraints file
`hypotheses/`	⚠️ Constrained	Exist only in `reflections.md`. Promotion to `world/codebase/patterns/` requires 2 tool-verified corroborations.

Why pre-seed skills instead of bootstrapping from zero?

The AgentOS blueprint's "earn from nothing" approach is elegant for general agents. For a SWE agent, the 20 Agent Skills represent validated engineering judgment that should not be re-discovered empirically. Starting from zero would waste tasks relearning "write tests before merging" — a known-good prior.

The skill budget still applies: earned skills compete with pre-seeded ones. If a pre-seeded skill consistently earns 0 or -1 for this agent's actual task distribution, it gets pruned exactly like any other.

5. Reward Signal Design

Plain +1/0/-1 is too coarse for SWE tasks. Every task scored with:

# rewards.md entry
- task_id: task-2025-001
  task_type: bug-fix
  timestamp: 2025-01-15T14:32:00Z
  reward_decomposition:
    correctness: +1      # tests pass, bug is actually fixed
    process:     +1      # spec → plan → build → test followed
    efficiency:   0      # took 3 LLM calls where 2 would have sufficed
    test_quality: +1     # regression test added
    review_gate: +1      # reviewer approved, no critical findings
  composite_score: +1    # majority positive = +1
  skills_applied: [test-driven-development, debugging-and-error-recovery]
  skills_candidate: null
  model_used: mid-tier
  context_tags: [python, async, bug-fix, database]
  tool_verified: true    # score is from test runner, not self-assessment

tool_verified: true is the critical constraint. The agent cannot award itself a +1 based on "looks right". Correctness requires:

pytest / jest / equivalent: exit code 0
Lint: zero new violations
Review gate: reviewer worker found no CRITICAL or BLOCKER findings

This eliminates the core failure mode of self-scoring systems: reward hacking.

6. Verification Architecture

Task Attempt
    │
    ├─ Static Analysis ──────────── lint, type-check (tool output → JSON)
    │
    ├─ Unit Tests ────────────────── pytest/jest with coverage threshold
    │
    ├─ Integration Tests ─────────── against fixtures, never live external
    │   (if applicable)
    │
    ├─ Reviewer Worker ───────────── five-axis code review (Agent Skills)
    │   ├── correctness axis
    │   ├── test quality axis
    │   ├── security axis (OWASP Top 10 checklist)
    │   ├── readability axis
    │   └── change sizing axis (~100 lines / PR)
    │
    ├─ Human Review Gate ─────────── triggered when:
    │   ├── security-and-hardening skill is activated
    │   ├── change touches auth, secrets, or PII paths
    │   ├── reviewer finds CRITICAL severity finding
    │   └── composite reward would be first +1 for a new skill candidate
    │
    └─ Reward Scoring ────────────── only after all gates pass

What is NOT in the verification chain:

Simulation environments (too fragile for v1; revisit in v2)
LLM-as-judge for correctness (self-scoring; tool output is authoritative)
Live external API calls from the executor (network disabled in sandbox)

7. Learning Policy: Constrained-Active

Policy: constrained-active

┌─────────────────────────────────────────────────────────┐
│  LEARNING BOUNDARIES                                    │
│                                                         │
│  ✅ CAN learn:                                          │
│     New skills beyond the 20 pre-seeded ones           │
│     Updated reward_evidence for existing skills        │
│     New goals from repeated reflection patterns        │
│     New world/codebase patterns (tool-verified only)   │
│                                                         │
│  ⚠️  CAN update with constraints:                       │
│     Pre-seeded skill reward_evidence (up or down)      │
│     Prune pre-seeded skill only if reward_evidence     │
│     shows 5+ tasks with composite ≤ 0                  │
│                                                         │
│  ❌ CANNOT do:                                          │
│     Rewrite persona.md or constraints.md               │
│     Self-award +1 without tool-verified evidence       │
│     Promote hypothesis to world model without 2        │
│     corroborating tool outputs                         │
│     Earn a new skill on first +1 (requires 2+          │
│     confirmed +1 outcomes across different tasks)      │
└─────────────────────────────────────────────────────────┘

Why require 2+ confirmed +1s for a new skill?
A single success could be task-specific luck. Two independent +1s across different task_ids with tool_verified: true demonstrates generalizability. This directly mirrors RL's minimum-sample requirements before policy update.

8. Model Routing Policy

Task Class	Trigger	Model Tier	Rationale
Boilerplate generation	`task_type: scaffold`, low `reasoning_demand`	Cheap (Haiku-class)	Deterministic, low-stakes
Refactoring, simple bug fixes	`context_tags` include known patterns	Cheap	Pattern-matched to existing skills
Spec writing, API design	`skills_applied` includes `spec-driven-development`	Mid (Sonnet-class)	Needs coherent multi-step reasoning
Complex debugging, root cause	`task_type: debug`, high `reasoning_demand`	Mid	Multi-hypothesis reasoning needed
Security review	`security-and-hardening` activated	Mid → Premium	High-stakes; escalate on any finding
Reviewer worker (five-axis)	All code review passes	Mid	Structured output, not creative
Architecture decisions	`documentation-and-adrs` activated	Premium	Long-horizon, high consequence
Human review gate prep	Summarizing findings for human	Cheap	Mechanical formatting task
First +1 skill candidate	New skill validation pass	Premium	High consequence; verify thoroughly

Routing is driven by context_tags from reward history, not static rules. After 50+ tasks, the agent has empirical data on which tag combinations actually need which model tier. The routing table above is the prior; task history updates the posterior.

9. Security Policy

Risk Level: Medium (code execution is the primary attack surface)

┌── Network Policy ──────────────────────────────────────────┐
│  Orchestrator container: outbound to model API only        │
│  Worker containers: no outbound network                    │
│  Executor container: NO network (air-gapped)              │
│  Test fixtures: local only, no live external APIs         │
└────────────────────────────────────────────────────────────┘

┌── Code Execution Sandbox ──────────────────────────────────┐
│  Executor runs as non-root UID 65534                      │
│  Read-only filesystem except /tmp and /workspace          │
│  CPU limit: 2 cores, Memory limit: 512MB                 │
│  Timeout: 120s per execution (hard kill)                  │
│  No shell access from agent — only structured commands    │
└────────────────────────────────────────────────────────────┘

┌── Forbidden Operations (constraints.md, enforced) ─────────┐
│  rm -rf on any path outside /tmp                          │
│  git push to any remote                                   │
│  Writes to repos/ without task_id in active queue        │
│  Reading or logging any file matching *secret*, *.env,   │
│  *credential*, *token* (pattern match before any read)   │
│  Spawning subprocesses with shell=True                    │
└────────────────────────────────────────────────────────────┘

┌── PII Policy ──────────────────────────────────────────────┐
│  No PII in rewards.md or reflections.md                   │
│  No real credentials in task queue entries                │
│  world/codebase/ entries must be anonymized               │
│  Human review required before any task touching           │
│  user data schemas or auth pathways                       │
└────────────────────────────────────────────────────────────┘

┌── Approval Gates ──────────────────────────────────────────┐
│  Human review before: any file touching auth, PII,        │
│  secrets management, or security-critical paths           │
│  Dual-model review for: new skill candidates,             │
│  architecture decisions, world model promotions           │
└────────────────────────────────────────────────────────────┘

10. Docker Project Files

Dockerfile

FROM python:3.12-slim AS base

# ── System dependencies ──────────────────────────────────────
RUN apt-get update && apt-get install -y --no-install-recommends \
    git curl jq \
    && rm -rf /var/lib/apt/lists/*

# ── Non-root agent user ──────────────────────────────────────
RUN useradd -m -u 1001 agent
WORKDIR /app

# ── Python dependencies ──────────────────────────────────────
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# ── Agent source ─────────────────────────────────────────────
COPY src/ ./src/
COPY agents.yaml .

# ── Git hooks (commit-per-learning-event) ────────────────────
COPY hooks/post-reward.sh /app/hooks/
RUN chmod +x /app/hooks/post-reward.sh

USER agent

# Volume: /data (all persistent state)
VOLUME ["/data"]

ENTRYPOINT ["python", "src/main.py"]

docker-compose.yml

version: "3.9"

services:
  orchestrator:
    build: .
    container_name: agentos-orchestrator
    environment:
      - AGENT_ROLE=orchestrator
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CHEAP_MODEL=claude-haiku-4-5-20251001
      - MID_MODEL=claude-sonnet-4-6
      - PREMIUM_MODEL=claude-opus-4-6
    volumes:
      - agent-data:/data
    networks:
      - agent-net
    depends_on:
      - executor

  builder:
    build: .
    container_name: agentos-builder
    environment:
      - AGENT_ROLE=builder
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CHEAP_MODEL=claude-haiku-4-5-20251001
      - MID_MODEL=claude-sonnet-4-6
      - PREMIUM_MODEL=claude-opus-4-6
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  reviewer:
    build: .
    container_name: agentos-reviewer
    environment:
      - AGENT_ROLE=reviewer
      - AGENT_PERSONA=code-reviewer   # uses the code-reviewer Agent Skills persona
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - CHEAP_MODEL=claude-haiku-4-5-20251001
      - MID_MODEL=claude-sonnet-4-6
      - PREMIUM_MODEL=claude-opus-4-6
    volumes:
      - agent-data:/data
    networks:
      - agent-net

  executor:
    image: python:3.12-slim          # minimal, no agent code
    container_name: agentos-executor
    user: "65534:65534"              # nobody
    read_only: true
    tmpfs:
      - /tmp:size=256m
      - /workspace:size=512m
    volumes:
      - agent-data:/data:ro          # read-only view of repos
    networks: []                     # NO network access
    cpus: "2.0"
    mem_limit: 512m
    entrypoint: ["python", "-c", "import time; time.sleep(86400)"]  # kept alive, receives work via IPC

volumes:
  agent-data:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: ./data

networks:
  agent-net:
    driver: bridge
    internal: true                   # no external internet from workers

agents.yaml

agents:
  orchestrator:
    persona: "Senior software engineering team lead. Enforce process discipline.
      Spec before code. Tests before merge. Never skip review gates."
    skill_budget: 30
    pre_seeded_skills:
      - spec-driven-development
      - planning-and-task-breakdown
      - incremental-implementation
      - test-driven-development
      - context-engineering
      - source-driven-development
      - frontend-ui-engineering
      - api-and-interface-design
      - browser-testing-with-devtools
      - debugging-and-error-recovery
      - code-review-and-quality
      - code-simplification
      - security-and-hardening
      - performance-optimization
      - git-workflow-and-versioning
      - ci-cd-and-automation
      - deprecation-and-migration
      - documentation-and-adrs
      - shipping-and-launch
      - idea-refine
    constraints:
      max_reward_log: 50
      max_reflection_log: 20
      min_task_composite_for_skill_earn: "+1"
      min_confirmations_for_new_skill: 2
      min_failures_before_prune: 5
      forbidden_tools: ["rm -rf outside /tmp", "git push", "shell=True subprocess"]
      human_review_triggers:
        - security-and-hardening activated
        - file path matches auth|secret|credential|pii|token
        - reviewer CRITICAL finding
        - new skill candidate first validation
      model_routing:
        cheap: ${CHEAP_MODEL}
        mid: ${MID_MODEL}
        premium: ${PREMIUM_MODEL}

  builder:
    persona: "Staff engineer. Build one thin vertical slice at a time.
      Always run tests. Commit atomically."
    inherits: orchestrator.constraints

  reviewer:
    persona: "Senior engineer doing five-axis code review. Severity labels:
      Nit / Optional / FYI / CRITICAL / BLOCKER. Changes > 100 lines get split."
    review_axes:
      - correctness
      - test_quality
      - security
      - readability
      - change_sizing

env.example

# Model API
ANTHROPIC_API_KEY=sk-ant-...

# Model tiers — map to actual model strings
CHEAP_MODEL=claude-haiku-4-5-20251001
MID_MODEL=claude-sonnet-4-6
PREMIUM_MODEL=claude-opus-4-6

# Execution limits
EXECUTOR_TIMEOUT_SECONDS=120
MAX_PARALLEL_WORKERS=2

# Learning controls
SKILL_BUDGET=30
MIN_CONFIRMATIONS_FOR_SKILL=2

# Observability
LOG_LEVEL=INFO
METRICS_FILE=/data/metrics/runtime.jsonl

Startup Flow

docker compose up
    │
    ├─ executor starts (sandboxed, awaiting work)
    │
    ├─ orchestrator starts
    │   ├─ mounts /data volume
    │   ├─ checks /data/agents/agentos/skills.md exists
    │   │   └─ if not: runs seed.py → writes all 20 Agent Skills
    │   ├─ installs Git hooks into /data/.git/hooks/
    │   └─ begins polling shared/queue.md
    │
    ├─ builder starts → polls shared/inbox/builder.md
    └─ reviewer starts → polls shared/inbox/reviewer.md

11. Git as Longitudinal Memory

Commit format

agentos(learn): earn async-pagination-pattern from +1 API task
task: sync paginated records from internal billing service
reward: composite +1 [correctness+1, process+1, efficiency+1, test+1, review+1]
skills-added: async-pagination-pattern
skills-pruned: none
world-model-update: world/codebase/patterns/async-pagination.md
model-used: mid-tier

Git hook: post-reward.sh

#!/bin/bash
# Runs after every rewards.md write
# Extracts composite score and commits learning event

REWARDS_FILE="/data/agents/agentos/rewards.md"
LAST_SCORE=$(head -n 20 "$REWARDS_FILE" | grep "composite_score:" | head -1 | awk '{print $2}')
TASK_ID=$(head -n 20 "$REWARDS_FILE" | grep "task_id:" | head -1 | awk '{print $2}')

cd /data
git add agents/ world/ --  # never add repos/ or build/ to agent memory commits
git commit -m "agentos(reward): task ${TASK_ID} scored ${LAST_SCORE}" \
  --allow-empty-message 2>/dev/null || true

.gitignore

# Ephemeral IPC
shared/outbox/
shared/locks/

# Build artifacts (high-churn, not agent memory)
build/

# Active repo working copies (managed by their own Git)
repos/

# Runtime noise
*.pyc
__pycache__/
.env

Useful Git queries

# Full learning timeline
git log --grep="^agentos" --oneline

# Reward trend (composite scores over time)
git log --grep="^agentos(reward)" --format="%s" | grep -oP 'scored \K[+-]\d'

# When was a specific skill earned
git log -S "async-pagination-pattern" --oneline

# Skill churn: are we maturing or thrashing?
git log --grep="skills-added" --format="%s" | grep -c "skills-added:"
git log --grep="skills-pruned" --format="%s" | grep -c "skills-pruned:"

# Detect reward regression
git log --grep="scored" --format="%s" | tail -10   # last 10 outcomes

12. Observability

Metrics emitted to METRICS_FILE as JSONL after every task:

Metric	Type	What It Tells You
`composite_reward`	`+1/0/-1`	Is the agent improving?
`test_pass_rate`	`float 0-1`	Core correctness signal
`review_gate_rejections`	`count`	How often reviewer blocks builder
`skill_churn_rate`	`earned / pruned`	Maturing vs thrashing
`model_tier_distribution`	`cheap/mid/premium %`	Cost efficiency
`tasks_per_skill`	`map[skill → count]`	Which skills are actually used
`human_review_triggers`	`count by trigger type`	How often human is pulled in
`executor_timeout_rate`	`%`	Are code runs getting too complex?
`world_model_promotion_rate`	`count`	Hypothesis → confirmed pattern
`tool_verified_rate`	`%`	Are reward scores honest?

Log format:

{"ts":"2025-01-15T14:32:00Z","task_id":"task-2025-001","event":"reward","composite":1,"tool_verified":true,"model":"mid","skills_applied":["test-driven-development"]}
{"ts":"2025-01-15T14:33:00Z","task_id":"task-2025-001","event":"commit","skills_added":null,"skills_pruned":null}

13. Example Tasks for AgentSWEOS

Spec + implement a rate-limited API client — triggers api-and-interface-design, incremental-implementation, test-driven-development
Fix a flaky async test in a Python service — triggers debugging-and-error-recovery, test-driven-development
Refactor a 500-line module that grew beyond the Rule of 500 — triggers code-simplification, code-review-and-quality
Add OWASP-compliant input validation to a form endpoint — triggers security-and-hardening → human review gate
Write an Architecture Decision Record for switching from REST to gRPC — triggers documentation-and-adrs, api-and-interface-design, premium model
Set up CI pipeline with lint + test + coverage thresholds — triggers ci-cd-and-automation, git-workflow-and-versioning
Deprecate a legacy auth endpoint with a migration path — triggers deprecation-and-migration, security-and-hardening
Profile and fix a slow database query causing P95 regression — triggers performance-optimization, debugging-and-error-recovery
Build a React component with WCAG 2.1 AA accessibility — triggers frontend-ui-engineering, test-driven-development
Pre-launch checklist for a new microservice — triggers shipping-and-launch, security-and-hardening, documentation-and-adrs

14. Risks & Failure Modes

Risk	Likelihood	Mitigation
Executor escape (sandbox breakout)	Low	Non-root, read-only FS, no network, hard timeout
Reward hacking (self-awarding +1 without tool evidence)	Medium	`tool_verified: true` required; Git hook validates before commit
Skill ossification (pre-seeded skills never pruned)	Medium	Reward evidence tracked per-skill; prune if 5+ composite ≤ 0
Hallucinated world model facts	Medium	`world/codebase/` writes require 2 corroborating tool outputs
Model routing drift (always escalating to premium)	Low-Medium	Routing logged in metrics; alert if premium % > 20 of tasks
Context window overflow on large repos	Medium	`context-engineering` skill limits context to `context_tags` matches
Review gate fatigue (human ignoring gates)	Medium	Human review triggers must be narrow and specific; log ignored gates
Git history bloat from high-frequency tasks	Low	.gitignore excludes build/, outbox/; commits only on learning events
Spec skip (builder jumping straight to code)	Medium	`spec-driven-development` skill enforced by orchestrator before queuing to builder

15. v1 Scope — Ship in 2 Weeks

Goal: One working end-to-end task loop with honest rewards and observable Git history.

Week 1:
├─ Docker Compose up with orchestrator + executor (no reviewer yet)
├─ Seed 20 Agent Skills into skills.md on first boot
├─ Task queue: single FIFO file, polled every 30s
├─ Builder: spec → code → test (pytest only, no integration tests)
├─ Executor: sandboxed Python runs, structured JSON output
├─ Rewards: 3-dimension (correctness, process, test_quality)
│   all tool-verified via pytest exit code + lint
└─ Git: learning commits on +1 reward only

Week 2:
├─ Add reviewer worker (five-axis, code-review-and-quality skill)
├─ Human review gate (write to stdout + block until ack file exists)
├─ Full 5-dimension reward decomposition
├─ Metrics JSONL emission
└─ Basic `git log` queries working and documented

Explicitly out of scope for v1:

Multi-language executor (Python only)
world/codebase/ pattern learning (too early; not enough task history)
Model routing (use mid-tier for everything in v1; routing in v2)
Horizontal scaling (single orchestrator)

16. v2 Expansion Plan

After 200+ real tasks with honest reward data:

Model routing activation — use task history to set routing thresholds empirically
world/codebase/ pattern learning — enough task history to validate promotion criteria
Multi-language executor — add Node.js runner with jest support
Horizontal worker scaling — multiple builder workers on shared queue (semaphore locks already in place)
Benchmark dashboard — visualize skill accumulation rate, pruning efficiency, reward trend over time
Security auditor persona — third worker using the security-auditor Agent Skills persona, activated on gate triggers
Integration test fixtures — local service stubs for API integration testing in executor
Spec quality scoring — measure downstream task success rate by spec author to improve spec-driven-development skill reward evidence

AgentSWEOS: disciplined engineering process as a runtime. Debuggable with git log. Improvable through honest failure.

MuhammadYossry/AgentSWEOS_blueprint.md

Select an option

No results found