Draft version (Revised Architecture)
"Make it reproducible, or it never happened."
Prerequisites:
- Part 0: Agent Bootstrapping Kit
- Part 1: Agent Filesystem
- Part 2: Inter‑Agent Communication
- Part 3: Git as Agent Memory
In Parts 0–3, AgentOS was introduced as a filesystem-native operating environment for agents that earn skills, prune memory, and preserve history through Git.
That foundation still stands.
But there is a deeper truth:
Design-time knowledge is only a prior. Runtime experience must become the posterior—and the posterior must win.
That changes everything.
The earlier model assumed the structure of knowledge was mostly fixed:
- skills are skills
- goals are goals
- rewards are outcomes
Now we move to a stronger model:
An agent should be able to learn not only how to act, but how to understand the world itself.
That means:
- discovering new concepts at runtime
- inventing new categories not anticipated by the designer
- updating beliefs when reality contradicts assumptions
- choosing cheaper or stronger models dynamically
- negotiating shared truths with other agents
This article updates the Dockerized AgentOS runtime to support that shift.
By the end, you'll have a containerized agent runtime where:
- the filesystem stores not just memory, but a living world model
- tasks continuously improve future context assembly
- runtime discoveries reshape future reasoning
- model choice adapts to task complexity
- multi-agent consensus creates shared knowledge
Docker was already useful for reproducibility.
Now it becomes essential because the runtime itself evolves.
| Concern | Why It Matters |
|---|---|
| Determinism | Runtime learning should happen in controlled environments |
| Portability | Same learning loop across laptop, server, cloud |
| Isolation | World models and experiments stay contained |
| Persistence | Learned files survive image upgrades |
| Benchmarking | Compare model policies across identical runtimes |
| Observability | Measure learning quality over time |
AgentOS treats state as files. Docker treats execution as infrastructure-as-code.
Together they create a reproducible learning organism.
Old view:
Filesystem = memory store
New view:
Filesystem = evolving model of reality
The container now hosts two learning layers:
How to solve tasks.
- skills
- workflows
- tool usage
What the world appears to be.
- entities
- relations
- hypotheses
- confidence updates
- contradictions
That second layer is new.
my-agent-project/
├── Dockerfile
├── agents.yaml
├── agent.py
├── bootstrap.py
├── main.py
└── requirements.txt
Runtime data volume:
/data/
├── agents/
│ └── orion/
│ ├── persona.md
│ ├── constraints.md
│ ├── skills.md
│ ├── goals.md
│ ├── rewards.md
│ ├── reflections.md
│ ├── context.md # ephemeral
│ ├── world/ # NEW
│ │ ├── index.md
│ │ ├── entities/
│ │ ├── relations/
│ │ └── hypotheses/
│ ├── iac/
│ └── .git/
│
└── shared/
└── iac/
├── proposals/
└── segments/
Earlier versions stored reusable behaviors.
Now agents also store discovered truths.
world/
├── entities/
│ └── rate-limiter.md
├── relations/
│ └── rate-limit-causes-backoff.md
└── hypotheses/
└── cursor-pagination-required.md
Things the agent believes exist.
Examples:
- API endpoint types
- customer cohorts
- failure modes
- legal clause categories
How things connect.
Examples:
- rate limiting causes retries
- missing indexes cause latency
- vague indemnity clauses increase risk
Patterns with weak evidence.
Example:
APIs using cursor pagination usually require auth scopes.
Hypotheses become confirmed entities/relations later.
A single +1 / 0 / -1 is too compressed.
New rewards capture dimensions.
- task: paginated sync
outcome: success
reward: +1
reward_decomposition:
correctness: +1
efficiency: 0
novelty: +1
generality: +1
context_tags:
- api
- pagination
- retry
- auth
reasoning_demand: moderate
model_used: mid-tier
model_appropriate: yes
world_model_update: world/entities/rate-limiter.mdThe agent no longer learns only “worked / failed.”
It learns:
- what kinds of work it handled well
- when cost was excessive
- when new abstractions emerged
- whether the chosen model tier was appropriate
That enables compute policy learning.
Loading all files every task does not scale.
Instead AgentOS now builds task-scoped context.
Task: Sync paginated vendor API
Included:
- persona.md
- constraints.md
- skills/retry-with-backoff.md
- skills/request-signing.md
- world/entities/rate-limiter.md
- world/hypotheses/cursor-pagination-required.md
- recent rewards (5)
Excluded:
- writing-style skills
- stale goals
- unrelated world entitiesThis file is rebuilt each task and not committed.
The context_tags learned during past tasks become retrieval keys for future tasks.
Every task improves future prompts automatically.
Most systems hardcode one expensive model.
AgentOS instead routes by task difficulty.
| reasoning_demand | route |
|------------------|------|
| routine | cheap-fast |
| moderate | balanced |
| novel | strongest |
| consensus | strongest multi-agent |Known skill exists, known entities.
Use cheap model.
Some ambiguity.
Use mid-tier.
No skill match or new territory.
Use strongest model.
Requires agreement across agents.
Use strongest reasoning path.
If cheap models fail novel tasks, reward logs expose it.
If expensive models are overused on routine tasks, waste becomes visible.
agents:
- name: orion
persona: "Backend systems engineer. Precise, terse, evidence-first."
domain: "Distributed systems"
goals:
- "Understand vendor API quirks"
- "Learn robust sync strategies"
hints:
- "Tasks often involve retries, auth, pagination"Hints are priors only.
Reality updates the posterior.
FROM python:3.12-slim
RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*
RUN useradd -ms /bin/bash agentos
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
RUN mkdir -p /data && chown -R agentos:agentos /data /app
USER agentos
ENV DATA_DIR=/data
ENV POLL_INTERVAL=5
ENTRYPOINT ["python", "-u", "main.py"]docker build -t agentos:latest .mkdir -p my-agent-data
cp agents.yaml my-agent-data/docker run -d \
--name agentos \
-e LLM_API_KEY="..." \
-e LLM_BASE_URL="https://api.openai.com/v1" \
-e LLM_MODEL="gpt-4o" \
-v $(pwd)/my-agent-data:/data \
agentos:latestcat >> my-agent-data/agents/orion/queue.md << 'EOF'
## sync-vendor-api
task: Build a Python sync process for a paginated API with rate limits.
priority: high
created: 2026-04-30T10:00:00Z
EOFReceive task
→ Assemble context
→ Choose model tier
→ Solve task
→ Score reward dimensions
→ Update skills if procedural
→ Update world/ if conceptual
→ Commit to Git
After repeated tasks:
world/entities/
├── token-bucket-rate-limiter.md
├── fixed-window-rate-limiter.md
world/relations/
├── token-bucket-responds-to-jitter.md
├── fixed-window-prefers-boundary-wait.md
No operator wrote those files.
The agent discovered them.
Single agents may hold private beliefs.
Shared truths require agreement.
shared/iac/proposals/prop-001.md
Subject: API rate limiter taxonomy
Proposer: orion
Claim:
Rate limiters commonly fall into:
1. token bucket
2. fixed window
Evidence:
3 successful sync tasks
Votes:
- orion: yes
- iris: yesWhen quorum is reached:
shared/iac/segments/api-knowledge.md
is updated.
All agents benefit.
This is collaborative epistemology.
cat skills.mdfind world -type fcat rewards.mdgit log --onelinels shared/iac/proposals
ls shared/iac/segmentsEarlier AgentOS optimized memory.
This version optimizes understanding.
Earlier versions stored reusable tactics.
This version also stores discovered structure.
Earlier versions selected one model.
This version learns compute allocation.
Earlier versions shared files.
This version negotiates truth.
The real bottleneck in agent systems is not raw model power.
It is:
- poor context selection
- forgotten lessons
- inability to revise abstractions
- wasting expensive inference
- weak collaboration semantics
This architecture addresses all five.
Before continuing the series, benchmark:
- task success over time
- context efficiency
- reward trend
- skill reuse rate
- hypothesis promotion rate
- contradiction correction speed
- duplicate concept collapse rate
- cost per successful task
- expensive-call reduction
- underpowered-call failure rate
- proposal acceptance rate
- time-to-consensus
- shared segment usefulness
A containerized agent should not just persist memory.
It should evolve a model of reality.
The filesystem is no longer only storage.
It becomes:
- a belief graph
- a curriculum
- a negotiation layer
- a compute policy memory
- a historical record of understanding
That is where compounding begins.
Docker gives the runtime a stable body.
Files give it memory.
Git gives it time.
Experience gives it truth.
A Generated Example:
Prompt Template — Generate a Custom AgentOS Dockerized Runtime for My Specific Agent Goal
You are a senior AI systems architect, staff software engineer, and production DevOps designer.
Your task is to generate a fully customized AgentOS containerized runtime project based on the user's intended agent purpose.
Do not generate a generic boilerplate container.
Instead, design a purpose-built runtime optimized for the exact domain, workflows, risks, tools, observability needs, and learning boundaries of the requested agent.
USER INPUT VARIABLES
Use the following placeholders exactly as provided by the user:
<AGENT_NAME><PRIMARY_PURPOSE><DOMAIN><TARGET_USERS><TASK_EXAMPLES><RISK_LEVEL>(low / medium / high / critical)<AUTONOMY_LEVEL>(assistant / semi-autonomous / autonomous)<TOOLS_ALLOWED><TOOLS_FORBIDDEN><DATA_SOURCES><SUCCESS_METRICS><DEPLOYMENT_ENV>(local laptop / server / cloud / enterprise / airgapped)<BUDGET_PRIORITY>(cheap / balanced / premium)<PRIVACY_REQUIREMENTS><HUMAN_REVIEW_POINTS><MULTI_AGENT_REQUIRED>(yes / no)<LONG_TERM_MEMORY>(yes / no)<LEARNING_ALLOWED>(none / constrained / active)<OUTPUT_STYLE>IMPORTANT 2026 REALITY CONSTRAINTS
You MUST design for current LLM limitations:
World Modeling Limits
LLMs are weak at:
Therefore:
Epistemic Reliability Risks
LLMs may:
Therefore:
Operational Guidance
Prefer systems that are:
YOUR OUTPUT MUST INCLUDE
Generate a complete custom project design containing:
1. Executive Summary
Explain what
<AGENT_NAME>does and why this architecture fits.2. Recommended Runtime Shape
Choose one:
Explain why.
3. Purpose-Built Filesystem Layout
Generate only files relevant to this use case.
Examples:
Avoid generic unnecessary folders.
4. Custom Agent Memory Model
Choose which should exist:
And justify each.
5. Verification Architecture
Design domain-appropriate checks such as:
6. Learning Policy
Based on
<LEARNING_ALLOWED>choose:Define strict boundaries.
7. Model Routing Policy
Based on
<BUDGET_PRIORITY>define:8. Security Policy
Use
<RISK_LEVEL>and<PRIVACY_REQUIREMENTS>.Include:
9. Docker Project Files
Generate:
All customized for this purpose.
10. Observability
Metrics and logs specific to the mission.
Examples:
11. Example Tasks
Give 10 realistic tasks for
<AGENT_NAME>.12. Risks & Failure Modes
Specific to the domain.
13. Recommended v1 Scope
What to ship first in 2 weeks.
14. v2 Expansion Plan
What to add after real usage data.
STYLE RULES
<PRIMARY_PURPOSE>is broad, narrow it into a viable first version.<PRIMARY_PURPOSE>is too vague, infer best practical scope.OUTPUT FORMAT
Use clean markdown with sections, code blocks, tables, and operational details.
USER VALUES
AGENT_NAME: <AGENT_NAME>
PRIMARY_PURPOSE: <PRIMARY_PURPOSE>
DOMAIN:
TARGET_USERS: <TARGET_USERS>
TASK_EXAMPLES: <TASK_EXAMPLES>
RISK_LEVEL: <RISK_LEVEL>
AUTONOMY_LEVEL: <AUTONOMY_LEVEL>
TOOLS_ALLOWED: <TOOLS_ALLOWED>
TOOLS_FORBIDDEN: <TOOLS_FORBIDDEN>
DATA_SOURCES: <DATA_SOURCES>
SUCCESS_METRICS: <SUCCESS_METRICS>
DEPLOYMENT_ENV: <DEPLOYMENT_ENV>
BUDGET_PRIORITY: <BUDGET_PRIORITY>
PRIVACY_REQUIREMENTS: <PRIVACY_REQUIREMENTS>
HUMAN_REVIEW_POINTS: <HUMAN_REVIEW_POINTS>
MULTI_AGENT_REQUIRED: <MULTI_AGENT_REQUIRED>
LONG_TERM_MEMORY: <LONG_TERM_MEMORY>
LEARNING_ALLOWED: <LEARNING_ALLOWED>
OUTPUT_STYLE: <OUTPUT_STYLE>
Now generate the best custom AgentOS runtime design.