Skip to content

Instantly share code, notes, and snippets.

@MuhammadYossry
Last active May 1, 2026 13:55
Show Gist options
  • Select an option

  • Save MuhammadYossry/9a7fea8108fab215a454fb25a53a3165 to your computer and use it in GitHub Desktop.

Select an option

Save MuhammadYossry/9a7fea8108fab215a454fb25a53a3165 to your computer and use it in GitHub Desktop.
Part 4: Containerizing AgentOS — Build, Customize, and Deploy a Self-Learning World-Modeling Agent

Part 4: Containerizing AgentOS — Build, Customize, and Deploy a Self-Learning World-Modeling Agent

Draft version (Revised Architecture)

"Make it reproducible, or it never happened."

image

Prerequisites:


Introduction

In Parts 0–3, AgentOS was introduced as a filesystem-native operating environment for agents that earn skills, prune memory, and preserve history through Git.

That foundation still stands.

But there is a deeper truth:

Design-time knowledge is only a prior. Runtime experience must become the posterior—and the posterior must win.

That changes everything.

The earlier model assumed the structure of knowledge was mostly fixed:

  • skills are skills
  • goals are goals
  • rewards are outcomes

Now we move to a stronger model:

An agent should be able to learn not only how to act, but how to understand the world itself.

That means:

  • discovering new concepts at runtime
  • inventing new categories not anticipated by the designer
  • updating beliefs when reality contradicts assumptions
  • choosing cheaper or stronger models dynamically
  • negotiating shared truths with other agents

This article updates the Dockerized AgentOS runtime to support that shift.

By the end, you'll have a containerized agent runtime where:

  • the filesystem stores not just memory, but a living world model
  • tasks continuously improve future context assembly
  • runtime discoveries reshape future reasoning
  • model choice adapts to task complexity
  • multi-agent consensus creates shared knowledge

Why Docker Matters Even More Now

Docker was already useful for reproducibility.

Now it becomes essential because the runtime itself evolves.

Concern Why It Matters
Determinism Runtime learning should happen in controlled environments
Portability Same learning loop across laptop, server, cloud
Isolation World models and experiments stay contained
Persistence Learned files survive image upgrades
Benchmarking Compare model policies across identical runtimes
Observability Measure learning quality over time

AgentOS treats state as files. Docker treats execution as infrastructure-as-code.

Together they create a reproducible learning organism.


The Core Architectural Shift

Old view:

Filesystem = memory store

New view:

Filesystem = evolving model of reality

The container now hosts two learning layers:

Procedural Layer

How to solve tasks.

  • skills
  • workflows
  • tool usage

Epistemic Layer

What the world appears to be.

  • entities
  • relations
  • hypotheses
  • confidence updates
  • contradictions

That second layer is new.


Project Layout

my-agent-project/
├── Dockerfile
├── agents.yaml
├── agent.py
├── bootstrap.py
├── main.py
└── requirements.txt

Runtime data volume:

/data/
├── agents/
│   └── orion/
│       ├── persona.md
│       ├── constraints.md
│       ├── skills.md
│       ├── goals.md
│       ├── rewards.md
│       ├── reflections.md
│       ├── context.md          # ephemeral
│       ├── world/             # NEW
│       │   ├── index.md
│       │   ├── entities/
│       │   ├── relations/
│       │   └── hypotheses/
│       ├── iac/
│       └── .git/
│
└── shared/
    └── iac/
        ├── proposals/
        └── segments/

The New world/ Directory

Earlier versions stored reusable behaviors.

Now agents also store discovered truths.

Example

world/
├── entities/
│   └── rate-limiter.md
├── relations/
│   └── rate-limit-causes-backoff.md
└── hypotheses/
    └── cursor-pagination-required.md

Meaning

Entities

Things the agent believes exist.

Examples:

  • API endpoint types
  • customer cohorts
  • failure modes
  • legal clause categories

Relations

How things connect.

Examples:

  • rate limiting causes retries
  • missing indexes cause latency
  • vague indemnity clauses increase risk

Hypotheses

Patterns with weak evidence.

Example:

APIs using cursor pagination usually require auth scopes.

Hypotheses become confirmed entities/relations later.


Reward Signals Must Explain Why

A single +1 / 0 / -1 is too compressed.

New rewards capture dimensions.

Example rewards.md

- task: paginated sync
  outcome: success
  reward: +1

  reward_decomposition:
    correctness: +1
    efficiency: 0
    novelty: +1
    generality: +1

  context_tags:
    - api
    - pagination
    - retry
    - auth

  reasoning_demand: moderate
  model_used: mid-tier
  model_appropriate: yes

  world_model_update: world/entities/rate-limiter.md

Why This Matters

The agent no longer learns only “worked / failed.”

It learns:

  • what kinds of work it handled well
  • when cost was excessive
  • when new abstractions emerged
  • whether the chosen model tier was appropriate

That enables compute policy learning.


Context Engineering Becomes First-Class

Loading all files every task does not scale.

Instead AgentOS now builds task-scoped context.

Example context.md

Task: Sync paginated vendor API

Included:
- persona.md
- constraints.md
- skills/retry-with-backoff.md
- skills/request-signing.md
- world/entities/rate-limiter.md
- world/hypotheses/cursor-pagination-required.md
- recent rewards (5)

Excluded:
- writing-style skills
- stale goals
- unrelated world entities

This file is rebuilt each task and not committed.

Why This Compounds

The context_tags learned during past tasks become retrieval keys for future tasks.

Every task improves future prompts automatically.


Runtime-Adaptive Model Selection

Most systems hardcode one expensive model.

AgentOS instead routes by task difficulty.

In constraints.md

| reasoning_demand | route |
|------------------|------|
| routine          | cheap-fast |
| moderate         | balanced |
| novel            | strongest |
| consensus        | strongest multi-agent |

Decision Logic

Routine

Known skill exists, known entities.

Use cheap model.

Moderate

Some ambiguity.

Use mid-tier.

Novel

No skill match or new territory.

Use strongest model.

Consensus

Requires agreement across agents.

Use strongest reasoning path.

Feedback Loop

If cheap models fail novel tasks, reward logs expose it.

If expensive models are overused on routine tasks, waste becomes visible.


Step 1 — Define Agents in agents.yaml

agents:
  - name: orion
    persona: "Backend systems engineer. Precise, terse, evidence-first."
    domain: "Distributed systems"

    goals:
      - "Understand vendor API quirks"
      - "Learn robust sync strategies"

    hints:
      - "Tasks often involve retries, auth, pagination"

Hints are priors only.

Reality updates the posterior.


Step 2 — Dockerfile

FROM python:3.12-slim

RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

RUN useradd -ms /bin/bash agentos

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

RUN mkdir -p /data && chown -R agentos:agentos /data /app

USER agentos

ENV DATA_DIR=/data
ENV POLL_INTERVAL=5

ENTRYPOINT ["python", "-u", "main.py"]

Step 3 — Build Locally

docker build -t agentos:latest .

Step 4 — Prepare Persistent Data

mkdir -p my-agent-data
cp agents.yaml my-agent-data/

Step 5 — Run Container

docker run -d \
  --name agentos \
  -e LLM_API_KEY="..." \
  -e LLM_BASE_URL="https://api.openai.com/v1" \
  -e LLM_MODEL="gpt-4o" \
  -v $(pwd)/my-agent-data:/data \
  agentos:latest

Step 6 — Feed Real Tasks

cat >> my-agent-data/agents/orion/queue.md << 'EOF'

## sync-vendor-api
task: Build a Python sync process for a paginated API with rate limits.
priority: high
created: 2026-04-30T10:00:00Z
EOF

What Happens Internally Now

Receive task
→ Assemble context
→ Choose model tier
→ Solve task
→ Score reward dimensions
→ Update skills if procedural
→ Update world/ if conceptual
→ Commit to Git

Example Runtime Learning

After repeated tasks:

world/entities/
├── token-bucket-rate-limiter.md
├── fixed-window-rate-limiter.md

world/relations/
├── token-bucket-responds-to-jitter.md
├── fixed-window-prefers-boundary-wait.md

No operator wrote those files.

The agent discovered them.


Multi-Agent Consensus

Single agents may hold private beliefs.

Shared truths require agreement.

Proposal Example

shared/iac/proposals/prop-001.md
Subject: API rate limiter taxonomy

Proposer: orion

Claim:
Rate limiters commonly fall into:
1. token bucket
2. fixed window

Evidence:
3 successful sync tasks

Votes:
- orion: yes
- iris: yes

When quorum is reached:

shared/iac/segments/api-knowledge.md

is updated.

All agents benefit.

This is collaborative epistemology.


Observing Growth

Skills

cat skills.md

World Model

find world -type f

Reward Quality

cat rewards.md

Learning Timeline

git log --oneline

Shared Consensus

ls shared/iac/proposals
ls shared/iac/segments

What Changed From Earlier Versions

Earlier AgentOS optimized memory.

This version optimizes understanding.

Earlier versions stored reusable tactics.

This version also stores discovered structure.

Earlier versions selected one model.

This version learns compute allocation.

Earlier versions shared files.

This version negotiates truth.


Why This Matters

The real bottleneck in agent systems is not raw model power.

It is:

  • poor context selection
  • forgotten lessons
  • inability to revise abstractions
  • wasting expensive inference
  • weak collaboration semantics

This architecture addresses all five.


Benchmarking Directions

Before continuing the series, benchmark:

Single-Agent

  • task success over time
  • context efficiency
  • reward trend
  • skill reuse rate

World Model Quality

  • hypothesis promotion rate
  • contradiction correction speed
  • duplicate concept collapse rate

Model Routing

  • cost per successful task
  • expensive-call reduction
  • underpowered-call failure rate

Multi-Agent

  • proposal acceptance rate
  • time-to-consensus
  • shared segment usefulness

Closing Thoughts

A containerized agent should not just persist memory.

It should evolve a model of reality.

The filesystem is no longer only storage.

It becomes:

  • a belief graph
  • a curriculum
  • a negotiation layer
  • a compute policy memory
  • a historical record of understanding

That is where compounding begins.

Docker gives the runtime a stable body.

Files give it memory.

Git gives it time.

Experience gives it truth.


A Generated Example:

@MuhammadYossry
Copy link
Copy Markdown
Author

Prompt Template — Generate a Custom AgentOS Dockerized Runtime for My Specific Agent Goal

You are a senior AI systems architect, staff software engineer, and production DevOps designer.

Your task is to generate a fully customized AgentOS containerized runtime project based on the user's intended agent purpose.

Do not generate a generic boilerplate container.

Instead, design a purpose-built runtime optimized for the exact domain, workflows, risks, tools, observability needs, and learning boundaries of the requested agent.


USER INPUT VARIABLES

Use the following placeholders exactly as provided by the user:

  • <AGENT_NAME>
  • <PRIMARY_PURPOSE>
  • <DOMAIN>
  • <TARGET_USERS>
  • <TASK_EXAMPLES>
  • <RISK_LEVEL> (low / medium / high / critical)
  • <AUTONOMY_LEVEL> (assistant / semi-autonomous / autonomous)
  • <TOOLS_ALLOWED>
  • <TOOLS_FORBIDDEN>
  • <DATA_SOURCES>
  • <SUCCESS_METRICS>
  • <DEPLOYMENT_ENV> (local laptop / server / cloud / enterprise / airgapped)
  • <BUDGET_PRIORITY> (cheap / balanced / premium)
  • <PRIVACY_REQUIREMENTS>
  • <HUMAN_REVIEW_POINTS>
  • <MULTI_AGENT_REQUIRED> (yes / no)
  • <LONG_TERM_MEMORY> (yes / no)
  • <LEARNING_ALLOWED> (none / constrained / active)
  • <OUTPUT_STYLE>

IMPORTANT 2026 REALITY CONSTRAINTS

You MUST design for current LLM limitations:

World Modeling Limits

LLMs are weak at:

  • causal reasoning consistency
  • long-horizon planning
  • stable beliefs over time
  • recursive reasoning
  • grounded truth without tools

Therefore:

  • narrow domain scope when possible
  • external verification required
  • avoid open-ended autonomous truth generation
  • prefer tool-backed evidence

Epistemic Reliability Risks

LLMs may:

  • hallucinate entities
  • reinforce false assumptions
  • misclassify patterns
  • overgeneralize from little data
  • fail contradiction detection

Therefore:

  • all world-model updates require confidence levels
  • hypotheses need evidence thresholds
  • contradiction checks mandatory
  • human review gates for sensitive domains

Operational Guidance

Prefer systems that are:

  • useful over impressive
  • inspectable over magical
  • narrow over vague
  • testable over aspirational
  • reversible over opaque

YOUR OUTPUT MUST INCLUDE

Generate a complete custom project design containing:

1. Executive Summary

Explain what <AGENT_NAME> does and why this architecture fits.

2. Recommended Runtime Shape

Choose one:

  • single container
  • multi-container
  • orchestrator + workers
  • local-only runtime
  • enterprise isolated runtime

Explain why.

3. Purpose-Built Filesystem Layout

Generate only files relevant to this use case.

Examples:

  • compliance agents need audit/
  • coding agents need repos/
  • research agents need sources/
  • support agents need customers/
  • legal agents need clauses/
  • medical agents need approvals/

Avoid generic unnecessary folders.

4. Custom Agent Memory Model

Choose which should exist:

  • skills
  • procedures
  • world model
  • cases
  • templates
  • verified facts
  • hypotheses
  • user preferences
  • policies

And justify each.

5. Verification Architecture

Design domain-appropriate checks such as:

  • tests
  • linting
  • simulation
  • API validation
  • human approval
  • dual model review
  • citations required
  • deterministic tool checks

6. Learning Policy

Based on <LEARNING_ALLOWED> choose:

  • no learning
  • reward logs only
  • constrained skill learning
  • active skill + world model learning

Define strict boundaries.

7. Model Routing Policy

Based on <BUDGET_PRIORITY> define:

  • cheap fast model for routine tasks
  • stronger model for difficult tasks
  • premium model only for high stakes

8. Security Policy

Use <RISK_LEVEL> and <PRIVACY_REQUIREMENTS>.

Include:

  • network access rules
  • secret handling
  • logging restrictions
  • PII policy
  • sandboxing
  • approval gates

9. Docker Project Files

Generate:

  • Dockerfile
  • docker-compose.yml (if useful)
  • agents.yaml
  • env.example
  • startup flow

All customized for this purpose.

10. Observability

Metrics and logs specific to the mission.

Examples:

  • code fix success rate
  • legal clause precision
  • customer satisfaction proxy
  • false positive rate
  • hallucination incidents
  • approval rejection rate

11. Example Tasks

Give 10 realistic tasks for <AGENT_NAME>.

12. Risks & Failure Modes

Specific to the domain.

13. Recommended v1 Scope

What to ship first in 2 weeks.

14. v2 Expansion Plan

What to add after real usage data.


STYLE RULES

  • Be concrete.
  • Prefer practical over theoretical.
  • Avoid hype.
  • Assume 2026 LLM limitations are real.
  • If the requested design is dangerous or unrealistic, scale it down safely.
  • If <PRIMARY_PURPOSE> is broad, narrow it into a viable first version.
  • If <PRIMARY_PURPOSE> is too vague, infer best practical scope.

OUTPUT FORMAT

Use clean markdown with sections, code blocks, tables, and operational details.


USER VALUES

AGENT_NAME: <AGENT_NAME>
PRIMARY_PURPOSE: <PRIMARY_PURPOSE>
DOMAIN:
TARGET_USERS: <TARGET_USERS>
TASK_EXAMPLES: <TASK_EXAMPLES>
RISK_LEVEL: <RISK_LEVEL>
AUTONOMY_LEVEL: <AUTONOMY_LEVEL>
TOOLS_ALLOWED: <TOOLS_ALLOWED>
TOOLS_FORBIDDEN: <TOOLS_FORBIDDEN>
DATA_SOURCES: <DATA_SOURCES>
SUCCESS_METRICS: <SUCCESS_METRICS>
DEPLOYMENT_ENV: <DEPLOYMENT_ENV>
BUDGET_PRIORITY: <BUDGET_PRIORITY>
PRIVACY_REQUIREMENTS: <PRIVACY_REQUIREMENTS>
HUMAN_REVIEW_POINTS: <HUMAN_REVIEW_POINTS>
MULTI_AGENT_REQUIRED: <MULTI_AGENT_REQUIRED>
LONG_TERM_MEMORY: <LONG_TERM_MEMORY>
LEARNING_ALLOWED: <LEARNING_ALLOWED>
OUTPUT_STYLE: <OUTPUT_STYLE>

Now generate the best custom AgentOS runtime design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment