Prompt template instantiation for: Agent-Native Research Artifact (ARA) system
Paradigm: ARA-first, paper-as-compiled-view, failure-trace as first-class memory
Key constraints: Docker-sandboxed execution · User-supplied model · Zero documentation burden
Before the design: the template variables as resolved, with explicit justifications for any adaptation or drop.
| Variable | Value | Notes |
|---|---|---|
AGENT_NAME |
ResearchOS |
Orchestrator + sandbox pool collective |
PRIMARY_PURPOSE |
Distill research sessions into executable ARA artifacts | Narrowed from "do research" to the Live Research Manager pattern |
DOMAIN |
Computational research & ML experimentation | Bounded to code-producing research; avoids wet-lab or clinical scope |
TARGET_USERS |
Researchers + coding agents that consume ARAs as baselines | Dual audience: human authors + downstream agent readers |
TASK_EXAMPLES |
See §11 | |
RISK_LEVEL |
medium | Results influence downstream agents; hallucinated claims propagate |
AUTONOMY_LEVEL |
semi-autonomous | Human gates on world-model commits and claim promotion |
TOOLS_ALLOWED |
Python/shell in sandbox, git, file I/O, citations API | Sandboxed; no network from worker containers |
TOOLS_FORBIDDEN |
Direct internet from workers, external DB writes, model self-modification | |
DATA_SOURCES |
Local repos, ARA artifacts, PDF corpus (via ARA Compiler) | |
SUCCESS_METRICS |
Reproduction rate, claim coverage, trace completeness, peer seal pass rate | |
DEPLOYMENT_ENV |
Local laptop → cloud-portable | Single docker compose up launch |
BUDGET_PRIORITY |
balanced | Routing defined by user; no model hardcoded |
PRIVACY_REQUIREMENTS |
Researcher controls what leaves the container | All execution airgapped inside sandbox |
HUMAN_REVIEW_POINTS |
Claim promotion, world-model merge, ARA Seal submission | |
MULTI_AGENT_REQUIRED |
yes | Orchestrator + LRM distiller + sandbox workers |
LONG_TERM_MEMORY |
yes | Git as longitudinal memory; ARA artifact store |
LEARNING_ALLOWED |
constrained | Skills + failure traces; no open-ended world-model generation |
OUTPUT_STYLE |
Executable ARA artifact + compiled human-readable view |
Template sections adapted or dropped:
- §4 Custom Agent Memory Model —
user preferencesdropped (irrelevant);casesrenamed tofailure traces(the core ARA insight);world modelscoped to aresearch landscape(known results, retracted claims, active hypotheses per domain). - §7 Model Routing — all model names replaced with env-var references; no default hardcoded.
- §8 Security — expanded with sandbox network policy table; container-level isolation is the primary security primitive, not prompt-level rules.
ResearchOS recasts research session outputs from narrative PDFs into Agent-Native Research Artifacts (ARAs): structured, executable knowledge packages with four interlocking layers — claims, code, failure traces, and raw evidence. A Live Research Manager (LRM) agent sits on top of any coding session and distills the conversation into the ARA in the background. A Compiler agent ingests legacy PDFs/repos into the same format. A Seal agent runs automated verification before human review.
The architecture is not a replacement for scientific judgment. It is a substrate that makes judgment auditable, reproducible, and composable — science that compounds like software.
Why this architecture fits:
- Failure traces are the primary differentiator. The filesystem keeps dead ends as ranked, attributed evidence — not narrative prose.
- Every code execution runs in an isolated Docker sandbox. Reproducibility is structural, not aspirational.
- The model is a user-supplied variable. ResearchOS routes by task difficulty, not by product preference.
- Git is the longitudinal memory. Every claim promotion, trace update, and world-model merge is a semantic commit with a machine-readable header.
Choice: Orchestrator + worker pool (multi-container)
┌─────────────────────────────────────────────────────┐
│ docker compose │
│ │
│ ┌─────────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ orchestrator│ │ lrm │ │ compiler │ │
│ │ (main.py) │◄──│ distiller│ │ (pdf→ARA) │ │
│ └──────┬──────┘ └──────────┘ └──────────────┘ │
│ │ spawns │
│ ┌──────▼──────────────────────────────────────┐ │
│ │ sandbox worker pool (ephemeral containers) │ │
│ │ sandbox-1 sandbox-2 sandbox-3 ... │ │
│ │ each: python + git + no network │ │
│ └──────────────────────────────────────────────┘ │
│ │
│ shared volume: /data (ARA store + git repo) │
└─────────────────────────────────────────────────────┘
Why multi-container:
- Code execution must be isolated. Each sandbox is ephemeral and network-gapped; a crashed or infinite experiment cannot affect the orchestrator or the ARA store.
- The LRM and Compiler are independent concerns. They read from the session queue and write to the ARA store; they should be independently restartable.
- Horizontal scaling is trivial: increase
SANDBOX_POOL_SIZEin.env.
Only files that earn their place in a research system are included. Generic agent directories (customers/, inbox/outbox/) are dropped.
/data/
├── .git/ # longitudinal memory — ALL learning events
│
├── ara/ # primary artifact store
│ ├── index.md # registry of all ARAs with status + claim counts
│ ├── <paper-slug>/
│ │ ├── ara.yaml # ARA manifest (version, authors, claim list, seal status)
│ │ ├── claims/
│ │ │ ├── <claim-id>.yaml # structured claim: text, evidence refs, confidence
│ │ │ └── ...
│ │ ├── code/
│ │ │ ├── reproduce.sh # single-command reproduction entry point
│ │ │ ├── environment.yaml # conda/pip exact pin
│ │ │ └── src/ # research code (may be symlink to repo)
│ │ ├── traces/
│ │ │ ├── failures/ # dead ends, ranked by hours spent + reason failed
│ │ │ ├── decisions/ # judgment calls with rationale
│ │ │ └── checkpoints/ # intermediate results (not in paper)
│ │ ├── evidence/
│ │ │ ├── raw/ # raw outputs: logs, CSVs, model checkpoints
│ │ │ └── figures/ # generated figures with provenance
│ │ └── views/
│ │ ├── paper.md # compiled human-readable narrative (generated)
│ │ └── review-packet.md # seal submission view
│
├── agents/
│ ├── orchestrator/
│ │ ├── persona.md # read-only
│ │ ├── constraints.md # read-only: budget caps, sandboxing rules
│ │ ├── skills.md # earned orchestration patterns (max 20)
│ │ ├── goals.md # recurring gaps
│ │ ├── rewards.md # rolling log (last 30)
│ │ └── reflections.md # failure patterns (last 15)
│ ├── lrm/ # Live Research Manager
│ │ ├── persona.md
│ │ ├── constraints.md
│ │ ├── skills.md # distillation patterns
│ │ └── session-queue.md # FIFO of sessions to distill
│ └── compiler/
│ ├── persona.md
│ ├── constraints.md
│ ├── skills.md # PDF→ARA conversion patterns
│ └── compile-queue.md
│
├── research-landscape/ # scoped world model — NOT open-ended
│ ├── index.md
│ ├── known-results/ # verified claims from imported ARAs
│ ├── retracted/ # flagged claims with reason
│ └── hypotheses/ # confidence-gated; require tool-backed evidence
│
├── shared/
│ ├── locks/ # semaphore files for concurrent writes
│ ├── sandbox-results/ # sandbox workers write here; orchestrator reads
│ └── proposals/ # world-model update proposals awaiting human gate
│
└── system/
├── task-queue.md # incoming research tasks
├── routing-policy.yaml # model routing rules (no model names hardcoded)
└── seal-policy.yaml # automated check thresholds
Five memory types are kept. Three from the template are adapted; two are dropped.
Skills — earned research patterns, not just code tricks. Examples: "structured ablation before full run", "cache intermediate tensors to evidence/raw/", "always pin random seeds before logging a claim". Budget: 20 entries. Evidence threshold: ≥2 reproduced +1 outcomes.
Verified Facts — promoted claims from the research landscape. Must have: a tool-backed evidence file reference, a confidence score ≥0.85, and no active contradiction in retracted/. These are the epistemic spine of the ARA.
Failure Traces — the core ARA differentiator. Every dead end is a first-class memory object: what was tried, why it failed, how long it took, and a discount_after timestamp (because a failed approach in 2022 may be worth retrying with a 2026 model). The trace is the "ranked menu of what to try and what not to" mentioned in the ARA paper. Traces are never pruned — they are timestamped and discounted, not deleted.
Hypotheses — patterns with weak evidence. Confidence-gated: cannot become a verified fact without tool-backed confirmation. Human review required before world-model merge.
Procedures — reusable experimental protocols (data split strategy, evaluation harness, baseline selection). Separate from skills because they are domain-specific sequences, not heuristics.
User Preferences — ResearchOS serves researchers and downstream agents equally. Personalisation at the memory layer adds drift risk with no reproducibility benefit.
Cases — renamed to Failure Traces (above). The original "cases" framing implied positive examples; the ARA insight is that negative cases are what compounds.
Templates — merged into views/ within the ARA artifact. A compiled paper view is a generated output, not a persistent memory object.
Three levels; the Seal runs all three before human review.
Level 1 — Structural Integrity
├── ara.yaml schema validation (jsonschema)
├── all claim IDs referenced in code comments
├── evidence files exist and are non-empty
├── reproduce.sh is executable and environment.yaml is pinned
└── no missing cross-references between claims and traces
Level 2 — Argumentative Rigor
├── dual-model review: two model calls (routing policy decides tiers)
│ each produces: claim-support verdict + confidence + contradiction flags
├── hypothesis-to-verified-fact gate: confidence ≥ 0.85 + tool evidence
├── retraction cross-check: no promoted claim contradicts retracted/
└── logical consistency scan (claim → evidence → figure chain)
Level 3 — Execution Reproducibility
├── sandbox run: `reproduce.sh` executed in isolated container
├── output diffed against stored evidence/raw/ (numeric tolerance configurable)
├── runtime logged and compared to claimed compute budget
└── random seed audit: all seeds must be logged before first model call
Human review gates (not automated):
- Claim promotion from hypothesis to verified fact in research-landscape
- ARA Seal submission
- World-model merge (proposals/ → known-results/)
- Any trace
discount_afterdate extension
Constrained skill + failure-trace learning. No open-ended world-model generation.
Rules:
learning_policy:
skills:
allowed: true
budget: 20
evidence_threshold: 2 # minimum +1 outcomes before a skill is written
pruning: lowest reward_evidence when budget full
owner: orchestrator + lrm (each have separate budgets)
failure_traces:
allowed: true
budget: unlimited # traces are never pruned, only discounted
discount_after: 18months # agent weights recent traces higher
attribution: required # who tried it, when, with what model tier
verified_facts:
allowed: true
gate: human_review + tool_evidence
confidence_floor: 0.85
hypotheses:
allowed: true
max_age_without_evidence: 90days # auto-expire to retracted/ if no evidence
world_model_autonomous_generation:
allowed: false # agents propose; humans merge
reason: "LLMs hallucinate entities and overgeneralize from little data"
self_modification:
allowed: false
reason: "persona.md and constraints.md are read-only"No model names hardcoded anywhere in the codebase. All routing is via system/routing-policy.yaml, populated from environment variables at startup.
# system/routing-policy.yaml (generated from env at startup)
routing:
tiers:
fast:
env_var: RESEARCHOS_MODEL_FAST
use_for:
- structural integrity checks (Level 1 Seal)
- session distillation (routine claims)
- failure trace formatting
- skill pruning decisions
balanced:
env_var: RESEARCHOS_MODEL_BALANCED
use_for:
- argumentative rigor review (Level 2 Seal, first pass)
- hypothesis confidence scoring
- ARA compiled paper view generation
- procedure extraction from session logs
strong:
env_var: RESEARCHOS_MODEL_STRONG
use_for:
- dual-model review second pass
- novel claim evaluation (novelty score > threshold)
- contradiction detection across research landscape
- world-model merge proposals
novelty_threshold: 0.7 # above this → always use strong tier
budget_override: false # never downgrade strong-tier tasks for cost
fallback: balanced # if a tier env_var is unsetThis means the same ResearchOS image runs on any provider's model or local inference — the user sets three env vars, and routing follows the declared policy.
Risk level: medium. Primary threat: a research agent that promotes hallucinated claims into the shared research landscape, which downstream agents treat as ground truth.
Container Outbound network Notes
─────────────────────────────────────────────────────
orchestrator LLM API only via egress proxy; no raw internet
lrm LLM API only
compiler LLM API + citations citations API is the only external data source
sandbox-worker NONE fully airgapped; reads/writes via /data volume only
Worker containers are launched with --network none. All inputs are written to the shared volume before launch; all outputs are read back after exit.
.env (never committed)
RESEARCHOS_MODEL_FAST=...
RESEARCHOS_MODEL_BALANCED=...
RESEARCHOS_MODEL_STRONG=...
LLM_API_KEY=...
CITATIONS_API_KEY=...
Secrets injected at runtime via env; never written to /data or .git
Pre-commit hook rejects commits containing key patterns (regex scan)
- Researcher names are stored only in
ara.yamlauthor fields, not in logs. - Sandbox stdout/stderr is captured to
evidence/raw/<run-id>.log— not streamed to external services. rewards.mdandreflections.mdcontain no user-identifying information.
| Action | Gate |
|---|---|
| hypothesis → verified fact | human review required |
| proposals/ → known-results/ | human review required |
| ARA Seal submission | human sign-off on Level 2 + 3 results |
sandbox reproduce.sh execution |
automatic (sandboxed) |
failure trace discount_after extension |
human review |
FROM python:3.12-slim
# System deps
RUN apt-get update && apt-get install -y --no-install-recommends \
git curl jq \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
# Python deps
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# App code
COPY src/ ./src/
COPY agents.yaml .
COPY system/ ./system/
# Git config for semantic commits
RUN git config --global user.email "researchos@local" \
&& git config --global user.name "ResearchOS"
# Init /data if not already a git repo (handled at entrypoint)
COPY entrypoint.sh .
RUN chmod +x entrypoint.sh
ENTRYPOINT ["./entrypoint.sh"]#!/bin/bash
set -e
# Initialize /data git repo on first run
if [ ! -d /data/.git ]; then
git -C /data init
git -C /data commit --allow-empty -m "init: ResearchOS data store initialized"
fi
# Generate routing-policy.yaml from env
python src/generate_routing_policy.py
# Start main task loop
exec python src/main.pyservices:
orchestrator:
build: .
image: researchos:latest
volumes:
- researchos-data:/data
- /var/run/docker.sock:/var/run/docker.sock # spawn sandbox containers
env_file: .env
environment:
- AGENT_ROLE=orchestrator
- SANDBOX_IMAGE=researchos-sandbox:latest
- SANDBOX_POOL_SIZE=3
lrm:
image: researchos:latest
volumes:
- researchos-data:/data
env_file: .env
environment:
- AGENT_ROLE=lrm
depends_on:
- orchestrator
compiler:
image: researchos:latest
volumes:
- researchos-data:/data
env_file: .env
environment:
- AGENT_ROLE=compiler
depends_on:
- orchestrator
volumes:
researchos-data:
driver: local# Sandbox worker — airgapped, ephemeral, no LLM calls
FROM python:3.12-slim
RUN apt-get update && apt-get install -y --no-install-recommends \
git \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /workspace
# Sandbox gets a copy of the ARA's environment.yaml at runtime
COPY sandbox_runner.py .
ENTRYPOINT ["python", "sandbox_runner.py"]
# Launched by orchestrator with:
# docker run --rm --network none \
# -v researchos-data:/data:ro \
# -v <run-dir>:/workspace \
# researchos-sandbox:latestagents:
orchestrator:
persona: agents/orchestrator/persona.md
constraints: agents/orchestrator/constraints.md
skills_budget: 20
reward_window: 30
reflection_window: 15
task_queue: system/task-queue.md
lrm:
persona: agents/lrm/persona.md
constraints: agents/lrm/constraints.md
skills_budget: 20
session_queue: agents/lrm/session-queue.md
distillation_mode: background # zero burden on researcher
compiler:
persona: agents/compiler/persona.md
constraints: agents/compiler/constraints.md
skills_budget: 20
compile_queue: agents/compiler/compile-queue.md
output_format: ara_v1
sandbox:
image: researchos-sandbox:latest
network: none
mem_limit: 8g
cpu_limit: 4.0
timeout_seconds: 3600
results_path: shared/sandbox-results/# Model routing — fill in your own model identifiers; no defaults enforced
RESEARCHOS_MODEL_FAST=
RESEARCHOS_MODEL_BALANCED=
RESEARCHOS_MODEL_STRONG=
# API keys
LLM_API_KEY=
CITATIONS_API_KEY=
# Sandbox pool
SANDBOX_POOL_SIZE=3
SANDBOX_MEM_LIMIT=8g
# ARA store
ARA_STORE_PATH=/data/ara
RESEARCH_LANDSCAPE_PATH=/data/research-landscape
# Seal thresholds
SEAL_L1_REQUIRED=true
SEAL_L2_CONFIDENCE_FLOOR=0.85
SEAL_L3_NUMERIC_TOLERANCE=1e-4
# Git identity for semantic commits
GIT_AUTHOR_EMAIL=researchos@local
GIT_AUTHOR_NAME=ResearchOS# Build both images
docker build -t researchos:latest .
docker build -f Dockerfile.sandbox -t researchos-sandbox:latest .
# Copy and fill env
cp env.example .env
# ... edit .env with your model identifiers and API keys ...
# Launch
docker compose up
# Queue a task (from host)
echo "- id: task-001\n type: compile\n source: /data/pdfs/attention-is-all-you-need.pdf" \
>> /path/to/researchos-data/system/task-queue.mdMetrics are research-specific. Generic "task success rate" is replaced with signal that reveals whether the agent is actually improving science throughput.
| Metric | Source | Target |
|---|---|---|
| Claim coverage rate | claims/ count vs. paper section count | > 85% |
| Trace completeness | failure entries per ARA | ≥ 3 per major experiment |
| Reproduction pass rate | Seal Level 3 auto-run | > 90% |
| Claim promotion lag | hypothesis created → verified fact | < 7 days median |
| Hallucination incidents | Level 2 dual-model disagreement rate | < 5% |
| Seal rejection rate | L1/L2/L3 auto-fail before human review | Track trend; rising = distiller regression |
| Skill accumulation curve | skills.md entries over time | Plateau = healthy; churn = policy problem |
| Failure trace discount rate | traces past discount_after date |
Flag for human review |
| Sandbox crash rate | non-zero exit from sandbox workers | < 2% |
| Model tier usage ratio | fast : balanced : strong call counts | Track cost vs. claim quality correlation |
All metrics are derived from files already in /data — no external observability stack required in v1. A simple python src/metrics.py command generates a system/metrics-report.md on demand.
-
Distill a 4-hour coding session into a new ARA for a gradient checkpointing experiment — LRM extracts 3 claims, 2 dead ends, 1 confirmed performance gain.
-
Compile a legacy PDF ("Attention Is All You Need") into an ARA: extract structured claims, map figures to evidence files, flag missing reproduction code.
-
Run a Seal check on an existing ARA — Level 3 catches that
reproduce.shproduces output that diverges fromevidence/raw/by 2.3% (outside tolerance). -
Rank failure traces for a new researcher starting on transformer quantization — retrieve all traces tagged
quantization, sorted by recency anddiscount_afterstatus. -
Propose a world-model update: after three independent
+1reproductions of a new result, draft aproposals/entry for human gate review. -
Extend an existing ARA with a new ablation result — add claim, attach sandbox output as evidence, update
reproduce.sh, commit with semantic header. -
Detect a contradiction: a new claim conflicts with a
known-results/entry — flag both, write ahypotheses/entry for the disagreement, queue for dual-model review. -
Generate a compiled paper view from an ARA — produce
views/paper.mdas a narrative document from structured claims, evidence, and trace summaries. -
Expire a stale hypothesis: a
hypotheses/entry with no new evidence in 90 days is auto-moved toretracted/with reasonevidence_timeout. -
Horizontal scale test: spin 5 sandbox workers simultaneously, each reproducing a different ARA; results written back to
shared/sandbox-results/without collision (lock files enforced).
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| LRM hallucinates a claim not in the session | Medium | High — propagates to research landscape | Dual-model Level 2 check; human gate on promotion |
| Sandbox reproduce.sh hangs indefinitely | Medium | Low — orchestrator unblocked | timeout_seconds in agents.yaml; container killed on expiry |
| Git history grows unbounded | Low | Medium — slow clones | git gc cron in orchestrator; large binary artifacts stored via Git LFS |
Failure trace discount_after not updated |
Medium | Medium — agent avoids valid approaches | Human review queue for expiring traces |
| Strong-tier model unavailable (API outage) | Low | High — Seal Level 2 blocked | Fallback to balanced + flag for delayed human review |
| Retracted claim re-promoted by compiler | Low | High | Cross-check against retracted/ before any promotion |
| Sandbox worker escapes network isolation | Very low | Critical | --network none is Docker-enforced, not policy-enforced |
| Two agents write to same ARA simultaneously | Medium | Medium | Lock files in shared/locks/ before any ARA write |
| Skill budget full with low-quality skills | Low | Low | Pruning on lowest reward_evidence; periodic human audit |
Trace discount_after becomes a ceiling for strong agents |
Medium | Low in v1 | Add provenance tags to traces; successors can selectively discount |
Ship exactly these four things, nothing else:
Week 1
- Docker images build and
docker compose upsucceeds with a test ARA. - LRM distiller: takes a session transcript (plain text), produces a valid
ara.yaml+ at least one claim file + at least one failure trace. No world-model writes yet. - Sandbox worker: runs
reproduce.shfrom an existing ARA, writes output toshared/sandbox-results/, exits cleanly.
Week 2
- Seal Level 1 (structural integrity) runs automatically after every LRM distillation.
- Semantic git commits: every claim addition and trace write produces a correctly formatted commit.
-
metrics.pygenerates asystem/metrics-report.mdcovering claim count, trace count, and sandbox pass rate. -
env.examplefully documented; first-run experience requires only filling in API keys and model names.
Explicitly out of scope for v1:
- ARA Compiler (PDF ingestion) — add in v2
- Seal Level 2 and 3 — add in v2
- Research landscape / world model — add in v2
- Horizontal sandbox scaling — add in v2
Ordered by value delivered per effort, based on the ARA paper's empirical findings.
High value, low risk
- Seal Level 3 (sandbox reproduction as gating check before any human review).
- ARA Compiler: PDF → ARA for legacy literature. Start with arXiv HTML format; PDF is harder.
- Failure trace discount scoring: weight traces by age × model-tier-at-time-of-failure. This is the mechanism that prevents the trace from becoming a ceiling for strong agents.
Medium value, medium risk
- Research landscape (known-results + hypotheses): start read-only (populated by compiler), add write path only after Level 2 Seal is stable.
- Horizontal sandbox scaling:
SANDBOX_POOL_SIZE > 1with lock-file coordination. - Seal Level 2 dual-model review: add once you have enough ARAs to calibrate the confidence floor.
Longer term
- ARA-native peer review protocol: reviewers attach signed review packets to artifacts, not to compiled paper views.
- Forking:
git forkan ARA as a starting point for a new experiment, with provenance preserved. - Provenance tags on failure traces: let downstream agents query "which traces were generated under model-tier X and base compute Y" and selectively discount.
(Human+AI)² Research Network: multiple ResearchOS instances sharing a common research landscape via a git remote, with merge-request style human gates on world-model updates.
Architecture follows the AgentOS blueprint: filesystem as state, rewards as signal, Git as institutional memory, Docker as reproducible runtime. The ARA is not a document format — it is the primary research object. The paper is a view.