Skip to content

Instantly share code, notes, and snippets.

@kanhua
Created May 17, 2026 14:27
Show Gist options
  • Select an option

  • Save kanhua/16b60beebfbc90f40795ddb3208d220e to your computer and use it in GitHub Desktop.

Select an option

Save kanhua/16b60beebfbc90f40795ddb3208d220e to your computer and use it in GitHub Desktop.
Knowledge system for AI-run scientific experiments
# Knowledge system for AI-run scientific experiments
## Core idea
This note proposes a knowledge system for long-running scientific projects where AI agents can execute experiments: machine learning, computational physics, computational biology, and similar domains with automated evaluation loops. The pattern is drawn from a year of operating an Obsidian-vault-backed lab notebook alongside a Claude Code agent on an event-vision ML project, but the structure generalizes.
Auto-research systems such as [Sakana AI's AI Scientist](https://sakana.ai/ai-scientist/) and [Hugging Face's ml-intern](https://github.com/huggingface/ml-intern) already show how agents can generate ideas and execute individual experiments. For projects that run for months, the harder problem is keeping accumulated knowledge organized so both the agent and the human can return later and pick up the thread.
The system below treats the project notebook as a layered, queryable knowledge base. Each layer is defined by who writes it, who reads it, and how long it should remain useful. Agents and humans both add and retrieve knowledge, while a maintenance loop keeps high-level summaries coherent as raw experiment records accumulate.
## Structure of the knowledge system
The knowledge system has four content layers plus one human-in-the-loop layer.
- Schema layer — `CLAUDE.md`, `AGENTS.md`, or `README.md` at the vault/repo root. Declares the directory layout, naming conventions, frontmatter schema, citation conventions, language preferences (e.g. "status reports written in Japanese"), and any project-wide guardrails. The agent reads this on every session start; the human reads it once when joining the project. Concretely it specifies things like "experiment notes live under `labbook/experiments/{YYYY-MM-DDTHH.MM.SS}_{session_name}/`" and "all notes use YAML frontmatter with `tags`, `date`, `status` fields where `status ∈ {active, completed, stale, backlog}`".
- Wiki layer — curated index pages and discussion papers. These are the "entry points" of the system: a reader who lands here should be able to find every note relevant to a topic within one or two hops. Three sub-types:
- Topic index notes: one per research subtopic, listing every related experiment, plan, and supporting note as wikilinks, with one-line descriptions and a status badge.
- Discussion papers: long-form, self-contained writeups that summarize the approach, the architecture, the results progression, the bottlenecks, and the candidate next directions for a given workstream. Suitable as a standalone document or as a prompt to seed a new agent session.
- Algorithm/method overview notes: stable references for "how does method X work", with parameter tables and component diagrams.
- Intermediate layer — experiment plans and research notes written *during* the work. This is where most of the daily writing happens. Three artifacts:
- Plan: written before execution, describing the hypothesis, the change to make, the eval to run, and the expected outcome. For agent-run experiments, the plan is a filled template that the agent will execute autonomously.
- Research note: the running log during the experiment. Records what was tried, what was observed, key metrics, and pointers down to the raw layer (W&B URLs, artifact paths, git SHAs).
- Report: written once the experiment ends, summarizing the key finding and the next decision.
- Raw layer — the ground-truth artifacts. The notebook only *points* at this layer; the data itself lives elsewhere:
- Code repository (registered as an additional directory in the agent's settings so it can read it without leaving the vault).
- Experiment-tracking platform runs (W&B, MLflow, etc.) — referenced by canonical run path.
- Output artifacts on disk (checkpoints, predictions, evaluation CSVs).
- Paper corpus, meeting minutes, slide decks, email threads, GitHub issues exported as notes.
- Project requirements and stakeholder questionnaires.
Human-in-the-loop layer:
- Task notes — the "GitHub issues" of the system. Each task is one investigation or implementation thread ("try architecture X on dataset Y", "diagnose why metric M collapsed"). Tasks are how the human navigates the project: the human's mental model of the project is organized around tasks, not files. A task note contains an objective, a chronological log of attempts, links to the plans/research notes/reports under it, and a current status. New tasks are typically opened by the human; the agent appends progress under them.
## Growing and maintenance of the knowledge system
The knowledge base grows through three operations, each of which can be driven by an agent or a human:
- Ingest — the experiment agent generates plans and research notes; the human files meeting minutes, ingests reference papers, and adds raw artifacts. Ingestion is append-only by default — nothing in the intermediate layer is ever overwritten by automation.
- Query — the human (often together with the agent) asks questions over the corpus: "what did we try with method X?", "which experiments are still active?", "what changed between v3 and v4 of the dataset?". Good questions and good answers are themselves valuable artifacts, and may produce new wiki entries or even a new task note. Without a query loop, the knowledge base silts up: it grows but stops being usable.
- Lint — a maintenance agent reads the lower layers and updates the upper layers. Concretely:
- Reclassifies the `status` frontmatter of intermediate-layer notes based on inferred activity (e.g. "no edits in 7 days and has open checkboxes" → `stale`).
- Refreshes topic index notes when new experiments land underneath them.
- Regenerates a project dashboard (`_project_state.md`) that lists active work, recent experiments, and stale notes — this dashboard is the entry point for the next agent session.
- Flags notes that have results but no key-finding section.
The lint loop is the most underappreciated part of the system. Without it, the wiki layer drifts out of sync within a few weeks and stops being trustworthy.
## Examples and practices
The examples below are reference patterns, not fixed templates. They show the *shape* of each artifact rather than any specific project's content; users should customize the details, fields, and conventions to fit their own project needs.
#### Research notes and plans
Every intermediate-layer note should let a reader (human or future agent) reconstruct *exactly* what happened. At minimum it should track:
1. Timestamp and a session/task slug in the filename (e.g. `2026-05-15T14.30.00_titanic_feature_baseline_note.md`). Colocating multiple files for one experiment in a `…_session_name/` folder keeps plan, note, report, and a results TSV together.
2. The git commit SHA(s) the experiment was run from. Without this you cannot tell whether the eval ran against the code you think it did.
3. Artifact directories (processed tables, predictions, evaluation outputs) as absolute paths, exactly as they exist on the compute host. Relative paths rot the moment someone moves files.
4. Run-tracker URLs or dataset IDs with the canonical identifier. For a Kaggle-style project, record the competition name, dataset version, and any notebook or MLflow run ID rather than only a human-readable title.
5. The exact command line that produced the results, including all flags. This is the single most useful thing for replication. Paste it inside a fenced code block.
Simulated example of a research-note metadata table:
```
| Field | Value |
| ----------------- | ------------------------------------------------------ |
| Session | 2026-05-15T14.30.00_titanic_feature_baseline |
| Branch | feature/titanic-baseline-features |
| Commit | a3f9c1d |
| Source dataset | kaggle/titanic@2026-05-01 |
| Processed data | /data/projects/titanic/features/20260515/baseline |
| Prediction file | /data/projects/titanic/predictions/20260515/submission.csv |
| MLflow run | titanic-survival/runs/abc12def |
| Eval command | uv run python scripts/train_titanic.py --config configs/logreg.yaml |
```
The body then has dated subsections — one per attempt — with a small results table per attempt and a `## Key Finding` section at the bottom that promotes the takeaway to the wiki-layer index notes.
#### Auto-research plans
When the agent is going to run a multi-step experiment autonomously, the plan it executes is itself a versioned artifact. A good auto-research plan template includes:
- Goal in one sentence (so the agent knows when it has succeeded).
- Primary metric and improvement direction (e.g. "cross-validated accuracy, higher is better"). The agent uses this to rank trials.
- Baseline command — the exact shell command for one evaluation run. Everything the agent does is a variation on this.
- Context files to read before starting (the prior plan, the relevant wiki page, the experiment README from the previous run). Without this list the agent will rediscover context every session.
- Can-do / cannot-do lists — explicit guardrails. Examples: can edit `src/features/titanic.py`, cannot use the test labels or manually tune against the public leaderboard.
- Ordered hypotheses in 3–4 phases. Phase 1 targets the most-likely root cause; later phases are fallbacks. Each hypothesis names specific parameter values to try.
- Output format — the columns of the results TSV that the agent will append to after every trial. At minimum: `commit`, `description`, the primary metric, `status`. The TSV is what makes the plan reviewable in tabular form afterward.
- Loop pseudocode — the agent's inner loop, including the run-name template that ties each trial back to the plan (e.g. `{experiment-name}-trial-{n}`).
- Debugging hints — short, project-specific (e.g. "if a feature creates missing values, add an explicit imputation rule before training").
Crucial: the plan must be reviewed and signed off by a human *before* the agent executes it. Cheap to write, expensive to roll back.
#### Index notes
Index notes are how the wiki layer stays navigable. One per research subtopic. The construction recipe:
1. Start from a small set of seed wikilinks under a `Sources:` heading. These can be hand-picked or pulled from a recent discussion paper.
2. Crawl the seeds' forward links and backlinks. Prefer link-graph traversal over keyword search — filename keywords produce false positives (unrelated notes whose titles happen to match) and false negatives (relevant notes whose titles don't contain the expected keyword).
3. Use `#experiment`-tagged notes as the anchors. Each anchor becomes a numbered section in the index, ordered chronologically by frontmatter `date`.
4. Nest supporting notes (plans, lab notes, eval results, brainstorms) under their parent anchor.
5. Append a flat metrics table — one row per evaluated configuration that produced numbers — so a reader can compare results across experiments without opening each note.
6. Append an "Open Threads" section listing concrete next actions found in the crawl. This is what makes the index a *living* document instead of a dead bibliography.
Simulated structure for one section:
```
### 3. Titanic passenger-title feature experiments (2026-05-12 → 2026-05-15)
- [[2026-05-15_titanic_title_features_note]] #experiment (status: completed)
- Compared a plain logistic-regression baseline with variants that extract passenger titles from `Name`.
- Key result: grouping rare titles improved 5-fold validation accuracy from 0.781 to 0.803.
- Supporting notes:
- [[2026-05-15_titanic_title_features_plan]] — pre-experiment plan
- [[titanic_feature_dictionary]] — definitions for `Title`, `FamilySize`, and `IsAlone`
- `labbook/experiments/2026-05-15T14.30.00_titanic_feature_baseline/`
```
The index note's frontmatter carries `tags: [index]`, `status: active`, and a `last_updated` field that the lint agent refreshes.
#### Task notes
Task notes are the human's index into the project. A task note should answer, at a glance: what are we trying to do, why, what has been tried, where do I look for the latest state. Suggested structure:
```
---
tags: [task/experiment] # or task/investigation, task/impl
date: 2026-05-10
status: active # active | completed | stale | backlog
parent: [[Topic Index Note]]
---
## Objective
One paragraph. Why this task exists. What "done" looks like.
## Background
Pointers to prior notes the reader needs to make sense of this task.
- [[Prior experiment note]] — why we're following up
- [[Wiki page]] — method background
## Plan
A short list of the steps. The list can change; it's a working plan, not a contract.
## Log
2026-05-11 10:14 — Set up the training script; baseline logistic regression reaches 0.781 CV accuracy.
2026-05-12 16:02 — Added `FamilySize` and `IsAlone`. Results in [[2026-05-12T16.00.00_family_features_note]].
Small improvement; next, try extracting titles from passenger names.
2026-05-13 09:30 — Title extraction improved validation accuracy but introduced sparse rare-title categories. Filed [[task: group rare Titanic titles]].
## Linked artifacts
- [[2026-05-12T16.00.00_family_features_plan]]
- [[2026-05-12T16.00.00_family_features_note]]
## Next Steps
- [ ] Group rare passenger titles into a single `Rare` category
- [ ] Update the Titanic feature-engineering index after the title experiment is complete
```
The append-only `## Log` is the heart of the task note: it preserves the *order* in which things were tried, which is what you'll want to reconstruct in a status report three weeks later. The agent appends to the log every time it lands a related research note; the human appends every time they make a decision.
Naming convention worth borrowing: prefix task notes by type — e.g. `[EXP]`, `[IMPL]`, `[INV]` — so the directory listing telegraphs what each task is about.
## Paper Corpus
The raw layer usually includes a literature pile. Treating it as just a folder of PDFs is a dead end — PDFs are not greppable, abstracts are not colocated with code, and an agent given a 50-paper folder will silently fail to find the right reference. The pattern that works:
- Convert every PDF to markdown once, and store the markdown next to the PDF. A short LLM enhancement pipeline — error detection → table repair → text cleanup → figure description — turns the messy `pdf-to-markdown` output into something an agent can actually read.
- Each paper gets a structured `meta.toml` (title, first author, year, venue, DOI, arXiv ID, canonical URL, plus a one-word `abbreviated_name`). Metadata is filled by a chain of enrichment scripts that hit arXiv / OpenReview / Semantic Scholar APIs in turn — manual entry is reserved for the fields the APIs don't return.
- Extracted figures are saved alongside the markdown and referenced inline (e.g. `images/source.pdf-<page>-<offset>.png`), so an agent reading the markdown can also see the figures.
## Tooling that closes the loop
A few small agent skills make the system self-sustaining. Each is a thin recipe over the conventions above; together they remove most of the maintenance overhead:
- A *setup-experiment* skill that takes a one-line goal and produces a filled plan, a research-note skeleton, and a results-TSV with the correct header — all under the canonical experiment directory. Removes the friction of starting a new experiment.
- A *note-triage* skill that scans the vault, infers the right `status` for each intermediate-layer note from activity signals (last timestamp, open checkboxes, presence of a key-finding section), and updates the frontmatter. Run weekly.
- A *project-state-generator* skill that produces a single `_project_state.md` dashboard at the vault root, listing active work, recent experiments, stale notes, and the project glossary. This file is what a fresh agent session reads first.
- A *build-topic-index / update-topic-index* pair that creates and maintains wiki-layer index notes from the seeds-and-crawl recipe above.
- A *daily-worklog* skill that scans for timestamps matching a given date and assembles a daily summary, so the human doesn't have to copy-paste links into the daily note.
- A *capture-workflow* skill that periodically snapshots how the research→experiment→reporting loop is actually being run, so the conventions themselves stay documented and reviewable.
The pattern across all of them: the human writes notes their natural way (append-only, timestamped, wikilinked); the skills do the bookkeeping that would otherwise rot.
## References
- [LLM Wiki by Karpathy](https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f)
- [Autoresearch by Karpathy](https://github.com/karpathy/autoresearch)
- [Sakana AI Scientist](https://sakana.ai/ai-scientist/)
- [Hugging Face ml-intern](https://github.com/huggingface/ml-intern)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment