Self-review of 2026-04-20 flagged four architectural concerns that don't lend themselves to a one-line bugfix. Each is captured below with context, options, and a recommendation. None of these are implemented yet; this is the design conversation, checked in so we don't lose it.
call_agent_structured with tools runs two phases:
generateTextwith tools — agent reads files and writes research text.generateObject— agent emits structured JSON.
The naive phase-2 input was [...original_context, research_text, "now emit JSON"]. The original context already paid tokens in phase 1 AND is
reshipped for phase 2, doubling input tokens per agent turn.
Phase 1 is a fixed-cost floor (tool round-trips); phase 2 is the one we can shrink.
Phase 2 sees only the last user message plus the research synthesis.
The system prompt still carries invariants/role/draft because system is
reused. This covers ~80% of the saving — the bulk of the duplication was
the jouster/specialist context block appended to messages.
The research synthesis itself can be large (10-30KB for a thorough tool-use loop). That's a one-sided doubling — phase 1 paid for the tool-result bytes, phase 2 pays for the agent's summary of them. For a 200KB tool-use trace with a 20KB summary, the extra 20KB is acceptable. For smaller traces it's noise.
-
A. Single-phase with
generateObject+ tools — the AI SDK'sgenerateObjectdoesn't support tools.generateTextwithexperimental_outputdoes support both simultaneously. We already depend on that path shape; swappinggenerateObjectforgenerateText({ experimental_output: Output.object({ schema }), tools })would collapse the two phases into one. Risk: the model is asked to tool-use AND emit JSON in one loop, which historically produced worse compliance than a dedicated structured-output pass. Worth a spike. -
B. Server-side caching (Anthropic prompt caching) — the
systemblock is stable across phase 1 and phase 2. With Anthropic'scache_controlmarker, phase 2 reads the system block from cache at ~10% of input cost. Zero code complexity change beyond adding the header. Does nothing for Gemini. -
C. Incremental summarization — rather than feed the whole research text into phase 2, summarize tool-results inline as they come in, keep a running digest, and feed that. Cheap in theory, brittle in practice (each extra summarization pass introduces drift).
Do B (prompt caching) next; it's a one-line addition for Anthropic agents and doubles as a latency win. Explore A once we have a smoke test for mixed tool-use + structured output compliance.
When a specialist's mutation violates an invariant, the old code
committed the rejection and moved on. Jousters get max_retries attempts
with violation-feedback; specialists got zero. A specialist's first
draft doesn't have to be perfect — they should have the same chance to
revise as a jouster does.
Specialists now run in a retry loop that mirrors the jouster loop:
compile context, call, lint, if invalid feed back YOUR PREVIOUS ATTEMPT WAS REJECTED + violations, retry up to max_retries. The summoning
jouster is still logged as the originator of the work.
The specialist retry loop lives inline inside the jouster for loop.
It's a copy-paste of the jouster retry loop. P2-19 (test coverage) would
land cleanly if this extracted into execute_specialist_mutation() and
execute_jouster_mutation() — both become testable in isolation.
-
A. Extract to
src/mutate.ts— a singlerun_mutation_with_retryfunction parameterized over (agent, role, context-compile fn). Both jouster and specialist call sites collapse into one-liners. Reduces therun.tsloop from ~650 lines to ~400 and makes it testable. -
B. Shared retry policy via config —
specialist_max_retriesseparate frommax_retries. Specialists tend to be more narrowly scoped and may want a tighter retry budget (e.g., 2 instead of 3).
Do A. The retry-loop duplication is a symptom of run.ts being too
monolithic; pulling the mutation-pipeline out is the path forward for
coverage. Defer B until we have usage data.
MAX_SUMMONS_PER_ROUND = 1 caps specialist invocations per round. When
a jouster summoned a specialist after the cap was reached, the summon
was silently dropped. The panel's own judgment that "we need X here"
was discarded.
A FIFO deferred_summons: { specialist, ask, requested_by }[] queue at
the run-scope. Rejected summons append to it; each round-start logs the
pending queue so the operator sees what's backlogged. Still doesn't
execute them — execution is blocked on the mutation-pipeline extraction
from item 2.
-
A. Consume one queued summon per round — at round start, pop one deferred summon and execute it against the current snowball. Counts against
MAX_SUMMONS_PER_ROUNDso backlog drains at one-per-round. Requires item 2's extraction. This is the natural completion of the hotfix. -
B. Dedupe the queue — a given specialist may be summoned multiple times with the same ask. Collapse duplicates so a high-traffic queue doesn't stall on redundant work.
-
C. Priority — summons come with an implicit priority (the panel asked for security before asking for UX). Score and sort rather than FIFO. Risks over-engineering; FIFO is fine until it isn't.
Do A after item 2 lands. B when dedup volume matters (probably never for typical 1-3-round runs). Skip C entirely.
src/lint.ts asks the main agent to judge whether a mutation violates
invariants. Invariants are written as natural-language MUST / SHOULD /
MUST_NOT rules. The model decides — which means two runs over the same
(draft, invariants) pair can diverge. A well-intentioned retry that was
"valid" on attempt N gets rejected on attempt N+1 for the same content.
This is exactly the problem tools-in-loop was meant to avoid for research, applied again to judgment. We're using the stochastic instrument as if it were deterministic.
Many invariants are structural: "draft MUST contain a section heading
## Risks" is a /^## Risks/m test. Others are semantic: "draft MUST
NOT assume SQL Server when context is PostgreSQL" — impossible to
regex, requires judgment.
A pure-deterministic linter misses the semantic invariants; a pure-LLM linter is a coin flip on the deterministic ones.
-
Invariant syntax upgrade — extend
invariants.MUSTfrom astring[]to a tagged union:{ kind: "pattern", rule: "must contain # Risks", pattern: "^## Risks" } { kind: "semantic", rule: "must not assume SQL Server", hint?: "..." }Bootstrap writes
kind: "semantic"by default (status quo preserved); the operator can promote rules topatternfor determinism. -
Two-stage lint pipeline:
- Run all
kind: "pattern"rules as regexes. Fail-fast on any miss. No LLM call, deterministic, instant. - For the remaining
kind: "semantic"rules, call the main agent with only those rules in scope. Smaller surface = more stable judgment.
- Run all
-
Judgment stabilization — for the semantic pass:
- temperature forced to 0 (we already default 0.2 but lint is the one place determinism matters more than variety)
- optional "two-model quorum" where two different provider models must both mark a rule violated for the verdict to stick. doubles lint cost but eliminates single-model idiosyncrasies.
- Invariants are already fiddly to write well; adding a
kindfield raises the floor. Bootstrap has to do the categorization. - Regex rules are themselves a source of false positives ("must contain
Ship a kind: "pattern" path first with pattern authored by bootstrap
when it's obvious from the rule text (e.g. "must contain the phrase X"
→ regex on X). Leave semantic rules as-is. Quorum is premature until we
have data showing single-model lint is actually the bottleneck.
Joust started as an RFC tool, so the schema hardcodes
invariants: { MUST, SHOULD, MUST_NOT } and the lint prompt reads those
rules back as normative assertions. That shape works for RFC-like
drafts, where "the system must X" is the native grammar.
It's a poor fit for other work joust is actually good at:
- discovery — "what are we even deciding?" has no MUSTs yet, only open questions. A rule-shaped anchor forces the bootstrap to invent constraints before the domain is understood.
- style/voice polish — "this reads like a brochure, make it read like an engineer wrote it" is a similarity target, not a rule list.
- optimization — "get p99 under 40ms" is a goal with constraints,
not a set of MUSTs. Expressing it as
MUST: p99 < 40msloses the optimization character (is 50ms a hard failure or just progress lost?). - comparison — "pick the better option between A and B" has no rules, only a scoring rubric.
The invariant shape isn't wrong — it's one kind of anchor. The job of the anchor is to be the durable thing a mutation is judged against. Other strategies fit the same slot.
| Shape | What it is | Lint check | Fits |
|---|---|---|---|
| invariants | MUST / SHOULD / MUST_NOT (current) | rule compliance | RFCs, specs, contracts |
| acceptance | "done when [list of questions answered]" | coverage of each item | discovery, scoping |
| exemplars | positive + negative examples of good/bad | similarity to positive, distance from negative | style, voice, tone |
| assertions | literal tests: "draft contains X", claim(Y) |
run the tests | code drafts, factual docs |
| rubric | weighted dimensions, each 0-5, with rationale | score each dim; reject if any regresses | long polish, editorial |
| goal+constraints | one north-star + flat constraint list | goal-progress score; constraint violations | optimization, trade studies |
Rename invariants (the field, not everything it touches) to anchor
and make it a tagged union:
type Anchor =
| { kind: "invariants"; MUST: string[]; SHOULD: string[]; MUST_NOT: string[] }
| { kind: "acceptance"; items: { question: string; done_when: string }[] }
| { kind: "exemplars"; positive: string[]; negative: string[] }
| { kind: "assertions"; checks: { description: string; predicate: string }[] }
| { kind: "rubric"; dims: { name: string; weight: number; scale: string }[] }
| { kind: "goal"; north_star: string; constraints: string[] };Bootstrap picks the kind from the prompt. Heuristics:
- Prompt contains "RFC", "spec", "design doc", "requirements" →
invariants. - Prompt contains "explore", "figure out", "what are we" →
acceptance. - Prompt contains "rewrite", "polish", "sounds like" + reference text →
exemplars. - Prompt contains "optimize", "reduce", "minimize", "under N" →
goal. - Default when unclear →
acceptance(weakest, least-wrong).
The operator can override with joust /init --anchor=rubric "...".
lint.ts becomes a thin dispatcher:
switch (snowball.anchor.kind) {
case "invariants": return lint_invariants(main, snowball, draft, ...);
case "acceptance": return lint_acceptance(main, snowball, draft, ...);
case "exemplars": return lint_exemplars(main, snowball, draft, ...);
case "assertions": return lint_assertions(main, snowball, draft, ...); // deterministic where possible
case "rubric": return lint_rubric(main, snowball, draft, ...);
case "goal": return lint_goal(main, snowball, draft, ...);
}Each variant returns the same LintResult shape — { valid, violations, should_violations? } — so the run loop, retry feedback, and history
format don't change.
kind: "assertions" is the deterministic path from item 4,
generalized. A pattern rule ("draft must contain ## Risks") is just an
assertion with a regex predicate. The hybrid design for invariants
falls out as: invariants with a pattern hint get lifted into an
assertions sidecar and run deterministically; the rest go through the
semantic LLM pass. Both items converge on the same mechanism.
The critique_trail and resolved_decisions stay the same — they're anchor-agnostic (they're a log of what happened, not what's required). Compaction stays the same. The human_directives channel stays the same.
The only schema change is invariants → anchor. That's a hard-cutover
rename (we've done two of those this week).
- Bootstrap picks wrong. An RFC prompt classified as
acceptanceproduces a weak lint that accepts loose drafts. Mitigation: make the operator's--anchor=override visible instatus, log the choice loudly at init, and let the operator re-anchor mid-run (new first-class command:joust /reanchor kind=invariants). - Anchor drift. Some tasks legitimately shift shape — discovery turns into spec work once the scope is pinned. A reanchor command handles this; the history entry records the switch.
- Specialist anchors. A security specialist summoned during
kind: "goal"work still wants rule-shaped output ("MUST validate X"). Solution: specialists always emitkind: "invariants"critique, and those get merged into the critique_trail as usual — the specialist's findings don't have to match the drafting anchor.
Land in three phases:
- Phase 1 — rename
invariants→anchor, wrap the existing field as{ kind: "invariants", ... }. Zero behavior change. One commit, trivially reviewable. - Phase 2 — add
kind: "acceptance"end-to-end (bootstrap classification, context compile, lint). Ship it as an opt-in via--anchor=acceptance. Run real drafts through it for a week. - Phase 3 — add the remaining four shapes based on actual demand,
not speculation.
exemplarsandgoalare the next most likely;assertionsarrives for free with item 4's hybrid linter;rubricis the last mile for editorial-heavy work.
This is the single most leveraged change on the list — it doesn't fix a bug, it expands what joust can be pointed at.
- P3-23 CLI flags —
--dry-run,--verbose, shell completions. Nice-to-haves, not blocking any user. Pick up during a UX sweep.