Architecture Critique — Proposed Designs

Self-review of 2026-04-20 flagged four architectural concerns that don't lend themselves to a one-line bugfix. Each is captured below with context, options, and a recommendation. None of these are implemented yet; this is the design conversation, checked in so we don't lose it.

1. Two-Phase Token Doubling

Problem

call_agent_structured with tools runs two phases:

generateText with tools — agent reads files and writes research text.
generateObject — agent emits structured JSON.

The naive phase-2 input was [...original_context, research_text, "now emit JSON"]. The original context already paid tokens in phase 1 AND is reshipped for phase 2, doubling input tokens per agent turn.

Phase 1 is a fixed-cost floor (tool round-trips); phase 2 is the one we can shrink.

Hotfix (shipped in P2 commit)

Phase 2 sees only the last user message plus the research synthesis. The system prompt still carries invariants/role/draft because system is reused. This covers ~80% of the saving — the bulk of the duplication was the jouster/specialist context block appended to messages.

Remaining concern

The research synthesis itself can be large (10-30KB for a thorough tool-use loop). That's a one-sided doubling — phase 1 paid for the tool-result bytes, phase 2 pays for the agent's summary of them. For a 200KB tool-use trace with a 20KB summary, the extra 20KB is acceptable. For smaller traces it's noise.

Longer-term options

A. Single-phase with generateObject + tools — the AI SDK's generateObject doesn't support tools. generateText with experimental_output does support both simultaneously. We already depend on that path shape; swapping generateObject for generateText({ experimental_output: Output.object({ schema }), tools }) would collapse the two phases into one. Risk: the model is asked to tool-use AND emit JSON in one loop, which historically produced worse compliance than a dedicated structured-output pass. Worth a spike.
B. Server-side caching (Anthropic prompt caching) — the system block is stable across phase 1 and phase 2. With Anthropic's cache_control marker, phase 2 reads the system block from cache at ~10% of input cost. Zero code complexity change beyond adding the header. Does nothing for Gemini.
C. Incremental summarization — rather than feed the whole research text into phase 2, summarize tool-results inline as they come in, keep a running digest, and feed that. Cheap in theory, brittle in practice (each extra summarization pass introduces drift).

Recommendation

Do B (prompt caching) next; it's a one-line addition for Anthropic agents and doubles as a latency win. Explore A once we have a smoke test for mixed tool-use + structured output compliance.

2. Specialist Retry Mechanism

Problem

When a specialist's mutation violates an invariant, the old code committed the rejection and moved on. Jousters get max_retries attempts with violation-feedback; specialists got zero. A specialist's first draft doesn't have to be perfect — they should have the same chance to revise as a jouster does.

Hotfix (shipped in P2 commit)

Specialists now run in a retry loop that mirrors the jouster loop: compile context, call, lint, if invalid feed back YOUR PREVIOUS ATTEMPT WAS REJECTED + violations, retry up to max_retries. The summoning jouster is still logged as the originator of the work.

Remaining concern

The specialist retry loop lives inline inside the jouster for loop. It's a copy-paste of the jouster retry loop. P2-19 (test coverage) would land cleanly if this extracted into execute_specialist_mutation() and execute_jouster_mutation() — both become testable in isolation.

Longer-term options

A. Extract to src/mutate.ts — a single run_mutation_with_retry function parameterized over (agent, role, context-compile fn). Both jouster and specialist call sites collapse into one-liners. Reduces the run.ts loop from ~650 lines to ~400 and makes it testable.
B. Shared retry policy via config — specialist_max_retries separate from max_retries. Specialists tend to be more narrowly scoped and may want a tighter retry budget (e.g., 2 instead of 3).

Recommendation

Do A. The retry-loop duplication is a symptom of run.ts being too monolithic; pulling the mutation-pipeline out is the path forward for coverage. Defer B until we have usage data.

3. Deferred Summon Queue

Problem

MAX_SUMMONS_PER_ROUND = 1 caps specialist invocations per round. When a jouster summoned a specialist after the cap was reached, the summon was silently dropped. The panel's own judgment that "we need X here" was discarded.

Hotfix (shipped in P3 commit)

A FIFO deferred_summons: { specialist, ask, requested_by }[] queue at the run-scope. Rejected summons append to it; each round-start logs the pending queue so the operator sees what's backlogged. Still doesn't execute them — execution is blocked on the mutation-pipeline extraction from item 2.

Longer-term options

A. Consume one queued summon per round — at round start, pop one deferred summon and execute it against the current snowball. Counts against MAX_SUMMONS_PER_ROUND so backlog drains at one-per-round. Requires item 2's extraction. This is the natural completion of the hotfix.
B. Dedupe the queue — a given specialist may be summoned multiple times with the same ask. Collapse duplicates so a high-traffic queue doesn't stall on redundant work.
C. Priority — summons come with an implicit priority (the panel asked for security before asking for UX). Score and sort rather than FIFO. Risks over-engineering; FIFO is fine until it isn't.

Recommendation

Do A after item 2 lands. B when dedup volume matters (probably never for typical 1-3-round runs). Skip C entirely.

4. LLM-Linter Non-Determinism

Problem

src/lint.ts asks the main agent to judge whether a mutation violates invariants. Invariants are written as natural-language MUST / SHOULD / MUST_NOT rules. The model decides — which means two runs over the same (draft, invariants) pair can diverge. A well-intentioned retry that was "valid" on attempt N gets rejected on attempt N+1 for the same content.

This is exactly the problem tools-in-loop was meant to avoid for research, applied again to judgment. We're using the stochastic instrument as if it were deterministic.

Why not just use regex?

Many invariants are structural: "draft MUST contain a section heading ## Risks" is a /^## Risks/m test. Others are semantic: "draft MUST NOT assume SQL Server when context is PostgreSQL" — impossible to regex, requires judgment.

A pure-deterministic linter misses the semantic invariants; a pure-LLM linter is a coin flip on the deterministic ones.

Hybrid design

Invariant syntax upgrade — extend invariants.MUST from a string[] to a tagged union:
```
{ kind: "pattern", rule: "must contain # Risks", pattern: "^## Risks" }
{ kind: "semantic", rule: "must not assume SQL Server", hint?: "..." }
```
Bootstrap writes kind: "semantic" by default (status quo preserved); the operator can promote rules to pattern for determinism.
Two-stage lint pipeline:
1. Run all kind: "pattern" rules as regexes. Fail-fast on any miss. No LLM call, deterministic, instant.
2. For the remaining kind: "semantic" rules, call the main agent with only those rules in scope. Smaller surface = more stable judgment.
Judgment stabilization — for the semantic pass:
- temperature forced to 0 (we already default 0.2 but lint is the one place determinism matters more than variety)
- optional "two-model quorum" where two different provider models must both mark a rule violated for the verdict to stick. doubles lint cost but eliminates single-model idiosyncrasies.

Risks

Invariants are already fiddly to write well; adding a kind field raises the floor. Bootstrap has to do the categorization.
Regex rules are themselves a source of false positives ("must contain
Risks" matches ## Risks and Mitigations but also # Risks Blocked).

Recommendation

Ship a kind: "pattern" path first with pattern authored by bootstrap when it's obvious from the rule text (e.g. "must contain the phrase X" → regex on X). Leave semantic rules as-is. Quorum is premature until we have data showing single-model lint is actually the bottleneck.

5. Pluggable Anchor (beyond MUST/SHOULD/MUST_NOT)

Problem

Joust started as an RFC tool, so the schema hardcodes invariants: { MUST, SHOULD, MUST_NOT } and the lint prompt reads those rules back as normative assertions. That shape works for RFC-like drafts, where "the system must X" is the native grammar.

It's a poor fit for other work joust is actually good at:

discovery — "what are we even deciding?" has no MUSTs yet, only open questions. A rule-shaped anchor forces the bootstrap to invent constraints before the domain is understood.
style/voice polish — "this reads like a brochure, make it read like an engineer wrote it" is a similarity target, not a rule list.
optimization — "get p99 under 40ms" is a goal with constraints, not a set of MUSTs. Expressing it as MUST: p99 < 40ms loses the optimization character (is 50ms a hard failure or just progress lost?).
comparison — "pick the better option between A and B" has no rules, only a scoring rubric.

The invariant shape isn't wrong — it's one kind of anchor. The job of the anchor is to be the durable thing a mutation is judged against. Other strategies fit the same slot.

Anchor shapes

Shape	What it is	Lint check	Fits
invariants	MUST / SHOULD / MUST_NOT (current)	rule compliance	RFCs, specs, contracts
acceptance	"done when [list of questions answered]"	coverage of each item	discovery, scoping
exemplars	positive + negative examples of good/bad	similarity to positive, distance from negative	style, voice, tone
assertions	literal tests: "draft contains X", `claim(Y)`	run the tests	code drafts, factual docs
rubric	weighted dimensions, each 0-5, with rationale	score each dim; reject if any regresses	long polish, editorial
goal+constraints	one north-star + flat constraint list	goal-progress score; constraint violations	optimization, trade studies

Design

Rename invariants (the field, not everything it touches) to anchor and make it a tagged union:

type Anchor =
  | { kind: "invariants"; MUST: string[]; SHOULD: string[]; MUST_NOT: string[] }
  | { kind: "acceptance"; items: { question: string; done_when: string }[] }
  | { kind: "exemplars"; positive: string[]; negative: string[] }
  | { kind: "assertions"; checks: { description: string; predicate: string }[] }
  | { kind: "rubric"; dims: { name: string; weight: number; scale: string }[] }
  | { kind: "goal"; north_star: string; constraints: string[] };

Bootstrap picks the kind from the prompt. Heuristics:

Prompt contains "RFC", "spec", "design doc", "requirements" → invariants.
Prompt contains "explore", "figure out", "what are we" → acceptance.
Prompt contains "rewrite", "polish", "sounds like" + reference text → exemplars.
Prompt contains "optimize", "reduce", "minimize", "under N" → goal.
Default when unclear → acceptance (weakest, least-wrong).

The operator can override with joust /init --anchor=rubric "...".

Lint dispatch

lint.ts becomes a thin dispatcher:

switch (snowball.anchor.kind) {
  case "invariants":   return lint_invariants(main, snowball, draft, ...);
  case "acceptance":   return lint_acceptance(main, snowball, draft, ...);
  case "exemplars":    return lint_exemplars(main, snowball, draft, ...);
  case "assertions":   return lint_assertions(main, snowball, draft, ...); // deterministic where possible
  case "rubric":       return lint_rubric(main, snowball, draft, ...);
  case "goal":         return lint_goal(main, snowball, draft, ...);
}

Each variant returns the same LintResult shape — { valid, violations, should_violations? } — so the run loop, retry feedback, and history format don't change.

Interaction with item 4 (hybrid linter)

kind: "assertions" is the deterministic path from item 4, generalized. A pattern rule ("draft must contain ## Risks") is just an assertion with a regex predicate. The hybrid design for invariants falls out as: invariants with a pattern hint get lifted into an assertions sidecar and run deterministically; the rest go through the semantic LLM pass. Both items converge on the same mechanism.

Snowball implications

The critique_trail and resolved_decisions stay the same — they're anchor-agnostic (they're a log of what happened, not what's required). Compaction stays the same. The human_directives channel stays the same.

The only schema change is invariants → anchor. That's a hard-cutover rename (we've done two of those this week).

Risks

Bootstrap picks wrong. An RFC prompt classified as acceptance produces a weak lint that accepts loose drafts. Mitigation: make the operator's --anchor= override visible in status, log the choice loudly at init, and let the operator re-anchor mid-run (new first-class command: joust /reanchor kind=invariants).
Anchor drift. Some tasks legitimately shift shape — discovery turns into spec work once the scope is pinned. A reanchor command handles this; the history entry records the switch.
Specialist anchors. A security specialist summoned during kind: "goal" work still wants rule-shaped output ("MUST validate X"). Solution: specialists always emit kind: "invariants" critique, and those get merged into the critique_trail as usual — the specialist's findings don't have to match the drafting anchor.

Recommendation

Land in three phases:

Phase 1 — rename invariants → anchor, wrap the existing field as { kind: "invariants", ... }. Zero behavior change. One commit, trivially reviewable.
Phase 2 — add kind: "acceptance" end-to-end (bootstrap classification, context compile, lint). Ship it as an opt-in via --anchor=acceptance. Run real drafts through it for a week.
Phase 3 — add the remaining four shapes based on actual demand, not speculation. exemplars and goal are the next most likely; assertions arrives for free with item 4's hybrid linter; rubric is the last mile for editorial-heavy work.

This is the single most leveraged change on the list — it doesn't fix a bug, it expands what joust can be pointed at.

Deferred (not a bug)

P3-23 CLI flags — --dry-run, --verbose, shell completions. Nice-to-haves, not blocking any user. Pick up during a UX sweep.

ahoward/architecture-critique.md

Select an option

No results found

Select an option

No results found

Architecture Critique — Proposed Designs

1. Two-Phase Token Doubling

Problem

Hotfix (shipped in P2 commit)

Remaining concern

Longer-term options

Recommendation

2. Specialist Retry Mechanism

Problem

Hotfix (shipped in P2 commit)

Remaining concern

Longer-term options

Recommendation

3. Deferred Summon Queue

Problem

Hotfix (shipped in P3 commit)

Longer-term options

Recommendation

4. LLM-Linter Non-Determinism

Problem

Why not just use regex?

Hybrid design

Risks

Risks" matches `## Risks and Mitigations` but also `# Risks Blocked`).

Recommendation

5. Pluggable Anchor (beyond MUST/SHOULD/MUST_NOT)

Problem

Anchor shapes

Design

Lint dispatch

Interaction with item 4 (hybrid linter)

Snowball implications

Risks

Recommendation

Deferred (not a bug)

ahoward/architecture-critique.md

Architecture Critique — Proposed Designs

1. Two-Phase Token Doubling

Problem

Hotfix (shipped in P2 commit)

Remaining concern

Longer-term options

Recommendation

2. Specialist Retry Mechanism

Problem

Hotfix (shipped in P2 commit)

Remaining concern

Longer-term options

Recommendation

3. Deferred Summon Queue

Problem

Hotfix (shipped in P3 commit)

Longer-term options

Recommendation

4. LLM-Linter Non-Determinism

Problem

Why not just use regex?

Hybrid design

Risks

Risks" matches ## Risks and Mitigations but also # Risks Blocked).

Recommendation

5. Pluggable Anchor (beyond MUST/SHOULD/MUST_NOT)

Problem

Anchor shapes

Design

Lint dispatch

Interaction with item 4 (hybrid linter)

Snowball implications

Risks

Recommendation

Deferred (not a bug)

Risks" matches `## Risks and Mitigations` but also `# Risks Blocked`).