Skip to content

Instantly share code, notes, and snippets.

@ahoward
Last active April 20, 2026 07:12
Show Gist options
  • Select an option

  • Save ahoward/2f98c6279880caeddacd50fd070dafa0 to your computer and use it in GitHub Desktop.

Select an option

Save ahoward/2f98c6279880caeddacd50fd070dafa0 to your computer and use it in GitHub Desktop.
joust — architecture critique (self-review 2026-04-20)

Architecture Critique — Proposed Designs

Self-review of 2026-04-20 flagged four architectural concerns that don't lend themselves to a one-line bugfix. Each is captured below with context, options, and a recommendation. None of these are implemented yet; this is the design conversation, checked in so we don't lose it.


1. Two-Phase Token Doubling

Problem

call_agent_structured with tools runs two phases:

  1. generateText with tools — agent reads files and writes research text.
  2. generateObject — agent emits structured JSON.

The naive phase-2 input was [...original_context, research_text, "now emit JSON"]. The original context already paid tokens in phase 1 AND is reshipped for phase 2, doubling input tokens per agent turn.

Phase 1 is a fixed-cost floor (tool round-trips); phase 2 is the one we can shrink.

Hotfix (shipped in P2 commit)

Phase 2 sees only the last user message plus the research synthesis. The system prompt still carries invariants/role/draft because system is reused. This covers ~80% of the saving — the bulk of the duplication was the jouster/specialist context block appended to messages.

Remaining concern

The research synthesis itself can be large (10-30KB for a thorough tool-use loop). That's a one-sided doubling — phase 1 paid for the tool-result bytes, phase 2 pays for the agent's summary of them. For a 200KB tool-use trace with a 20KB summary, the extra 20KB is acceptable. For smaller traces it's noise.

Longer-term options

  • A. Single-phase with generateObject + tools — the AI SDK's generateObject doesn't support tools. generateText with experimental_output does support both simultaneously. We already depend on that path shape; swapping generateObject for generateText({ experimental_output: Output.object({ schema }), tools }) would collapse the two phases into one. Risk: the model is asked to tool-use AND emit JSON in one loop, which historically produced worse compliance than a dedicated structured-output pass. Worth a spike.

  • B. Server-side caching (Anthropic prompt caching) — the system block is stable across phase 1 and phase 2. With Anthropic's cache_control marker, phase 2 reads the system block from cache at ~10% of input cost. Zero code complexity change beyond adding the header. Does nothing for Gemini.

  • C. Incremental summarization — rather than feed the whole research text into phase 2, summarize tool-results inline as they come in, keep a running digest, and feed that. Cheap in theory, brittle in practice (each extra summarization pass introduces drift).

Recommendation

Do B (prompt caching) next; it's a one-line addition for Anthropic agents and doubles as a latency win. Explore A once we have a smoke test for mixed tool-use + structured output compliance.


2. Specialist Retry Mechanism

Problem

When a specialist's mutation violates an invariant, the old code committed the rejection and moved on. Jousters get max_retries attempts with violation-feedback; specialists got zero. A specialist's first draft doesn't have to be perfect — they should have the same chance to revise as a jouster does.

Hotfix (shipped in P2 commit)

Specialists now run in a retry loop that mirrors the jouster loop: compile context, call, lint, if invalid feed back YOUR PREVIOUS ATTEMPT WAS REJECTED + violations, retry up to max_retries. The summoning jouster is still logged as the originator of the work.

Remaining concern

The specialist retry loop lives inline inside the jouster for loop. It's a copy-paste of the jouster retry loop. P2-19 (test coverage) would land cleanly if this extracted into execute_specialist_mutation() and execute_jouster_mutation() — both become testable in isolation.

Longer-term options

  • A. Extract to src/mutate.ts — a single run_mutation_with_retry function parameterized over (agent, role, context-compile fn). Both jouster and specialist call sites collapse into one-liners. Reduces the run.ts loop from ~650 lines to ~400 and makes it testable.

  • B. Shared retry policy via configspecialist_max_retries separate from max_retries. Specialists tend to be more narrowly scoped and may want a tighter retry budget (e.g., 2 instead of 3).

Recommendation

Do A. The retry-loop duplication is a symptom of run.ts being too monolithic; pulling the mutation-pipeline out is the path forward for coverage. Defer B until we have usage data.


3. Deferred Summon Queue

Problem

MAX_SUMMONS_PER_ROUND = 1 caps specialist invocations per round. When a jouster summoned a specialist after the cap was reached, the summon was silently dropped. The panel's own judgment that "we need X here" was discarded.

Hotfix (shipped in P3 commit)

A FIFO deferred_summons: { specialist, ask, requested_by }[] queue at the run-scope. Rejected summons append to it; each round-start logs the pending queue so the operator sees what's backlogged. Still doesn't execute them — execution is blocked on the mutation-pipeline extraction from item 2.

Longer-term options

  • A. Consume one queued summon per round — at round start, pop one deferred summon and execute it against the current snowball. Counts against MAX_SUMMONS_PER_ROUND so backlog drains at one-per-round. Requires item 2's extraction. This is the natural completion of the hotfix.

  • B. Dedupe the queue — a given specialist may be summoned multiple times with the same ask. Collapse duplicates so a high-traffic queue doesn't stall on redundant work.

  • C. Priority — summons come with an implicit priority (the panel asked for security before asking for UX). Score and sort rather than FIFO. Risks over-engineering; FIFO is fine until it isn't.

Recommendation

Do A after item 2 lands. B when dedup volume matters (probably never for typical 1-3-round runs). Skip C entirely.


4. LLM-Linter Non-Determinism

Problem

src/lint.ts asks the main agent to judge whether a mutation violates invariants. Invariants are written as natural-language MUST / SHOULD / MUST_NOT rules. The model decides — which means two runs over the same (draft, invariants) pair can diverge. A well-intentioned retry that was "valid" on attempt N gets rejected on attempt N+1 for the same content.

This is exactly the problem tools-in-loop was meant to avoid for research, applied again to judgment. We're using the stochastic instrument as if it were deterministic.

Why not just use regex?

Many invariants are structural: "draft MUST contain a section heading ## Risks" is a /^## Risks/m test. Others are semantic: "draft MUST NOT assume SQL Server when context is PostgreSQL" — impossible to regex, requires judgment.

A pure-deterministic linter misses the semantic invariants; a pure-LLM linter is a coin flip on the deterministic ones.

Hybrid design

  • Invariant syntax upgrade — extend invariants.MUST from a string[] to a tagged union:

    { kind: "pattern", rule: "must contain # Risks", pattern: "^## Risks" }
    { kind: "semantic", rule: "must not assume SQL Server", hint?: "..." }
    

    Bootstrap writes kind: "semantic" by default (status quo preserved); the operator can promote rules to pattern for determinism.

  • Two-stage lint pipeline:

    1. Run all kind: "pattern" rules as regexes. Fail-fast on any miss. No LLM call, deterministic, instant.
    2. For the remaining kind: "semantic" rules, call the main agent with only those rules in scope. Smaller surface = more stable judgment.
  • Judgment stabilization — for the semantic pass:

    • temperature forced to 0 (we already default 0.2 but lint is the one place determinism matters more than variety)
    • optional "two-model quorum" where two different provider models must both mark a rule violated for the verdict to stick. doubles lint cost but eliminates single-model idiosyncrasies.

Risks

  • Invariants are already fiddly to write well; adding a kind field raises the floor. Bootstrap has to do the categorization.
  • Regex rules are themselves a source of false positives ("must contain

    Risks" matches ## Risks and Mitigations but also # Risks Blocked).

Recommendation

Ship a kind: "pattern" path first with pattern authored by bootstrap when it's obvious from the rule text (e.g. "must contain the phrase X" → regex on X). Leave semantic rules as-is. Quorum is premature until we have data showing single-model lint is actually the bottleneck.


5. Pluggable Anchor (beyond MUST/SHOULD/MUST_NOT)

Problem

Joust started as an RFC tool, so the schema hardcodes invariants: { MUST, SHOULD, MUST_NOT } and the lint prompt reads those rules back as normative assertions. That shape works for RFC-like drafts, where "the system must X" is the native grammar.

It's a poor fit for other work joust is actually good at:

  • discovery — "what are we even deciding?" has no MUSTs yet, only open questions. A rule-shaped anchor forces the bootstrap to invent constraints before the domain is understood.
  • style/voice polish — "this reads like a brochure, make it read like an engineer wrote it" is a similarity target, not a rule list.
  • optimization — "get p99 under 40ms" is a goal with constraints, not a set of MUSTs. Expressing it as MUST: p99 < 40ms loses the optimization character (is 50ms a hard failure or just progress lost?).
  • comparison — "pick the better option between A and B" has no rules, only a scoring rubric.

The invariant shape isn't wrong — it's one kind of anchor. The job of the anchor is to be the durable thing a mutation is judged against. Other strategies fit the same slot.

Anchor shapes

Shape What it is Lint check Fits
invariants MUST / SHOULD / MUST_NOT (current) rule compliance RFCs, specs, contracts
acceptance "done when [list of questions answered]" coverage of each item discovery, scoping
exemplars positive + negative examples of good/bad similarity to positive, distance from negative style, voice, tone
assertions literal tests: "draft contains X", claim(Y) run the tests code drafts, factual docs
rubric weighted dimensions, each 0-5, with rationale score each dim; reject if any regresses long polish, editorial
goal+constraints one north-star + flat constraint list goal-progress score; constraint violations optimization, trade studies

Design

Rename invariants (the field, not everything it touches) to anchor and make it a tagged union:

type Anchor =
  | { kind: "invariants"; MUST: string[]; SHOULD: string[]; MUST_NOT: string[] }
  | { kind: "acceptance"; items: { question: string; done_when: string }[] }
  | { kind: "exemplars"; positive: string[]; negative: string[] }
  | { kind: "assertions"; checks: { description: string; predicate: string }[] }
  | { kind: "rubric"; dims: { name: string; weight: number; scale: string }[] }
  | { kind: "goal"; north_star: string; constraints: string[] };

Bootstrap picks the kind from the prompt. Heuristics:

  • Prompt contains "RFC", "spec", "design doc", "requirements" → invariants.
  • Prompt contains "explore", "figure out", "what are we" → acceptance.
  • Prompt contains "rewrite", "polish", "sounds like" + reference text → exemplars.
  • Prompt contains "optimize", "reduce", "minimize", "under N" → goal.
  • Default when unclear → acceptance (weakest, least-wrong).

The operator can override with joust /init --anchor=rubric "...".

Lint dispatch

lint.ts becomes a thin dispatcher:

switch (snowball.anchor.kind) {
  case "invariants":   return lint_invariants(main, snowball, draft, ...);
  case "acceptance":   return lint_acceptance(main, snowball, draft, ...);
  case "exemplars":    return lint_exemplars(main, snowball, draft, ...);
  case "assertions":   return lint_assertions(main, snowball, draft, ...); // deterministic where possible
  case "rubric":       return lint_rubric(main, snowball, draft, ...);
  case "goal":         return lint_goal(main, snowball, draft, ...);
}

Each variant returns the same LintResult shape — { valid, violations, should_violations? } — so the run loop, retry feedback, and history format don't change.

Interaction with item 4 (hybrid linter)

kind: "assertions" is the deterministic path from item 4, generalized. A pattern rule ("draft must contain ## Risks") is just an assertion with a regex predicate. The hybrid design for invariants falls out as: invariants with a pattern hint get lifted into an assertions sidecar and run deterministically; the rest go through the semantic LLM pass. Both items converge on the same mechanism.

Snowball implications

The critique_trail and resolved_decisions stay the same — they're anchor-agnostic (they're a log of what happened, not what's required). Compaction stays the same. The human_directives channel stays the same.

The only schema change is invariantsanchor. That's a hard-cutover rename (we've done two of those this week).

Risks

  • Bootstrap picks wrong. An RFC prompt classified as acceptance produces a weak lint that accepts loose drafts. Mitigation: make the operator's --anchor= override visible in status, log the choice loudly at init, and let the operator re-anchor mid-run (new first-class command: joust /reanchor kind=invariants).
  • Anchor drift. Some tasks legitimately shift shape — discovery turns into spec work once the scope is pinned. A reanchor command handles this; the history entry records the switch.
  • Specialist anchors. A security specialist summoned during kind: "goal" work still wants rule-shaped output ("MUST validate X"). Solution: specialists always emit kind: "invariants" critique, and those get merged into the critique_trail as usual — the specialist's findings don't have to match the drafting anchor.

Recommendation

Land in three phases:

  1. Phase 1 — rename invariantsanchor, wrap the existing field as { kind: "invariants", ... }. Zero behavior change. One commit, trivially reviewable.
  2. Phase 2 — add kind: "acceptance" end-to-end (bootstrap classification, context compile, lint). Ship it as an opt-in via --anchor=acceptance. Run real drafts through it for a week.
  3. Phase 3 — add the remaining four shapes based on actual demand, not speculation. exemplars and goal are the next most likely; assertions arrives for free with item 4's hybrid linter; rubric is the last mile for editorial-heavy work.

This is the single most leveraged change on the list — it doesn't fix a bug, it expands what joust can be pointed at.


Deferred (not a bug)

  • P3-23 CLI flags--dry-run, --verbose, shell completions. Nice-to-haves, not blocking any user. Pick up during a UX sweep.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment