githubcustomerserviceistrash · August 24, 2025 02:36
diff --git a/gistfile1.txt b/gistfile1.txt
 **PROMPT FOR LLM (copy+paste below, then append your plan)**

 You are converting a **generic plan of action** into a **production-ready research/engineering `todo.md`** that is directly executable by a capable coding agent. Your output must be a single Markdown document with a fenced **XML `<workflows>`** block. Do **not** include any extra commentary, explanations, or chat—**output only the document**.

 ### Transformation requirements

 * Be **specific** and **operational**. Replace vague goals with concrete steps, commands, checklists, acceptance gates, and explicit assumptions.
 * Prefer **compact, high-signal prose**. No filler. Use short paragraphs and terse bullets.
 * If information is missing, make **minimal, clearly labeled assumptions** (e.g., “Assumption: …”). Do **not** ask questions.
 * Compile the spec into **executable oracles**: pre/postconditions, invariants, consumer/provider API contracts, and **metamorphic properties**. Generate both tests and **runtime guards**.
 * Enforce **hermetic spin-up**: pinned toolchains/lockfiles, reproducible container image, seed data, migrations, health/readiness probes, **golden smoke flows**, and a signed **boot transcript** artifact.
 * Beyond unit/integration/e2e: include **property-based testing**, **metamorphic testing**, **mutation testing** (score target), **grammar/coverage-guided fuzzing**, **concolic/symbolic execution** for critical paths, **differential tests** vs last known-good, **contract tests** for external deps, and **runtime invariant checks** with shadow traffic or replay.
 * Add **static/semantic gates**: strict typing/linters, SAST/taint, API surface diffs, complexity deltas, license/OSS policy.
 * Implement a **risk score** and **Gatekeeper** to decide: **AGENT\_REFINE** (auto-iterate) vs **MANUAL\_QA** (human exploration) vs **PROMOTE** (stage/ship).
 * Use **relative improvements** (percent) when comparing methods; keep budget parity rules explicit (e.g., “±5% params/FLOPs”).
 * Include **reproducibility and guardrails** (seeds, SHAs, data/index hashes, environment pins).
 * Treat statistics rigor as first-class (paired bootstrap CIs, multiple-comparison control) unless the domain makes this irrelevant.
 * Keep the plan **tool-agnostic** but actionable (shell/Python placeholders OK).

 ### Required document structure (exact section order)

 1. **Title** — `# {{PROJECT_NAME}} — \`todo.md\`\`
 2. **TL;DR** — one line.
 3. **Invariants (do not change)** — non-negotiable constraints.
 4. **Assumptions & Scope** — what you’re assuming (label uncertain items).
 5. **Objectives** — 3–5 measurable goals.
 6. **Risks & Mitigations** — top risks with a single mitigation each.
 7. **Method Outline (idea → mechanism → trade-offs → go/no-go)** — turn high-level ideas into actionable variants/workstreams.
 8. **Run Matrix** — table of variants with budgets and promotion criteria.
 9. **Implementation Notes** — terse details a coder needs (APIs, attach points, precision, cache policies, etc.).
 10. **Acceptance Gates** — pass/fail thresholds tied to Objectives.
 11. **“Make-sure-you” Checklist** — must-do guardrails for the agent.
 12. **File/Layout Plan** — directories and key files to create.
 13. **Fenced XML Workflows** — **mandatory**: `building`, `running`, `tracking`, `evaluating`, `refinement`.

    * Each `<workflow>` contains ordered `<commands>` and a **`<make_sure>`** checklist.
    * Use explicit IDs (e.g., `id="R1"`).
    * Commands may be placeholders but must be realistic and sequenced.
 14. **Minimal Pseudocode (optional)** — only if it clarifies tricky parts.
 15. **Next Actions (strict order)** — 3–6 concrete steps the agent executes next.

 ### Statistical & evaluation defaults (apply unless the plan dictates otherwise)

 * Report paired metrics with **10k bootstrap**, **BCa 95% CIs**; mark significance only if CI lower bound > 0.
 * Control family-wise errors (e.g., **FDR** within metric families).
 * Maintain **budget parity** (params & FLOPs within **±5%**) across variants unless a “decoding-only” or “systems” budget is declared separately.
 * Always show **two slices** if applicable (e.g., “Focused” vs “Full”); never hide slices.
 * Include **latency p50/p95**, throughput, and memory/VRAM when performance matters.
 * **Verification defaults:** Hermetic spin-up must pass. **Mutation score ≥ 0.80**; **property/metamorphic coverage ≥ 0.70**; **0 high/critical SAST**; **flakiness < 1%** over 100 reruns; **runtime invariants** hold over N=10k shadow requests; **API contracts** green.

 ### Language & style constraints

 * Crisp, technical, neutral tone.
 * No self-references (“As an AI…”), no questions.
 * Use code fences for XML and pseudocode.
 * Use placeholders like `{{MODEL_NAME}}`, `{{DATASET}}`, `{{RANK_SCHEDULE}}` when the plan lacks specifics; **label them** in Assumptions.

 ---

 ### OUTPUT TEMPLATE (fill every section)

 # {{PROJECT\_NAME}} — `todo.md`

 **TL;DR:** {{one-sentence summary of the execution plan}}

 ## Invariants (do not change)

 * {{Constraint 1}}
 * {{Constraint 2}}
 * {{Oracles are source-of-truth; contracts & properties enforced at runtime}}
 * {{Hermetic spin-up required; boot transcript must verify}}
 * {{…}}

 ## Assumptions & Scope

 * **Assumption:** {{ explicit assumption }}
 * **Assumption:** {{ thresholds if unspecified: T\_mut=0.80, T\_prop=0.70, SAST\_high=0 }}
 * **Scope:** {{ what’s in/out }}

 ## Objectives

 1. {{Objective with measurable target}}
 2. **Verification:** Achieve mutation ≥ {{T\_mut}} and property/metamorphic coverage ≥ {{T\_prop}}.
 3. **Reliability:** Zero high/critical SAST; flakiness < 1% on reruns; all API contracts pass.
 4. **Spin-up:** One-shot hermetic boot from clean checkout with golden smoke flows; produce signed boot transcript.
 5. {{Domain KPI improvement with CI-backed threshold}}

 ## Risks & Mitigations

 * {{Risk}} → **Mitigation:** {{one-line fix}}
 * External API drift → **Mitigation:** Consumer/provider contracts + service virtualization.
 * Env non-determinism → **Mitigation:** Containerized toolchain, pinned lockfiles, deterministic seeds.
 * Unknown-unknown logic gaps → **Mitigation:** Metamorphic + fuzzing + runtime invariants with shadow traffic.

 ## Method Outline (idea → mechanism → trade-offs → go/no-go)

 ### Workstream/Variant A — Spec→Oracles Pipeline

 * **Idea:** Compile spec to contracts/properties/metamorphic tests + runtime guards.
 * **Mechanism:** DSL → codegen (pre/post/invariants), property-based suites, metamorphic relations; inject assert/monitor hooks.
 * **Trade-offs:** Upfront authoring cost; stricter gates surface more refactors.
 * **Go/No-Go Gate:** All generated oracles pass; property coverage ≥ {{T\_prop}}.

 ### Workstream/Variant B — Adversarial Verification

 * **Idea:** Kill mutants, explore paths, and stress inputs.
 * **Mechanism:** Mutation testing; grammar/coverage-guided fuzzing; concolic on parsers/auth/finance.
 * **Trade-offs:** Longer CI wall time; infra complexity.
 * **Go/No-Go Gate:** Mutation ≥ {{T\_mut}}; no exploitable paths found at severity ≥ medium.

 ### Workstream/Variant C — Differential & Contract Testing

 * **Idea:** Compare against last known-good/reference; lock external behavior.
 * **Mechanism:** Golden snapshots, differential tests, Pact-like contracts validated in CI; API surface diffs.
 * **Trade-offs:** Snapshot churn; contract maintenance.
 * **Go/No-Go Gate:** No incompatible diffs; contracts green.

 ## Run Matrix

 | ID | Method/Variant               | Budget                        | Inputs                   | Expected Gain       | Promote if…                                  |
 | -- | ---------------------------- | ----------------------------- | ------------------------ | ------------------- | -------------------------------------------- |
 | V1 | Spec→Oracles                 | ±5% parity / baseline compute | Spec DSL, current code   | Fewer escapes       | Property ≥ {{T\_prop}} & all invariants hold |
 | V2 | Mutation+Fuzz+Concolic       | Separate verification budget  | Corpus/grammars          | Kill weak tests     | Mutation ≥ {{T\_mut}}; 0 high/crit SAST      |
 | V3 | Differential+Contracts       | ±5% parity                    | Golden outputs, provider | Regression defense  | Zero incompatible diffs; contracts green     |
 | V4 | Runtime Invariant Monitoring | Staging/shadow only           | Shadow traffic/replays   | Prod-parity signals | 0 invariant breaks over N=10k requests       |

 ## Implementation Notes

 * **APIs/Attach points:** {{paths/interfaces for injecting contracts & monitors}}
 * **Precision/Quantization:** {{fp16/fp8/int8 policy}}
 * **Caching/State:** {{cache windows, stickiness, invalidation}}
 * **Telemetry:** Log mutation score, property coverage, fuzz/crash repros, SAST severity counts, API diffs, visual diffs, flakiness, boot transcript hash.
 * **Repro:** Seeds, SHAs, container digest, dataset/index hashes, contract versions.

 ## Acceptance Gates

 * **Spin-up:** Clean checkout → container build → migrate/seed → readiness OK → golden smokes pass → **boot transcript signed**.
 * **Static:** 0 high/critical SAST; typecheck clean; license policy OK; API surface diffs acknowledged.
 * **Dynamic:** Mutation ≥ {{T\_mut}}; property/metamorphic coverage ≥ {{T\_prop}}; fuzzing runtime ≥ {{X}} mins with 0 new medium+ crashes.
 * **Differential/Contracts:** No incompatible diffs; contracts green.
 * **Runtime (staging/shadow):** 0 invariant breaks over N=10k requests, error budget respected.
 * **Domain KPI:** CI lower bound > 0 on {{primary\_metric}} within budget parity.

 ## “Make-sure-you” Checklist

 * Pin toolchain & deps; record env manifest and container digest.
 * Generate contracts/properties from spec; commit artifacts.
 * Save **boot transcript** and artifact hashes.
 * Record seeds; rerun flaky tests 100×; fail on flakiness.
 * Quarantine network; stub external deps unless in contract tests.
 * Export metrics JSONL; persist logs/artifacts under `artifacts/`.

 ## File/Layout Plan

 ```
 {{repo_root}}/
  spec/                 # DSL + compiled contracts/properties
  contracts/            # consumer/provider specs
  src/
  tests/
    properties/
    metamorphic/
    mutation/
    differential/
    e2e/
  scripts/
    gatekeeper.py
    spinup_smoke.sh
    compute_risk.py
  artifacts/
    boot_transcript.json
    metrics/
  infra/
  analysis/
  logs/
 ```

 ## Workflows (required)

 ```xml
 <workflows project="{{PROJECT_SLUG}}" version="1.0">

  <!-- =============================== -->
  <!-- BUILDING: env, assets, guards   -->
  <!-- =============================== -->
  <workflow name="building">
    <env id="B0">
      <desc>Set up environment and pin versions</desc>
      <commands>
        <cmd>{{create_venv}}</cmd>
        <cmd>{{install_packages}}</cmd>
        <cmd>{{container_build_with_lockfiles}}</cmd>
        <cmd>{{record_env_manifest}}</cmd>
      </commands>
      <make_sure>
        <item>{{GPU/CPU visibility test}}</item>
        <item>{{lockfile / hashes saved}}</item>
      </make_sure>
    </env>

    <assets id="B1">
      <desc>Fetch models/data/indexes or domain assets</desc>
      <commands>
        <cmd>{{download_or_prepare_assets}}</cmd>
        <cmd>{{verify_licenses_and_hashes}}</cmd>
      </commands>
      <make_sure>
        <item>{{asset SHAs recorded}}</item>
      </make_sure>
    </assets>

    <contracts id="B2">
      <desc>Compile spec to contracts/properties/metamorphic tests</desc>
      <commands>
        <cmd>{{compile_spec_to_contracts}}</cmd>
        <cmd>{{generate_property_tests}}</cmd>
        <cmd>{{inject_runtime_guards}}</cmd>
      </commands>
      <make_sure>
        <item>{{oracles generated and versioned}}</item>
      </make_sure>
    </contracts>

    <static id="B3">
      <desc>Enable static/semantic guardrails</desc>
      <commands>
        <cmd>{{run_typecheck_linters}}</cmd>
        <cmd>{{run_SAST_taint}}</cmd>
        <cmd>{{api_surface_diff}}</cmd>
        <cmd>{{complexity_delta_check}}</cmd>
      </commands>
      <make_sure>
        <item>{{abort_on_high/critical findings}}</item>
      </make_sure>
    </static>

    <spinup id="B4">
      <desc>Hermetic boot; produce boot transcript</desc>
      <commands>
        <cmd>{{container_run_clean_checkout}}</cmd>
        <cmd>{{apply_migrations_and_seed}}</cmd>
        <cmd>{{readiness_and_health_checks}}</cmd>
        <cmd>{{run_golden_smokes}}</cmd>
        <cmd>{{write_boot_transcript_json}}</cmd>
      </commands>
      <make_sure>
        <item>{{transcript signed with env digest}}</item>
      </make_sure>
    </spinup>
  </workflow>

  <!-- =============================== -->
  <!-- RUNNING: verification battery   -->
  <!-- =============================== -->
  <workflow name="running">
    <baseline id="R0">
      <desc>Run baseline under parity</desc>
      <commands>
        <cmd>{{train_or_build_baseline}}</cmd>
        <cmd>{{evaluate_baseline}}</cmd>
      </commands>
      <make_sure>
        <item>{{same attach points / budgets}}</item>
      </make_sure>
    </baseline>

    <contracts id="R1">
      <desc>API consumer/provider contracts</desc>
      <commands>
        <cmd>{{start_service_virtualization}}</cmd>
        <cmd>{{run_contract_tests}}</cmd>
      </commands>
      <make_sure>
        <item>{{no contract breaks}}</item>
      </make_sure>
    </contracts>

    <properties id="R2">
      <desc>Property & metamorphic tests</desc>
      <commands>
        <cmd>{{run_property_tests_with_seeds}}</cmd>
        <cmd>{{run_metamorphic_suites}}</cmd>
      </commands>
      <make_sure>
        <item>{{report property coverage}}</item>
      </make_sure>
    </properties>

    <fuzz_symbolic id="R3">
      <desc>Grammar-guided fuzzing + concolic on critical paths</desc>
      <commands>
        <cmd>{{fuzz_parsers_and_endpoints}}</cmd>
        <cmd>{{concolic_on_auth_and_money}}</cmd>
      </commands>
      <make_sure>
        <item>{{crashes minimized; repros archived}}</item>
      </make_sure>
    </fuzz_symbolic>

    <mutation id="R4">
      <desc>Mutation testing for adequacy</desc>
      <commands>
        <cmd>{{generate_mutants}}</cmd>
        <cmd>{{run_mutation_suite}}</cmd>
        <cmd>{{compute_mutation_score}}</cmd>
      </commands>
      <make_sure>
        <item>{{mutation score ≥ {{T_mut}}}}</item>
      </make_sure>
    </mutation>

    <differential id="R5">
      <desc>Golden snapshots & diffs vs last known-good</desc>
      <commands>
        <cmd>{{capture_golden_outputs}}</cmd>
        <cmd>{{run_differential_tests}}</cmd>
        <cmd>{{visual_diff_if_UI}}</cmd>
      </commands>
      <make_sure>
        <item>{{no incompatible diffs}}</item>
      </make_sure>
    </differential>

    <runtime id="R6">
      <desc>Staging with shadow traffic; runtime invariants</desc>
      <commands>
        <cmd>{{deploy_to_staging}}</cmd>
        <cmd>{{mirror_or_replay_traffic}}</cmd>
        <cmd>{{monitor_invariant_breaks}}</cmd>
      </commands>
      <make_sure>
        <item>{{0 invariant breaks over N requests}}</item>
      </make_sure>
    </runtime>
  </workflow>

  <!-- =============================== -->
  <!-- TRACKING: collect & compute     -->
  <!-- =============================== -->
  <workflow name="tracking">
    <harvest id="T1">
      <desc>Consolidate metrics/artifacts; compute statistics</desc>
      <commands>
        <cmd>{{collect_logs_to_jsonl}}</cmd>
        <cmd>{{score_outputs}}</cmd>
        <cmd>{{paired_bootstrap_and_FDR}}</cmd>
        <cmd>{{summarize_SAST_and_static}}</cmd>
        <cmd>{{summarize_mutation_property_coverage}}</cmd>
        <cmd>{{summarize_flakiness_visual_diffs}}</cmd>
        <cmd>{{hash_and_store_boot_transcript}}</cmd>
      </commands>
      <make_sure>
        <item>{{CI policy applied; stars only when CI>0}}</item>
      </make_sure>
    </harvest>

    <risk id="T2">
      <desc>Compute risk score R and decision features</desc>
      <commands>
        <cmd>{{compute_risk.py --delta_loc --novelty --ext_dep_delta --one_minus_mutation --flakiness --static_severity}}</cmd>
        <cmd>{{emit_decision_features_json}}</cmd>
      </commands>
      <make_sure>
        <item>{{features normalized to [0,1]}}</item>
      </make_sure>
    </risk>
  </workflow>

  <!-- =============================== -->
  <!-- EVALUATING: promotion rules     -->
  <!-- =============================== -->
  <workflow name="evaluating">
    <gatekeeper id="E1">
      <desc>Apply gates and decide: AGENT_REFINE vs MANUAL_QA vs PROMOTE</desc>
      <commands>
        <cmd>{{apply_acceptance_gates}}</cmd>
        <cmd>{{compute_R_and_compare_thresholds}}</cmd>
        <cmd>{{route_decision}}</cmd>
        <cmd>{{generate_tables_and_figures}}</cmd>
      </commands>
      <make_sure>
        <item>{{no promotion without CI-backed wins and gates met}}</item>
      </make_sure>
    </gatekeeper>
  </workflow>

  <!-- =============================== -->
  <!-- REFINEMENT: next iteration      -->
  <!-- =============================== -->
  <workflow name="refinement">
    <agent_refine id="N1">
      <desc>Auto-iterate with obligation-driven prompts (if routed)</desc>
      <commands>
        <cmd>{{create_actionable_prompt_from_failures}}</cmd>
        <cmd>{{schedule_agent_build_and_verify}}</cmd>
      </commands>
      <make_sure>
        <item>{{prompts include concrete obligations & thresholds}}</item>
      </make_sure>
    </agent_refine>

    <manual_qa id="N2">
      <desc>Human exploration handoff (if routed)</desc>
      <commands>
        <cmd>{{open_tracking_dashboard}}</cmd>
        <cmd>{{attach_repros_boot_transcript_contracts}}</cmd>
      </commands>
      <make_sure>
        <item>{{clear owner; rollback/kill-switch ready}}</item>
      </make_sure>
    </manual_qa>
  </workflow>

 </workflows>
 ```

 ## Minimal Pseudocode (optional)

 ```python
 # Gatekeeper decision (feature weights are configurable)
 def decide(metrics, T_mut=0.80, T_prop=0.70, T_manual=0.50):
    if not metrics["hermetic_spinup_pass"]: return "AGENT_REFINE: fix spin-up"
    if metrics["sast_high_critical"] > 0:   return "AGENT_REFINE: resolve SAST"
    if metrics["mutation"] < T_mut:         return "AGENT_REFINE: raise mutation"
    if metrics["prop_cov"] < T_prop:        return "AGENT_REFINE: add properties"
    R = (w1*metrics["delta_loc"] + w2*metrics["novelty"] +
         w3*metrics["ext_dep_delta"] + w4*(1-metrics["mutation"]) +
         w5*metrics["flakiness"] + w6*metrics["static_severity"])
    if R >= T_manual: return "MANUAL_QA"
    return "PROMOTE"
 ```

 ## Next Actions (strict order)

 1. Normalize plan into Objectives and Acceptance Gates; set T\_mut/T\_prop thresholds.
 2. Define or import spec DSL; generate contracts/properties/metamorphic tests.
 3. Implement spin-up script and golden smokes; emit boot transcript.
 4. Wire mutation/fuzz/concolic and differential test harnesses; add metrics logging.
 5. Add Gatekeeper and risk computation; connect to CI promotion step.

 ---

 ### HOW TO USE

 * Paste your rough plan after this line: `=== PLAN START ===` … `=== PLAN END ===`.
 * The model must map plan elements into the template above, filling placeholders, and inventing **only** minimal, labeled assumptions where needed.
 * The model must output **only** the final Markdown document (no extra commentary).

 **END OF PROMPT**
	PROMPT FOR LLM (copy+paste below, then append your plan)

	You are converting a generic plan of action into a production-ready research/engineering `todo.md` that is directly executable by a capable coding agent. Your output must be a single Markdown document with a fenced XML `<workflows>` block. Do not include any extra commentary, explanations, or chat—output only the document.

	### Transformation requirements

	* Be specific and operational. Replace vague goals with concrete steps, commands, checklists, acceptance gates, and explicit assumptions.
	* Prefer compact, high-signal prose. No filler. Use short paragraphs and terse bullets.
	* If information is missing, make minimal, clearly labeled assumptions (e.g., “Assumption: …”). Do not ask questions.
	* Compile the spec into executable oracles: pre/postconditions, invariants, consumer/provider API contracts, and metamorphic properties. Generate both tests and runtime guards.
	* Enforce hermetic spin-up: pinned toolchains/lockfiles, reproducible container image, seed data, migrations, health/readiness probes, golden smoke flows, and a signed boot transcript artifact.
	* Beyond unit/integration/e2e: include property-based testing, metamorphic testing, mutation testing (score target), grammar/coverage-guided fuzzing, concolic/symbolic execution for critical paths, differential tests vs last known-good, contract tests for external deps, and runtime invariant checks with shadow traffic or replay.
	* Add static/semantic gates: strict typing/linters, SAST/taint, API surface diffs, complexity deltas, license/OSS policy.
	* Implement a risk score and Gatekeeper to decide: AGENT\_REFINE (auto-iterate) vs MANUAL\_QA (human exploration) vs PROMOTE (stage/ship).
	* Use relative improvements (percent) when comparing methods; keep budget parity rules explicit (e.g., “±5% params/FLOPs”).
	* Include reproducibility and guardrails (seeds, SHAs, data/index hashes, environment pins).
	* Treat statistics rigor as first-class (paired bootstrap CIs, multiple-comparison control) unless the domain makes this irrelevant.
	* Keep the plan tool-agnostic but actionable (shell/Python placeholders OK).

	### Required document structure (exact section order)

	1. Title — `# {{PROJECT_NAME}} — \`todo.md\`\`
	2. TL;DR — one line.
	3. Invariants (do not change) — non-negotiable constraints.
	4. Assumptions & Scope — what you’re assuming (label uncertain items).
	5. Objectives — 3–5 measurable goals.
	6. Risks & Mitigations — top risks with a single mitigation each.
	7. Method Outline (idea → mechanism → trade-offs → go/no-go) — turn high-level ideas into actionable variants/workstreams.
	8. Run Matrix — table of variants with budgets and promotion criteria.
	9. Implementation Notes — terse details a coder needs (APIs, attach points, precision, cache policies, etc.).
	10. Acceptance Gates — pass/fail thresholds tied to Objectives.
	11. “Make-sure-you” Checklist — must-do guardrails for the agent.
	12. File/Layout Plan — directories and key files to create.
	13. Fenced XML Workflows — mandatory: `building`, `running`, `tracking`, `evaluating`, `refinement`.

	* Each `<workflow>` contains ordered `<commands>` and a `<make_sure>` checklist.
	* Use explicit IDs (e.g., `id="R1"`).
	* Commands may be placeholders but must be realistic and sequenced.
	14. Minimal Pseudocode (optional) — only if it clarifies tricky parts.
	15. Next Actions (strict order) — 3–6 concrete steps the agent executes next.

	### Statistical & evaluation defaults (apply unless the plan dictates otherwise)

	* Report paired metrics with 10k bootstrap, BCa 95% CIs; mark significance only if CI lower bound > 0.
	* Control family-wise errors (e.g., FDR within metric families).
	* Maintain budget parity (params & FLOPs within ±5%) across variants unless a “decoding-only” or “systems” budget is declared separately.
	* Always show two slices if applicable (e.g., “Focused” vs “Full”); never hide slices.
	* Include latency p50/p95, throughput, and memory/VRAM when performance matters.
	* Verification defaults: Hermetic spin-up must pass. Mutation score ≥ 0.80; property/metamorphic coverage ≥ 0.70; 0 high/critical SAST; flakiness < 1% over 100 reruns; runtime invariants hold over N=10k shadow requests; API contracts green.

	### Language & style constraints

	* Crisp, technical, neutral tone.
	* No self-references (“As an AI…”), no questions.
	* Use code fences for XML and pseudocode.
	* Use placeholders like `{{MODEL_NAME}}`, `{{DATASET}}`, `{{RANK_SCHEDULE}}` when the plan lacks specifics; label them in Assumptions.

	---

	### OUTPUT TEMPLATE (fill every section)

	# {{PROJECT\_NAME}} — `todo.md`

	TL;DR: {{one-sentence summary of the execution plan}}

	## Invariants (do not change)

	* {{Constraint 1}}
	* {{Constraint 2}}
	* {{Oracles are source-of-truth; contracts & properties enforced at runtime}}
	* {{Hermetic spin-up required; boot transcript must verify}}
	* {{…}}

	## Assumptions & Scope

	* Assumption: {{ explicit assumption }}
	* Assumption: {{ thresholds if unspecified: T\_mut=0.80, T\_prop=0.70, SAST\_high=0 }}
	* Scope: {{ what’s in/out }}

	## Objectives

	1. {{Objective with measurable target}}
	2. Verification: Achieve mutation ≥ {{T\_mut}} and property/metamorphic coverage ≥ {{T\_prop}}.
	3. Reliability: Zero high/critical SAST; flakiness < 1% on reruns; all API contracts pass.
	4. Spin-up: One-shot hermetic boot from clean checkout with golden smoke flows; produce signed boot transcript.
	5. {{Domain KPI improvement with CI-backed threshold}}

	## Risks & Mitigations

	* {{Risk}} → Mitigation: {{one-line fix}}
	* External API drift → Mitigation: Consumer/provider contracts + service virtualization.
	* Env non-determinism → Mitigation: Containerized toolchain, pinned lockfiles, deterministic seeds.
	* Unknown-unknown logic gaps → Mitigation: Metamorphic + fuzzing + runtime invariants with shadow traffic.

	## Method Outline (idea → mechanism → trade-offs → go/no-go)

	### Workstream/Variant A — Spec→Oracles Pipeline

	* Idea: Compile spec to contracts/properties/metamorphic tests + runtime guards.
	* Mechanism: DSL → codegen (pre/post/invariants), property-based suites, metamorphic relations; inject assert/monitor hooks.
	* Trade-offs: Upfront authoring cost; stricter gates surface more refactors.
	* Go/No-Go Gate: All generated oracles pass; property coverage ≥ {{T\_prop}}.

	### Workstream/Variant B — Adversarial Verification

	* Idea: Kill mutants, explore paths, and stress inputs.
	* Mechanism: Mutation testing; grammar/coverage-guided fuzzing; concolic on parsers/auth/finance.
	* Trade-offs: Longer CI wall time; infra complexity.
	* Go/No-Go Gate: Mutation ≥ {{T\_mut}}; no exploitable paths found at severity ≥ medium.

	### Workstream/Variant C — Differential & Contract Testing

	* Idea: Compare against last known-good/reference; lock external behavior.
	* Mechanism: Golden snapshots, differential tests, Pact-like contracts validated in CI; API surface diffs.
	* Trade-offs: Snapshot churn; contract maintenance.
	* Go/No-Go Gate: No incompatible diffs; contracts green.

	## Run Matrix

	\| ID \| Method/Variant \| Budget \| Inputs \| Expected Gain \| Promote if… \|
	\| -- \| ---------------------------- \| ----------------------------- \| ------------------------ \| ------------------- \| -------------------------------------------- \|
	\| V1 \| Spec→Oracles \| ±5% parity / baseline compute \| Spec DSL, current code \| Fewer escapes \| Property ≥ {{T\_prop}} & all invariants hold \|
	\| V2 \| Mutation+Fuzz+Concolic \| Separate verification budget \| Corpus/grammars \| Kill weak tests \| Mutation ≥ {{T\_mut}}; 0 high/crit SAST \|
	\| V3 \| Differential+Contracts \| ±5% parity \| Golden outputs, provider \| Regression defense \| Zero incompatible diffs; contracts green \|
	\| V4 \| Runtime Invariant Monitoring \| Staging/shadow only \| Shadow traffic/replays \| Prod-parity signals \| 0 invariant breaks over N=10k requests \|

	## Implementation Notes

	* APIs/Attach points: {{paths/interfaces for injecting contracts & monitors}}
	* Precision/Quantization: {{fp16/fp8/int8 policy}}
	* Caching/State: {{cache windows, stickiness, invalidation}}
	* Telemetry: Log mutation score, property coverage, fuzz/crash repros, SAST severity counts, API diffs, visual diffs, flakiness, boot transcript hash.
	* Repro: Seeds, SHAs, container digest, dataset/index hashes, contract versions.

	## Acceptance Gates

	* Spin-up: Clean checkout → container build → migrate/seed → readiness OK → golden smokes pass → boot transcript signed.
	* Static: 0 high/critical SAST; typecheck clean; license policy OK; API surface diffs acknowledged.
	* Dynamic: Mutation ≥ {{T\_mut}}; property/metamorphic coverage ≥ {{T\_prop}}; fuzzing runtime ≥ {{X}} mins with 0 new medium+ crashes.
	* Differential/Contracts: No incompatible diffs; contracts green.
	* Runtime (staging/shadow): 0 invariant breaks over N=10k requests, error budget respected.
	* Domain KPI: CI lower bound > 0 on {{primary\_metric}} within budget parity.

	## “Make-sure-you” Checklist

	* Pin toolchain & deps; record env manifest and container digest.
	* Generate contracts/properties from spec; commit artifacts.
	* Save boot transcript and artifact hashes.
	* Record seeds; rerun flaky tests 100×; fail on flakiness.
	* Quarantine network; stub external deps unless in contract tests.
	* Export metrics JSONL; persist logs/artifacts under `artifacts/`.

	## File/Layout Plan

	```
	{{repo_root}}/
	spec/ # DSL + compiled contracts/properties
	contracts/ # consumer/provider specs
	src/
	tests/
	properties/
	metamorphic/
	mutation/
	differential/
	e2e/
	scripts/
	gatekeeper.py
	spinup_smoke.sh
	compute_risk.py
	artifacts/
	boot_transcript.json
	metrics/
	infra/
	analysis/
	logs/
	```

	## Workflows (required)

	```xml
	<workflows project="{{PROJECT_SLUG}}" version="1.0">

	<!-- =============================== -->
	<!-- BUILDING: env, assets, guards -->
	<!-- =============================== -->
	<workflow name="building">
	<env id="B0">
	<desc>Set up environment and pin versions</desc>
	<commands>
	<cmd>{{create_venv}}</cmd>
	<cmd>{{install_packages}}</cmd>
	<cmd>{{container_build_with_lockfiles}}</cmd>
	<cmd>{{record_env_manifest}}</cmd>
	</commands>
	<make_sure>
	<item>{{GPU/CPU visibility test}}</item>
	<item>{{lockfile / hashes saved}}</item>
	</make_sure>
	</env>

	<assets id="B1">
	<desc>Fetch models/data/indexes or domain assets</desc>
	<commands>
	<cmd>{{download_or_prepare_assets}}</cmd>
	<cmd>{{verify_licenses_and_hashes}}</cmd>
	</commands>
	<make_sure>
	<item>{{asset SHAs recorded}}</item>
	</make_sure>
	</assets>

	<contracts id="B2">
	<desc>Compile spec to contracts/properties/metamorphic tests</desc>
	<commands>
	<cmd>{{compile_spec_to_contracts}}</cmd>
	<cmd>{{generate_property_tests}}</cmd>
	<cmd>{{inject_runtime_guards}}</cmd>
	</commands>
	<make_sure>
	<item>{{oracles generated and versioned}}</item>
	</make_sure>
	</contracts>

	<static id="B3">
	<desc>Enable static/semantic guardrails</desc>
	<commands>
	<cmd>{{run_typecheck_linters}}</cmd>
	<cmd>{{run_SAST_taint}}</cmd>
	<cmd>{{api_surface_diff}}</cmd>
	<cmd>{{complexity_delta_check}}</cmd>
	</commands>
	<make_sure>
	<item>{{abort_on_high/critical findings}}</item>
	</make_sure>
	</static>

	<spinup id="B4">
	<desc>Hermetic boot; produce boot transcript</desc>
	<commands>
	<cmd>{{container_run_clean_checkout}}</cmd>
	<cmd>{{apply_migrations_and_seed}}</cmd>
	<cmd>{{readiness_and_health_checks}}</cmd>
	<cmd>{{run_golden_smokes}}</cmd>
	<cmd>{{write_boot_transcript_json}}</cmd>
	</commands>
	<make_sure>
	<item>{{transcript signed with env digest}}</item>
	</make_sure>
	</spinup>
	</workflow>

	<!-- =============================== -->
	<!-- RUNNING: verification battery -->
	<!-- =============================== -->
	<workflow name="running">
	<baseline id="R0">
	<desc>Run baseline under parity</desc>
	<commands>
	<cmd>{{train_or_build_baseline}}</cmd>
	<cmd>{{evaluate_baseline}}</cmd>
	</commands>
	<make_sure>
	<item>{{same attach points / budgets}}</item>
	</make_sure>
	</baseline>

	<contracts id="R1">
	<desc>API consumer/provider contracts</desc>
	<commands>
	<cmd>{{start_service_virtualization}}</cmd>
	<cmd>{{run_contract_tests}}</cmd>
	</commands>
	<make_sure>
	<item>{{no contract breaks}}</item>
	</make_sure>
	</contracts>

	<properties id="R2">
	<desc>Property & metamorphic tests</desc>
	<commands>
	<cmd>{{run_property_tests_with_seeds}}</cmd>
	<cmd>{{run_metamorphic_suites}}</cmd>
	</commands>
	<make_sure>
	<item>{{report property coverage}}</item>
	</make_sure>
	</properties>

	<fuzz_symbolic id="R3">
	<desc>Grammar-guided fuzzing + concolic on critical paths</desc>
	<commands>
	<cmd>{{fuzz_parsers_and_endpoints}}</cmd>
	<cmd>{{concolic_on_auth_and_money}}</cmd>
	</commands>
	<make_sure>
	<item>{{crashes minimized; repros archived}}</item>
	</make_sure>
	</fuzz_symbolic>

	<mutation id="R4">
	<desc>Mutation testing for adequacy</desc>
	<commands>
	<cmd>{{generate_mutants}}</cmd>
	<cmd>{{run_mutation_suite}}</cmd>
	<cmd>{{compute_mutation_score}}</cmd>
	</commands>
	<make_sure>
	<item>{{mutation score ≥ {{T_mut}}}}</item>
	</make_sure>
	</mutation>

	<differential id="R5">
	<desc>Golden snapshots & diffs vs last known-good</desc>
	<commands>
	<cmd>{{capture_golden_outputs}}</cmd>
	<cmd>{{run_differential_tests}}</cmd>
	<cmd>{{visual_diff_if_UI}}</cmd>
	</commands>
	<make_sure>
	<item>{{no incompatible diffs}}</item>
	</make_sure>
	</differential>

	<runtime id="R6">
	<desc>Staging with shadow traffic; runtime invariants</desc>
	<commands>
	<cmd>{{deploy_to_staging}}</cmd>
	<cmd>{{mirror_or_replay_traffic}}</cmd>
	<cmd>{{monitor_invariant_breaks}}</cmd>
	</commands>
	<make_sure>
	<item>{{0 invariant breaks over N requests}}</item>
	</make_sure>
	</runtime>
	</workflow>

	<!-- =============================== -->
	<!-- TRACKING: collect & compute -->
	<!-- =============================== -->
	<workflow name="tracking">
	<harvest id="T1">
	<desc>Consolidate metrics/artifacts; compute statistics</desc>
	<commands>
	<cmd>{{collect_logs_to_jsonl}}</cmd>
	<cmd>{{score_outputs}}</cmd>
	<cmd>{{paired_bootstrap_and_FDR}}</cmd>
	<cmd>{{summarize_SAST_and_static}}</cmd>
	<cmd>{{summarize_mutation_property_coverage}}</cmd>
	<cmd>{{summarize_flakiness_visual_diffs}}</cmd>
	<cmd>{{hash_and_store_boot_transcript}}</cmd>
	</commands>
	<make_sure>
	<item>{{CI policy applied; stars only when CI>0}}</item>
	</make_sure>
	</harvest>

	<risk id="T2">
	<desc>Compute risk score R and decision features</desc>
	<commands>
	<cmd>{{compute_risk.py --delta_loc --novelty --ext_dep_delta --one_minus_mutation --flakiness --static_severity}}</cmd>
	<cmd>{{emit_decision_features_json}}</cmd>
	</commands>
	<make_sure>
	<item>{{features normalized to [0,1]}}</item>
	</make_sure>
	</risk>
	</workflow>

	<!-- =============================== -->
	<!-- EVALUATING: promotion rules -->
	<!-- =============================== -->
	<workflow name="evaluating">
	<gatekeeper id="E1">
	<desc>Apply gates and decide: AGENT_REFINE vs MANUAL_QA vs PROMOTE</desc>
	<commands>
	<cmd>{{apply_acceptance_gates}}</cmd>
	<cmd>{{compute_R_and_compare_thresholds}}</cmd>
	<cmd>{{route_decision}}</cmd>
	<cmd>{{generate_tables_and_figures}}</cmd>
	</commands>
	<make_sure>
	<item>{{no promotion without CI-backed wins and gates met}}</item>
	</make_sure>
	</gatekeeper>
	</workflow>

	<!-- =============================== -->
	<!-- REFINEMENT: next iteration -->
	<!-- =============================== -->
	<workflow name="refinement">
	<agent_refine id="N1">
	<desc>Auto-iterate with obligation-driven prompts (if routed)</desc>
	<commands>
	<cmd>{{create_actionable_prompt_from_failures}}</cmd>
	<cmd>{{schedule_agent_build_and_verify}}</cmd>
	</commands>
	<make_sure>
	<item>{{prompts include concrete obligations & thresholds}}</item>
	</make_sure>
	</agent_refine>

	<manual_qa id="N2">
	<desc>Human exploration handoff (if routed)</desc>
	<commands>
	<cmd>{{open_tracking_dashboard}}</cmd>
	<cmd>{{attach_repros_boot_transcript_contracts}}</cmd>
	</commands>
	<make_sure>
	<item>{{clear owner; rollback/kill-switch ready}}</item>
	</make_sure>
	</manual_qa>
	</workflow>

	</workflows>
	```

	## Minimal Pseudocode (optional)

	```python
	# Gatekeeper decision (feature weights are configurable)
	def decide(metrics, T_mut=0.80, T_prop=0.70, T_manual=0.50):
	if not metrics["hermetic_spinup_pass"]: return "AGENT_REFINE: fix spin-up"
	if metrics["sast_high_critical"] > 0: return "AGENT_REFINE: resolve SAST"
	if metrics["mutation"] < T_mut: return "AGENT_REFINE: raise mutation"
	if metrics["prop_cov"] < T_prop: return "AGENT_REFINE: add properties"
	R = (w1metrics["delta_loc"] + w2metrics["novelty"] +
	w3metrics["ext_dep_delta"] + w4(1-metrics["mutation"]) +
	w5metrics["flakiness"] + w6metrics["static_severity"])
	if R >= T_manual: return "MANUAL_QA"
	return "PROMOTE"
	```

	## Next Actions (strict order)

	1. Normalize plan into Objectives and Acceptance Gates; set T\_mut/T\_prop thresholds.
	2. Define or import spec DSL; generate contracts/properties/metamorphic tests.
	3. Implement spin-up script and golden smokes; emit boot transcript.
	4. Wire mutation/fuzz/concolic and differential test harnesses; add metrics logging.
	5. Add Gatekeeper and risk computation; connect to CI promotion step.

	---

	### HOW TO USE

	* Paste your rough plan after this line: `=== PLAN START ===` … `=== PLAN END ===`.
	* The model must map plan elements into the template above, filling placeholders, and inventing only minimal, labeled assumptions where needed.
	* The model must output only the final Markdown document (no extra commentary).

	END OF PROMPT
No results found