Verified Spec-Driven Development (VSDD) is a unified software engineering methodology that fuses three proven paradigms into a single AI-orchestrated pipeline:
- Spec-Driven Development (SDD): Define the contract before writing a single line of implementation. Specs are the source of truth.
- Test-Driven Development (TDD): Tests are written before code. Red → Green → Refactor. No code exists without a failing test that demanded it.
- Verification-Driven Development (VDD): Subject all surviving code to adversarial refinement until a hyper-critical reviewer is forced to hallucinate flaws.
VSDD treats these not as competing philosophies but as sequential gates in a single pipeline. Specs define what. Tests enforce how. Adversarial verification ensures nothing was missed. AI models orchestrate every phase, with the human developer serving as the strategic decision-maker and final authority.
| Role | Entity | Function |
|---|---|---|
| The Architect | Human Developer | Strategic vision, domain expertise, acceptance authority. Signs off on specs, arbitrates disputes between Builder and Adversary. |
| The Builder | Claude (or similar) | Spec authorship, test generation, code implementation, and refactoring. Operates under strict TDD constraints. |
| The Tracker | Chainlink | Hierarchical issue decomposition — Epics → Issues → Sub-issues ("beads"). Every spec, test, and implementation maps to a bead. |
| The Adversary | Sarcasmotron (Gemini Gem or equivalent) | Hyper-critical reviewer with zero patience. Reviews specs, tests, and implementation. Fresh context on every pass. |
Nothing gets built until the contract is airtight.
The human developer describes the feature intent to the Builder. The Builder then produces a formal specification document for each unit of work, containing:
- Behavioral Contract: What the module/function/endpoint must do, expressed as preconditions, postconditions, and invariants.
- Interface Definition: Input types, output types, error types. No ambiguity. If it's an API, this is the OpenAPI/GraphQL schema. If it's a module, this is the type signature and doc contract.
- Edge Case Catalog: Explicitly enumerated boundary conditions, degenerate inputs, and failure modes. The Builder is prompted to be exhaustive here — "What happens when the input is null? Empty? Maximum size? Negative? Unicode? Concurrent?"
- Non-Functional Requirements: Performance bounds, memory constraints, security considerations baked into the spec itself.
Spec Review Gate: The spec is reviewed by both the human and the Adversary before any tests are written. Sarcasmotron tears into the spec looking for:
- Ambiguous language that could be interpreted multiple ways
- Missing edge cases
- Implicit assumptions that aren't stated
- Contradictions between different parts of the spec
The spec is iterated until the Adversary can't find legitimate holes.
Chainlink Integration: Each spec maps to a Chainlink Issue. Sub-issues are generated for each behavioral contract item, edge case, and non-functional requirement.
Red → Green → Refactor, enforced by AI.
With an airtight spec in hand, the Builder now writes tests — and only tests. No implementation code yet.
Step 2a: Test Suite Generation
The Builder translates the spec directly into executable tests:
- Unit Tests: One or more tests per behavioral contract item. Every postcondition becomes an assertion. Every precondition violation becomes a test that expects a specific error.
- Edge Case Tests: Every item in the Edge Case Catalog becomes a test. These are the tests that catch the bugs that "never happen in production" (until they do).
- Integration Tests: Tests that verify the module works correctly within the larger system context defined in the spec.
- Property-Based Tests: Where applicable, the Builder generates property-based tests (e.g., using Hypothesis, fast-check, or proptest) that assert invariants hold across randomized inputs.
The Red Gate: All tests must fail before any implementation begins. If a test passes without implementation, the test is suspect — it's either testing the wrong thing or the spec was wrong. The Builder flags this for human review.
Step 2b: Minimal Implementation
The Builder writes the minimum code necessary to make each test pass, one at a time. This is classic TDD discipline:
- Pick the next failing test.
- Write the smallest implementation that makes it pass.
- Run the full suite — nothing else should break.
- Repeat.
Step 2c: Refactor
After all tests are green, the Builder refactors for clarity, performance, and adherence to the non-functional requirements in the spec. The test suite acts as the safety net — if refactoring breaks something, the tests catch it immediately.
Human Checkpoint: The developer reviews the test suite and implementation for alignment with the "spirit" of the spec. AI can miss intent even when it nails the letter of the contract.
The code survived testing. Now it faces the gauntlet.
The verified, test-passing codebase — along with the spec and test suite — is presented to Sarcasmotron in a fresh context window.
What the Adversary reviews:
- Spec Fidelity: Does the implementation actually satisfy the spec, or did the tests inadvertently encode a misunderstanding?
- Test Quality: Are the tests actually testing what they claim? Are there tests that would pass even if the implementation were subtly wrong? (Tautological tests, tests that mock too aggressively, tests that assert on implementation details rather than behavior.)
- Code Quality: The classic VDD roast — placeholder comments, generic error handling, inefficient patterns, hidden coupling, missing resource cleanup, race conditions.
- Security Surface: Input validation gaps, injection vectors, authentication/authorization assumptions.
- Spec Gaps Revealed by Implementation: Sometimes writing the code reveals that the spec was incomplete. The Adversary looks for implemented behavior that isn't covered by the spec.
Negative Prompting: Sarcasmotron is prompted for zero tolerance. No "overall this looks good, but..." preamble. Every piece of feedback is a concrete flaw with a specific location and a proposed fix or question.
Context Reset: Fresh context window on every adversarial pass. No relationship drift. No accumulated goodwill.
The Adversary's critique feeds back through the entire pipeline:
- Spec-level flaws → Return to Phase 1. Update the spec, re-review.
- Test-level flaws → Return to Phase 2a. Fix or add tests, verify they fail against the current implementation (or a deliberately broken version), then fix implementation if needed.
- Implementation-level flaws → Return to Phase 2c. Refactor, ensure all tests still pass.
- New edge cases discovered → Add to spec's Edge Case Catalog, write new failing tests, implement fixes.
This loop continues until convergence (see Phase 6).
As the adversarial loop tightens, the methodology escalates to mathematical verification:
- Model Checking & Symbolic Execution: Tools like Kani (Rust), CBMC (C/C++), or Dafny are integrated to prove properties the test suite can only suggest. Memory safety, arithmetic correctness, and state machine invariants are mathematically guaranteed.
- Fuzz Testing: Structured fuzzing (AFL++, libFuzzer, cargo-fuzz) is layered on top of property-based tests to find inputs that no human or AI anticipated.
- Security Hardening: Suites like Wycheproof (cryptographic edge cases) and Semgrep (static analysis) are run as CI/CD gates.
- Mutation Testing: Tools like mutmut or Stryker mutate the code to verify the test suite actually catches real bugs. If a mutation survives, the test suite has a gap.
All formal verification and fuzzing results feed back into Phase 4 if issues are found.
VSDD inherits VDD's hallucination-based termination, extended across all three dimensions:
| Dimension | Convergence Signal |
|---|---|
| Spec | The Adversary's spec critiques are nitpicks about wording, not about missing behavior or ambiguity. |
| Tests | The Adversary can't identify a meaningful untested scenario. Mutation testing confirms high kill rate. |
| Implementation | The Adversary is forced to invent problems that don't exist in the code. Formal verifiers pass. Fuzzers find nothing. |
Maximum Viable Refinement is reached when all three dimensions have converged. The software is considered Zero-Slop — every line of code traces to a spec requirement, is covered by a test, and has survived adversarial scrutiny.
One of VSDD's defining properties is full traceability. Every artifact links back:
Spec Requirement → Chainlink Bead → Test Case → Implementation → Adversarial Review → Formal Proof
At any point, you can ask: "Why does this line of code exist?" and trace it all the way back to a specific spec requirement, through the test that demanded it, the adversarial review that hardened it, and (optionally) the formal proof that guarantees it.
-
Spec Supremacy: The spec is the highest authority below the human developer. Tests serve the spec. Code serves the tests. Nothing exists without a reason traced to the spec.
-
Red Before Green: No implementation code is written until a failing test demands it. AI models are explicitly constrained to follow TDD discipline — no "let me just write the whole thing and add tests after."
-
Anti-Slop Bias: The first "correct" version is assumed to contain hidden debt. Trust is earned through adversarial survival, not initial appearance.
-
Forced Negativity: Adversarial pressure bypasses the politeness filters of standard LLM interactions. The Adversary doesn't care about your feelings — it cares about your invariants.
-
Linear Accountability: Chainlink beads ensure every spec item, test, and line of code has a corresponding tracked unit of work. Nothing slips through the cracks.
-
Entropy Resistance: Context resets on every adversarial pass prevent the natural degradation of long-running AI conversations.
-
Three-Dimensional Convergence: The system isn't done until specs, tests, and implementation have all independently survived adversarial review.
VSDD is explicitly designed for multi-model AI workflows:
- The Builder benefits from large context windows and strong code generation (Claude, GPT-4, etc.). It needs to hold the full spec, test suite, and implementation simultaneously.
- The Adversary benefits from a different model or configuration to avoid shared blind spots. Using a different model family (e.g., Gemini as Adversary when Claude is Builder) introduces genuine cognitive diversity.
- The Human is not a bottleneck — they're the strategic layer. They approve specs, resolve disputes, and make judgment calls that AI can't. The human's role is elevated, not diminished, by the AI orchestration.
Prompt Engineering for TDD Discipline: The Builder must be explicitly instructed: "You are operating under strict TDD. Write tests FIRST. Do NOT write implementation code until I confirm all tests fail. When implementing, write the MINIMUM code to pass each test." Without this constraint, AI models will naturally try to write implementation and tests simultaneously.
VSDD is high-ceremony by design. It's worth the overhead when:
- Correctness is non-negotiable (financial systems, medical software, infrastructure)
- The codebase will be maintained long-term and must resist entropy
- Multiple AI models are available and the team wants maximum quality extraction
- Security is a primary concern, not an afterthought
- The project complexity justifies formal spec work
For rapid prototyping or throwaway scripts, use the parts that make sense — TDD discipline and a quick adversarial pass can still catch a lot of slop even without the full ceremony.
"VSDD doesn't just generate code — it generates code that can prove why it exists, demonstrate that it works, and survive an adversary that wants it dead."