Skip to content

Instantly share code, notes, and snippets.

@ruvnet
Last active May 9, 2026 03:25
Show Gist options
  • Select an option

  • Save ruvnet/ee7763c36f7a9a1c1886da783abc872b to your computer and use it in GitHub Desktop.

Select an option

Save ruvnet/ee7763c36f7a9a1c1886da783abc872b to your computer and use it in GitHub Desktop.
Agentic Validation System — three-layer regression protection for AI-built codebases (smoke harness + cryptographic witness + temporal history)

Agentic Validation System

A three-layer regression-protection stack for AI-built codebases that ship fast across many small fixes. Designed for projects where:

  • Releases are frequent (daily alphas, not monthly majors)
  • Fixes are small (one-line CLI parser swaps, dependency moves) but many in number
  • Multiple agents touch overlapping code in the same release window
  • A regression that ships affects every user immediately, not just one customer

The stack catches three distinct regression classes that traditional CI misses, then provides forensic tools to answer "when did this break and what changed?"


Why traditional CI is not enough

A standard CI pipeline runs unit tests + a typecheck + maybe a build. Every regression we shipped on 2026-05-08 passed all three of those for the broken commit. Specifically:

Regression What broke Why CI passed What user saw
#1867 @claude-flow/memory had better-sqlite3 as a hard dep + static import CI ran on Node 20 where prebuilds existed, so the static import evaluated fine npm install ruflo@latest failed on Node 26 with node-gyp errors
#1862 ruflo-core plugin's hooks.json called --format true (not a real flag) No CI test invoked the plugin's hooks.json against the CLI with realistic stdin Every Write/Edit tool use printed [ERROR] Invalid value for --format: true
#1859 CLI parser preferred stray positionals over named flags (14 sites) Unit tests passed flags individually, never a --flag + boolean-shaped value combo post-edit --file X --success true recorded "true" as the file path

Three different regression classes, three different reasons unit tests missed them. The root cause is the same: unit tests verify code paths, not user-visible failure modes. When the bug lives in the integration boundary — install resolution, subprocess flag parsing, plugin/CLI version drift — unit tests don't see it.

The agentic validation system adds three layers that each test a user-visible failure mode against a real artifact.


Architecture

                 ┌─────────────────────────────────────────────────┐
                 │  Layer 1: Behavioral smoke tests                │
                 │  ─────────────────────────────                  │
                 │  Fresh `npm install` on real Node versions      │
                 │  Real subprocess invocation with real JSON      │
                 │  Asserts user-visible signal, not code path     │
                 └─────────────────────────────────────────────────┘
                                       ↓
                 ┌─────────────────────────────────────────────────┐
                 │  Layer 2: Cryptographic witness manifest        │
                 │  ─────────────────────────────                  │
                 │  SHA-256 + marker substring per fix             │
                 │  Ed25519-signed with deterministic seed         │
                 │  Anyone can re-derive the public key            │
                 └─────────────────────────────────────────────────┘
                                       ↓
                 ┌─────────────────────────────────────────────────┐
                 │  Layer 3: Append-only temporal history (JSONL)  │
                 │  ─────────────────────────────                  │
                 │  One snapshot per regen                         │
                 │  Per-fix status timeline                        │
                 │  Regression-introduction commit identification  │
                 └─────────────────────────────────────────────────┘

Each layer is independently useful and independently adoptable. Together they form a stack where:

  • Layer 1 catches the regression as a user would experience it.
  • Layer 2 confirms every documented fix is still in the code, even if Layer 1 has no specific test for it.
  • Layer 3 answers when the regression was introduced, so triage doesn't require manual git bisect.

Layer 1 — Behavioral smoke tests

The pattern: build the artifact under test in CI, drive it through the user-visible failure path with a real subprocess, assert on the user-visible signal.

Concrete instances

Install smoke (smoke-install-no-bsqlite.mjs) — for npm install failures:

# Pack the package as it would be published
cd v3/@claude-flow/memory && npm pack
# Install into a fresh directory with --omit=optional
# (simulates "native build failed" on platforms without prebuilds)
cd /tmp/smoke && npm init -y && npm install /path/to/tarball --omit=optional
# Assert: package loads, runtime auto-falls-back, round-trip works
node -e "
  const m = await import('@claude-flow/memory');
  const db = await m.createDatabase('/tmp/x.db', { provider: 'auto' });
  await db.initialize();
  // … round-trip a value …
"

This catches any form of "install fails when the optional native dep can't build" regardless of which dep it is. If a developer adds a new static import of an optional dependency, this fails immediately.

Hook smoke (test-hooks.mjs) — for plugin/CLI version drift:

// Read each PostToolUse hook from hooks.json
const cmd = hooks.PostToolUse.find(h => h.matcher === 'Bash').hooks[0].command;
// Pipe synthetic Claude-Code-style JSON to it
spawnSync('bash', ['-c', cmd], { input: JSON.stringify({ tool_input: { command: 'echo hi' } }) });
// Assert exit code 0, output contains the actual command not "true"
expect(stdout).toContain('echo hi');
expect(stdout).not.toContain('Recording outcome for: true');

The negative assertion (not.toContain('Recording outcome for: true')) is critical. A naive contains: 'true' test would pass against the broken code because the recorded value happened to be "true". Catching the wrong-value bug requires asserting the right value.

CI integration

smoke-install-no-bsqlite:
  strategy:
    matrix:
      node: ['22', '24']  # versions where prebuilds may be missing
  steps:
    - run: |
        cd v3/@claude-flow/memory
        TARBALL=$(npm pack)
        mkdir /tmp/smoke && cd /tmp/smoke
        npm init -y && npm install "$TARBALL" --omit=optional
        cp scripts/smoke-no-bsqlite.mjs ./
        node smoke-no-bsqlite.mjs

A failing job here means: a user running npm install <pkg> on a platform without prebuilds will hit this error. No reproduction required from the bug report.


Layer 2 — Cryptographic witness manifest

The pattern: every documented fix gets an entry containing the file path, a SHA-256 of that file at issuance, and a marker substring that must remain in the file while the fix is present. The whole manifest is hashed (SHA-256) and signed (Ed25519) using a deterministic seed derived from the git commit, so the public key can be re-derived without a committed private key.

Manifest shape (excerpt)

{
  "manifest": {
    "schema": "ruflo-witness/v1",
    "issuedAt": "2026-05-09T01:00:47.879Z",
    "gitCommit": "54c706f56...",
    "fixes": [
      {
        "id": "#1867",
        "desc": "Node 26 install: better-sqlite3 dynamic import + optionalDependencies",
        "file": "v3/@claude-flow/memory/dist/sqlite-backend.js",
        "sha256": "<64-char hex>",
        "marker": "(await import('better-sqlite3')).default",
        "markerVerified": true
      }
    ],
    "summary": { "totalFixes": 81, "verified": 81, "missing": 0 }
  },
  "integrity": {
    "manifestHashAlgo": "sha256",
    "manifestHash": "<64-char hex of canonical manifest>",
    "signatureAlgo": "ed25519",
    "publicKey": "<32-byte hex>",
    "signature": "<64-byte hex>",
    "seedDerivation": "sha256(gitCommit + ':ruflo-witness/v1')"
  }
}

How verification works

# Anyone with the same git commit can re-derive the public key
GITSHA=$(jq -r '.manifest.gitCommit' verification.md.json)
SEED=$(echo -n "$GITSHA:ruflo-witness/v1" | sha256sum | head -c 64)
# Then check Ed25519 signature against manifestHash with that key

For each fix entry, the verifier computes:

  • Pass — file's SHA-256 matches manifest entry exactly (no drift since issuance)
  • Drift — file SHA-256 changed but the marker is still present (acceptable — codebase advanced)
  • Regressed — the marker is missing from the file (real regression — fix has been removed or refactored away)
  • Missing — the cited file no longer exists (rebuild needed, or fix retired)

CI gates publish on regressed === 0 && signatureValid.

Why marker substrings, not just SHA-256

A SHA-256-only check would flag every benign whitespace change as a regression. The marker is the semantic invariant — "the fix is the presence of this specific substring." If a developer refactors the file but preserves the fix, marker stays present, drift is recorded, no false alarm. If a developer deletes the fix, marker disappears, regression is caught.

Choosing markers is the load-bearing skill. Bad markers:

  • 'function' — too generic, false-positives everywhere
  • 'TODO' — likely to flap as TODOs come and go

Good markers:

  • (await import('better-sqlite3')).default — distinctive and specific to the fix mechanism
  • (ctx.flags.file as string) || ctx.args[0] — the actual swap that fixed #1859
  • import * as bcrypt from 'bcryptjs' — proves the migration from bcrypt to bcryptjs is in dist

Layer 3 — Append-only temporal history

The pattern: every regen of the witness appends one line to a JSONL file recording the snapshot. The history file is committed alongside the manifest. Queries against the history answer:

  • When was a regression introduced (which commit window)
  • What fixes have flapped between pass and regressed (likely a brittle marker)
  • Which fixes are persistently drifting (probably an unstable file)

Entry shape

{
  "v": 1,
  "commit": "54c706f56138...",
  "issuedAt": "2026-05-09T01:00:47.879Z",
  "branch": "main",
  "manifestHash": "<64-char hex>",
  "summary": { "totalFixes": 81, "verified": 81, "missing": 0 },
  "fixes": {
    "#1867": { "sha256": "...", "markerVerified": true },
    "F1":    { "sha256": "...", "markerVerified": true },
    /* ... one entry per fix, keyed by id ... */
  }
}

Regression-introduction queries

# For each currently-regressed fix, find the commit that introduced the regression
node history.mjs --history verification-history.jsonl regressions
# Output:
#   F12
#     last pass:    a1b2c3d4e5f6  2026-05-07T14:23:11.000Z
#     regressed at: 9f8e7d6c5b4a  2026-05-08T09:14:55.000Z

# Now triage with git
git log a1b2c3d4..9f8e7d6c -- path/to/file

This collapses regression triage from "git bisect across 50 commits" to "read the diff for the 3 commits in this 18-hour window."

Status timeline for a single fix

node history.mjs --history verification-history.jsonl timeline --id F12
# Output:
#   2026-05-06T...  abc123  pass
#   2026-05-07T...  def456  pass
#   2026-05-08T...  789abc  regressed
#   2026-05-09T...  012def  regressed

A fix that flaps pass → regressed → pass → regressed is signalling that its marker is too brittle. A fix that's drift-only for 30 snapshots is signalling that its file is undergoing constant refactor and its SHA-256 baseline is meaningless — either accept perpetual drift or update the baseline.


Usage

Simple — adopt the pattern in your project (5 minutes)

# Bootstrap empty manifest + history + fixes template
node plugins/ruflo-core/scripts/witness/init.mjs

# Edit witness-fixes.json to register your fixes:
#   { "fixes": [ { "id": "MY-001", "desc": "...", "file": "src/foo.ts", "marker": "..." } ] }

# Install the only runtime dep
npm i @noble/ed25519

# Generate the signed manifest + first history entry
node plugins/ruflo-core/scripts/witness/regen.mjs \
  --manifest verification.md.json \
  --history  verification-history.jsonl \
  --fixes    witness-fixes.json

# Commit verification.md.json + verification-history.jsonl + witness-fixes.json together
git add verification.md.json verification-history.jsonl witness-fixes.json
git commit -m "feat: bootstrap witness manifest"

Simple — verify the witness in your CI

- name: Witness verify
  run: |
    node plugins/ruflo-core/scripts/witness/verify.mjs \
      --manifest verification.md.json \
      --json > /tmp/witness.json
    node -e "
      const r = require('/tmp/witness.json');
      if (!r.ok) { console.error('signature or fix regressed'); process.exit(1); }
    "

Intermediate — register a fix when shipping a release

When you ship a fix:

# 1. Identify a distinctive marker substring that will be present
#    while the fix is in the file. Use a unique pattern from the diff,
#    not generic words like "function" or "import".

# 2. Append to witness-fixes.json:
{
  "id": "#234",
  "desc": "Fix race condition in token refresh",
  "file": "dist/auth.js",
  "marker": "if (this._refreshing) return this._refreshing;"
}

# 3. Dry-run to confirm verified=N/N before writing:
node plugins/ruflo-core/scripts/witness/regen.mjs \
  --manifest verification.md.json \
  --history  verification-history.jsonl \
  --fixes    witness-fixes.json \
  --dry-run

# 4. Real run if dry-run looks good
node plugins/ruflo-core/scripts/witness/regen.mjs \
  --manifest verification.md.json \
  --history  verification-history.jsonl \
  --fixes    witness-fixes.json

# 5. Commit the trio
git add verification.md.json verification-history.jsonl witness-fixes.json

Advanced — investigate a regression

CI reports F12 regressed. To find when it broke:

node plugins/ruflo-core/scripts/witness/history.mjs \
  --history verification-history.jsonl regressions

# Output:
#   F12
#     last pass:    a1b2c3d4  2026-05-07T14:23:11.000Z
#     regressed at: 9f8e7d6c  2026-05-08T09:14:55.000Z

# Read the diff for the 3 commits in that window
git log a1b2c3d4..9f8e7d6c -- $(jq -r '.manifest.fixes[] | select(.id == "F12") | .file' verification.md.json)

Advanced — combining smoke and witness layers in CI

jobs:
  smoke-install:
    name: Smoke install / Node ${{ matrix.node }}
    strategy:
      matrix: { node: ['22', '24'] }
    steps:
      - run: scripts/test-fresh-install.sh

  plugin-hooks-smoke:
    strategy:
      matrix: { node: ['20', '22'] }
    steps:
      - run: node scripts/test-hooks.mjs "node $PWD/bin/cli.js"

  witness-verify:
    needs: [smoke-install, plugin-hooks-smoke]   # both behavioral layers must pass first
    steps:
      - run: |
          node scripts/witness/verify.mjs --manifest verification.md.json --json > /tmp/r.json
          node -e "if (!require('/tmp/r.json').ok) process.exit(1)"
      - run: |
          node scripts/witness/history.mjs --history verification-history.jsonl summary
          # Soft signal: prints "newly regressed" fixes if any

  publish:
    needs: [smoke-install, plugin-hooks-smoke, witness-verify]
    if: github.ref == 'refs/heads/main'
    steps:
      - run: npm publish

The publish step gates on all three layers green. Behavioral smoke catches user-experience regressions. Witness catches presence regressions. History surfaces the introduction commit. Together they provide both prevention and forensics.


CI Integration Pitfalls

These are the specific traps I hit wiring this into ruflo's GitHub Actions and that adopters will hit too. The fixes are small once you know to look for them; the failure modes are subtle when you don't.

1. pnpm isolated linker hides @noble/ed25519

verify.mjs loads @noble/ed25519 via createRequire. With pnpm's default isolated node-linker, transitive deps don't hoist to the workspace root unless a workspace member declares them directly. Locally you might have a flat copy at <root>/node_modules from an earlier npm install and never notice. In CI, fresh pnpm-only install — and the probe fails silently into signatureValid: false.

Fix: the probes array in verify.mjs and lib.mjs should include the workspace packages that do declare @noble/ed25519 directly:

const probes = [
  repoRoot,
  join(repoRoot, 'v3'),
  join(repoRoot, 'v3/@claude-flow/cli'),                 // declares ed25519
  join(repoRoot, 'v3/@claude-flow/plugin-agent-federation'), // declares ed25519
];

Adapt the inner package paths to your repo layout. The shipped script is pre-configured for ruflo's monorepo; in other projects, edit the array to match wherever @noble/ed25519 is a direct dep.

2. Don't dogfood the CLI in CI's witness-verify step

There are two ways to invoke the verifier: the bundled CLI subcommand (ruflo verify) and the standalone plugin script (plugins/ruflo-core/scripts/witness/verify.mjs). They produce identical output.

Use the standalone in CI. The CLI binary may transitively load native modules (e.g. sharp for image processing, onnxruntime-node for embeddings). pnpm v8 doesn't run native postinstall scripts by default, so the prebuilds aren't fetched and the CLI fails on first import — long before reaching the verify code. The standalone has zero deps beyond @noble/ed25519.

# ✗ Don't do this in CI — pulls in CLI's native deps
- run: node bin/cli.js verify --manifest verification.md.json

# ✓ Do this — pure-JS, only @noble/ed25519
- run: node plugins/ruflo-core/scripts/witness/verify.mjs --manifest verification.md.json --json

3. npm pack chokes on workspace:* deps

If the smoke job packs a workspace package (e.g. npm pack the memory package, then install the tarball with --omit=optional to simulate a Node version without prebuilds), npm rejects workspace:* protocol entries with EUNSUPPORTEDPROTOCOL.

Fix: use pnpm pack instead — it rewrites workspace:* to resolved versions before tarballing, producing a tarball that plain npm install can consume.

- name: Install workspace + build memory
  working-directory: v3
  run: |
    pnpm install --frozen-lockfile
    pnpm --filter @claude-flow/memory... run build

- name: Pack memory tarball (pnpm rewrites workspace:* → versions)
  id: pack
  working-directory: v3/@claude-flow/memory
  run: |
    TARBALL=$(pnpm pack --pack-destination /tmp 2>&1 | grep -E "\.tgz$" | head -1)
    echo "tarball=$TARBALL" >> "$GITHUB_OUTPUT"

4. Always print the verify output, never trust silent exit codes

set -e (the GitHub Actions default for run: blocks) kills the bash script the instant verify.mjs returns non-zero — before any diagnostic node block runs. Result: a 65ms job failure with no log output, and you have no idea which fix regressed or whether the signature even loaded.

Fix: always wrap the verify call in set +e ... set -e, capture both streams, and analyze unconditionally:

- name: Verify witness manifest
  run: |
    set +e
    node plugins/ruflo-core/scripts/witness/verify.mjs \
      --manifest verification.md.json \
      --json > /tmp/witness-result.json 2> /tmp/witness-result.err
    VERIFY_EXIT=$?
    set -e
    echo "--- verify.mjs exit code: $VERIFY_EXIT ---"
    echo "--- stderr ---"
    cat /tmp/witness-result.err || true
    echo "--- summary ---"
    node -e "
      const fs = require('fs');
      const raw = fs.readFileSync('/tmp/witness-result.json', 'utf8');
      if (!raw.trim()) { console.error('verify.mjs produced no JSON output'); process.exit(1); }
      const r = JSON.parse(raw);
      console.log(JSON.stringify({signature: r.signature, summary: r.summary}, null, 2));
      const failures = (r.results || []).filter(x => x.status !== 'pass' && x.status !== 'drift');
      if (failures.length) {
        console.error('non-pass fixes:');
        for (const f of failures) console.error('  ' + f.status + ': ' + f.id + ' (' + f.file + ')');
      }
      if (!r.ok) { console.error('witness verify FAILED'); process.exit(1); }
      if (r.summary.regressed > 0) { console.error('regressed fixes:', r.summary.regressed); process.exit(1); }
      console.log('witness verify ok:', r.summary.pass, 'pass,', r.summary.drift, 'drift');
    "

This costs nothing on the green path and gives you a concrete failure cause on the red path. The pattern generalizes — any CI step that gates on a signed/cryptographic check should surface why the check failed, not just that it failed.


Capabilities matrix

Failure class Layer that catches it Example
Install fails on platform without prebuilds 1 (install smoke) npm install errors out during native build
Wrong CLI flag handling, parser ambiguity 1 (subprocess smoke) --flag value records the wrong value
Plugin calls flag the CLI doesn't have 1 (subprocess smoke) Hook prints Invalid value for --format: true
Documented fix silently removed 2 (witness markers) Refactor deletes the load-bearing line, code still compiles
Fix regressed: which commit? 3 (history) git bisect reduced to 3 commits in 18-hour window
Marker too brittle, flaps pass↔regressed 3 (history) Status timeline shows oscillation

Adoption notes

  • No CLI required for adopters. The standalone scripts depend only on @noble/ed25519 (~15KB minified). Copy plugins/ruflo-core/scripts/witness/ into your project, install one package, run.
  • JSONL is committed, not gitignored. Without committed history, you lose Layer 3 entirely.
  • Markers are the load-bearing skill. Generic markers false-positive; brittle markers flap. Aim for unique patterns specific to the fix mechanism.
  • The two layers complement each other. Behavioral smoke catches things you wrote a test for. Witness catches things you didn't. Don't pick one.

Files

Path Role
plugins/ruflo-core/scripts/witness/lib.mjs Shared regen + history primitives
plugins/ruflo-core/scripts/witness/init.mjs Bootstrap into a fresh project
plugins/ruflo-core/scripts/witness/regen.mjs Sign manifest + append history
plugins/ruflo-core/scripts/witness/verify.mjs Validate signature + markers
plugins/ruflo-core/scripts/witness/history.mjs Query temporal log
plugins/ruflo-core/skills/witness/SKILL.md Workflow + anti-patterns
plugins/ruflo-core/agents/witness-curator.md Agent for adding fixes / interpreting regressions
verification.md.json The signed manifest itself
verification-history.jsonl The append-only temporal log
witness-fixes.json Project-specific fix list (input to regen)

ADRs

  • v3/docs/adr/ADR-102-plugin-hook-cli-flag-regression-ci-guard.md — smoke harness pattern + flag-priority CLI convention
  • v3/docs/adr/ADR-103-witness-temporal-history.md — JSONL history layer + plugin-distributed toolkit

Related

  • ruvnet/ruflo verification.md — original witness manifest documentation
  • ~/.claude/.../project_verification_process.md — pre-toolkit inline regen process (superseded)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment