Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save jeremylongshore/a21645800c5dbd59a0d8fd5b285507b0 to your computer and use it in GitHub Desktop.

Select an option

Save jeremylongshore/a21645800c5dbd59a0d8fd5b285507b0 to your computer and use it in GitHub Desktop.
intent-eval-lab — one-pager + operator audit + record of change

intent-eval-lab

The methodology and specs home of the Intent Eval Platform — where spec authority is kept provably in sync with upstream truth.

A research umbrella for vendor-neutral evaluation methodology around AI plugins, agents, MCP servers, and skill-discovery systems. The repo ships no application code: its outputs are versioned spec modules, normative blueprints, a decision-record governance corpus, and a continuous spec-compliance watcher that monitors 16 upstream spec surfaces daily and feeds the canonical contracts kernel (@intentsolutions/core).

License: Apache 2.0 Spec drift watch

Links:


One-Pager

Problem

Spec authority drifts silently from upstream truth. The contracts that define agent-native artifacts — SKILL.md frontmatter, MCP server config, hooks, plugin manifests, marketplace catalogs — are spread across upstream surfaces (open standards, vendor reference pages, changelogs, release feeds) that change without notice. Any validator or schema kernel that encodes those contracts goes stale the moment an upstream field is added, removed, or has a required flag flipped — and nothing tells you. A watcher that merely polls pages has its own failure mode: a surface that 404s or restructures forever reads as "no drift, all green" while the contract silently decays.

Solution

intent-eval-lab is the methodology layer that closes both gaps. It hosts the normative documents (Blueprints A/B/C, the canonical glossary, 50+ numbered decision records) that govern the platform, and it runs a three-layer deterministic spec-drift watcher:

  1. Byte-hash drift (scripts/spec-drift-check.sh) — "did the page change?" — 16 upstream surfaces, each compared against a committed snapshot hash.
  2. Field-level projection diff (scripts/spec-projection-diff.py) — "did the normative shape change?" — extracts a per-surface normative projection from a vendored snapshot and diffs it field-by-field against a committed projection.v1.json. The byte-hash is demoted to a cheap "should we re-extract?" tripwire.
  3. Liveness guards (scripts/watcher-liveness.py) — fetch-error streak counters and a dead-man heartbeat, so the watcher cannot fail silent-green.

Confirmed drift is reconciled by a human into the kernel single source of truth (@intentsolutions/core schemas/authoring/v1); the differ never authors a kernel edit and never closes its own signal.

W5

Who Jeremy Longshore / Intent Solutions. Single-maintainer; public so the methodology can be reviewed.
What Methodology + specs umbrella: versioned spec modules, normative blueprints and glossary, ISEDC decision records, and the continuous spec-compliance watcher infrastructure.
When Active since 2026-05. Drift watch runs on a daily cron (09:00 UTC) plus PR/push self-tests. 51 numbered docs as of 2026-06-11.
Where github.com/jeremylongshore/intent-eval-lab, one of the repos in the Intent Eval Platform that converge via a shared Evidence Bundle schema. Feeds the kernel at @intentsolutions/core.
Why Vendor-neutral measurement of agent artifacts requires a spec authority that demonstrably tracks upstream truth — and a governance trail explaining every architectural decision.

Stack

Layer Technology
Normative docs Markdown under 000-docs/, numbered + category-coded filing standard (NNN-CC-ABCD-title-date.md)
Spec modules Versioned SPEC.md + JSON Schema under specs/<module>/v0.1.0-draft/; schema files are kernel-redirect stubs
Surface registry specs/upstream-surface-registry.v1.json — 16 surfaces with authority tiers, contracts, waves, extractors
Watcher scripts Bash (curl + sha256sum) and Python 3.12, stdlib-only, offline-deterministic differs with --self-test
CI GitHub Actions — 9 workflows including the daily drift cron, kernel-canonical schema gate, and harness-hash verify
Governance ISEDC 7-seat adversarial decision records (DR-002..DR-049) with verbatim seat positions preserved
License Apache 2.0

Differentiators

  • 16-surface, field-level drift detection. Not just "the page changed" — a normative projection diff that distinguishes a field addition, a required-flag flip, or a deprecation from cosmetic page noise. Surfaces are ranked by authority tier (machine-readable official spec → official spec → vendor doc → reference page → release feed → changelog).
  • Deterministic-first, byte-hash tripwires. No LLM anywhere in the detection backbone. The differs are stdlib-only, operate offline on vendored snapshots, and carry a --self-test mode that proves every drift class is detected with no false positives — run as a real CI gate on every event.
  • Cannot fail silent-green. Fetch-error streaks (≥3 consecutive errors fails loud — the surface is effectively unmonitored) plus a dead-man heartbeat (external ping + retrospective gap check) close the asymmetric failure where a dead watcher looks like a healthy one.
  • Decision-record governance. Architectural choices are adjudicated by a 7-seat adversarial council; dissent is preserved verbatim, and every binding (schema authority, naming discipline, phase gates) cites its decision record by number.
  • Kernel-canonical schema authority, CI-enforced. The lab may host redirect stubs for discoverability but cannot host normative schema content — a structural gate, not a convention.

Operator-Grade System Analysis

Architecture

The repo is docs-as-normative-artifacts plus watcher infrastructure. There is no build system — no package.json, no compiled output. What ships is markdown, JSON, and scripts.

intent-eval-lab/
├── 000-docs/        ← 51 numbered docs: blueprints, glossary, decision records, plans, AARs
├── specs/           ← versioned spec modules + the upstream-surface registry
│   ├── evidence-bundle/v0.1.0-draft/        (schema = kernel-redirect stub)
│   ├── mcp-plugin-observability/v0.1.0-draft/
│   ├── prompt-evaluation/v0.1.0-draft/
│   ├── upstream-surface-registry.v1.json    (16 monitored surfaces)
│   └── snapshots/.sha/                      (committed byte-hash baselines)
├── scripts/         ← watcher + governance enforcement (bash + Python, stdlib-only)
├── _vendor/upstream/ ← firewall snapshots the field-level differ operates on
├── research/        ← literature surveys, Phase A.0 baseline experiment code
├── sandboxes/       ← per-experiment dated dirs, isolated state
└── .github/workflows/ ← 9 CI workflows

Key structural decisions:

  • Normative content lives in numbered docs, not wikis. Blueprint A (ecosystem constitution, 12 binding principles), Blueprint B (runtime architecture, 13-entity domain model, the normative gate-result/v1 predicate spec), Blueprint C (per-repo blueprint template), and the canonical glossary are the precedence chain every downstream repo cites.
  • The lab does not own schemas. Per the schema-authority decision (DR-018 § 6.4, "Option α-minus"), the kernel @intentsolutions/core owns every canonical JSON Schema. Lab schema files must carry a top-level x-redirect field and must not declare canonical predicate fields — enforced by schema-drift.yml on every PR.
  • The surface registry and the executable SOURCES list are DRY-duplicated today (the registry JSON documents; the watcher script executes its own array). scripts/check-surface-registry.py is the consistency gate that asserts they are identical by surface name and that every registered extractor function exists in the script. Wiring the script to read the JSON directly is a tracked follow-up.
  • Drift handling is human-in-the-loop by design. On a field-level drift, CI opens a reconciliation issue (auto-deduped by date) and pushes a notification; a human re-vendors the snapshot, refreshes the projection, and promotes the change into the kernel.

Operational reference

CI gates (all on github.com/jeremylongshore/intent-eval-lab):

Workflow Trigger Gate
spec-drift-watch.yml Daily cron 09:00 UTC; manual; PR/push self-test Byte-hash check over 16 surfaces + field-level projection diff + liveness guards. Offline-deterministic gates (differ self-test, snapshot↔projection consistency) fail on every event; live drift/liveness gates fail only on the scheduled run. On trip: red X + push notification + auto-opened GH issue with diff summary.
schema-drift.yml PR/push touching specs/evidence-bundle/** Every lab *.schema.json must be an x-redirect kernel-redirect stub; declaring any of 14 canonical predicate field names fails.
partner-name-guard.yml PR/push touching public docs Case-insensitive grep across 000-docs/, specs/, README.md, CLAUDE.md, KNOWN-ISSUES.md for a partner-name pattern (maintained privately; an inline backstop copy lives in the workflow). Zero hits required.
harness-hash-verify.yml Every PR + push to main scripts/audit-harness verify against the .harness-hash manifest; exit 2 (HARNESS_TAMPERED) blocks the PR.
ci.yml PR + push JSON Schema syntax validation across specs/**/schema/*.json.
doc-quality.yml PR/push touching markdown markdownlint + Vale + link checking (lychee) + prettier.
python-tests.yml PR/push touching Python research code Tests for the Phase A.0 baseline code.
sign-evidence-bundle.yml Manual/release path Keyless cosign blob-signing of Phase A.0 result artifacts against production sigstore (Fulcio + Rekor). Attests file integrity + identity + time; deliberately declares no predicate URI (predicate declaration is gated on separate normative + DNS preconditions).
release.yml Tag push Cuts a GitHub Release and emits + cosign-signs a kernel gate-result/v1 report-manifest attached to it; a downstream reports hub re-verifies OIDC subject, Rekor inclusion, DSSE, and kernel schema at ingest.

Local verification (repo root):

scripts/spec-drift-check.sh                       # check all 16 surfaces; exit 1 on drift
scripts/spec-drift-check.sh --json                # machine-readable report (CI consumption)
scripts/spec-drift-check.sh --init                # seed missing baselines
python3 scripts/spec-projection-diff.py --self-test   # prove the differ detects every drift class
python3 scripts/spec-projection-diff.py --check       # vendored snapshot ↔ committed projection
python3 scripts/watcher-liveness.py --show            # current streak / heartbeat state
python3 scripts/check-surface-registry.py             # registry JSON ↔ watcher SOURCES consistency
scripts/audit-harness verify                          # hash-pinned policy verification

Governance enforcement scripts:

  • scripts/bd-claim-precheck.sh — machine-enforced plan-ratification gate: blocks claiming refiner-labeled work items unless the plan-audit STATUS.md reads RATIFIED (or RATIFIED-WITH-DELTAS with the item explicitly authorized). Replaces an honor-system rule with exit codes.
  • scripts/validate-trilink.sh — verifies cross-references between work items, docs, and GitHub issues for refiner-labeled artifacts.

Security posture

  • Tamper-evident CI policy. The vendored audit-harness pins workflow definitions, the schema redirect stub, and its own scripts in a .harness-hash manifest; any unreviewed mutation fails harness-hash-verify with a hard block. Legitimate changes require an explicit re-init committed alongside the edit.
  • Least-privilege workflows. Workflows default to contents: read; only the drift watcher adds issues: write (for auto-opening drift issues).
  • Partner-name discipline. A standing decision-record binding requires zero partner-name hits in any public artifact; the CI grep gate enforces it on every PR and push, case-insensitively, with the canonical pattern held outside the public repo.
  • Deterministic, offline gates where it matters. The projection differ and liveness checker take no network input in CI gating mode — they operate on vendored snapshots and committed state, so a compromised or flaky upstream cannot flip a gate.
  • Signing is scoped and honest. The evidence-bundle signing workflow attests artifact digests, identity, and time via production sigstore — and its comments explicitly enumerate what it does not attest (no predicate claim, no statistical-correctness claim) until separate normative preconditions clear.

Current state (2026-06-11)

  • 51 numbered docs in 000-docs/; Phase A foundation (Blueprints A/B/C + canonical glossary) merged and NORMATIVE.
  • 16 upstream surfaces monitored across three rollout waves (skill frontmatter, MCP spec + machine-readable schema, hooks/settings/slash-commands references, plugins/sub-agents/marketplaces references, version + release feeds, engineering blog); one candidate surface documented as deliberately unmonitored pending a sampleable raw file.
  • Spec modules: evidence-bundle v0.1.0-draft (schema redirected to the kernel), mcp-plugin-observability v0.1.0-draft, prompt-evaluation v0.1.0-draft; placeholder modules reserve slots for validator-contract-reliability, forecasting-drift-detection, and decentralized-crypto-evaluation.
  • The Spec Authority Kernel charter is ratified (DR-049); the surface registry is NORMATIVE (doc 050); the umbrella-wide review-and-fix wave is recorded in the 051 after-action report.
  • Known seams, tracked: registry↔script DRY duplication (consistency-gated), liveness state cached rather than committed between runs (committed seed survives eviction), external heartbeat ping optional on a secret.

Changelog

The root CHANGELOG.md is release-grained (Keep a Changelog format; latest entry 0.2.0, 2026-05-26, packaging the Phase A foundation). Day-to-day change in this repo is recorded in the numbered 000-docs/ series — every decision, plan, spec, and after-action report lands as an immutable numbered doc (NNN-CC-ABCD-title-date.md), so the doc index itself is the running record of change. The ten most recent:

Doc Date What it records
051-AA-AACR-umbrella-review-and-fix-wave 2026-06-11 After-action record of the umbrella-wide review-and-fix wave (companion machine-readable data at 051a-...-data.json).
050-AT-SPEC-upstream-surface-registry 2026-06-11 NORMATIVE registry of the 16 upstream spec surfaces the drift watcher monitors, with authority-tier precedence.
049-AT-DECR-isedc-class-1-charter-ratification 2026-06-10 Class-1 charter ratification: all 6 Spec-Authority-Kernel decisions RATIFIED-WITH-CONSTRAINTS; flips SAK work items to claimable.
048-PP-PLAN-skill-refiner-sak-amendment-v8 2026-06-10 Plan amendment closing the 4 blocking findings from the v7 incremental re-audit.
047-AT-ARCH-repo-blueprint 2026-06-10 This repo's own blueprint (instantiating the Blueprint C template).
046-AT-STND-sak-governance-owners 2026-06-10 Seat-bound decision-authority taxonomy for Spec Authority Kernel governance.
045-RR-LAND-single-source-of-truth-and-continuous-spec-compliance 2026-06-09 The SSoT rationale: why the kernel is the single source of authoring-artifact validity and how the drift watch keeps it continuously adherent to upstream.
044-AT-DECR-isedc-council-session-8-sak-charter 2026-06-09 7-seat council decision record on the Spec Authority Kernel charter.
043-DR-RFC-intent-eval-target-generalization 2026-06-06 RFC on generalizing evaluation targets beyond SKILL.md artifacts.
042-RR-LAND-prompt-and-context-eval-landscape 2026-06-06 Research landscape survey of prompt- and context-evaluation tooling.

Convention: RR = research/recon, PP = plan, AA = after-action, AT-DECR = decision record, AT-ARCH/AT-SPEC = architecture/spec, DR-GLOS = glossary. Decision records run DR-002 through DR-049; superseded docs are marked in place rather than deleted, so the audit trail stays intact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment