The methodology and specs home of the Intent Eval Platform — where spec authority is kept provably in sync with upstream truth.
A research umbrella for vendor-neutral evaluation methodology around AI plugins, agents, MCP servers, and skill-discovery systems. The repo ships no application code: its outputs are versioned spec modules, normative blueprints, a decision-record governance corpus, and a continuous spec-compliance watcher that monitors 16 upstream spec surfaces daily and feeds the canonical contracts kernel (@intentsolutions/core).
Links:
Spec authority drifts silently from upstream truth. The contracts that define agent-native artifacts — SKILL.md frontmatter, MCP server config, hooks, plugin manifests, marketplace catalogs — are spread across upstream surfaces (open standards, vendor reference pages, changelogs, release feeds) that change without notice. Any validator or schema kernel that encodes those contracts goes stale the moment an upstream field is added, removed, or has a required flag flipped — and nothing tells you. A watcher that merely polls pages has its own failure mode: a surface that 404s or restructures forever reads as "no drift, all green" while the contract silently decays.
intent-eval-lab is the methodology layer that closes both gaps. It hosts the normative documents (Blueprints A/B/C, the canonical glossary, 50+ numbered decision records) that govern the platform, and it runs a three-layer deterministic spec-drift watcher:
- Byte-hash drift (
scripts/spec-drift-check.sh) — "did the page change?" — 16 upstream surfaces, each compared against a committed snapshot hash. - Field-level projection diff (
scripts/spec-projection-diff.py) — "did the normative shape change?" — extracts a per-surface normative projection from a vendored snapshot and diffs it field-by-field against a committedprojection.v1.json. The byte-hash is demoted to a cheap "should we re-extract?" tripwire. - Liveness guards (
scripts/watcher-liveness.py) — fetch-error streak counters and a dead-man heartbeat, so the watcher cannot fail silent-green.
Confirmed drift is reconciled by a human into the kernel single source of truth (@intentsolutions/core schemas/authoring/v1); the differ never authors a kernel edit and never closes its own signal.
| Who | Jeremy Longshore / Intent Solutions. Single-maintainer; public so the methodology can be reviewed. |
| What | Methodology + specs umbrella: versioned spec modules, normative blueprints and glossary, ISEDC decision records, and the continuous spec-compliance watcher infrastructure. |
| When | Active since 2026-05. Drift watch runs on a daily cron (09:00 UTC) plus PR/push self-tests. 51 numbered docs as of 2026-06-11. |
| Where | github.com/jeremylongshore/intent-eval-lab, one of the repos in the Intent Eval Platform that converge via a shared Evidence Bundle schema. Feeds the kernel at @intentsolutions/core. |
| Why | Vendor-neutral measurement of agent artifacts requires a spec authority that demonstrably tracks upstream truth — and a governance trail explaining every architectural decision. |
| Layer | Technology |
|---|---|
| Normative docs | Markdown under 000-docs/, numbered + category-coded filing standard (NNN-CC-ABCD-title-date.md) |
| Spec modules | Versioned SPEC.md + JSON Schema under specs/<module>/v0.1.0-draft/; schema files are kernel-redirect stubs |
| Surface registry | specs/upstream-surface-registry.v1.json — 16 surfaces with authority tiers, contracts, waves, extractors |
| Watcher scripts | Bash (curl + sha256sum) and Python 3.12, stdlib-only, offline-deterministic differs with --self-test |
| CI | GitHub Actions — 9 workflows including the daily drift cron, kernel-canonical schema gate, and harness-hash verify |
| Governance | ISEDC 7-seat adversarial decision records (DR-002..DR-049) with verbatim seat positions preserved |
| License | Apache 2.0 |
- 16-surface, field-level drift detection. Not just "the page changed" — a normative projection diff that distinguishes a field addition, a required-flag flip, or a deprecation from cosmetic page noise. Surfaces are ranked by authority tier (machine-readable official spec → official spec → vendor doc → reference page → release feed → changelog).
- Deterministic-first, byte-hash tripwires. No LLM anywhere in the detection backbone. The differs are stdlib-only, operate offline on vendored snapshots, and carry a
--self-testmode that proves every drift class is detected with no false positives — run as a real CI gate on every event. - Cannot fail silent-green. Fetch-error streaks (≥3 consecutive errors fails loud — the surface is effectively unmonitored) plus a dead-man heartbeat (external ping + retrospective gap check) close the asymmetric failure where a dead watcher looks like a healthy one.
- Decision-record governance. Architectural choices are adjudicated by a 7-seat adversarial council; dissent is preserved verbatim, and every binding (schema authority, naming discipline, phase gates) cites its decision record by number.
- Kernel-canonical schema authority, CI-enforced. The lab may host redirect stubs for discoverability but cannot host normative schema content — a structural gate, not a convention.
The repo is docs-as-normative-artifacts plus watcher infrastructure. There is no build system — no package.json, no compiled output. What ships is markdown, JSON, and scripts.
intent-eval-lab/
├── 000-docs/ ← 51 numbered docs: blueprints, glossary, decision records, plans, AARs
├── specs/ ← versioned spec modules + the upstream-surface registry
│ ├── evidence-bundle/v0.1.0-draft/ (schema = kernel-redirect stub)
│ ├── mcp-plugin-observability/v0.1.0-draft/
│ ├── prompt-evaluation/v0.1.0-draft/
│ ├── upstream-surface-registry.v1.json (16 monitored surfaces)
│ └── snapshots/.sha/ (committed byte-hash baselines)
├── scripts/ ← watcher + governance enforcement (bash + Python, stdlib-only)
├── _vendor/upstream/ ← firewall snapshots the field-level differ operates on
├── research/ ← literature surveys, Phase A.0 baseline experiment code
├── sandboxes/ ← per-experiment dated dirs, isolated state
└── .github/workflows/ ← 9 CI workflows
Key structural decisions:
- Normative content lives in numbered docs, not wikis. Blueprint A (ecosystem constitution, 12 binding principles), Blueprint B (runtime architecture, 13-entity domain model, the normative
gate-result/v1predicate spec), Blueprint C (per-repo blueprint template), and the canonical glossary are the precedence chain every downstream repo cites. - The lab does not own schemas. Per the schema-authority decision (DR-018 § 6.4, "Option α-minus"), the kernel
@intentsolutions/coreowns every canonical JSON Schema. Lab schema files must carry a top-levelx-redirectfield and must not declare canonical predicate fields — enforced byschema-drift.ymlon every PR. - The surface registry and the executable SOURCES list are DRY-duplicated today (the registry JSON documents; the watcher script executes its own array).
scripts/check-surface-registry.pyis the consistency gate that asserts they are identical by surface name and that every registered extractor function exists in the script. Wiring the script to read the JSON directly is a tracked follow-up. - Drift handling is human-in-the-loop by design. On a field-level drift, CI opens a reconciliation issue (auto-deduped by date) and pushes a notification; a human re-vendors the snapshot, refreshes the projection, and promotes the change into the kernel.
CI gates (all on github.com/jeremylongshore/intent-eval-lab):
| Workflow | Trigger | Gate |
|---|---|---|
spec-drift-watch.yml |
Daily cron 09:00 UTC; manual; PR/push self-test | Byte-hash check over 16 surfaces + field-level projection diff + liveness guards. Offline-deterministic gates (differ self-test, snapshot↔projection consistency) fail on every event; live drift/liveness gates fail only on the scheduled run. On trip: red X + push notification + auto-opened GH issue with diff summary. |
schema-drift.yml |
PR/push touching specs/evidence-bundle/** |
Every lab *.schema.json must be an x-redirect kernel-redirect stub; declaring any of 14 canonical predicate field names fails. |
partner-name-guard.yml |
PR/push touching public docs | Case-insensitive grep across 000-docs/, specs/, README.md, CLAUDE.md, KNOWN-ISSUES.md for a partner-name pattern (maintained privately; an inline backstop copy lives in the workflow). Zero hits required. |
harness-hash-verify.yml |
Every PR + push to main | scripts/audit-harness verify against the .harness-hash manifest; exit 2 (HARNESS_TAMPERED) blocks the PR. |
ci.yml |
PR + push | JSON Schema syntax validation across specs/**/schema/*.json. |
doc-quality.yml |
PR/push touching markdown | markdownlint + Vale + link checking (lychee) + prettier. |
python-tests.yml |
PR/push touching Python research code | Tests for the Phase A.0 baseline code. |
sign-evidence-bundle.yml |
Manual/release path | Keyless cosign blob-signing of Phase A.0 result artifacts against production sigstore (Fulcio + Rekor). Attests file integrity + identity + time; deliberately declares no predicate URI (predicate declaration is gated on separate normative + DNS preconditions). |
release.yml |
Tag push | Cuts a GitHub Release and emits + cosign-signs a kernel gate-result/v1 report-manifest attached to it; a downstream reports hub re-verifies OIDC subject, Rekor inclusion, DSSE, and kernel schema at ingest. |
Local verification (repo root):
scripts/spec-drift-check.sh # check all 16 surfaces; exit 1 on drift
scripts/spec-drift-check.sh --json # machine-readable report (CI consumption)
scripts/spec-drift-check.sh --init # seed missing baselines
python3 scripts/spec-projection-diff.py --self-test # prove the differ detects every drift class
python3 scripts/spec-projection-diff.py --check # vendored snapshot ↔ committed projection
python3 scripts/watcher-liveness.py --show # current streak / heartbeat state
python3 scripts/check-surface-registry.py # registry JSON ↔ watcher SOURCES consistency
scripts/audit-harness verify # hash-pinned policy verificationGovernance enforcement scripts:
scripts/bd-claim-precheck.sh— machine-enforced plan-ratification gate: blocks claiming refiner-labeled work items unless the plan-auditSTATUS.mdreads RATIFIED (or RATIFIED-WITH-DELTAS with the item explicitly authorized). Replaces an honor-system rule with exit codes.scripts/validate-trilink.sh— verifies cross-references between work items, docs, and GitHub issues for refiner-labeled artifacts.
- Tamper-evident CI policy. The vendored audit-harness pins workflow definitions, the schema redirect stub, and its own scripts in a
.harness-hashmanifest; any unreviewed mutation failsharness-hash-verifywith a hard block. Legitimate changes require an explicit re-init committed alongside the edit. - Least-privilege workflows. Workflows default to
contents: read; only the drift watcher addsissues: write(for auto-opening drift issues). - Partner-name discipline. A standing decision-record binding requires zero partner-name hits in any public artifact; the CI grep gate enforces it on every PR and push, case-insensitively, with the canonical pattern held outside the public repo.
- Deterministic, offline gates where it matters. The projection differ and liveness checker take no network input in CI gating mode — they operate on vendored snapshots and committed state, so a compromised or flaky upstream cannot flip a gate.
- Signing is scoped and honest. The evidence-bundle signing workflow attests artifact digests, identity, and time via production sigstore — and its comments explicitly enumerate what it does not attest (no predicate claim, no statistical-correctness claim) until separate normative preconditions clear.
- 51 numbered docs in
000-docs/; Phase A foundation (Blueprints A/B/C + canonical glossary) merged and NORMATIVE. - 16 upstream surfaces monitored across three rollout waves (skill frontmatter, MCP spec + machine-readable schema, hooks/settings/slash-commands references, plugins/sub-agents/marketplaces references, version + release feeds, engineering blog); one candidate surface documented as deliberately unmonitored pending a sampleable raw file.
- Spec modules:
evidence-bundlev0.1.0-draft (schema redirected to the kernel),mcp-plugin-observabilityv0.1.0-draft,prompt-evaluationv0.1.0-draft; placeholder modules reserve slots for validator-contract-reliability, forecasting-drift-detection, and decentralized-crypto-evaluation. - The Spec Authority Kernel charter is ratified (DR-049); the surface registry is NORMATIVE (doc 050); the umbrella-wide review-and-fix wave is recorded in the 051 after-action report.
- Known seams, tracked: registry↔script DRY duplication (consistency-gated), liveness state cached rather than committed between runs (committed seed survives eviction), external heartbeat ping optional on a secret.
The root CHANGELOG.md is release-grained (Keep a Changelog format; latest entry 0.2.0, 2026-05-26, packaging the Phase A foundation). Day-to-day change in this repo is recorded in the numbered 000-docs/ series — every decision, plan, spec, and after-action report lands as an immutable numbered doc (NNN-CC-ABCD-title-date.md), so the doc index itself is the running record of change. The ten most recent:
| Doc | Date | What it records |
|---|---|---|
051-AA-AACR-umbrella-review-and-fix-wave |
2026-06-11 | After-action record of the umbrella-wide review-and-fix wave (companion machine-readable data at 051a-...-data.json). |
050-AT-SPEC-upstream-surface-registry |
2026-06-11 | NORMATIVE registry of the 16 upstream spec surfaces the drift watcher monitors, with authority-tier precedence. |
049-AT-DECR-isedc-class-1-charter-ratification |
2026-06-10 | Class-1 charter ratification: all 6 Spec-Authority-Kernel decisions RATIFIED-WITH-CONSTRAINTS; flips SAK work items to claimable. |
048-PP-PLAN-skill-refiner-sak-amendment-v8 |
2026-06-10 | Plan amendment closing the 4 blocking findings from the v7 incremental re-audit. |
047-AT-ARCH-repo-blueprint |
2026-06-10 | This repo's own blueprint (instantiating the Blueprint C template). |
046-AT-STND-sak-governance-owners |
2026-06-10 | Seat-bound decision-authority taxonomy for Spec Authority Kernel governance. |
045-RR-LAND-single-source-of-truth-and-continuous-spec-compliance |
2026-06-09 | The SSoT rationale: why the kernel is the single source of authoring-artifact validity and how the drift watch keeps it continuously adherent to upstream. |
044-AT-DECR-isedc-council-session-8-sak-charter |
2026-06-09 | 7-seat council decision record on the Spec Authority Kernel charter. |
043-DR-RFC-intent-eval-target-generalization |
2026-06-06 | RFC on generalizing evaluation targets beyond SKILL.md artifacts. |
042-RR-LAND-prompt-and-context-eval-landscape |
2026-06-06 | Research landscape survey of prompt- and context-evaluation tooling. |
Convention: RR = research/recon, PP = plan, AA = after-action, AT-DECR = decision record, AT-ARCH/AT-SPEC = architecture/spec, DR-GLOS = glossary. Decision records run DR-002 through DR-049; superseded docs are marked in place rather than deleted, so the audit trail stays intact.