jeremylongshore/0-intent-eval-lab-one-pager-and-operator-audit.md

Created June 12, 2026 14:54

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/jeremylongshore/a21645800c5dbd59a0d8fd5b285507b0.js"></script>
Save jeremylongshore/a21645800c5dbd59a0d8fd5b285507b0 to your computer and use it in GitHub Desktop.

Download ZIP

intent-eval-lab — one-pager + operator audit + record of change

Raw

0-intent-eval-lab-one-pager-and-operator-audit.md

intent-eval-lab

The methodology and specs home of the Intent Eval Platform — where spec authority is kept provably in sync with upstream truth.

A research umbrella for vendor-neutral evaluation methodology around AI plugins, agents, MCP servers, and skill-discovery systems. The repo ships no application code: its outputs are versioned spec modules, normative blueprints, a decision-record governance corpus, and a continuous spec-compliance watcher that monitors 16 upstream spec surfaces daily and feeds the canonical contracts kernel (@intentsolutions/core).

Links:

GitHub: https://github.com/jeremylongshore/intent-eval-lab

One-Pager

Problem

Spec authority drifts silently from upstream truth. The contracts that define agent-native artifacts — SKILL.md frontmatter, MCP server config, hooks, plugin manifests, marketplace catalogs — are spread across upstream surfaces (open standards, vendor reference pages, changelogs, release feeds) that change without notice. Any validator or schema kernel that encodes those contracts goes stale the moment an upstream field is added, removed, or has a required flag flipped — and nothing tells you. A watcher that merely polls pages has its own failure mode: a surface that 404s or restructures forever reads as "no drift, all green" while the contract silently decays.

Solution

intent-eval-lab is the methodology layer that closes both gaps. It hosts the normative documents (Blueprints A/B/C, the canonical glossary, 50+ numbered decision records) that govern the platform, and it runs a three-layer deterministic spec-drift watcher:

Byte-hash drift (scripts/spec-drift-check.sh) — "did the page change?" — 16 upstream surfaces, each compared against a committed snapshot hash.
Field-level projection diff (scripts/spec-projection-diff.py) — "did the normative shape change?" — extracts a per-surface normative projection from a vendored snapshot and diffs it field-by-field against a committed projection.v1.json. The byte-hash is demoted to a cheap "should we re-extract?" tripwire.
Liveness guards (scripts/watcher-liveness.py) — fetch-error streak counters and a dead-man heartbeat, so the watcher cannot fail silent-green.

Confirmed drift is reconciled by a human into the kernel single source of truth (@intentsolutions/core schemas/authoring/v1); the differ never authors a kernel edit and never closes its own signal.

W5


Who	Jeremy Longshore / Intent Solutions. Single-maintainer; public so the methodology can be reviewed.
What	Methodology + specs umbrella: versioned spec modules, normative blueprints and glossary, ISEDC decision records, and the continuous spec-compliance watcher infrastructure.
When	Active since 2026-05. Drift watch runs on a daily cron (09:00 UTC) plus PR/push self-tests. 51 numbered docs as of 2026-06-11.
Where	`github.com/jeremylongshore/intent-eval-lab`, one of the repos in the Intent Eval Platform that converge via a shared Evidence Bundle schema. Feeds the kernel at `@intentsolutions/core`.
Why	Vendor-neutral measurement of agent artifacts requires a spec authority that demonstrably tracks upstream truth — and a governance trail explaining every architectural decision.

Stack

Layer	Technology
Normative docs	Markdown under `000-docs/`, numbered + category-coded filing standard (`NNN-CC-ABCD-title-date.md`)
Spec modules	Versioned `SPEC.md` + JSON Schema under `specs/<module>/v0.1.0-draft/`; schema files are kernel-redirect stubs
Surface registry	`specs/upstream-surface-registry.v1.json` — 16 surfaces with authority tiers, contracts, waves, extractors
Watcher scripts	Bash (`curl` + `sha256sum`) and Python 3.12, stdlib-only, offline-deterministic differs with `--self-test`
CI	GitHub Actions — 9 workflows including the daily drift cron, kernel-canonical schema gate, and harness-hash verify
Governance	ISEDC 7-seat adversarial decision records (DR-002..DR-049) with verbatim seat positions preserved
License	Apache 2.0

Differentiators

16-surface, field-level drift detection. Not just "the page changed" — a normative projection diff that distinguishes a field addition, a required-flag flip, or a deprecation from cosmetic page noise. Surfaces are ranked by authority tier (machine-readable official spec → official spec → vendor doc → reference page → release feed → changelog).
Deterministic-first, byte-hash tripwires. No LLM anywhere in the detection backbone. The differs are stdlib-only, operate offline on vendored snapshots, and carry a --self-test mode that proves every drift class is detected with no false positives — run as a real CI gate on every event.
Cannot fail silent-green. Fetch-error streaks (≥3 consecutive errors fails loud — the surface is effectively unmonitored) plus a dead-man heartbeat (external ping + retrospective gap check) close the asymmetric failure where a dead watcher looks like a healthy one.
Decision-record governance. Architectural choices are adjudicated by a 7-seat adversarial council; dissent is preserved verbatim, and every binding (schema authority, naming discipline, phase gates) cites its decision record by number.
Kernel-canonical schema authority, CI-enforced. The lab may host redirect stubs for discoverability but cannot host normative schema content — a structural gate, not a convention.

Operator-Grade System Analysis

Architecture

The repo is docs-as-normative-artifacts plus watcher infrastructure. There is no build system — no package.json, no compiled output. What ships is markdown, JSON, and scripts.

intent-eval-lab/
├── 000-docs/        ← 51 numbered docs: blueprints, glossary, decision records, plans, AARs
├── specs/           ← versioned spec modules + the upstream-surface registry
│   ├── evidence-bundle/v0.1.0-draft/        (schema = kernel-redirect stub)
│   ├── mcp-plugin-observability/v0.1.0-draft/
│   ├── prompt-evaluation/v0.1.0-draft/
│   ├── upstream-surface-registry.v1.json    (16 monitored surfaces)
│   └── snapshots/.sha/                      (committed byte-hash baselines)
├── scripts/         ← watcher + governance enforcement (bash + Python, stdlib-only)
├── _vendor/upstream/ ← firewall snapshots the field-level differ operates on
├── research/        ← literature surveys, Phase A.0 baseline experiment code
├── sandboxes/       ← per-experiment dated dirs, isolated state
└── .github/workflows/ ← 9 CI workflows

Key structural decisions:

Normative content lives in numbered docs, not wikis. Blueprint A (ecosystem constitution, 12 binding principles), Blueprint B (runtime architecture, 13-entity domain model, the normative gate-result/v1 predicate spec), Blueprint C (per-repo blueprint template), and the canonical glossary are the precedence chain every downstream repo cites.
The lab does not own schemas. Per the schema-authority decision (DR-018 § 6.4, "Option α-minus"), the kernel @intentsolutions/core owns every canonical JSON Schema. Lab schema files must carry a top-level x-redirect field and must not declare canonical predicate fields — enforced by schema-drift.yml on every PR.
The surface registry and the executable SOURCES list are DRY-duplicated today (the registry JSON documents; the watcher script executes its own array). scripts/check-surface-registry.py is the consistency gate that asserts they are identical by surface name and that every registered extractor function exists in the script. Wiring the script to read the JSON directly is a tracked follow-up.
Drift handling is human-in-the-loop by design. On a field-level drift, CI opens a reconciliation issue (auto-deduped by date) and pushes a notification; a human re-vendors the snapshot, refreshes the projection, and promotes the change into the kernel.

Operational reference

CI gates (all on github.com/jeremylongshore/intent-eval-lab):

Workflow	Trigger	Gate
`spec-drift-watch.yml`	Daily cron 09:00 UTC; manual; PR/push self-test	Byte-hash check over 16 surfaces + field-level projection diff + liveness guards. Offline-deterministic gates (differ self-test, snapshot↔projection consistency) fail on every event; live drift/liveness gates fail only on the scheduled run. On trip: red X + push notification + auto-opened GH issue with diff summary.
`schema-drift.yml`	PR/push touching `specs/evidence-bundle/**`	Every lab `*.schema.json` must be an `x-redirect` kernel-redirect stub; declaring any of 14 canonical predicate field names fails.
`partner-name-guard.yml`	PR/push touching public docs	Case-insensitive grep across `000-docs/`, `specs/`, `README.md`, `CLAUDE.md`, `KNOWN-ISSUES.md` for a partner-name pattern (maintained privately; an inline backstop copy lives in the workflow). Zero hits required.
`harness-hash-verify.yml`	Every PR + push to main	`scripts/audit-harness verify` against the `.harness-hash` manifest; exit 2 (`HARNESS_TAMPERED`) blocks the PR.
`ci.yml`	PR + push	JSON Schema syntax validation across `specs/*/schema/.json`.
`doc-quality.yml`	PR/push touching markdown	markdownlint + Vale + link checking (lychee) + prettier.
`python-tests.yml`	PR/push touching Python research code	Tests for the Phase A.0 baseline code.
`sign-evidence-bundle.yml`	Manual/release path	Keyless cosign blob-signing of Phase A.0 result artifacts against production sigstore (Fulcio + Rekor). Attests file integrity + identity + time; deliberately declares no predicate URI (predicate declaration is gated on separate normative + DNS preconditions).
`release.yml`	Tag push	Cuts a GitHub Release and emits + cosign-signs a kernel `gate-result/v1` report-manifest attached to it; a downstream reports hub re-verifies OIDC subject, Rekor inclusion, DSSE, and kernel schema at ingest.

Local verification (repo root):

scripts/spec-drift-check.sh                       # check all 16 surfaces; exit 1 on drift
scripts/spec-drift-check.sh --json                # machine-readable report (CI consumption)
scripts/spec-drift-check.sh --init                # seed missing baselines
python3 scripts/spec-projection-diff.py --self-test   # prove the differ detects every drift class
python3 scripts/spec-projection-diff.py --check       # vendored snapshot ↔ committed projection
python3 scripts/watcher-liveness.py --show            # current streak / heartbeat state
python3 scripts/check-surface-registry.py             # registry JSON ↔ watcher SOURCES consistency
scripts/audit-harness verify                          # hash-pinned policy verification

Governance enforcement scripts:

scripts/bd-claim-precheck.sh — machine-enforced plan-ratification gate: blocks claiming refiner-labeled work items unless the plan-audit STATUS.md reads RATIFIED (or RATIFIED-WITH-DELTAS with the item explicitly authorized). Replaces an honor-system rule with exit codes.
scripts/validate-trilink.sh — verifies cross-references between work items, docs, and GitHub issues for refiner-labeled artifacts.

Security posture

Tamper-evident CI policy. The vendored audit-harness pins workflow definitions, the schema redirect stub, and its own scripts in a .harness-hash manifest; any unreviewed mutation fails harness-hash-verify with a hard block. Legitimate changes require an explicit re-init committed alongside the edit.
Least-privilege workflows. Workflows default to contents: read; only the drift watcher adds issues: write (for auto-opening drift issues).
Partner-name discipline. A standing decision-record binding requires zero partner-name hits in any public artifact; the CI grep gate enforces it on every PR and push, case-insensitively, with the canonical pattern held outside the public repo.
Deterministic, offline gates where it matters. The projection differ and liveness checker take no network input in CI gating mode — they operate on vendored snapshots and committed state, so a compromised or flaky upstream cannot flip a gate.
Signing is scoped and honest. The evidence-bundle signing workflow attests artifact digests, identity, and time via production sigstore — and its comments explicitly enumerate what it does not attest (no predicate claim, no statistical-correctness claim) until separate normative preconditions clear.

Current state (2026-06-11)

51 numbered docs in 000-docs/; Phase A foundation (Blueprints A/B/C + canonical glossary) merged and NORMATIVE.
16 upstream surfaces monitored across three rollout waves (skill frontmatter, MCP spec + machine-readable schema, hooks/settings/slash-commands references, plugins/sub-agents/marketplaces references, version + release feeds, engineering blog); one candidate surface documented as deliberately unmonitored pending a sampleable raw file.
Spec modules: evidence-bundle v0.1.0-draft (schema redirected to the kernel), mcp-plugin-observability v0.1.0-draft, prompt-evaluation v0.1.0-draft; placeholder modules reserve slots for validator-contract-reliability, forecasting-drift-detection, and decentralized-crypto-evaluation.
The Spec Authority Kernel charter is ratified (DR-049); the surface registry is NORMATIVE (doc 050); the umbrella-wide review-and-fix wave is recorded in the 051 after-action report.
Known seams, tracked: registry↔script DRY duplication (consistency-gated), liveness state cached rather than committed between runs (committed seed survives eviction), external heartbeat ping optional on a secret.

Changelog

The root CHANGELOG.md is release-grained (Keep a Changelog format; latest entry 0.2.0, 2026-05-26, packaging the Phase A foundation). Day-to-day change in this repo is recorded in the numbered 000-docs/ series — every decision, plan, spec, and after-action report lands as an immutable numbered doc (NNN-CC-ABCD-title-date.md), so the doc index itself is the running record of change. The ten most recent:

Doc	Date	What it records
`051-AA-AACR-umbrella-review-and-fix-wave`	2026-06-11	After-action record of the umbrella-wide review-and-fix wave (companion machine-readable data at `051a-...-data.json`).
`050-AT-SPEC-upstream-surface-registry`	2026-06-11	NORMATIVE registry of the 16 upstream spec surfaces the drift watcher monitors, with authority-tier precedence.
`049-AT-DECR-isedc-class-1-charter-ratification`	2026-06-10	Class-1 charter ratification: all 6 Spec-Authority-Kernel decisions RATIFIED-WITH-CONSTRAINTS; flips SAK work items to claimable.
`048-PP-PLAN-skill-refiner-sak-amendment-v8`	2026-06-10	Plan amendment closing the 4 blocking findings from the v7 incremental re-audit.
`047-AT-ARCH-repo-blueprint`	2026-06-10	This repo's own blueprint (instantiating the Blueprint C template).
`046-AT-STND-sak-governance-owners`	2026-06-10	Seat-bound decision-authority taxonomy for Spec Authority Kernel governance.
`045-RR-LAND-single-source-of-truth-and-continuous-spec-compliance`	2026-06-09	The SSoT rationale: why the kernel is the single source of authoring-artifact validity and how the drift watch keeps it continuously adherent to upstream.
`044-AT-DECR-isedc-council-session-8-sak-charter`	2026-06-09	7-seat council decision record on the Spec Authority Kernel charter.
`043-DR-RFC-intent-eval-target-generalization`	2026-06-06	RFC on generalizing evaluation targets beyond SKILL.md artifacts.
`042-RR-LAND-prompt-and-context-eval-landscape`	2026-06-06	Research landscape survey of prompt- and context-evaluation tooling.

Convention: RR = research/recon, PP = plan, AA = after-action, AT-DECR = decision record, AT-ARCH/AT-SPEC = architecture/spec, DR-GLOS = glossary. Decision records run DR-002 through DR-049; superseded docs are marked in place rather than deleted, so the audit trail stays intact.

jeremylongshore/0-intent-eval-lab-one-pager-and-operator-audit.md

Select an option

No results found

Select an option

No results found