ACCESS-CI Dev Journal

Gist mirror: https://gist.github.com/bacalj/a6c6f9726844611df5a09c83884a0e83

2026-05-04 (evening) — Classify-free path implemented on `feat/no-classify`; eval mixed signal

Picked up the morning plan and executed it. Four commits on a new sub-branch feat/no-classify (off feat/qwen-integration); 279/279 unit tests pass; end-to-end eval comparison run on Qwen3.6-FP8 against the 14-question tool-coverage battery. Result is mixed: composite +0.05 in candidate's favor, but compare-judge picks the baseline as winner with margin "small" — exactly the failure mode predicted at the start of the session.

Branch decision: sub-branch over committing to parent

Plan as written had the work landing directly on feat/qwen-integration. Late-day reconsideration: the work is experimental enough that a separate sub-branch is worth the small extra branching complexity. feat/no-classify cuts off feat/qwen-integration; if the experiment loses, the parent branch is unaffected and the sub-branch can be deleted. If it wins, it merges into the parent.

Commits (in order)

commit	content
`b7765b4`	Centralize `max_tokens` in `src/config.py` (`MAX_TOKENS_LOOP=6000`, `MAX_TOKENS_DOMAIN_AGENT=4000`, `MAX_TOKENS_SYNTH_CONDENSE=8000`, `MAX_TOKENS_SYNTH_FINAL=3000`, `MAX_TOKENS_PLAN=3000`, `MAX_TOKENS_EVALUATE=1500`, `MAX_TOKENS_RECOVER=1500`). Reasoning-friendly defaults. classify.py's `max_completion_tokens=350` left alone (pinned to gpt-4o-mini, bypassed by master switch).
`5b2c0ef`	New `search_access_documents` `StructuredTool` wrapping `uky_client.ask()` with `query`, `source: Literal["general","xdmod"]="general"`, optional `rp_name`. Lives in new `src/agent/tools/` package (distinct from `src/tools/` MCP-client and `src/agent/domains/tools.py` MCP-wrapper). 10 unit tests.
`1cc9d66`	`USE_NO_CLASSIFY` master switch in `graph.py` — when true, `START → tool_calling_loop` directly, classify/rag_answer/rag_and_plan/domain_agent bypassed. New `src/agent/prompts/no_classify.py` system prompt with announcements/JSM workflow choreography folded in. Loop node branches on the flag to choose prompt builder + whether to append `search_access_documents`. 7 new graph/loop tests.
`f81092a`	Prompt rewrite. The first draft of `SYSTEM_IDENTITY_NO_CLASSIFY` opened with "There is no separate documentation layer upstream of you" — leaking implementation history into the prompt, comparing against an alternative the LLM has no context for. Replaced with a job-description framing organized around a 3-step workflow: docs first (call `search_access_documents`), enrich with MCP tools where the topic touches live data, synthesize. The "what not to do" guidance moved into its own section instead of being buried mid-paragraph.

Pre-flight before eval

Docker stack brought up: 11 MCP servers + postgres + redis. JSM and postgres weren't in the existing partial stack; brought them up explicitly.
UKY model list checked: ccs/Qwen/Qwen3.6-35B-A3B-FP8 listed and serving (false alarm earlier in session from a head -c 800-truncated curl).
Chat-completion smoke against Qwen at max_tokens=6000: finish_reason: stop, content emits proper </think> close, model produces "OK" after the reasoning trace. _StrippingChatOpenAI wrapper handles cleanly.
UKY general RAG endpoint: smoke-tested with allocation question, returned 2129-character response.
JSM safety: confirmed three layers stack — READ_ONLY=true (agent strips write tools at catalog level), JSM_DRY_RUN=true (MCP container short-circuits ticket creation, returns DRYRUN-NNN stub), JSM_PROXY_URL unset (would fail closed even if dry-run weren't on). Ran eval with READ_ONLY=true for belt-and-suspenders.

Eval comparison

Battery: eval/questions/tool_coverage_battery.yaml (14 Qs). System: agent_full. Both runs on feat/no-classify HEAD (f81092a).

	Run ID	Composite	Scored / Skipped	Per-dim
Baseline (USE_NO_CLASSIFY=false)	`loop-20260504-181608-68a8f0`	4.93	13 / 1	corr 5.00, comp 4.77, rel 5.00, cite 4.92, hedge 5.00
Candidate (USE_NO_CLASSIFY=true)	`loop-20260504-182336-7b71ea`	4.98	13 / 1	corr 5.00, comp 4.92, rel 5.00, cite 5.00, hedge 5.00

compare-judge JSON at comparisons/no-classify-2026-05-04.json. HTML report deployed: https://access-ci-reports.netlify.app/no-classify-2026-05-04.html

The judge's run-level summary picks the baseline as winner (margin "small"), citing more detailed and actionable information on allocation, software-version, and announcement questions. Candidate wins on terse technical Q&A (SSH keys, CUDA versions). Notable: the judge explicitly flagged that the per-answer rubric was undervaluing candidate's conciseness.

This matches the failure mode predicted at session-start: when the LLM has discretion over whether to call search_access_documents, it sometimes doesn't, and on detail-heavy questions that costs substance. Composite saturated near 5.00 in both runs — compare-judge surfaced the behavioral difference the rubric averaged out. Same lesson as lesson_composite_vs_behavioral_parity.md, fresh datapoint.

Side-finding: the loose-match honest framing traces to a four-commit cross-repo fix

Spotted on tc-allocations-02 in the candidate report: agent now opens "Searching 'climate modeling' on Delta returned 58 results, but none are actually about climate modeling." Traced via git log + blame to a 2026-04-29 four-commit chain:

access-mcp d4d9845 — pagination + query_relevance metadata on 3 search tools, plus docs→links rename per Andrew's review.
access-mcp 4bb5ce6 — extended same metadata across all 16 listing tools.
access-agent 88eccdd — first prompt rule consuming the new metadata + null-RAG-fallback rule.
access-agent 9e449ac — strengthened the loose_match rule to a mandatory opening-line shape; this is the commit that produced today's exact phrasing.

Co-designed change: neither side could have produced the visible behavior alone — the metadata gives the prompt something concrete to anchor to, and the prompt gives the metadata behavioral teeth. Saved as finding_loose_match_honest_framing_2026_04_29.md.

`access-mcp` PR #3 bundled

PR #3 was originally framed as the see_all_url deliverable only, but its branch feat/listing-urls-in-tool-responses carries all four 2026-04-28 / 2026-04-29 commits. Updated the PR title and body to bundle the three deliverables (see_all_url, links rename, pagination + query_relevance) with per-deliverable motivation, design, and a companion-changes section linking the two access-agent commits that pair with deliverable 3. Title now: feat: add structural metadata to MCP tool responses (see_all_url + pagination + query_relevance). Still draft.

Open questions for next session

Roll back vs harden. The judge picked baseline. Two reasonable directions: flip USE_NO_CLASSIFY=false and retire the experiment (the centralization commit and the search_access_documents tool stand on their own), or strengthen the docs-first rule in the no-classify prompt so the LLM has less discretion to skip the doc-search call, then re-run the 14-Q battery and compare-judge against today's candidate.
PR strategy. feat/no-classify is unmerged. Could open a draft PR off feat/qwen-integration, hold the sub-branch unmerged pending the rollback decision, or cherry-pick b7765b4 (centralization) into PR #26 since it's stand-alone valuable.
MCP PR #3. Ready to flip from draft to open whenever Andrew should see it.

2026-05-04 — Plan for the classify-free path: tightened to a single tool, master switch on `feat/qwen-integration`

Andrew was offline most of the day; spent the session pinning down a plan for the classify-removal work that Joe could commit to without a sign-off in hand.

What changed from the 2026-05-01 plan

The 2026-05-01 plan had a spike/no-classify branch off feat/qwen-integration with two new RAG tools (search_uky_general_rag + search_uky_xdmod_rag) and full deletions of classify / chain / flag if .4 won. Andrew's brief Slack today shifted the framing: "Wrapping the RAG doesn't necessarily make sense. We're getting rid of the RAGs (to be replaced with the /retrieve documentation endpoint). It's could be fine to leave the classifier in until that happens?"

Three concrete corrections to the plan:

No spike branch. The work is permanent infrastructure (new tool, prompt changes, graph routing), not throwaway research. It commits onto feat/qwen-integration directly. Cross-commit comparison via compare-judge replaces cross-branch comparison.
Two RAG tools collapse to one. A single search_access_documents(query, source: Literal["general", "xdmod"] = "general", rp_name?) exposes the existing endpoint_type parameter on uky_client.ask() directly to the LLM. The choice classify makes today moves into the LLM's hand via tool description, not into a heuristic inside the tool body.
No deletions today. A master switch in graph.py introduces the new path alongside the old ones. Classify, the legacy chain, USE_TOOL_CALLING_LOOP, and agent_full_legacy all stay alive until Andrew confirms prod is on the loop and Phase 8 sign-off is done. Cleanup is a follow-up PR.

Why a single tool with a `source` param ages better than two tools

Long working session with the uky_client code. ask() already takes endpoint_type: Literal["general", "xdmod"] (required) and rp_name: str | None (optional); _get_url() picks one of two configured URLs (UKY_RAG_GENERAL_URL / UKY_RAG_XDMOD_URL) and POSTs to it. So our side really does pick between two endpoints — the source param on the new tool is a direct surface of an existing internal choice, not a new abstraction.

Why one tool with a param, not two named tools: when /retrieve ships, the function body changes (return type widens to chunks, system prompt updates, possibly multi-call iterative search), but the parameter name and semantics can survive — source="general" either remains meaningful (if /retrieve doesn't subsume XDMoD) or quietly retires (if it does). Two-tool design would force a rename or deprecation regardless. The spec for /retrieve is silent on whether it subsumes XDMoD.

Why the master switch instead of a flag or branch

Three options were on the table at different points today:

A new USE_NO_CLASSIFY flag → introduces a permutation through the graph (legacy / loop-with-classify / loop-no-classify) that has to be carried indefinitely.
A spike/no-classify branch → implies throwaway, but the work is permanent.
A master switch in graph.py routing → equivalent to the flag in effect, but the surface is one routing branch and it deletes cleanly when Andrew signs off the cutover.

Picked the master switch. The cleanup PR removes the routing complexity, the flag, classify, the chain, and agent_full_legacy all at once.

What got verified vs. assumed

Verified ask()'s signature (uky_client.py:65-72) and the two-URL routing (uky_client.py:53-57 and __init__ lines 35-46).
Verified the /retrieve spec footprint — only Phase 4 of the launch umbrella plan and the matching section of the hardening spec (line 167). Both placeholders pointing to "the contract Vikram is building." Nothing in access-qa-planning mentions /retrieve.
Did not verify whether prod is currently running USE_TOOL_CALLING_LOOP=true. Joe needs to confirm with Andrew before the eventual cleanup PR — if prod is still on the chain, deleting things is the cutover and needs Phase 8 sign-off.

Slack to Andrew (sent today)

In the current qwen-integration branch I'm setting up a classify-free path which uses a single tool calling + thinking loop that is equipped with

all existing tools, including the domain_agent ones

a new search_documentation tool that will, for now, wrap existing /ask function that calls the UKY endpoints, will eventually instead wrap the /retrieve function that gets the UKY chunks

In addition to the rp param that gets passed through, we'll have the loop decide to pass XDMod or general since the ask function still expects it so it knows which UKY rag to hit. This is the logic that was in classify. I don't know if the retrieve path is going to want the same param, so we'll just remember it might need to be passed through, or removed.

Memory artifacts updated

architecture_stages_2026_05_01.md rewritten to reflect the master-switch plan and the single-tool design.
next_session_prompt.md rewritten with the implementation order: centralize max_tokens first, then add the tool, then the master switch, then flatten domain MCP tools, then run the cross-commit comparison.

2026-05-01 — End-to-end loop run on Kimi at UKY; Andrew flagged a no-classify direction

Picked up where the 2026-04-30 entry left off. Updated .env with the UKY Qwen target, hit the same 500 (Connection error.. Received Model Group=ccs/Qwen/Qwen3.6-35B-A3B-FP8) on retry, raw curl reproduced the same shape — confirmed it's UKY-side, not us. Wrote two diagnostic markdown files at the access-ci root: uky-endpoint-diagnostic.md (technical) and uky-endpoint-brief.md (the Vikram-facing version, three short prose paragraphs + curl evidence, no commentary).

Pivoted the smoke target to ccs/kimi-k2.6 (the only working reasoning model on the same proxy — the colon-format names like ccs/qwen3.6:35b, ccs/deepseek-r1:8b etc. are misrouted to Anthropic, ~14 of them, all 400 with AnthropicException). Smoke runs cleanly through our wrapper. Kimi exposes reasoning in a separate reasoning_content JSON field rather than inline </think> — so the strip is a no-op for Kimi (still proven by the 13 unit tests on synthetic input; not exercised end-to-end against a real </think>-emitting model since none are reachable on UKY today).

End-to-end eval comparison on Kimi

Ran tc_loose_match_subset.yaml (4 Qs) through both the legacy chain and the loop, with LLM_PROVIDER=vllm, VLLM_MODEL_NAME=ccs/kimi-k2.6. Database overrides (DATABASE_URL=...localhost..., MCP_SERVER_HOST=localhost) needed because the local docker stack's container has stale code; ran the eval CLI from local source against the container's exposed Postgres + MCP ports.

	Legacy chain	Loop
Run ID	`chain-20260501-140041-066793`	`loop-20260501-141220-9e7684`
Scored	2 / 4	3 / 4
Composite	4.88	5.00

HTML report deployed: https://access-ci-reports.netlify.app/tc-loose-kimi-2026-05-01.html

Reasoning-model token-budget finding

Both answer_length=0 skips in the chain run came from Kimi consuming all 4000 tokens of the synthesize budget on reasoning_content and never producing content. Different reasoning models partition the budget differently — Qwen embeds reasoning inline in content (our wrapper recovers the answer post-hoc), Kimi puts it in a separate field that shares the budget with content. Per-node budgets in the codebase today: synthesize=4000, plan=1500, tool_calling_loop=2000 (default), evaluate/recover=500. Once we're on a reasoning model in production these need to scale up, or enable_thinking=False needs to be passed on terse paths. Open question for Andrew, sent.

Andrew responses (Slack, 12:38-12:39)

Three messages: "the design should be not to use a classify node," "Vikram is looking at the Qwen issue," "what do you mean by token budgets?"

The first is a new design direction not in any Andrew-authored doc. Confirmed by reading every file he authored: the Phase 3 hardening spec at access-agent/docs/superpowers/specs/2026-04-21-production-launch-hardening-design.md line 150 explicitly scopes the loop to "Replaces plan + execute + evaluate + recover with a single tool_calling_loop node" — classify is not in that list, and active/10-analytics-and-domain-agents.md (Andrew's domain-agent doc) shows classify upstream of the existing react agents. The implementation plan we co-wrote with Claude faithfully reflected that scope. So the no-classify direction is genuinely new from him today.

Decision: spike no-classify on a branch, not a flag

Two responsibilities currently held by classify need new homes:

RAG endpoint selection (general vs XDMoD) → wrap each as a LangChain tool (search_uky_general_rag, search_uky_xdmod_rag), defined in a new src/agent/tools/rag_tools.py, appended to the loop's tool list alongside the MCP tools.
Domain agent routing (announcements, JSM) → fold their MCP tools (already in the catalog for reads; available for writes when READ_ONLY=false) directly into the main loop. The workflow choreography that lives in src/agent/nodes/domain_agent.py (preview/confirm/create for announcements, field-gather for JSM) gets moved into the main system prompt as instructions.

Doing this on a flag would make the graph carry three permutations (legacy / loop-with-classify / loop-no-classify). Since this is a research spike rather than a production cutover, it goes on a branch (spike/no-classify off feat/qwen-integration); compare runs across branches via compare-judge (eval already records branch + commit per run). If no-classify wins, it becomes a real PR — classify, USE_TOOL_CALLING_LOOP, the legacy chain, and domain_agent as a separate node all delete; the graph gets much smaller.

Architecture stages framing (locked in)

Four cumulative stages, each builds on the previous:

classify → plan → execute → evaluate → recover → synthesize (legacy chain, OpenAI inside).
classify → tool_calling_loop (OpenAI) — Phase 3, PR #14, USE_TOOL_CALLING_LOOP=true.
classify → tool_calling_loop (UKY Qwen/Kimi) — Phase 5, this branch (feat/qwen-integration). LLM_PROVIDER=vllm.
tool_calling_loop (UKY Qwen/Kimi), no classify — the upcoming spike branch.

Comparisons:

1 vs 2: done in PR #14 (4.67 = 4.67 on phase3_smoke_battery).
2 vs 3: partially done today (loop-on-Kimi has data; OpenAI-loop baseline lives in earlier eval runs).
3 vs 4: next eval. Run .3 on feat/qwen-integration, switch branches to spike/no-classify, run .4, compare-judge across the two run IDs.

Open follow-ups

No-classify spike branch. Cut from feat/qwen-integration. RAG-as-tools, domain workflow content into system prompt, gut classify, run subset, compare.
Vikram on Qwen3.6 FP8 endpoint. Diagnostic at uky-endpoint-brief.md, paste-ready curl repros.
Token budget decision. Centralize in config or per-call-site; use enable_thinking=False on terse paths (evaluate/recover/old-classify-equivalent if any). Sent to Andrew, awaiting response.
Draft PR #26. https://github.com/necyberteam/access-agent/pull/26 — covers stage 3 (Qwen integration + Issue #15 cleanup). Stays draft until Qwen3.6 is reachable for a real smoke + the no-classify direction is settled.

2026-04-30 — Qwen integration committed (Phase 5, commit a1c54bb); UKY endpoint smoke blocked by vLLM backend

PR #14 (eab821f) merged 2026-04-30T18:09Z, putting tool-calling-loop, READ_ONLY filter, and code-quality guardrails on main. Started cross-repo cleanup, then opened Phase 5 (Qwen) on a fresh branch.

Cross-repo branch hygiene

Synced main and pruned merged feature branches:

access-agent — pulled 101 commits, deleted local feature/production-baseline-comparison (merged in PR #14)
access-qa-bot, qa-bot-core, qa-bot-proxy — each had feature/non-agentic-proxy-2026-04-10 merged upstream and behind main; checked out main, pulled, deleted the local branch
access-qa-bot had a stale working-tree edit changing BACKEND_ID from 'uky' to 'access'. main already has 'access', so the modification was content-identical to main — stashed for safety as stash@{0} rather than discarded
access-ci-ui chore/bump-access-qa-bot-3.5.2 — left alone. PR #78 was closed without merge on 2026-04-12; upstream main is now at ^3.7.1, well past the ^3.5.2 the branch tried to land. Definitively superseded
access-mcp feat/listing-urls-in-tool-responses — left alone, DRAFT PR #3 still active (the see_all_url work)

Qwen integration shipped to `feat/qwen-integration` (commit `a1c54bb`)

Cut a fresh branch off main and built the Phase 5 LLM-layer changes. Highlights:

</think> reasoning strip in the LLM client wrapper. Subclassed ChatOpenAI to override _generate/_agenerate and remove anything up to and including </think> from response content. Strip happens during each call, not at end-of-loop, so when the prebuilt react agent re-sends conversation history on subsequent turns the model never sees its own prior reasoning replayed back as input. Andrew originally suggested stripping in the loop node, but tool_calling_loop.py delegates to LangGraph's create_react_agent and we don't own the per-turn LLM call there — provider-layer placement was confirmed with him as the canonical client-side pattern (matches DeepSeek-R1 / Qwen guidance).
enable_thinking parameter wired through LLMProvider.get_chat_model, both providers, and get_llm(). Default None = nothing extra in the request body. When set, OpenAICompatibleProvider adds extra_body={"chat_template_kwargs": {"enable_thinking": value}} so the request lands at LiteLLM with the right shape. No call site uses it yet — the knob is exposed per Andrew's note for future fast-path experiments.
13 unit tests in tests/test_llm_providers.py covering the strip helpers (no-op when tag absent, idempotent, multimodal-content skip), the wrapper subclass type, and the enable_thinking plumbing on/off semantics across both providers. mypy/ruff clean; full suite still 263 pass / 1 skip.
qwen_smoke.py — runnable end-to-end check at the LLM layer (independent of the agent graph). Two round-trips: default thinking + enable_thinking=False, both expected to come back without </think> artifacts.

UKY endpoint smoke failed — vLLM backend not reachable

Updated .env with LLM_PROVIDER=vllm + the three VLLM_* variables Andrew supplied (https://jump-external.ccs.uky.edu/v1, model ccs/Qwen/Qwen3.6-35B-A3B-FP8). Smoke and a raw curl reproduce the same error:

HTTP 500 — litellm.InternalServerError: OpenAIException - Connection error..
Received Model Group=ccs/Qwen/Qwen3.6-35B-A3B-FP8
Available Model Group Fallbacks=None

/v1/models returns 200 with ccs/Qwen/Qwen3.6-35B-A3B-FP8 in the list, so the LiteLLM proxy itself is healthy and the model group is registered — the failure is downstream, between LiteLLM and the vLLM backend. A second registered route, ccs/qwen3.6:35b, returns 400 (AnthropicException - Model 'qwen3.6:35b' was not found), which looks like a separate misconfiguration. Vikram (not Andrew) controls those servers; ping pending.

Curl repro proves the failure is upstream of our code — same 500 from raw curl bypassing Python entirely. Config is taking; the smoke printed VLLM_BASE_URL and VLLM_MODEL_NAME correctly out of settings before the call.

Cleanup commit landed (`0ecd5ad`, −475 lines net)

Same branch, second commit. Removed dead-code from Issue #15's "removable now" bucket — full pgvector path was disabled 2026-03-23 and never re-enabled, plus a RAG-only synthesis fallback that had zero callers anywhere:

src/services/qa_client.py deleted entirely (229 lines)
_search_pgvector and _get_threshold_for_query_type removed from rag_answer.py (only consumers of qa_client)
_synthesize_with_rag_only and RAG_ONLY_SYNTHESIS_PROMPT removed from synthesize.py (no callers)
Config fields with no remaining readers: QA_SERVICE_URL, RAG_TOP_K, RAG_THRESHOLD_STATIC/COMBINED/FALLBACK, RAG_SIMILARITY_THRESHOLD, MAX_RETRIES_PER_TOOL, MAX_RETRIES_TOTAL, TIMEOUT_BUDGET_MS
The QA Service URL startup log line in main.py
Matching env-var stubs in .env.example, docker-compose.yml, docker-compose.prod.yml, plus README config table and file tree

Backed off from the original Issue #15 list on RetryContext: memory said it was only consumed by recover_node, but grep showed plan.py also uses retry_context in its prompt. Both are part of the legacy chain that's still selectable via agent_full_legacy for eval comparisons. Killing RetryContext would break that path. Separate decision.

ruff/mypy/pytest stayed green (263 pass / 1 skip).

Endpoint diagnostic — confirmed our integration is sound, narrowed down the UKY-side breakage

While Qwen3.6 is offline, exercised the wrapper against an alternate model on the same proxy to verify our code is correct independent of the Qwen target.

End-to-end through our wrapper against ccs/scout succeeded. Full stack — .env → settings → get_llm() → _StrippingChatOpenAI → HTTPS to LiteLLM → response parsing — round-trips cleanly. Wrapper class _StrippingChatOpenAI confirmed; response comes back as plain content. The strip is a no-op for non-reasoning models (no </think> to remove), and that no-op behavior is also covered by unit tests. So integration path is proven correct against a live UKY model; nothing in our code is blocking.

Broader UKY routing pattern found while probing for any other reasoning model that's hot:

Every name:tag format model (ollama-style: ccs/qwen3.6:35b, ccs/deepseek-r1:8b, ccs/qwen3.5:9b, ccs/qwq:32b, ccs/llama3.1:8b, ccs/gemma3:27b, ccs/mistral-small3.2:24b) returns HTTP 400 with AnthropicException - Model 'X' was not found. LiteLLM has these model groups configured to route through Anthropic, which doesn't host them. Looks like a config-template error in LiteLLM — they should be routing to vLLM/Ollama, not to Anthropic.
The slash-format ccs/Qwen/Qwen3.6-35B-A3B-FP8 returns HTTP 500 — LiteLLM accepts the request, identifies the Model Group, then fails to reach the vLLM backend behind it. vLLM isn't serving that model right now.
ccs/scout (also slash-format, no version tag) returns 200 cleanly.

So the proxy itself is healthy, auth is fine, and at least one model groups is correctly wired through. The breakage is per-model-group: one (scout) works, all the colon-format ones are misrouted, and the FP8 Qwen3.6 vLLM target is down.

Open follow-ups

Vikram ping — diagnostic above is paste-ready. Two distinct issues worth surfacing: the FP8 vLLM target being down, and the broader colon-format → Anthropic misrouting (probably a one-line LiteLLM config fix).
Re-baseline on tc_loose_match_subset.yaml — gated on Qwen actually responding. After that lands, hand-read the answers per the eval-is-instrumental memory rather than chasing composites.
PR open — once at least one successful Qwen response is on the wire, open feat/qwen-integration against main. Currently two commits: a1c54bb (Qwen integration, +283/−1) and 0ecd5ad (cleanup, +3/−478).

2026-04-29 — TC battery ground-truth pass: all 14 questions verified, 11 rubric commits, real bugs surfaced

Completed the ground-truth pass on tool_coverage_battery.yaml set up by the 2026-04-28 entry. All 14 questions now have hand-verified ground truth (three-source: WebSearch + WebFetch + local MCP) and atomic-fact rubrics. 11 commits on feature/production-baseline-comparison, 3 done in-session and 8 dispatched to a fork that committed per-question:

9d4b68a fix(eval): tc-rag-04 — atomic facts, surface RAG content staleness
d20f0c8 fix(eval): tc-rag-03 — loosen F2/F4 to match RAG content, document variance
3cadc85 fix(eval): tc-rag-02 — atomic facts, fix Accelerate proposal length
11277f2 fix(eval): tc-rag-01 — atomic facts, separate login-host from 2FA enrollment
02cfbbd fix(eval): tc-xdmod-01 — atomic facts, accept content-equivalent realm naming
2ed4dad fix(eval): tc-events-01 — refresh snapshot, tighten F4 to grade link target
0dd6474 fix(eval): tc-nsf-01 — refresh snapshot to 2026-04-29
42c597e fix(eval): tc-allocations-02 — correct ground truth, document agent gap
0019dec fix(eval): tc-software-02 — correct ground truth, drop chatter-rewarding F3
5fe048a fix(eval): tc-software-01 — reframe F2, fix stale doc URL
c825fad fix(eval): tighten tc-status-01 F3 to grade link target, not verbal framing

Bugs found in the prior "answer key"

The pass exposed concrete cases where the existing rubric was grading against incorrect ground truth — not just minor wording, but factual errors that would have masked or fabricated agent failures:

tc-software-02 — F1 listed Delta CUDA versions as 13.1.1, 12.9, 12.8, 11.8. That was the top-level union of versions_by_resource from the MCP — i.e., Delta + DeltaAI conflated. Per-resource breakdown (verified via MCP and NCSA's own cudatoolkit/25.3_12.8 module name): Delta has only 12.8 and 11.8; 13.1.1 and 12.9 are DeltaAI-only.
tc-allocations-02 — authoring_notes claimed "zero of them are actually allocated on a Delta-family resource" for the loose-match result set. Re-verification on 2026-04-29 found 6 of 20 results are on Delta-family resources (Delta GPU or DeltaAI), but none are about climate modeling. The F1 wording ("no climate-modeling Delta projects") was structurally correct; only the explanatory notes were wrong.
tc-rag-02 — F4 specified Accelerate proposals as "Up to 3-page proposal with panel merit review". Per allocations.access-ci.org/project-types: Accelerate is 10 pages (Discover is 3). Corrected.
tc-rag-04 — RAG content predates NSF Important Notice No. 149 (July 2025). Indexed text still says "unaffiliated or self-employed CAN apply", directly contradicting current policy. The agent faithfully passes this through — F6 (institutional-email requirement) correctly fails when this happens; documented as RAG-content-staleness signal, not a rubric loosening.

Real agent gap surfaced (not yet fixed)

tc-allocations-02 composite 1.00. The agent narrates loose-match search results as exact matches: presents three projects with no climate-modeling content as "20 climate modeling projects on Delta." This is a genuine product failure — search_projects returned a fuzzy result set, the agent didn't qualify it. Per the session's instruction set (no agent-prompt edits during the ground-truth pass), this finding was committed to the rubric and authoring_notes for follow-up rather than fixed inline.

Patterns fixed across the rubric set

Recurring failure modes in the prior YAML, addressed system-wide:

Chatter-rewarding "verbal framing" facts. Several questions had facts demanding the answer use a specific phrase (e.g., "cites the live status feed as its source") when the substantive thing was a link target. Same pattern across tc-status-01 F3, tc-software-02 F3, tc-events-01 F4, tc-xdmod-01 F2/F3 — all rewritten to grade what the URL points to, not how the answer phrases the citation. These rewards were a material driver of the chain-vs-loop composite gap noted on 2026-04-28.
Heading + items dict structure. tc-rag-01, tc-rag-02, tc-rag-04 used a {heading, items} shape that flattens into N redundant fact verdicts. Replaced with atomic per-claim facts. tc-rag-04 went from 13 sub-claims to 6 atomic facts.
OR clauses misread by judge. tc-rag-01 F3 had "registry.access-ci.org/ OR ~/.ssh/authorized_keys" — judge consistently demanded both. tc-software-02 F3 had a similar problem with "inline hint OR doc link" and was eventually dropped as a chatter trap. Where an OR is genuinely needed, explicit "either is sufficient" wording works better than "OR".

Other findings

describe_realms still HTTP 500 as of 2026-04-29 (xdmod MCP). tc-xdmod-01 was rewritten to grade content-equivalent metric-category enumeration since RAG is the actual answer source for that question right now.
UKY RAG returns variably. tc-rag-03 ran 4.45 / 3.10 / 4.45 across three sequential runs; the 3.10 had empty rag_context. tc-rag-04 returned 0 chars on 2 of 4 runs. Composite oscillation on rag-* questions is RAG variance, not rubric or agent flakiness — the per-fact verdicts honestly distinguish "RAG worked" from "RAG fell back to general knowledge."
docs.see_all_url propagation confirmed. Multiple questions show the agent now linking to canonical pages (sds.access-ci.org for software, support.access-ci.org/outages for status, etc.) without any agent prompt change — same pattern documented in the 2026-04-28 entry, now observed at battery scale.

Framing — eval is instrumental, not authoritative

Saved a feedback memory (feedback_eval_is_instrumental.md) capturing a guidance correction made during the session: composite 5.00 does not mean an answer is perfect, only that it satisfied that judge interpretation on that run. Several judge errors were observed during the pass (misread OR clauses, demand verbal framing when the link is what counts, miss content under different naming). The HTML report from any future battery run is the starting point for a human + frontier-model review, not a verdict. The system, not the eval, is the product.

Open follow-ups

Step 12 — full TC re-run + compare-judge + HTML + Netlify publish — staged for next session via next_session_prompt.md. Comparison vs. the 2026-04-28 baseline (chain 4.63 / loop 4.26) will tell whether the rubric pass moved the gap.
tc-allocations-02 agent fix — narrate fuzzy/loose-match results as such instead of as exact matches.
xdmod describe_realms 500 — open against access-mcp.
NSF Notice 149 RAG content refresh — out of scope for the agent repo, but the rubric now signals when this corpus is stale.

While iterating on tc-announce-01 (the routine memory's process question), surfaced a structural gap: when search_announcements returned 0 hits for "Expanse", the agent invented Confluence/SDSC fallback links because nothing in the tool output, RAG, or prompt named the canonical ACCESS announcements page. RAG corpus probe confirmed the problem isn't going to be patched at that layer — 11 different "where can I find X" phrasings returned zero matches each, corpus has only documents (603) + compute-resources (80) domains.

Fix landed in access-mcp

PR necyberteam/access-mcp#3 ("feat: add see_all_url to MCP tool listings"). Lifted compute-resources' private addDocumentation() helper to a BaseAccessServer.listingDocs(context) method and applied it across 6 servers:

Server	URL surfaced
announcements	`support.access-ci.org/announcements`
events	`support.access-ci.org/events`
affinity-groups	`support.access-ci.org/affinity-groups`
system-status	`support.access-ci.org/outages`
software-discovery	`sds.access-ci.org/`
allocations	`allocations.access-ci.org/current-projects`

Additive change — JSON.stringify drops undefined so opt-out servers stay clean. compute-resources/xdmod/nsf-awards/jsm intentionally skipped (different shapes / external systems). 13 files, +115/-2, 222 tests pass (198 unit + 24 integration). Rebuilt locally via docker compose up -d --build mcp-{announcements,events,affinity-groups,system-status,software-discovery,allocations}; all 6 confirmed surfacing the new field via direct curl probes.

tc-announce-01 spot-check — no agent change needed

Re-ran the question against agent_full after the rebuild:

Before (loop-20260427-191442-61e1df): "...the official ACCESS documentation at ACCESS Documentation or the Expanse user guide at SDSC Expanse User Guide"
After (loop-20260428-202727-ba807b): "...you can check the ACCESS Announcements page..."

The agent's loop spontaneously surfaced the new docs.see_all_url field — no prompt edit, no synthesis-node teaching. Confirms the routine-memory claim that link content from tool output flows faithfully into the final answer. Composite 4.50, F1+F2 both yes; completeness=3 is the inherent ceiling for a "no results" answer, not a regression.

Implication for the routine

Updated routine_iterating_on_tc_questions.md already classifies failures as agent / rubric / judge. Adding a fourth bucket for future tc-* iterations: MCP product gap — when the agent invents a fallback link or names the wrong site, check the MCP server's tool response shape first. If there's no canonical landing-page field, the fix belongs there, not in the agent prompt. The docs.see_all_url convention is now established across the 6 servers.

Open follow-up

A full tool_coverage_battery re-run against the rebuilt MCP servers should show similar improvements on events/affinity-groups/system-status/software-discovery/allocations questions. compare-judge against pre-2026-04-28 runs would quantify it. Not done today — single-question validation was the intent.

Two comparison runs published — chain vs loop, post-MCP-fix

Ran the parity-check comparison end-to-end against the rebuilt MCP servers, on both batteries:

Battery	Questions	Chain composite	Loop composite	Margin	Run-judge winner	Report
Phase 3 smoke	40	4.80	4.81	+0.01 (loop)	B (small)	phase3-smoke
Tool coverage	14	4.63	4.26	−0.37 (chain)	A (small)	tc-battery

The smoke battery is essentially tied (0.01); the TC battery shows the chain pulling ahead by 0.37. The gap is concentrated in tool-shaped questions where the chain's verbose RAG-included answers score higher on per-fact verdicts and run-judge "comprehensiveness", while loop's more selective answers score lower despite often using fresher tool data. The run-judge self-flagged this calibration concern explicitly: "the judge may have underestimated the value of up-to-date information provided by System B."

Pattern paragraph from the TC compare-judge:

Chain wins on detailed context, specific examples, actionable guidance.
Loop wins on current/accurate data, especially software versioning.
Tied on straightforward info-only questions.

Read: at least half of the 0.37 gap is rubric/judge artifact, not behavioral regression. Per-fact verdicts and the run-level judge both reward comprehensiveness; neither penalizes verbose-from-RAG over concise-from-tools. This is the eval-system work the 2026-04-27 entry was setting up.

Run IDs:

Smoke: baseline chain-20260428-203336-fefdc1, candidate loop-20260428-203338-35b0e0
TC: baseline chain-20260428-205550-771850, candidate loop-20260428-205553-7eecb5

Pagination bug surfaced — comb-002 in the smoke battery

Sample answer from comb-002: "There are currently 100 active allocation projects using GPU resources." The 100 is exactly page_size from searchProjects's page-1 response. Agent is reporting the page-1 batch as the population total.

Right fix is at the MCP layer, no crawling. The upstream API already returns pages: number on every page-1 response (ProjectsResponse.pages at allocations/src/server.ts:48). The MCP layer just drops it when constructing the tool response shape. Same fix-at-source pattern as docs.see_all_url:

return JSON.stringify({
  total: items.length,
  items,
  pagination: {
    page: 1,
    pages: response.pages,
    has_more: response.pages > 1,
  },
  docs: this.listingDocs("search"),
})

Forwards what the upstream already told us — no extra fetches, no agent-side crawling. The agent then has unambiguous info to write "over 500" or "at least 100" instead of treating the batch as the whole. For tools where upstream genuinely doesn't return totals (some software-discovery / system-status endpoints), surface pagination: { has_more: true, total_known: false } so the agent knows it's seeing a partial.

Worth its own follow-up PR after today's wins are absorbed. ~5-10 lines per tool, similar PR shape to today's see_all_url change.

Two side-benefits-for-Vikram, traced

Vikram surfaced two UX preferences in the 2026-04-27 sync: (1) concise responses preferred over long ones, (2) followable links to comprehensive searchable. Today's work touches both, and the lineage is worth pinning down so it doesn't get re-litigated.

Brevity is structural, not luck. The legacy chain has a synthesize node that ingests RAG output (notably UKY's /ask paragraph responses, which run 400-800 words by default) and weaves it into the final answer. The loop has no synthesize node — the LLM produces the final answer when it stops calling tools, which tends toward brevity given the conversation context. So loop answers are consistently shorter without any prompt change. Direct consequence of removing the synthesis step. Anticipated by the Phase 3 architecture work; not stumbled into. Most visible on tc-announce-01: legacy ~280 words including 8 bullets, loop ~50 words.

Followable links — landed today. PR necyberteam/access-mcp#3 adds docs.see_all_url to 6 MCP servers' tool responses. Each tool now returns a canonical landing-page URL alongside its results, and the agent surfaces those URLs faithfully without prompt change. Matches Vikram's "show top results AND a followable link to the source" pattern exactly.

Tracing the legacy chain's tc-announce-01 answer to its source. The eight detailed bullets about Expanse hardware/policies that the legacy chain produced today were traced via eval_scores.context.rag_context to UKY's /ask synthesis service. UKY retrieved chunks from SDSC's Expanse user guide PDF, and their synthesis LLM wove them into a paragraph framed as "the most recent ACCESS announcements about Expanse highlight..." — because that's what the user asked for, and UKY's synthesis is willing to write it that way. The chain's synthesize node verbatim-included UKY's paragraph; the MCP search_announcements tool also ran and returned 0, with that "no new announcements" line buried at the very end after the 8 confidently-stated bullets. Facts: real (SDSC docs). Framing: wrong (specs presented as announcements). Effective experience: arguably worse than hallucination — a skimming reader would treat the 8 bullets as recent announcements before reaching the disclaimer. Connects to the architecture conversation Vikram + Andrew opened 2026-04-27 about how chunks should reach the agent (raw chunks vs MCP-wrapped vs full synthesis); today's loop avoided this trap by virtue of its different relationship with retrieval.

Side-benefit #1 (brevity) depends on the chain → loop migration; if the chain stays alongside loop for any user segment, that segment loses brevity. Side-benefit #2 (followable links) lands regardless of agent variant — even legacy chain answers now carry the canonical URLs (the chain just pads them with UKY synthesis, but the URL is still there).

TC-battery re-run + answer-quality review (afternoon)

Ran step 12 — full TC re-run on both systems against the post-rubric YAML, then read all 28 answers manually rather than treating judge composites as the verdict (per the eval-is-instrumental discipline established in feedback_eval_is_instrumental.md). Today's runs: chain composite 4.80, loop composite 4.52, vs yesterday's 4.63 / 4.26. The 0.28 gap decomposed cleanly: ~0.25 comes from one real agent defect (tc-allocations-02 loop fabrication, composite 1.00), ~0.08 comes from RAG retrieval variance on tc-rag-01 / tc-rag-03 where the loop's rag_context was null while the chain's was populated (UKY flake, not behavior). Two real loop wins (tc-software-02 cited both CUDA versions vs chain's one; tc-announce-01 concise+correct vs chain's UKY-synthesis-as-announcements). Eleven of fourteen questions tied or near-tied. Report published: https://access-ci-reports.netlify.app/tc-battery-2026-04-29.html

Three loop improvements identified, in priority order:

Loop fabricates relevance from loose-match search results — tc-allocations-02 narrated 3 unrelated projects (RNA, photosynthesis, solar/wind) as "20 climate-modeling projects on Delta."
Loop's general-knowledge fallback isn't ACCESS-grounded when RAG returns null — tc-rag-01 used wrong hostname expanse.sdsc.edu instead of login.expanse.sdsc.edu, missed passive.sdsc.edu 2FA portal; tc-rag-03 produced a generic Globus tutorial without the ACCESS Collection Search workflow.
Loop lacks order-of-magnitude qualifiers when listing examples — tc-affinity-01 says "several notable examples" with no count.

#1 and #3 were both tractable as MCP-layer fixes (the established see_all_url pattern: have the MCP tell the agent something it currently has to guess). #2 is a pure prompt-side gap.

`access-mcp` PR #3 expansion — `links` rename + pagination + query_relevance

Andrew flagged that docs was a misleading key name for what is really a structural URL, and suggested links. Two new commits on top of the existing see_all_url work:

d4d9845 — Renamed listingDocs → listingLinks and docs → links across base + 6 servers + tests. Added the two new structural-metadata fields to the three search tools where the documented bugs surfaced: pagination: { matched, has_more, total_known } and query_relevance: "exact" | "loose_match". Applied to allocations.search_projects, affinity-groups.search_groups, software-discovery.search_software.
4bb5ce6 — Extended both fields uniformly across all 16 listing/search call sites for shape consistency. Servers without paginated upstream APIs surface { has_more: false, total_known: true }; sample-and-stop endpoints (e.g., the listProjectsBy* family in allocations) surface { total_known: false, has_more: results.length >= limit }.

426 unit tests pass. PR #3 still in draft; versioning intentionally untouched per repo convention (versions bump in dedicated chore commits on main).

Loop system prompt — three additions (`access-agent` `88eccdd`)

Three additions to SYSTEM_IDENTITY in src/agent/prompts/tool_calling_loop.py:

Null-RAG fallback for documentation-style questions (active immediately) — when no reference context arrives, the loop is told to surface the canonical doc URL via MCP, or admit lack of grounding plainly. ACCESS-CI specifics (login hostnames, 2FA portals, identity-provider names, registry URLs) diverge enough from general HPC conventions that a generic answer is subtly wrong. Targets tc-rag-01 / tc-rag-03.
pagination interpretation (latent until PR #3 ships) — qualify counts when has_more or total_known: false. Targets tc-affinity-01 / comb-002.
query_relevance interpretation (latent) — when loose_match, inspect items against the actual topic; do not narrate fuzzy results as exact matches. Targets tc-allocations-02.

Items 2 and 3 are no-ops until PR #3 merges and the relevant servers republish, but item 1 is active immediately. Belt-and-suspenders: the see_all_url precedent (the loop spontaneously surfaced the new field with no prompt change) suggests items 2 and 3 might also work without prompt help, but bundling the prompt nudge insures against the case where it doesn't.

Andrew's review of PR #14 — five items addressed

Andrew (with assist from his Claude) posted a substantive review. Five concrete items, addressed in three commits + one tracking issue. Order: most-urgent first.

Production crash-loop on wrapt 2.x — `ec3414d`

The uv.lock regenerated on this branch pinned wrapt==2.1.2, but opentelemetry-instrumentation-langchain 0.60.0 calls wrap_function_wrapper(module=...) and wrapt 2.x removed the module= kwarg. Every container start since 2026-04-22 has thrown TypeError in init_telemetry → LangchainInstrumentor().instrument(). CI didn't catch it because nothing in the test suite exercised src.main's lifespan. Compounding: Dockerfile did pip install --no-cache-dir ., which re-resolves from pyproject.toml and ignores uv.lock — so the --locked CI check was decorative for production builds.

User-visible impact is zero — the live ACCESS chatbot routes through UKY/proxy via the hardcoded VITE_API_ENDPOINT baked into access-qa-bot at publish time, not through agent prod (the "Drupal Insanity" from reference_production_deploy_chain.md). But every smoke test of agent prod has been red for a week, masking any other regression and blocking the actual launch path.

Three changes:

Pin wrapt<2 in pyproject.toml (defensive; droppable when Traceloop openllmetry #4025 / #4048 ships).
Switch Dockerfile to uv sync --locked --no-dev so the lockfile actually governs production builds.
Add tests/test_main_lifespan_smoke.py — two tests that exercise init_telemetry and the full FastAPI lifespan. Without these, the wrapt incompat would have stayed green in CI.

`READ_ONLY` bypass in `tool_calling_loop` — `e8b2f12`

The legacy chain enforces READ_ONLY by removing write capabilities from the registry at build time. The new tool_calling_loop builds tools directly from the MCP catalog and never sees the registry. With both USE_TOOL_CALLING_LOOP=True and READ_ONLY=True (the actual production-hardening case), the LLM could still call manage_announcements, open_ticket, report_login_problem, report_security — the audit's "READ_ONLY blocks all writes" claim did not hold on the new code path.

Fix: added WRITE_MCP_TOOL_NAMES constant in capabilities.py enumerating the 6 MCP tool names corresponding to the 4 write capabilities, and _apply_read_only_filter() helper in tool_calling_loop.py that strips them when settings.READ_ONLY is True. Two unit tests machine-verify the audit guarantee on the new code path. docs/security/write-capability-audit.md updated with a parallel-paths section + 2026-04-29 changelog entry.

Cleanup batch — `d3a76af`

Three smaller items in one commit:

Dropped agent_rag_only system mode entirely. Andrew identified it as a silent no-op — _run_agent mutated os.environ["ENABLED_CAPABILITIES"], but pydantic-settings is constructed at import and the capability registry is cached on first call. The team's own test_capabilities_read_only.py already uses monkeypatch.setattr on the settings object, confirming env mutation doesn't work. Past compare-judge runs that used agent_rag_only were silently comparing agent_full to agent_full. Removing the mode is cleaner than fixing it given (a) RAG endpoints planned for deprecation and (b) no battery uses it.
Dropped include_context: bool = True parameter on compare_runs. The False branch had no callers (verified by grep). Behavior unchanged for every existing caller.
Expanded argilla-push records with question_set + tool_count metadata. Andrew's specific kwarg names didn't match the current build_argilla_record signature, but the underlying intent (more filterable Argilla metadata) is reasonable. Added the two parameters to build_argilla_record + the dataset's metadata schema, threaded through both call sites.

Two items deliberately flagged rather than fixed

The "rag_and_plan → tool_calling_loop edge" Andrew named doesn't exist in the graph — rag_and_plan only routes to execute / synthesize. The actual dead-when-flag-on situation is rag_and_plan becoming unreachable, by design for rollback safety during the launch window. Cleaner removal belongs with the legacy-chain teardown when the flag becomes permanent.
The justifications kwarg on build_argilla_record — dict-shaped, the metadata sink expects scalars, and adding it as a field rather than metadata is a larger schema change. Skipped pending clarification on the desired Argilla surface.

Issue #15 — broader cleanup audit

Andrew's review surfaced ~20 dead-code candidates in three buckets: already-dead-on-main (removable now), removable-when-loop-permanent, removable-when-RAG-deprecated. Filed as a tracking issue separate from PR #14 — bundling into the PR would defeat its launch-discipline framing.

Evening — TC battery re-run on rebuilt MCP + first prompt-iteration cycle

After Andrew's review fixes, rebuilt the 6 MCP servers (docker compose up -d --build mcp-{announcements,events,affinity-groups,system-status,software-discovery,allocations}) so the new links/pagination/query_relevance metadata would actually be in the tool responses. Postgres container hit disk-full mid-run on the first attempt — docker system prune -f --volumes reclaimed 23GB and the recovery completed cleanly. Re-ran TC battery on both systems.

Postfix run vs morning baseline:

	Morning	Postfix	Δ
Chain	4.80	4.68	−0.12
Loop	4.52	4.64	+0.12
Gap	0.28	0.04	—

Report: https://access-ci-reports.netlify.app/tc-battery-2026-04-29-postfix.html

Per-question read against morning loop (judge-free, reading actual answer text):

Verified the new metadata IS reaching the agent (query_relevance, pagination, links present in tool-result context for tc-allocations-02; SQL probe confirmed).
tc-rag-01 (+0.85) and tc-rag-03 (+0.25) gains are mostly UKY RAG firing this run when it didn't last run — variance, not behavior change.
tc-nsf-01 (+0.60) is judge mood — same answer shape, no real change.
tc-allocations-02 (+0.50) is partial: agent now hedges with "or similar topics" but still asserts climate-modeling framing for projects that aren't.
tc-affinity-01 (no change): agent ignored pagination.matched, still says "several notable examples".

Most of the composite movement is variance + judge mood. Real behavior changed on exactly one question (tc-allocations-02), and only partially.

Prompt-iteration cycle on `tc_loose_match_subset.yaml` (4 questions)

Created eval/questions/tc_loose_match_subset.yaml — tc-allocations-02 + tc-affinity-01 as targets, tc-software-02 + tc-status-01 as regression controls. ~2-min runs.

Iter-1: strengthened query_relevance: "loose_match" instruction with mandatory opening structure + explicit prohibition of "or similar topics" hedges. Result on tc-allocations-02: 4.75 (up from 1.50). Agent opened: "Searching for ACCESS projects on Delta related to climate modeling returned 53 results, but none are specifically focused on climate modeling." Used pagination.matched=53 ✓, declined the topic framing ✓, qualified examples as "related to other topics" ✓. Looked like a clean fix.

Iter-2: also strengthened the pagination instruction to a similar MUST form, hoping to fix tc-affinity-01. Result on tc-allocations-02: 1.00 (regression). Agent: "There are 53 projects on the NCSA Delta GPU related to climate modeling, though the relevance to climate modeling varies." Back to fabricating with hedge. The pagination MUST rule appears to dilute the query_relevance MUST rule.

Iter-1 reverted (re-running the iter-1 prompt one more time): tc-allocations-02 = 1.65. Agent acknowledges "based on a loose match search" but still asserts the topic framing.

So out of 3 runs of this prompt class: 1 clean fix, 2 hedged fabrications. The prompt is an improvement but not a reliable fix. The agent reads query_relevance: "loose_match" (proven by literal mention in two of the three answers) but follows the directive to decline the framing only ~1/3 of the time. Prompt discipline is unreliable for declining-the-framing instructions.

tc-affinity-01 is a separate problem — search_affinity_groups(query='GPU') returns 6 substring matches, but the rubric expects the 12-group GPU-equipped universe. The tool's substring matching doesn't surface groups whose searchable text doesn't contain "GPU". Not a prompt issue at all; needs a tool/data-layer fix (e.g., has_gpu flag on each group instead of a substring search).

Decision: stop iterating on the prompt; go to the MCP layer

Committed iter-1 prompt as 9e449ac (it's directionally better than no prompt change — when it works it produces the structurally-right answer) plus the subset YAML. Recommended next-session move: tool-side fix on allocations.search_projects so loose-match fabrication is structurally impossible (e.g., return items: [] plus unranked_loose_matches: [...] when query_relevance: "loose_match"). Same precedent as see_all_url, pagination, query_relevance — fix at the source rather than relying on prompt discipline. Tracked in next_session_prompt.md.

Reports published today:

Morning baseline: https://access-ci-reports.netlify.app/tc-battery-2026-04-29.html
Postfix (after MCP rebuild + loop prompt): https://access-ci-reports.netlify.app/tc-battery-2026-04-29-postfix.html
Subset (4-question, iter-1 prompt): https://access-ci-reports.netlify.app/tc-subset-strongprompt-2026-04-29.html

2026-04-27 — Vikram + Andrew sync; tool_coverage_battery atomization + YAML cutover

Battery work (morning)

Reworked access-agent/eval/questions/tool_coverage_battery.json after Andrew's review of 22abfbd flagged that many required_facts were clause-stacked to hit a 3-4 count, with overlap and snapshot data mixed into durable claims. Five commits on feature/production-baseline-comparison:

Commit	Change
`4bf1a44`	Atomicized `required_facts` across all 14 questions; added `ground_truth_stability` field (9 `time_bound`, 5 `stable`); moved snapshot data from facts to `authoring_notes`.
`13feb2f`	Converted JSON → YAML for human-authoring ergonomics. Loader at `src/eval/questions.py` dispatches by extension; sibling batteries stay JSON. Added pyyaml as direct dep + types-PyYAML.
`e35e2e6`	Made `authoring_notes` a bullet list (was prose blob); added blank lines between questions.
`9c5756c`	Nested enumerative `required_facts` as `{heading, items}` dicts (e.g., eligible institutions list, XDMoD realms) instead of comma-blobs.
`39430ab`	Tightened YAML representer to use block scalars only when actually needed (long strings with apostrophes), down from ~35 to 14.

The factoid scorer that consumes these facts is not yet built — today's work is the ground-truth corpus shape, not the grading. Running the battery now would still hit the existing rubric judge and produce numbers indistinguishable from last week's; the work pays off when the factoid scorer lands.

Eval-system repeatability is load-bearing — `time_bound` refresh pattern

The factoid-scoring + ground-truth work is the eval system itself, not a one-time exercise. We will iterate on agent setup repeatedly, and the eval must be reliably re-runnable each time. The time_bound required_facts refresh pattern (regen from live MCP before each run) is therefore not optional polish — it's the operational core. Documented in access-agent/eval/questions/tool_coverage_battery.README.md (new "Repeatability — refreshing time_bound facts" section). Refresh script itself is not yet built; manual fallback for now.

Vikram + Andrew sync (afternoon)

Side topics surfaced during the meeting, captured here so they don't evaporate:

UX direction (Vikram). Concise responses preferred over long ones — "show top five and a follow-able link to the source" pattern. Implication for the agent's synthesis prompt: brevity with explicit pointers, not exhaustive enumeration. Not a launch blocker, but a product input worth keeping.
Architecture: how chunks reach the agent. Three options on the table — (1) raw chunks endpoint (pipeline-driven, agent has no say), (2) chunks wrapped in MCP (agent-driven, can iterate, fits tool-calling-loop architecture), (3) full synthesis service (current UKY /ask, returns paragraph). Direction: MCP-wrapped chunks aligns with the long-term tool-calling-loop shape and the OAuth-proxy pattern (lets non-MCP-native clients use the same retrieval surface). Raw API can be the thin layer the MCP wraps. Current synthesis remains as a transitional path. Tracks the "RAG as a tool" Phase 4+ direction Andrew floated previously.
"Local model" terminology — clarified. Andrew's "local model" = UKY-hosted (their on-Grace-Hopper vLLM), NOT laptop-local. Phase 5 of the launch plan covers the swap from OpenAI gpt-4o to UKY-hosted vLLM.
Current tool-calling LLM. OpenAI gpt-4o (default), configurable via LLM_PROVIDER env var (openai | vllm | access_ai | fireworks). Eval judge separate at gpt-4o-mini via EVAL_JUDGE_MODEL. Provider swap is config-only — no graph code change required.

2026-04-24 — Tool-use investigation: counting tool calls is the wrong metric

Followed up on Andrew's concern from the Apr 23 parity check that the loop calls tools less often than the chain. Postgres diff confirmed the shape: across 40 questions, chain called ≥1 tool on 28; loop skipped tools on 13 of those 28 (one-sided pattern, never reverse).

Prompt tweak — partial fix

Reframed SYSTEM_IDENTITY in src/agent/prompts/tool_calling_loop.py to present docs + tools as two complementary sources, with "live tool wins on disagreement" framing (commit e34e6bb). Also added a per-question tool-call summary row to the HTML report with a ⚠ when counts diverge (commit 2ee0288). New run (chain-20260424-161321-fce4d6 / loop-20260424-162652-0cab24): 5 of 13 missed-tool cases now call a tool. 8 still skip.

Report: https://access-ci-reports.netlify.app/phase3-smoke-run3.html. Compare-judge JSON at access-agent/comparisons/phase3-smoke-run3.json.

Spot-checked the remaining 8 — story changed

Pulled each side-by-side (chain's tool result, chain's answer, loop's answer):

4 of 8: chain's tool returned nothing useful (empty results or HTTP 500). Both agents wrote from docs anyway. Tool call was theatre.
1 of 8 (XDMoD usage check): loop correctly declined for anonymous user; chain answered from docs.
2 of 8 (Office Hours, webinar link): docs already contained what the tool would have added. Near-identical answers; loop's "Open OnDemand Tips and Tricks" naming was arguably sharper than chain's "the webinar."
1 of 8 (affinity groups for ML): chain ignored real group names its tool returned and gave a generic answer. Both agents may be wrong; can't tell from scores alone.

Conclusion: the gap was mostly illusory

"Loop calls tools less often" is true but doesn't translate to "loop's answers are worse." UKY's docs are strong enough that answers converge whether or not the tool is called. Tool-call count is not a proxy for grounding — an agent can call a tool that returns nothing, ignore a tool's result, or skip a tool because docs already have the answer, and the count doesn't distinguish.

Decision: parked "force loop to call more tools" architectural work (the gate-by-classifier / pre-dispatch / split-agent options). No urgent signal that loop quality is hurting. Direction reaffirmed: the only way out of this measurement gap is ground-truth answers — tool_coverage_battery (now shipped at commits 2884a06 + 22abfbd) is the seed corpus.

MCP bugs surfaced — file against access-mcp

Two real bugs, unrelated to loop/chain:

get_user_data — HTTP 500: Object of type TextContent is not JSON serializable
integrate_nsf_xdmod — HTTP 500: same TextContent is not JSON serializable error. (Earlier journal entry at line ~1070 noted the agent referencing this tool and UKY 500'ing; the new finding is that the MCP server itself is what's throwing — the JSON serialization failure is the root cause.)

Both worth filing against the access-mcp repo.

2026-04-23 (afternoon) — Phase 3 parity check + report-tooling iteration

Closed out Launch Phase 3 by running Task 7 — the eval-parity check between the legacy chain and the new tool-calling loop — and tightening up the eval CLI + HTML reporter while doing it. Phase 3 is now fully done.

Parity check — methodology

Built phase3_smoke_battery.json, a 40-question curated battery covering seven categories (static-confident, static-deflection, combined-simple, multi-tool, pure-mcp, error-prone, domain-routed). Ran it twice through the eval pipeline against the same agent image, varying only USE_TOOL_CALLING_LOOP via a new CLI choice: --system agent_full_legacy programmatically forces the flag false for the run; --system agent_full forces it true. Same MCP servers, same questions, same judge — single bit flipping which downstream path the LangGraph routing functions take.

Result

Run 1 — legacy composite 4.73 vs loop 4.60. Loop slightly behind, biggest gap was completeness (−0.27). Compare-judge narrative: "winner A (legacy), small margin." The loop's SYSTEM_IDENTITY told it "Be concise. Researchers want the answer, not ceremony." — that was the likely culprit.
Single-line prompt change in src/agent/prompts/tool_calling_loop.py: swapped the concise instruction for "Be complete. Include the specific details researchers need to act — commands, links, numeric values, step-by-step instructions where relevant. Don't pad with ceremony, but don't strip substance either."
Run 2 — legacy 4.67 vs loop 4.67. Tied composite. Loop wins citation_quality by +0.23 and is within noise on every other dimension. Calling Phase 3 done.

Reports live: https://access-ci-reports.netlify.app/phase3-smoke-run1.html and https://access-ci-reports.netlify.app/phase3-smoke-run2.html.

Tooling improvements that landed during the run

Nine commits beyond the Phase 3 code (cad8ee1..HEAD on feature/production-baseline-comparison):

Commit	What
`492cd3d`	`phase3_smoke_battery.json` — 40 curated questions across 7 categories
`a145ab0`	`--system agent_full_legacy` choice + programmatic flag override
`1971d60`	Docs for `--system`
`0924f01`	Semantic run IDs (`chain-YYYYMMDD-HHMMSS-hash6` / `loop-…`) replacing opaque UUIDs — visible in Argilla's "Eval Run ID" filter
`94be014`	Prompt tweak: "be concise" → "be complete"
`9d7f2e4`	HTML template fix: execution traces now render for both baseline and candidate (was rendering only candidate)
`63ac882`	`--preset` flag on the html report (`grand-prix` default, `phase3-parity` for loop-vs-chain)
`c5440f1`	Hardcoded "Raw RAG"/"raw_rag" template fallbacks neutralized
`5efeb3c`	Data-derived question count (was hardcoded in BATTERY_INFO) + container width 1180→1800px, column-collapse breakpoint 900px

Architecture insight worth noting

The legacy chain (plan/execute/evaluate/recover/synthesize) and the new tool_calling_loop are registered as nodes in the same LangGraph graph (src/agent/graph.py::_build_graph_structure). Two routing functions (route_by_classification, route_after_rag) consult settings.USE_TOOL_CALLING_LOOP to decide which branch to take. So the "fork" lives at the routing decisions, not in two separate graphs. Same image, same MCP servers, same questions — just one flag flipping which downstream path is traversed. This made parity testing trivial: the eval CLI's --system flag is the same boolean from a different angle.

State

feature/production-baseline-comparison pushed; ~44 commits ahead of main now.
Phase 3 fully done (code + parity check). USE_TOOL_CALLING_LOOP=false still the default in production.
No PR yet. Next: open the PR; start Phase 6 privacy investigation in parallel.

2026-04-23 — Launch Phase 3 (tool-calling loop) code lands

Executed the Phase 3 plan subagent-driven. Code is on feature/production-baseline-comparison (origin), behind the USE_TOOL_CALLING_LOOP=false default. Only remaining Phase 3 item is Task 7 — the eval-parity check.

What shipped — 12 commits, `cad8ee1..4272a78`

Commit	What
`4a9d9d2`	`USE_TOOL_CALLING_LOOP` flag added to `Settings` + `.env.example`
`64b9da5`	`src/agent/prompts/tool_calling_loop.py` — `SYSTEM_IDENTITY`, `build_system_prompt`, `format_rag_matches`
`1b3dcca`	Failing tests (TDD baseline) for `tool_calling_loop_node`
`acbf91d`	Test fixture correction: use real `RAGMatch` fields (`entity_id`, `similarity_score`, `domain`) rather than plan's invented `source`/`score`
`f8a5278`	`src/agent/nodes/tool_calling_loop.py` + `create_mcp_tools_from_catalog` helper in `domains/tools.py`
`b4ac03f`	`format_rag_matches` field-name fix (same plan bug as `acbf91d`, different file)
`251def1`	Back-fill `state["tool_results"]` from ToolMessages — missing requirement in plan; without it, Phase 7 eval scorer would have seen "no tools used" for every loop response
`b8b9153`	Graph routing: register node, extend `route_after_rag` Literal, flip all 4 `return "plan"` sites via shared `tool_path` local, short-circuit `route_by_classification` to `rag_answer` when flag is on
`de637d1`	Module-docstring deprecation blocks on `plan/execute/evaluate/recover/synthesize.py`
`617f66c`	Function-level deprecation comments on `_resolve_parameters` / `_resolve_reference` in `execute.py` (the `$step_N` resolver — retires with the legacy path)
`92ae941`	Pre-Phase-7 hardening: `final_answer=None` when loop emits no text (not `""`), `GraphRecursionError` caught with graceful apology, orphan ToolMessages counted in span
`4272a78`	Plan-doc reconciliation — 12 drift items fixed in-place, plus an "Implementation notes — deviations from plan" section at the end

Plan-vs-reality mismatches fixed mid-execution

Andrew's plan was authored before the implementation started; several snippets referenced APIs and field names that differed from reality. Corrections applied by implementer subagents with explicit callouts in their prompts:

MCPToolWrapper field names: real API is mcp_client= / tool_server=, not client= / server=.
_build_args_schema(name, parameters) takes two positional args, not a dict.
RAGMatch fields are id/question/answer/domain/entity_id/similarity_score/metadata — the plan's source and score don't exist.
Graph factory is create_agent_graph, not build_graph; compiled-graph introspection is graph.get_graph().nodes.
route_after_rag has four return "plan" sites (disabled-domain fallback, combined/dynamic with matches, combined/dynamic without matches, static-deflection, static-no-match), not the two the plan claimed. Shipped code uses a shared tool_path: Literal["plan", "tool_calling_loop"] local to flip them uniformly.
AIMessage.content is typed str | list[str | dict] — needs narrowing before binding to final_answer.

Design call worth flagging to Andrew

The new path skips synthesize.py entirely — the loop's LLM writes its own final answer inline. Citations are handled by a paragraph in the system prompt asking the LLM to cite sources. This is architecturally clean and matches the plan, but it trades deterministic citation generation (what synthesize.py was doing in its 750 lines) for LLM-prompt-instruction compliance. If Phase 7 eval comparison shows citation regressions, a thin post-loop citation node is a clean retrofit (tool_calling_loop → citation_pass → END) — the loop's state already carries everything that node would need. Draft note to Andrew is in memory.

Test posture

tests/test_tool_calling_loop.py — 18 tests (8 core scenarios + 1 tool_results back-fill + 6 routing/graph-structure + 3 hardening). All green.
Broader non-e2e/classify suite: 230 passed, 1 skipped. No regressions.
Pre-commit (ruff + mypy + gitleaks) clean on every commit; no --no-verify anywhere.

State

feature/production-baseline-comparison pushed to origin at 4272a78, ~35 commits ahead of main.
USE_TOOL_CALLING_LOOP=false default, so agent-prod behavior is identical; staging will default the flag to true once Phase 2 lands.
No PR opened. Plan is: run Phase 3 eval-parity check first (Task 7), then PR.

Phase 3 eval-parity check (Task 7). Favoring a targeted smoke slice (20-50 questions covering static-confident, static-deflection, combined/dynamic, multi-tool chains, tool failures) over the full 900-pair battery for initial regression detection — expand only if smoke shows a regression in a specific category.
Open the PR and send Andrew the synthesis-skip / citations note alongside it.
Start Phase 6 privacy investigation in parallel when a coding-session break is welcome. Phase 3 added six new span attributes on the loop node; audit those specifically.

2026-04-21 (afternoon) — Launch-hardening plan lands; sync with Andrew; new focus is Project M

Separate from the morning's grand-prix report work (see the next entry). Afternoon was receiving Andrew's new plans, syncing with him, and re-orienting the work plan around them.

What Andrew pushed

Four coordinated Drupal commits ~10:48–10:52 ET, shipping the Resource Documentation API v1.0:

Repo	Branch	Commit
`necyberteam/Operations_Drupal_Feed_Cider`	`main`	`aec3bd7` — inheritance-aware resource API with versioned paths
`necyberteam/access`	`3.0.x`	`8c9077f` — Swagger docs for the API
`necyberteam/aspTheme`	`main`	`526936e` — theme uses the new inheritance service
`necyberteam/cyberteam_drupal`	`main`	`4edfb4e` — Cypress tests for API versioning + inheritance

URL paths moved: /api/resources → /api/1.0/resources, /api/resource-groups → /api/1.0/resource-groups. List endpoint now filters by documented=true by default. Detail endpoint now applies Resource-Group inheritance server-side for 15 inheritable fields (login text, SSH nodes, file transfer, storage, support links, office hours, software list, etc. — compute-specific fields stay per-resource). ssh_logins sub-object shape changed. New scalar fields surfaced: login_text, file_transfer_text, jobs_info, software_list_url.

Impact on access-agent: the default DRUPAL_RESOURCE_GROUPS_URL in src/config.py and docker-compose.yml points at the old unversioned path. Not urgent — production support.access-ci.org may not have deployed v1.0 yet. Verify before touching.

Six new planning docs on access-agent/main as commit 14a578c ~10:55 ET, under docs/superpowers/:

plans/2026-04-21-production-launch-umbrella.md — 9-phase umbrella tracker.
specs/2026-04-21-production-launch-hardening-design.md — primary launch spec.
plans/2026-04-21-launch-phase-0-deps-upgrade.md — langgraph 0.2→1.x upgrade.
plans/2026-04-21-launch-phase-1-safety-audit.md — READ_ONLY guard + write-capability audit doc.
specs/2026-04-21-eval-rubric-ground-truth-design.md — parallel track: judge gets authored required_facts from Argilla.
plans/2026-04-21-eval-rubric-ground-truth.md — rubric implementation plan.

The launch spec reframes what "production launch" means. It's no longer "flip the current agent into prod" — it's "stand up a new architecture (native tool-calling loop replacing plan/execute/evaluate/recover, UKY /retrieve chunks replacing paragraph responses, UKY-hosted vLLM replacing OpenAI), prove it side-by-side against current prod on a real staging environment, leadership-gate, flip." Decision 007's framing in access-qa-planning is effectively superseded — the Apr 21 grand-prix report delivered the evidence 007 asked for; the new spec is the governing plan.

Sync with Andrew (12:00 ET)

Key decisions out of the call:

Keep working on feature/production-baseline-comparison. Andrew: "turn this branch that you're on now into the branch that's gonna incorporate the ground truth." Not retiring it; it becomes the home for M.0/M.1 exploration and the parallel rubric work. Andrew isn't doing any parallel access-agent work, so no race-to-merge pressure.
Start with the production/launch work, not the rubric work. Phase 3 is partly a hypothesis test — Andrew: "we think that the new Frontier models are going to be able to handle the tool calling without having all these extra steps in there, but, like, we haven't tested that." Earlier validation is better.
Rubric (ground-truth) is explicitly not a launch blocker. Andrew: "I think people can just look in Argilla and... see the difference for themselves... it's pretty obvious that having MCP is better than not." The formal quality bar (N spot-checked, X dissent) still stands on paper; his baseline posture is the story is already essentially proven. Rubric work is the in-between-waits plate.
Argilla role clarified. Argilla is "where you go to dig in more to particular answers, and also a potential source for... human answer, verified answer capability." Not the side-by-side comparison UX — that stays in the HTML grand-prix report.
New scope surfaced in conversation: live bot traffic → Argilla pipeline. Andrew: "I guess I would just add that into the plan. Somewhere. You can just sort of slot it in wherever you think is, like, the right place for them." No home yet in the umbrella plan; Joe owns placing it.
Argilla housekeeping: push the grand-prix eval data from Apr 21 to Argilla; take down the older run there. One-second operation.
Known small bugs in the grand-prix report — not fixing now because the rubric work will touch them again:
- Compare-judge narrative leaks bare A/B letters in prose that the relabel() helper from fc0803a doesn't catch (it handles "System A"/"System B" but not single letters in generated text).
- At least one factually-wrong judge verdict (GPU-allocations question: agent returned project counts correctly, raw RAG returned unrelated GPU specs, judge preferred raw RAG). Concrete on-hand argument for the rubric work.
Timing: Andrew is "anxious to get this done" but "not expecting it to be done this week." Vikram floated end-of-week for /retrieve; firmness unclear. Stalls likely at /retrieve and at vLLM hosting.

State

feature/production-baseline-comparison at 2139613 on origin. Continues as the working trunk.
access-qa-planning main at 2bfa6c3 (Apr 8 commit — unchanged since). Decision 007 still present; effectively superseded by the new launch spec but not yet annotated.
access-agent main at 14a578c with the six new planning docs. No code changes today.
access-mcp fix/search-events-webinar-guidance at 15c74cb — unchanged, still no PR. Less load-bearing under the new architecture (native tool-calling retries differently) but the fix is still real.
Synthesis doc at access-ci/EVAL_REVAMP_AND_PROD.md — local reference, not gist-synced, summarizes both the new plans and the 2026-04-21 meeting outcomes.

Work off Andrew's plans directly. Start M.0 per access-agent/docs/superpowers/plans/2026-04-21-launch-phase-0-deps-upgrade.md; phase order in the umbrella plan. Argilla push and live-to-Argilla stub both dropped — the push because the data's about to be superseded by new-architecture runs, the stub because access-qa-planning/active/03-review-system.md Phase 5 already covers that ground. Phase 2 (staging) infra decisions added to FEB_MARCH_PLAN.md under "Questions for Andrew."

2026-04-21 — Grand-prix report: scientist-first framing, slot abstraction, client-side search

Template-only iteration on the grand-prix HTML comparison report. No eval re-runs, no judge changes, no agent changes — all four commits touch only src/eval/html_report/. Rendered from the four compare-judge JSON artifacts generated on 2026-04-20. Live at https://access-ci-reports.netlify.app/grand-prix-20260421.html.

Framing: compare-judge is the story, per-answer composites are gone

The report previously led with per-answer judge composites (A 4.79 / R 4.65 / Δ +0.14) in each row header. Decision this session: remove them entirely, not just de-emphasize. The per-answer judge's known calibration weakness (rewards generic-but-truthful answers as highly as specific-and-correct ones) meant those numbers were actively misleading at a glance. They stay in the bundle JSON as raw data but render nowhere — no row header, no row body, no battery rollup, no run summary.

Verdict labels also dropped the decisively / narrowly margin qualifiers. Those are AI-estimated confidence, not evidence a reader can check, so they shouldn't leak into visible copy. Margin still lives in the data for the "Most decisive first" sort option.

Scientist-first layout

Stakeholders (Jim, Vikram, Shelly) need "agent wins, here's where and why" — but researchers also need to form their own opinion from the raw answers, not have the AI's opinion projected onto every scannable row.

AI run-level analysis moved from the top of the report to the end, under "Analysis by AI Comparison Judge — per battery", with a provenance paragraph stating outright that these are LLM-written summaries, not ground truth.
Per-question AI opinion nested behind a second disclosure. First disclosure opens a row to reveal question + execution trace + both answers side-by-side. A second <details> at the bottom, labeled "AI comparison judge's opinion on this question", reveals the verdict pill + why. Two clicks to see the judge's opinion on any one question; zero clicks to see the evidence.

Slot abstraction

Foggy Notion F from OY_VEY_2.md (repeatable A/B comparison infrastructure for agent variants) nudged a small refactor: the template no longer hard-codes "Agent" / "Raw RAG" / "agent" / "raw" CSS class decisions. Every bundle now carries systems: {A, B} where A = baseline slot (purple palette) and B = candidate slot (teal palette). Labels come from a SYSTEM_LABELS registry in notes.py (raw_rag → "Raw RAG", agent_full → "Agent"; unknown IDs fall back to the raw ID). A relabel() helper in the template rewrites the compare-judge's "System A" / "System B" narrative phrasing to the configured label.

Future agent_v2 vs agent_v3 comparisons can swap in just by adding entries to SYSTEM_LABELS — no template edits needed. Slot visuals (purple = baseline, teal = candidate) stay stable across system swaps.

Filter UX

Filter bar rebuilt:

Battery and Sort switched from button rows to <select> dropdowns — far less horizontal space, more obvious as filter UI at a glance.
New client-side search input. Plain text does case-insensitive substring match against question text + qid. Wrapping in /.../flags switches to regex (e.g. /outage|down/i). Invalid regex leaves all rows visible and marks the input red with an "invalid regex" hint, so the view stays usable mid-edit. A small pill beside the input shows substring or regex so the matching mode is obvious. Fully client-side — bundle JSON already has all question text, no server round trips.

Commits on `feature/production-baseline-comparison`

Commit	Summary
`fc0803a`	Lead with compare-judge verdicts, hide per-answer scores, slot abstraction
`2560750`	Scientist-first framing — hide AI verdicts behind disclosures
`2139613`	Client-side search field (substring or /regex/)

spike/grand-prix-subcommand fast-forwarded into feature/production-baseline-comparison (5fd63c5 → 2139613) and pushed. Spike branch pruned locally.

State

feature/production-baseline-comparison at 2139613 on origin.
Published report at https://access-ci-reports.netlify.app/grand-prix-20260421.html.
comparisons/grand_prix_20260420_161753_*.json untouched — these four artifacts are the data spine for the report.
notes.py observation and subtitle copy updated to match the new framing (no longer references "composite numbers").

Return to the foggy notions in OY_VEY_2.md now that the grand-prix output is readable enough to reason from.
Decide whether the per-answer judge stays (Foggy C — feed it reference answers) or gets deprecated in place (Foggy D — pairwise + humans only). This report's framing de facto takes option D's stance; committing to it is a separate conversation.
Parked synthesis spike (spike/synthesis-empty-tool-defers-to-rag) still unmerged pending a query-class-aware approach.

2026-04-20 — Events MCP: `search_events` scope investigated and fixed

Follow-up to the earlier mcp-cov-010 diagnosis. Took the "investigate the events MCP" item from the Next list and closed it.

Investigation

search_events lives in access-mcp/packages/events/src/server.ts. The tool is a thin proxy to the Drupal view at https://support.access-ci.org/api/2.3/events: query is passed literally to the search_api_fulltext param, type/tags/skill become faceted filters (f[0]=custom_event_type:X). No client-side matching logic.

Hit the backing API directly. Of 106 upcoming events, the custom_event_type vocabulary is Office Hours (80), Training (14), Conference (7), Other (5). No event is typed webinar anywhere in the corpus, and only 2 upcoming events mention the word in their description (neither in title). "Anvil Support Hour" and "Sage Office Hours" are both present and recurring April–June 2026.

So ?search_api_fulltext=webinar returns [] correctly given the data — the data just doesn't use that word. Not an MCP bug, not a data-population problem.

Diagnosis

Hybrid of two things:

Tool description misled the LLM. The existing description advertised the server as searching "workshops, webinars, training" and listed webinar in the type param as a common value. An LLM reading that schema naturally composed {query: "webinar"}.
Planner over-narrowed. A generic ask ("any webinars coming up?") was translated to a keyword filter when dropping query or filtering by type would have been correct.

Fix (`fix/search-events-webinar-guidance` on `bacalj/access-mcp`)

One commit, 37 lines touched, events package only. Two changes:

Pre-call: rewrote the tool description to describe the actual corpus, constrained type to an enum (Training | Office Hours | Conference | Other), and told callers not to use generic event-category words as query.
Post-call: when query returns 0 items but the corpus has upcoming events, the response now includes a note field explaining the miss and telling the caller to retry without query. Graceful degradation without silently substituting unrelated events.

Events package builds clean. Pre-existing TS/test-infra breakage on main (shared package build errors, missing @opentelemetry/sdk-node dep) is unrelated to this change — same failures appear with changes stashed. Branch pushed, no PR opened yet.

State

Branch: fix/search-events-webinar-guidance on bacalj/access-mcp, one commit (15c74cb), pushed.
Planner-side change in access-agent deliberately deferred: the schema + note should steer the LLM on their own. Re-run of mcp-cov-010 against the rebuilt MCP will tell us whether any agent-side retry-on-note logic is still needed.

Rebuild MCP and re-run mcp-cov-010 (or the full mcp_coverage battery) via the grand-prix / HTML report routine. Handed to a fresh session with richer grand-prix context.
Based on that result, either open a PR as-is or layer a retry-on-note change into access-agent as a branch off the current feature/production-baseline-comparison.

2026-04-20 — Judge calibration, rejudge, comparison-judge, HTML narrative spine

Worked on two ends of the eval pipeline: the per-answer judge, and a new comparison-judge stage that sits alongside it. Outcome: the agent-vs-raw signal, previously a statistical tie on mcp_coverage, now shows the agent winning by a clear margin under the improved judge.

Diagnosis of mcp-cov-010 ("Are there any ACCESS webinars coming up?")

Pulled the failing case from Postgres and traced through it. Four layers compounded:

Synthesis prompt bug — COMBINED_SYNTHESIS_PROMPT told the agent that an empty tool result was "the authoritative answer," even using the webinar phrasing verbatim as its example. So when search_events returned {total: 0, items: []}, the agent said "no webinars" despite RAG matches listing Anvil Support Hour and Sage Office Hours.
Evaluate node echoed the bias — concluded is_helpful=true on the empty tool result.
MCP scope may be narrow — raw UKY's RAG had the office/support hour data; search_events did not. Possibly a matching-too-narrowly bug on the MCP side. Not yet investigated.
No graph-level empty-tool fallback — when a tool returns empty and RAG has substance, the graph still goes to combined synthesis rather than routing to RAG-only.

Also directly in tension with judge commit f6f3238 which had already established the opposite rule for the judge side.

Synthesis spike — parked

Branch spike/synthesis-empty-tool-defers-to-rag: flipped the COMBINED_SYNTHESIS_PROMPT rule wholesale (empty tool = absence, defer to RAG when RAG has substance). Left unmerged because the blanket flip trades one failure mode for another — time-sensitive queries (e.g. "are there current outages?") legitimately want the empty tool to override stale RAG. A proper fix needs query-class awareness, which is a larger spike.

Judge improvements — merged to `feature/production-baseline-comparison`

Cut spike branch spike/judge-preamble-and-richer-context off the clean feature base (no synthesis changes in it), fast-forward-merged back in once tests passed. Eight commits:

Commit	Summary
`5d7afc7`	`ToolResult.arguments` captured agent-side
`e1c4237`	Mission preamble + structured tool-call context in judge prompt
`d01b216`	`rejudge` subcommand — re-score existing runs, no system re-call
`7148284`	`argilla-push` guard against rejudge overwrites
`853bf8a`	`compare-judge` subcommand — head-to-head LLM analysis, JSON output
`117422f`	Self-contained compare-judge JSON + JSON-backed HTML path
`58ac7ba`	Template renders compare-judge narrative (run-level + per-question)
`5fd63c5`	`.gitignore comparisons/`

Rejudge result on mcp_coverage (21 questions)

	Old judge	New judge
agent_full composite	4.74	4.71
raw_rag composite	4.75	4.44
Agent vs raw margin	−0.01 (tie)	+0.27 (agent wins)

The mission preamble took raw's completeness on this battery from ~4.8 to 3.33 — the judge now correctly recognizes "generic advice when specific data was called for" as incomplete. Agent scores essentially unchanged. That's the ideal calibration outcome.

On mcp-cov-010 specifically, the new judge still scored agent 2.65 (agent genuinely failed — no judge change was going to rescue "no webinars"). Raw dropped 5.00 → 4.75 (nudged down for calling "Anvil Support Hour" a webinar). The comparison judge's verdict on that question was A-wins-large with the per-answer-judge-note "Agree."

Compare-judge + HTML narrative spine

New first-class stage in the eval pipeline: compare-judge reads a pair of runs from Postgres (read-only), calls the same OpenAI judge, and writes a self-contained JSON artifact with per-question verdicts + run-level summary. Zero DB writes, zero schema changes.

comparisons/ is gitignored. The artifact becomes the narrative spine for the HTML report: html --from-json <paths...> renders from JSON alone (no Postgres needed at render time). The template grew two new sections (run-level verdict per battery + per-question verdict inside each expanded row), both gated on compare-judge data being present — Postgres-backed reports with no comparisons render unchanged.

End-to-end run on the rejudged mcp_coverage pair produced a 859 KB self-contained JSON and a 113 KB rendered HTML. Report at ~/.agent/diagrams/mcp_coverage_from_json.html.

State

feature/production-baseline-comparison pushed to origin at 5fd63c5.
spike/synthesis-empty-tool-defers-to-rag still present locally, unmerged, parked pending a query-class-aware approach.
OY_VEY.md written at the access-ci root earlier as a reorientation doc (not gist-synced, may delete when it stops being useful).

Investigate the events MCP (access-mcp) to determine whether search_events is too narrow by design or accident, and whether the backing data actually contains the office/support-hour records. Durable fix might live there rather than in the agent.
Run compare-judge on the other three battery pairs (friendly, real_user, combined) for a full grand-prix HTML.
Decide what to do with the parked synthesis spike.

2026-04-09 — Project H: Turnstile proxy for NAIRR (built, deployed, working)

Possible next step: proxy for ACCESS (pending Andrew confirmation)

Andrew proposed routing ACCESS bot traffic through qa-bot-proxy too, as a stopgap while the agent is still being evaluated. The agent is not deployed to production — ACCESS bot currently hits UKY directly via the hardcoded default in access-qa-bot/src/config/constants.ts (QA_ENDPOINT).

If confirmed, the changes would be:

In qa-bot-proxy, add an "access" backend ID pointing at the UKY URL
In access-qa-bot, point qaEndpoint at the proxy and set backendId: 'access'
Verify qa-bot-core sends + resets the Turnstile token on every request (not just the first) — the stateless proxy validates every request, unlike access-agent which marks sessions as verified. The turnstile.reset() fix in 0.2.35 was built for NAIRR; need to confirm it applies to ACCESS config too.

Waiting on Andrew to confirm tomorrow (2026-04-10).

Architecture

Two new repos:

qa-bot-proxy (necyberteam/qa-bot-proxy) — Netlify serverless function that validates Cloudflare Turnstile tokens and forwards requests to backends resolved from ALLOWED_BACKENDS env var. Client sends _backend ID (e.g. "nairr"), never a URL. CORS support with origin reflection and credentials. Deployed at qa-bot-proxy.netlify.app. 16 tests.
nairr-bot (necyberteam/nairr-bot) — Existing Netlify-hosted static site, updated to use the proxy. Points qaEndpoint at the external proxy URL (cross-origin), with backendId: 'nairr'. Shows git commit hash in bottom-left corner for deploy verification.

qa-bot-core changes (0.2.33 → 0.2.35, three releases)

0.2.33: New optional backendId prop — included as _backend in request body for proxy routing. lib.tsx refactored to derive types from QABotProps and spread props (future props flow automatically). Fixed missing X-Session-ID/X-Query-ID headers on Turnstile resubmit.

0.2.34: Removed 5-second Turnstile timeout that was killing the visible widget before it could complete. Fallback "log in" link now conditional — only shown when loginUrl is a real URL (not default /login), so NAIRR deployments without login don't show a broken link.

0.2.35: Reset Turnstile widget after each successful request using Cloudflare's recommended turnstile.reset() API. Turnstile tokens are single-use — without reset, the second question sends a burned token and triggers a "one moment" loop. ACCESS never hit this because access-agent marks sessions as verified after one token; the stateless proxy validates every request.

Turnstile key configuration

Both NAIRR and ACCESS Cloudflare keys changed from invisible to managed mode. Managed does invisible when it can, shows a visible checkbox when Cloudflare deems the user suspicious. No code changes needed — qa-bot-core's TurnstileWidget component renders whatever mode the key dictates.

CORS discovery

nairr-bot (separate Netlify site) calls qa-bot-proxy (different Netlify site) cross-origin. Required: reflecting the request Origin header instead of Access-Control-Allow-Origin: *, plus Access-Control-Allow-Credentials: true, because qa-bot-core sends credentials: 'include' on all fetches.

access-qa-bot 3.5.1 → 3.5.2

Bumped to pick up qa-bot-core 0.2.35. No code changes — just dependency update.

ACCESS stack tested

Updated access-ci-ui → Drupal with 0.2.35. Bot still works, no regressions. ACCESS doesn't use the proxy or backendId.

Relationship to access-agent's built-in Turnstile

access-agent keeps its own Turnstile (F.1). The proxy is a separate validation path for deployments that don't use access-agent. Future option: ACCESS could route through the proxy too, but no reason to now.

2026-04-08 — F.4 merge + publish, personalization Phase 1–2

F.4 resource scoping: merged and published

All 5 F.4 PRs merged:

access-agent#13, qa-bot-core#13, access-qa-bot#8, access-ci-ui#76, access#397
access-ci-ui#76 had a merge conflict with upstream a11y/release-please changes (new qaEndpoint/ratingEndpoint props) — resolved by keeping both sets of props.
access-ci-ui#76 was merged without maintainer review (Matt) — sent a follow-up email explaining the change.

Published stable versions:

@snf/qa-bot-core@0.2.32 (npm + GitHub release)
@snf/access-qa-bot@3.5.0 (npm + GitHub release)
access-ci-ui dependency updated from rc to ^3.5.0 on main.

Andrew pushed a refactor to access-agent before merge: RPSectionCache now fetches from a new /api/resource-groups Drupal endpoint (single call, pre-aggregated populated_sections) instead of the old /api/resources + per-resource detail calls. Also added uky_in_scope boolean propagation through the RAG answer pipeline and new tests.

Personalization: Phase 1 + Phase 2 (agent-side)

Researched the personalization spec (access-qa-planning 09-researcher-profiles.md, 11-capability-registry.md). Determined that Phases 1 and 2 are the current scope — Phases 3–5 are explicitly marked "(Future)" in the spec. Phase 1 builds the data endpoint; Phase 2 makes the agent use it.

Phase 1 — /capabilities/personalized endpoint:

New DrupalProfileFetcher service (src/services/drupal_profile.py) that calls Drupal JSON:API for user data: active allocations (field_cider_resources), affinity groups + coordinator status (mcp_my_affinity_groups view), institution, HPC experience.
Fetches user entity fields and affinity groups in parallel; tolerates partial failures.
Per-user in-memory cache with 5-min TTL (PROFILE_CACHE_TTL_SECONDS).
GET /api/v1/capabilities/personalized requires JWT auth, returns user, highlighted_capabilities, and context.
highlighted_capabilities derived from profile: coordinators get "Manage [group] announcements", users with allocations get "Check your usage on [resources]".
Config: DRUPAL_BASE_URL (default https://support.access-ci.org).

Phase 2 — System prompt injection + personalized discovery:

personalization_context field added to AgentState, threaded through create_initial_state → stream_agent → run_agent.
UserProfile.to_system_prompt_section() formats profile as a ## USER CONTEXT block (name, institution, HPC experience, skills, interests, affinity groups, allocations).
All three synthesis prompts (tools-only, combined, RAG-only) now include {personalization} placeholder.
Route layer fetches profile before capability discovery, passes both personalization_context (for agent prompts) and highlighted_capabilities (for discovery response).
"Show my options" response now includes personalized highlights at the top when available.

Not yet tested against live Drupal data. The JSON:API field shapes for field_cider_resources, field_institution, field_hpc_experience, and the mcp_my_affinity_groups view may need adjustment once we hit real responses.

All changes on feature/personalization-phase-1-2 branch in access-agent (3 commits, not pushed). 14 new tests, all passing.

2026-04-07 (session 2) — Welcome message for scoped capabilities

Added welcome_message to the scoped capabilities response. The message is built dynamically from the resource's populated sections (e.g., "Hi! I can help with login, file transfer, storage, job submission, software, and datasets on Delta — or ask me anything about ACCESS."). Resources with no populated sections get a simpler fallback.

Changes across 3 repos (all on feature/resource-scoping):

access-agent: get_by_category_scoped() now returns welcome_message built from SECTION_QUESTION_MAP labels. Handles 1, 2, and 3+ section cases for natural English.
access-qa-bot: Added welcome_message?: string to CapabilitiesResponse type. Welcome message priority: explicit welcome prop > capabilities.welcome_message > BOT_CONFIG.WELCOME_MESSAGE. Published as @snf/access-qa-bot@3.5.0-rc.2.
access-ci-ui: Removed hardcoded "Welcome to ACCESS Q&A Bot!" default that was blocking the capabilities-driven welcome message.

Testing notes:

Used dev seed data in rp_cache.py to work around prod Drupal rate-limiting during local testing (removed before commit).
Tested end-to-end through local Drupal: embedded bot on home page showed scoped Delta welcome message; floating bot showed default.
Discovered access-ci-ui's qa-bot.jsx had a hardcoded default that overrode everything — fixed.

2026-04-07 — F.4 resource scoping: Phase 1 agent infrastructure

Built the agent-side infrastructure for resource-scoped capabilities (F.4). All changes on feature/resource-scoping branch in access-agent.

What was built:

resource_context field threaded through the full query pipeline: QueryRequest → stream_agent() → AgentState → create_initial_state()
UKYClient.ask() now accepts rp_name — sets X-Origin header to the RP slug and includes rp_name in the request body for UKY's scoped vector DB. Added in_scope field to UKYResponse (None until UKY implements it).
RPSectionCache with hardcoded seed data for 9 resources (delta, anvil, bridges2, expanse, jetstream2, stampede3, derecho, neocortex, kyric). Singleton at src/services/rp_cache.py.
Section-to-question mapping (SECTION_QUESTION_MAP) in capabilities registry — maps login, file_transfer, storage, queue_specs, top_software, datasets to labeled capabilities with description and example_query interpolated with RP title.
GET /api/v1/capabilities?resource_context=<slug> returns RP-scoped response with resource_docs + support + analytics categories. Unknown slugs fall through to standard response.
Scoped capability discovery short-circuit: "Show my options" with resource_context returns Delta-scoped suggestions.
Out-of-scope fallback in rag_answer_node: if scoped RAG response contains out-of-scope phrases, retries without rp_name for general RAG.

Verified with curl:

?resource_context=delta → 6 section capabilities + support + analytics (matches spec exactly)
?resource_context=jetstream2 → 2 sections (login + storage) + support + analytics (sparse resource)
?resource_context=bogus → falls through to standard 5-category response
POST /query with resource_context=delta + "Show my options" → Delta-scoped discovery

Phase 2 — frontend prop plumbing (same session):

qa-bot-core: resourceContext prop on QABotProps, threaded through CreateQAFlowParams → POST body (resource_context), Turnstile resubmit body, lib.tsx programmatic API.
access-qa-bot: resourceContext on AccessQABotProps, appended as query param on capabilities fetch, passed through to QABot.
Verified via npm link workflow: [linked] marker in logger, dev server at localhost:3000 with resourceContext="delta", confirmed Delta-scoped capabilities and scoped RAG fallback.
Fixed scoped RAG fallback: UKY returns "No documents are currently available" for empty RP collections — added to out-of-scope heuristic phrases. Retry with general RAG now works.
access-ci-ui not yet updated (separate repo, needs PR to access-ci-org/access-ci-ui).

PRs:

access-agent: necyberteam/access-agent#13
qa-bot-core: necyberteam/qa-bot-core#13
access-qa-bot: necyberteam/access-qa-bot#8

Phase 3 — access-ci-ui + end-to-end Drupal testing (same session):

Published rc versions: qa-bot-core@0.2.32-rc.1, access-qa-bot@3.5.0-rc.1 (both from feature branches).
access-ci-ui: bumped access-qa-bot dep, added explicit resourceContext prop to QABot wrapper. PR to access-ci-org/access-ci-ui#76.
Built access-ci-ui, copied dist to Drupal's web/libraries/access-ci-ui/.
Restored Drupal DB from backup (backups/site.sql.gz, Aug 2025) — ddev volume had been pruned.
Discovered .embedded-qa-bot div lives in a Drupal block content body (DB), not a template file. Drupal's text format filters strip data-* attributes, so data-resource-context can't be set via block content — production will need it on the template.
Proved e2e by hardcoding resourceContext: "delta" in headerfooter.js → embedded bot showed Delta-scoped capabilities, scoped RAG with fallback to general. Floating bot remained unscoped. Two independent sessions on the same page.

PRs (ready for review):

access-agent: https://github.com/necyberteam/access-agent/pull/13
qa-bot-core: necyberteam/qa-bot-core#13
access-qa-bot: necyberteam/access-qa-bot#8
access-ci-ui: access-ci-org/access-ci-ui#76

Phase 4 — live Drupal fetch + headerfooter.js PR (same session):

Replaced hardcoded seed data in RPSectionCache with live fetch from Drupal's /api/resources list + /api/resources/{id} detail endpoints. Checks which per-section fields (ssh_logins, file_transfer, storage, etc.) have content. Refreshes on first access + every 30 min (configurable via RP_CACHE_TTL_SECONDS).
Currently all 109 resources return empty section arrays — RPs haven't entered documentation content yet. Cache correctly shows 90 resources, 0 with populated sections. Resources still get support + analytics capabilities.
headerfooter.js PR to necyberteam/access on 3.0.x: one-line change to read data-resource-context from the .embedded-qa-bot div and pass as resourceContext to qaBot().
Confirmed data-resource-context attribute already exists in production Drupal (verified on https://support.access-ci.org/node/10864 — data-resource-context="anvil" set by preprocess hook).
Local Drupal DB (Aug 2025) is too stale for current codebase — smoke test failed on unrelated schema errors. Full chain was proven earlier in the session.

All PRs (5 repos, all ready for review):

access-agent: https://github.com/necyberteam/access-agent/pull/13
qa-bot-core: necyberteam/qa-bot-core#13
access-qa-bot: necyberteam/access-qa-bot#8
access-ci-ui: access-ci-org/access-ci-ui#76
access (Drupal headerfooter.js): necyberteam/access#397

Phase 5 — Drupal smoke test (same session):

Got fresh production DB via gh run download artifact + robo did. Required switching cyberteam_drupal to main and running composer install — the Aug 2025 DB was too stale for the old codebase.
headerfooter.js lives in necyberteam/access repo (branch 3.0.x), not cyberteam_drupal. Separate git repo nested at docroot/modules/custom/access/.
Production headerfooter.js imports from unpkg.com/@access-ci/ui@0.19.0 — had to swap to local /libraries/ path for testing (not committed).
Smoke test passed: both bots render, embedded bot picks up data-resource-context from the div (confirmed attribute present in fresh DB via preprocess hook), floating bot stays unscoped.
When PRs merge: access-ci-ui gets a new release, then bump the version in headerfooter.js import — same process as every access-ci-ui release.

All PRs (5 repos, all ready for review):

access-agent: https://github.com/necyberteam/access-agent/pull/13
qa-bot-core: necyberteam/qa-bot-core#13
access-qa-bot: necyberteam/access-qa-bot#8
access-ci-ui: access-ci-org/access-ci-ui#76
access (Drupal headerfooter.js): necyberteam/access#397

What's next:

README updates for resourceContext prop in all 3 frontend repos (before final publish)
After PRs merge: publish stable versions (qa-bot-core 0.2.32, access-qa-bot 3.5.0), update access-ci-ui dep, bump headerfooter.js import version
Welcome message field in capabilities response (Andrew approved, spec it + build)

2026-04-06 — F.3 UX experimentation and shipping

UX approach exploration

Explored three approaches for capability discovery in the chatbot:

5 category buttons with improved canned responses — kept the existing category/capability button pattern but improved the shortcircuit text to include "try typing..." examples. Extended the shortcircuit to also match capability labels (not just category labels).
8 example query buttons — replaced category buttons with real queries ("Are there any system outages right now?", "Is Python available on Delta?") that go through the full agent pipeline. Honest buttons — clicking does the same as typing. Standard ChatGPT/Gemini pattern.
Single "Show my options" button — minimal approach. One discovery button returns a categorized list of example queries. Welcome message introduces capabilities and invites typing. Settled on this.

Key insight from the process: the old buttons looked actionable but just returned canned text. The "honest button" approach (variant 2) was better but 8 buttons is a lot. A single discovery button with rich content is the cleanest.

Changes shipped

access-agent (PR #12, merged):

Renamed support category "Get help" → "Create a ticket"
Extended _check_capability_discovery() to match capability labels (initially for variant 1, kept as safety net for typed input)
Rewrote "Show my options" response to show example queries instead of generic descriptions
Added example_query field to Capability dataclass — registry is single source of truth
Removed dead category-label and capability-label shortcircuit blocks (unreachable in new UX)

access-qa-bot (PR #7, merged, v3.4.0 published):

Replaced 5 category buttons with single "Show my options"
Updated welcome message to introduce capabilities and invite typing
Removed unused capabilities parameter from createMainMenuFlow
Removed CapabilitiesResponse import from flow file

Andrew's review feedback

Two changes from Andrew's review on the agent PR:

Move example queries into the registry instead of a separate dict in routes.py — led to the example_query field on Capability
Remove unused _capabilities parameter in the frontend

Development setup

Set up npm link for qa-bot-core → access-qa-bot local iteration (with [linked] breadcrumb in logger for verification). Added vite proxy initially for CORS, reverted in favor of using port 3000 which is already in the agent's CORS allowlist.

2026-04-04 — PR review fixes, merging, and publishing

PR review fixes (qa-bot-core)

Addressed all 7 items from Andrew's review on qa-bot-core PR #11:

Removed "Feel free to ask another question" injection (all 3 instances)
Gated metadata display (confidence, agent, tools_used) behind QA_BOT_DEBUG localStorage flag
Suppressed rating buttons when rating_target is null
Rewrote visible Turnstile challenge flow — auto-resubmits pending query after onVerify instead of waiting for user input (eliminates silent input replacement)
Token expiry logs warning instead of showing permanent error (allows Cloudflare auto-refresh)
Second requires_turnstile response keeps pending query intact for re-solve

Tested locally with Cloudflare test keys (1x00000000000000000000AA site key, TURNSTILE_FREE_QUERIES=0). Visible challenge flow works end-to-end.

Andrew's changes synced

Pulled Andrew's recent work across repos:

access-agent feature/capability-routing: replaced _response_is_question() ("?" heuristic) with domain_completed flag from domain agent node. Also fixed requires_auth on capabilities (only manage_announcements needs auth), added capability_id to classifier examples.
access-qa-bot feature/turnstile: deployment warning useEffect for misconfigurations, removed duplicate WELCOME_MESSAGE_LOGGED_OUT.
access-qa-planning: reorganized into active/, archive/, decisions/ directories. Added 6 ADRs.
access-agent main: streaming responses spec (2026-04-03-streaming-responses-design.md).

PRs merged

All three feature PRs merged to main:

access-agent #6 (feature/capability-routing)
qa-bot-core #11 (feature/turnstile)
access-qa-bot #4 (feature/turnstile)

Releases published

qa-bot-core v0.2.30 — npm + git tag + GitHub release
access-qa-bot v3.3.12 — npm + git tag + GitHub release (updated qa-bot-core dependency to 0.2.30)

Streaming responses (Andrew's spec at access-agent/docs/superpowers/specs/2026-04-03-streaming-responses-design.md). RAG scoping is unblocked on the agent side but lower priority. Personalization last (Drupal-dependent).

2026-04-02 — F.3 full-stack capability testing

Systematic capability-by-capability testing with access-agent (Docker), access-qa-bot (Vite dev server), and all MCP servers running locally. Created OUTSTANDING_ISSUES.md at the access-ci root to track cross-cutting issues.

Bugs found and fixed (continued from 2026-04-01 session 2)

Eight total issues found and fixed across the stack:

Logged-out buttons not rendering: Capabilities fetch async timing. Fix: defer <QABot> render until capabilities load.
Category labels sent to RAG: "Get help" etc. treated as questions. Fix: _check_capability_discovery() short-circuit in routes.py.
Ratings on clarifying questions: is_final_response hardcoded true. Fix: _response_is_question() heuristic for domain agents.
"Feel free to ask another question" on every response: Fix: gated on lastIsFinalResponse in qa-flow.tsx (rc.14).
Hardcoded ticket/security flow intercepts: Removed; all routes go to qa_loop.
Classifier misrouting MCP-backed capabilities: NSF awards, affinity groups, events, announcements classified as "static" → sent to RAG. Fix: added MCP-backed capability section with routing rules and examples to classifier prompt.
CORS blocking port 3006: Dev CORS whitelist didn't include 3006. Fix: added it. Also added graceful degradation when capabilities endpoint is unreachable (renders bot with fallback "Show my options" button).
Lock emoji prefix breaking discovery: Frontend sends 🔒 Ask a question which didn't match discovery labels. Fix: strip lock emoji prefix in _check_capability_discovery(). Also removed "Ask a question" button entirely per spec resolved decision #1 (typing is the default).

Capability test results (anonymous)

Capability	Result
Get help (button)	PASS — discovery listing
Open a help ticket (typed)	PASS — JSM domain agent engages. E2e deferred (need test queue)
Explore resources (button)	PASS — discovery listing
Check system status	PASS — system-status MCP
Browse events	PASS — events MCP
Browse affinity groups	PASS — affinity-groups MCP (needs per-group links)
Search software	PASS ⭐ — excellent results for "search software abaqus"
Search announcements	PASS — announcements MCP (minor list formatting in UI)
Search NSF awards	PARTIAL — tool called but institution search too fuzzy, quality check rejects
Show my options	PASS — full capability listing with lock icons
Manage announcements	PASS routing — domain agent engages, e2e needs auth
Check usage (XDMoD)	FAIL — MCP tool 500 error, missing `XDMOD_API_TOKEN`

Commits

Repo	Branch	Hash	Message
access-agent	`feature/capability-routing`	`aeadd02`	feat: capability discovery short-circuit and is_final_response heuristic
access-agent	`feature/capability-routing`	`713cfc9`	fix: add localhost:3006 to dev CORS whitelist
access-agent	`feature/capability-routing`	`578a9cb`	feat: improve classifier routing for MCP-backed capabilities
access-agent	`feature/capability-routing`	`e570611`	fix: strip lock emoji prefix from capability discovery queries
qa-bot-core	`feature/turnstile`	`f27bec7`	fix: gate follow-up prompt on is_final_response (rc.14)
access-qa-bot	`feature/turnstile`	`c2a5fb6`	refactor: simplify main menu flow, all routes to agent
access-qa-bot	`feature/turnstile`	`790aaaa`	fix: graceful degradation when capabilities endpoint is unreachable
access-qa-bot	`feature/turnstile`	`60d7463`	fix: omit "Ask a question" button — typing is the default action

Outstanding issues documented

Created OUTSTANDING_ISSUES.md with 10 items covering: test JSM queue (Andrew), NSF search precision (Andrew), quality check retry waste, is_final_response heuristic fragility, XDMoD token, CORS flexibility, authenticated testing, announcements e2e, affinity group links, numbered list rendering.

2026-04-01 — F.3 frontend implementation started

Analyzed spec and planned qa-bot-core vs access-qa-bot split

Reviewed the capability registry design spec (access-agent/docs/superpowers/specs/2026-03-18-capability-registry-design.md) and planned which F.3 changes belong in each repo. Core decision: qa-bot-core handles generic rating infrastructure (metadata capture, is_final_response gating, rating_target routing). access-qa-bot handles ACCESS-specific concerns (capability fetching, dynamic button rendering, lock icons, personalization).

One divergence from spec: added agentRatingEndpoint as an optional prop on qa-bot-core. The spec doesn't name this prop but qa-flow needs a concrete URL to POST agent ratings to. Everything else matches the spec.

qa-bot-core changes (published as 0.2.30-rc.13)

Added capabilitiesEndpoint (passthrough for wrappers) and agentRatingEndpoint props to QABotProps
qa-flow.tsx now captures is_final_response and rating_target from response metadata object
Rating buttons gated on is_final_response: true (replaces old hasShownResponse flag)
Ratings route to agentRatingEndpoint when rating_target is "agent" (with agent payload: query_id, rating, session_id), or to ratingEndpoint for "uky_rag" (existing UKY payload)
Wired through QABot.tsx and programmatic API interfaces

access-qa-bot changes

Added AGENT_ENDPOINT config (defaults to localhost:8000/api/v1) with derived capabilities and agent rating URLs
AccessQABot fetches GET /api/v1/capabilities on mount, re-fetches when auth state changes
Rewrote main-menu-flow.ts: dynamic buttons from capabilities response, "Show my options" discovery button, chatDisabled: false on start step
Combined welcome message + AI disclaimer into one message (removed go_ahead_and_ask transition step)
Lock prefix (🔒) on capabilities marked locked: true for anonymous users
Lazy personalization stub (GET /api/v1/capabilities/personalized) for authenticated users — logs to console, F.4 will use the data
Added CapabilitiesResponse, CapabilityCategory, CapabilityItem types

Git consolidation

Consolidated feature/dynamic-capabilities branches (created fresh from main) onto existing feature/turnstile branches in both repos. The turnstile frontend work and capabilities work are now on one branch per repo. Resolved merge conflicts in qa-flow.tsx (combined Turnstile challenge suppression with is_final_response gating). Deleted stale feature/dynamic-capabilities branches.

Local testing

Dev server runs, capabilities endpoint responds, buttons render from API data. Some runtime issues remain — likely button routing, flow transitions, or prop wiring. Needs full-stack debugging next session with access-agent + access-qa-bot + qa-bot-core all running.

2026-03-31 — F.1 + F.2 backend shipped, PR review fixes, F.3 ready

PR #2 merged: Turnstile + capability registry (squash merge)

Addressed two rounds of PR review feedback before merging. Review fixes:

Rating endpoint hardened with anti-spoofing: ownership check (user_hash for authenticated, session_id for anonymous), one-per-query (409), 24h time window (410), proper 403 on mismatch.
General capabilities auth flags corrected — public features (status, events, software, NSF awards, affinity groups) now requires_auth=False. Anonymous users see them unlocked in /api/v1/capabilities.
Synthesis prompts now auth-aware — anonymous users only see public capabilities in agent responses.
Classifier max_completion_tokens bumped 250→350 to prevent truncation after adding capability_id field.
Turnstile: unbounded _sessions dict replaced with periodic eviction (5min interval, 10k cap). Expired verification now resets query counter (fresh grace period). Anonymous queries without session_id rejected when Turnstile enabled.
Health endpoint KeyError fix (s["name"] → s["server"]).
Removed dead get_for_auth() method, fixed session leak in log_query().
Added 14 new tests: capability registry (loading, auth visibility, inference) + turnstile eviction.

PR #4 merged: response metadata for frontend rating flow

Added is_final_response, rating_target, capability_id, and question_id to query response metadata. rating_target is "uky_rag" when UKY was primary source, "agent" for tool/domain responses. Completes F.2 backend.

F.2 backend complete, F.3 next

All backend API surface for the capability registry is shipped on main. Frontend work (F.3) can now begin: fetch capabilities on load, replace hardcoded buttons, contextual ratings, rating routing.

Earlier today

2026-03-31 — Turnstile decision finalized, capability registry rebase, eval handoff

Turnstile (F.1) — decision locked

Andrew confirmed: keeping the invisible Cloudflare widget key. The silent pre-verify flow (useTurnstile hook) handles the common case. For the rare edge case where invisible verification fails (VPN + ad blocker), the user sees a "verify" message with an empty widget — Cloudflare may still succeed silently in the background. If not, the user can refresh or log in. Documented the options in CONSTERNATION.md during analysis, but the decision is to ship as-is. No code changes needed.

Capability registry rebase

Rebased feature/capability-registry onto current main (includes Andrew's eval pipeline commit 4441536). Clean rebase, no conflicts. Branch now has 5 capability commits on top of 3 turnstile commits on top of main.

Eval pipeline handoff to Andrew

Andrew is taking ownership of the evaluation pipeline (Project G). The ad-hoc testing infrastructure (test runner, LLM judge, batteries, HTML reports) documented in AGENT_TESTING.md served its purpose as a prototype. Andrew's eval pipeline design (access-agent/docs/superpowers/specs/2026-03-31-eval-pipeline-design.md) formalizes and supersedes it. Existing batteries (~160 questions) carry forward as seed data for G.1.

Cleanup

Removed working documents CONSTERNATION.md (Turnstile edge-case analysis — decision made) and AGENT_TESTING.md (eval testing overview — superseded by Andrew's eval pipeline spec and Project G in the plan).

2026-03-30 — Silent Turnstile, PR #1 merged, capability registry started

Turnstile silent pre-verification (qa-bot-core + access-qa-bot)

Reworked the Turnstile frontend from "always visible challenge" to "invisible by default, visible fallback." New useTurnstile hook renders an invisible Cloudflare Turnstile widget on mount, stores the token, and the qa-flow attaches it to every outgoing request automatically. The backend's visible challenge flow (requires_turnstile response + widget in chat) remains as fallback for suspicious users or when silent verification fails. Three free queries act as a grace period.

Key insight: Cloudflare's widget type (managed, non-interactive, invisible) is determined by the site key, not frontend code. Test keys control dev behavior:

1x...BB = invisible, always passes (normal dev)
3x...FF = forces interactive challenge (test fallback)
1x...AA = visible, always passes (see widget auto-complete)

New turnstileSiteKey prop on QABot controls activation. access-qa-bot reads it from VITE_TURNSTILE_SITE_KEY env var. Version bumped to 0.2.30-rc.10.

All three repos committed and pushed on feature/turnstile branches.

End-to-end testing and design decision

npm was down most of the day (web auth incident). Once it recovered, published rc.10 and rc.11, tested all three scenarios end-to-end:

Invisible happy path (1x...BB key) — silent verification on mount, user never sees anything. Confirmed via debug console logs.
Visible fallback (3x...FF key) — 3 free queries, then "Please verify you're human" with interactive checkbox. Widget padding (8px 16px) and rounded corners (8px) applied. Rating buttons suppressed during challenge.
Disabled (empty keys) — login gate returns, old behavior preserved.
Spec-only reactive flow (1x...AA key, no turnstileSiteKey prop) — tested to confirm what the spec alone produces: after 3 queries, "Please verify you're human" message + widget that auto-checks itself in ~1 second. Works but is a visible interruption.

Design decision (pending Andrew's input): The turnstileSiteKey prop and useTurnstile hook go beyond the original spec. They are the only path to zero interruptions for legitimate users — without them, every anonymous user sees a brief "verify you're human" blip after 3 queries. Asked Andrew whether he prefers fully invisible (current PRs) or is fine with the brief auto-check (spec-only). Both paths work in the current code — it's a deployment config choice:

Set VITE_TURNSTILE_SITE_KEY → silent pre-verify active, zero interruptions
Don't set it → reactive-only flow from the spec (brief blip after 3 queries)

If Andrew prefers spec-only, the changes to remove are: useTurnstile hook, turnstileSiteKey prop from QABot/lib, and the getTurnstileToken plumbing in qa-flow. The reactive fallback path stays as-is.

PRs opened

qa-bot-core #11: feature/turnstile → main
access-qa-bot #4: feature/turnstile → main
access-agent #2: feature/turnstile → main

PR #1 merged (access-agent)

Andrew merged uky-plus-mcp → main (14 commits). His merge included 44075e9 (usage logging fix, health endpoint enhancement, deploy workflow). Rebased feature/turnstile onto updated main across all repos — clean, no conflicts except a trivial package-lock in access-qa-bot. Cleaned up stale local branches (uky-plus-mcp, feature/dual-rag-logging).

Capability registry — F.2 backend complete

Created feature/capability-registry branch off feature/turnstile in access-agent. 5 commits, all pushed.

Data models and registry:

Capability and Category dataclasses in src/agent/domains/config.py
Added capabilities field to DomainAgentConfig; announcements (2 capabilities) and JSM (3 capabilities) configs updated
8 general pipeline capabilities (Q&A, allocations, software, status, events, affinity groups, XDMoD, NSF awards)
CapabilityRegistry class in new src/agent/domains/capabilities.py — aggregates domain + general capabilities, handles DISABLED_CAPABILITIES env var, provides auth-filtered queries, system prompt generation, and capability-to-query inference
5 categories: general, support, content, explore, analytics. 13 total capabilities.

API endpoints:

GET /api/v1/capabilities — returns capabilities grouped by category; auth-required items marked locked for anonymous users
POST /api/v1/rating — attaches rating (helpful/not_helpful) + optional feedback to an existing usage log entry by query_id

Classification:

Added capability_id field to QueryClassification in state model
Updated classify prompt with full capability list so the LLM outputs capability_id for every query
Falls back to inference from domain/tools_used via CapabilityRegistry.infer_capability_id()

Usage logging:

New columns: capability_id, category, was_authenticated, rating, rating_feedback
Startup migration adds columns to existing usage_logs table (since create_all only creates tables)
log_query now accepts capability_id and category; was_authenticated derived from acting_user
New log_rating method for the rating endpoint

Agent self-knowledge:

All three synthesis prompts (tools-only, combined, RAG-only) now include a {capabilities} section
Lazy-loaded from CapabilityRegistry on first synthesis call
Agent can now answer "what can you do?" and give contextual hints about other capabilities

Remaining for F.2: Restrict /tools and /catalog endpoints to admin access (low priority). Frontend work is F.3.

2026-03-26 — Turnstile backend, Project F resequencing

Goal: Begin Project F work. Decided with Andrew to ship Turnstile first (self-contained, immediate user value) rather than landing all phases at once.

Sequencing decision

Reordered Project F delivery: Turnstile (F.1) → Capability Registry (F.2) → Dynamic UI (F.3) → Personalization (F.4). Each is a clean base for the next. Updated FEB_MARCH_PLAN.md and synced gist.

Turnstile backend (`feature/turnstile` branch, off `uky-plus-mcp`)

Implemented server-side Turnstile bot protection for anonymous users. Commit 96f9f64 pushed to origin/feature/turnstile. Will PR against main after PR #1 merges, then rebase.

What it does: anonymous users hit the /api/v1/query endpoint, the agent tracks per-session query counts and verification status. In deferred mode (default), 3 free queries pass before the agent returns {"requires_turnstile": true, "site_key": "..."} instead of an answer. The frontend (future work) will show the Cloudflare Turnstile widget, get a token, and resend. The agent verifies the token with Cloudflare's /siteverify endpoint and marks the session verified for a configurable TTL (default 1 hour). Authenticated users skip all of this.

Files: src/turnstile.py (new), src/config.py (5 env vars), src/api/routes.py (gate + counter), tests/test_turnstile.py (11 tests, all passing).

Turnstile frontend (`feature/turnstile` branches in qa-bot-core and access-qa-bot)

Implemented the Cloudflare Turnstile widget in qa-bot-core. When the agent returns {"requires_turnstile": true, "site_key": "..."}, the chatbot shows "Please verify you're human to continue." with the Turnstile widget rendered below it as a React component, then automatically resends the original query after verification.

Key learnings from iteration (rc.1 through rc.9):

injectMessage() escapes HTML — cannot inject widget containers that way.
react-chatbotify's component property on a flow step only renders reliably on the current step, not on steps reached via path transition. A separate turnstile_challenge step with a component never rendered.
Solution: put a conditional TurnstileWidgetWrapper component on qa_loop itself (same pattern as LoginButton on the login gate step). The wrapper reads from a mutable state object so it can check the site key at render time rather than at flow creation time.
Must clear turnstileState.siteKey after token is consumed, otherwise the widget re-renders on subsequent qa_loop cycles.
LIB_VERSION in logger.ts was stale at 0.2.19 — updated it to track RC numbers for dev sanity.

Files: src/components/TurnstileWidget.tsx (new), src/utils/turnstile.ts (new), src/utils/flows/qa-flow.tsx (modified).

Dev environment setup for Turnstile testing

Agent .env: Cloudflare test keys (visible, always-pass) with TURNSTILE_MODE=immediate for fast iteration.
Agent docker-compose.yml: added Turnstile env vars to the environment block (they weren't being forwarded to the container).
Agent src/main.py: added http://localhost:3000 to CORS allowed origins for local dev.
Agent src/api/routes.py: fixed response_model=None on /query endpoint (FastAPI rejects Union[QueryResponse, JSONResponse]).
access-qa-bot .env.local: pointed VITE_API_ENDPOINT at http://localhost:8000/api/v1/query for local testing.
Workflow: publish RC to npm from qa-bot-core, npm install @snf/qa-bot-core@<rc> in access-qa-bot, clear node_modules/.vite, restart dev server.

Other

Removed stale DUAL_RAG_LOGGING=false from .env — was breaking all tests (key removed from Settings model during E.3 but left in .env).
Pulled repos: access-mcp had 29 files of new changes (per-user XDMoD tokens, docs proxy fixes). All others unchanged.
PRs #1 (access-agent) and #2 (access-qa-planning) still awaiting Andrew's review, no activity.

2026-03-26 — TOOL_CAVEATS, parallel RAG+plan, MCP auth, domain agents

Goal: Fix Q15 (search_projects returning random public projects), parallelize UKY RAG and tool planning per Andrew's request, set up MCP_API_KEY for service auth, add domain agent test questions.

Commits (2 new, branch now 13 off main)

28390f9 TOOL_CAVEATS — Inject caveats directly into the tool catalog text shown to the planner LLM. search_projects gets a note that it's a public catalog search with no user/owner parameter. Prevents the planner from selecting it for "my projects/allocations" queries. Earlier attempt (rule 12 in RULES section) was ignored by the LLM — moving the warning next to the tool description in the catalog was more effective.
b58b553 Parallel RAG+plan — For combined/dynamic queries, UKY RAG and tool planning now run concurrently via asyncio.gather in a new rag_and_plan_node. The planner doesn't read rag_matches, so there's no data dependency. Static and domain queries keep the sequential RAG-first path (static needs RAG result to decide whether to END, domain needs RAG context before routing). New routing in route_by_classification: returns "rag_answer" for static/domain, "rag_and_plan" for combined/dynamic.

Infrastructure

Set MCP_API_KEY=my-random-string in both access-agent/.env and access-mcp/.env — shared secret for service-to-service auth. Unlocks JSM, announcements, and events tools.
Started mcp-jsm container (was defined in compose but not running).
Fixed Argilla container consuming 23.8GB disk (stuck in restart loop) — docker compose down on access-argilla freed the space.

Battery results

v4 run (18 questions, parallel RAG+plan, MCP auth):

18/18 pass, zero 92-char failures
Q15 answer clean — allocation troubleshooting guidance with correct links, no random public projects
Q17 (JSM domain): correctly routed, tools=['jsm'], asks for ticket details (281 chars)
Q18 (Announcements domain): correctly routed, tools=['announcements'], provides guidance (725 chars)
JSM tools now functional with MCP_API_KEY — create_support_ticket, create_login_ticket appearing in tool results

Side-by-side UKY vs Agent (18 questions):

Agent longer: 6/18, UKY longer: 12/18
Avg chars nearly identical: UKY 1413, Agent 1449
Agent adds genuine value on live data: Q14 (+3848 chars, live events via search_events), Q7 (+789 chars, hardware specs)
Domain questions (Q17, Q18) are shorter because the agent takes action (ticket creation, announcement workflow) rather than providing generic documentation
Report: slim-v4-sidebyside.html on Netlify

Design decision: UKY context in domain agents

Explored injecting UKY RAG context into domain agent system prompts so ticket/announcement flows would include troubleshooting steps. Tested and reverted — the JSM react agent ignores the documentation context and follows its ticket-creation workflow regardless. More importantly, when a researcher says "open a ticket about my login issue on Bridges-2", they've already tried the obvious fixes. Injecting "have you checked your password?" is condescending. The upcoming capabilities work (buttons for direct actions) is a better fit for this pattern.

Circuit breaker for retry loop (commit 14)

Discovered the quality loop (evaluate → re-plan → execute) was grinding through 3 identical retries when tools returned empty or error data. The planner has no knowledge of previous failures, so it picks the same tool each time. Added a circuit breaker in should_retry_quality: if all tool results are empty, failed, or error-shaped (including MCP success=True with {"error": "..."} body), skip straight to synthesize. UKY content is available in state regardless. Saved ~10-15s on affected questions (Q13 went from 24.5s → 9.0s).

Reports

slim-v4-comparison.html — standalone 18-question battery (pre-parallel)
slim-v4-sidebyside.html — final UKY vs Agent side-by-side (with parallel RAG+plan + circuit breaker)
All deployed to access-ci-reports.netlify.app

2026-03-25 — Graph fixes on `uky-plus-mcp` + slim battery validation

Goal: Implement the top graph fixes identified in the E.3 v2 review, build a focused 16-question regression battery, and validate improvements.

Commits (10 total on `uky-plus-mcp` off `main`)

b6f29e1 cherry-pick: node_trace observability
093daeb cherry-pick: gate node_trace behind ?include_trace
8dfa28d Always consult UKY — every query hits rag_answer first, classifier no longer gates UKY access. domain_agent falls back to UKY when tools unavailable.
cf3a033 Tighten JSM classification — domain=jsm only on explicit "open/file a ticket" language, not problem descriptions.
d09d24c Smarter hedge detection — keep hedged answers with substance (>500 chars, has URLs/emails). Only reject true deflections.
0914e00 Synthesizer prompts — preserve UKY links/contacts/specifics, strip hedge preamble, don't inject LLM training data.
dec9e43 Rename _rag_answer_is_weak → _rag_answer_is_deflection
d381ea9 Widen combined classification — hardware specs, software versions, resource comparisons now combined instead of static to preserve MCP enrichment path.
e039e02 F15 fix — when plan says "no tools needed" but UKY has content, use RAG-only synthesis instead of pure LLM generation.
6edb670 Direct UKY serve — when tools add no value (not needed, failed, or absent), serve raw UKY answer directly with hedge preamble stripped. No LLM rewrite. LLM synthesis only on combined path where tools actually contributed. Hardened URL preservation in all synthesis prompts.

Also: disabled pgvector fallback (stubs remain), added ACCESS_AI_API_KEY to docker-compose.yml, updated SYSTEM_OVERVIEW.md diagram, updated agent-decision-flow.html to reflect new graph.

Slim battery results (v3 — final)

16 questions, 3 iterations (v1→v2→v3). v3 is the final run with all 10 commits.

v3 results (vs UKY baseline):

Zero 92-char failures (was 7 before fixes)
8 of 16 questions now longer than UKY baseline
Direct serve strategy eliminates synthesis nerf: Q3 went 1,197→2,248, Q4 went 365→1,535
Combined synthesis works when MCP adds real data: Q13 (+423 system status), Q15 (+249)
Q11 (+1,578): MCP enrichment with current GPU specs (the "dream scenario")
Remaining shorter answers are mostly UKY variance (different answer each run)

Synthesis strategies in v3:

UKY DIRECT (no synth / no tools / tools failed): 10 questions — raw UKY answer, hedge stripped, no LLM rewrite
COMBINED (UKY + MCP): 6 questions — LLM merges UKY knowledge with MCP tool data

Remaining issues:

UKY variance — same question produces 274–946 chars across runs
Auth gap — user-specific MCP tools need MCP_API_KEY service token (ask Andrew)
Combined synthesis still condenses slightly (but only when tools actually contributed)

Reports

slim-v1-comparison.html — first run with fixes 1-8
slim-v2-comparison.html — after F15 fix, three-column comparison
slim-v3-comparison.html — final, all fixes, no scenario tags, strategy badges
All deployed to access-ci-reports.netlify.app

Artifacts

SLIM_BATTERY.md — scenario descriptions for the 16-question battery
SLIM_BATTERY_QUESTIONS.md — the actual questions file for the test runner
REVIEW_CHECKLIST.md — full findings (F1-F15, W1) with per-question notes
agent-decision-flow.html — updated graph diagram with state reads/writes

2026-03-24 — E.3 v2: Traced batteries + deep review on `uky-plus-mcp`

Goal: Add node traces to the agent, re-run both batteries, build traced comparison reports, and do a question-by-question review to identify systemic issues.

Branch: uky-plus-mcp (off main in access-agent).

Changes made

Cherry-picked 04342c8 (node_trace) + b7a9bec (trace gating) from feature/dual-rag-logging
Disabled pgvector Q&A pair fallback — stubs remain in _search_pgvector() and qa_client.py for future use
Added weak_answer (hedge detection) field to rag_answer trace
Added classification_info and node_trace to API response metadata
Fixed missing ACCESS_AI_API_KEY in docker-compose.yml — UKY was silently failing without it
Updated SYSTEM_OVERVIEW.md mermaid diagram (removed incorrect pgvector→synthesize edge)
Updated agent-decision-flow.html with state reads/writes per node

Results

Both batteries 50/50 success with full traces. Reports at ~/.agent/diagrams/e3-v2-*-comparison.html, deployed to Netlify.

Deep review findings (15 findings, see REVIEW_CHECKLIST.md)

Most critical:

All 7 fallback-gap 92-char failures are JSM misroutes — classifier interprets every problem description as "file a ticket"
Hedge detection rejects good UKY answers that have useful content after the preamble — 20/50 friendly, 15/50 real-user
Synthesize dilutes authoritative UKY content and can inject incorrect LLM training data
Dynamic classification skips UKY on documentation questions when user describes their situation
Combined classification forces resynthesis even when UKY answers confidently

One proven win: Real-user Q15 (Bridges-2 GPUs) — MCP get_resource_hardware corrected UKY's outdated docs (only knew V100s, MCP added H100 and L40S). Only question across both batteries where the system was unambiguously better than UKY alone.

Artifacts

REVIEW_CHECKLIST.md — full findings and per-question notes
SLIM_BATTERY.md — 16-question focused battery covering 8 scenarios
a3_results/e3-v2-friendly.json, a3_results/e3-v2-realuser.json (in access-agent subdir)

Next session

Plan and implement graph fixes on uky-plus-mcp: always consult UKY, tighten JSM classification, smarter hedge handling, synthesizer prompt improvements. Run slim battery to validate.

2026-03-20 — E.3: UKY+MCP on main vs UKY alone

Setup

Switched access-agent to main (pulled 8 new commits from origin), set UKY_RAG_ENABLED=true, rebuilt Docker container. Ran both batteries against the full system (UKY RAG + MCP tools + LangGraph orchestrator).

E.3 results

Friendly battery (50 well-phrased queries):

	UKY alone	UKY+MCP (main)
Answered	50/50	50/50
Avg response length	1,380 chars	886 chars
Avg latency	3.9s	11.0s
MCP tool usage	—	21/50 (42%)
Fallback responses	0	2 (cold-start artifact)

Real-user battery (50 messy queries with typos/vague phrasing):

	UKY alone	UKY+MCP (main)
Answered	48/50	50/50
Avg response length	1,159 chars	752 chars
Avg latency	4.8s	10.8s
MCP tool usage	—	19/50 (38%)
Fallback responses	0	7 (regression)
pgvector RAG hits	—	0

Key findings

MCP adds real value: Live software lists, resource specs, system status, events — data UKY cannot provide. ~40% of questions benefit.
Critical fallback gap: 7 real-user queries return "tools needed for this task are currently unavailable" (92 chars) where UKY gives substantive answers (687–1,859 chars). Pattern: account/password/troubleshooting questions routed to dynamic path but no MCP tool matches. Agent should fall back to UKY RAG instead of giving up.
Zero pgvector RAG hits on real-user battery: The 0.85 similarity threshold is too high for messy input. Cherry-pick candidate 08809ad (lower thresholds) would help.
Friendly battery cold-start: Q1/Q2 got generic fallback due to UKY timeout on first request (588s). Transient issue.

Reports

Friendly: ~/.agent/diagrams/e3-friendly-comparison.html
Real-user: ~/.agent/diagrams/e3-realuser-comparison.html
Published copies: published-reports/e3-friendly-comparison.html, published-reports/e3-realuser-comparison.html
Raw data: a3_results/e3-friendly-main.json, a3_results/e3-realuser-main.json

Next steps

Fix the fallback gap: when classifier routes to dynamic path but no tool matches, fall back to UKY RAG
Consider cherry-picks: 08809ad (lower thresholds) and ef43a21 (top-5 RAG) most likely to help
Verify whether JSM bug (e629cb6) exists on main before cherry-picking
Re-run real-user battery after fix to confirm regression is resolved

Andrew's reaction

Sent reports. Response suggested he sees the architecture as sound but may not have focused on the 7 regressions. Key quote: "Document RAG and MCP will probably do pretty well... supplementing with curated Q&A pairs for specific commonly asked questions would be a good supplement." He's directionally right but the fallback gap needs fixing before the system is production-ready.

2026-03-19 — E.2: Real-user battery + architecture pivot

E.2 bake-off results

Ran 50 real-user queries (REAL_USER_BATTERY.md) against two targets. UKY alone ran clean first. NEWSYSTEM run was contaminated mid-run by an internet outage (q18 took 926s; q19–29 returned a canned 130-char fallback). Re-ran NEWSYSTEM cleanly.

Results — UKY alone vs. NEWSYSTEM (pgvector + MCP, UKY_RAG_ENABLED=false):

	UKY	NEWSYSTEM
Answered	48/50	50/50
Avg response length	1,159 chars	694 chars
Avg latency	4.8s	12.4s
RAG hits	—	9 (18%)
MCP tool calls	—	12 (24%)
LLM only	—	29 (58%)

Report: ~/.agent/diagrams/e2-comparison.html

Key findings:

6 "tools unavailable" deflections — q4, q13, q17, q22, q32, q50. All account/support-type questions. Classifier routed to JSM domain; JSM was unavailable; agent returned a canned 92-char error instead of falling back. The E.1 fix (e629cb6) addresses this but it was not yet deployed in this run.
MCP tools returning empty for factual resource questions — q38 (DARWIN storage), q39 (SLURM resources), q43 (TAMU storage), q37 (GPU hours conversion). Tools were called and ran for 20–36s but returned no results. UKY answered these correctly from documents. Data gaps in the MCP layer.
58% LLM-only — same root cause as friendly battery: 0.70 threshold too strict for messy real-user input, and classifier over-routes problem-sounding questions to dynamic paths.
UKY answers are consistently longer and more detailed for how-to questions. NEWSYSTEM's thin-answer gap is real.
q25 (XDMoD research profile) — agent referenced a nonexistent tool integrate_nsf_xdmod. UKY also 500'd on this.

Architecture pivot mid-session

Andrew raised: "The access-agent currently works with UKY, so maybe no changes needed there?" Realized our E.2 NEWSYSTEM run was testing pgvector-only (UKY disabled in .env), not UKY+MCP. The feature branch already supports UKY as primary with pgvector fallback — we just had UKY_RAG_ENABLED=false.

Revised E.2 goal: The real comparison is UKY alone vs. UKY + MCP tools (i.e., main). Andrew confirmed UKY can stay. pgvector remains as a slam-dunk fallback, not a UKY replacement.

E.2 UKY-alone data is reusable (a3_results/e2-uky.json). E.3 just needs a NEWSYSTEM run with UKY_RAG_ENABLED=true.

Branch salvage assessment

feature/dual-rag-logging has 7 commits ahead of main. Verdict:

Merge to main:

e629cb6 — JSM graceful recovery (production fix, the most important one)
04342c8 + b7a9bec — node_trace observability + gating behind ?include_trace
ef43a21 — RAG top-5 + better synthesis prompt
08809ad — lower similarity thresholds (0.85→0.70)

Leave behind (spike-only):

caf7256 — dual-RAG comparison logging infrastructure
de26e37 — pgvector→synthesis routing (fine but bundled with comparison logger)

2026-03-19 — E.1: JSM error handling

Two failure modes identified: JSM server unavailable (no tools loaded) and classifier over-routing complaint-framed questions to JSM domain.

Fixes (access-agent):

domain_agent.py: no-tools case now returns final_answer=None and clears domain rather than a canned error string
graph.py: replaced hard domain_agent → END edge with a conditional — final_answer=None falls through to rag_answer, otherwise END
classify.py: tightened JSM routing to require explicit ticket-filing language; frustration/complaint framing now routes to RAG

144 tests passing, 1 pre-existing failure in dual-RAG logging test unrelated to these changes.

2026-03-19 — E.1: JSM error handling

Confirmed two failure modes: JSM server unavailable (no tools loaded) and classifier over-routing complaint-framed questions to the JSM domain.

Fixes (access-agent, commit e629cb6):

domain_agent.py: no-tools case now returns final_answer=None and clears domain rather than emitting a canned error string
graph.py: replaced hard domain_agent → END edge with conditional — final_answer=None falls through to rag_answer
classify.py: tightened JSM routing to require explicit ticket-filing language; frustration/complaint framing now routes to RAG

Also confirmed the hedge-detection fallback in the graph is working correctly: UKY hedge phrases trigger MCP tool lookup, so static content that lives in MCP (Ranch specs, software lists) is found on the second attempt. Classifier dynamic definition is sound — no misrouting observed.

144 tests passing. 1 pre-existing failure in dual-RAG logging test, unrelated.

2026-03-16 — RAG architecture analysis: proposing document-chunk retrieval

Problem statement

Reviewed bake-off results end-to-end. QAP matching has a fundamental surface area limitation: a Q&A pair question is ~20 words. A document chunk is ~2000 words. Real user queries are messy (vague, jargon, error pastes, complaint framing) and score 0.50–0.65 against clean QAP questions — below any useful threshold. Lowering the threshold below 0.55 pulls in wrong-topic matches. Returning more QAPs and synthesizing across them only helps union-type queries ("which resources have GPUs?"); it doesn't help when zero QAPs match above threshold.

Even on clean, well-phrased questions (friendly battery), NEWSYSTEM only won 12 of 50 against UKY. 48% RAG hit rate means half the questions got no retrieval at all. The narrow matching surface is a limitation even in the best case.

Where NEWSYSTEM did win

The 12 friendly battery wins break down as: 5 QAP-RAG hits, 4 MCP tool calls, 3 LLM-only. The QAP-RAG wins were genuinely better than UKY — judge cited more precise commands, correct module names, properly formatted citations, less cross-contamination between resources. All 5 winning QAPs were document-sourced (source documents exist in UKY's corpus too). The QAPs won on precision/scoping, not unique coverage. This is a chunking quality issue, not a QAP-vs-docs issue — per-resource chunking should achieve the same precision.

Two proposed architectures

Option A: Three-tier retrieval (QAP → doc-chunk → MCP)

Keep QAPs at high threshold for slam-dunk matches
Add document-chunk RAG as fallback when QAP matching is weak
MCP tools for live/dynamic data
Pro: preserves QAP precision. Con: two extraction pipelines, more routing complexity

Option B: Unified document store (proposed)

Convert MCP entity data into documents (resource descriptions, software lists as prose) and chunk alongside existing PDFs/guides
One pgvector store, one retrieval path
MCP tools still handle truly dynamic data (status, allocations, user-specific)
MCP→document extraction replaces MCP→QAP extraction — same pipeline, simpler output. Transcribing structured data into prose, not summarizing (not a game of telephone)
Cross-entity queries work naturally via vector search instead of requiring Plan → Execute
Pro: dramatically simpler, solves surface area problem universally. Con: lose pre-written curated answers; LLM synthesizes from chunks

Sent analysis to Andrew for input.

Resolution: architecture direction confirmed (2026-03-17)

After review with Andrew, the path forward is simpler than either proposed option:

UKY handles document RAG — it already does this well, no need to rebuild
MCP tools fill the live data gaps — system status, allocations, user-specific queries, anything not in the document corpus
QAPs stay in pgvector, no new generation — keep the existing pairs as a high-confidence slam-dunk path; if they match, great. Don't invest in generating more until they prove useful in production
Fix Atlassian error handling — agent is not recovering gracefully from failed JSM calls
Double LLM synthesis — UKY synthesizes, then access-agent synthesizes. Real latency/cost issue worth addressing eventually, not blocking now

Andrew also clarified that UKY already has at least the intention of pulling dynamic data (events etc.) from APIs — so the "MCP fills gaps UKY can't" argument is narrower than assumed. The bottleneck is LLM synthesis, not retrieval transport.

The QAP extraction work (Projects C, A.3) was not wasted — it produced clear quantitative evidence (18% real-user hit rate, 12/50 on friendly battery) that validated the approach's limits and identified exactly when QAPs do and don't work. That's a defensible architectural decision.

Key architectural insight: QAP value depends on data source

Content type	Best retrieval	Rationale
Documents (guides, how-tos, policies)	Document-chunk RAG	Rich surface area handles messy queries; QAPs distill away the very text users search for
MCP entity data (resource specs, software)	QAP-as-cache OR MCP→document extraction	No source documents to chunk; QAPs are one option, synthesized docs are the other
Live/dynamic data (status, allocations)	MCP tools directly	Changes constantly, can't be cached in any RAG store

2026-02-28 — A.1: Argilla → pgvector sync pipeline

Goal: Get Q&A pairs from Argilla into access-qa-service (pgvector) so they're searchable via semantic search.

Discovery: access-qa-service already had a /admin/sync endpoint and argilla_sync.py — but the code was scaffolded with placeholder logic that didn't match the actual Argilla v2 API or the record schema created by access-qa-extraction.

What was wrong:

Used deprecated Argilla v1 API (rg.init() / rg.load())
Guessed at record field access (record.inputs, record.question) — Argilla v2 uses record.fields["question"]
Looked for entity_id in metadata (doesn't exist) — needs to come from <<SRC:...>> citation markers in the answer text
Default dataset name was "access-qa" but extraction creates "qa-review"
argilla Python SDK wasn't in the dependencies

What we fixed (commit 5b57ae0 on access-qa-service/main):

Rewrote sync_from_argilla() for Argilla v2 client API
Correct field access via record.fields
Domain/entity_id extracted from citation markers, with source_ref parsing as fallback
Added _get_edited_values() to prefer reviewer edits (future-proofing)
Judge scores (faithfulness, relevance, completeness, confidence) carried through to pgvector metadata
Added argilla>=2.0.0 as a proper dependency
Added Argilla env vars to docker-compose.yml for local dev

Test result:

POST /admin/sync → {"synced": 83, "skipped": 0, "citations_loaded": 12, "errors": []}
POST /search {"query": "What is ACES designed for?"} → similarity_score: 1.0, correct answer with citation

83 records across 5 domains (compute-resources, software-discovery, affinity-groups, allocations, nsf-awards) synced and searchable.

Also documented: Andrew's feature/access-agent-integration branch on qa-bot-core — what it changes (Netlify proxy, request body format, response contract) and why it matters for Projects A and B. Added to FEB_MARCH_PLAN.md and synced to the gist.

2026-02-28 — A.2: Dual-RAG comparison logging in access-agent

Goal: Modify rag_answer node to query both UKY document RAG and pgvector Q&A-pair RAG for every question, logging side-by-side results for A.3 evaluation.

Approach: Parallel queries via asyncio.gather, gated behind DUAL_RAG_LOGGING env var. When the flag is off, behavior is identical to before.

What was built (commit caf7256 on access-agent/feature/dual-rag-logging):

src/config.py — Added DUAL_RAG_LOGGING: bool = False setting
src/rag_comparison_logger.py (new) — SQLAlchemy model + singleton logger for rag_comparison_logs table. Follows same pattern as usage_logger.py. Table auto-creates on first use.
src/agent/nodes/rag_answer.py — Added:
- _query_uky_raw() / _query_pgvector_raw() — lightweight async helpers that return raw results without span side-effects
- _dual_rag_answer() — runs both queries concurrently, applies same UKY-primary/pgvector-fallback priority, logs comparison to PostgreSQL
- Gate in rag_answer_node: settings.DUAL_RAG_LOGGING and rag_endpoint → dual path; else unchanged
tests/test_rag_answer.py (new) — 19 tests: citation processing, raw query helpers, dual-RAG logic (UKY served, pgvector fallback, both fail, combined query, below threshold, logger failure resilience), flag gating

Comparison log table schema (rag_comparison_logs):

Query context: session_id, question_id, query_text, expanded_query, query_type, rag_endpoint
UKY result: uky_response, uky_duration_ms, uky_error
pgvector result: pgvector_matches (JSONB), pgvector_best_score, pgvector_match_count, pgvector_duration_ms, pgvector_error
Outcome: served_by, served_answer_length

Test result: 94 passed (all existing + 19 new), 0 failures.

What's unchanged: state.py, graph.py, routes.py — the graph contract is untouched. The comparison log is a side-effect inside the rag_answer node.

Next (A.3): Deploy the feature/dual-rag-logging branch with DUAL_RAG_LOGGING=true, ask questions via qa-bot-core or direct API, then query rag_comparison_logs to evaluate UKY vs pgvector.

2026-02-28 — A.3 setup: Docker environment stood up and smoke-tested

Decision: Run A.3 locally in Docker, bypass qa-bot-core, use direct curl requests.

Docker setup (two separate compose projects):

access-qa-service/docker-compose.yml → qa-service (port 8001) + PostgreSQL (port 5433) + Redis (port 6380)
access-agent/docker-compose.yml → agent (port 8000) + PostgreSQL (port 5432) + Redis
access-agent reaches access-qa-service via host.docker.internal:8001 (macOS Docker)
UKY endpoint is remote — uses same API key as qa-bot-core (ACCESS_AI_API_KEY)

What we did to get access-agent running:

Created access-agent/.env from discovered keys: OPENAI_API_KEY (from access-qa-extraction/.env), ACCESS_AI_API_KEY (same key as QA_MODEL_API_KEY in access-serverless-api/.env and REACT_APP_API_KEY in qa-bot-core/.env.local), plus DUAL_RAG_LOGGING=true, QA_SERVICE_URL=http://host.docker.internal:8001, OTEL_ENABLED=false
Modified access-agent/docker-compose.yml: added env_file: .env to the agent service (previously all env vars had to be listed explicitly), removed external mcp-network dependency (MCP servers aren't needed for A.3)
Built and started: docker compose up --build -d — all containers healthy

Smoke test (successful):

curl -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is Delta?", "session_id": "test-a3-smoke", "question_id": "smoke-1"}'

→ Got a full UKY-sourced response about Delta (NCSA HPC resource), 6s latency, tools_used: ["uky_rag_retrieval"]. Agent is live and hitting UKY successfully.

Note: The API field is query (not question). The MCP server warnings in the agent logs are expected and harmless — those servers aren't on this Docker network and aren't needed for A.3.

Current container status (all running):

Service	Port	Notes
access-agent	8000	`feature/dual-rag-logging` branch, `DUAL_RAG_LOGGING=true`
access-agent postgres	5432	checkpointing + comparison logs
access-qa-service	8001	83 Q&A pairs loaded
qa-service postgres	5433	pgvector embeddings
access-argilla	6900	Q&A pair review UI

2026-03-02 — A.3 pre-flight: similarity threshold bug found

Goal: Verify Docker environment still works and start A.3 evaluation.

Discovery: pgvector is returning zero matches for reasonable queries like "What is ACES?" — even though we have 20 compute-resources Q&A pairs including several about ACES.

Root cause: The similarity threshold is too aggressive. There are two thresholds stacked:

qa-service default (access-qa-service/src/access_qa_service/config.py:26): rag_similarity_threshold = 0.85
access-agent per-query-type thresholds (access-agent/src/config.py:69-71):
- RAG_THRESHOLD_STATIC = 0.85 (static queries)
- RAG_THRESHOLD_COMBINED = 0.75 (combined queries)
- RAG_THRESHOLD_FALLBACK = 0.65 (fallback)

The agent's _query_pgvector_raw() passes the threshold to the qa-service, which uses it to filter results. For static queries (the most common type), both sides enforce 0.85.

The problem: "What is ACES?" scores 0.84 against the best match ("What is ACES designed for?") — just below the 0.85 cutoff. With threshold 0.3, the same query returns 3 solid matches (0.84, 0.82, 0.76). Short or naturally-phrased questions routinely fall just under 0.85 even when the topic matches perfectly.

Evidence:

curl /search {"query": "What is ACES?", "threshold": 0.85}  → 0 matches
curl /search {"query": "What is ACES?", "threshold": 0.3}   → 3 matches (0.84, 0.82, 0.76)
curl /search {"query": "What is ACES designed for?"}         → 1 match (1.0, exact)

The rag_comparison_logs table confirmed this — both smoke test queries ("What is Delta?", "What is ACES?") show pgvector_match_count: 0 and served_by: uky_general.

What needs to happen before running A.3:

Lower the threshold so pgvector actually returns matches for natural queries
Options: (a) lower RAG_THRESHOLD_STATIC from 0.85 to ~0.70 in access-agent config, (b) use a comparison-specific override in the dual-RAG path so production defaults aren't touched, or (c) lower the qa-service default
Rebuild the access-agent container after the change

Also this session: Created SYSTEM_OVERVIEW.md with sequence diagrams of the three main flows (query answering, knowledge base building, per-entity extraction detail). Updated the agent graph illustration in FEB_MARCH_PLAN.md from mermaid to an emoji-annotated state transition table. Synced plan gist.

2026-03-02 — Threshold fix committed

Change: Lowered all RAG similarity thresholds in access-agent/src/config.py (commit 08809ad on feature/dual-rag-logging):

RAG_THRESHOLD_STATIC: 0.85 → 0.70
RAG_THRESHOLD_COMBINED: 0.75 → 0.60
RAG_THRESHOLD_FALLBACK: 0.65 → 0.50
RAG_SIMILARITY_THRESHOLD (legacy): 0.85 → 0.70

Why: Best matches for natural queries scored ~0.84, just below the 0.85 cutoff. This was the A.3 blocker — pgvector returned 0 matches for every query.

Still needed: Rebuild the access-agent Docker container (docker compose up --build -d) and verify the fix with a smoke test before proceeding with A.3.

2026-03-02 — A.3 running: container rebuilt, threshold verified, test questions written

Rebuilt container: docker compose up --build -d picked up the threshold fix. All containers healthy.

Threshold fix verified: "What is ACES?" now returns pgvector_match_count: 3, pgvector_best_score: 0.84. Before the fix this was 0 matches. UKY still served (as designed), but pgvector results are now logging.

Pushed branches: access-agent/feature/dual-rag-logging pushed to GitHub (3 commits: A.2 dual-RAG logging, threshold fix). access-qa-service/main push failed — Joe doesn't have write access to necyberteam/access-qa-service (need Andrew to grant).

QAP coverage (83 pairs across 11 entities in 5 domains):

Domain	Entity	Pairs
compute-resources	ACES (TAMU)	10
compute-resources	Ranch (TACC)	10
software-discovery	ABINIT	10
software-discovery	Abaqus	8
allocations	Grassland bird habitat (#72204)	9
allocations	RL benchmark (#72205)	10
nsf-awards	Pollinator conservation AI (#2529183)	10
nsf-awards	Great Salt Lake dust (#2449122)	8
affinity-groups	Neocortex (PSC)	5
affinity-groups	REPACSS (TTU)	3

Test questions written: 40 questions in A3_TEST_QUESTIONS.md, organized in 3 groups:

pgvector-targeted (24): Questions about entities we have QAPs for
UKY-targeted (8): General ACCESS questions our 83 pairs probably don't cover
Edge cases (8): Vague, misspelled, or cross-domain questions

Next: Review the test questions, then fire them all through the agent and pull the comparison logs.

2026-03-04 — A.3 Run 2: first full test, unfair comparison discovered

Run 2 executed: Fired all 41 test questions through the agent with DUAL_RAG_LOGGING=true. All 41 succeeded, 40 logged (q41 classified as dynamic/xdmod). Results exported to a3_results/run2.json.

Run 2 results (high-level): UKY answered 36/40, pgvector had matches for 30/40, served by UKY 36, served by pgvector 4.

Built interactive HTML comparison: ~/.agent/diagrams/a3-run2-comparison.html — expandable rows with side-by-side answers, KPI summary, sidebar nav, analysis section.

Synthesis routing fix: pgvector static matches were previously returned as final_answer (raw Q&A pair text). Changed rag_answer.py to set rag_matches + rag_used instead, and added "synthesize" as a third routing option from route_after_rag in graph.py. This routes pgvector results through the LLM synthesis pipeline.

Unfair comparison discovered: Run 2's comparison was apples-to-oranges. UKY answers arrive already LLM-synthesized (UKY's own LLM produces polished prose). pgvector answers in the comparison log were raw Q&A pair text — just the verbatim answer field from the curated pair. This made pgvector look worse than it actually is, since the difference was partly in presentation quality, not underlying knowledge.

2026-03-04 — A.3 Run 3: fair apples-to-apples comparison

Goal: Make the comparison fair by synthesizing pgvector answers through our own LLM before logging them.

What was changed:

rag_comparison_logger.py — Added pgvector_synthesized_answer = Column(Text) to the model and log_comparison() method
rag_answer.py — Imported _format_rag_matches and _synthesize_with_rag_only from synthesize.py. In _dual_rag_answer(), after getting pgvector matches, calls synthesis to produce an LLM-polished answer before logging. This is what the user would actually see if pgvector served the answer.
pyproject.toml — Pinned opentelemetry-instrumentation-langchain<0.53 (newer version had a breaking import for GenAICustomOperationName)
Database — ALTER TABLE rag_comparison_logs ADD COLUMN pgvector_synthesized_answer text;
Test runner — Created a3_results/run_a3_test.py to fire all 41 questions programmatically

Run 3 results (41/41 succeeded, all logged):

Metric	Value
UKY answered	38/41 (93%)
pgvector answered (synthesized)	27/41 (66%)
Both answered	24 (direct comparison possible)
UKY only	14
pgvector only	3
Avg pgvector similarity score	0.84

Fair comparison conclusions (from HTML analysis at ~/.agent/diagrams/a3-run3-comparison.html):

The two backends are complementary, not competitive. pgvector gives precise, curated answers for entities we've built Q&A pairs for. UKY covers the long tail of general ACCESS knowledge.
pgvector excels on its own domain: Of 25 pgvector-targeted questions (Q1-Q25), pgvector produced synthesized answers for 24 (96%). These are entities with curated Q&A pairs.
UKY handles breadth that pgvector cannot: For 8 UKY-targeted questions (Q26-Q33) about general ACCESS topics (allocations process, Globus, password reset), pgvector answered 0. Our 83 curated pairs simply don't cover these.
UKY produces longer answers (~157% longer on average when both answer the same question). This may reflect UKY's larger document corpus or that our synthesis prompt is more concise. Length alone doesn't indicate quality.
pgvector retrieval is dramatically faster (~5 ms vs ~2500 ms for UKY), though pgvector now also needs LLM synthesis time (not logged separately).
The quality gap is narrower than Run 2 suggested. With LLM synthesis, pgvector answers read as polished, cited responses. The Run 2 comparison was unfairly penalizing pgvector by showing raw text.
Production recommendation: Use both backends — pgvector for high-confidence domain matches, UKY for everything else. This is already the architecture (_dual_rag_answer uses UKY-primary, pgvector-fallback).

Files produced:

a3_results/run3.json — Full export of 41 comparison log entries
~/.agent/diagrams/a3-run3-comparison.html — Interactive comparison with analysis
a3_results/run_a3_test.py — Test runner script

2026-03-05 — A.3 post-mortem: reframing the question

Realization: The A.3 analysis drifted toward "complementary backends" and fallback architecture. But that wasn't the original question. From FEB_MARCH_PLAN.md:

"proving this approach outperforms document RAG" (line 34) "We need data on how these two approaches compare before making further investment decisions" (line 65) "A first because it validates the approach before investing in B" (line 259)

A.3 was a bake-off to decide whether Q&A-pair RAG can replace UKY document RAG — not to build a hybrid system. The "use both" conclusion was the code's existing fallback architecture leaking into the analysis.

Why pgvector lost on breadth (and it's not about quality)

The coverage gap is entirely explained by content type, not approach quality:

What the extraction pipeline covers (5 MCP server domains, entity-focused):

Compute resources (23 entities: ACES, Delta, Anvil, etc.)
Software discovery (1,404 packages)
Allocations (5,440 projects)
NSF awards (10,000+ awards)
Affinity groups (55 groups)

These are all "what is X" questions about discrete entities. The pipeline pulls structured data from MCP servers and generates Q&A pairs about each entity's properties.

What UKY has that we don't (general ACCESS documentation):

How to apply for an allocation (process docs)
How to transfer files / use Globus (how-to guides)
How to reset your password (account management)
Startup vs research allocations (policy docs)
Training resources, publication acknowledgment (educational docs)

These are "how do I" questions about ACCESS-wide processes. They don't live in any MCP server — they live in documentation pages, wikis, and guides that UKY ingested.

We don't know exactly what UKY ingested. The plan has an open question: "Need a list from Andrew of what UKY currently ingests." UKY is a black-box API to us.

The actual A.3 verdict

On entity questions where we have Q&A pairs: pgvector hits 96% (24/25). The synthesized answers are concise and accurate. pgvector retrieval is ~500x faster than UKY (~5ms vs ~2500ms).

On general how-to/process questions: pgvector scores 0%. We simply have zero Q&A pairs for these topics because no MCP server serves allocation process docs or file transfer guides.

The gap is coverage, not quality. If we had Q&A pairs for general ACCESS topics, pgvector would likely match or beat UKY on those too.

Decision point

The plan says Project C ("Extract from ACCESS documentation") was deferred with this note:

"Revisit only if a specific content gap surfaces that exists only in documents with no API equivalent (e.g., narrative tutorials, policy explainers)."

A.3 just surfaced exactly that gap. The 14 UKY-only questions are all process/how-to questions with no API equivalent.

Joe needs to decide:

Pursue Project C — Extract Q&A pairs from ACCESS documentation (not MCP entities). This would close the how-to gap and potentially let pgvector replace UKY entirely. Requires: getting the doc list from Andrew, building a document extractor, running extraction + Argilla review.
Keep UKY for breadth, pgvector for precision — Accept the hybrid architecture. UKY handles general questions, pgvector handles entity questions. Simpler, but you're dependent on UKY's black-box system and can't control answer quality for general topics.
Expand entity coverage first — Before tackling docs, run the existing extraction pipeline against more entities (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages, 2 of 5,440 allocations). More entity coverage might narrow the gap enough.

UKY corpus: confirmed undocumented

Searched all repos (access-qa-planning, access-agent, access-mcp, access-qa-extraction, access-qa-bot) for any documentation of what UKY's system ingests. Found:

pages-current-production.md — "The Q&A backend is hosted at the University of Kentucky." No corpus details.
pages-access-qa-tool.md line 193 — Notes UKY's tech stack as "ChromaDB, llamaindex." No document list.
FEB_MARCH_PLAN.md line 233 — Open question: "Need a list from Andrew of what UKY currently ingests."
uky_client.py — Black-box HTTP client. No corpus metadata.

No list of UKY's ingested documents exists anywhere in our repos. Andrew is the only source for this information.

Research options independent of Andrew

Even without the UKY document list, there are viable paths to continue the bake-off:

Option A: Analyze UKY's 14 winning answers for source clues. Read the UKY-only responses from Run 3 and determine whether the information is unique to some internal corpus or is general ACCESS knowledge available on public web pages (support.access-ci.org, allocations.access-ci.org). UKY's answers may contain citations, URLs, or verbatim language that reveals their source documents. This takes ~30 minutes and informs all other options.

Option B: Generate Q&A pairs from public ACCESS content. Point the extraction pipeline (or a variant) at public ACCESS web pages — the allocations guide, getting started pages, Globus documentation, password reset instructions. These are freely available. Generate Q&A pairs, curate them, load into pgvector, re-run A.3. This directly tests whether closing the topic gap closes the performance gap.

Option C: Determine whether UKY's advantage is unique knowledge or general glue. The 14 UKY-only questions are all process/how-to topics. If UKY is synthesizing from the same public ACCESS web pages any user can read, then the "advantage" is simply that we haven't generated Q&A pairs for those topics yet — not that UKY has access to privileged information. This reframes the bake-off: it's not documents vs Q&A pairs, it's about coverage breadth.

Option D: Expand entity coverage as a control. Add Q&A pairs for remaining MCP server domains (events, announcements, system-status) and more entities within existing domains (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages). This tests whether broader entity coverage alone changes the picture.

Recommended sequence: A first (30 min, informs everything), then B (directly tests the hypothesis), with D as low-effort parallel work.

2026-03-06 — UKY corpus obtained, plan aligned with Andrew

UKY document corpus now available

Andrew provided the full set of documents that feed UKY's document RAG. They are in rag_documents/ (75 files, 69 MB) split across two directories:

staging/ (~47 files) — The main corpus. Three categories:

Category	Examples	Count
Resource descriptions	ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage, Voyager, Fabric (PDFs)	~20
User guides	ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage (PDFs)	~10
Process/how-to docs	Allocations, Globus, MFA, add users, progress reports, office hours, events/trainings, system status (docx)	~12
Misc	ARA description, SDS pointer, CloudBank login, REPACSS overview, Sage edge apps, current projects	~5

data/ (~28 files) — Per-resource software lists (txt/csv) and resource-specific documentation:

Software installed lists for ACES, Anvil, Bridges-2, Darwin, Delta, Expanse, Jetstream2, Kyric, Stampede3
Darwin docs (user guide, login, filesystems, job management, SLURM, software)
Delta docs (user guide, data management)
FASTER docs (intro, SLURM partitions, documentation)
ACCESS Travel Rewards (md)

Key observation: The process/how-to docs in staging/ (allocations, Globus, MFA, etc.) are exactly the topics UKY beat pgvector on in A.3. The resource descriptions overlap with what MCP extraction already covers. This confirms the A.3 finding — the gap was coverage, not quality.

Alignment with Andrew

Confirmed the shared end state:

Generate Q&A pairs from these documents — Use a similar two-shot process to what exists for MCP entities, but with documents as input. Andrew: "Probably a similar prompt to the MCP tools can work for generating pairs from docs."
One unified Q&A pair bank in pgvector — Entity pairs (from MCP) + document pairs (from these files) living together, searchable as one corpus.
The orchestrator agent decides routing — RAG for factual queries, MCP for live data, both when needed. Andrew: "The orchestrator agent should decide which tools to use (RAG, MCP, both) and then it should get synthesized. That logic should already exist in access-agent."
UKY goes away — Andrew: "Eventually, we will likely not need the document based RAG since the Q&A pairs are faster." pgvector replaces UKY entirely.

Plan: document extraction pipeline

Step 1: Categorize the corpus. Skim the 75 files and bucket them: resource descriptions (entity overlap with MCP), user guides (process/how-to), general ACCESS docs. Identify what's already covered by MCP extraction vs. what's net new.

Step 2: Build a document extractor in access-qa-extraction. Extend the pipeline to accept documents (PDF/docx) as input. The two-shot prompt structure should carry over — battery pass for coverage, discovery pass for insights. New work: document parsing (PDF text extraction, docx reading) and chunking into logical sections.

Step 3: Run extraction on the full corpus. Generate Q&A pairs from all documents. Push to Argilla for review. This produces pairs for the exact topics pgvector was missing — allocations process, Globus, MFA, user guides.

Step 4: Load into pgvector alongside entity pairs. One unified bank: existing 83 entity pairs + document-sourced pairs. All searchable together.

Step 5: Re-run A.3. Same 41 questions (plus new ones if the expanded corpus suggests them). If pgvector-with-documents matches or beats UKY across the board, the bake-off is won.

Step 6: Simplify the agent routing. Once the Q&A pair bank covers everything, the agent graph simplifies: RAG for factual queries, MCP for live data, synthesis when both contribute. Remove the UKY fallback path.

2026-03-09 — Project C step 1: corpus categorized

Skimmed all 75 files in rag_documents/ and produced a categorized index at rag_documents/CORPUS_INDEX.md. No files were moved or renamed — the index is a read-only reference.

Categorization results

Category	Files	Priority	Rationale
NET-NEW process/how-to	20	First	Fills the exact A.3 gap — allocations, Globus, MFA, Sage, citations, Jupyter
USER GUIDE (deep)	22	Second	Operational depth (job submission, filesystems, SLURM) beyond MCP surface data
MCP OVERLAP (descriptions)	17	Later	1-page resource catalog entries — MCP already covers most of this
DATA FILE	12	Skip	Raw software lists (name/version lines) — MCP software-discovery covers this
POINTER/EMPTY	4	Skip	URL stubs or corrupt files with no substantive content

Key finding: The 20 NET-NEW files are mostly small docx docs — easy to parse, directly address the A.3 gap. The 22 user guides are larger PDFs with real depth (SLURM partitions, data management, module systems). The 17 resource descriptions are 1-page PDFs that overlap with MCP entity data.

Also this session: Consolidated project documentation — SYSTEM_OVERVIEW.md is now single source of truth for architecture, FEB_MARCH_PLAN.md updated with A.3 results and Project C active status, all three docs gist-mirrored, CLAUDE.md updated with document discipline rules.

2026-03-09 — PRs merged, document extractor built (C.2)

Pre-flight: merged outstanding PRs

access-qa-extraction PR #1 (two-shot pipeline) — squash-merged to main. 4,697 additions across the full two-shot extraction pipeline: battery + discovery prompts, LLM judge, incremental cache, Argilla entity-replace, 5 domain extractors, 144 tests. Branch archived on GitHub.

access-qa-planning PR #1 (companion docs) — squash-merged to main. Documentation updates for two-shot pipeline.

access-agent and qa-bot-core — decided to leave on their branches. qa-bot-core is a production product with its own release routine. access-agent's feature/dual-rag-logging branch mixes evaluation scaffolding with production improvements — better to leave as-is until the bake-off concludes.

Smoke-test on main

Reinstalled access-qa-extraction from clean main. 144/144 tests pass. Started mcp-compute-resources Docker container from access-mcp/docker-compose.yml (port 3002). Ran extraction:

qa-extract extract compute-resources --max-entities 1 --no-judge

Produced 8 Q&A pairs for ACES — 5 battery + 3 discovery, all with citations. Two-shot pipeline confirmed working on main.

Built DocumentExtractor (Project C.2)

Branched feat/document-extractor off clean main. Built the document extraction pipeline:

New files:

parsers.py — Standalone document parsing module. parse_docx() (python-docx), parse_pdf() (PyMuPDF/fitz), parse_text() (.txt/.md). Dispatcher parse_document() routes by extension. chunk_text() splits large docs (~6000 words) with overlap. clean_extracted_text() collapses PDF/docx whitespace artifacts.
extractors/documents.py — DocumentExtractor(BaseExtractor). Overrides run() to skip MCPClient (documents are local files). Discovers files recursively from config.url directory. Each document/chunk = one entity. Two-shot LLM pipeline (battery + discovery), judge evaluation, incremental cache — same as MCP extractors. Uses source="doc_generated", source_ref="doc://documents/{entity_id}".

Modified files:

pyproject.toml — Added python-docx>=1.0.0, PyMuPDF>=1.24.0
models.py — Added source parameter to QAPair.create() (default "mcp_extraction", backward-compatible)
question_categories.py — Added "documents" to DOMAIN_LABELS, DOMAIN_NOTES, and FIELD_GUIDANCE (5 field groups: overview, key procedures, requirements & eligibility, important details, support & contact)
config.py — Added "documents" MCPServerConfig with url=os.getenv("DOCUMENTS_DIR", "../rag_documents")
extractors/__init__.py — Added DocumentExtractor import and export
cli.py — Added DocumentExtractor to EXTRACTORS registry

Smoke tests

Test 1: qa-extract extract documents --max-entities 1 --no-judge — parsed CORPUS_INDEX.md, produced 6 Q&A pairs about the document corpus.

Test 2: qa-extract extract documents --entity-ids "10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream" --no-judge — parsed a docx file from staging/, produced 5 Q&A pairs about Jetstream citation formats and acknowledgment requirements.

Fix: _title_from_stem() was producing ugly titles from Slack-style filenames (e.g., 10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream). Added re.sub(r"^\d+_[\d.]+_", "", stem) to strip the numeric prefix, plus stripping common prefixes (data-ACCESS-, data:, etc.). Title now renders as "How To Cite Jetstream".

All 144 existing tests still pass after all changes.

First extraction run: staging/ directory (C.3)

Ran DOCUMENTS_DIR="../rag_documents/staging" qa-extract extract documents --no-judge on all 47 files in staging/. Took ~25 minutes (94 LLM calls).

Results: 586 Q&A pairs from 83 entities (46 files processed, 1 corrupt file skipped).

Category	Entities	Pairs	Notes
NET-NEW docx (process/how-to)	19	~110	Allocations, MFA, Globus, Sage, Jupyter
User Guide PDFs (chunked)	39 chunks	~290	Jetstream2 (20 chunks), Anvil (6), Bridges-2 (5), etc.
MCP Overlap descriptions	17	~134	1-page resource PDFs
Other (ARA, SDS, REPACSS)	8	~52	Small docs

100% citation markers (<<SRC:documents:...>>)
All pairs use source: "doc_generated"
Large PDFs chunked correctly (~6000 words per chunk with overlap)
Quality spot-check: questions are natural, answers contain specific details (URLs, commands, step-by-step procedures)
Only error: current-access-projects.docx (known corrupt/empty file)

Output at data/output/documents_qa_pairs.jsonl (gitignored). Branch pushed to GitHub.

Not yet run: data/ directory (Darwin, Delta, FASTER docs + ACCESS-Travel-Rewards.md + software lists).

Second extraction run: data/ directory (C.3)

Ran DOCUMENTS_DIR="../rag_documents/data" qa-extract extract documents --no-judge on all files in data/ subdirectories.

Results: 221 Q&A pairs from 29 entities.

Subdirectory	Entities	Pairs	Notes
ACCESS-Resources/Darwin/	9	~65	Managing jobs, user guide, compiling, file systems, etc.
ACCESS-Resources/Delta/	3 chunks	~25	Large PDF chunked into 3
ACCESS-Resources/FASTER/	4	~30	User guide, system overview, jobs, file systems
ACCESS-Travel-Rewards.md	1	~8	Travel reimbursement program
ACCESS-Software-Installed-by-resource/	12	~93	Software lists (package names/versions — generic Q&A quality)

Software-list files produced generic "what software is installed on X" pairs — adequate but not high-value. Argilla reviewers can reject low-quality ones.
Darwin and FASTER docs produced strong procedural content (SLURM commands, file system paths, compilation flags).

Combined output and Argilla push

Saved staging/ output as documents_staging_qa_pairs.jsonl, combined both runs into documents_all_qa_pairs.jsonl (807 total pairs).

Pushed all 807 pairs to Argilla: qa-extract push data/output/documents_all_qa_pairs.jsonl. Records visible in qa-review dataset at http://localhost:6900.

Docker note: Argilla containers had stale network references from previous sessions. Fixed with docker compose down --remove-orphans && docker network prune -f && docker compose up -d.

Added `document_name` metadata field

Problem: When reviewing pairs in Argilla, all 807 records had domain: "documents" with no way to tell which source document they came from — the only clue was the source_ref URI (e.g., doc://documents/10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream), which is opaque. For MCP-extracted pairs, domain provides natural grouping (compute-resources, allocations, etc.), but document pairs lack an equivalent.

Fix: Added document_name as an optional metadata field on QAMetadata, populated from the existing _title_from_stem() helper in DocumentExtractor. The field flows through to Argilla as a filterable TermsMetadataProperty. MCP extractors are unaffected (field defaults to None).

Files changed: models.py (field + factory param), documents.py (passes title), argilla_client.py (schema + record metadata).

Re-extraction: Re-ran both staging/ (611 pairs) and data/ (214 pairs) = 825 total. Deleted old Argilla dataset (no schema for document_name), pushed fresh. 72 unique document names now filterable in Argilla.

Fixed `source_data` for document pairs

Problem: For MCP entity pairs, source_data contains the full entity JSON that the LLM used to generate the Q&A pair — reviewer sees exactly what went in. For document pairs, source_data was set to content_preview: chunk[:500] — the first 500 characters of the chunk. This was misleading: it looked like the source material but only represented a tiny slice of the ~6000-word chunk the LLM actually saw. Reviewers would see a content_preview about topic X when the Q&A pair was about topic Y (from elsewhere in the same chunk).

Fix: Replaced content_preview with a reference: {file, chunk, total_chunks, word_count}. For non-chunked documents, chunk and total_chunks are null. The reviewer sees the file and chunk number; the actual document is in rag_documents/.

Design note on chunking: Large documents (>6000 words) are split into sequential ~6000-word chunks with 500-word overlap. Each chunk is processed as a separate entity — the LLM only sees one chunk at a time, not the whole document. So chunk 9 of a 20-chunk Jetstream PDF starts at roughly word 44,000. This is why the source_ref includes the chunk number (e.g., doc://documents/jetstream-2-user-guide__chunk_9).

2026-03-10 — Evaluation harness design (Project D)

Andrew asked about making the bake-off self-service: editable golden questions, runnable by the team with their own tokens, comparing different agent configurations ("tool combinations"). Key points from the conversation:

Golden questions: Andrew wants a curated benchmark set that people can view, add, and modify. These are distinct from the Q&A pairs in Argilla — they're the test inputs used to evaluate the agent.
Different tool combinations: Not UKY-vs-pgvector (UKY is going away), but different configurations of our agent — RAG thresholds, MCP server subsets, model choices. Each configuration is a "scenario."
Self-service: Team members should be able to run evaluations and see results without Joe in the loop.
Ongoing process: Re-run as the agent evolves, not a one-shot comparison.

Designed the evaluation harness. Full design saved as EVAL_HARNESS_PLAN.md. Summary:

Golden questions in YAML (merge A3_TEST_QUESTIONS.md + e2e_test_cases.csv → ~55 questions with structured assertions)
Scenario configs as YAML files overriding Settings env vars
CLI runner calling run_agent() directly (not HTTP) to capture full AgentState
HTML report generator producing self-contained comparison pages (matching a3-run3 visual style)
New access-agent/eval/ directory

Added as Project D in FEB_MARCH_PLAN.md (D.1–D.4), parallel with Project B after C.4 completes.

Pivot: Initially designed as a CLI-based Python tool (access-agent/eval/). Revised to a static web app on Netlify (eval-ui/) — no Python environment needed, users just open a browser. Golden questions and scenarios bundled at build time, results displayed inline and exportable as JSON. Two open design questions flagged: (1) how scenarios actually change agent behavior given the current API doesn't accept config overrides, and (2) API key routing (server-side vs pass-through). Plan saved as EVAL_HARNESS_PLAN.md.

This is future work — immediate next step remains C.4 (review Argilla, sync pgvector, re-run A.3).

2026-03-10 — C.4: Meta-referencing fix, re-extraction, A.3 Run 4

Meta-referencing problem in document Q&A pairs

Spot-checked the 825 document pairs in Argilla and found a systematic quality issue: 36% (300/825) of generated questions referenced the source documents rather than the subject matter.

Examples:

Wrong: "What are the important quotas and limits mentioned in the Darwin Filesystems Storage document?"
Right: "What are the storage quotas on Darwin?"

Root cause analysis: Two contributing factors:

FIELD_GUIDANCE field group #1 said "what is this document about?" — 90% of seq-1 (overview) pairs were meta-referencing.
Entity titles included document-type suffixes ("Jetstream 2 User Guide") which primed the LLM to treat the document as the subject.

Prompt and code fixes

question_categories.py — Two changes:

Added explicit anti-meta-referencing instruction to DOMAIN_NOTES["documents"] with wrong/right examples.
Reworded all 5 field groups in FIELD_GUIDANCE["documents"] to avoid document-referencing (e.g., "Overview — what is this topic about?" instead of "what is this document about?").

documents.py — Added regex to _title_from_stem() to strip document-type suffixes ("User Guide", "Manual", "Handbook", etc.) so the LLM sees "Jetstream 2" instead of "Jetstream 2 User Guide" as the entity name.

Re-extraction results

Three extraction runs after iterating on fixes:

Staging (first fix): 608 pairs, 10% meta (down from 36%)
Staging (with title suffix fix): 604 pairs, 0.9% meta (6 remaining)
Data directory: 228 pairs

Combined: 832 pairs, 6 meta-referencing (0.7%). Cleared Argilla and pushed fresh.

A.3 Run 4 — the bake-off

Brought up all services locally (qa-service on 8001, access-agent on 8000). Synced 832 document pairs from Argilla and loaded 70 entity pairs via JSONL. Total: 902 pairs in pgvector.

Fired all 41 test questions. Results:

Metric	Run 3 (83 pairs)	Run 4 (902 pairs)
UKY hits	38/41 (93%)	40/40 (100%)
pgvector hits	27/41 (66%)	27/40 (67%)
pgvector avg latency	~5ms	~30ms

pgvector coverage stayed flat at 67% despite 10x more pairs.

The architectural insight

The 13 missed questions fall into two categories:

Missing source content (4 questions) — Ranch storage has zero Q&A pairs because no Ranch documents exist in rag_documents/ and Ranch wasn't returned from MCP in the extraction run that generated the original test questions.
No cross-cutting Q&A pairs (9 questions) — General ACCESS questions ("How do I apply for an allocation?", "How do I transfer files between resources?", "What training does ACCESS offer?") have no matching pairs even though we have 104 allocation mentions, 50 transfer/Globus mentions, and 40 training mentions across our pairs. The problem: all those mentions are entity-scoped. We have "How do I cite Jetstream?" but not "How do I acknowledge ACCESS?" We have "What allocations does Anvil support?" but not "How do I apply for an allocation?"

The extraction pipeline processes one document at a time, so it only ever generates entity-scoped Q&A pairs. It will never produce cross-cutting "How does ACCESS work in general?" pairs from a single-document prompt.

UKY's advantage is architectural: chunk-level retrieval at query time lets it pull relevant fragments from multiple documents and synthesize on the fly. It doesn't need a pre-generated answer that matches — it just needs chunks that are individually relevant. Our Q&A-pair RAG needs a pair whose question semantically matches the user's question, and no single entity-scoped pair matches a cross-cutting query closely enough.

Decision questions for Andrew

Manually curate cross-cutting pairs — Write 20-30 general ACCESS Q&A pairs by hand. Fast, targeted, but doesn't scale.
Add a cross-cutting extraction pass — Feed the LLM multiple documents simultaneously and ask for general questions that span topics. New pipeline capability.
Keep UKY as fallback for general questions — Accept the hybrid. pgvector for entity questions (fast, verified), UKY for cross-cutting (slow, unverified).
Lower similarity thresholds — Some misses scored 0.55-0.68, not far from the 0.70 cutoff. Won't fix the 0.28-0.49 misses.
Detect cross-cutting-ness at query time — Instead of pre-generating cross-cutting pairs, use pgvector match quality as a signal: low scores with scattered partial matches → route to document chunk RAG or MCP tools. Fits existing agent graph routing.

Files produced

a3_results/run4.json — 40 comparison log entries
a3_results/run4_enriched.json — enriched with low-threshold best-possible scores
~/.agent/diagrams/a3-run4-bakeoff.html — interactive comparison visualization

Answer richness gap (second dimension)

Even when pgvector hits, many answers are thinner than UKY's. Investigated whether pgvector answers were bypassing LLM synthesis — confirmed they are NOT: _dual_rag_answer() calls _synthesize_with_rag_only() for every pgvector match. The real issue: a single pre-digested Q&A pair gives the synthesis LLM very little to work with, so it returns near-verbatim text. UKY pulls multiple document chunks and the LLM has more raw material to synthesize a richer answer.

However, reviewing side-by-side answers revealed a more nuanced picture:

Some pgvector answers are actually better than UKY's (more precise, directly relevant)
Some just need link enrichment (the synthesis prompt doesn't encourage adding URLs)
Some questions UKY can't answer but pgvector can (entity-specific data from MCP)

This shifts the framing from "pgvector vs UKY" to "how to combine them intelligently."

Quick fix (low-effort, high-impact): The RAG_ONLY_SYNTHESIS_PROMPT in synthesize.py says "Be concise and direct" — this is why the LLM returns near-verbatim single sentences. Updating the prompt to encourage link inclusion, practical context, and resource pointers would immediately enrich thin answers without any architectural changes. The Q&A pair metadata already carries domain and entity_id which could drive link generation.

5th strategic option: cross-cutting detection at query time

Instead of generating cross-cutting Q&A pairs up front, detect cross-cutting-ness at query time based on pgvector results and route accordingly:

pgvector score < threshold but > 0.4 → content exists but scattered → fall back to document chunk RAG or plan+MCP
pgvector hit but thin answer → enrich with MCP tool calls or document chunks
pgvector hit with rich answer → serve it (fast, verified)
pgvector zero matches → missing content → MCP or UKY fallback

This fits the existing agent graph — rag_answer already evaluates match quality and routes to plan on weak matches. The change: make that evaluation smarter about why the match is weak.

Bugs noted (not fixed)

threshold=0.0 falsy in vectorstore.py: threshold or settings.rag_similarity_threshold treats 0.0 as falsy, falling back to default 0.85. Affects diagnostic queries with threshold=0.
q21 not logged: "How much funding did the pollinator conservation AI project get?" was classified as non-RAG (40/41 logged).

2026-03-11 — Run 4 reanalysis: UKY hit rate was overcounted

Discovery: The Run 4 summary reported "UKY hits 40/40 (100%)" — but this counted every UKY response as a hit, including hedges like "The provided documents do not contain specific information about Abaqus. Please open a support ticket." Applied the same hedge detection used at runtime (_rag_answer_is_weak in graph.py) to the logged responses.

Corrected Run 4 numbers:

Metric	pgvector	UKY
Genuine answers	27/40 (68%)	13/40 (33%)
Hedged / no match	13	27

Head-to-head breakdown:

Both answered well: 8
pgvector only (UKY hedged): 19
UKY only (pgvector no match): 5 — all general process questions (allocations, password reset, file transfer)
Neither answered well: 8

What this means: pgvector already outperforms UKY 2-to-1. UKY's 19 entity-specific hedges are questions pgvector handles from curated MCP data (software versions, resource specs, NSF awards) that UKY's document corpus simply doesn't cover. The "UKY as strong fallback" framing was wrong — UKY adds value on only 5 questions, all cross-cutting process topics.

Remaining gap (13 questions): 5 cross-cutting process questions (UKY answers, pgvector doesn't) + 8 neither backend handles. A document-chunk fallback for cross-cutting detection would address most of these, but the urgency is lower than previously thought.

Also this session: Updated SYSTEM_OVERVIEW.md routing table with file names, condition explanations, and node descriptions. Synced gist.

2026-03-14 — Friendly battery bake-off: capability confirmed

What we ran

50 clean, well-phrased questions (FRIENDLY_BATTERY.md) against both UKY and NEWSYSTEM. Questions intentionally well-phrased (no typos, no complaint framing, no vagueness) to test capability separate from robustness. Covers same topics as the real-user battery but with proper phrasing.

UKY battery: 50/50 success, 247.5s total (~3.9s avg). Saved to a3_results/friendly-uky.json.
NEWSYSTEM battery: 50/50 success, 310.0s total (~5.2s avg). Saved to a3_results/friendly-ns.json.
Comparison report: ~/.agent/diagrams/friendly-battery-comparison.html

Key numbers

Metric	UKY	NEWSYSTEM
Avg response length	1,380 chars	777 chars
Avg latency	3.9s	5.2s
Errors	0	0
Short responses (<200c)	0	3

NEWSYSTEM source breakdown

Source	Count	Notes
RAG retrieval	24/50 (48%)	Up from 18% on real-user queries
MCP tool calls	7/50 (14%)	search_resources, get_resource_hardware, search_software, search_events
LLM-only	19/50 (38%)	No external data, answered from Claude's training

Where the 19 LLM-only questions cluster

Allocations/credits (3): allocation types, applying, exchange allocations
Account management (3): password reset, new account, office hours
Cross-cutting process (2): Globus file transfer, acknowledging ACCESS
Resource-specific how-tos (7): SLURM submission, logging in, SU calculations, data management
Troubleshooting (2): SLURM qos errors, common parameters
Credit conversion (2): SU calculations, GPU hours from credits

These are all topics where Q&A pairs exist in pgvector but the classifier routes them as "dynamic" or the similarity scores fall below threshold. The answers from LLM training data are often reasonable (avg 1,028 chars) but ungrounded — no citations.

The comparison with real-user battery

Metric	Real-user (messy)	Friendly (clean)
RAG hit rate	18% (9/50)	48% (24/50)
MCP tool use	14% (7/50)	14% (7/50)
LLM-only	68% (34/50)	38% (19/50)
Errors/failures	7 (JSM errors)	0
Short responses	7 (<100c)	3 (<200c)

RAG hit rate nearly triples with clean phrasing. MCP usage is identical (the classifier correctly identifies dynamic questions regardless of phrasing quality). The 7 JSM error messages from the real-user battery are gone because these clean questions don't trigger complaint-framing misclassification.

What this means

The system works. When given fair input, NEWSYSTEM retrieves relevant Q&A pairs and produces grounded answers. The 18% RAG hit rate was an input robustness problem, not a capability problem.
But it's still not great. 48% RAG hit rate means 52% of well-phrased questions don't get grounded answers. 19 LLM-only is still too many. The Q&A pair matching threshold (0.70) is filtering out matches that would score in the 0.55–0.69 range.
UKY wins on answer richness. Longer answers on 43/50 questions. Document chunks provide more context than distilled Q&A pairs. This is the thin-answer problem from earlier runs, still present.
Classifier is less of a problem with clean input. Zero JSM misfires. The classifier works when questions are well-formed; it breaks on complaint framing and error pastes.

Next steps

Lower similarity threshold from 0.70 to ~0.60 — recover the 0.55–0.69 band matches
Fix classifier prompt — distinguish "needs live data" from "describes a problem with a documented solution"
Add RAG fallback for failed dynamic — when tools fail or planner gives up, try Q&A pair search
Re-run real-user battery after tuning — measure improvement
If still not enough — present evidence for document-chunk RAG as fallback layer

2026-03-13 — Real-user battery bake-off: NEWSYSTEM regression diagnosed

What we ran

Full clean-state bake-off: reset pgvector + Argilla → re-extract entities (80 pairs) → re-extract documents (603 pairs) → Argilla sync to pgvector (683 total pairs) → run 50 real-user questions from REAL_USER_BATTERY.md against both UKY and NEWSYSTEM separately.

UKY battery: 50/50 success, 275.6s total (~5.5s avg). Saved to a3_results/uky-battery.json.
NEWSYSTEM battery: 50/50 success, 487.6s total (~9.8s avg). Saved to a3_results/ns-battery.json.

Results: NEWSYSTEM performed worse than previous runs

Metric	UKY	NEWSYSTEM
Avg response length	~1,100 chars	~680 chars
Ultra-short failures (<100 chars)	0	7
RAG hit rate	N/A	9/50 (18%)
Tool usage	N/A	24 calls across 7 tool types
Total duration	4.6 min	8.1 min

Root cause: three compounding problems

1. Threshold too aggressive for real-user queries (biggest factor). The agent uses 0.70 for static queries. Only 9/50 cleared that bar. But useful data exists just below it — "ACES specifications?" scored 0.694 (missed by 0.006), "How can I use my allocations?" scored 0.618, "login.expanse.sdsc.edu" scored 0.608. The Q&A pairs are phrased as clean specific questions; real user queries are short, vague, and messy. The 0.70 threshold worked for our hand-crafted 41-question battery but fails on real user language.

Direct qa-service queries at threshold=0.01 confirmed data IS there — the agent's threshold is filtering it out. 25/50 questions had zero matches at threshold 0.70 but would have found relevant pairs at lower thresholds.

2. Over-classification as "dynamic" (16/50 = 32%). Dynamic classification bypasses RAG entirely. Of 16 dynamic-classified questions:

7 routed to JSM domain → all failed with "tools unavailable" (JSM MCP server not running). These produced identical 92-char canned error messages.
9 others classified dynamic with domain=None → 4 succeeded with tools, 5 had no applicable tools and fell to LLM-only.

Queries like "Having password issues using ssh to login" (q32) and "I am invited to a new project but I cannot see it" (q33) are arguably static how-to questions that RAG should handle, but the classifier read them as user-specific dynamic problems.

3. No fallback: dynamic misses don't try RAG. When a dynamic query fails (JSM unavailable, no tools applicable), the agent generates from LLM training data. There is no "if tools fail, try RAG" recovery path.

Net effect

Of 50 real-user questions: 9 hit RAG, 5 succeeded with tools, 7 got JSM errors, and 29 were answered from pure LLM generation with no external data. The system barely used its knowledge base.

Why worse than Run 4

Run 4's 41 questions were hand-crafted to be clean and specific. Real user queries are short ("is lammps on stampede"), vague ("How do it get to anvil"), or framed as complaints ("I can't login to Expanse right now"). The threshold + classifier combo that worked for clean queries falls apart on messy ones.

Classification breakdown

Type	Count	Notes
static	31 (62%)	25 got zero RAG matches, 6 hit RAG
dynamic	16 (32%)	7 JSM failures, 4 tool successes, 5 LLM-only
combined	3 (6%)	2 hit RAG, 1 failed tools

RAG hits (9 questions that cleared threshold)

Q#	Score	Query (truncated)
q3	0.733	NSG allocation, get started on Expanse
q6	0.730	Jetstream2 storage for deltaai
q8	0.752	account for NCSA DeltaAI
q12	0.715	How do it get to anvil
q14	0.838	hardware specifications for Anvil
q15	0.718	specific type of GPU for Bridges-2
q18	0.632	TAMU ACES: 1 of 1 SUs remaining (combined, threshold 0.60)
q21	0.731	credit for 1 node of jetstream2 GPU
q36	0.735	resources have comsol + gpu (combined, threshold 0.60)

Files produced

a3_results/uky-battery.json — 50 UKY responses
a3_results/ns-battery.json — 50 NEWSYSTEM responses with full node traces
~/.agent/diagrams/bakeoff-battery-comparison.html — side-by-side HTML comparison

Analysis: three compounding problems, two potential fixes, one strategic question

Problem 1 — Threshold (0.70) too aggressive for messy queries. Lowering to 0.55–0.60 would recover ~5-8 more RAG hits. Below 0.50 starts pulling wrong-topic matches (e.g., "Can I change my ACCESS ID?" at 0.346 matches "How can a researcher access Anvil?" — completely wrong). The 0.60–0.70 band has genuinely useful near-misses: "ACES specifications?" at 0.694, "How can I use my allocations?" at 0.618, "login.expanse.sdsc.edu" at 0.608.

Problem 2 — Classifier over-routes to dynamic/JSM. 7 questions routed to JSM (Jira Service Management) because they sounded like complaints ("password not working", "shows pending", "can't see my project"). These are static how-to questions that RAG should handle. The classifier reads frustration/complaint framing and assumes it needs a live support ticket lookup. Two clear misclassifications (q40 "how can I access my slurm", q41 "sbatch error") and several judgment calls where static would have been more pragmatic (q28, q42). q46 ("link to register for the webinar?") was correctly classified as dynamic — webinar links are live data.

Problem 3 — No fallback from failed dynamic. When dynamic classification fails (JSM not running, no applicable tools), the agent generates from LLM training data with no external grounding. Adding a RAG fallback for failed dynamic queries would recover 5-7 questions without touching the threshold.

The strategic question: is threshold/classifier tuning a dead end?

Joe's concern: we'll drop thresholds, risk false positives, tune classifier rules against an imperfect surface, get a few more questions right, and still not be good enough. The fundamental brittleness is that Q&A-pair RAG requires the user's question to semantically match a pre-generated question. Real users don't phrase things that way.

Document-chunk RAG as fallback: Joe suspects we'll eventually need document chunks in the mix — like UKY does. High-confidence Q&A pair matches get served directly (fast, verified). When Q&A similarity is weak, fall back to document-chunk retrieval where the LLM finds the answer within a broader text passage. This is more resilient to messy phrasing because chunks have more surface area for partial matches.

Andrew's prior directive was no document-chunk RAG — just return more Q&A pair matches and let synthesis combine them. This worked for union-type cross-cutting queries but doesn't solve the phrasing-mismatch problem. The real-user battery results may change this conversation.

Plan: friendly battery first, then tune, then re-test

Before tuning anything, separate two questions that are currently conflated:

Can our system match UKY when given well-phrased questions about entities we cover? (capability test)
Can our system handle messy real-user language? (robustness test)

Step 1: Build a "friendly battery" — 30-40 clean, well-phrased questions specifically targeting the 683 Q&A pairs in pgvector. Topics we know we have pairs for. If NEWSYSTEM matches or beats UKY here, the system works and the problem is purely input robustness.

Step 2: Tune classifier + threshold — targeted fixes based on the failure analysis above.

Step 3: Re-run real-user battery — measure how much tuning helped.

Three data points: best-case, tuned, real-world. If the gap between friendly and real-world remains huge after tuning, that's concrete evidence for document-chunk RAG as a fallback layer.

WHERE WE ARE — resume point (updated 2026-03-14)

Friendly battery confirms NEWSYSTEM capability. 48% RAG hit rate on clean questions (vs 18% on messy real-user queries). 19/50 answered LLM-only — mainly allocations, account management, and cross-cutting process questions where Q&A pairs exist but similarity threshold filters them out. UKY still wins on answer length (1,380 vs 777 avg chars). The system works; the problem is input robustness (threshold + classifier). Next: lower threshold to ~0.60, fix classifier, add RAG fallback for failed dynamic, then re-run real-user battery.

What's done

A.1 (Argilla → pgvector sync) ✅
A.2 (dual-RAG logging in access-agent) ✅
A.3 Runs 1–6 complete ✅ — RAG-vs-RAG (Runs 1–4), full-system (Runs 5–6)
Post-mortem analysis ✅ — gap is content type (entity vs process), not quality
UKY corpus obtained ✅ — 75 files in rag_documents/
Direction confirmed with Andrew ✅ — generate Q&A pairs from docs, unify in pgvector, retire UKY
C.1 corpus categorized ✅ — index at rag_documents/CORPUS_INDEX.md
C.2 document extractor built ✅ — committed and pushed on feat/document-extractor
C.3 extraction complete ✅ — 832 pairs (604 staging + 228 data), meta-referencing fixed (36% → 0.7%)
Outstanding PRs merged ✅ — both access-qa-extraction and access-qa-planning PRs squash-merged
C.4 sync + bake-off ✅ — 902 pairs in pgvector (832 document + 70 entity), 40 questions answered
A.3 Run 4 reanalysis ✅ — pgvector 68% vs UKY 33% (hedge responses excluded)
A.3 Run 5 ✅ — full-system test (pgvector + MCP + routing). 24 RAG, 5 MCP, 12 LLM-only.
Node tracing ✅ — node_trace in AgentState, gated behind ?include_trace=true (commits 04342c8, b7a9bec)
Top-5 matches + enriched synthesis prompt ✅ — RAG_TOP_K 3→5, prompt rewritten (commit ef43a21)
A.3 Run 6 ✅ — 27 RAG, 5 MCP, 9 LLM-only. Thin-answer problem confirmed as extraction-level issue.
Real-user query analysis ✅ — 4,887 unique queries from chatbot CSV, topic/resource distribution mapped
Real-user test battery ✅ (DRAFT) — 50 questions sampled by real user interest (REAL_USER_BATTERY.md)
Entity alignment analysis ✅ — MCP vs docs Venn diagram mapped, top 8 resources identified
Real-user bake-off (clean state) ✅ — 683 pairs, 50 questions, UKY vs NEWSYSTEM. NEWSYSTEM regression: 18% RAG hit rate, 7 JSM failures, 29/50 LLM-only.
Regression root cause analysis ✅ — three compounding problems: threshold too high for messy input, classifier over-routes to dynamic/JSM, no fallback from failed dynamic
Friendly battery ✅ — 50 clean questions, 48% RAG hit rate (vs 18% real-user), confirms capability. Problem is input robustness, not system capability.

The core findings

pgvector is already ahead: 27/40 genuine answers vs UKY's 13/40. pgvector covers entity-specific data (software, resources, awards) that UKY cannot.
Full system closes more gaps: MCP tools answer Ranch questions and project search (Runs 5–6). LLM-only count dropped from 12 → 9 with top-5 matches pulling more questions into RAG.
Cross-cutting gap splits into two types: Union-type queries now partially addressed by top-5 multi-match synthesis. Procedural queries ("How do I apply for an allocation?") still need hand-curated cross-cutting Q&A pairs (~5 questions).
Answer richness gap is upstream: Even when pgvector hits, answers are thinner than UKY's (e.g., q2: bare accelerator list vs UKY's unit counts, model numbers, memory specs). The synthesis prompt can't add detail that isn't in the Q&A pairs. Fix is in access-qa-extraction prompts — affects both MCP and document extractors.
Real users ask about Expanse, Delta, Anvil most: 246, 181, 174 mentions respectively. Ranch (11 mentions) was overrepresented in the original 41-question battery.

Entity alignment: MCP ∩ Documents (top 8 by user interest)

Resource	MCP ID	User mentions
Expanse	expanse.sdsc.access-ci.org	246
Delta	delta.ncsa.access-ci.org	181
Anvil	anvil.purdue.access-ci.org	174
Bridges-2	bridges2.psc.access-ci.org	102
ACES	aces.tamu.access-ci.org	83
Jetstream2	jetstream2.indiana.access-ci.org	62
Stampede3	stampede3.tacc.access-ci.org	55
Sage	sage.northwestern.edu	43

All 8 exist in both MCP and rag_documents/. Currently only ACES has MCP entity pairs in pgvector.

What's next

Classifier + threshold tuning — review classifier prompt, fix over-routing to dynamic/JSM. Lower static threshold from 0.70 to ~0.60. Add RAG fallback for failed dynamic queries.
Re-run real-user battery — measure improvement from tuning. Three data points: friendly (48% RAG), tuned real-user, original real-user (18% RAG).
Assess document-chunk RAG — if tuning doesn't close the gap, present the evidence to Andrew for adding document-chunk retrieval as a fallback layer beneath Q&A pair matching.
Project D — evaluation harness (EVAL_HARNESS_PLAN.md)
Project B — feedback protocol design

2026-03-13 — Real-user query analysis, entity alignment, test battery creation

Chatbot query analysis

Analyzed chatbot_log_all_data.csv — 5,780 rows from the production chatbot (connected to UKY). After deduplication, length filtering, and removing our own test battery questions: 4,887 unique real user queries saved to a3_results/real_user_queries.json.

Topic distribution:

Specific resources: 20% (Expanse 246, Delta 181, Anvil 174, Bridges 102, ACES 83, Jetstream 62, Stampede 55, Sage 43)
Allocations: 17%
Account/access: 16%
GPU: 8%
Jobs/SLURM: 8%
Software: 5%
Storage/data: 4%
Other (general ACCESS): 42%

Ranch had only 11 mentions (0.2%) but occupied 4/41 questions (10%) in our original battery — significantly overrepresented.

Entity alignment: MCP vs documents

Mapped which compute resources exist in MCP (search_resources returns 23), which have documents in rag_documents/, and which are in pgvector. Key finding: only ACES has MCP entity pairs in pgvector (8 pairs). The other 22 MCP resources were never extracted because we ran --max-entities during testing.

10 resources exist in both MCP and docs. 13 are MCP-only (including Ranch). 2 are docs-only (Darwin, FASTER). The top 8 by real user interest all exist in both sources.

Real-user test battery

Sampled 50 questions from the 4,887 real queries, weighted by actual user interest. Saved as REAL_USER_BATTERY.md (parallel to A3_TEST_QUESTIONS.md) and a3_results/real_user_battery_50.json. Filtered out context-dependent follow-ups, pasted errors, very short/long queries, and our own test questions. Kept realistic messiness (typos, vague phrasing).

Clarifying the plan

Spent significant time untangling confusion about what the system is, what we're testing, and why. The confusion stemmed from conflating three separate problems: thin answers (extraction prompt quality), missing entities (extraction coverage), and cross-cutting gaps (architectural limitation of per-entity extraction). Wrote FRIDAY_THE_13TH.md as a narrative summary of the full arc and decisions.

Key clarification: the system under test is access-agent's LangGraph with MCP-extracted QAPs + document-extracted QAPs + MCP tools at runtime. Both QAP sources stay — MCP gives structured specs, docs give process/how-to. No document-chunk RAG for now (Andrew's directive — use top-5 QAP matching instead).

Files produced

a3_results/real_user_queries.json — 4,887 deduplicated real user queries
a3_results/real_user_queries_original.json — backup before any edits
a3_results/real_user_battery_50.json — 50 sampled questions with categories
REAL_USER_BATTERY.md — human-readable battery (DRAFT, needs review)
FRIDAY_THE_13TH.md — narrative summary of where we are and what we decided

2026-03-12 — A.3 Run 6: Top-5 RAG + enriched synthesis prompt evaluation

What we tested

Ran all 41 questions against the updated access-agent (commit ef43a21: RAG_TOP_K 3→5, RAG_ONLY_SYNTHESIS_PROMPT rewritten for thoroughness). UKY disabled, MCP servers active, include_trace=true for full node tracing. Compared against UKY baseline from Run 4.

Results

Metric	Run 5b	Run 6
Via RAG	24	27
Via MCP	5	5
LLM only	12	9
Avg latency	—	6.6s

41/41 answered, 0 failures. Top-5 pulled 3 more questions into the RAG path (likely union-type queries that now get enough matching pairs to clear the threshold).

The thin-answer finding

The enriched synthesis prompt did not fix the answer richness gap. Example — q2 ("What kind of accelerators does ACES have?"):

Run 6 (272 chars): Bare list of accelerator names (Intel Max GPUs, NVIDIA H100, etc.)
UKY (1197 chars): Detailed breakdown with unit counts, model numbers, memory specs per accelerator type

Run 6 matched 3 Q&A pairs (best score 0.929), but the top pair's answer is itself just a summary list: "The ACES system includes a variety of accelerators such as Intel Max GPUs, Intel FPGAs, NVIDIA H100 and A30 GPUs, NEC Vector Engines, NextSilicon co-processors, and Graphcore IPUs."

The synthesis LLM can't add detail that isn't in the source pairs. UKY's advantage here isn't architectural — it's that UKY's source documents (the ACES user guide PDF) contain the detailed specs table, and chunk retrieval preserves that detail. Our extraction pipeline summarized it away.

Root cause: extraction prompts produce summary-level answers

The two-shot extraction pipeline (battery + discovery) generates answers that summarize source material rather than preserving specifics. This is appropriate for "what is X" overview questions but loses value for detail questions ("what accelerators", "what specs", "how many nodes").

The fix is upstream in access-qa-extraction: the battery/discovery prompts need to instruct the LLM to retain numerical details, specifications, counts, and model numbers from the source data.

Ranch questions confirmed: no Q&A pairs, MCP fills the gap

q5–q8 (Ranch) all show 0 RAG matches because Ranch was never in our MCP extraction and has no Q&A pairs. q5, q6, q7 get reasonable answers via MCP tools (search_resources, get_resource_hardware). q8 ("How do I request a shared project space on Ranch?") falls to LLM-only with a generic answer.

Files produced

a3_results/run6.json — 41 questions with full responses and node traces
~/.agent/diagrams/a3-run6-comparison.html — interactive comparison (Run 6 vs UKY baseline)

2026-03-11 — A.3 Run 5: Full-System Comparison

What we did

Ran all 41 questions through the production agent graph with MCP servers active and UKY disabled. This is the first system-vs-system test: pgvector RAG + MCP tools + LangGraph routing, compared against UKY's baseline responses from Run 4.

Configuration:

ENVIRONMENT=production, MCP_SERVER_HOST=host.docker.internal — agent container reaches MCP servers via Docker host bridge
UKY_RAG_ENABLED=false, DUAL_RAG_LOGGING=false — no UKY, no dual-RAG comparison path
10 MCP servers running (access-mcp/docker-compose.yml)
902 Q&A pairs in pgvector (832 document + 70 entity)

Results — 41/41 questions answered:

24 via RAG (rag_retrieval)
5 via MCP tools (search_resources, get_resource_hardware, search_events, search_projects)
12 LLM-only (no tools called)

Key findings

MCP tools fill the Ranch gap. Ranch had zero Q&A pairs — q5, q6, q40 now get real answers via search_resources and get_resource_hardware. Even the misspelled q40 ("reanch storage") resolves.
q41 gets a real answer. "What allocation projects are using machine learning?" calls search_projects, returns 20 real projects with PIs and institutions.
q31 routes to events ("What training resources does ACCESS offer?") calls search_events, though the search returned empty results.
Cross-cutting questions (q3, q7, q8, q26-q28, q32-q33, q38) fall to LLM synthesis. Neither RAG nor MCP covers these general ACCESS process questions. Answers read well but are ungrounded — could hallucinate.

What we learned about observability

The API response only exposes tools_used, confidence, execution_strategy, tool_count. We cannot tell from the response:

What the classifier decided (static/dynamic/combined)
Which graph nodes actually executed (e.g. did RAG fire and fail before falling to LLM?)
RAG similarity scores for matched pairs
Whether _rag_answer_is_weak triggered
The plan content (if the planner node ran)
MCP tool arguments and raw responses

The 12 "LLM-only" answers are a black box — we can't distinguish "classified as static, RAG returned nothing, fell through to LLM" from "classified as static, LLM answered directly without trying RAG." Adding a node_trace to QueryResponse is the immediate next step.

Report

Interactive HTML comparison at ~/.agent/diagrams/a3-run5-comparison.html. Matches the Run 3/4 report format: KPI cards, filters, expandable side-by-side comparison. Note: hedge detection has a known issue — see below.

Known issue: hedge detection false positives

The report's hedge detection uses substring matching against phrases like "do not contain", "does not explicitly", etc. UKY q27 ("The provided documents do not specify...") is marked hedged but none of the exact phrases match — the detection was too aggressive. The h2h classification for q27 and potentially others needs review. Should align with access-agent/src/agent/graph.py:_rag_answer_is_weak() which uses the canonical hedge phrases.

Raw data

a3_results/run5.json — 41 questions with full agent responses
a3_results/uky_baseline_from_run4.json — UKY baseline (40 questions, q21 missing)
a3_results/run_a3_test.py — test runner (updated for Run 5: captures full response, saves to JSON)

What's next

Add node tracing to agent graph — track which nodes executed, classification result, RAG scores. Expose in QueryResponse.metadata.
Re-run with tracing — Run 5b with node trace data, so we can see exactly how each question routes.
Fix hedge detection — align report's hedge logic with _rag_answer_is_weak() from the agent codebase.
Tune synthesis prompt — RAG_ONLY_SYNTHESIS_PROMPT produces thin answers when one pair matches. Add links, context.
Curate 20-30 cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the 12 LLM-only gaps).

To restart Docker (if containers are down)

cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-qa-service && docker compose up -d
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-agent && docker compose up -d

Verify with docker ps — you should see access-agent-agent-1 (8000), qa-service-app (8001), and their postgres/redis containers.

Quick smoke test to confirm everything works

curl -s -X POST http://localhost:8000/api/v1/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is ACES?", "session_id": "test", "question_id": "test-1"}' | python3 -m json.tool

2026-03-12 — Node tracing added to agent graph

What was built

Two commits on access-agent/main:

04342c8 — Added node_trace to AgentState as Annotated[list[dict[str, Any]], operator.add]. Each graph node appends a structured trace dict recording what it decided:

classify: query_type, confidence, domain, rag_endpoint, reason, whether query was expanded
rag_answer: source (uky/pgvector), match_count, best_score, rag_used, has_final_answer
plan: requires_tools, tool_count, tool names, strategy
execute: tools_called, succeeded, failed
evaluate: is_helpful, reason
recover: action taken, new tools selected
synthesize: strategy, answer_length
domain_agent: domain, tool_count

The /api/v1/query response includes classification summary and node_trace in metadata. 10 files changed (all node files + state.py + routes.py).

b7a9bec — Gated node_trace behind ?include_trace=true query parameter. OTel/Honeycomb (added Jan 2026, commit 422b92d) already provides full distributed tracing for ops. node_trace serves a different consumer: the eval harness needs trace data inline in the API response so it can programmatically inspect routing decisions without querying an external service. Nodes continue accumulating trace dicts in state (zero overhead), but the response only includes them when opted in.

Why two tracing systems

OTel/Honeycomb: Ops. Waterfall view of every span, LLM call, MCP tool. External service.
node_trace: Eval. Inline in API response. Shows decisions (classifier output, RAG scores, tool selection) not timing. Consumable programmatically by the eval harness (Project D).

2026-03-12 — Top-5 RAG matches + enriched synthesis prompt

Context

Andrew suggested returning the top 5 Q&A pair matches (instead of just the best) and letting the synthesizer combine them — simpler than building document-chunk retrieval for cross-cutting queries. Analysis confirmed the pipeline already supported multiple matches end-to-end (RAG_TOP_K was 3, qa-service accepts up to 20, all downstream code iterates over the full list). The change was purely configuration + prompt.

What changed (commit `ef43a21`)

config.py — RAG_TOP_K: 3 → 5. More material for the synthesizer, especially for union-type cross-cutting queries where related entity pairs from different resources can be combined.

synthesize.py — RAG_ONLY_SYNTHESIS_PROMPT rewritten:

"Be concise and direct" → "Answer the question thoroughly"
New guideline: when multiple knowledge entries are provided, synthesize into a unified answer
URLs/links elevated to IMPORTANT (matching the tool-only and combined prompts)
Added practical next steps guidance and support ticket link (both already present in the other prompts, missing here)

What this addresses

Thin answers: Single-match answers were near-verbatim because the prompt said "be concise." Now the LLM is instructed to give a complete, actionable response with links and context.
Union-type cross-cutting queries: "What resources support GPUs?" now gets 5 entity-scoped pairs (Delta, Bridges-2, ACES, etc.) and the prompt tells the LLM to combine them.
Does NOT fix procedural cross-cutting: "How do I apply for an allocation?" still has no matching pairs at any score. These need hand-curated cross-cutting Q&A pairs (~5 questions).

2026-04-15 — Return from absence; reporting plan formed with Andrew

Orientation after time away. Surveyed git state across all ACCESS-CI repos. Working trees clean except one uncommitted line in access-qa-bot/src/config/constants.ts (flipping BACKEND_ID default to 'access') — now redundant because Andrew's 3.6.0 release included the same change.

What Andrew shipped during the absence. The non-agentic-proxy train (Path A from 2026-04-10-status.md) was merged and published solo:

qa-bot-proxy: per-backend Turnstile keys for multi-tenant validation + verified-session cookie to skip Turnstile on subsequent requests. Resolved the Turnstile site-key blocker flagged on 2026-04-10.
qa-bot-core@0.2.36 (includes Shadow-DOM Turnstile portal fix).
access-qa-bot@3.6.0 (includes the BACKEND_ID='access' fix).
access-ci-ui@0.20.0 on upstream.

Four local feature branches are now obsolete (the three feature/non-agentic-proxy-2026-04-10 branches + chore/bump-access-qa-bot-3.5.2 on access-ci-ui fork). access-mcp main needs fast-forward to upstream (4 commits: Hono replacing Express for Claude Code HTTP transport). access-agent branch feature/personalization-phase-1-2 has 4 unpushed commits and is 6 behind origin/main; parked.

Planning meeting with Andrew — focus shift to reporting. Personalization work parked indefinitely. Immediate priority is producing evidence the new agent is better than current production (raw UKY) for Monday's review with Jim, Vikram (UKY), Shelly (CU).

Confirmed Andrew's src/eval/ pipeline is more than a bakeoff CLI: Postgres-backed EvalRun + EvalScore schema, LLM judge, three markdown report generators (generate_team_report, generate_leadership_report, generate_resource_report), bidirectional Argilla integration (argilla_push / argilla_pull), prod-switching infra via SSH tunnel (scripts/eval-prod, .env.eval.prod). Andrew's decisions/007-production-baseline-comparison.md (2026-04-08) specifies the comparison framework and go/no-go criteria.

Monday deliverable: two-way comparison, not three-way. Decision 007 specified three systems (raw_rag / agent_rag_only / agent_full). After Slack iteration, dropped agent_rag_only because (a) UKY scoped corpora aren't ready, stripping the main differentiator the middle path would show, and (b) an MCP-less agent is not a shippable product direction — Andrew confirmed "no way we will do an agent with just RAG." The spec's three-way framing is valid when all three systems are real product options; under current constraints the middle system measures mostly synthesis-prompt delta, which is not a stakeholder-facing story.

Result: 2 batteries × 2 routes = 4 eval runs. Reports emitted as markdown → PDF (no custom HTML — scientists don't want flash).

Work plan (Joe owns all):

Add system="raw_rag" mode to src/eval/runner.py — calls uky_client directly, bypasses graph, returns RunResult. Prerequisite for the comparison.
Run two systems against friendly_battery.json + real_user_battery.json; emit markdown reports → PDF. Push full-agent run to Argilla for spot-check.
Persist duration_ms on EvalScore + Argilla metadata. Already captured in RunResult, dropped by scorer.py. Needs an Alembic migration for prod Postgres (create_all only creates missing tables, not new columns).
Add agent_commit to Argilla record metadata in argilla_push.build_argilla_record(). Already on EvalRun; ~3 lines.
Thread a resource field (read from question metadata) through run_question() → run_agent(resource_context=...) or uky_client.ask(rp_name=...) for raw_rag. Enables scoped RAG exercise. load_questions already captures unknown fields via metadata; capability_area fields (delta/expanse/anvil/etc.) often match RP slugs directly.

Held for later:

Judge calibration via Argilla human annotations (want review rounds first).
Live prod traffic → prod Argilla. Implementation sketch: sample % of queries in agent query handler, create EvalRun with run_type='production_sample', skip LLM judge (per Decision 006), push to dedicated dataset. Requires PII strip, rate limit, feature flag, and an agent-side deploy. Not needed for Monday.

Open questions surfaced:

Whether UKY has per-RP scoped corpora actually indexed. Agent sends rp_name in the body; UKY's behavior not directly verifiable from this side. Quick curl test recommended before asserting per-resource quality claims.
in_scope flag: specified in the code (uky_client.py:132, # None until UKY implements it) as an authoritative scoped-out-of-scope signal. Agent currently falls back to text-pattern heuristics in _rag_response_out_of_scope(). Not a Monday blocker; becomes important for reliable scoped-quality metrics once UKY ships it.

Side artifact: created HELPING_JOE_GET_IT.md at the access-ci root (local-only, not gist-synced) with three Mermaid diagrams clarifying the eval harness vs agent distinction and the three routing modes. Useful for future reorientations.

2026-05-05 — No-classify eval validated; MCP review fixes shipped; ship-it case made

Picked up the no-classify path on feat/no-classify (sub-branch off feat/qwen-integration) where 2026-05-04 left off: 4 commits implementing USE_NO_CLASSIFY master switch, search_access_documents doc-search tool, centralized max_tokens, no-classify system prompt. Yesterday's 14-Q tool_coverage_battery eval was mixed-signal (composite +0.05 candidate, but compare-judge picked baseline). Goal today: validate against a richer battery, decide whether to ship.

21-Q no_classify_smoke_battery.json curated. Combined questions from tool_coverage_battery.yaml (10 tool-coverage + 4 RAG), combined_battery.json (5 detail-heavy/messy combined), friendly_battery.json and real_user_battery.json (2 domain-agent-territory). Preserved required_facts on tc-* questions so the judge applies them. Pulled per-question score history from Postgres to confirm no questions were dead weight: tc-events-01 and tc-software-01 perfect-5.0 across all historical runs (kept as proof), tc-allocations-02 sd=1.66 (the loose-match question Joe spent days iterating on), tc-rag-03 had a known 3.10 outlier from 2026-04-29 (generic Globus-101 answer with no ACCESS-specific anchors — kept as canary).

v1 eval pair (composite tied): baseline loop-20260505-144033-3943fb 4.82 / candidate loop-20260505-145419-e6546d 4.81. Both errored on tc-events-01 due to a 32k context-window overflow — events tool returning 50 items × ~1964 chars HTML descriptions = 96k chars (~24k tokens). Eyes-on review of the 20 scored questions surfaced an apparent regression on tc-software-01: candidate produced a 4-line "PyTorch is on ACES, Anvil. Check the docs" wave-off vs baseline's structured table with hostnames, version details, related packages.

About to draft three rounds of prompt fixes for the wave-off when Joe redirected: "let's just run it again and see if the non-deterministic world of agents gives us a different result." Re-ran tc-software-01 with same code/prompt/model → 5.00/5.00, 1226-char rich table answer. The wave-off was a single non-deterministic sample, not a systemic regression. Saved lesson_rerun_before_designing_for_regression.md — agent runs are non-deterministic enough that a single bad sample can look systemic; always re-run before drafting fixes. Confirmation bias is real.

Andrew's PR #3 review caught a substantive bug. Five of six new see_all_urls verified correct; system-status pointed at support.access-ci.org/outages (portal mirror) instead of operations.access-ci.org/infrastructure_news_view (canonical). More important — Andrew noticed query_relevance was hardcoded "exact" in announcements + events even when filters.query was supplied. Both servers send filters.query to Drupal's search_api_fulltext (a fuzzy full-text backend with tokenization, stemming, relevance ranking). Hardcoding "exact" was actively defeating the agent's honest-framing prompt rule — the agent skipped the verification step and fabricated matches on fuzzy results. Same failure mode the prompt was written to prevent.

Three substantive MCP fixes landed on feat/listing-urls-in-tool-responses (PR #3 branch):

ad9c98e system-status URL → operations.access-ci.org/infrastructure_news_view; query_relevance: filters.query ? "loose_match" : "exact" in announcements + events (mirrors software-discovery's existing pattern); test coverage on both branches in announcements + events (Andrew flagged "exact" path was asserted by zero tests across all servers); events compactDescription helper strips HTML and truncates per-item description.
4e9e6be events default limit lowered 50 → 20; DESCRIPTION_MAX_CHARS tightened 400 → 250. Resolves the 32k context overflow on tc-events-01: 50 items × ~1050 chars now ~50k chars (~12.5k tokens) instead of ~24k.

v3 eval pair (after MCP fixes): baseline loop-20260505-172157-a8ddd2 4.89 / candidate loop-20260505-175022-f8cd0f 4.90. First eval where candidate beats baseline at the composite level. All 21 questions scored on both sides — events overflow gone. Three-day trend: 2026-05-04 (-0.05) → 2026-05-05 v1 (-0.01) → 2026-05-05 v3 (+0.01) as the surrounding system got cleaner. Compare-judge still picked baseline "small" — same judge subjectivity pattern, but eyes-on review of regressions shows they're either non-deterministic or judge-side noise (tc-allocations-02 1.55-point gap was the judge confusing "tool match count" with "actually about climate" — candidate's honest framing was intact). Report: https://access-ci-reports.netlify.app/no-classify-2026-05-05-v3.html

Latency picture is bimodal. Net candidate is ~16% faster (avg 15.7s/Q vs 18.8s/Q). Big wins on tool-heavy questions: tc-announce-01 saved 17s, comb-011 saved 17s, tc-xdmod-01 saved 16s, tc-status-01 saved 13s — baseline pays for classify + rag_answer + tool_calling_loop; candidate skips straight to the loop. Losses on RAG-only questions: tc-rag-03 (Globus) +23s slower, friendly-001 (SSH+2FA) +13s slower. Candidate calls search_access_documents itself, paying UKY's full /ask round-trip (5-15s, sometimes called twice in one Q). Verified via direct trace inspection: search_access_documents → uky_client.ask() makes ONE POST per call to ONE endpoint (general OR xdmod, never both). Latency is purely UKY's RAG inference (chunk retrieval + their LLM synthesis on their side). Vikram's /retrieve returns raw chunks at sub-second latency — collapses this gap entirely.

Two-step ship plan agreed (decision_no_classify_two_step_ship.md): Step 1 — flip USE_NO_CLASSIFY=true in prod via PR off feat/no-classify + .env.example default. One-line, reversible. Step 2 — delete the conditionality (classify, rag_answer_node, USE_TOOL_CALLING_LOOP flag, agent_full_legacy, master switch) in a follow-up cleanup PR, gated on (a) Andrew sign-off, (b) /retrieve shipped, (c) one more battery confirming no new wave-off failure modes. The conditional earns its keep as a comparison anchor — caught the tc-software-01 wave-off (and avoided designing for it) because we had a baseline to A/B against.

Pushed:

access-mcp feat/listing-urls-in-tool-responses: ad9c98e (Andrew review fixes + events compactDescription) + 4e9e6be (events tightening). PR #3 ready for Andrew when Joe flips to "Ready for review."
access-agent feat/no-classify: 0c8fc20 adds eval/questions/no_classify_smoke_battery.json (the 21-Q battery used today).

Both repos clean, no untracked files. PR for feat/no-classify itself NOT yet opened — first task next session.

Commit Log (Joe Bacal, Feb 2026 work)

Commits across all repos related to the Feb/March plan. Older commits omitted.

access-qa-extraction (`feat/two-shot`)

Hash	Date	Message
`c8fbf0b`	02-26	docs: remove historical docs, update system overview for two-shot
`853e88f`	02-26	replace GUIDED-TOUR with TRACE-TOUR signposts; fix software name casing
`00ba293`	02-24	prompt: add rule to quote long lowercase entity names in Q&A
`7b0590e`	02-24	prompt: enhance rule 4 to check free-text fields; update review observations doc
`28be413`	02-24	fix: entity name interpolation + temporal language + coming-soon cleanup
`170e87d`	02-24	docs: log full corpus scan results — quantify issues #1/#2, add issue #3
`8336f45`	02-24	docs: move allocations:72170 finding to Patterns (positive, not an issue)
`d7f57f5`	02-24	docs: log allocations:72170 as non-issue (Jurafsky in source data, verified)
`4f9c22d`	02-24	docs: add retrieval surface area rationale to P1 (self-contained answers)
`a4f7b66`	02-24	docs: note preferred fix for P1 — entity name interpolation in user prompt
`43e980e`	02-24	docs: clarify P1 — entity name needed in both Q and A for RAG
`70f9424`	02-24	docs: add P1 pattern — questions must be self-contained (cross-cuts #1 and #2)
`6084c93`	02-24	docs: log issue #2 — decontextualized-question pattern (pervasive)
`07da145`	02-24	docs: log issue #1 — temporal-assumption in affinity-groups events
`c4ec468`	02-24	docs: add qa-review-observations.md for tracking Argilla review issues
`6857db8`	02-24	docs: improve signpost comments + fix COMING SOON name normalization
`579e10d`	02-24	fix: normalize "COMING SOON" resource names to lowercase
`7bd43ba`	02-24	wip: some signpost comments
`3333c32`	02-23	docs: update guided-tour
`66e1819`	02-20	refactor: adopt two-shot as sole extraction strategy
`7803147`	02-20	fix: restore missing return in software_discovery._generate_qa_pairs
`7791e2b`	02-20	feat: add --prompt-strategy flag for A/B/C extraction experiment
`b662dc9`	02-20	feat: implement entity-replace for Argilla push
`80fc641`	02-20	docs: update plan with metadata on human actions on archive records
`9d54819`	02-19	fix(data-quality): separate NSF program fields and add per-domain LLM guidance
`39a4c06`	02-19	refactor: remove factoid templates and bonus generation (2-pass pipeline)
`5268caa`	02-19	docs: reflect entity-replace decision and update README
`8c9e7f2`	02-18	docs: update all docs for freeform extraction pipeline and Argilla dedup
`4181585`	02-18	feat: roll out freeform extraction to all 5 extractors
`da79f7d`	02-18	feat: freeform extraction replaces category+bonus two-pass approach
`2833d7b`	02-18	docs: update for Argilla metadata integration and test count
`e6d08fa`	02-18	feat(argilla): add eval_issues and source_ref to Argilla records
`3c762c9`	02-18	feat(argilla): push judge scores and granularity to Argilla metadata
`24c8373`	02-17	feat(judge): LLM judge evaluation scores for Q&A pair quality
`93a1fb2`	02-17	feat(bonus): LLM exploratory questions for entity-unique information
`068c08a`	02-17	feat(incremental): hash-based change detection to skip unchanged entities
`9059614`	02-17	fix(factoids): data quality guards for template generation
`3662d8b`	02-13	feat(generators): dual-granularity Q&A + extend comparisons to all 5 domains
`fa2ff93`	02-12	fix(nsf-awards): normalize primaryProgram list + skip unused MCPClient
`f3b1437`	02-12	feat(extractors): fixed question categories + direct API for allocations/nsf-awards
`fdebdab`	02-12	feat(software-discovery): switch from search terms to list_all_software
`e33d006`	02-11	feat(extract): add max_entities cap for cheap test runs
`2da2c32`	02-10	Use real enumerations from taxonomies.ts for search terms
`d987dee`	02-10	Add report command for MCP coverage stats without LLM calls
`6c4667c`	02-10	Add ExtractionConfig to centralize extraction parameters
`0b16ba8`	02-04	Fix Q&A pair ID collisions by appending question hash
`cf384bc`	02-04	Add Argilla integration for pushing Q&A pairs to human review
`51e9877`	02-04	Expand extraction queries, fix software-discovery, update docs
`a69ce2e`	02-02	Fix allocations and nsf-awards extractors returning 0 results
`038d42d`	02-02	Add dedicated OpenAI backend (LLM_BACKEND=openai)
`b557300`	02-01	Add LOCAL_DIRECTIONS.md and update .env.example for OpenAI setup
`d45eda1`	02-01	Add NSFAwardsExtractor and register in CLI/validator
`b67eba0`	02-01	Add AllocationsExtractor and register in CLI/validator
`18c0e49`	01-31	Add AffinityGroupsExtractor and fix MCP server port defaults
`de28ab2`	01-31	Add CLAUDE.md and update README with local dev setup guide

access-qa-service (`main`)

Hash	Date	Message
`5b57ae0`	02-28	Fix Argilla sync to work with access-qa-extraction's dataset schema

access-agent (`main` / `feature/dual-rag-logging`)

Hash	Date	Message
`ef43a21`	03-12	feat: return top-5 RAG matches and enrich synthesis prompt
`b7a9bec`	03-12	feat: gate node_trace behind ?include_trace query parameter
`04342c8`	03-11	feat: add node_trace to agent graph for execution path observability
`de26e37`	—	feat: route pgvector through LLM synthesis + fair comparison logging
`08809ad`	—	fix: lower RAG similarity thresholds — 0.85 was filtering valid matches
`caf7256`	02-28	feat: add dual-RAG comparison logging for A.2 evaluation

access-mcp (`main`)

Hash	Date	Message
`bb3b54f`	02-04	spike: Add list-all fallbacks to allocations and nsf-awards routers

access-qa-planning (`update/mcp-extraction-two-shot`)

Hash	Date	Message
`a84fb4a`	02-26	docs: GUIDED-TOUR.md → TRACE-TOUR.extract.md in file tree
`033c46e`	02-23	docs: update mcp-extraction-impl to reflect two-shot pipeline and entity-replace

access-argilla (`main`)

Hash	Date	Message
`d5cb931`	01-30	chore: init claude file

bacalj/DEV_JOURNAL.md

ACCESS-CI Dev Journal

2026-05-04 (evening) — Classify-free path implemented on feat/no-classify; eval mixed signal

Branch decision: sub-branch over committing to parent

Commits (in order)

Pre-flight before eval

Eval comparison

Side-finding: the loose-match honest framing traces to a four-commit cross-repo fix

access-mcp PR #3 bundled

Open questions for next session

2026-05-04 — Plan for the classify-free path: tightened to a single tool, master switch on feat/qwen-integration

What changed from the 2026-05-01 plan

Why a single tool with a source param ages better than two tools

Why the master switch instead of a flag or branch

What got verified vs. assumed

Slack to Andrew (sent today)

Memory artifacts updated

2026-05-01 — End-to-end loop run on Kimi at UKY; Andrew flagged a no-classify direction

End-to-end eval comparison on Kimi

Reasoning-model token-budget finding

Andrew responses (Slack, 12:38-12:39)

Decision: spike no-classify on a branch, not a flag

Architecture stages framing (locked in)

Open follow-ups

2026-04-30 — Qwen integration committed (Phase 5, commit a1c54bb); UKY endpoint smoke blocked by vLLM backend

Cross-repo branch hygiene

Qwen integration shipped to feat/qwen-integration (commit a1c54bb)

UKY endpoint smoke failed — vLLM backend not reachable

Cleanup commit landed (0ecd5ad, −475 lines net)

Endpoint diagnostic — confirmed our integration is sound, narrowed down the UKY-side breakage

Open follow-ups

2026-04-29 — TC battery ground-truth pass: all 14 questions verified, 11 rubric commits, real bugs surfaced

Bugs found in the prior "answer key"

Real agent gap surfaced (not yet fixed)

Patterns fixed across the rubric set

Other findings

Framing — eval is instrumental, not authoritative

Open follow-ups

Fix landed in access-mcp

tc-announce-01 spot-check — no agent change needed

Implication for the routine

Open follow-up

Two comparison runs published — chain vs loop, post-MCP-fix

Pagination bug surfaced — comb-002 in the smoke battery

Two side-benefits-for-Vikram, traced

TC-battery re-run + answer-quality review (afternoon)

access-mcp PR #3 expansion — links rename + pagination + query_relevance

Loop system prompt — three additions (access-agent 88eccdd)

Andrew's review of PR #14 — five items addressed

Production crash-loop on wrapt 2.x — ec3414d

READ_ONLY bypass in tool_calling_loop — e8b2f12

Cleanup batch — d3a76af

Two items deliberately flagged rather than fixed

Issue #15 — broader cleanup audit

Evening — TC battery re-run on rebuilt MCP + first prompt-iteration cycle

Prompt-iteration cycle on tc_loose_match_subset.yaml (4 questions)

Decision: stop iterating on the prompt; go to the MCP layer

2026-04-27 — Vikram + Andrew sync; tool_coverage_battery atomization + YAML cutover

Battery work (morning)

Eval-system repeatability is load-bearing — time_bound refresh pattern

Vikram + Andrew sync (afternoon)

2026-04-24 — Tool-use investigation: counting tool calls is the wrong metric

Prompt tweak — partial fix

Spot-checked the remaining 8 — story changed

Conclusion: the gap was mostly illusory

MCP bugs surfaced — file against access-mcp

2026-04-23 (afternoon) — Phase 3 parity check + report-tooling iteration

Parity check — methodology

Result

Tooling improvements that landed during the run

Architecture insight worth noting

State

2026-04-23 — Launch Phase 3 (tool-calling loop) code lands

What shipped — 12 commits, cad8ee1..4272a78

Plan-vs-reality mismatches fixed mid-execution

Design call worth flagging to Andrew

Test posture

State

Next

2026-04-21 (afternoon) — Launch-hardening plan lands; sync with Andrew; new focus is Project M

2026-05-04 (evening) — Classify-free path implemented on `feat/no-classify`; eval mixed signal

`access-mcp` PR #3 bundled

2026-05-04 — Plan for the classify-free path: tightened to a single tool, master switch on `feat/qwen-integration`

Why a single tool with a `source` param ages better than two tools

Qwen integration shipped to `feat/qwen-integration` (commit `a1c54bb`)

Cleanup commit landed (`0ecd5ad`, −475 lines net)

`access-mcp` PR #3 expansion — `links` rename + pagination + query_relevance

Loop system prompt — three additions (`access-agent` `88eccdd`)

Production crash-loop on wrapt 2.x — `ec3414d`

`READ_ONLY` bypass in `tool_calling_loop` — `e8b2f12`

Cleanup batch — `d3a76af`

Prompt-iteration cycle on `tc_loose_match_subset.yaml` (4 questions)

Eval-system repeatability is load-bearing — `time_bound` refresh pattern

What shipped — 12 commits, `cad8ee1..4272a78`

Commits on `feature/production-baseline-comparison`

2026-04-20 — Events MCP: `search_events` scope investigated and fixed

Fix (`fix/search-events-webinar-guidance` on `bacalj/access-mcp`)

Judge improvements — merged to `feature/production-baseline-comparison`