Gist mirror: https://gist.github.com/bacalj/a6c6f9726844611df5a09c83884a0e83
Picked up the morning plan and executed it. Four commits on a new sub-branch feat/no-classify (off feat/qwen-integration); 279/279 unit tests pass; end-to-end eval comparison run on Qwen3.6-FP8 against the 14-question tool-coverage battery. Result is mixed: composite +0.05 in candidate's favor, but compare-judge picks the baseline as winner with margin "small" — exactly the failure mode predicted at the start of the session.
Plan as written had the work landing directly on feat/qwen-integration. Late-day reconsideration: the work is experimental enough that a separate sub-branch is worth the small extra branching complexity. feat/no-classify cuts off feat/qwen-integration; if the experiment loses, the parent branch is unaffected and the sub-branch can be deleted. If it wins, it merges into the parent.
| commit | content |
|---|---|
b7765b4 |
Centralize max_tokens in src/config.py (MAX_TOKENS_LOOP=6000, MAX_TOKENS_DOMAIN_AGENT=4000, MAX_TOKENS_SYNTH_CONDENSE=8000, MAX_TOKENS_SYNTH_FINAL=3000, MAX_TOKENS_PLAN=3000, MAX_TOKENS_EVALUATE=1500, MAX_TOKENS_RECOVER=1500). Reasoning-friendly defaults. classify.py's max_completion_tokens=350 left alone (pinned to gpt-4o-mini, bypassed by master switch). |
5b2c0ef |
New search_access_documents StructuredTool wrapping uky_client.ask() with query, source: Literal["general","xdmod"]="general", optional rp_name. Lives in new src/agent/tools/ package (distinct from src/tools/ MCP-client and src/agent/domains/tools.py MCP-wrapper). 10 unit tests. |
1cc9d66 |
USE_NO_CLASSIFY master switch in graph.py — when true, START → tool_calling_loop directly, classify/rag_answer/rag_and_plan/domain_agent bypassed. New src/agent/prompts/no_classify.py system prompt with announcements/JSM workflow choreography folded in. Loop node branches on the flag to choose prompt builder + whether to append search_access_documents. 7 new graph/loop tests. |
f81092a |
Prompt rewrite. The first draft of SYSTEM_IDENTITY_NO_CLASSIFY opened with "There is no separate documentation layer upstream of you" — leaking implementation history into the prompt, comparing against an alternative the LLM has no context for. Replaced with a job-description framing organized around a 3-step workflow: docs first (call search_access_documents), enrich with MCP tools where the topic touches live data, synthesize. The "what not to do" guidance moved into its own section instead of being buried mid-paragraph. |
- Docker stack brought up: 11 MCP servers + postgres + redis. JSM and postgres weren't in the existing partial stack; brought them up explicitly.
- UKY model list checked:
ccs/Qwen/Qwen3.6-35B-A3B-FP8listed and serving (false alarm earlier in session from ahead -c 800-truncated curl). - Chat-completion smoke against Qwen at
max_tokens=6000:finish_reason: stop, content emits proper</think>close, model produces "OK" after the reasoning trace._StrippingChatOpenAIwrapper handles cleanly. - UKY general RAG endpoint: smoke-tested with allocation question, returned 2129-character response.
- JSM safety: confirmed three layers stack —
READ_ONLY=true(agent strips write tools at catalog level),JSM_DRY_RUN=true(MCP container short-circuits ticket creation, returnsDRYRUN-NNNstub),JSM_PROXY_URLunset (would fail closed even if dry-run weren't on). Ran eval withREAD_ONLY=truefor belt-and-suspenders.
Battery: eval/questions/tool_coverage_battery.yaml (14 Qs). System: agent_full. Both runs on feat/no-classify HEAD (f81092a).
| Run ID | Composite | Scored / Skipped | Per-dim | |
|---|---|---|---|---|
| Baseline (USE_NO_CLASSIFY=false) | loop-20260504-181608-68a8f0 |
4.93 | 13 / 1 | corr 5.00, comp 4.77, rel 5.00, cite 4.92, hedge 5.00 |
| Candidate (USE_NO_CLASSIFY=true) | loop-20260504-182336-7b71ea |
4.98 | 13 / 1 | corr 5.00, comp 4.92, rel 5.00, cite 5.00, hedge 5.00 |
compare-judge JSON at comparisons/no-classify-2026-05-04.json. HTML report deployed: https://access-ci-reports.netlify.app/no-classify-2026-05-04.html
The judge's run-level summary picks the baseline as winner (margin "small"), citing more detailed and actionable information on allocation, software-version, and announcement questions. Candidate wins on terse technical Q&A (SSH keys, CUDA versions). Notable: the judge explicitly flagged that the per-answer rubric was undervaluing candidate's conciseness.
This matches the failure mode predicted at session-start: when the LLM has discretion over whether to call search_access_documents, it sometimes doesn't, and on detail-heavy questions that costs substance. Composite saturated near 5.00 in both runs — compare-judge surfaced the behavioral difference the rubric averaged out. Same lesson as lesson_composite_vs_behavioral_parity.md, fresh datapoint.
Spotted on tc-allocations-02 in the candidate report: agent now opens "Searching 'climate modeling' on Delta returned 58 results, but none are actually about climate modeling." Traced via git log + blame to a 2026-04-29 four-commit chain:
access-mcpd4d9845—pagination+query_relevancemetadata on 3 search tools, plusdocs→linksrename per Andrew's review.access-mcp4bb5ce6— extended same metadata across all 16 listing tools.access-agent88eccdd— first prompt rule consuming the new metadata + null-RAG-fallback rule.access-agent9e449ac— strengthened theloose_matchrule to a mandatory opening-line shape; this is the commit that produced today's exact phrasing.
Co-designed change: neither side could have produced the visible behavior alone — the metadata gives the prompt something concrete to anchor to, and the prompt gives the metadata behavioral teeth. Saved as finding_loose_match_honest_framing_2026_04_29.md.
PR #3 was originally framed as the see_all_url deliverable only, but its branch feat/listing-urls-in-tool-responses carries all four 2026-04-28 / 2026-04-29 commits. Updated the PR title and body to bundle the three deliverables (see_all_url, links rename, pagination + query_relevance) with per-deliverable motivation, design, and a companion-changes section linking the two access-agent commits that pair with deliverable 3. Title now: feat: add structural metadata to MCP tool responses (see_all_url + pagination + query_relevance). Still draft.
- Roll back vs harden. The judge picked baseline. Two reasonable directions: flip
USE_NO_CLASSIFY=falseand retire the experiment (the centralization commit and thesearch_access_documentstool stand on their own), or strengthen the docs-first rule in the no-classify prompt so the LLM has less discretion to skip the doc-search call, then re-run the 14-Q battery and compare-judge against today's candidate. - PR strategy.
feat/no-classifyis unmerged. Could open a draft PR offfeat/qwen-integration, hold the sub-branch unmerged pending the rollback decision, or cherry-pickb7765b4(centralization) into PR #26 since it's stand-alone valuable. - MCP PR #3. Ready to flip from draft to open whenever Andrew should see it.
2026-05-04 — Plan for the classify-free path: tightened to a single tool, master switch on feat/qwen-integration
Andrew was offline most of the day; spent the session pinning down a plan for the classify-removal work that Joe could commit to without a sign-off in hand.
The 2026-05-01 plan had a spike/no-classify branch off feat/qwen-integration with two new RAG tools (search_uky_general_rag + search_uky_xdmod_rag) and full deletions of classify / chain / flag if .4 won. Andrew's brief Slack today shifted the framing: "Wrapping the RAG doesn't necessarily make sense. We're getting rid of the RAGs (to be replaced with the /retrieve documentation endpoint). It's could be fine to leave the classifier in until that happens?"
Three concrete corrections to the plan:
- No spike branch. The work is permanent infrastructure (new tool, prompt changes, graph routing), not throwaway research. It commits onto
feat/qwen-integrationdirectly. Cross-commit comparison viacompare-judgereplaces cross-branch comparison. - Two RAG tools collapse to one. A single
search_access_documents(query, source: Literal["general", "xdmod"] = "general", rp_name?)exposes the existingendpoint_typeparameter onuky_client.ask()directly to the LLM. The choice classify makes today moves into the LLM's hand via tool description, not into a heuristic inside the tool body. - No deletions today. A master switch in
graph.pyintroduces the new path alongside the old ones. Classify, the legacy chain,USE_TOOL_CALLING_LOOP, andagent_full_legacyall stay alive until Andrew confirms prod is on the loop and Phase 8 sign-off is done. Cleanup is a follow-up PR.
Long working session with the uky_client code. ask() already takes endpoint_type: Literal["general", "xdmod"] (required) and rp_name: str | None (optional); _get_url() picks one of two configured URLs (UKY_RAG_GENERAL_URL / UKY_RAG_XDMOD_URL) and POSTs to it. So our side really does pick between two endpoints — the source param on the new tool is a direct surface of an existing internal choice, not a new abstraction.
Why one tool with a param, not two named tools: when /retrieve ships, the function body changes (return type widens to chunks, system prompt updates, possibly multi-call iterative search), but the parameter name and semantics can survive — source="general" either remains meaningful (if /retrieve doesn't subsume XDMoD) or quietly retires (if it does). Two-tool design would force a rename or deprecation regardless. The spec for /retrieve is silent on whether it subsumes XDMoD.
Three options were on the table at different points today:
- A new
USE_NO_CLASSIFYflag → introduces a permutation through the graph (legacy / loop-with-classify / loop-no-classify) that has to be carried indefinitely. - A
spike/no-classifybranch → implies throwaway, but the work is permanent. - A master switch in
graph.pyrouting → equivalent to the flag in effect, but the surface is one routing branch and it deletes cleanly when Andrew signs off the cutover.
Picked the master switch. The cleanup PR removes the routing complexity, the flag, classify, the chain, and agent_full_legacy all at once.
- Verified
ask()'s signature (uky_client.py:65-72) and the two-URL routing (uky_client.py:53-57and__init__lines 35-46). - Verified the
/retrievespec footprint — only Phase 4 of the launch umbrella plan and the matching section of the hardening spec (line 167). Both placeholders pointing to "the contract Vikram is building." Nothing inaccess-qa-planningmentions/retrieve. - Did not verify whether prod is currently running
USE_TOOL_CALLING_LOOP=true. Joe needs to confirm with Andrew before the eventual cleanup PR — if prod is still on the chain, deleting things is the cutover and needs Phase 8 sign-off.
In the current qwen-integration branch I'm setting up a classify-free path which uses a single tool calling + thinking loop that is equipped with
- all existing tools, including the domain_agent ones
- a new search_documentation tool that will, for now, wrap existing /ask function that calls the UKY endpoints, will eventually instead wrap the /retrieve function that gets the UKY chunks
In addition to the rp param that gets passed through, we'll have the loop decide to pass XDMod or general since the ask function still expects it so it knows which UKY rag to hit. This is the logic that was in classify. I don't know if the retrieve path is going to want the same param, so we'll just remember it might need to be passed through, or removed.
architecture_stages_2026_05_01.mdrewritten to reflect the master-switch plan and the single-tool design.next_session_prompt.mdrewritten with the implementation order: centralizemax_tokensfirst, then add the tool, then the master switch, then flatten domain MCP tools, then run the cross-commit comparison.
Picked up where the 2026-04-30 entry left off. Updated .env with the UKY Qwen target, hit the same 500 (Connection error.. Received Model Group=ccs/Qwen/Qwen3.6-35B-A3B-FP8) on retry, raw curl reproduced the same shape — confirmed it's UKY-side, not us. Wrote two diagnostic markdown files at the access-ci root: uky-endpoint-diagnostic.md (technical) and uky-endpoint-brief.md (the Vikram-facing version, three short prose paragraphs + curl evidence, no commentary).
Pivoted the smoke target to ccs/kimi-k2.6 (the only working reasoning model on the same proxy — the colon-format names like ccs/qwen3.6:35b, ccs/deepseek-r1:8b etc. are misrouted to Anthropic, ~14 of them, all 400 with AnthropicException). Smoke runs cleanly through our wrapper. Kimi exposes reasoning in a separate reasoning_content JSON field rather than inline </think> — so the strip is a no-op for Kimi (still proven by the 13 unit tests on synthetic input; not exercised end-to-end against a real </think>-emitting model since none are reachable on UKY today).
Ran tc_loose_match_subset.yaml (4 Qs) through both the legacy chain and the loop, with LLM_PROVIDER=vllm, VLLM_MODEL_NAME=ccs/kimi-k2.6. Database overrides (DATABASE_URL=...localhost..., MCP_SERVER_HOST=localhost) needed because the local docker stack's container has stale code; ran the eval CLI from local source against the container's exposed Postgres + MCP ports.
| Legacy chain | Loop | |
|---|---|---|
| Run ID | chain-20260501-140041-066793 |
loop-20260501-141220-9e7684 |
| Scored | 2 / 4 | 3 / 4 |
| Composite | 4.88 | 5.00 |
HTML report deployed: https://access-ci-reports.netlify.app/tc-loose-kimi-2026-05-01.html
Both answer_length=0 skips in the chain run came from Kimi consuming all 4000 tokens of the synthesize budget on reasoning_content and never producing content. Different reasoning models partition the budget differently — Qwen embeds reasoning inline in content (our wrapper recovers the answer post-hoc), Kimi puts it in a separate field that shares the budget with content. Per-node budgets in the codebase today: synthesize=4000, plan=1500, tool_calling_loop=2000 (default), evaluate/recover=500. Once we're on a reasoning model in production these need to scale up, or enable_thinking=False needs to be passed on terse paths. Open question for Andrew, sent.
Three messages: "the design should be not to use a classify node," "Vikram is looking at the Qwen issue," "what do you mean by token budgets?"
The first is a new design direction not in any Andrew-authored doc. Confirmed by reading every file he authored: the Phase 3 hardening spec at access-agent/docs/superpowers/specs/2026-04-21-production-launch-hardening-design.md line 150 explicitly scopes the loop to "Replaces plan + execute + evaluate + recover with a single tool_calling_loop node" — classify is not in that list, and active/10-analytics-and-domain-agents.md (Andrew's domain-agent doc) shows classify upstream of the existing react agents. The implementation plan we co-wrote with Claude faithfully reflected that scope. So the no-classify direction is genuinely new from him today.
Two responsibilities currently held by classify need new homes:
- RAG endpoint selection (general vs XDMoD) → wrap each as a LangChain tool (
search_uky_general_rag,search_uky_xdmod_rag), defined in a newsrc/agent/tools/rag_tools.py, appended to the loop's tool list alongside the MCP tools. - Domain agent routing (announcements, JSM) → fold their MCP tools (already in the catalog for reads; available for writes when
READ_ONLY=false) directly into the main loop. The workflow choreography that lives insrc/agent/nodes/domain_agent.py(preview/confirm/create for announcements, field-gather for JSM) gets moved into the main system prompt as instructions.
Doing this on a flag would make the graph carry three permutations (legacy / loop-with-classify / loop-no-classify). Since this is a research spike rather than a production cutover, it goes on a branch (spike/no-classify off feat/qwen-integration); compare runs across branches via compare-judge (eval already records branch + commit per run). If no-classify wins, it becomes a real PR — classify, USE_TOOL_CALLING_LOOP, the legacy chain, and domain_agent as a separate node all delete; the graph gets much smaller.
Four cumulative stages, each builds on the previous:
classify → plan → execute → evaluate → recover → synthesize(legacy chain, OpenAI inside).classify → tool_calling_loop(OpenAI) — Phase 3, PR #14,USE_TOOL_CALLING_LOOP=true.classify → tool_calling_loop(UKY Qwen/Kimi) — Phase 5, this branch (feat/qwen-integration).LLM_PROVIDER=vllm.tool_calling_loop(UKY Qwen/Kimi), no classify — the upcoming spike branch.
Comparisons:
- 1 vs 2: done in PR #14 (4.67 = 4.67 on phase3_smoke_battery).
- 2 vs 3: partially done today (loop-on-Kimi has data; OpenAI-loop baseline lives in earlier eval runs).
- 3 vs 4: next eval. Run .3 on
feat/qwen-integration, switch branches tospike/no-classify, run .4, compare-judge across the two run IDs.
- No-classify spike branch. Cut from
feat/qwen-integration. RAG-as-tools, domain workflow content into system prompt, gut classify, run subset, compare. - Vikram on Qwen3.6 FP8 endpoint. Diagnostic at
uky-endpoint-brief.md, paste-ready curl repros. - Token budget decision. Centralize in config or per-call-site; use
enable_thinking=Falseon terse paths (evaluate/recover/old-classify-equivalent if any). Sent to Andrew, awaiting response. - Draft PR #26. https://github.com/necyberteam/access-agent/pull/26 — covers stage 3 (Qwen integration + Issue #15 cleanup). Stays draft until Qwen3.6 is reachable for a real smoke + the no-classify direction is settled.
2026-04-30 — Qwen integration committed (Phase 5, commit a1c54bb); UKY endpoint smoke blocked by vLLM backend
PR #14 (eab821f) merged 2026-04-30T18:09Z, putting tool-calling-loop, READ_ONLY filter, and code-quality guardrails on main. Started cross-repo cleanup, then opened Phase 5 (Qwen) on a fresh branch.
Synced main and pruned merged feature branches:
- access-agent — pulled 101 commits, deleted local
feature/production-baseline-comparison(merged in PR #14) - access-qa-bot, qa-bot-core, qa-bot-proxy — each had
feature/non-agentic-proxy-2026-04-10merged upstream and behindmain; checked outmain, pulled, deleted the local branch - access-qa-bot had a stale working-tree edit changing
BACKEND_IDfrom'uky'to'access'.mainalready has'access', so the modification was content-identical tomain— stashed for safety asstash@{0}rather than discarded - access-ci-ui
chore/bump-access-qa-bot-3.5.2— left alone. PR #78 was closed without merge on 2026-04-12; upstreammainis now at^3.7.1, well past the^3.5.2the branch tried to land. Definitively superseded - access-mcp
feat/listing-urls-in-tool-responses— left alone, DRAFT PR #3 still active (thesee_all_urlwork)
Cut a fresh branch off main and built the Phase 5 LLM-layer changes. Highlights:
</think>reasoning strip in the LLM client wrapper. SubclassedChatOpenAIto override_generate/_agenerateand remove anything up to and including</think>from response content. Strip happens during each call, not at end-of-loop, so when the prebuilt react agent re-sends conversation history on subsequent turns the model never sees its own prior reasoning replayed back as input. Andrew originally suggested stripping in the loop node, buttool_calling_loop.pydelegates to LangGraph'screate_react_agentand we don't own the per-turn LLM call there — provider-layer placement was confirmed with him as the canonical client-side pattern (matches DeepSeek-R1 / Qwen guidance).enable_thinkingparameter wired throughLLMProvider.get_chat_model, both providers, andget_llm(). DefaultNone= nothing extra in the request body. When set,OpenAICompatibleProvideraddsextra_body={"chat_template_kwargs": {"enable_thinking": value}}so the request lands at LiteLLM with the right shape. No call site uses it yet — the knob is exposed per Andrew's note for future fast-path experiments.- 13 unit tests in
tests/test_llm_providers.pycovering the strip helpers (no-op when tag absent, idempotent, multimodal-content skip), the wrapper subclass type, and theenable_thinkingplumbing on/off semantics across both providers. mypy/ruff clean; full suite still 263 pass / 1 skip. qwen_smoke.py— runnable end-to-end check at the LLM layer (independent of the agent graph). Two round-trips: default thinking +enable_thinking=False, both expected to come back without</think>artifacts.
Updated .env with LLM_PROVIDER=vllm + the three VLLM_* variables Andrew supplied (https://jump-external.ccs.uky.edu/v1, model ccs/Qwen/Qwen3.6-35B-A3B-FP8). Smoke and a raw curl reproduce the same error:
HTTP 500 — litellm.InternalServerError: OpenAIException - Connection error..
Received Model Group=ccs/Qwen/Qwen3.6-35B-A3B-FP8
Available Model Group Fallbacks=None
/v1/models returns 200 with ccs/Qwen/Qwen3.6-35B-A3B-FP8 in the list, so the LiteLLM proxy itself is healthy and the model group is registered — the failure is downstream, between LiteLLM and the vLLM backend. A second registered route, ccs/qwen3.6:35b, returns 400 (AnthropicException - Model 'qwen3.6:35b' was not found), which looks like a separate misconfiguration. Vikram (not Andrew) controls those servers; ping pending.
Curl repro proves the failure is upstream of our code — same 500 from raw curl bypassing Python entirely. Config is taking; the smoke printed VLLM_BASE_URL and VLLM_MODEL_NAME correctly out of settings before the call.
Same branch, second commit. Removed dead-code from Issue #15's "removable now" bucket — full pgvector path was disabled 2026-03-23 and never re-enabled, plus a RAG-only synthesis fallback that had zero callers anywhere:
src/services/qa_client.pydeleted entirely (229 lines)_search_pgvectorand_get_threshold_for_query_typeremoved fromrag_answer.py(only consumers of qa_client)_synthesize_with_rag_onlyandRAG_ONLY_SYNTHESIS_PROMPTremoved fromsynthesize.py(no callers)- Config fields with no remaining readers:
QA_SERVICE_URL,RAG_TOP_K,RAG_THRESHOLD_STATIC/COMBINED/FALLBACK,RAG_SIMILARITY_THRESHOLD,MAX_RETRIES_PER_TOOL,MAX_RETRIES_TOTAL,TIMEOUT_BUDGET_MS - The
QA Service URLstartup log line inmain.py - Matching env-var stubs in
.env.example,docker-compose.yml,docker-compose.prod.yml, plus README config table and file tree
Backed off from the original Issue #15 list on RetryContext: memory said it was only consumed by recover_node, but grep showed plan.py also uses retry_context in its prompt. Both are part of the legacy chain that's still selectable via agent_full_legacy for eval comparisons. Killing RetryContext would break that path. Separate decision.
ruff/mypy/pytest stayed green (263 pass / 1 skip).
While Qwen3.6 is offline, exercised the wrapper against an alternate model on the same proxy to verify our code is correct independent of the Qwen target.
End-to-end through our wrapper against ccs/scout succeeded. Full stack — .env → settings → get_llm() → _StrippingChatOpenAI → HTTPS to LiteLLM → response parsing — round-trips cleanly. Wrapper class _StrippingChatOpenAI confirmed; response comes back as plain content. The strip is a no-op for non-reasoning models (no </think> to remove), and that no-op behavior is also covered by unit tests. So integration path is proven correct against a live UKY model; nothing in our code is blocking.
Broader UKY routing pattern found while probing for any other reasoning model that's hot:
- Every
name:tagformat model (ollama-style:ccs/qwen3.6:35b,ccs/deepseek-r1:8b,ccs/qwen3.5:9b,ccs/qwq:32b,ccs/llama3.1:8b,ccs/gemma3:27b,ccs/mistral-small3.2:24b) returns HTTP 400 withAnthropicException - Model 'X' was not found. LiteLLM has these model groups configured to route through Anthropic, which doesn't host them. Looks like a config-template error in LiteLLM — they should be routing to vLLM/Ollama, not to Anthropic. - The slash-format
ccs/Qwen/Qwen3.6-35B-A3B-FP8returns HTTP 500 — LiteLLM accepts the request, identifies the Model Group, then fails to reach the vLLM backend behind it. vLLM isn't serving that model right now. ccs/scout(also slash-format, no version tag) returns 200 cleanly.
So the proxy itself is healthy, auth is fine, and at least one model groups is correctly wired through. The breakage is per-model-group: one (scout) works, all the colon-format ones are misrouted, and the FP8 Qwen3.6 vLLM target is down.
- Vikram ping — diagnostic above is paste-ready. Two distinct issues worth surfacing: the FP8 vLLM target being down, and the broader colon-format → Anthropic misrouting (probably a one-line LiteLLM config fix).
- Re-baseline on
tc_loose_match_subset.yaml— gated on Qwen actually responding. After that lands, hand-read the answers per the eval-is-instrumental memory rather than chasing composites. - PR open — once at least one successful Qwen response is on the wire, open
feat/qwen-integrationagainstmain. Currently two commits:a1c54bb(Qwen integration, +283/−1) and0ecd5ad(cleanup, +3/−478).
2026-04-29 — TC battery ground-truth pass: all 14 questions verified, 11 rubric commits, real bugs surfaced
Completed the ground-truth pass on tool_coverage_battery.yaml set up by the 2026-04-28 entry. All 14 questions now have hand-verified ground truth (three-source: WebSearch + WebFetch + local MCP) and atomic-fact rubrics. 11 commits on feature/production-baseline-comparison, 3 done in-session and 8 dispatched to a fork that committed per-question:
9d4b68a fix(eval): tc-rag-04 — atomic facts, surface RAG content staleness
d20f0c8 fix(eval): tc-rag-03 — loosen F2/F4 to match RAG content, document variance
3cadc85 fix(eval): tc-rag-02 — atomic facts, fix Accelerate proposal length
11277f2 fix(eval): tc-rag-01 — atomic facts, separate login-host from 2FA enrollment
02cfbbd fix(eval): tc-xdmod-01 — atomic facts, accept content-equivalent realm naming
2ed4dad fix(eval): tc-events-01 — refresh snapshot, tighten F4 to grade link target
0dd6474 fix(eval): tc-nsf-01 — refresh snapshot to 2026-04-29
42c597e fix(eval): tc-allocations-02 — correct ground truth, document agent gap
0019dec fix(eval): tc-software-02 — correct ground truth, drop chatter-rewarding F3
5fe048a fix(eval): tc-software-01 — reframe F2, fix stale doc URL
c825fad fix(eval): tighten tc-status-01 F3 to grade link target, not verbal framing
The pass exposed concrete cases where the existing rubric was grading against incorrect ground truth — not just minor wording, but factual errors that would have masked or fabricated agent failures:
- tc-software-02 — F1 listed Delta CUDA versions as
13.1.1, 12.9, 12.8, 11.8. That was the top-level union ofversions_by_resourcefrom the MCP — i.e., Delta + DeltaAI conflated. Per-resource breakdown (verified via MCP and NCSA's owncudatoolkit/25.3_12.8module name): Delta has only 12.8 and 11.8; 13.1.1 and 12.9 are DeltaAI-only. - tc-allocations-02 —
authoring_notesclaimed "zero of them are actually allocated on a Delta-family resource" for the loose-match result set. Re-verification on 2026-04-29 found 6 of 20 results are on Delta-family resources (Delta GPU or DeltaAI), but none are about climate modeling. The F1 wording ("no climate-modeling Delta projects") was structurally correct; only the explanatory notes were wrong. - tc-rag-02 — F4 specified Accelerate proposals as "Up to 3-page proposal with panel merit review". Per
allocations.access-ci.org/project-types: Accelerate is 10 pages (Discover is 3). Corrected. - tc-rag-04 — RAG content predates NSF Important Notice No. 149 (July 2025). Indexed text still says "unaffiliated or self-employed CAN apply", directly contradicting current policy. The agent faithfully passes this through — F6 (institutional-email requirement) correctly fails when this happens; documented as RAG-content-staleness signal, not a rubric loosening.
tc-allocations-02 composite 1.00. The agent narrates loose-match search results as exact matches: presents three projects with no climate-modeling content as "20 climate modeling projects on Delta." This is a genuine product failure — search_projects returned a fuzzy result set, the agent didn't qualify it. Per the session's instruction set (no agent-prompt edits during the ground-truth pass), this finding was committed to the rubric and authoring_notes for follow-up rather than fixed inline.
Recurring failure modes in the prior YAML, addressed system-wide:
- Chatter-rewarding "verbal framing" facts. Several questions had facts demanding the answer use a specific phrase (e.g., "cites the live status feed as its source") when the substantive thing was a link target. Same pattern across tc-status-01 F3, tc-software-02 F3, tc-events-01 F4, tc-xdmod-01 F2/F3 — all rewritten to grade what the URL points to, not how the answer phrases the citation. These rewards were a material driver of the chain-vs-loop composite gap noted on 2026-04-28.
- Heading + items dict structure. tc-rag-01, tc-rag-02, tc-rag-04 used a
{heading, items}shape that flattens into N redundant fact verdicts. Replaced with atomic per-claim facts. tc-rag-04 went from 13 sub-claims to 6 atomic facts. - OR clauses misread by judge. tc-rag-01 F3 had "registry.access-ci.org/ OR ~/.ssh/authorized_keys" — judge consistently demanded both. tc-software-02 F3 had a similar problem with "inline hint OR doc link" and was eventually dropped as a chatter trap. Where an OR is genuinely needed, explicit "either is sufficient" wording works better than "OR".
describe_realmsstill HTTP 500 as of 2026-04-29 (xdmod MCP). tc-xdmod-01 was rewritten to grade content-equivalent metric-category enumeration since RAG is the actual answer source for that question right now.- UKY RAG returns variably. tc-rag-03 ran
4.45 / 3.10 / 4.45across three sequential runs; the 3.10 had emptyrag_context. tc-rag-04 returned 0 chars on 2 of 4 runs. Composite oscillation on rag-* questions is RAG variance, not rubric or agent flakiness — the per-fact verdicts honestly distinguish "RAG worked" from "RAG fell back to general knowledge." docs.see_all_urlpropagation confirmed. Multiple questions show the agent now linking to canonical pages (sds.access-ci.org for software, support.access-ci.org/outages for status, etc.) without any agent prompt change — same pattern documented in the 2026-04-28 entry, now observed at battery scale.
Saved a feedback memory (feedback_eval_is_instrumental.md) capturing a guidance correction made during the session: composite 5.00 does not mean an answer is perfect, only that it satisfied that judge interpretation on that run. Several judge errors were observed during the pass (misread OR clauses, demand verbal framing when the link is what counts, miss content under different naming). The HTML report from any future battery run is the starting point for a human + frontier-model review, not a verdict. The system, not the eval, is the product.
- Step 12 — full TC re-run + compare-judge + HTML + Netlify publish — staged for next session via
next_session_prompt.md. Comparison vs. the 2026-04-28 baseline (chain 4.63 / loop 4.26) will tell whether the rubric pass moved the gap. - tc-allocations-02 agent fix — narrate fuzzy/loose-match results as such instead of as exact matches.
- xdmod
describe_realms500 — open againstaccess-mcp. - NSF Notice 149 RAG content refresh — out of scope for the agent repo, but the rubric now signals when this corpus is stale.
While iterating on tc-announce-01 (the routine memory's process question), surfaced a structural gap: when search_announcements returned 0 hits for "Expanse", the agent invented Confluence/SDSC fallback links because nothing in the tool output, RAG, or prompt named the canonical ACCESS announcements page. RAG corpus probe confirmed the problem isn't going to be patched at that layer — 11 different "where can I find X" phrasings returned zero matches each, corpus has only documents (603) + compute-resources (80) domains.
PR necyberteam/access-mcp#3 ("feat: add see_all_url to MCP tool listings"). Lifted compute-resources' private addDocumentation() helper to a BaseAccessServer.listingDocs(context) method and applied it across 6 servers:
| Server | URL surfaced |
|---|---|
| announcements | support.access-ci.org/announcements |
| events | support.access-ci.org/events |
| affinity-groups | support.access-ci.org/affinity-groups |
| system-status | support.access-ci.org/outages |
| software-discovery | sds.access-ci.org/ |
| allocations | allocations.access-ci.org/current-projects |
Additive change — JSON.stringify drops undefined so opt-out servers stay clean. compute-resources/xdmod/nsf-awards/jsm intentionally skipped (different shapes / external systems). 13 files, +115/-2, 222 tests pass (198 unit + 24 integration). Rebuilt locally via docker compose up -d --build mcp-{announcements,events,affinity-groups,system-status,software-discovery,allocations}; all 6 confirmed surfacing the new field via direct curl probes.
Re-ran the question against agent_full after the rebuild:
- Before (
loop-20260427-191442-61e1df): "...the official ACCESS documentation at ACCESS Documentation or the Expanse user guide at SDSC Expanse User Guide" - After (
loop-20260428-202727-ba807b): "...you can check the ACCESS Announcements page..."
The agent's loop spontaneously surfaced the new docs.see_all_url field — no prompt edit, no synthesis-node teaching. Confirms the routine-memory claim that link content from tool output flows faithfully into the final answer. Composite 4.50, F1+F2 both yes; completeness=3 is the inherent ceiling for a "no results" answer, not a regression.
Updated routine_iterating_on_tc_questions.md already classifies failures as agent / rubric / judge. Adding a fourth bucket for future tc-* iterations: MCP product gap — when the agent invents a fallback link or names the wrong site, check the MCP server's tool response shape first. If there's no canonical landing-page field, the fix belongs there, not in the agent prompt. The docs.see_all_url convention is now established across the 6 servers.
A full tool_coverage_battery re-run against the rebuilt MCP servers should show similar improvements on events/affinity-groups/system-status/software-discovery/allocations questions. compare-judge against pre-2026-04-28 runs would quantify it. Not done today — single-question validation was the intent.
Ran the parity-check comparison end-to-end against the rebuilt MCP servers, on both batteries:
| Battery | Questions | Chain composite | Loop composite | Margin | Run-judge winner | Report |
|---|---|---|---|---|---|---|
| Phase 3 smoke | 40 | 4.80 | 4.81 | +0.01 (loop) | B (small) | phase3-smoke |
| Tool coverage | 14 | 4.63 | 4.26 | −0.37 (chain) | A (small) | tc-battery |
The smoke battery is essentially tied (0.01); the TC battery shows the chain pulling ahead by 0.37. The gap is concentrated in tool-shaped questions where the chain's verbose RAG-included answers score higher on per-fact verdicts and run-judge "comprehensiveness", while loop's more selective answers score lower despite often using fresher tool data. The run-judge self-flagged this calibration concern explicitly: "the judge may have underestimated the value of up-to-date information provided by System B."
Pattern paragraph from the TC compare-judge:
- Chain wins on detailed context, specific examples, actionable guidance.
- Loop wins on current/accurate data, especially software versioning.
- Tied on straightforward info-only questions.
Read: at least half of the 0.37 gap is rubric/judge artifact, not behavioral regression. Per-fact verdicts and the run-level judge both reward comprehensiveness; neither penalizes verbose-from-RAG over concise-from-tools. This is the eval-system work the 2026-04-27 entry was setting up.
Run IDs:
- Smoke: baseline
chain-20260428-203336-fefdc1, candidateloop-20260428-203338-35b0e0 - TC: baseline
chain-20260428-205550-771850, candidateloop-20260428-205553-7eecb5
Sample answer from comb-002: "There are currently 100 active allocation projects using GPU resources." The 100 is exactly page_size from searchProjects's page-1 response. Agent is reporting the page-1 batch as the population total.
Right fix is at the MCP layer, no crawling. The upstream API already returns pages: number on every page-1 response (ProjectsResponse.pages at allocations/src/server.ts:48). The MCP layer just drops it when constructing the tool response shape. Same fix-at-source pattern as docs.see_all_url:
return JSON.stringify({
total: items.length,
items,
pagination: {
page: 1,
pages: response.pages,
has_more: response.pages > 1,
},
docs: this.listingDocs("search"),
})Forwards what the upstream already told us — no extra fetches, no agent-side crawling. The agent then has unambiguous info to write "over 500" or "at least 100" instead of treating the batch as the whole. For tools where upstream genuinely doesn't return totals (some software-discovery / system-status endpoints), surface pagination: { has_more: true, total_known: false } so the agent knows it's seeing a partial.
Worth its own follow-up PR after today's wins are absorbed. ~5-10 lines per tool, similar PR shape to today's see_all_url change.
Vikram surfaced two UX preferences in the 2026-04-27 sync: (1) concise responses preferred over long ones, (2) followable links to comprehensive searchable. Today's work touches both, and the lineage is worth pinning down so it doesn't get re-litigated.
Brevity is structural, not luck. The legacy chain has a synthesize node that ingests RAG output (notably UKY's /ask paragraph responses, which run 400-800 words by default) and weaves it into the final answer. The loop has no synthesize node — the LLM produces the final answer when it stops calling tools, which tends toward brevity given the conversation context. So loop answers are consistently shorter without any prompt change. Direct consequence of removing the synthesis step. Anticipated by the Phase 3 architecture work; not stumbled into. Most visible on tc-announce-01: legacy ~280 words including 8 bullets, loop ~50 words.
Followable links — landed today. PR necyberteam/access-mcp#3 adds docs.see_all_url to 6 MCP servers' tool responses. Each tool now returns a canonical landing-page URL alongside its results, and the agent surfaces those URLs faithfully without prompt change. Matches Vikram's "show top results AND a followable link to the source" pattern exactly.
Tracing the legacy chain's tc-announce-01 answer to its source. The eight detailed bullets about Expanse hardware/policies that the legacy chain produced today were traced via eval_scores.context.rag_context to UKY's /ask synthesis service. UKY retrieved chunks from SDSC's Expanse user guide PDF, and their synthesis LLM wove them into a paragraph framed as "the most recent ACCESS announcements about Expanse highlight..." — because that's what the user asked for, and UKY's synthesis is willing to write it that way. The chain's synthesize node verbatim-included UKY's paragraph; the MCP search_announcements tool also ran and returned 0, with that "no new announcements" line buried at the very end after the 8 confidently-stated bullets. Facts: real (SDSC docs). Framing: wrong (specs presented as announcements). Effective experience: arguably worse than hallucination — a skimming reader would treat the 8 bullets as recent announcements before reaching the disclaimer. Connects to the architecture conversation Vikram + Andrew opened 2026-04-27 about how chunks should reach the agent (raw chunks vs MCP-wrapped vs full synthesis); today's loop avoided this trap by virtue of its different relationship with retrieval.
Side-benefit #1 (brevity) depends on the chain → loop migration; if the chain stays alongside loop for any user segment, that segment loses brevity. Side-benefit #2 (followable links) lands regardless of agent variant — even legacy chain answers now carry the canonical URLs (the chain just pads them with UKY synthesis, but the URL is still there).
Ran step 12 — full TC re-run on both systems against the post-rubric YAML, then read all 28 answers manually rather than treating judge composites as the verdict (per the eval-is-instrumental discipline established in feedback_eval_is_instrumental.md). Today's runs: chain composite 4.80, loop composite 4.52, vs yesterday's 4.63 / 4.26. The 0.28 gap decomposed cleanly: ~0.25 comes from one real agent defect (tc-allocations-02 loop fabrication, composite 1.00), ~0.08 comes from RAG retrieval variance on tc-rag-01 / tc-rag-03 where the loop's rag_context was null while the chain's was populated (UKY flake, not behavior). Two real loop wins (tc-software-02 cited both CUDA versions vs chain's one; tc-announce-01 concise+correct vs chain's UKY-synthesis-as-announcements). Eleven of fourteen questions tied or near-tied. Report published: https://access-ci-reports.netlify.app/tc-battery-2026-04-29.html
Three loop improvements identified, in priority order:
- Loop fabricates relevance from loose-match search results — tc-allocations-02 narrated 3 unrelated projects (RNA, photosynthesis, solar/wind) as "20 climate-modeling projects on Delta."
- Loop's general-knowledge fallback isn't ACCESS-grounded when RAG returns null — tc-rag-01 used wrong hostname
expanse.sdsc.eduinstead oflogin.expanse.sdsc.edu, missedpassive.sdsc.edu2FA portal; tc-rag-03 produced a generic Globus tutorial without the ACCESS Collection Search workflow. - Loop lacks order-of-magnitude qualifiers when listing examples — tc-affinity-01 says "several notable examples" with no count.
#1 and #3 were both tractable as MCP-layer fixes (the established see_all_url pattern: have the MCP tell the agent something it currently has to guess). #2 is a pure prompt-side gap.
Andrew flagged that docs was a misleading key name for what is really a structural URL, and suggested links. Two new commits on top of the existing see_all_url work:
d4d9845— RenamedlistingDocs→listingLinksanddocs→linksacross base + 6 servers + tests. Added the two new structural-metadata fields to the three search tools where the documented bugs surfaced:pagination: { matched, has_more, total_known }andquery_relevance: "exact" | "loose_match". Applied toallocations.search_projects,affinity-groups.search_groups,software-discovery.search_software.4bb5ce6— Extended both fields uniformly across all 16 listing/search call sites for shape consistency. Servers without paginated upstream APIs surface{ has_more: false, total_known: true }; sample-and-stop endpoints (e.g., thelistProjectsBy*family in allocations) surface{ total_known: false, has_more: results.length >= limit }.
426 unit tests pass. PR #3 still in draft; versioning intentionally untouched per repo convention (versions bump in dedicated chore commits on main).
Three additions to SYSTEM_IDENTITY in src/agent/prompts/tool_calling_loop.py:
- Null-RAG fallback for documentation-style questions (active immediately) — when no reference context arrives, the loop is told to surface the canonical doc URL via MCP, or admit lack of grounding plainly. ACCESS-CI specifics (login hostnames, 2FA portals, identity-provider names, registry URLs) diverge enough from general HPC conventions that a generic answer is subtly wrong. Targets tc-rag-01 / tc-rag-03.
paginationinterpretation (latent until PR #3 ships) — qualify counts whenhas_moreortotal_known: false. Targets tc-affinity-01 / comb-002.query_relevanceinterpretation (latent) — whenloose_match, inspect items against the actual topic; do not narrate fuzzy results as exact matches. Targets tc-allocations-02.
Items 2 and 3 are no-ops until PR #3 merges and the relevant servers republish, but item 1 is active immediately. Belt-and-suspenders: the see_all_url precedent (the loop spontaneously surfaced the new field with no prompt change) suggests items 2 and 3 might also work without prompt help, but bundling the prompt nudge insures against the case where it doesn't.
Andrew (with assist from his Claude) posted a substantive review. Five concrete items, addressed in three commits + one tracking issue. Order: most-urgent first.
The uv.lock regenerated on this branch pinned wrapt==2.1.2, but opentelemetry-instrumentation-langchain 0.60.0 calls wrap_function_wrapper(module=...) and wrapt 2.x removed the module= kwarg. Every container start since 2026-04-22 has thrown TypeError in init_telemetry → LangchainInstrumentor().instrument(). CI didn't catch it because nothing in the test suite exercised src.main's lifespan. Compounding: Dockerfile did pip install --no-cache-dir ., which re-resolves from pyproject.toml and ignores uv.lock — so the --locked CI check was decorative for production builds.
User-visible impact is zero — the live ACCESS chatbot routes through UKY/proxy via the hardcoded VITE_API_ENDPOINT baked into access-qa-bot at publish time, not through agent prod (the "Drupal Insanity" from reference_production_deploy_chain.md). But every smoke test of agent prod has been red for a week, masking any other regression and blocking the actual launch path.
Three changes:
- Pin
wrapt<2inpyproject.toml(defensive; droppable when Traceloop openllmetry #4025 / #4048 ships). - Switch
Dockerfiletouv sync --locked --no-devso the lockfile actually governs production builds. - Add
tests/test_main_lifespan_smoke.py— two tests that exerciseinit_telemetryand the full FastAPI lifespan. Without these, the wrapt incompat would have stayed green in CI.
The legacy chain enforces READ_ONLY by removing write capabilities from the registry at build time. The new tool_calling_loop builds tools directly from the MCP catalog and never sees the registry. With both USE_TOOL_CALLING_LOOP=True and READ_ONLY=True (the actual production-hardening case), the LLM could still call manage_announcements, open_ticket, report_login_problem, report_security — the audit's "READ_ONLY blocks all writes" claim did not hold on the new code path.
Fix: added WRITE_MCP_TOOL_NAMES constant in capabilities.py enumerating the 6 MCP tool names corresponding to the 4 write capabilities, and _apply_read_only_filter() helper in tool_calling_loop.py that strips them when settings.READ_ONLY is True. Two unit tests machine-verify the audit guarantee on the new code path. docs/security/write-capability-audit.md updated with a parallel-paths section + 2026-04-29 changelog entry.
Three smaller items in one commit:
- Dropped
agent_rag_onlysystem mode entirely. Andrew identified it as a silent no-op —_run_agentmutatedos.environ["ENABLED_CAPABILITIES"], but pydantic-settings is constructed at import and the capability registry is cached on first call. The team's owntest_capabilities_read_only.pyalready usesmonkeypatch.setattron the settings object, confirming env mutation doesn't work. Past compare-judge runs that usedagent_rag_onlywere silently comparingagent_fulltoagent_full. Removing the mode is cleaner than fixing it given (a) RAG endpoints planned for deprecation and (b) no battery uses it. - Dropped
include_context: bool = Trueparameter oncompare_runs. TheFalsebranch had no callers (verified by grep). Behavior unchanged for every existing caller. - Expanded argilla-push records with
question_set+tool_countmetadata. Andrew's specific kwarg names didn't match the currentbuild_argilla_recordsignature, but the underlying intent (more filterable Argilla metadata) is reasonable. Added the two parameters tobuild_argilla_record+ the dataset's metadata schema, threaded through both call sites.
- The "rag_and_plan → tool_calling_loop edge" Andrew named doesn't exist in the graph —
rag_and_planonly routes toexecute/synthesize. The actual dead-when-flag-on situation israg_and_planbecoming unreachable, by design for rollback safety during the launch window. Cleaner removal belongs with the legacy-chain teardown when the flag becomes permanent. - The
justificationskwarg onbuild_argilla_record— dict-shaped, the metadata sink expects scalars, and adding it as a field rather than metadata is a larger schema change. Skipped pending clarification on the desired Argilla surface.
Andrew's review surfaced ~20 dead-code candidates in three buckets: already-dead-on-main (removable now), removable-when-loop-permanent, removable-when-RAG-deprecated. Filed as a tracking issue separate from PR #14 — bundling into the PR would defeat its launch-discipline framing.
After Andrew's review fixes, rebuilt the 6 MCP servers (docker compose up -d --build mcp-{announcements,events,affinity-groups,system-status,software-discovery,allocations}) so the new links/pagination/query_relevance metadata would actually be in the tool responses. Postgres container hit disk-full mid-run on the first attempt — docker system prune -f --volumes reclaimed 23GB and the recovery completed cleanly. Re-ran TC battery on both systems.
Postfix run vs morning baseline:
| Morning | Postfix | Δ | |
|---|---|---|---|
| Chain | 4.80 | 4.68 | −0.12 |
| Loop | 4.52 | 4.64 | +0.12 |
| Gap | 0.28 | 0.04 | — |
Report: https://access-ci-reports.netlify.app/tc-battery-2026-04-29-postfix.html
Per-question read against morning loop (judge-free, reading actual answer text):
- Verified the new metadata IS reaching the agent (
query_relevance,pagination,linkspresent in tool-result context for tc-allocations-02; SQL probe confirmed). - tc-rag-01 (+0.85) and tc-rag-03 (+0.25) gains are mostly UKY RAG firing this run when it didn't last run — variance, not behavior change.
- tc-nsf-01 (+0.60) is judge mood — same answer shape, no real change.
- tc-allocations-02 (+0.50) is partial: agent now hedges with "or similar topics" but still asserts climate-modeling framing for projects that aren't.
- tc-affinity-01 (no change): agent ignored
pagination.matched, still says "several notable examples".
Most of the composite movement is variance + judge mood. Real behavior changed on exactly one question (tc-allocations-02), and only partially.
Created eval/questions/tc_loose_match_subset.yaml — tc-allocations-02 + tc-affinity-01 as targets, tc-software-02 + tc-status-01 as regression controls. ~2-min runs.
Iter-1: strengthened query_relevance: "loose_match" instruction with mandatory opening structure + explicit prohibition of "or similar topics" hedges. Result on tc-allocations-02: 4.75 (up from 1.50). Agent opened: "Searching for ACCESS projects on Delta related to climate modeling returned 53 results, but none are specifically focused on climate modeling." Used pagination.matched=53 ✓, declined the topic framing ✓, qualified examples as "related to other topics" ✓. Looked like a clean fix.
Iter-2: also strengthened the pagination instruction to a similar MUST form, hoping to fix tc-affinity-01. Result on tc-allocations-02: 1.00 (regression). Agent: "There are 53 projects on the NCSA Delta GPU related to climate modeling, though the relevance to climate modeling varies." Back to fabricating with hedge. The pagination MUST rule appears to dilute the query_relevance MUST rule.
Iter-1 reverted (re-running the iter-1 prompt one more time): tc-allocations-02 = 1.65. Agent acknowledges "based on a loose match search" but still asserts the topic framing.
So out of 3 runs of this prompt class: 1 clean fix, 2 hedged fabrications. The prompt is an improvement but not a reliable fix. The agent reads query_relevance: "loose_match" (proven by literal mention in two of the three answers) but follows the directive to decline the framing only ~1/3 of the time. Prompt discipline is unreliable for declining-the-framing instructions.
tc-affinity-01 is a separate problem — search_affinity_groups(query='GPU') returns 6 substring matches, but the rubric expects the 12-group GPU-equipped universe. The tool's substring matching doesn't surface groups whose searchable text doesn't contain "GPU". Not a prompt issue at all; needs a tool/data-layer fix (e.g., has_gpu flag on each group instead of a substring search).
Committed iter-1 prompt as 9e449ac (it's directionally better than no prompt change — when it works it produces the structurally-right answer) plus the subset YAML. Recommended next-session move: tool-side fix on allocations.search_projects so loose-match fabrication is structurally impossible (e.g., return items: [] plus unranked_loose_matches: [...] when query_relevance: "loose_match"). Same precedent as see_all_url, pagination, query_relevance — fix at the source rather than relying on prompt discipline. Tracked in next_session_prompt.md.
Reports published today:
- Morning baseline: https://access-ci-reports.netlify.app/tc-battery-2026-04-29.html
- Postfix (after MCP rebuild + loop prompt): https://access-ci-reports.netlify.app/tc-battery-2026-04-29-postfix.html
- Subset (4-question, iter-1 prompt): https://access-ci-reports.netlify.app/tc-subset-strongprompt-2026-04-29.html
Reworked access-agent/eval/questions/tool_coverage_battery.json after Andrew's review of 22abfbd flagged that many required_facts were clause-stacked to hit a 3-4 count, with overlap and snapshot data mixed into durable claims. Five commits on feature/production-baseline-comparison:
| Commit | Change |
|---|---|
4bf1a44 |
Atomicized required_facts across all 14 questions; added ground_truth_stability field (9 time_bound, 5 stable); moved snapshot data from facts to authoring_notes. |
13feb2f |
Converted JSON → YAML for human-authoring ergonomics. Loader at src/eval/questions.py dispatches by extension; sibling batteries stay JSON. Added pyyaml as direct dep + types-PyYAML. |
e35e2e6 |
Made authoring_notes a bullet list (was prose blob); added blank lines between questions. |
9c5756c |
Nested enumerative required_facts as {heading, items} dicts (e.g., eligible institutions list, XDMoD realms) instead of comma-blobs. |
39430ab |
Tightened YAML representer to use block scalars only when actually needed (long strings with apostrophes), down from ~35 to 14. |
The factoid scorer that consumes these facts is not yet built — today's work is the ground-truth corpus shape, not the grading. Running the battery now would still hit the existing rubric judge and produce numbers indistinguishable from last week's; the work pays off when the factoid scorer lands.
The factoid-scoring + ground-truth work is the eval system itself, not a one-time exercise. We will iterate on agent setup repeatedly, and the eval must be reliably re-runnable each time. The time_bound required_facts refresh pattern (regen from live MCP before each run) is therefore not optional polish — it's the operational core. Documented in access-agent/eval/questions/tool_coverage_battery.README.md (new "Repeatability — refreshing time_bound facts" section). Refresh script itself is not yet built; manual fallback for now.
Side topics surfaced during the meeting, captured here so they don't evaporate:
- UX direction (Vikram). Concise responses preferred over long ones — "show top five and a follow-able link to the source" pattern. Implication for the agent's synthesis prompt: brevity with explicit pointers, not exhaustive enumeration. Not a launch blocker, but a product input worth keeping.
- Architecture: how chunks reach the agent. Three options on the table — (1) raw chunks endpoint (pipeline-driven, agent has no say), (2) chunks wrapped in MCP (agent-driven, can iterate, fits tool-calling-loop architecture), (3) full synthesis service (current UKY
/ask, returns paragraph). Direction: MCP-wrapped chunks aligns with the long-term tool-calling-loop shape and the OAuth-proxy pattern (lets non-MCP-native clients use the same retrieval surface). Raw API can be the thin layer the MCP wraps. Current synthesis remains as a transitional path. Tracks the "RAG as a tool" Phase 4+ direction Andrew floated previously. - "Local model" terminology — clarified. Andrew's "local model" = UKY-hosted (their on-Grace-Hopper vLLM), NOT laptop-local. Phase 5 of the launch plan covers the swap from OpenAI gpt-4o to UKY-hosted vLLM.
- Current tool-calling LLM. OpenAI gpt-4o (default), configurable via
LLM_PROVIDERenv var (openai | vllm | access_ai | fireworks). Eval judge separate atgpt-4o-miniviaEVAL_JUDGE_MODEL. Provider swap is config-only — no graph code change required.
Followed up on Andrew's concern from the Apr 23 parity check that the loop calls tools less often than the chain. Postgres diff confirmed the shape: across 40 questions, chain called ≥1 tool on 28; loop skipped tools on 13 of those 28 (one-sided pattern, never reverse).
Reframed SYSTEM_IDENTITY in src/agent/prompts/tool_calling_loop.py to present docs + tools as two complementary sources, with "live tool wins on disagreement" framing (commit e34e6bb). Also added a per-question tool-call summary row to the HTML report with a ⚠ when counts diverge (commit 2ee0288). New run (chain-20260424-161321-fce4d6 / loop-20260424-162652-0cab24): 5 of 13 missed-tool cases now call a tool. 8 still skip.
Report: https://access-ci-reports.netlify.app/phase3-smoke-run3.html. Compare-judge JSON at access-agent/comparisons/phase3-smoke-run3.json.
Pulled each side-by-side (chain's tool result, chain's answer, loop's answer):
- 4 of 8: chain's tool returned nothing useful (empty results or HTTP 500). Both agents wrote from docs anyway. Tool call was theatre.
- 1 of 8 (XDMoD usage check): loop correctly declined for anonymous user; chain answered from docs.
- 2 of 8 (Office Hours, webinar link): docs already contained what the tool would have added. Near-identical answers; loop's "Open OnDemand Tips and Tricks" naming was arguably sharper than chain's "the webinar."
- 1 of 8 (affinity groups for ML): chain ignored real group names its tool returned and gave a generic answer. Both agents may be wrong; can't tell from scores alone.
"Loop calls tools less often" is true but doesn't translate to "loop's answers are worse." UKY's docs are strong enough that answers converge whether or not the tool is called. Tool-call count is not a proxy for grounding — an agent can call a tool that returns nothing, ignore a tool's result, or skip a tool because docs already have the answer, and the count doesn't distinguish.
Decision: parked "force loop to call more tools" architectural work (the gate-by-classifier / pre-dispatch / split-agent options). No urgent signal that loop quality is hurting. Direction reaffirmed: the only way out of this measurement gap is ground-truth answers — tool_coverage_battery (now shipped at commits 2884a06 + 22abfbd) is the seed corpus.
Two real bugs, unrelated to loop/chain:
get_user_data— HTTP 500:Object of type TextContent is not JSON serializableintegrate_nsf_xdmod— HTTP 500: sameTextContent is not JSON serializableerror. (Earlier journal entry at line ~1070 noted the agent referencing this tool and UKY 500'ing; the new finding is that the MCP server itself is what's throwing — the JSON serialization failure is the root cause.)
Both worth filing against the access-mcp repo.
Closed out Launch Phase 3 by running Task 7 — the eval-parity check between the legacy chain and the new tool-calling loop — and tightening up the eval CLI + HTML reporter while doing it. Phase 3 is now fully done.
Built phase3_smoke_battery.json, a 40-question curated battery covering seven categories (static-confident, static-deflection, combined-simple, multi-tool, pure-mcp, error-prone, domain-routed). Ran it twice through the eval pipeline against the same agent image, varying only USE_TOOL_CALLING_LOOP via a new CLI choice: --system agent_full_legacy programmatically forces the flag false for the run; --system agent_full forces it true. Same MCP servers, same questions, same judge — single bit flipping which downstream path the LangGraph routing functions take.
- Run 1 — legacy composite 4.73 vs loop 4.60. Loop slightly behind, biggest gap was completeness (−0.27). Compare-judge narrative: "winner A (legacy), small margin." The loop's
SYSTEM_IDENTITYtold it "Be concise. Researchers want the answer, not ceremony." — that was the likely culprit. - Single-line prompt change in
src/agent/prompts/tool_calling_loop.py: swapped the concise instruction for "Be complete. Include the specific details researchers need to act — commands, links, numeric values, step-by-step instructions where relevant. Don't pad with ceremony, but don't strip substance either." - Run 2 — legacy 4.67 vs loop 4.67. Tied composite. Loop wins citation_quality by +0.23 and is within noise on every other dimension. Calling Phase 3 done.
Reports live: https://access-ci-reports.netlify.app/phase3-smoke-run1.html and https://access-ci-reports.netlify.app/phase3-smoke-run2.html.
Nine commits beyond the Phase 3 code (cad8ee1..HEAD on feature/production-baseline-comparison):
| Commit | What |
|---|---|
492cd3d |
phase3_smoke_battery.json — 40 curated questions across 7 categories |
a145ab0 |
--system agent_full_legacy choice + programmatic flag override |
1971d60 |
Docs for --system |
0924f01 |
Semantic run IDs (chain-YYYYMMDD-HHMMSS-hash6 / loop-…) replacing opaque UUIDs — visible in Argilla's "Eval Run ID" filter |
94be014 |
Prompt tweak: "be concise" → "be complete" |
9d7f2e4 |
HTML template fix: execution traces now render for both baseline and candidate (was rendering only candidate) |
63ac882 |
--preset flag on the html report (grand-prix default, phase3-parity for loop-vs-chain) |
c5440f1 |
Hardcoded "Raw RAG"/"raw_rag" template fallbacks neutralized |
5efeb3c |
Data-derived question count (was hardcoded in BATTERY_INFO) + container width 1180→1800px, column-collapse breakpoint 900px |
The legacy chain (plan/execute/evaluate/recover/synthesize) and the new tool_calling_loop are registered as nodes in the same LangGraph graph (src/agent/graph.py::_build_graph_structure). Two routing functions (route_by_classification, route_after_rag) consult settings.USE_TOOL_CALLING_LOOP to decide which branch to take. So the "fork" lives at the routing decisions, not in two separate graphs. Same image, same MCP servers, same questions — just one flag flipping which downstream path is traversed. This made parity testing trivial: the eval CLI's --system flag is the same boolean from a different angle.
feature/production-baseline-comparisonpushed; ~44 commits ahead ofmainnow.- Phase 3 fully done (code + parity check).
USE_TOOL_CALLING_LOOP=falsestill the default in production. - No PR yet. Next: open the PR; start Phase 6 privacy investigation in parallel.
Executed the Phase 3 plan subagent-driven. Code is on feature/production-baseline-comparison (origin), behind the USE_TOOL_CALLING_LOOP=false default. Only remaining Phase 3 item is Task 7 — the eval-parity check.
| Commit | What |
|---|---|
4a9d9d2 |
USE_TOOL_CALLING_LOOP flag added to Settings + .env.example |
64b9da5 |
src/agent/prompts/tool_calling_loop.py — SYSTEM_IDENTITY, build_system_prompt, format_rag_matches |
1b3dcca |
Failing tests (TDD baseline) for tool_calling_loop_node |
acbf91d |
Test fixture correction: use real RAGMatch fields (entity_id, similarity_score, domain) rather than plan's invented source/score |
f8a5278 |
src/agent/nodes/tool_calling_loop.py + create_mcp_tools_from_catalog helper in domains/tools.py |
b4ac03f |
format_rag_matches field-name fix (same plan bug as acbf91d, different file) |
251def1 |
Back-fill state["tool_results"] from ToolMessages — missing requirement in plan; without it, Phase 7 eval scorer would have seen "no tools used" for every loop response |
b8b9153 |
Graph routing: register node, extend route_after_rag Literal, flip all 4 return "plan" sites via shared tool_path local, short-circuit route_by_classification to rag_answer when flag is on |
de637d1 |
Module-docstring deprecation blocks on plan/execute/evaluate/recover/synthesize.py |
617f66c |
Function-level deprecation comments on _resolve_parameters / _resolve_reference in execute.py (the $step_N resolver — retires with the legacy path) |
92ae941 |
Pre-Phase-7 hardening: final_answer=None when loop emits no text (not ""), GraphRecursionError caught with graceful apology, orphan ToolMessages counted in span |
4272a78 |
Plan-doc reconciliation — 12 drift items fixed in-place, plus an "Implementation notes — deviations from plan" section at the end |
Andrew's plan was authored before the implementation started; several snippets referenced APIs and field names that differed from reality. Corrections applied by implementer subagents with explicit callouts in their prompts:
MCPToolWrapperfield names: real API ismcp_client=/tool_server=, notclient=/server=._build_args_schema(name, parameters)takes two positional args, not a dict.RAGMatchfields areid/question/answer/domain/entity_id/similarity_score/metadata— the plan'ssourceandscoredon't exist.- Graph factory is
create_agent_graph, notbuild_graph; compiled-graph introspection isgraph.get_graph().nodes. route_after_raghas fourreturn "plan"sites (disabled-domain fallback, combined/dynamic with matches, combined/dynamic without matches, static-deflection, static-no-match), not the two the plan claimed. Shipped code uses a sharedtool_path: Literal["plan", "tool_calling_loop"]local to flip them uniformly.AIMessage.contentis typedstr | list[str | dict]— needs narrowing before binding tofinal_answer.
The new path skips synthesize.py entirely — the loop's LLM writes its own final answer inline. Citations are handled by a paragraph in the system prompt asking the LLM to cite sources. This is architecturally clean and matches the plan, but it trades deterministic citation generation (what synthesize.py was doing in its 750 lines) for LLM-prompt-instruction compliance. If Phase 7 eval comparison shows citation regressions, a thin post-loop citation node is a clean retrofit (tool_calling_loop → citation_pass → END) — the loop's state already carries everything that node would need. Draft note to Andrew is in memory.
tests/test_tool_calling_loop.py— 18 tests (8 core scenarios + 1 tool_results back-fill + 6 routing/graph-structure + 3 hardening). All green.- Broader non-e2e/classify suite: 230 passed, 1 skipped. No regressions.
- Pre-commit (ruff + mypy + gitleaks) clean on every commit; no
--no-verifyanywhere.
feature/production-baseline-comparisonpushed to origin at4272a78, ~35 commits ahead ofmain.USE_TOOL_CALLING_LOOP=falsedefault, so agent-prod behavior is identical; staging will default the flag totrueonce Phase 2 lands.- No PR opened. Plan is: run Phase 3 eval-parity check first (Task 7), then PR.
- Phase 3 eval-parity check (Task 7). Favoring a targeted smoke slice (20-50 questions covering static-confident, static-deflection, combined/dynamic, multi-tool chains, tool failures) over the full 900-pair battery for initial regression detection — expand only if smoke shows a regression in a specific category.
- Open the PR and send Andrew the synthesis-skip / citations note alongside it.
- Start Phase 6 privacy investigation in parallel when a coding-session break is welcome. Phase 3 added six new span attributes on the loop node; audit those specifically.
Separate from the morning's grand-prix report work (see the next entry). Afternoon was receiving Andrew's new plans, syncing with him, and re-orienting the work plan around them.
Four coordinated Drupal commits ~10:48–10:52 ET, shipping the Resource Documentation API v1.0:
| Repo | Branch | Commit |
|---|---|---|
necyberteam/Operations_Drupal_Feed_Cider |
main |
aec3bd7 — inheritance-aware resource API with versioned paths |
necyberteam/access |
3.0.x |
8c9077f — Swagger docs for the API |
necyberteam/aspTheme |
main |
526936e — theme uses the new inheritance service |
necyberteam/cyberteam_drupal |
main |
4edfb4e — Cypress tests for API versioning + inheritance |
URL paths moved: /api/resources → /api/1.0/resources, /api/resource-groups → /api/1.0/resource-groups. List endpoint now filters by documented=true by default. Detail endpoint now applies Resource-Group inheritance server-side for 15 inheritable fields (login text, SSH nodes, file transfer, storage, support links, office hours, software list, etc. — compute-specific fields stay per-resource). ssh_logins sub-object shape changed. New scalar fields surfaced: login_text, file_transfer_text, jobs_info, software_list_url.
Impact on access-agent: the default DRUPAL_RESOURCE_GROUPS_URL in src/config.py and docker-compose.yml points at the old unversioned path. Not urgent — production support.access-ci.org may not have deployed v1.0 yet. Verify before touching.
Six new planning docs on access-agent/main as commit 14a578c ~10:55 ET, under docs/superpowers/:
plans/2026-04-21-production-launch-umbrella.md— 9-phase umbrella tracker.specs/2026-04-21-production-launch-hardening-design.md— primary launch spec.plans/2026-04-21-launch-phase-0-deps-upgrade.md— langgraph 0.2→1.x upgrade.plans/2026-04-21-launch-phase-1-safety-audit.md—READ_ONLYguard + write-capability audit doc.specs/2026-04-21-eval-rubric-ground-truth-design.md— parallel track: judge gets authoredrequired_factsfrom Argilla.plans/2026-04-21-eval-rubric-ground-truth.md— rubric implementation plan.
The launch spec reframes what "production launch" means. It's no longer "flip the current agent into prod" — it's "stand up a new architecture (native tool-calling loop replacing plan/execute/evaluate/recover, UKY /retrieve chunks replacing paragraph responses, UKY-hosted vLLM replacing OpenAI), prove it side-by-side against current prod on a real staging environment, leadership-gate, flip." Decision 007's framing in access-qa-planning is effectively superseded — the Apr 21 grand-prix report delivered the evidence 007 asked for; the new spec is the governing plan.
Key decisions out of the call:
- Keep working on
feature/production-baseline-comparison. Andrew: "turn this branch that you're on now into the branch that's gonna incorporate the ground truth." Not retiring it; it becomes the home for M.0/M.1 exploration and the parallel rubric work. Andrew isn't doing any parallelaccess-agentwork, so no race-to-merge pressure. - Start with the production/launch work, not the rubric work. Phase 3 is partly a hypothesis test — Andrew: "we think that the new Frontier models are going to be able to handle the tool calling without having all these extra steps in there, but, like, we haven't tested that." Earlier validation is better.
- Rubric (ground-truth) is explicitly not a launch blocker. Andrew: "I think people can just look in Argilla and... see the difference for themselves... it's pretty obvious that having MCP is better than not." The formal quality bar (N spot-checked, X dissent) still stands on paper; his baseline posture is the story is already essentially proven. Rubric work is the in-between-waits plate.
- Argilla role clarified. Argilla is "where you go to dig in more to particular answers, and also a potential source for... human answer, verified answer capability." Not the side-by-side comparison UX — that stays in the HTML grand-prix report.
- New scope surfaced in conversation: live bot traffic → Argilla pipeline. Andrew: "I guess I would just add that into the plan. Somewhere. You can just sort of slot it in wherever you think is, like, the right place for them." No home yet in the umbrella plan; Joe owns placing it.
- Argilla housekeeping: push the grand-prix eval data from Apr 21 to Argilla; take down the older run there. One-second operation.
- Known small bugs in the grand-prix report — not fixing now because the rubric work will touch them again:
- Compare-judge narrative leaks bare
A/Bletters in prose that therelabel()helper fromfc0803adoesn't catch (it handles "System A"/"System B" but not single letters in generated text). - At least one factually-wrong judge verdict (GPU-allocations question: agent returned project counts correctly, raw RAG returned unrelated GPU specs, judge preferred raw RAG). Concrete on-hand argument for the rubric work.
- Compare-judge narrative leaks bare
- Timing: Andrew is "anxious to get this done" but "not expecting it to be done this week." Vikram floated end-of-week for
/retrieve; firmness unclear. Stalls likely at/retrieveand at vLLM hosting.
feature/production-baseline-comparisonat2139613on origin. Continues as the working trunk.access-qa-planningmainat2bfa6c3(Apr 8 commit — unchanged since). Decision 007 still present; effectively superseded by the new launch spec but not yet annotated.access-agentmainat14a578cwith the six new planning docs. No code changes today.access-mcpfix/search-events-webinar-guidanceat15c74cb— unchanged, still no PR. Less load-bearing under the new architecture (native tool-calling retries differently) but the fix is still real.- Synthesis doc at
access-ci/EVAL_REVAMP_AND_PROD.md— local reference, not gist-synced, summarizes both the new plans and the 2026-04-21 meeting outcomes.
Work off Andrew's plans directly. Start M.0 per access-agent/docs/superpowers/plans/2026-04-21-launch-phase-0-deps-upgrade.md; phase order in the umbrella plan. Argilla push and live-to-Argilla stub both dropped — the push because the data's about to be superseded by new-architecture runs, the stub because access-qa-planning/active/03-review-system.md Phase 5 already covers that ground. Phase 2 (staging) infra decisions added to FEB_MARCH_PLAN.md under "Questions for Andrew."
Template-only iteration on the grand-prix HTML comparison report. No eval re-runs, no judge changes, no agent changes — all four commits touch only src/eval/html_report/. Rendered from the four compare-judge JSON artifacts generated on 2026-04-20. Live at https://access-ci-reports.netlify.app/grand-prix-20260421.html.
The report previously led with per-answer judge composites (A 4.79 / R 4.65 / Δ +0.14) in each row header. Decision this session: remove them entirely, not just de-emphasize. The per-answer judge's known calibration weakness (rewards generic-but-truthful answers as highly as specific-and-correct ones) meant those numbers were actively misleading at a glance. They stay in the bundle JSON as raw data but render nowhere — no row header, no row body, no battery rollup, no run summary.
Verdict labels also dropped the decisively / narrowly margin qualifiers. Those are AI-estimated confidence, not evidence a reader can check, so they shouldn't leak into visible copy. Margin still lives in the data for the "Most decisive first" sort option.
Stakeholders (Jim, Vikram, Shelly) need "agent wins, here's where and why" — but researchers also need to form their own opinion from the raw answers, not have the AI's opinion projected onto every scannable row.
- AI run-level analysis moved from the top of the report to the end, under "Analysis by AI Comparison Judge — per battery", with a provenance paragraph stating outright that these are LLM-written summaries, not ground truth.
- Per-question AI opinion nested behind a second disclosure. First disclosure opens a row to reveal question + execution trace + both answers side-by-side. A second
<details>at the bottom, labeled "AI comparison judge's opinion on this question", reveals the verdict pill + why. Two clicks to see the judge's opinion on any one question; zero clicks to see the evidence.
Foggy Notion F from OY_VEY_2.md (repeatable A/B comparison infrastructure for agent variants) nudged a small refactor: the template no longer hard-codes "Agent" / "Raw RAG" / "agent" / "raw" CSS class decisions. Every bundle now carries systems: {A, B} where A = baseline slot (purple palette) and B = candidate slot (teal palette). Labels come from a SYSTEM_LABELS registry in notes.py (raw_rag → "Raw RAG", agent_full → "Agent"; unknown IDs fall back to the raw ID). A relabel() helper in the template rewrites the compare-judge's "System A" / "System B" narrative phrasing to the configured label.
Future agent_v2 vs agent_v3 comparisons can swap in just by adding entries to SYSTEM_LABELS — no template edits needed. Slot visuals (purple = baseline, teal = candidate) stay stable across system swaps.
Filter bar rebuilt:
- Battery and Sort switched from button rows to
<select>dropdowns — far less horizontal space, more obvious as filter UI at a glance. - New client-side search input. Plain text does case-insensitive substring match against question text + qid. Wrapping in
/.../flagsswitches to regex (e.g./outage|down/i). Invalid regex leaves all rows visible and marks the input red with an "invalid regex" hint, so the view stays usable mid-edit. A small pill beside the input showssubstringorregexso the matching mode is obvious. Fully client-side — bundle JSON already has all question text, no server round trips.
| Commit | Summary |
|---|---|
fc0803a |
Lead with compare-judge verdicts, hide per-answer scores, slot abstraction |
2560750 |
Scientist-first framing — hide AI verdicts behind disclosures |
2139613 |
Client-side search field (substring or /regex/) |
spike/grand-prix-subcommand fast-forwarded into feature/production-baseline-comparison (5fd63c5 → 2139613) and pushed. Spike branch pruned locally.
feature/production-baseline-comparisonat2139613on origin.- Published report at https://access-ci-reports.netlify.app/grand-prix-20260421.html.
comparisons/grand_prix_20260420_161753_*.jsonuntouched — these four artifacts are the data spine for the report.notes.pyobservation and subtitle copy updated to match the new framing (no longer references "composite numbers").
- Return to the foggy notions in
OY_VEY_2.mdnow that the grand-prix output is readable enough to reason from. - Decide whether the per-answer judge stays (Foggy C — feed it reference answers) or gets deprecated in place (Foggy D — pairwise + humans only). This report's framing de facto takes option D's stance; committing to it is a separate conversation.
- Parked synthesis spike (
spike/synthesis-empty-tool-defers-to-rag) still unmerged pending a query-class-aware approach.
Follow-up to the earlier mcp-cov-010 diagnosis. Took the "investigate the events MCP" item from the Next list and closed it.
search_events lives in access-mcp/packages/events/src/server.ts. The tool is a thin proxy to the Drupal view at https://support.access-ci.org/api/2.3/events: query is passed literally to the search_api_fulltext param, type/tags/skill become faceted filters (f[0]=custom_event_type:X). No client-side matching logic.
Hit the backing API directly. Of 106 upcoming events, the custom_event_type vocabulary is Office Hours (80), Training (14), Conference (7), Other (5). No event is typed webinar anywhere in the corpus, and only 2 upcoming events mention the word in their description (neither in title). "Anvil Support Hour" and "Sage Office Hours" are both present and recurring April–June 2026.
So ?search_api_fulltext=webinar returns [] correctly given the data — the data just doesn't use that word. Not an MCP bug, not a data-population problem.
Hybrid of two things:
- Tool description misled the LLM. The existing description advertised the server as searching "workshops, webinars, training" and listed
webinarin thetypeparam as a common value. An LLM reading that schema naturally composed{query: "webinar"}. - Planner over-narrowed. A generic ask ("any webinars coming up?") was translated to a keyword filter when dropping
queryor filtering bytypewould have been correct.
One commit, 37 lines touched, events package only. Two changes:
- Pre-call: rewrote the tool description to describe the actual corpus, constrained
typeto an enum (Training | Office Hours | Conference | Other), and told callers not to use generic event-category words asquery. - Post-call: when
queryreturns 0 items but the corpus has upcoming events, the response now includes anotefield explaining the miss and telling the caller to retry withoutquery. Graceful degradation without silently substituting unrelated events.
Events package builds clean. Pre-existing TS/test-infra breakage on main (shared package build errors, missing @opentelemetry/sdk-node dep) is unrelated to this change — same failures appear with changes stashed. Branch pushed, no PR opened yet.
- Branch:
fix/search-events-webinar-guidanceonbacalj/access-mcp, one commit (15c74cb), pushed. - Planner-side change in access-agent deliberately deferred: the schema + note should steer the LLM on their own. Re-run of mcp-cov-010 against the rebuilt MCP will tell us whether any agent-side retry-on-note logic is still needed.
- Rebuild MCP and re-run mcp-cov-010 (or the full mcp_coverage battery) via the grand-prix / HTML report routine. Handed to a fresh session with richer grand-prix context.
- Based on that result, either open a PR as-is or layer a retry-on-note change into access-agent as a branch off the current
feature/production-baseline-comparison.
Worked on two ends of the eval pipeline: the per-answer judge, and a new comparison-judge stage that sits alongside it. Outcome: the agent-vs-raw signal, previously a statistical tie on mcp_coverage, now shows the agent winning by a clear margin under the improved judge.
Pulled the failing case from Postgres and traced through it. Four layers compounded:
- Synthesis prompt bug —
COMBINED_SYNTHESIS_PROMPTtold the agent that an empty tool result was "the authoritative answer," even using the webinar phrasing verbatim as its example. So whensearch_eventsreturned{total: 0, items: []}, the agent said "no webinars" despite RAG matches listing Anvil Support Hour and Sage Office Hours. - Evaluate node echoed the bias — concluded
is_helpful=trueon the empty tool result. - MCP scope may be narrow — raw UKY's RAG had the office/support hour data;
search_eventsdid not. Possibly a matching-too-narrowly bug on the MCP side. Not yet investigated. - No graph-level empty-tool fallback — when a tool returns empty and RAG has substance, the graph still goes to combined synthesis rather than routing to RAG-only.
Also directly in tension with judge commit f6f3238 which had already established the opposite rule for the judge side.
Branch spike/synthesis-empty-tool-defers-to-rag: flipped the COMBINED_SYNTHESIS_PROMPT rule wholesale (empty tool = absence, defer to RAG when RAG has substance). Left unmerged because the blanket flip trades one failure mode for another — time-sensitive queries (e.g. "are there current outages?") legitimately want the empty tool to override stale RAG. A proper fix needs query-class awareness, which is a larger spike.
Cut spike branch spike/judge-preamble-and-richer-context off the clean feature base (no synthesis changes in it), fast-forward-merged back in once tests passed. Eight commits:
| Commit | Summary |
|---|---|
5d7afc7 |
ToolResult.arguments captured agent-side |
e1c4237 |
Mission preamble + structured tool-call context in judge prompt |
d01b216 |
rejudge subcommand — re-score existing runs, no system re-call |
7148284 |
argilla-push guard against rejudge overwrites |
853bf8a |
compare-judge subcommand — head-to-head LLM analysis, JSON output |
117422f |
Self-contained compare-judge JSON + JSON-backed HTML path |
58ac7ba |
Template renders compare-judge narrative (run-level + per-question) |
5fd63c5 |
.gitignore comparisons/ |
| Old judge | New judge | |
|---|---|---|
| agent_full composite | 4.74 | 4.71 |
| raw_rag composite | 4.75 | 4.44 |
| Agent vs raw margin | −0.01 (tie) | +0.27 (agent wins) |
The mission preamble took raw's completeness on this battery from ~4.8 to 3.33 — the judge now correctly recognizes "generic advice when specific data was called for" as incomplete. Agent scores essentially unchanged. That's the ideal calibration outcome.
On mcp-cov-010 specifically, the new judge still scored agent 2.65 (agent genuinely failed — no judge change was going to rescue "no webinars"). Raw dropped 5.00 → 4.75 (nudged down for calling "Anvil Support Hour" a webinar). The comparison judge's verdict on that question was A-wins-large with the per-answer-judge-note "Agree."
New first-class stage in the eval pipeline: compare-judge reads a pair of runs from Postgres (read-only), calls the same OpenAI judge, and writes a self-contained JSON artifact with per-question verdicts + run-level summary. Zero DB writes, zero schema changes.
comparisons/ is gitignored. The artifact becomes the narrative spine for the HTML report: html --from-json <paths...> renders from JSON alone (no Postgres needed at render time). The template grew two new sections (run-level verdict per battery + per-question verdict inside each expanded row), both gated on compare-judge data being present — Postgres-backed reports with no comparisons render unchanged.
End-to-end run on the rejudged mcp_coverage pair produced a 859 KB self-contained JSON and a 113 KB rendered HTML. Report at ~/.agent/diagrams/mcp_coverage_from_json.html.
feature/production-baseline-comparisonpushed to origin at5fd63c5.spike/synthesis-empty-tool-defers-to-ragstill present locally, unmerged, parked pending a query-class-aware approach.OY_VEY.mdwritten at the access-ci root earlier as a reorientation doc (not gist-synced, may delete when it stops being useful).
- Investigate the events MCP (
access-mcp) to determine whethersearch_eventsis too narrow by design or accident, and whether the backing data actually contains the office/support-hour records. Durable fix might live there rather than in the agent. - Run compare-judge on the other three battery pairs (friendly, real_user, combined) for a full grand-prix HTML.
- Decide what to do with the parked synthesis spike.
Andrew proposed routing ACCESS bot traffic through qa-bot-proxy too, as a stopgap while the agent is still being evaluated. The agent is not deployed to production — ACCESS bot currently hits UKY directly via the hardcoded default in access-qa-bot/src/config/constants.ts (QA_ENDPOINT).
If confirmed, the changes would be:
- In
qa-bot-proxy, add an"access"backend ID pointing at the UKY URL - In
access-qa-bot, pointqaEndpointat the proxy and setbackendId: 'access' - Verify qa-bot-core sends + resets the Turnstile token on every request (not just the first) — the stateless proxy validates every request, unlike access-agent which marks sessions as verified. The
turnstile.reset()fix in 0.2.35 was built for NAIRR; need to confirm it applies to ACCESS config too.
Waiting on Andrew to confirm tomorrow (2026-04-10).
Two new repos:
-
qa-bot-proxy(necyberteam/qa-bot-proxy) — Netlify serverless function that validates Cloudflare Turnstile tokens and forwards requests to backends resolved fromALLOWED_BACKENDSenv var. Client sends_backendID (e.g."nairr"), never a URL. CORS support with origin reflection and credentials. Deployed at qa-bot-proxy.netlify.app. 16 tests. -
nairr-bot(necyberteam/nairr-bot) — Existing Netlify-hosted static site, updated to use the proxy. PointsqaEndpointat the external proxy URL (cross-origin), withbackendId: 'nairr'. Shows git commit hash in bottom-left corner for deploy verification.
0.2.33: New optional backendId prop — included as _backend in request body for proxy routing. lib.tsx refactored to derive types from QABotProps and spread props (future props flow automatically). Fixed missing X-Session-ID/X-Query-ID headers on Turnstile resubmit.
0.2.34: Removed 5-second Turnstile timeout that was killing the visible widget before it could complete. Fallback "log in" link now conditional — only shown when loginUrl is a real URL (not default /login), so NAIRR deployments without login don't show a broken link.
0.2.35: Reset Turnstile widget after each successful request using Cloudflare's recommended turnstile.reset() API. Turnstile tokens are single-use — without reset, the second question sends a burned token and triggers a "one moment" loop. ACCESS never hit this because access-agent marks sessions as verified after one token; the stateless proxy validates every request.
Both NAIRR and ACCESS Cloudflare keys changed from invisible to managed mode. Managed does invisible when it can, shows a visible checkbox when Cloudflare deems the user suspicious. No code changes needed — qa-bot-core's TurnstileWidget component renders whatever mode the key dictates.
nairr-bot (separate Netlify site) calls qa-bot-proxy (different Netlify site) cross-origin. Required: reflecting the request Origin header instead of Access-Control-Allow-Origin: *, plus Access-Control-Allow-Credentials: true, because qa-bot-core sends credentials: 'include' on all fetches.
Bumped to pick up qa-bot-core 0.2.35. No code changes — just dependency update.
Updated access-ci-ui → Drupal with 0.2.35. Bot still works, no regressions. ACCESS doesn't use the proxy or backendId.
access-agent keeps its own Turnstile (F.1). The proxy is a separate validation path for deployments that don't use access-agent. Future option: ACCESS could route through the proxy too, but no reason to now.
All 5 F.4 PRs merged:
- access-agent#13, qa-bot-core#13, access-qa-bot#8, access-ci-ui#76, access#397
- access-ci-ui#76 had a merge conflict with upstream a11y/release-please changes (new
qaEndpoint/ratingEndpointprops) — resolved by keeping both sets of props. - access-ci-ui#76 was merged without maintainer review (Matt) — sent a follow-up email explaining the change.
Published stable versions:
@snf/qa-bot-core@0.2.32(npm + GitHub release)@snf/access-qa-bot@3.5.0(npm + GitHub release)- access-ci-ui dependency updated from rc to
^3.5.0on main.
Andrew pushed a refactor to access-agent before merge: RPSectionCache now fetches from a new /api/resource-groups Drupal endpoint (single call, pre-aggregated populated_sections) instead of the old /api/resources + per-resource detail calls. Also added uky_in_scope boolean propagation through the RAG answer pipeline and new tests.
Researched the personalization spec (access-qa-planning 09-researcher-profiles.md, 11-capability-registry.md). Determined that Phases 1 and 2 are the current scope — Phases 3–5 are explicitly marked "(Future)" in the spec. Phase 1 builds the data endpoint; Phase 2 makes the agent use it.
Phase 1 — /capabilities/personalized endpoint:
- New
DrupalProfileFetcherservice (src/services/drupal_profile.py) that calls Drupal JSON:API for user data: active allocations (field_cider_resources), affinity groups + coordinator status (mcp_my_affinity_groupsview), institution, HPC experience. - Fetches user entity fields and affinity groups in parallel; tolerates partial failures.
- Per-user in-memory cache with 5-min TTL (
PROFILE_CACHE_TTL_SECONDS). GET /api/v1/capabilities/personalizedrequires JWT auth, returnsuser,highlighted_capabilities, andcontext.highlighted_capabilitiesderived from profile: coordinators get "Manage [group] announcements", users with allocations get "Check your usage on [resources]".- Config:
DRUPAL_BASE_URL(defaulthttps://support.access-ci.org).
Phase 2 — System prompt injection + personalized discovery:
personalization_contextfield added toAgentState, threaded throughcreate_initial_state→stream_agent→run_agent.UserProfile.to_system_prompt_section()formats profile as a## USER CONTEXTblock (name, institution, HPC experience, skills, interests, affinity groups, allocations).- All three synthesis prompts (tools-only, combined, RAG-only) now include
{personalization}placeholder. - Route layer fetches profile before capability discovery, passes both
personalization_context(for agent prompts) andhighlighted_capabilities(for discovery response). - "Show my options" response now includes personalized highlights at the top when available.
Not yet tested against live Drupal data. The JSON:API field shapes for field_cider_resources, field_institution, field_hpc_experience, and the mcp_my_affinity_groups view may need adjustment once we hit real responses.
All changes on feature/personalization-phase-1-2 branch in access-agent (3 commits, not pushed). 14 new tests, all passing.
Added welcome_message to the scoped capabilities response. The message is built dynamically from the resource's populated sections (e.g., "Hi! I can help with login, file transfer, storage, job submission, software, and datasets on Delta — or ask me anything about ACCESS."). Resources with no populated sections get a simpler fallback.
Changes across 3 repos (all on feature/resource-scoping):
- access-agent:
get_by_category_scoped()now returnswelcome_messagebuilt fromSECTION_QUESTION_MAPlabels. Handles 1, 2, and 3+ section cases for natural English. - access-qa-bot: Added
welcome_message?: stringtoCapabilitiesResponsetype. Welcome message priority: explicitwelcomeprop >capabilities.welcome_message>BOT_CONFIG.WELCOME_MESSAGE. Published as@snf/access-qa-bot@3.5.0-rc.2. - access-ci-ui: Removed hardcoded
"Welcome to ACCESS Q&A Bot!"default that was blocking the capabilities-driven welcome message.
Testing notes:
- Used dev seed data in
rp_cache.pyto work around prod Drupal rate-limiting during local testing (removed before commit). - Tested end-to-end through local Drupal: embedded bot on home page showed scoped Delta welcome message; floating bot showed default.
- Discovered access-ci-ui's
qa-bot.jsxhad a hardcoded default that overrode everything — fixed.
Built the agent-side infrastructure for resource-scoped capabilities (F.4). All changes on feature/resource-scoping branch in access-agent.
What was built:
resource_contextfield threaded through the full query pipeline:QueryRequest→stream_agent()→AgentState→create_initial_state()UKYClient.ask()now acceptsrp_name— setsX-Originheader to the RP slug and includesrp_namein the request body for UKY's scoped vector DB. Addedin_scopefield toUKYResponse(None until UKY implements it).RPSectionCachewith hardcoded seed data for 9 resources (delta, anvil, bridges2, expanse, jetstream2, stampede3, derecho, neocortex, kyric). Singleton atsrc/services/rp_cache.py.- Section-to-question mapping (
SECTION_QUESTION_MAP) in capabilities registry — mapslogin,file_transfer,storage,queue_specs,top_software,datasetsto labeled capabilities withdescriptionandexample_queryinterpolated with RP title. GET /api/v1/capabilities?resource_context=<slug>returns RP-scoped response withresource_docs+support+analyticscategories. Unknown slugs fall through to standard response.- Scoped capability discovery short-circuit: "Show my options" with
resource_contextreturns Delta-scoped suggestions. - Out-of-scope fallback in
rag_answer_node: if scoped RAG response contains out-of-scope phrases, retries withoutrp_namefor general RAG.
Verified with curl:
?resource_context=delta→ 6 section capabilities + support + analytics (matches spec exactly)?resource_context=jetstream2→ 2 sections (login + storage) + support + analytics (sparse resource)?resource_context=bogus→ falls through to standard 5-category responsePOST /querywithresource_context=delta+ "Show my options" → Delta-scoped discovery
Phase 2 — frontend prop plumbing (same session):
- qa-bot-core:
resourceContextprop onQABotProps, threaded throughCreateQAFlowParams→ POST body (resource_context), Turnstile resubmit body,lib.tsxprogrammatic API. - access-qa-bot:
resourceContextonAccessQABotProps, appended as query param on capabilities fetch, passed through toQABot. - Verified via npm link workflow:
[linked]marker in logger, dev server at localhost:3000 withresourceContext="delta", confirmed Delta-scoped capabilities and scoped RAG fallback. - Fixed scoped RAG fallback: UKY returns "No documents are currently available" for empty RP collections — added to out-of-scope heuristic phrases. Retry with general RAG now works.
- access-ci-ui not yet updated (separate repo, needs PR to access-ci-org/access-ci-ui).
PRs:
- access-agent: necyberteam/access-agent#13
- qa-bot-core: necyberteam/qa-bot-core#13
- access-qa-bot: necyberteam/access-qa-bot#8
Phase 3 — access-ci-ui + end-to-end Drupal testing (same session):
- Published rc versions: qa-bot-core@0.2.32-rc.1, access-qa-bot@3.5.0-rc.1 (both from feature branches).
- access-ci-ui: bumped access-qa-bot dep, added explicit
resourceContextprop to QABot wrapper. PR to access-ci-org/access-ci-ui#76. - Built access-ci-ui, copied dist to Drupal's
web/libraries/access-ci-ui/. - Restored Drupal DB from backup (
backups/site.sql.gz, Aug 2025) — ddev volume had been pruned. - Discovered
.embedded-qa-botdiv lives in a Drupal block content body (DB), not a template file. Drupal's text format filters stripdata-*attributes, sodata-resource-contextcan't be set via block content — production will need it on the template. - Proved e2e by hardcoding
resourceContext: "delta"in headerfooter.js → embedded bot showed Delta-scoped capabilities, scoped RAG with fallback to general. Floating bot remained unscoped. Two independent sessions on the same page.
PRs (ready for review):
- access-agent: https://github.com/necyberteam/access-agent/pull/13
- qa-bot-core: necyberteam/qa-bot-core#13
- access-qa-bot: necyberteam/access-qa-bot#8
- access-ci-ui: access-ci-org/access-ci-ui#76
Phase 4 — live Drupal fetch + headerfooter.js PR (same session):
- Replaced hardcoded seed data in
RPSectionCachewith live fetch from Drupal's/api/resourceslist +/api/resources/{id}detail endpoints. Checks which per-section fields (ssh_logins, file_transfer, storage, etc.) have content. Refreshes on first access + every 30 min (configurable viaRP_CACHE_TTL_SECONDS). - Currently all 109 resources return empty section arrays — RPs haven't entered documentation content yet. Cache correctly shows 90 resources, 0 with populated sections. Resources still get support + analytics capabilities.
headerfooter.jsPR tonecyberteam/accesson3.0.x: one-line change to readdata-resource-contextfrom the.embedded-qa-botdiv and pass asresourceContexttoqaBot().- Confirmed
data-resource-contextattribute already exists in production Drupal (verified on https://support.access-ci.org/node/10864 —data-resource-context="anvil"set by preprocess hook). - Local Drupal DB (Aug 2025) is too stale for current codebase — smoke test failed on unrelated schema errors. Full chain was proven earlier in the session.
All PRs (5 repos, all ready for review):
- access-agent: https://github.com/necyberteam/access-agent/pull/13
- qa-bot-core: necyberteam/qa-bot-core#13
- access-qa-bot: necyberteam/access-qa-bot#8
- access-ci-ui: access-ci-org/access-ci-ui#76
- access (Drupal headerfooter.js): necyberteam/access#397
Phase 5 — Drupal smoke test (same session):
- Got fresh production DB via
gh run downloadartifact +robo did. Required switchingcyberteam_drupaltomainand runningcomposer install— the Aug 2025 DB was too stale for the old codebase. headerfooter.jslives innecyberteam/accessrepo (branch3.0.x), notcyberteam_drupal. Separate git repo nested atdocroot/modules/custom/access/.- Production
headerfooter.jsimports fromunpkg.com/@access-ci/ui@0.19.0— had to swap to local/libraries/path for testing (not committed). - Smoke test passed: both bots render, embedded bot picks up
data-resource-contextfrom the div (confirmed attribute present in fresh DB via preprocess hook), floating bot stays unscoped. - When PRs merge: access-ci-ui gets a new release, then bump the version in
headerfooter.jsimport — same process as every access-ci-ui release.
All PRs (5 repos, all ready for review):
- access-agent: https://github.com/necyberteam/access-agent/pull/13
- qa-bot-core: necyberteam/qa-bot-core#13
- access-qa-bot: necyberteam/access-qa-bot#8
- access-ci-ui: access-ci-org/access-ci-ui#76
- access (Drupal headerfooter.js): necyberteam/access#397
What's next:
- README updates for resourceContext prop in all 3 frontend repos (before final publish)
- After PRs merge: publish stable versions (qa-bot-core 0.2.32, access-qa-bot 3.5.0), update access-ci-ui dep, bump headerfooter.js import version
- Welcome message field in capabilities response (Andrew approved, spec it + build)
Explored three approaches for capability discovery in the chatbot:
-
5 category buttons with improved canned responses — kept the existing category/capability button pattern but improved the shortcircuit text to include "try typing..." examples. Extended the shortcircuit to also match capability labels (not just category labels).
-
8 example query buttons — replaced category buttons with real queries ("Are there any system outages right now?", "Is Python available on Delta?") that go through the full agent pipeline. Honest buttons — clicking does the same as typing. Standard ChatGPT/Gemini pattern.
-
Single "Show my options" button — minimal approach. One discovery button returns a categorized list of example queries. Welcome message introduces capabilities and invites typing. Settled on this.
Key insight from the process: the old buttons looked actionable but just returned canned text. The "honest button" approach (variant 2) was better but 8 buttons is a lot. A single discovery button with rich content is the cleanest.
access-agent (PR #12, merged):
- Renamed support category "Get help" → "Create a ticket"
- Extended
_check_capability_discovery()to match capability labels (initially for variant 1, kept as safety net for typed input) - Rewrote "Show my options" response to show example queries instead of generic descriptions
- Added
example_queryfield toCapabilitydataclass — registry is single source of truth - Removed dead category-label and capability-label shortcircuit blocks (unreachable in new UX)
access-qa-bot (PR #7, merged, v3.4.0 published):
- Replaced 5 category buttons with single "Show my options"
- Updated welcome message to introduce capabilities and invite typing
- Removed unused
capabilitiesparameter fromcreateMainMenuFlow - Removed
CapabilitiesResponseimport from flow file
Two changes from Andrew's review on the agent PR:
- Move example queries into the registry instead of a separate dict in routes.py — led to the
example_queryfield onCapability - Remove unused
_capabilitiesparameter in the frontend
Set up npm link for qa-bot-core → access-qa-bot local iteration (with [linked] breadcrumb in logger for verification). Added vite proxy initially for CORS, reverted in favor of using port 3000 which is already in the agent's CORS allowlist.
Addressed all 7 items from Andrew's review on qa-bot-core PR #11:
- Removed "Feel free to ask another question" injection (all 3 instances)
- Gated metadata display (confidence, agent, tools_used) behind
QA_BOT_DEBUGlocalStorage flag - Suppressed rating buttons when
rating_targetis null - Rewrote visible Turnstile challenge flow — auto-resubmits pending query after
onVerifyinstead of waiting for user input (eliminates silent input replacement) - Token expiry logs warning instead of showing permanent error (allows Cloudflare auto-refresh)
- Second
requires_turnstileresponse keeps pending query intact for re-solve
Tested locally with Cloudflare test keys (1x00000000000000000000AA site key, TURNSTILE_FREE_QUERIES=0). Visible challenge flow works end-to-end.
Pulled Andrew's recent work across repos:
- access-agent
feature/capability-routing: replaced_response_is_question()("?" heuristic) withdomain_completedflag from domain agent node. Also fixedrequires_authon capabilities (onlymanage_announcementsneeds auth), addedcapability_idto classifier examples. - access-qa-bot
feature/turnstile: deployment warninguseEffectfor misconfigurations, removed duplicateWELCOME_MESSAGE_LOGGED_OUT. - access-qa-planning: reorganized into
active/,archive/,decisions/directories. Added 6 ADRs. - access-agent
main: streaming responses spec (2026-04-03-streaming-responses-design.md).
All three feature PRs merged to main:
- access-agent #6 (
feature/capability-routing) - qa-bot-core #11 (
feature/turnstile) - access-qa-bot #4 (
feature/turnstile)
- qa-bot-core v0.2.30 — npm + git tag + GitHub release
- access-qa-bot v3.3.12 — npm + git tag + GitHub release (updated qa-bot-core dependency to 0.2.30)
Streaming responses (Andrew's spec at access-agent/docs/superpowers/specs/2026-04-03-streaming-responses-design.md). RAG scoping is unblocked on the agent side but lower priority. Personalization last (Drupal-dependent).
Systematic capability-by-capability testing with access-agent (Docker), access-qa-bot (Vite dev server), and all MCP servers running locally. Created OUTSTANDING_ISSUES.md at the access-ci root to track cross-cutting issues.
Eight total issues found and fixed across the stack:
- Logged-out buttons not rendering: Capabilities fetch async timing. Fix: defer
<QABot>render until capabilities load. - Category labels sent to RAG: "Get help" etc. treated as questions. Fix:
_check_capability_discovery()short-circuit in routes.py. - Ratings on clarifying questions:
is_final_responsehardcodedtrue. Fix:_response_is_question()heuristic for domain agents. - "Feel free to ask another question" on every response: Fix: gated on
lastIsFinalResponsein qa-flow.tsx (rc.14). - Hardcoded ticket/security flow intercepts: Removed; all routes go to
qa_loop. - Classifier misrouting MCP-backed capabilities: NSF awards, affinity groups, events, announcements classified as "static" → sent to RAG. Fix: added MCP-backed capability section with routing rules and examples to classifier prompt.
- CORS blocking port 3006: Dev CORS whitelist didn't include 3006. Fix: added it. Also added graceful degradation when capabilities endpoint is unreachable (renders bot with fallback "Show my options" button).
- Lock emoji prefix breaking discovery: Frontend sends
🔒 Ask a questionwhich didn't match discovery labels. Fix: strip lock emoji prefix in_check_capability_discovery(). Also removed "Ask a question" button entirely per spec resolved decision #1 (typing is the default).
| Capability | Result |
|---|---|
| Get help (button) | PASS — discovery listing |
| Open a help ticket (typed) | PASS — JSM domain agent engages. E2e deferred (need test queue) |
| Explore resources (button) | PASS — discovery listing |
| Check system status | PASS — system-status MCP |
| Browse events | PASS — events MCP |
| Browse affinity groups | PASS — affinity-groups MCP (needs per-group links) |
| Search software | PASS ⭐ — excellent results for "search software abaqus" |
| Search announcements | PASS — announcements MCP (minor list formatting in UI) |
| Search NSF awards | PARTIAL — tool called but institution search too fuzzy, quality check rejects |
| Show my options | PASS — full capability listing with lock icons |
| Manage announcements | PASS routing — domain agent engages, e2e needs auth |
| Check usage (XDMoD) | FAIL — MCP tool 500 error, missing XDMOD_API_TOKEN |
| Repo | Branch | Hash | Message |
|---|---|---|---|
| access-agent | feature/capability-routing |
aeadd02 |
feat: capability discovery short-circuit and is_final_response heuristic |
| access-agent | feature/capability-routing |
713cfc9 |
fix: add localhost:3006 to dev CORS whitelist |
| access-agent | feature/capability-routing |
578a9cb |
feat: improve classifier routing for MCP-backed capabilities |
| access-agent | feature/capability-routing |
e570611 |
fix: strip lock emoji prefix from capability discovery queries |
| qa-bot-core | feature/turnstile |
f27bec7 |
fix: gate follow-up prompt on is_final_response (rc.14) |
| access-qa-bot | feature/turnstile |
c2a5fb6 |
refactor: simplify main menu flow, all routes to agent |
| access-qa-bot | feature/turnstile |
790aaaa |
fix: graceful degradation when capabilities endpoint is unreachable |
| access-qa-bot | feature/turnstile |
60d7463 |
fix: omit "Ask a question" button — typing is the default action |
Created OUTSTANDING_ISSUES.md with 10 items covering: test JSM queue (Andrew), NSF search precision (Andrew), quality check retry waste, is_final_response heuristic fragility, XDMoD token, CORS flexibility, authenticated testing, announcements e2e, affinity group links, numbered list rendering.
Reviewed the capability registry design spec (access-agent/docs/superpowers/specs/2026-03-18-capability-registry-design.md) and planned which F.3 changes belong in each repo. Core decision: qa-bot-core handles generic rating infrastructure (metadata capture, is_final_response gating, rating_target routing). access-qa-bot handles ACCESS-specific concerns (capability fetching, dynamic button rendering, lock icons, personalization).
One divergence from spec: added agentRatingEndpoint as an optional prop on qa-bot-core. The spec doesn't name this prop but qa-flow needs a concrete URL to POST agent ratings to. Everything else matches the spec.
- Added
capabilitiesEndpoint(passthrough for wrappers) andagentRatingEndpointprops toQABotProps - qa-flow.tsx now captures
is_final_responseandrating_targetfrom responsemetadataobject - Rating buttons gated on
is_final_response: true(replaces oldhasShownResponseflag) - Ratings route to
agentRatingEndpointwhenrating_targetis"agent"(with agent payload:query_id,rating,session_id), or toratingEndpointfor"uky_rag"(existing UKY payload) - Wired through QABot.tsx and programmatic API interfaces
- Added
AGENT_ENDPOINTconfig (defaults tolocalhost:8000/api/v1) with derived capabilities and agent rating URLs - AccessQABot fetches
GET /api/v1/capabilitieson mount, re-fetches when auth state changes - Rewrote
main-menu-flow.ts: dynamic buttons from capabilities response, "Show my options" discovery button,chatDisabled: falseon start step - Combined welcome message + AI disclaimer into one message (removed
go_ahead_and_asktransition step) - Lock prefix (🔒) on capabilities marked
locked: truefor anonymous users - Lazy personalization stub (
GET /api/v1/capabilities/personalized) for authenticated users — logs to console, F.4 will use the data - Added
CapabilitiesResponse,CapabilityCategory,CapabilityItemtypes
Consolidated feature/dynamic-capabilities branches (created fresh from main) onto existing feature/turnstile branches in both repos. The turnstile frontend work and capabilities work are now on one branch per repo. Resolved merge conflicts in qa-flow.tsx (combined Turnstile challenge suppression with is_final_response gating). Deleted stale feature/dynamic-capabilities branches.
Dev server runs, capabilities endpoint responds, buttons render from API data. Some runtime issues remain — likely button routing, flow transitions, or prop wiring. Needs full-stack debugging next session with access-agent + access-qa-bot + qa-bot-core all running.
Addressed two rounds of PR review feedback before merging. Review fixes:
- Rating endpoint hardened with anti-spoofing: ownership check (user_hash for authenticated, session_id for anonymous), one-per-query (409), 24h time window (410), proper 403 on mismatch.
- General capabilities auth flags corrected — public features (status, events, software, NSF awards, affinity groups) now
requires_auth=False. Anonymous users see them unlocked in/api/v1/capabilities. - Synthesis prompts now auth-aware — anonymous users only see public capabilities in agent responses.
- Classifier
max_completion_tokensbumped 250→350 to prevent truncation after addingcapability_idfield. - Turnstile: unbounded
_sessionsdict replaced with periodic eviction (5min interval, 10k cap). Expired verification now resets query counter (fresh grace period). Anonymous queries withoutsession_idrejected when Turnstile enabled. - Health endpoint
KeyErrorfix (s["name"]→s["server"]). - Removed dead
get_for_auth()method, fixed session leak inlog_query(). - Added 14 new tests: capability registry (loading, auth visibility, inference) + turnstile eviction.
Added is_final_response, rating_target, capability_id, and question_id to query response metadata. rating_target is "uky_rag" when UKY was primary source, "agent" for tool/domain responses. Completes F.2 backend.
All backend API surface for the capability registry is shipped on main. Frontend work (F.3) can now begin: fetch capabilities on load, replace hardcoded buttons, contextual ratings, rating routing.
Andrew confirmed: keeping the invisible Cloudflare widget key. The silent pre-verify flow (useTurnstile hook) handles the common case. For the rare edge case where invisible verification fails (VPN + ad blocker), the user sees a "verify" message with an empty widget — Cloudflare may still succeed silently in the background. If not, the user can refresh or log in. Documented the options in CONSTERNATION.md during analysis, but the decision is to ship as-is. No code changes needed.
Rebased feature/capability-registry onto current main (includes Andrew's eval pipeline commit 4441536). Clean rebase, no conflicts. Branch now has 5 capability commits on top of 3 turnstile commits on top of main.
Andrew is taking ownership of the evaluation pipeline (Project G). The ad-hoc testing infrastructure (test runner, LLM judge, batteries, HTML reports) documented in AGENT_TESTING.md served its purpose as a prototype. Andrew's eval pipeline design (access-agent/docs/superpowers/specs/2026-03-31-eval-pipeline-design.md) formalizes and supersedes it. Existing batteries (~160 questions) carry forward as seed data for G.1.
Removed working documents CONSTERNATION.md (Turnstile edge-case analysis — decision made) and AGENT_TESTING.md (eval testing overview — superseded by Andrew's eval pipeline spec and Project G in the plan).
Reworked the Turnstile frontend from "always visible challenge" to "invisible by default, visible fallback." New useTurnstile hook renders an invisible Cloudflare Turnstile widget on mount, stores the token, and the qa-flow attaches it to every outgoing request automatically. The backend's visible challenge flow (requires_turnstile response + widget in chat) remains as fallback for suspicious users or when silent verification fails. Three free queries act as a grace period.
Key insight: Cloudflare's widget type (managed, non-interactive, invisible) is determined by the site key, not frontend code. Test keys control dev behavior:
1x...BB= invisible, always passes (normal dev)3x...FF= forces interactive challenge (test fallback)1x...AA= visible, always passes (see widget auto-complete)
New turnstileSiteKey prop on QABot controls activation. access-qa-bot reads it from VITE_TURNSTILE_SITE_KEY env var. Version bumped to 0.2.30-rc.10.
All three repos committed and pushed on feature/turnstile branches.
npm was down most of the day (web auth incident). Once it recovered, published rc.10 and rc.11, tested all three scenarios end-to-end:
- Invisible happy path (
1x...BBkey) — silent verification on mount, user never sees anything. Confirmed via debug console logs. - Visible fallback (
3x...FFkey) — 3 free queries, then "Please verify you're human" with interactive checkbox. Widget padding (8px 16px) and rounded corners (8px) applied. Rating buttons suppressed during challenge. - Disabled (empty keys) — login gate returns, old behavior preserved.
- Spec-only reactive flow (
1x...AAkey, noturnstileSiteKeyprop) — tested to confirm what the spec alone produces: after 3 queries, "Please verify you're human" message + widget that auto-checks itself in ~1 second. Works but is a visible interruption.
Design decision (pending Andrew's input): The turnstileSiteKey prop and useTurnstile hook go beyond the original spec. They are the only path to zero interruptions for legitimate users — without them, every anonymous user sees a brief "verify you're human" blip after 3 queries. Asked Andrew whether he prefers fully invisible (current PRs) or is fine with the brief auto-check (spec-only). Both paths work in the current code — it's a deployment config choice:
- Set
VITE_TURNSTILE_SITE_KEY→ silent pre-verify active, zero interruptions - Don't set it → reactive-only flow from the spec (brief blip after 3 queries)
If Andrew prefers spec-only, the changes to remove are: useTurnstile hook, turnstileSiteKey prop from QABot/lib, and the getTurnstileToken plumbing in qa-flow. The reactive fallback path stays as-is.
- qa-bot-core #11:
feature/turnstile→main - access-qa-bot #4:
feature/turnstile→main - access-agent #2:
feature/turnstile→main
Andrew merged uky-plus-mcp → main (14 commits). His merge included 44075e9 (usage logging fix, health endpoint enhancement, deploy workflow). Rebased feature/turnstile onto updated main across all repos — clean, no conflicts except a trivial package-lock in access-qa-bot. Cleaned up stale local branches (uky-plus-mcp, feature/dual-rag-logging).
Created feature/capability-registry branch off feature/turnstile in access-agent. 5 commits, all pushed.
Data models and registry:
CapabilityandCategorydataclasses insrc/agent/domains/config.py- Added
capabilitiesfield toDomainAgentConfig; announcements (2 capabilities) and JSM (3 capabilities) configs updated - 8 general pipeline capabilities (Q&A, allocations, software, status, events, affinity groups, XDMoD, NSF awards)
CapabilityRegistryclass in newsrc/agent/domains/capabilities.py— aggregates domain + general capabilities, handlesDISABLED_CAPABILITIESenv var, provides auth-filtered queries, system prompt generation, and capability-to-query inference- 5 categories: general, support, content, explore, analytics. 13 total capabilities.
API endpoints:
GET /api/v1/capabilities— returns capabilities grouped by category; auth-required items markedlockedfor anonymous usersPOST /api/v1/rating— attaches rating (helpful/not_helpful) + optional feedback to an existing usage log entry by query_id
Classification:
- Added
capability_idfield toQueryClassificationin state model - Updated classify prompt with full capability list so the LLM outputs
capability_idfor every query - Falls back to inference from domain/tools_used via
CapabilityRegistry.infer_capability_id()
Usage logging:
- New columns:
capability_id,category,was_authenticated,rating,rating_feedback - Startup migration adds columns to existing
usage_logstable (sincecreate_allonly creates tables) log_querynow acceptscapability_idandcategory;was_authenticatedderived fromacting_user- New
log_ratingmethod for the rating endpoint
Agent self-knowledge:
- All three synthesis prompts (tools-only, combined, RAG-only) now include a
{capabilities}section - Lazy-loaded from CapabilityRegistry on first synthesis call
- Agent can now answer "what can you do?" and give contextual hints about other capabilities
Remaining for F.2: Restrict /tools and /catalog endpoints to admin access (low priority). Frontend work is F.3.
Goal: Begin Project F work. Decided with Andrew to ship Turnstile first (self-contained, immediate user value) rather than landing all phases at once.
Reordered Project F delivery: Turnstile (F.1) → Capability Registry (F.2) → Dynamic UI (F.3) → Personalization (F.4). Each is a clean base for the next. Updated FEB_MARCH_PLAN.md and synced gist.
Implemented server-side Turnstile bot protection for anonymous users. Commit 96f9f64 pushed to origin/feature/turnstile. Will PR against main after PR #1 merges, then rebase.
What it does: anonymous users hit the /api/v1/query endpoint, the agent tracks per-session query counts and verification status. In deferred mode (default), 3 free queries pass before the agent returns {"requires_turnstile": true, "site_key": "..."} instead of an answer. The frontend (future work) will show the Cloudflare Turnstile widget, get a token, and resend. The agent verifies the token with Cloudflare's /siteverify endpoint and marks the session verified for a configurable TTL (default 1 hour). Authenticated users skip all of this.
Files: src/turnstile.py (new), src/config.py (5 env vars), src/api/routes.py (gate + counter), tests/test_turnstile.py (11 tests, all passing).
Implemented the Cloudflare Turnstile widget in qa-bot-core. When the agent returns {"requires_turnstile": true, "site_key": "..."}, the chatbot shows "Please verify you're human to continue." with the Turnstile widget rendered below it as a React component, then automatically resends the original query after verification.
Key learnings from iteration (rc.1 through rc.9):
injectMessage()escapes HTML — cannot inject widget containers that way.- react-chatbotify's
componentproperty on a flow step only renders reliably on the current step, not on steps reached viapathtransition. A separateturnstile_challengestep with a component never rendered. - Solution: put a conditional
TurnstileWidgetWrappercomponent onqa_loopitself (same pattern as LoginButton on the login gate step). The wrapper reads from a mutable state object so it can check the site key at render time rather than at flow creation time. - Must clear
turnstileState.siteKeyafter token is consumed, otherwise the widget re-renders on subsequent qa_loop cycles. LIB_VERSIONin logger.ts was stale at0.2.19— updated it to track RC numbers for dev sanity.
Files: src/components/TurnstileWidget.tsx (new), src/utils/turnstile.ts (new), src/utils/flows/qa-flow.tsx (modified).
- Agent
.env: Cloudflare test keys (visible, always-pass) withTURNSTILE_MODE=immediatefor fast iteration. - Agent
docker-compose.yml: added Turnstile env vars to the environment block (they weren't being forwarded to the container). - Agent
src/main.py: addedhttp://localhost:3000to CORS allowed origins for local dev. - Agent
src/api/routes.py: fixedresponse_model=Noneon/queryendpoint (FastAPI rejectsUnion[QueryResponse, JSONResponse]). - access-qa-bot
.env.local: pointedVITE_API_ENDPOINTathttp://localhost:8000/api/v1/queryfor local testing. - Workflow: publish RC to npm from qa-bot-core,
npm install @snf/qa-bot-core@<rc>in access-qa-bot, clearnode_modules/.vite, restart dev server.
- Removed stale
DUAL_RAG_LOGGING=falsefrom.env— was breaking all tests (key removed from Settings model during E.3 but left in.env). - Pulled repos:
access-mcphad 29 files of new changes (per-user XDMoD tokens, docs proxy fixes). All others unchanged. - PRs #1 (access-agent) and #2 (access-qa-planning) still awaiting Andrew's review, no activity.
Goal: Fix Q15 (search_projects returning random public projects), parallelize UKY RAG and tool planning per Andrew's request, set up MCP_API_KEY for service auth, add domain agent test questions.
-
28390f9TOOL_CAVEATS — Inject caveats directly into the tool catalog text shown to the planner LLM.search_projectsgets a note that it's a public catalog search with no user/owner parameter. Prevents the planner from selecting it for "my projects/allocations" queries. Earlier attempt (rule 12 in RULES section) was ignored by the LLM — moving the warning next to the tool description in the catalog was more effective. -
b58b553Parallel RAG+plan — For combined/dynamic queries, UKY RAG and tool planning now run concurrently viaasyncio.gatherin a newrag_and_plan_node. The planner doesn't readrag_matches, so there's no data dependency. Static and domain queries keep the sequential RAG-first path (static needs RAG result to decide whether to END, domain needs RAG context before routing). New routing inroute_by_classification: returns"rag_answer"for static/domain,"rag_and_plan"for combined/dynamic.
- Set
MCP_API_KEY=my-random-stringin bothaccess-agent/.envandaccess-mcp/.env— shared secret for service-to-service auth. Unlocks JSM, announcements, and events tools. - Started
mcp-jsmcontainer (was defined in compose but not running). - Fixed Argilla container consuming 23.8GB disk (stuck in restart loop) —
docker compose downon access-argilla freed the space.
v4 run (18 questions, parallel RAG+plan, MCP auth):
- 18/18 pass, zero 92-char failures
- Q15 answer clean — allocation troubleshooting guidance with correct links, no random public projects
- Q17 (JSM domain): correctly routed,
tools=['jsm'], asks for ticket details (281 chars) - Q18 (Announcements domain): correctly routed,
tools=['announcements'], provides guidance (725 chars) - JSM tools now functional with MCP_API_KEY —
create_support_ticket,create_login_ticketappearing in tool results
Side-by-side UKY vs Agent (18 questions):
- Agent longer: 6/18, UKY longer: 12/18
- Avg chars nearly identical: UKY 1413, Agent 1449
- Agent adds genuine value on live data: Q14 (+3848 chars, live events via search_events), Q7 (+789 chars, hardware specs)
- Domain questions (Q17, Q18) are shorter because the agent takes action (ticket creation, announcement workflow) rather than providing generic documentation
- Report:
slim-v4-sidebyside.htmlon Netlify
Explored injecting UKY RAG context into domain agent system prompts so ticket/announcement flows would include troubleshooting steps. Tested and reverted — the JSM react agent ignores the documentation context and follows its ticket-creation workflow regardless. More importantly, when a researcher says "open a ticket about my login issue on Bridges-2", they've already tried the obvious fixes. Injecting "have you checked your password?" is condescending. The upcoming capabilities work (buttons for direct actions) is a better fit for this pattern.
Discovered the quality loop (evaluate → re-plan → execute) was grinding through 3 identical retries when tools returned empty or error data. The planner has no knowledge of previous failures, so it picks the same tool each time. Added a circuit breaker in should_retry_quality: if all tool results are empty, failed, or error-shaped (including MCP success=True with {"error": "..."} body), skip straight to synthesize. UKY content is available in state regardless. Saved ~10-15s on affected questions (Q13 went from 24.5s → 9.0s).
slim-v4-comparison.html— standalone 18-question battery (pre-parallel)slim-v4-sidebyside.html— final UKY vs Agent side-by-side (with parallel RAG+plan + circuit breaker)- All deployed to
access-ci-reports.netlify.app
Goal: Implement the top graph fixes identified in the E.3 v2 review, build a focused 16-question regression battery, and validate improvements.
b6f29e1cherry-pick: node_trace observability093daebcherry-pick: gate node_trace behind?include_trace8dfa28dAlways consult UKY — every query hits rag_answer first, classifier no longer gates UKY access. domain_agent falls back to UKY when tools unavailable.cf3a033Tighten JSM classification — domain=jsm only on explicit "open/file a ticket" language, not problem descriptions.d09d24cSmarter hedge detection — keep hedged answers with substance (>500 chars, has URLs/emails). Only reject true deflections.0914e00Synthesizer prompts — preserve UKY links/contacts/specifics, strip hedge preamble, don't inject LLM training data.dec9e43Rename_rag_answer_is_weak→_rag_answer_is_deflectiond381ea9Widen combined classification — hardware specs, software versions, resource comparisons nowcombinedinstead ofstaticto preserve MCP enrichment path.e039e02F15 fix — when plan says "no tools needed" but UKY has content, use RAG-only synthesis instead of pure LLM generation.6edb670Direct UKY serve — when tools add no value (not needed, failed, or absent), serve raw UKY answer directly with hedge preamble stripped. No LLM rewrite. LLM synthesis only oncombinedpath where tools actually contributed. Hardened URL preservation in all synthesis prompts.
Also: disabled pgvector fallback (stubs remain), added ACCESS_AI_API_KEY to docker-compose.yml, updated SYSTEM_OVERVIEW.md diagram, updated agent-decision-flow.html to reflect new graph.
16 questions, 3 iterations (v1→v2→v3). v3 is the final run with all 10 commits.
v3 results (vs UKY baseline):
- Zero 92-char failures (was 7 before fixes)
- 8 of 16 questions now longer than UKY baseline
- Direct serve strategy eliminates synthesis nerf: Q3 went 1,197→2,248, Q4 went 365→1,535
- Combined synthesis works when MCP adds real data: Q13 (+423 system status), Q15 (+249)
- Q11 (+1,578): MCP enrichment with current GPU specs (the "dream scenario")
- Remaining shorter answers are mostly UKY variance (different answer each run)
Synthesis strategies in v3:
UKY DIRECT(no synth / no tools / tools failed): 10 questions — raw UKY answer, hedge stripped, no LLM rewriteCOMBINED (UKY + MCP): 6 questions — LLM merges UKY knowledge with MCP tool data
Remaining issues:
- UKY variance — same question produces 274–946 chars across runs
- Auth gap — user-specific MCP tools need MCP_API_KEY service token (ask Andrew)
- Combined synthesis still condenses slightly (but only when tools actually contributed)
slim-v1-comparison.html— first run with fixes 1-8slim-v2-comparison.html— after F15 fix, three-column comparisonslim-v3-comparison.html— final, all fixes, no scenario tags, strategy badges- All deployed to
access-ci-reports.netlify.app
SLIM_BATTERY.md— scenario descriptions for the 16-question batterySLIM_BATTERY_QUESTIONS.md— the actual questions file for the test runnerREVIEW_CHECKLIST.md— full findings (F1-F15, W1) with per-question notesagent-decision-flow.html— updated graph diagram with state reads/writes
Goal: Add node traces to the agent, re-run both batteries, build traced comparison reports, and do a question-by-question review to identify systemic issues.
Branch: uky-plus-mcp (off main in access-agent).
- Cherry-picked
04342c8(node_trace) +b7a9bec(trace gating) fromfeature/dual-rag-logging - Disabled pgvector Q&A pair fallback — stubs remain in
_search_pgvector()andqa_client.pyfor future use - Added
weak_answer(hedge detection) field to rag_answer trace - Added
classification_infoandnode_traceto API response metadata - Fixed missing
ACCESS_AI_API_KEYindocker-compose.yml— UKY was silently failing without it - Updated SYSTEM_OVERVIEW.md mermaid diagram (removed incorrect pgvector→synthesize edge)
- Updated
agent-decision-flow.htmlwith state reads/writes per node
Both batteries 50/50 success with full traces. Reports at ~/.agent/diagrams/e3-v2-*-comparison.html, deployed to Netlify.
Most critical:
- All 7 fallback-gap 92-char failures are JSM misroutes — classifier interprets every problem description as "file a ticket"
- Hedge detection rejects good UKY answers that have useful content after the preamble — 20/50 friendly, 15/50 real-user
- Synthesize dilutes authoritative UKY content and can inject incorrect LLM training data
- Dynamic classification skips UKY on documentation questions when user describes their situation
- Combined classification forces resynthesis even when UKY answers confidently
One proven win: Real-user Q15 (Bridges-2 GPUs) — MCP get_resource_hardware corrected UKY's outdated docs (only knew V100s, MCP added H100 and L40S). Only question across both batteries where the system was unambiguously better than UKY alone.
REVIEW_CHECKLIST.md— full findings and per-question notesSLIM_BATTERY.md— 16-question focused battery covering 8 scenariosa3_results/e3-v2-friendly.json,a3_results/e3-v2-realuser.json(in access-agent subdir)
Plan and implement graph fixes on uky-plus-mcp: always consult UKY, tighten JSM classification, smarter hedge handling, synthesizer prompt improvements. Run slim battery to validate.
Switched access-agent to main (pulled 8 new commits from origin), set UKY_RAG_ENABLED=true, rebuilt Docker container. Ran both batteries against the full system (UKY RAG + MCP tools + LangGraph orchestrator).
Friendly battery (50 well-phrased queries):
| UKY alone | UKY+MCP (main) | |
|---|---|---|
| Answered | 50/50 | 50/50 |
| Avg response length | 1,380 chars | 886 chars |
| Avg latency | 3.9s | 11.0s |
| MCP tool usage | — | 21/50 (42%) |
| Fallback responses | 0 | 2 (cold-start artifact) |
Real-user battery (50 messy queries with typos/vague phrasing):
| UKY alone | UKY+MCP (main) | |
|---|---|---|
| Answered | 48/50 | 50/50 |
| Avg response length | 1,159 chars | 752 chars |
| Avg latency | 4.8s | 10.8s |
| MCP tool usage | — | 19/50 (38%) |
| Fallback responses | 0 | 7 (regression) |
| pgvector RAG hits | — | 0 |
- MCP adds real value: Live software lists, resource specs, system status, events — data UKY cannot provide. ~40% of questions benefit.
- Critical fallback gap: 7 real-user queries return "tools needed for this task are currently unavailable" (92 chars) where UKY gives substantive answers (687–1,859 chars). Pattern: account/password/troubleshooting questions routed to dynamic path but no MCP tool matches. Agent should fall back to UKY RAG instead of giving up.
- Zero pgvector RAG hits on real-user battery: The 0.85 similarity threshold is too high for messy input. Cherry-pick candidate
08809ad(lower thresholds) would help. - Friendly battery cold-start: Q1/Q2 got generic fallback due to UKY timeout on first request (588s). Transient issue.
- Friendly:
~/.agent/diagrams/e3-friendly-comparison.html - Real-user:
~/.agent/diagrams/e3-realuser-comparison.html - Published copies:
published-reports/e3-friendly-comparison.html,published-reports/e3-realuser-comparison.html - Raw data:
a3_results/e3-friendly-main.json,a3_results/e3-realuser-main.json
- Fix the fallback gap: when classifier routes to dynamic path but no tool matches, fall back to UKY RAG
- Consider cherry-picks:
08809ad(lower thresholds) andef43a21(top-5 RAG) most likely to help - Verify whether JSM bug (
e629cb6) exists on main before cherry-picking - Re-run real-user battery after fix to confirm regression is resolved
Sent reports. Response suggested he sees the architecture as sound but may not have focused on the 7 regressions. Key quote: "Document RAG and MCP will probably do pretty well... supplementing with curated Q&A pairs for specific commonly asked questions would be a good supplement." He's directionally right but the fallback gap needs fixing before the system is production-ready.
Ran 50 real-user queries (REAL_USER_BATTERY.md) against two targets. UKY alone ran clean first. NEWSYSTEM run was contaminated mid-run by an internet outage (q18 took 926s; q19–29 returned a canned 130-char fallback). Re-ran NEWSYSTEM cleanly.
Results — UKY alone vs. NEWSYSTEM (pgvector + MCP, UKY_RAG_ENABLED=false):
| UKY | NEWSYSTEM | |
|---|---|---|
| Answered | 48/50 | 50/50 |
| Avg response length | 1,159 chars | 694 chars |
| Avg latency | 4.8s | 12.4s |
| RAG hits | — | 9 (18%) |
| MCP tool calls | — | 12 (24%) |
| LLM only | — | 29 (58%) |
Report: ~/.agent/diagrams/e2-comparison.html
Key findings:
-
6 "tools unavailable" deflections — q4, q13, q17, q22, q32, q50. All account/support-type questions. Classifier routed to JSM domain; JSM was unavailable; agent returned a canned 92-char error instead of falling back. The E.1 fix (
e629cb6) addresses this but it was not yet deployed in this run. -
MCP tools returning empty for factual resource questions — q38 (DARWIN storage), q39 (SLURM resources), q43 (TAMU storage), q37 (GPU hours conversion). Tools were called and ran for 20–36s but returned no results. UKY answered these correctly from documents. Data gaps in the MCP layer.
-
58% LLM-only — same root cause as friendly battery: 0.70 threshold too strict for messy real-user input, and classifier over-routes problem-sounding questions to dynamic paths.
-
UKY answers are consistently longer and more detailed for how-to questions. NEWSYSTEM's thin-answer gap is real.
-
q25 (XDMoD research profile) — agent referenced a nonexistent tool
integrate_nsf_xdmod. UKY also 500'd on this.
Andrew raised: "The access-agent currently works with UKY, so maybe no changes needed there?" Realized our E.2 NEWSYSTEM run was testing pgvector-only (UKY disabled in .env), not UKY+MCP. The feature branch already supports UKY as primary with pgvector fallback — we just had UKY_RAG_ENABLED=false.
Revised E.2 goal: The real comparison is UKY alone vs. UKY + MCP tools (i.e., main). Andrew confirmed UKY can stay. pgvector remains as a slam-dunk fallback, not a UKY replacement.
E.2 UKY-alone data is reusable (a3_results/e2-uky.json). E.3 just needs a NEWSYSTEM run with UKY_RAG_ENABLED=true.
feature/dual-rag-logging has 7 commits ahead of main. Verdict:
Merge to main:
e629cb6— JSM graceful recovery (production fix, the most important one)04342c8+b7a9bec—node_traceobservability + gating behind?include_traceef43a21— RAG top-5 + better synthesis prompt08809ad— lower similarity thresholds (0.85→0.70)
Leave behind (spike-only):
caf7256— dual-RAG comparison logging infrastructurede26e37— pgvector→synthesis routing (fine but bundled with comparison logger)
Two failure modes identified: JSM server unavailable (no tools loaded) and classifier over-routing complaint-framed questions to JSM domain.
Fixes (access-agent):
domain_agent.py: no-tools case now returnsfinal_answer=Noneand clearsdomainrather than a canned error stringgraph.py: replaced harddomain_agent → ENDedge with a conditional —final_answer=Nonefalls through torag_answer, otherwise ENDclassify.py: tightened JSM routing to require explicit ticket-filing language; frustration/complaint framing now routes to RAG
144 tests passing, 1 pre-existing failure in dual-RAG logging test unrelated to these changes.
Confirmed two failure modes: JSM server unavailable (no tools loaded) and classifier over-routing complaint-framed questions to the JSM domain.
Fixes (access-agent, commit e629cb6):
domain_agent.py: no-tools case now returnsfinal_answer=Noneand clearsdomainrather than emitting a canned error stringgraph.py: replaced harddomain_agent → ENDedge with conditional —final_answer=Nonefalls through torag_answerclassify.py: tightened JSM routing to require explicit ticket-filing language; frustration/complaint framing now routes to RAG
Also confirmed the hedge-detection fallback in the graph is working correctly: UKY hedge phrases trigger MCP tool lookup, so static content that lives in MCP (Ranch specs, software lists) is found on the second attempt. Classifier dynamic definition is sound — no misrouting observed.
144 tests passing. 1 pre-existing failure in dual-RAG logging test, unrelated.
Reviewed bake-off results end-to-end. QAP matching has a fundamental surface area limitation: a Q&A pair question is ~20 words. A document chunk is ~2000 words. Real user queries are messy (vague, jargon, error pastes, complaint framing) and score 0.50–0.65 against clean QAP questions — below any useful threshold. Lowering the threshold below 0.55 pulls in wrong-topic matches. Returning more QAPs and synthesizing across them only helps union-type queries ("which resources have GPUs?"); it doesn't help when zero QAPs match above threshold.
Even on clean, well-phrased questions (friendly battery), NEWSYSTEM only won 12 of 50 against UKY. 48% RAG hit rate means half the questions got no retrieval at all. The narrow matching surface is a limitation even in the best case.
The 12 friendly battery wins break down as: 5 QAP-RAG hits, 4 MCP tool calls, 3 LLM-only. The QAP-RAG wins were genuinely better than UKY — judge cited more precise commands, correct module names, properly formatted citations, less cross-contamination between resources. All 5 winning QAPs were document-sourced (source documents exist in UKY's corpus too). The QAPs won on precision/scoping, not unique coverage. This is a chunking quality issue, not a QAP-vs-docs issue — per-resource chunking should achieve the same precision.
Option A: Three-tier retrieval (QAP → doc-chunk → MCP)
- Keep QAPs at high threshold for slam-dunk matches
- Add document-chunk RAG as fallback when QAP matching is weak
- MCP tools for live/dynamic data
- Pro: preserves QAP precision. Con: two extraction pipelines, more routing complexity
Option B: Unified document store (proposed)
- Convert MCP entity data into documents (resource descriptions, software lists as prose) and chunk alongside existing PDFs/guides
- One pgvector store, one retrieval path
- MCP tools still handle truly dynamic data (status, allocations, user-specific)
- MCP→document extraction replaces MCP→QAP extraction — same pipeline, simpler output. Transcribing structured data into prose, not summarizing (not a game of telephone)
- Cross-entity queries work naturally via vector search instead of requiring Plan → Execute
- Pro: dramatically simpler, solves surface area problem universally. Con: lose pre-written curated answers; LLM synthesizes from chunks
Sent analysis to Andrew for input.
After review with Andrew, the path forward is simpler than either proposed option:
- UKY handles document RAG — it already does this well, no need to rebuild
- MCP tools fill the live data gaps — system status, allocations, user-specific queries, anything not in the document corpus
- QAPs stay in pgvector, no new generation — keep the existing pairs as a high-confidence slam-dunk path; if they match, great. Don't invest in generating more until they prove useful in production
- Fix Atlassian error handling — agent is not recovering gracefully from failed JSM calls
- Double LLM synthesis — UKY synthesizes, then access-agent synthesizes. Real latency/cost issue worth addressing eventually, not blocking now
Andrew also clarified that UKY already has at least the intention of pulling dynamic data (events etc.) from APIs — so the "MCP fills gaps UKY can't" argument is narrower than assumed. The bottleneck is LLM synthesis, not retrieval transport.
The QAP extraction work (Projects C, A.3) was not wasted — it produced clear quantitative evidence (18% real-user hit rate, 12/50 on friendly battery) that validated the approach's limits and identified exactly when QAPs do and don't work. That's a defensible architectural decision.
| Content type | Best retrieval | Rationale |
|---|---|---|
| Documents (guides, how-tos, policies) | Document-chunk RAG | Rich surface area handles messy queries; QAPs distill away the very text users search for |
| MCP entity data (resource specs, software) | QAP-as-cache OR MCP→document extraction | No source documents to chunk; QAPs are one option, synthesized docs are the other |
| Live/dynamic data (status, allocations) | MCP tools directly | Changes constantly, can't be cached in any RAG store |
Goal: Get Q&A pairs from Argilla into access-qa-service (pgvector) so they're searchable via semantic search.
Discovery: access-qa-service already had a /admin/sync endpoint and argilla_sync.py — but the code was scaffolded with placeholder logic that didn't match the actual Argilla v2 API or the record schema created by access-qa-extraction.
What was wrong:
- Used deprecated Argilla v1 API (
rg.init()/rg.load()) - Guessed at record field access (
record.inputs,record.question) — Argilla v2 usesrecord.fields["question"] - Looked for
entity_idin metadata (doesn't exist) — needs to come from<<SRC:...>>citation markers in the answer text - Default dataset name was
"access-qa"but extraction creates"qa-review" argillaPython SDK wasn't in the dependencies
What we fixed (commit 5b57ae0 on access-qa-service/main):
- Rewrote
sync_from_argilla()for Argilla v2 client API - Correct field access via
record.fields - Domain/entity_id extracted from citation markers, with
source_refparsing as fallback - Added
_get_edited_values()to prefer reviewer edits (future-proofing) - Judge scores (faithfulness, relevance, completeness, confidence) carried through to pgvector metadata
- Added
argilla>=2.0.0as a proper dependency - Added Argilla env vars to
docker-compose.ymlfor local dev
Test result:
POST /admin/sync → {"synced": 83, "skipped": 0, "citations_loaded": 12, "errors": []}
POST /search {"query": "What is ACES designed for?"} → similarity_score: 1.0, correct answer with citation
83 records across 5 domains (compute-resources, software-discovery, affinity-groups, allocations, nsf-awards) synced and searchable.
Also documented: Andrew's feature/access-agent-integration branch on qa-bot-core — what it changes (Netlify proxy, request body format, response contract) and why it matters for Projects A and B. Added to FEB_MARCH_PLAN.md and synced to the gist.
Goal: Modify rag_answer node to query both UKY document RAG and pgvector Q&A-pair RAG for every question, logging side-by-side results for A.3 evaluation.
Approach: Parallel queries via asyncio.gather, gated behind DUAL_RAG_LOGGING env var. When the flag is off, behavior is identical to before.
What was built (commit caf7256 on access-agent/feature/dual-rag-logging):
src/config.py— AddedDUAL_RAG_LOGGING: bool = Falsesettingsrc/rag_comparison_logger.py(new) — SQLAlchemy model + singleton logger forrag_comparison_logstable. Follows same pattern asusage_logger.py. Table auto-creates on first use.src/agent/nodes/rag_answer.py— Added:_query_uky_raw()/_query_pgvector_raw()— lightweight async helpers that return raw results without span side-effects_dual_rag_answer()— runs both queries concurrently, applies same UKY-primary/pgvector-fallback priority, logs comparison to PostgreSQL- Gate in
rag_answer_node:settings.DUAL_RAG_LOGGING and rag_endpoint→ dual path; else unchanged
tests/test_rag_answer.py(new) — 19 tests: citation processing, raw query helpers, dual-RAG logic (UKY served, pgvector fallback, both fail, combined query, below threshold, logger failure resilience), flag gating
Comparison log table schema (rag_comparison_logs):
- Query context:
session_id,question_id,query_text,expanded_query,query_type,rag_endpoint - UKY result:
uky_response,uky_duration_ms,uky_error - pgvector result:
pgvector_matches(JSONB),pgvector_best_score,pgvector_match_count,pgvector_duration_ms,pgvector_error - Outcome:
served_by,served_answer_length
Test result: 94 passed (all existing + 19 new), 0 failures.
What's unchanged: state.py, graph.py, routes.py — the graph contract is untouched. The comparison log is a side-effect inside the rag_answer node.
Next (A.3): Deploy the feature/dual-rag-logging branch with DUAL_RAG_LOGGING=true, ask questions via qa-bot-core or direct API, then query rag_comparison_logs to evaluate UKY vs pgvector.
Decision: Run A.3 locally in Docker, bypass qa-bot-core, use direct curl requests.
Docker setup (two separate compose projects):
access-qa-service/docker-compose.yml→ qa-service (port 8001) + PostgreSQL (port 5433) + Redis (port 6380)access-agent/docker-compose.yml→ agent (port 8000) + PostgreSQL (port 5432) + Redis- access-agent reaches access-qa-service via
host.docker.internal:8001(macOS Docker) - UKY endpoint is remote — uses same API key as qa-bot-core (
ACCESS_AI_API_KEY)
What we did to get access-agent running:
- Created
access-agent/.envfrom discovered keys:OPENAI_API_KEY(fromaccess-qa-extraction/.env),ACCESS_AI_API_KEY(same key asQA_MODEL_API_KEYinaccess-serverless-api/.envandREACT_APP_API_KEYinqa-bot-core/.env.local), plusDUAL_RAG_LOGGING=true,QA_SERVICE_URL=http://host.docker.internal:8001,OTEL_ENABLED=false - Modified
access-agent/docker-compose.yml: addedenv_file: .envto the agent service (previously all env vars had to be listed explicitly), removed externalmcp-networkdependency (MCP servers aren't needed for A.3) - Built and started:
docker compose up --build -d— all containers healthy
Smoke test (successful):
curl -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"query": "What is Delta?", "session_id": "test-a3-smoke", "question_id": "smoke-1"}'
→ Got a full UKY-sourced response about Delta (NCSA HPC resource), 6s latency, tools_used: ["uky_rag_retrieval"]. Agent is live and hitting UKY successfully.
Note: The API field is query (not question). The MCP server warnings in the agent logs are expected and harmless — those servers aren't on this Docker network and aren't needed for A.3.
Current container status (all running):
| Service | Port | Notes |
|---|---|---|
| access-agent | 8000 | feature/dual-rag-logging branch, DUAL_RAG_LOGGING=true |
| access-agent postgres | 5432 | checkpointing + comparison logs |
| access-qa-service | 8001 | 83 Q&A pairs loaded |
| qa-service postgres | 5433 | pgvector embeddings |
| access-argilla | 6900 | Q&A pair review UI |
Goal: Verify Docker environment still works and start A.3 evaluation.
Discovery: pgvector is returning zero matches for reasonable queries like "What is ACES?" — even though we have 20 compute-resources Q&A pairs including several about ACES.
Root cause: The similarity threshold is too aggressive. There are two thresholds stacked:
- qa-service default (
access-qa-service/src/access_qa_service/config.py:26):rag_similarity_threshold = 0.85 - access-agent per-query-type thresholds (
access-agent/src/config.py:69-71):RAG_THRESHOLD_STATIC = 0.85(static queries)RAG_THRESHOLD_COMBINED = 0.75(combined queries)RAG_THRESHOLD_FALLBACK = 0.65(fallback)
The agent's _query_pgvector_raw() passes the threshold to the qa-service, which uses it to filter results. For static queries (the most common type), both sides enforce 0.85.
The problem: "What is ACES?" scores 0.84 against the best match ("What is ACES designed for?") — just below the 0.85 cutoff. With threshold 0.3, the same query returns 3 solid matches (0.84, 0.82, 0.76). Short or naturally-phrased questions routinely fall just under 0.85 even when the topic matches perfectly.
Evidence:
curl /search {"query": "What is ACES?", "threshold": 0.85} → 0 matches
curl /search {"query": "What is ACES?", "threshold": 0.3} → 3 matches (0.84, 0.82, 0.76)
curl /search {"query": "What is ACES designed for?"} → 1 match (1.0, exact)
The rag_comparison_logs table confirmed this — both smoke test queries ("What is Delta?", "What is ACES?") show pgvector_match_count: 0 and served_by: uky_general.
What needs to happen before running A.3:
- Lower the threshold so pgvector actually returns matches for natural queries
- Options: (a) lower
RAG_THRESHOLD_STATICfrom 0.85 to ~0.70 in access-agent config, (b) use a comparison-specific override in the dual-RAG path so production defaults aren't touched, or (c) lower the qa-service default - Rebuild the access-agent container after the change
Also this session: Created SYSTEM_OVERVIEW.md with sequence diagrams of the three main flows (query answering, knowledge base building, per-entity extraction detail). Updated the agent graph illustration in FEB_MARCH_PLAN.md from mermaid to an emoji-annotated state transition table. Synced plan gist.
Change: Lowered all RAG similarity thresholds in access-agent/src/config.py (commit 08809ad on feature/dual-rag-logging):
RAG_THRESHOLD_STATIC: 0.85 → 0.70RAG_THRESHOLD_COMBINED: 0.75 → 0.60RAG_THRESHOLD_FALLBACK: 0.65 → 0.50RAG_SIMILARITY_THRESHOLD(legacy): 0.85 → 0.70
Why: Best matches for natural queries scored ~0.84, just below the 0.85 cutoff. This was the A.3 blocker — pgvector returned 0 matches for every query.
Still needed: Rebuild the access-agent Docker container (docker compose up --build -d) and verify the fix with a smoke test before proceeding with A.3.
Rebuilt container: docker compose up --build -d picked up the threshold fix. All containers healthy.
Threshold fix verified: "What is ACES?" now returns pgvector_match_count: 3, pgvector_best_score: 0.84. Before the fix this was 0 matches. UKY still served (as designed), but pgvector results are now logging.
Pushed branches: access-agent/feature/dual-rag-logging pushed to GitHub (3 commits: A.2 dual-RAG logging, threshold fix). access-qa-service/main push failed — Joe doesn't have write access to necyberteam/access-qa-service (need Andrew to grant).
QAP coverage (83 pairs across 11 entities in 5 domains):
| Domain | Entity | Pairs |
|---|---|---|
| compute-resources | ACES (TAMU) | 10 |
| compute-resources | Ranch (TACC) | 10 |
| software-discovery | ABINIT | 10 |
| software-discovery | Abaqus | 8 |
| allocations | Grassland bird habitat (#72204) | 9 |
| allocations | RL benchmark (#72205) | 10 |
| nsf-awards | Pollinator conservation AI (#2529183) | 10 |
| nsf-awards | Great Salt Lake dust (#2449122) | 8 |
| affinity-groups | Neocortex (PSC) | 5 |
| affinity-groups | REPACSS (TTU) | 3 |
Test questions written: 40 questions in A3_TEST_QUESTIONS.md, organized in 3 groups:
- pgvector-targeted (24): Questions about entities we have QAPs for
- UKY-targeted (8): General ACCESS questions our 83 pairs probably don't cover
- Edge cases (8): Vague, misspelled, or cross-domain questions
Next: Review the test questions, then fire them all through the agent and pull the comparison logs.
Run 2 executed: Fired all 41 test questions through the agent with DUAL_RAG_LOGGING=true. All 41 succeeded, 40 logged (q41 classified as dynamic/xdmod). Results exported to a3_results/run2.json.
Run 2 results (high-level): UKY answered 36/40, pgvector had matches for 30/40, served by UKY 36, served by pgvector 4.
Built interactive HTML comparison: ~/.agent/diagrams/a3-run2-comparison.html — expandable rows with side-by-side answers, KPI summary, sidebar nav, analysis section.
Synthesis routing fix: pgvector static matches were previously returned as final_answer (raw Q&A pair text). Changed rag_answer.py to set rag_matches + rag_used instead, and added "synthesize" as a third routing option from route_after_rag in graph.py. This routes pgvector results through the LLM synthesis pipeline.
Unfair comparison discovered: Run 2's comparison was apples-to-oranges. UKY answers arrive already LLM-synthesized (UKY's own LLM produces polished prose). pgvector answers in the comparison log were raw Q&A pair text — just the verbatim answer field from the curated pair. This made pgvector look worse than it actually is, since the difference was partly in presentation quality, not underlying knowledge.
Goal: Make the comparison fair by synthesizing pgvector answers through our own LLM before logging them.
What was changed:
rag_comparison_logger.py— Addedpgvector_synthesized_answer = Column(Text)to the model andlog_comparison()methodrag_answer.py— Imported_format_rag_matchesand_synthesize_with_rag_onlyfromsynthesize.py. In_dual_rag_answer(), after getting pgvector matches, calls synthesis to produce an LLM-polished answer before logging. This is what the user would actually see if pgvector served the answer.pyproject.toml— Pinnedopentelemetry-instrumentation-langchain<0.53(newer version had a breaking import forGenAICustomOperationName)- Database —
ALTER TABLE rag_comparison_logs ADD COLUMN pgvector_synthesized_answer text; - Test runner — Created
a3_results/run_a3_test.pyto fire all 41 questions programmatically
Run 3 results (41/41 succeeded, all logged):
| Metric | Value |
|---|---|
| UKY answered | 38/41 (93%) |
| pgvector answered (synthesized) | 27/41 (66%) |
| Both answered | 24 (direct comparison possible) |
| UKY only | 14 |
| pgvector only | 3 |
| Avg pgvector similarity score | 0.84 |
Fair comparison conclusions (from HTML analysis at ~/.agent/diagrams/a3-run3-comparison.html):
-
The two backends are complementary, not competitive. pgvector gives precise, curated answers for entities we've built Q&A pairs for. UKY covers the long tail of general ACCESS knowledge.
-
pgvector excels on its own domain: Of 25 pgvector-targeted questions (Q1-Q25), pgvector produced synthesized answers for 24 (96%). These are entities with curated Q&A pairs.
-
UKY handles breadth that pgvector cannot: For 8 UKY-targeted questions (Q26-Q33) about general ACCESS topics (allocations process, Globus, password reset), pgvector answered 0. Our 83 curated pairs simply don't cover these.
-
UKY produces longer answers (~157% longer on average when both answer the same question). This may reflect UKY's larger document corpus or that our synthesis prompt is more concise. Length alone doesn't indicate quality.
-
pgvector retrieval is dramatically faster (~5 ms vs ~2500 ms for UKY), though pgvector now also needs LLM synthesis time (not logged separately).
-
The quality gap is narrower than Run 2 suggested. With LLM synthesis, pgvector answers read as polished, cited responses. The Run 2 comparison was unfairly penalizing pgvector by showing raw text.
-
Production recommendation: Use both backends — pgvector for high-confidence domain matches, UKY for everything else. This is already the architecture (
_dual_rag_answeruses UKY-primary, pgvector-fallback).
Files produced:
a3_results/run3.json— Full export of 41 comparison log entries~/.agent/diagrams/a3-run3-comparison.html— Interactive comparison with analysisa3_results/run_a3_test.py— Test runner script
Realization: The A.3 analysis drifted toward "complementary backends" and fallback architecture. But that wasn't the original question. From FEB_MARCH_PLAN.md:
"proving this approach outperforms document RAG" (line 34) "We need data on how these two approaches compare before making further investment decisions" (line 65) "A first because it validates the approach before investing in B" (line 259)
A.3 was a bake-off to decide whether Q&A-pair RAG can replace UKY document RAG — not to build a hybrid system. The "use both" conclusion was the code's existing fallback architecture leaking into the analysis.
The coverage gap is entirely explained by content type, not approach quality:
What the extraction pipeline covers (5 MCP server domains, entity-focused):
- Compute resources (23 entities: ACES, Delta, Anvil, etc.)
- Software discovery (1,404 packages)
- Allocations (5,440 projects)
- NSF awards (10,000+ awards)
- Affinity groups (55 groups)
These are all "what is X" questions about discrete entities. The pipeline pulls structured data from MCP servers and generates Q&A pairs about each entity's properties.
What UKY has that we don't (general ACCESS documentation):
- How to apply for an allocation (process docs)
- How to transfer files / use Globus (how-to guides)
- How to reset your password (account management)
- Startup vs research allocations (policy docs)
- Training resources, publication acknowledgment (educational docs)
These are "how do I" questions about ACCESS-wide processes. They don't live in any MCP server — they live in documentation pages, wikis, and guides that UKY ingested.
We don't know exactly what UKY ingested. The plan has an open question: "Need a list from Andrew of what UKY currently ingests." UKY is a black-box API to us.
On entity questions where we have Q&A pairs: pgvector hits 96% (24/25). The synthesized answers are concise and accurate. pgvector retrieval is ~500x faster than UKY (~5ms vs ~2500ms).
On general how-to/process questions: pgvector scores 0%. We simply have zero Q&A pairs for these topics because no MCP server serves allocation process docs or file transfer guides.
The gap is coverage, not quality. If we had Q&A pairs for general ACCESS topics, pgvector would likely match or beat UKY on those too.
The plan says Project C ("Extract from ACCESS documentation") was deferred with this note:
"Revisit only if a specific content gap surfaces that exists only in documents with no API equivalent (e.g., narrative tutorials, policy explainers)."
A.3 just surfaced exactly that gap. The 14 UKY-only questions are all process/how-to questions with no API equivalent.
Joe needs to decide:
-
Pursue Project C — Extract Q&A pairs from ACCESS documentation (not MCP entities). This would close the how-to gap and potentially let pgvector replace UKY entirely. Requires: getting the doc list from Andrew, building a document extractor, running extraction + Argilla review.
-
Keep UKY for breadth, pgvector for precision — Accept the hybrid architecture. UKY handles general questions, pgvector handles entity questions. Simpler, but you're dependent on UKY's black-box system and can't control answer quality for general topics.
-
Expand entity coverage first — Before tackling docs, run the existing extraction pipeline against more entities (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages, 2 of 5,440 allocations). More entity coverage might narrow the gap enough.
Searched all repos (access-qa-planning, access-agent, access-mcp, access-qa-extraction, access-qa-bot) for any documentation of what UKY's system ingests. Found:
pages-current-production.md— "The Q&A backend is hosted at the University of Kentucky." No corpus details.pages-access-qa-tool.mdline 193 — Notes UKY's tech stack as "ChromaDB, llamaindex." No document list.FEB_MARCH_PLAN.mdline 233 — Open question: "Need a list from Andrew of what UKY currently ingests."uky_client.py— Black-box HTTP client. No corpus metadata.
No list of UKY's ingested documents exists anywhere in our repos. Andrew is the only source for this information.
Even without the UKY document list, there are viable paths to continue the bake-off:
Option A: Analyze UKY's 14 winning answers for source clues. Read the UKY-only responses from Run 3 and determine whether the information is unique to some internal corpus or is general ACCESS knowledge available on public web pages (support.access-ci.org, allocations.access-ci.org). UKY's answers may contain citations, URLs, or verbatim language that reveals their source documents. This takes ~30 minutes and informs all other options.
Option B: Generate Q&A pairs from public ACCESS content. Point the extraction pipeline (or a variant) at public ACCESS web pages — the allocations guide, getting started pages, Globus documentation, password reset instructions. These are freely available. Generate Q&A pairs, curate them, load into pgvector, re-run A.3. This directly tests whether closing the topic gap closes the performance gap.
Option C: Determine whether UKY's advantage is unique knowledge or general glue. The 14 UKY-only questions are all process/how-to topics. If UKY is synthesizing from the same public ACCESS web pages any user can read, then the "advantage" is simply that we haven't generated Q&A pairs for those topics yet — not that UKY has access to privileged information. This reframes the bake-off: it's not documents vs Q&A pairs, it's about coverage breadth.
Option D: Expand entity coverage as a control. Add Q&A pairs for remaining MCP server domains (events, announcements, system-status) and more entities within existing domains (we only extracted 11 of 23 compute resources, 2 of 1,404 software packages). This tests whether broader entity coverage alone changes the picture.
Recommended sequence: A first (30 min, informs everything), then B (directly tests the hypothesis), with D as low-effort parallel work.
Andrew provided the full set of documents that feed UKY's document RAG. They are in rag_documents/ (75 files, 69 MB) split across two directories:
staging/ (~47 files) — The main corpus. Three categories:
| Category | Examples | Count |
|---|---|---|
| Resource descriptions | ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage, Voyager, Fabric (PDFs) | ~20 |
| User guides | ACES, Anvil, Bridges-2, Delta, Expanse, Jetstream-2, Neocortex, Sage (PDFs) | ~10 |
| Process/how-to docs | Allocations, Globus, MFA, add users, progress reports, office hours, events/trainings, system status (docx) | ~12 |
| Misc | ARA description, SDS pointer, CloudBank login, REPACSS overview, Sage edge apps, current projects | ~5 |
data/ (~28 files) — Per-resource software lists (txt/csv) and resource-specific documentation:
- Software installed lists for ACES, Anvil, Bridges-2, Darwin, Delta, Expanse, Jetstream2, Kyric, Stampede3
- Darwin docs (user guide, login, filesystems, job management, SLURM, software)
- Delta docs (user guide, data management)
- FASTER docs (intro, SLURM partitions, documentation)
- ACCESS Travel Rewards (md)
Key observation: The process/how-to docs in staging/ (allocations, Globus, MFA, etc.) are exactly the topics UKY beat pgvector on in A.3. The resource descriptions overlap with what MCP extraction already covers. This confirms the A.3 finding — the gap was coverage, not quality.
Confirmed the shared end state:
-
Generate Q&A pairs from these documents — Use a similar two-shot process to what exists for MCP entities, but with documents as input. Andrew: "Probably a similar prompt to the MCP tools can work for generating pairs from docs."
-
One unified Q&A pair bank in pgvector — Entity pairs (from MCP) + document pairs (from these files) living together, searchable as one corpus.
-
The orchestrator agent decides routing — RAG for factual queries, MCP for live data, both when needed. Andrew: "The orchestrator agent should decide which tools to use (RAG, MCP, both) and then it should get synthesized. That logic should already exist in access-agent."
-
UKY goes away — Andrew: "Eventually, we will likely not need the document based RAG since the Q&A pairs are faster." pgvector replaces UKY entirely.
Step 1: Categorize the corpus. Skim the 75 files and bucket them: resource descriptions (entity overlap with MCP), user guides (process/how-to), general ACCESS docs. Identify what's already covered by MCP extraction vs. what's net new.
Step 2: Build a document extractor in access-qa-extraction. Extend the pipeline to accept documents (PDF/docx) as input. The two-shot prompt structure should carry over — battery pass for coverage, discovery pass for insights. New work: document parsing (PDF text extraction, docx reading) and chunking into logical sections.
Step 3: Run extraction on the full corpus. Generate Q&A pairs from all documents. Push to Argilla for review. This produces pairs for the exact topics pgvector was missing — allocations process, Globus, MFA, user guides.
Step 4: Load into pgvector alongside entity pairs. One unified bank: existing 83 entity pairs + document-sourced pairs. All searchable together.
Step 5: Re-run A.3. Same 41 questions (plus new ones if the expanded corpus suggests them). If pgvector-with-documents matches or beats UKY across the board, the bake-off is won.
Step 6: Simplify the agent routing. Once the Q&A pair bank covers everything, the agent graph simplifies: RAG for factual queries, MCP for live data, synthesis when both contribute. Remove the UKY fallback path.
Skimmed all 75 files in rag_documents/ and produced a categorized index at rag_documents/CORPUS_INDEX.md. No files were moved or renamed — the index is a read-only reference.
| Category | Files | Priority | Rationale |
|---|---|---|---|
| NET-NEW process/how-to | 20 | First | Fills the exact A.3 gap — allocations, Globus, MFA, Sage, citations, Jupyter |
| USER GUIDE (deep) | 22 | Second | Operational depth (job submission, filesystems, SLURM) beyond MCP surface data |
| MCP OVERLAP (descriptions) | 17 | Later | 1-page resource catalog entries — MCP already covers most of this |
| DATA FILE | 12 | Skip | Raw software lists (name/version lines) — MCP software-discovery covers this |
| POINTER/EMPTY | 4 | Skip | URL stubs or corrupt files with no substantive content |
Key finding: The 20 NET-NEW files are mostly small docx docs — easy to parse, directly address the A.3 gap. The 22 user guides are larger PDFs with real depth (SLURM partitions, data management, module systems). The 17 resource descriptions are 1-page PDFs that overlap with MCP entity data.
Also this session: Consolidated project documentation — SYSTEM_OVERVIEW.md is now single source of truth for architecture, FEB_MARCH_PLAN.md updated with A.3 results and Project C active status, all three docs gist-mirrored, CLAUDE.md updated with document discipline rules.
access-qa-extraction PR #1 (two-shot pipeline) — squash-merged to main. 4,697 additions across the full two-shot extraction pipeline: battery + discovery prompts, LLM judge, incremental cache, Argilla entity-replace, 5 domain extractors, 144 tests. Branch archived on GitHub.
access-qa-planning PR #1 (companion docs) — squash-merged to main. Documentation updates for two-shot pipeline.
access-agent and qa-bot-core — decided to leave on their branches. qa-bot-core is a production product with its own release routine. access-agent's feature/dual-rag-logging branch mixes evaluation scaffolding with production improvements — better to leave as-is until the bake-off concludes.
Reinstalled access-qa-extraction from clean main. 144/144 tests pass. Started mcp-compute-resources Docker container from access-mcp/docker-compose.yml (port 3002). Ran extraction:
qa-extract extract compute-resources --max-entities 1 --no-judge
Produced 8 Q&A pairs for ACES — 5 battery + 3 discovery, all with citations. Two-shot pipeline confirmed working on main.
Branched feat/document-extractor off clean main. Built the document extraction pipeline:
New files:
parsers.py— Standalone document parsing module.parse_docx()(python-docx),parse_pdf()(PyMuPDF/fitz),parse_text()(.txt/.md). Dispatcherparse_document()routes by extension.chunk_text()splits large docs (~6000 words) with overlap.clean_extracted_text()collapses PDF/docx whitespace artifacts.extractors/documents.py—DocumentExtractor(BaseExtractor). Overridesrun()to skip MCPClient (documents are local files). Discovers files recursively fromconfig.urldirectory. Each document/chunk = one entity. Two-shot LLM pipeline (battery + discovery), judge evaluation, incremental cache — same as MCP extractors. Usessource="doc_generated",source_ref="doc://documents/{entity_id}".
Modified files:
pyproject.toml— Addedpython-docx>=1.0.0,PyMuPDF>=1.24.0models.py— Addedsourceparameter toQAPair.create()(default"mcp_extraction", backward-compatible)question_categories.py— Added"documents"toDOMAIN_LABELS,DOMAIN_NOTES, andFIELD_GUIDANCE(5 field groups: overview, key procedures, requirements & eligibility, important details, support & contact)config.py— Added"documents"MCPServerConfig withurl=os.getenv("DOCUMENTS_DIR", "../rag_documents")extractors/__init__.py— AddedDocumentExtractorimport and exportcli.py— AddedDocumentExtractortoEXTRACTORSregistry
Test 1: qa-extract extract documents --max-entities 1 --no-judge — parsed CORPUS_INDEX.md, produced 6 Q&A pairs about the document corpus.
Test 2: qa-extract extract documents --entity-ids "10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream" --no-judge — parsed a docx file from staging/, produced 5 Q&A pairs about Jetstream citation formats and acknowledgment requirements.
Fix: _title_from_stem() was producing ugly titles from Slack-style filenames (e.g., 10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream). Added re.sub(r"^\d+_[\d.]+_", "", stem) to strip the numeric prefix, plus stripping common prefixes (data-ACCESS-, data:, etc.). Title now renders as "How To Cite Jetstream".
All 144 existing tests still pass after all changes.
Ran DOCUMENTS_DIR="../rag_documents/staging" qa-extract extract documents --no-judge on all 47 files in staging/. Took ~25 minutes (94 LLM calls).
Results: 586 Q&A pairs from 83 entities (46 files processed, 1 corrupt file skipped).
| Category | Entities | Pairs | Notes |
|---|---|---|---|
| NET-NEW docx (process/how-to) | 19 | ~110 | Allocations, MFA, Globus, Sage, Jupyter |
| User Guide PDFs (chunked) | 39 chunks | ~290 | Jetstream2 (20 chunks), Anvil (6), Bridges-2 (5), etc. |
| MCP Overlap descriptions | 17 | ~134 | 1-page resource PDFs |
| Other (ARA, SDS, REPACSS) | 8 | ~52 | Small docs |
- 100% citation markers (
<<SRC:documents:...>>) - All pairs use
source: "doc_generated" - Large PDFs chunked correctly (~6000 words per chunk with overlap)
- Quality spot-check: questions are natural, answers contain specific details (URLs, commands, step-by-step procedures)
- Only error:
current-access-projects.docx(known corrupt/empty file)
Output at data/output/documents_qa_pairs.jsonl (gitignored). Branch pushed to GitHub.
Not yet run: data/ directory (Darwin, Delta, FASTER docs + ACCESS-Travel-Rewards.md + software lists).
Ran DOCUMENTS_DIR="../rag_documents/data" qa-extract extract documents --no-judge on all files in data/ subdirectories.
Results: 221 Q&A pairs from 29 entities.
| Subdirectory | Entities | Pairs | Notes |
|---|---|---|---|
| ACCESS-Resources/Darwin/ | 9 | ~65 | Managing jobs, user guide, compiling, file systems, etc. |
| ACCESS-Resources/Delta/ | 3 chunks | ~25 | Large PDF chunked into 3 |
| ACCESS-Resources/FASTER/ | 4 | ~30 | User guide, system overview, jobs, file systems |
| ACCESS-Travel-Rewards.md | 1 | ~8 | Travel reimbursement program |
| ACCESS-Software-Installed-by-resource/ | 12 | ~93 | Software lists (package names/versions — generic Q&A quality) |
- Software-list files produced generic "what software is installed on X" pairs — adequate but not high-value. Argilla reviewers can reject low-quality ones.
- Darwin and FASTER docs produced strong procedural content (SLURM commands, file system paths, compilation flags).
Saved staging/ output as documents_staging_qa_pairs.jsonl, combined both runs into documents_all_qa_pairs.jsonl (807 total pairs).
Pushed all 807 pairs to Argilla: qa-extract push data/output/documents_all_qa_pairs.jsonl. Records visible in qa-review dataset at http://localhost:6900.
Docker note: Argilla containers had stale network references from previous sessions. Fixed with docker compose down --remove-orphans && docker network prune -f && docker compose up -d.
Problem: When reviewing pairs in Argilla, all 807 records had domain: "documents" with no way to tell which source document they came from — the only clue was the source_ref URI (e.g., doc://documents/10_1758119706.911465_data-ACCESS-how-to-cite-Jetstream), which is opaque. For MCP-extracted pairs, domain provides natural grouping (compute-resources, allocations, etc.), but document pairs lack an equivalent.
Fix: Added document_name as an optional metadata field on QAMetadata, populated from the existing _title_from_stem() helper in DocumentExtractor. The field flows through to Argilla as a filterable TermsMetadataProperty. MCP extractors are unaffected (field defaults to None).
Files changed: models.py (field + factory param), documents.py (passes title), argilla_client.py (schema + record metadata).
Re-extraction: Re-ran both staging/ (611 pairs) and data/ (214 pairs) = 825 total. Deleted old Argilla dataset (no schema for document_name), pushed fresh. 72 unique document names now filterable in Argilla.
Problem: For MCP entity pairs, source_data contains the full entity JSON that the LLM used to generate the Q&A pair — reviewer sees exactly what went in. For document pairs, source_data was set to content_preview: chunk[:500] — the first 500 characters of the chunk. This was misleading: it looked like the source material but only represented a tiny slice of the ~6000-word chunk the LLM actually saw. Reviewers would see a content_preview about topic X when the Q&A pair was about topic Y (from elsewhere in the same chunk).
Fix: Replaced content_preview with a reference: {file, chunk, total_chunks, word_count}. For non-chunked documents, chunk and total_chunks are null. The reviewer sees the file and chunk number; the actual document is in rag_documents/.
Design note on chunking: Large documents (>6000 words) are split into sequential ~6000-word chunks with 500-word overlap. Each chunk is processed as a separate entity — the LLM only sees one chunk at a time, not the whole document. So chunk 9 of a 20-chunk Jetstream PDF starts at roughly word 44,000. This is why the source_ref includes the chunk number (e.g., doc://documents/jetstream-2-user-guide__chunk_9).
Andrew asked about making the bake-off self-service: editable golden questions, runnable by the team with their own tokens, comparing different agent configurations ("tool combinations"). Key points from the conversation:
- Golden questions: Andrew wants a curated benchmark set that people can view, add, and modify. These are distinct from the Q&A pairs in Argilla — they're the test inputs used to evaluate the agent.
- Different tool combinations: Not UKY-vs-pgvector (UKY is going away), but different configurations of our agent — RAG thresholds, MCP server subsets, model choices. Each configuration is a "scenario."
- Self-service: Team members should be able to run evaluations and see results without Joe in the loop.
- Ongoing process: Re-run as the agent evolves, not a one-shot comparison.
Designed the evaluation harness. Full design saved as EVAL_HARNESS_PLAN.md. Summary:
- Golden questions in YAML (merge A3_TEST_QUESTIONS.md + e2e_test_cases.csv → ~55 questions with structured assertions)
- Scenario configs as YAML files overriding
Settingsenv vars - CLI runner calling
run_agent()directly (not HTTP) to capture fullAgentState - HTML report generator producing self-contained comparison pages (matching a3-run3 visual style)
- New
access-agent/eval/directory
Added as Project D in FEB_MARCH_PLAN.md (D.1–D.4), parallel with Project B after C.4 completes.
Pivot: Initially designed as a CLI-based Python tool (access-agent/eval/). Revised to a static web app on Netlify (eval-ui/) — no Python environment needed, users just open a browser. Golden questions and scenarios bundled at build time, results displayed inline and exportable as JSON. Two open design questions flagged: (1) how scenarios actually change agent behavior given the current API doesn't accept config overrides, and (2) API key routing (server-side vs pass-through). Plan saved as EVAL_HARNESS_PLAN.md.
This is future work — immediate next step remains C.4 (review Argilla, sync pgvector, re-run A.3).
Spot-checked the 825 document pairs in Argilla and found a systematic quality issue: 36% (300/825) of generated questions referenced the source documents rather than the subject matter.
Examples:
- Wrong: "What are the important quotas and limits mentioned in the Darwin Filesystems Storage document?"
- Right: "What are the storage quotas on Darwin?"
Root cause analysis: Two contributing factors:
- FIELD_GUIDANCE field group #1 said "what is this document about?" — 90% of seq-1 (overview) pairs were meta-referencing.
- Entity titles included document-type suffixes ("Jetstream 2 User Guide") which primed the LLM to treat the document as the subject.
question_categories.py — Two changes:
- Added explicit anti-meta-referencing instruction to
DOMAIN_NOTES["documents"]with wrong/right examples. - Reworded all 5 field groups in
FIELD_GUIDANCE["documents"]to avoid document-referencing (e.g., "Overview — what is this topic about?" instead of "what is this document about?").
documents.py — Added regex to _title_from_stem() to strip document-type suffixes ("User Guide", "Manual", "Handbook", etc.) so the LLM sees "Jetstream 2" instead of "Jetstream 2 User Guide" as the entity name.
Three extraction runs after iterating on fixes:
- Staging (first fix): 608 pairs, 10% meta (down from 36%)
- Staging (with title suffix fix): 604 pairs, 0.9% meta (6 remaining)
- Data directory: 228 pairs
Combined: 832 pairs, 6 meta-referencing (0.7%). Cleared Argilla and pushed fresh.
Brought up all services locally (qa-service on 8001, access-agent on 8000). Synced 832 document pairs from Argilla and loaded 70 entity pairs via JSONL. Total: 902 pairs in pgvector.
Fired all 41 test questions. Results:
| Metric | Run 3 (83 pairs) | Run 4 (902 pairs) |
|---|---|---|
| UKY hits | 38/41 (93%) | 40/40 (100%) |
| pgvector hits | 27/41 (66%) | 27/40 (67%) |
| pgvector avg latency | ~5ms | ~30ms |
pgvector coverage stayed flat at 67% despite 10x more pairs.
The 13 missed questions fall into two categories:
-
Missing source content (4 questions) — Ranch storage has zero Q&A pairs because no Ranch documents exist in
rag_documents/and Ranch wasn't returned from MCP in the extraction run that generated the original test questions. -
No cross-cutting Q&A pairs (9 questions) — General ACCESS questions ("How do I apply for an allocation?", "How do I transfer files between resources?", "What training does ACCESS offer?") have no matching pairs even though we have 104 allocation mentions, 50 transfer/Globus mentions, and 40 training mentions across our pairs. The problem: all those mentions are entity-scoped. We have "How do I cite Jetstream?" but not "How do I acknowledge ACCESS?" We have "What allocations does Anvil support?" but not "How do I apply for an allocation?"
The extraction pipeline processes one document at a time, so it only ever generates entity-scoped Q&A pairs. It will never produce cross-cutting "How does ACCESS work in general?" pairs from a single-document prompt.
UKY's advantage is architectural: chunk-level retrieval at query time lets it pull relevant fragments from multiple documents and synthesize on the fly. It doesn't need a pre-generated answer that matches — it just needs chunks that are individually relevant. Our Q&A-pair RAG needs a pair whose question semantically matches the user's question, and no single entity-scoped pair matches a cross-cutting query closely enough.
- Manually curate cross-cutting pairs — Write 20-30 general ACCESS Q&A pairs by hand. Fast, targeted, but doesn't scale.
- Add a cross-cutting extraction pass — Feed the LLM multiple documents simultaneously and ask for general questions that span topics. New pipeline capability.
- Keep UKY as fallback for general questions — Accept the hybrid. pgvector for entity questions (fast, verified), UKY for cross-cutting (slow, unverified).
- Lower similarity thresholds — Some misses scored 0.55-0.68, not far from the 0.70 cutoff. Won't fix the 0.28-0.49 misses.
- Detect cross-cutting-ness at query time — Instead of pre-generating cross-cutting pairs, use pgvector match quality as a signal: low scores with scattered partial matches → route to document chunk RAG or MCP tools. Fits existing agent graph routing.
a3_results/run4.json— 40 comparison log entriesa3_results/run4_enriched.json— enriched with low-threshold best-possible scores~/.agent/diagrams/a3-run4-bakeoff.html— interactive comparison visualization
Even when pgvector hits, many answers are thinner than UKY's. Investigated whether pgvector answers were bypassing LLM synthesis — confirmed they are NOT: _dual_rag_answer() calls _synthesize_with_rag_only() for every pgvector match. The real issue: a single pre-digested Q&A pair gives the synthesis LLM very little to work with, so it returns near-verbatim text. UKY pulls multiple document chunks and the LLM has more raw material to synthesize a richer answer.
However, reviewing side-by-side answers revealed a more nuanced picture:
- Some pgvector answers are actually better than UKY's (more precise, directly relevant)
- Some just need link enrichment (the synthesis prompt doesn't encourage adding URLs)
- Some questions UKY can't answer but pgvector can (entity-specific data from MCP)
This shifts the framing from "pgvector vs UKY" to "how to combine them intelligently."
Quick fix (low-effort, high-impact): The RAG_ONLY_SYNTHESIS_PROMPT in synthesize.py says "Be concise and direct" — this is why the LLM returns near-verbatim single sentences. Updating the prompt to encourage link inclusion, practical context, and resource pointers would immediately enrich thin answers without any architectural changes. The Q&A pair metadata already carries domain and entity_id which could drive link generation.
Instead of generating cross-cutting Q&A pairs up front, detect cross-cutting-ness at query time based on pgvector results and route accordingly:
- pgvector score < threshold but > 0.4 → content exists but scattered → fall back to document chunk RAG or plan+MCP
- pgvector hit but thin answer → enrich with MCP tool calls or document chunks
- pgvector hit with rich answer → serve it (fast, verified)
- pgvector zero matches → missing content → MCP or UKY fallback
This fits the existing agent graph — rag_answer already evaluates match quality and routes to plan on weak matches. The change: make that evaluation smarter about why the match is weak.
threshold=0.0falsy in vectorstore.py:threshold or settings.rag_similarity_thresholdtreats 0.0 as falsy, falling back to default 0.85. Affects diagnostic queries withthreshold=0.- q21 not logged: "How much funding did the pollinator conservation AI project get?" was classified as non-RAG (40/41 logged).
Discovery: The Run 4 summary reported "UKY hits 40/40 (100%)" — but this counted every UKY response as a hit, including hedges like "The provided documents do not contain specific information about Abaqus. Please open a support ticket." Applied the same hedge detection used at runtime (_rag_answer_is_weak in graph.py) to the logged responses.
Corrected Run 4 numbers:
| Metric | pgvector | UKY |
|---|---|---|
| Genuine answers | 27/40 (68%) | 13/40 (33%) |
| Hedged / no match | 13 | 27 |
Head-to-head breakdown:
- Both answered well: 8
- pgvector only (UKY hedged): 19
- UKY only (pgvector no match): 5 — all general process questions (allocations, password reset, file transfer)
- Neither answered well: 8
What this means: pgvector already outperforms UKY 2-to-1. UKY's 19 entity-specific hedges are questions pgvector handles from curated MCP data (software versions, resource specs, NSF awards) that UKY's document corpus simply doesn't cover. The "UKY as strong fallback" framing was wrong — UKY adds value on only 5 questions, all cross-cutting process topics.
Remaining gap (13 questions): 5 cross-cutting process questions (UKY answers, pgvector doesn't) + 8 neither backend handles. A document-chunk fallback for cross-cutting detection would address most of these, but the urgency is lower than previously thought.
Also this session: Updated SYSTEM_OVERVIEW.md routing table with file names, condition explanations, and node descriptions. Synced gist.
50 clean, well-phrased questions (FRIENDLY_BATTERY.md) against both UKY and NEWSYSTEM. Questions intentionally well-phrased (no typos, no complaint framing, no vagueness) to test capability separate from robustness. Covers same topics as the real-user battery but with proper phrasing.
- UKY battery: 50/50 success, 247.5s total (~3.9s avg). Saved to
a3_results/friendly-uky.json. - NEWSYSTEM battery: 50/50 success, 310.0s total (~5.2s avg). Saved to
a3_results/friendly-ns.json. - Comparison report:
~/.agent/diagrams/friendly-battery-comparison.html
| Metric | UKY | NEWSYSTEM |
|---|---|---|
| Avg response length | 1,380 chars | 777 chars |
| Avg latency | 3.9s | 5.2s |
| Errors | 0 | 0 |
| Short responses (<200c) | 0 | 3 |
| Source | Count | Notes |
|---|---|---|
| RAG retrieval | 24/50 (48%) | Up from 18% on real-user queries |
| MCP tool calls | 7/50 (14%) | search_resources, get_resource_hardware, search_software, search_events |
| LLM-only | 19/50 (38%) | No external data, answered from Claude's training |
- Allocations/credits (3): allocation types, applying, exchange allocations
- Account management (3): password reset, new account, office hours
- Cross-cutting process (2): Globus file transfer, acknowledging ACCESS
- Resource-specific how-tos (7): SLURM submission, logging in, SU calculations, data management
- Troubleshooting (2): SLURM qos errors, common parameters
- Credit conversion (2): SU calculations, GPU hours from credits
These are all topics where Q&A pairs exist in pgvector but the classifier routes them as "dynamic" or the similarity scores fall below threshold. The answers from LLM training data are often reasonable (avg 1,028 chars) but ungrounded — no citations.
| Metric | Real-user (messy) | Friendly (clean) |
|---|---|---|
| RAG hit rate | 18% (9/50) | 48% (24/50) |
| MCP tool use | 14% (7/50) | 14% (7/50) |
| LLM-only | 68% (34/50) | 38% (19/50) |
| Errors/failures | 7 (JSM errors) | 0 |
| Short responses | 7 (<100c) | 3 (<200c) |
RAG hit rate nearly triples with clean phrasing. MCP usage is identical (the classifier correctly identifies dynamic questions regardless of phrasing quality). The 7 JSM error messages from the real-user battery are gone because these clean questions don't trigger complaint-framing misclassification.
- The system works. When given fair input, NEWSYSTEM retrieves relevant Q&A pairs and produces grounded answers. The 18% RAG hit rate was an input robustness problem, not a capability problem.
- But it's still not great. 48% RAG hit rate means 52% of well-phrased questions don't get grounded answers. 19 LLM-only is still too many. The Q&A pair matching threshold (0.70) is filtering out matches that would score in the 0.55–0.69 range.
- UKY wins on answer richness. Longer answers on 43/50 questions. Document chunks provide more context than distilled Q&A pairs. This is the thin-answer problem from earlier runs, still present.
- Classifier is less of a problem with clean input. Zero JSM misfires. The classifier works when questions are well-formed; it breaks on complaint framing and error pastes.
- Lower similarity threshold from 0.70 to ~0.60 — recover the 0.55–0.69 band matches
- Fix classifier prompt — distinguish "needs live data" from "describes a problem with a documented solution"
- Add RAG fallback for failed dynamic — when tools fail or planner gives up, try Q&A pair search
- Re-run real-user battery after tuning — measure improvement
- If still not enough — present evidence for document-chunk RAG as fallback layer
Full clean-state bake-off: reset pgvector + Argilla → re-extract entities (80 pairs) → re-extract documents (603 pairs) → Argilla sync to pgvector (683 total pairs) → run 50 real-user questions from REAL_USER_BATTERY.md against both UKY and NEWSYSTEM separately.
- UKY battery: 50/50 success, 275.6s total (~5.5s avg). Saved to
a3_results/uky-battery.json. - NEWSYSTEM battery: 50/50 success, 487.6s total (~9.8s avg). Saved to
a3_results/ns-battery.json.
| Metric | UKY | NEWSYSTEM |
|---|---|---|
| Avg response length | ~1,100 chars | ~680 chars |
| Ultra-short failures (<100 chars) | 0 | 7 |
| RAG hit rate | N/A | 9/50 (18%) |
| Tool usage | N/A | 24 calls across 7 tool types |
| Total duration | 4.6 min | 8.1 min |
1. Threshold too aggressive for real-user queries (biggest factor). The agent uses 0.70 for static queries. Only 9/50 cleared that bar. But useful data exists just below it — "ACES specifications?" scored 0.694 (missed by 0.006), "How can I use my allocations?" scored 0.618, "login.expanse.sdsc.edu" scored 0.608. The Q&A pairs are phrased as clean specific questions; real user queries are short, vague, and messy. The 0.70 threshold worked for our hand-crafted 41-question battery but fails on real user language.
Direct qa-service queries at threshold=0.01 confirmed data IS there — the agent's threshold is filtering it out. 25/50 questions had zero matches at threshold 0.70 but would have found relevant pairs at lower thresholds.
2. Over-classification as "dynamic" (16/50 = 32%). Dynamic classification bypasses RAG entirely. Of 16 dynamic-classified questions:
- 7 routed to JSM domain → all failed with "tools unavailable" (JSM MCP server not running). These produced identical 92-char canned error messages.
- 9 others classified dynamic with domain=None → 4 succeeded with tools, 5 had no applicable tools and fell to LLM-only.
Queries like "Having password issues using ssh to login" (q32) and "I am invited to a new project but I cannot see it" (q33) are arguably static how-to questions that RAG should handle, but the classifier read them as user-specific dynamic problems.
3. No fallback: dynamic misses don't try RAG. When a dynamic query fails (JSM unavailable, no tools applicable), the agent generates from LLM training data. There is no "if tools fail, try RAG" recovery path.
Of 50 real-user questions: 9 hit RAG, 5 succeeded with tools, 7 got JSM errors, and 29 were answered from pure LLM generation with no external data. The system barely used its knowledge base.
Run 4's 41 questions were hand-crafted to be clean and specific. Real user queries are short ("is lammps on stampede"), vague ("How do it get to anvil"), or framed as complaints ("I can't login to Expanse right now"). The threshold + classifier combo that worked for clean queries falls apart on messy ones.
| Type | Count | Notes |
|---|---|---|
| static | 31 (62%) | 25 got zero RAG matches, 6 hit RAG |
| dynamic | 16 (32%) | 7 JSM failures, 4 tool successes, 5 LLM-only |
| combined | 3 (6%) | 2 hit RAG, 1 failed tools |
| Q# | Score | Query (truncated) |
|---|---|---|
| q3 | 0.733 | NSG allocation, get started on Expanse |
| q6 | 0.730 | Jetstream2 storage for deltaai |
| q8 | 0.752 | account for NCSA DeltaAI |
| q12 | 0.715 | How do it get to anvil |
| q14 | 0.838 | hardware specifications for Anvil |
| q15 | 0.718 | specific type of GPU for Bridges-2 |
| q18 | 0.632 | TAMU ACES: 1 of 1 SUs remaining (combined, threshold 0.60) |
| q21 | 0.731 | credit for 1 node of jetstream2 GPU |
| q36 | 0.735 | resources have comsol + gpu (combined, threshold 0.60) |
a3_results/uky-battery.json— 50 UKY responsesa3_results/ns-battery.json— 50 NEWSYSTEM responses with full node traces~/.agent/diagrams/bakeoff-battery-comparison.html— side-by-side HTML comparison
Problem 1 — Threshold (0.70) too aggressive for messy queries. Lowering to 0.55–0.60 would recover ~5-8 more RAG hits. Below 0.50 starts pulling wrong-topic matches (e.g., "Can I change my ACCESS ID?" at 0.346 matches "How can a researcher access Anvil?" — completely wrong). The 0.60–0.70 band has genuinely useful near-misses: "ACES specifications?" at 0.694, "How can I use my allocations?" at 0.618, "login.expanse.sdsc.edu" at 0.608.
Problem 2 — Classifier over-routes to dynamic/JSM. 7 questions routed to JSM (Jira Service Management) because they sounded like complaints ("password not working", "shows pending", "can't see my project"). These are static how-to questions that RAG should handle. The classifier reads frustration/complaint framing and assumes it needs a live support ticket lookup. Two clear misclassifications (q40 "how can I access my slurm", q41 "sbatch error") and several judgment calls where static would have been more pragmatic (q28, q42). q46 ("link to register for the webinar?") was correctly classified as dynamic — webinar links are live data.
Problem 3 — No fallback from failed dynamic. When dynamic classification fails (JSM not running, no applicable tools), the agent generates from LLM training data with no external grounding. Adding a RAG fallback for failed dynamic queries would recover 5-7 questions without touching the threshold.
Joe's concern: we'll drop thresholds, risk false positives, tune classifier rules against an imperfect surface, get a few more questions right, and still not be good enough. The fundamental brittleness is that Q&A-pair RAG requires the user's question to semantically match a pre-generated question. Real users don't phrase things that way.
Document-chunk RAG as fallback: Joe suspects we'll eventually need document chunks in the mix — like UKY does. High-confidence Q&A pair matches get served directly (fast, verified). When Q&A similarity is weak, fall back to document-chunk retrieval where the LLM finds the answer within a broader text passage. This is more resilient to messy phrasing because chunks have more surface area for partial matches.
Andrew's prior directive was no document-chunk RAG — just return more Q&A pair matches and let synthesis combine them. This worked for union-type cross-cutting queries but doesn't solve the phrasing-mismatch problem. The real-user battery results may change this conversation.
Before tuning anything, separate two questions that are currently conflated:
- Can our system match UKY when given well-phrased questions about entities we cover? (capability test)
- Can our system handle messy real-user language? (robustness test)
Step 1: Build a "friendly battery" — 30-40 clean, well-phrased questions specifically targeting the 683 Q&A pairs in pgvector. Topics we know we have pairs for. If NEWSYSTEM matches or beats UKY here, the system works and the problem is purely input robustness.
Step 2: Tune classifier + threshold — targeted fixes based on the failure analysis above.
Step 3: Re-run real-user battery — measure how much tuning helped.
Three data points: best-case, tuned, real-world. If the gap between friendly and real-world remains huge after tuning, that's concrete evidence for document-chunk RAG as a fallback layer.
Friendly battery confirms NEWSYSTEM capability. 48% RAG hit rate on clean questions (vs 18% on messy real-user queries). 19/50 answered LLM-only — mainly allocations, account management, and cross-cutting process questions where Q&A pairs exist but similarity threshold filters them out. UKY still wins on answer length (1,380 vs 777 avg chars). The system works; the problem is input robustness (threshold + classifier). Next: lower threshold to ~0.60, fix classifier, add RAG fallback for failed dynamic, then re-run real-user battery.
- A.1 (Argilla → pgvector sync) ✅
- A.2 (dual-RAG logging in access-agent) ✅
- A.3 Runs 1–6 complete ✅ — RAG-vs-RAG (Runs 1–4), full-system (Runs 5–6)
- Post-mortem analysis ✅ — gap is content type (entity vs process), not quality
- UKY corpus obtained ✅ — 75 files in
rag_documents/ - Direction confirmed with Andrew ✅ — generate Q&A pairs from docs, unify in pgvector, retire UKY
- C.1 corpus categorized ✅ — index at
rag_documents/CORPUS_INDEX.md - C.2 document extractor built ✅ — committed and pushed on
feat/document-extractor - C.3 extraction complete ✅ — 832 pairs (604 staging + 228 data), meta-referencing fixed (36% → 0.7%)
- Outstanding PRs merged ✅ — both
access-qa-extractionandaccess-qa-planningPRs squash-merged - C.4 sync + bake-off ✅ — 902 pairs in pgvector (832 document + 70 entity), 40 questions answered
- A.3 Run 4 reanalysis ✅ — pgvector 68% vs UKY 33% (hedge responses excluded)
- A.3 Run 5 ✅ — full-system test (pgvector + MCP + routing). 24 RAG, 5 MCP, 12 LLM-only.
- Node tracing ✅ —
node_tracein AgentState, gated behind?include_trace=true(commits04342c8,b7a9bec) - Top-5 matches + enriched synthesis prompt ✅ —
RAG_TOP_K3→5, prompt rewritten (commitef43a21) - A.3 Run 6 ✅ — 27 RAG, 5 MCP, 9 LLM-only. Thin-answer problem confirmed as extraction-level issue.
- Real-user query analysis ✅ — 4,887 unique queries from chatbot CSV, topic/resource distribution mapped
- Real-user test battery ✅ (DRAFT) — 50 questions sampled by real user interest (
REAL_USER_BATTERY.md) - Entity alignment analysis ✅ — MCP vs docs Venn diagram mapped, top 8 resources identified
- Real-user bake-off (clean state) ✅ — 683 pairs, 50 questions, UKY vs NEWSYSTEM. NEWSYSTEM regression: 18% RAG hit rate, 7 JSM failures, 29/50 LLM-only.
- Regression root cause analysis ✅ — three compounding problems: threshold too high for messy input, classifier over-routes to dynamic/JSM, no fallback from failed dynamic
- Friendly battery ✅ — 50 clean questions, 48% RAG hit rate (vs 18% real-user), confirms capability. Problem is input robustness, not system capability.
- pgvector is already ahead: 27/40 genuine answers vs UKY's 13/40. pgvector covers entity-specific data (software, resources, awards) that UKY cannot.
- Full system closes more gaps: MCP tools answer Ranch questions and project search (Runs 5–6). LLM-only count dropped from 12 → 9 with top-5 matches pulling more questions into RAG.
- Cross-cutting gap splits into two types: Union-type queries now partially addressed by top-5 multi-match synthesis. Procedural queries ("How do I apply for an allocation?") still need hand-curated cross-cutting Q&A pairs (~5 questions).
- Answer richness gap is upstream: Even when pgvector hits, answers are thinner than UKY's (e.g., q2: bare accelerator list vs UKY's unit counts, model numbers, memory specs). The synthesis prompt can't add detail that isn't in the Q&A pairs. Fix is in
access-qa-extractionprompts — affects both MCP and document extractors. - Real users ask about Expanse, Delta, Anvil most: 246, 181, 174 mentions respectively. Ranch (11 mentions) was overrepresented in the original 41-question battery.
| Resource | MCP ID | User mentions |
|---|---|---|
| Expanse | expanse.sdsc.access-ci.org | 246 |
| Delta | delta.ncsa.access-ci.org | 181 |
| Anvil | anvil.purdue.access-ci.org | 174 |
| Bridges-2 | bridges2.psc.access-ci.org | 102 |
| ACES | aces.tamu.access-ci.org | 83 |
| Jetstream2 | jetstream2.indiana.access-ci.org | 62 |
| Stampede3 | stampede3.tacc.access-ci.org | 55 |
| Sage | sage.northwestern.edu | 43 |
All 8 exist in both MCP and rag_documents/. Currently only ACES has MCP entity pairs in pgvector.
- Classifier + threshold tuning — review classifier prompt, fix over-routing to dynamic/JSM. Lower static threshold from 0.70 to ~0.60. Add RAG fallback for failed dynamic queries.
- Re-run real-user battery — measure improvement from tuning. Three data points: friendly (48% RAG), tuned real-user, original real-user (18% RAG).
- Assess document-chunk RAG — if tuning doesn't close the gap, present the evidence to Andrew for adding document-chunk retrieval as a fallback layer beneath Q&A pair matching.
- Project D — evaluation harness (EVAL_HARNESS_PLAN.md)
- Project B — feedback protocol design
Analyzed chatbot_log_all_data.csv — 5,780 rows from the production chatbot (connected to UKY). After deduplication, length filtering, and removing our own test battery questions: 4,887 unique real user queries saved to a3_results/real_user_queries.json.
Topic distribution:
- Specific resources: 20% (Expanse 246, Delta 181, Anvil 174, Bridges 102, ACES 83, Jetstream 62, Stampede 55, Sage 43)
- Allocations: 17%
- Account/access: 16%
- GPU: 8%
- Jobs/SLURM: 8%
- Software: 5%
- Storage/data: 4%
- Other (general ACCESS): 42%
Ranch had only 11 mentions (0.2%) but occupied 4/41 questions (10%) in our original battery — significantly overrepresented.
Mapped which compute resources exist in MCP (search_resources returns 23), which have documents in rag_documents/, and which are in pgvector. Key finding: only ACES has MCP entity pairs in pgvector (8 pairs). The other 22 MCP resources were never extracted because we ran --max-entities during testing.
10 resources exist in both MCP and docs. 13 are MCP-only (including Ranch). 2 are docs-only (Darwin, FASTER). The top 8 by real user interest all exist in both sources.
Sampled 50 questions from the 4,887 real queries, weighted by actual user interest. Saved as REAL_USER_BATTERY.md (parallel to A3_TEST_QUESTIONS.md) and a3_results/real_user_battery_50.json. Filtered out context-dependent follow-ups, pasted errors, very short/long queries, and our own test questions. Kept realistic messiness (typos, vague phrasing).
Spent significant time untangling confusion about what the system is, what we're testing, and why. The confusion stemmed from conflating three separate problems: thin answers (extraction prompt quality), missing entities (extraction coverage), and cross-cutting gaps (architectural limitation of per-entity extraction). Wrote FRIDAY_THE_13TH.md as a narrative summary of the full arc and decisions.
Key clarification: the system under test is access-agent's LangGraph with MCP-extracted QAPs + document-extracted QAPs + MCP tools at runtime. Both QAP sources stay — MCP gives structured specs, docs give process/how-to. No document-chunk RAG for now (Andrew's directive — use top-5 QAP matching instead).
a3_results/real_user_queries.json— 4,887 deduplicated real user queriesa3_results/real_user_queries_original.json— backup before any editsa3_results/real_user_battery_50.json— 50 sampled questions with categoriesREAL_USER_BATTERY.md— human-readable battery (DRAFT, needs review)FRIDAY_THE_13TH.md— narrative summary of where we are and what we decided
Ran all 41 questions against the updated access-agent (commit ef43a21: RAG_TOP_K 3→5, RAG_ONLY_SYNTHESIS_PROMPT rewritten for thoroughness). UKY disabled, MCP servers active, include_trace=true for full node tracing. Compared against UKY baseline from Run 4.
| Metric | Run 5b | Run 6 |
|---|---|---|
| Via RAG | 24 | 27 |
| Via MCP | 5 | 5 |
| LLM only | 12 | 9 |
| Avg latency | — | 6.6s |
41/41 answered, 0 failures. Top-5 pulled 3 more questions into the RAG path (likely union-type queries that now get enough matching pairs to clear the threshold).
The enriched synthesis prompt did not fix the answer richness gap. Example — q2 ("What kind of accelerators does ACES have?"):
- Run 6 (272 chars): Bare list of accelerator names (Intel Max GPUs, NVIDIA H100, etc.)
- UKY (1197 chars): Detailed breakdown with unit counts, model numbers, memory specs per accelerator type
Run 6 matched 3 Q&A pairs (best score 0.929), but the top pair's answer is itself just a summary list: "The ACES system includes a variety of accelerators such as Intel Max GPUs, Intel FPGAs, NVIDIA H100 and A30 GPUs, NEC Vector Engines, NextSilicon co-processors, and Graphcore IPUs."
The synthesis LLM can't add detail that isn't in the source pairs. UKY's advantage here isn't architectural — it's that UKY's source documents (the ACES user guide PDF) contain the detailed specs table, and chunk retrieval preserves that detail. Our extraction pipeline summarized it away.
The two-shot extraction pipeline (battery + discovery) generates answers that summarize source material rather than preserving specifics. This is appropriate for "what is X" overview questions but loses value for detail questions ("what accelerators", "what specs", "how many nodes").
The fix is upstream in access-qa-extraction: the battery/discovery prompts need to instruct the LLM to retain numerical details, specifications, counts, and model numbers from the source data.
q5–q8 (Ranch) all show 0 RAG matches because Ranch was never in our MCP extraction and has no Q&A pairs. q5, q6, q7 get reasonable answers via MCP tools (search_resources, get_resource_hardware). q8 ("How do I request a shared project space on Ranch?") falls to LLM-only with a generic answer.
a3_results/run6.json— 41 questions with full responses and node traces~/.agent/diagrams/a3-run6-comparison.html— interactive comparison (Run 6 vs UKY baseline)
Ran all 41 questions through the production agent graph with MCP servers active and UKY disabled. This is the first system-vs-system test: pgvector RAG + MCP tools + LangGraph routing, compared against UKY's baseline responses from Run 4.
Configuration:
ENVIRONMENT=production,MCP_SERVER_HOST=host.docker.internal— agent container reaches MCP servers via Docker host bridgeUKY_RAG_ENABLED=false,DUAL_RAG_LOGGING=false— no UKY, no dual-RAG comparison path- 10 MCP servers running (access-mcp/docker-compose.yml)
- 902 Q&A pairs in pgvector (832 document + 70 entity)
Results — 41/41 questions answered:
- 24 via RAG (
rag_retrieval) - 5 via MCP tools (
search_resources,get_resource_hardware,search_events,search_projects) - 12 LLM-only (no tools called)
- MCP tools fill the Ranch gap. Ranch had zero Q&A pairs — q5, q6, q40 now get real answers via
search_resourcesandget_resource_hardware. Even the misspelled q40 ("reanch storage") resolves. - q41 gets a real answer. "What allocation projects are using machine learning?" calls
search_projects, returns 20 real projects with PIs and institutions. - q31 routes to events ("What training resources does ACCESS offer?") calls
search_events, though the search returned empty results. - Cross-cutting questions (q3, q7, q8, q26-q28, q32-q33, q38) fall to LLM synthesis. Neither RAG nor MCP covers these general ACCESS process questions. Answers read well but are ungrounded — could hallucinate.
The API response only exposes tools_used, confidence, execution_strategy, tool_count. We cannot tell from the response:
- What the classifier decided (
static/dynamic/combined) - Which graph nodes actually executed (e.g. did RAG fire and fail before falling to LLM?)
- RAG similarity scores for matched pairs
- Whether
_rag_answer_is_weaktriggered - The plan content (if the planner node ran)
- MCP tool arguments and raw responses
The 12 "LLM-only" answers are a black box — we can't distinguish "classified as static, RAG returned nothing, fell through to LLM" from "classified as static, LLM answered directly without trying RAG." Adding a node_trace to QueryResponse is the immediate next step.
Interactive HTML comparison at ~/.agent/diagrams/a3-run5-comparison.html. Matches the Run 3/4 report format: KPI cards, filters, expandable side-by-side comparison. Note: hedge detection has a known issue — see below.
The report's hedge detection uses substring matching against phrases like "do not contain", "does not explicitly", etc. UKY q27 ("The provided documents do not specify...") is marked hedged but none of the exact phrases match — the detection was too aggressive. The h2h classification for q27 and potentially others needs review. Should align with access-agent/src/agent/graph.py:_rag_answer_is_weak() which uses the canonical hedge phrases.
a3_results/run5.json— 41 questions with full agent responsesa3_results/uky_baseline_from_run4.json— UKY baseline (40 questions, q21 missing)a3_results/run_a3_test.py— test runner (updated for Run 5: captures full response, saves to JSON)
- Add node tracing to agent graph — track which nodes executed, classification result, RAG scores. Expose in
QueryResponse.metadata. - Re-run with tracing — Run 5b with node trace data, so we can see exactly how each question routes.
- Fix hedge detection — align report's hedge logic with
_rag_answer_is_weak()from the agent codebase. - Tune synthesis prompt —
RAG_ONLY_SYNTHESIS_PROMPTproduces thin answers when one pair matches. Add links, context. - Curate 20-30 cross-cutting Q&A pairs — allocations, Globus, MFA, training, citation (fills the 12 LLM-only gaps).
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-qa-service && docker compose up -d
cd /Users/josephbacal/Projects/sweet-and-fizzy/access-ci/access-agent && docker compose up -d
Verify with docker ps — you should see access-agent-agent-1 (8000), qa-service-app (8001), and their postgres/redis containers.
curl -s -X POST http://localhost:8000/api/v1/query \
-H "Content-Type: application/json" \
-d '{"query": "What is ACES?", "session_id": "test", "question_id": "test-1"}' | python3 -m json.tool
Two commits on access-agent/main:
04342c8 — Added node_trace to AgentState as Annotated[list[dict[str, Any]], operator.add]. Each graph node appends a structured trace dict recording what it decided:
- classify: query_type, confidence, domain, rag_endpoint, reason, whether query was expanded
- rag_answer: source (uky/pgvector), match_count, best_score, rag_used, has_final_answer
- plan: requires_tools, tool_count, tool names, strategy
- execute: tools_called, succeeded, failed
- evaluate: is_helpful, reason
- recover: action taken, new tools selected
- synthesize: strategy, answer_length
- domain_agent: domain, tool_count
The /api/v1/query response includes classification summary and node_trace in metadata. 10 files changed (all node files + state.py + routes.py).
b7a9bec — Gated node_trace behind ?include_trace=true query parameter. OTel/Honeycomb (added Jan 2026, commit 422b92d) already provides full distributed tracing for ops. node_trace serves a different consumer: the eval harness needs trace data inline in the API response so it can programmatically inspect routing decisions without querying an external service. Nodes continue accumulating trace dicts in state (zero overhead), but the response only includes them when opted in.
- OTel/Honeycomb: Ops. Waterfall view of every span, LLM call, MCP tool. External service.
node_trace: Eval. Inline in API response. Shows decisions (classifier output, RAG scores, tool selection) not timing. Consumable programmatically by the eval harness (Project D).
Andrew suggested returning the top 5 Q&A pair matches (instead of just the best) and letting the synthesizer combine them — simpler than building document-chunk retrieval for cross-cutting queries. Analysis confirmed the pipeline already supported multiple matches end-to-end (RAG_TOP_K was 3, qa-service accepts up to 20, all downstream code iterates over the full list). The change was purely configuration + prompt.
config.py — RAG_TOP_K: 3 → 5. More material for the synthesizer, especially for union-type cross-cutting queries where related entity pairs from different resources can be combined.
synthesize.py — RAG_ONLY_SYNTHESIS_PROMPT rewritten:
- "Be concise and direct" → "Answer the question thoroughly"
- New guideline: when multiple knowledge entries are provided, synthesize into a unified answer
- URLs/links elevated to IMPORTANT (matching the tool-only and combined prompts)
- Added practical next steps guidance and support ticket link (both already present in the other prompts, missing here)
- Thin answers: Single-match answers were near-verbatim because the prompt said "be concise." Now the LLM is instructed to give a complete, actionable response with links and context.
- Union-type cross-cutting queries: "What resources support GPUs?" now gets 5 entity-scoped pairs (Delta, Bridges-2, ACES, etc.) and the prompt tells the LLM to combine them.
- Does NOT fix procedural cross-cutting: "How do I apply for an allocation?" still has no matching pairs at any score. These need hand-curated cross-cutting Q&A pairs (~5 questions).
Orientation after time away. Surveyed git state across all ACCESS-CI repos. Working trees clean except one uncommitted line in access-qa-bot/src/config/constants.ts (flipping BACKEND_ID default to 'access') — now redundant because Andrew's 3.6.0 release included the same change.
What Andrew shipped during the absence. The non-agentic-proxy train (Path A from 2026-04-10-status.md) was merged and published solo:
qa-bot-proxy: per-backend Turnstile keys for multi-tenant validation + verified-session cookie to skip Turnstile on subsequent requests. Resolved the Turnstile site-key blocker flagged on 2026-04-10.qa-bot-core@0.2.36(includes Shadow-DOM Turnstile portal fix).access-qa-bot@3.6.0(includes theBACKEND_ID='access'fix).access-ci-ui@0.20.0on upstream.
Four local feature branches are now obsolete (the three feature/non-agentic-proxy-2026-04-10 branches + chore/bump-access-qa-bot-3.5.2 on access-ci-ui fork). access-mcp main needs fast-forward to upstream (4 commits: Hono replacing Express for Claude Code HTTP transport). access-agent branch feature/personalization-phase-1-2 has 4 unpushed commits and is 6 behind origin/main; parked.
Planning meeting with Andrew — focus shift to reporting. Personalization work parked indefinitely. Immediate priority is producing evidence the new agent is better than current production (raw UKY) for Monday's review with Jim, Vikram (UKY), Shelly (CU).
Confirmed Andrew's src/eval/ pipeline is more than a bakeoff CLI: Postgres-backed EvalRun + EvalScore schema, LLM judge, three markdown report generators (generate_team_report, generate_leadership_report, generate_resource_report), bidirectional Argilla integration (argilla_push / argilla_pull), prod-switching infra via SSH tunnel (scripts/eval-prod, .env.eval.prod). Andrew's decisions/007-production-baseline-comparison.md (2026-04-08) specifies the comparison framework and go/no-go criteria.
Monday deliverable: two-way comparison, not three-way. Decision 007 specified three systems (raw_rag / agent_rag_only / agent_full). After Slack iteration, dropped agent_rag_only because (a) UKY scoped corpora aren't ready, stripping the main differentiator the middle path would show, and (b) an MCP-less agent is not a shippable product direction — Andrew confirmed "no way we will do an agent with just RAG." The spec's three-way framing is valid when all three systems are real product options; under current constraints the middle system measures mostly synthesis-prompt delta, which is not a stakeholder-facing story.
Result: 2 batteries × 2 routes = 4 eval runs. Reports emitted as markdown → PDF (no custom HTML — scientists don't want flash).
Work plan (Joe owns all):
- Add
system="raw_rag"mode tosrc/eval/runner.py— callsuky_clientdirectly, bypasses graph, returnsRunResult. Prerequisite for the comparison. - Run two systems against
friendly_battery.json+real_user_battery.json; emit markdown reports → PDF. Push full-agent run to Argilla for spot-check. - Persist
duration_msonEvalScore+ Argilla metadata. Already captured inRunResult, dropped byscorer.py. Needs an Alembic migration for prod Postgres (create_allonly creates missing tables, not new columns). - Add
agent_committo Argilla record metadata inargilla_push.build_argilla_record(). Already onEvalRun; ~3 lines. - Thread a
resourcefield (read from question metadata) throughrun_question()→run_agent(resource_context=...)oruky_client.ask(rp_name=...)for raw_rag. Enables scoped RAG exercise.load_questionsalready captures unknown fields viametadata;capability_areafields (delta/expanse/anvil/etc.) often match RP slugs directly.
Held for later:
- Judge calibration via Argilla human annotations (want review rounds first).
- Live prod traffic → prod Argilla. Implementation sketch: sample % of queries in agent query handler, create
EvalRunwithrun_type='production_sample', skip LLM judge (per Decision 006), push to dedicated dataset. Requires PII strip, rate limit, feature flag, and an agent-side deploy. Not needed for Monday.
Open questions surfaced:
- Whether UKY has per-RP scoped corpora actually indexed. Agent sends
rp_namein the body; UKY's behavior not directly verifiable from this side. Quick curl test recommended before asserting per-resource quality claims. in_scopeflag: specified in the code (uky_client.py:132,# None until UKY implements it) as an authoritative scoped-out-of-scope signal. Agent currently falls back to text-pattern heuristics in_rag_response_out_of_scope(). Not a Monday blocker; becomes important for reliable scoped-quality metrics once UKY ships it.
Side artifact: created HELPING_JOE_GET_IT.md at the access-ci root (local-only, not gist-synced) with three Mermaid diagrams clarifying the eval harness vs agent distinction and the three routing modes. Useful for future reorientations.
Picked up the no-classify path on feat/no-classify (sub-branch off feat/qwen-integration) where 2026-05-04 left off: 4 commits implementing USE_NO_CLASSIFY master switch, search_access_documents doc-search tool, centralized max_tokens, no-classify system prompt. Yesterday's 14-Q tool_coverage_battery eval was mixed-signal (composite +0.05 candidate, but compare-judge picked baseline). Goal today: validate against a richer battery, decide whether to ship.
21-Q no_classify_smoke_battery.json curated. Combined questions from tool_coverage_battery.yaml (10 tool-coverage + 4 RAG), combined_battery.json (5 detail-heavy/messy combined), friendly_battery.json and real_user_battery.json (2 domain-agent-territory). Preserved required_facts on tc-* questions so the judge applies them. Pulled per-question score history from Postgres to confirm no questions were dead weight: tc-events-01 and tc-software-01 perfect-5.0 across all historical runs (kept as proof), tc-allocations-02 sd=1.66 (the loose-match question Joe spent days iterating on), tc-rag-03 had a known 3.10 outlier from 2026-04-29 (generic Globus-101 answer with no ACCESS-specific anchors — kept as canary).
v1 eval pair (composite tied): baseline loop-20260505-144033-3943fb 4.82 / candidate loop-20260505-145419-e6546d 4.81. Both errored on tc-events-01 due to a 32k context-window overflow — events tool returning 50 items × ~1964 chars HTML descriptions = 96k chars (~24k tokens). Eyes-on review of the 20 scored questions surfaced an apparent regression on tc-software-01: candidate produced a 4-line "PyTorch is on ACES, Anvil. Check the docs" wave-off vs baseline's structured table with hostnames, version details, related packages.
About to draft three rounds of prompt fixes for the wave-off when Joe redirected: "let's just run it again and see if the non-deterministic world of agents gives us a different result." Re-ran tc-software-01 with same code/prompt/model → 5.00/5.00, 1226-char rich table answer. The wave-off was a single non-deterministic sample, not a systemic regression. Saved lesson_rerun_before_designing_for_regression.md — agent runs are non-deterministic enough that a single bad sample can look systemic; always re-run before drafting fixes. Confirmation bias is real.
Andrew's PR #3 review caught a substantive bug. Five of six new see_all_urls verified correct; system-status pointed at support.access-ci.org/outages (portal mirror) instead of operations.access-ci.org/infrastructure_news_view (canonical). More important — Andrew noticed query_relevance was hardcoded "exact" in announcements + events even when filters.query was supplied. Both servers send filters.query to Drupal's search_api_fulltext (a fuzzy full-text backend with tokenization, stemming, relevance ranking). Hardcoding "exact" was actively defeating the agent's honest-framing prompt rule — the agent skipped the verification step and fabricated matches on fuzzy results. Same failure mode the prompt was written to prevent.
Three substantive MCP fixes landed on feat/listing-urls-in-tool-responses (PR #3 branch):
ad9c98esystem-status URL → operations.access-ci.org/infrastructure_news_view;query_relevance: filters.query ? "loose_match" : "exact"in announcements + events (mirrors software-discovery's existing pattern); test coverage on both branches in announcements + events (Andrew flagged "exact" path was asserted by zero tests across all servers); eventscompactDescriptionhelper strips HTML and truncates per-item description.4e9e6beevents defaultlimitlowered 50 → 20;DESCRIPTION_MAX_CHARStightened 400 → 250. Resolves the 32k context overflow on tc-events-01: 50 items × ~1050 chars now ~50k chars (~12.5k tokens) instead of ~24k.
v3 eval pair (after MCP fixes): baseline loop-20260505-172157-a8ddd2 4.89 / candidate loop-20260505-175022-f8cd0f 4.90. First eval where candidate beats baseline at the composite level. All 21 questions scored on both sides — events overflow gone. Three-day trend: 2026-05-04 (-0.05) → 2026-05-05 v1 (-0.01) → 2026-05-05 v3 (+0.01) as the surrounding system got cleaner. Compare-judge still picked baseline "small" — same judge subjectivity pattern, but eyes-on review of regressions shows they're either non-deterministic or judge-side noise (tc-allocations-02 1.55-point gap was the judge confusing "tool match count" with "actually about climate" — candidate's honest framing was intact). Report: https://access-ci-reports.netlify.app/no-classify-2026-05-05-v3.html
Latency picture is bimodal. Net candidate is ~16% faster (avg 15.7s/Q vs 18.8s/Q). Big wins on tool-heavy questions: tc-announce-01 saved 17s, comb-011 saved 17s, tc-xdmod-01 saved 16s, tc-status-01 saved 13s — baseline pays for classify + rag_answer + tool_calling_loop; candidate skips straight to the loop. Losses on RAG-only questions: tc-rag-03 (Globus) +23s slower, friendly-001 (SSH+2FA) +13s slower. Candidate calls search_access_documents itself, paying UKY's full /ask round-trip (5-15s, sometimes called twice in one Q). Verified via direct trace inspection: search_access_documents → uky_client.ask() makes ONE POST per call to ONE endpoint (general OR xdmod, never both). Latency is purely UKY's RAG inference (chunk retrieval + their LLM synthesis on their side). Vikram's /retrieve returns raw chunks at sub-second latency — collapses this gap entirely.
Two-step ship plan agreed (decision_no_classify_two_step_ship.md): Step 1 — flip USE_NO_CLASSIFY=true in prod via PR off feat/no-classify + .env.example default. One-line, reversible. Step 2 — delete the conditionality (classify, rag_answer_node, USE_TOOL_CALLING_LOOP flag, agent_full_legacy, master switch) in a follow-up cleanup PR, gated on (a) Andrew sign-off, (b) /retrieve shipped, (c) one more battery confirming no new wave-off failure modes. The conditional earns its keep as a comparison anchor — caught the tc-software-01 wave-off (and avoided designing for it) because we had a baseline to A/B against.
Pushed:
access-mcpfeat/listing-urls-in-tool-responses:ad9c98e(Andrew review fixes + events compactDescription) +4e9e6be(events tightening). PR #3 ready for Andrew when Joe flips to "Ready for review."access-agentfeat/no-classify:0c8fc20addseval/questions/no_classify_smoke_battery.json(the 21-Q battery used today).
Both repos clean, no untracked files. PR for feat/no-classify itself NOT yet opened — first task next session.
Commits across all repos related to the Feb/March plan. Older commits omitted.
| Hash | Date | Message |
|---|---|---|
c8fbf0b |
02-26 | docs: remove historical docs, update system overview for two-shot |
853e88f |
02-26 | replace GUIDED-TOUR with TRACE-TOUR signposts; fix software name casing |
00ba293 |
02-24 | prompt: add rule to quote long lowercase entity names in Q&A |
7b0590e |
02-24 | prompt: enhance rule 4 to check free-text fields; update review observations doc |
28be413 |
02-24 | fix: entity name interpolation + temporal language + coming-soon cleanup |
170e87d |
02-24 | docs: log full corpus scan results — quantify issues #1/#2, add issue #3 |
8336f45 |
02-24 | docs: move allocations:72170 finding to Patterns (positive, not an issue) |
d7f57f5 |
02-24 | docs: log allocations:72170 as non-issue (Jurafsky in source data, verified) |
4f9c22d |
02-24 | docs: add retrieval surface area rationale to P1 (self-contained answers) |
a4f7b66 |
02-24 | docs: note preferred fix for P1 — entity name interpolation in user prompt |
43e980e |
02-24 | docs: clarify P1 — entity name needed in both Q and A for RAG |
70f9424 |
02-24 | docs: add P1 pattern — questions must be self-contained (cross-cuts #1 and #2) |
6084c93 |
02-24 | docs: log issue #2 — decontextualized-question pattern (pervasive) |
07da145 |
02-24 | docs: log issue #1 — temporal-assumption in affinity-groups events |
c4ec468 |
02-24 | docs: add qa-review-observations.md for tracking Argilla review issues |
6857db8 |
02-24 | docs: improve signpost comments + fix COMING SOON name normalization |
579e10d |
02-24 | fix: normalize "COMING SOON" resource names to lowercase |
7bd43ba |
02-24 | wip: some signpost comments |
3333c32 |
02-23 | docs: update guided-tour |
66e1819 |
02-20 | refactor: adopt two-shot as sole extraction strategy |
7803147 |
02-20 | fix: restore missing return in software_discovery._generate_qa_pairs |
7791e2b |
02-20 | feat: add --prompt-strategy flag for A/B/C extraction experiment |
b662dc9 |
02-20 | feat: implement entity-replace for Argilla push |
80fc641 |
02-20 | docs: update plan with metadata on human actions on archive records |
9d54819 |
02-19 | fix(data-quality): separate NSF program fields and add per-domain LLM guidance |
39a4c06 |
02-19 | refactor: remove factoid templates and bonus generation (2-pass pipeline) |
5268caa |
02-19 | docs: reflect entity-replace decision and update README |
8c9e7f2 |
02-18 | docs: update all docs for freeform extraction pipeline and Argilla dedup |
4181585 |
02-18 | feat: roll out freeform extraction to all 5 extractors |
da79f7d |
02-18 | feat: freeform extraction replaces category+bonus two-pass approach |
2833d7b |
02-18 | docs: update for Argilla metadata integration and test count |
e6d08fa |
02-18 | feat(argilla): add eval_issues and source_ref to Argilla records |
3c762c9 |
02-18 | feat(argilla): push judge scores and granularity to Argilla metadata |
24c8373 |
02-17 | feat(judge): LLM judge evaluation scores for Q&A pair quality |
93a1fb2 |
02-17 | feat(bonus): LLM exploratory questions for entity-unique information |
068c08a |
02-17 | feat(incremental): hash-based change detection to skip unchanged entities |
9059614 |
02-17 | fix(factoids): data quality guards for template generation |
3662d8b |
02-13 | feat(generators): dual-granularity Q&A + extend comparisons to all 5 domains |
fa2ff93 |
02-12 | fix(nsf-awards): normalize primaryProgram list + skip unused MCPClient |
f3b1437 |
02-12 | feat(extractors): fixed question categories + direct API for allocations/nsf-awards |
fdebdab |
02-12 | feat(software-discovery): switch from search terms to list_all_software |
e33d006 |
02-11 | feat(extract): add max_entities cap for cheap test runs |
2da2c32 |
02-10 | Use real enumerations from taxonomies.ts for search terms |
d987dee |
02-10 | Add report command for MCP coverage stats without LLM calls |
6c4667c |
02-10 | Add ExtractionConfig to centralize extraction parameters |
0b16ba8 |
02-04 | Fix Q&A pair ID collisions by appending question hash |
cf384bc |
02-04 | Add Argilla integration for pushing Q&A pairs to human review |
51e9877 |
02-04 | Expand extraction queries, fix software-discovery, update docs |
a69ce2e |
02-02 | Fix allocations and nsf-awards extractors returning 0 results |
038d42d |
02-02 | Add dedicated OpenAI backend (LLM_BACKEND=openai) |
b557300 |
02-01 | Add LOCAL_DIRECTIONS.md and update .env.example for OpenAI setup |
d45eda1 |
02-01 | Add NSFAwardsExtractor and register in CLI/validator |
b67eba0 |
02-01 | Add AllocationsExtractor and register in CLI/validator |
18c0e49 |
01-31 | Add AffinityGroupsExtractor and fix MCP server port defaults |
de28ab2 |
01-31 | Add CLAUDE.md and update README with local dev setup guide |
| Hash | Date | Message |
|---|---|---|
5b57ae0 |
02-28 | Fix Argilla sync to work with access-qa-extraction's dataset schema |
| Hash | Date | Message |
|---|---|---|
ef43a21 |
03-12 | feat: return top-5 RAG matches and enrich synthesis prompt |
b7a9bec |
03-12 | feat: gate node_trace behind ?include_trace query parameter |
04342c8 |
03-11 | feat: add node_trace to agent graph for execution path observability |
de26e37 |
— | feat: route pgvector through LLM synthesis + fair comparison logging |
08809ad |
— | fix: lower RAG similarity thresholds — 0.85 was filtering valid matches |
caf7256 |
02-28 | feat: add dual-RAG comparison logging for A.2 evaluation |
| Hash | Date | Message |
|---|---|---|
bb3b54f |
02-04 | spike: Add list-all fallbacks to allocations and nsf-awards routers |
| Hash | Date | Message |
|---|---|---|
a84fb4a |
02-26 | docs: GUIDED-TOUR.md → TRACE-TOUR.extract.md in file tree |
033c46e |
02-23 | docs: update mcp-extraction-impl to reflect two-shot pipeline and entity-replace |
| Hash | Date | Message |
|---|---|---|
d5cb931 |
01-30 | chore: init claude file |