- Title: Trace-Level Analysis of Information Contamination in Multi-Agent Systems
- Authors: Anna Mazhar, Huzaifa Suri, Sainyam Galhotra
- Venue: ACM Conference on AI and Agentic Systems (ACM CAIS '26), May 26–29, 2026, San Jose, CA
- DOI: 10.1145/3786335.3813147
- arXiv: 2604.27586
- Repository: https://github.com/anna-mazhar/trace-level-contamination-mas
The paper studies how uncertainty in heterogeneous input artifacts (PDFs, spreadsheets, slide decks) propagates through multi-agent workflows. Key contribution: outcome-only evaluation systematically misses contamination-induced failures visible only at the trace level.
Across 614 paired runs on 32 GAIA tasks with three LLMs (GPT-5-mini, LLaMA-3.1-70B, Qwen3-235B), they find structural divergence and outcome corruption are decoupled:
| Manifestation Type | Prevalence | Meaning |
|---|---|---|
| Silent semantic corruption | 15.3% | Trace looks normal, output is wrong |
| Behavioral detours with recovery | 40.3% | Trace diverges substantially, output is correct |
| Combined structural disruption | 39.9% | Both trace and output corrupted |
| No observable effect | 4.5% | Perturbation effectively ignored |
-
Outcome-only metrics miss contamination. 40.3% of runs show major structural divergence yet recover correct answers. 15.3% maintain stable traces while producing wrong outputs.
-
Cost is a poor indicator of correctness. 76.2% of low-cost runs produce incorrect answers. Only 16.3% of high-cost runs succeed. Silent semantic corruption disguises errors with baseline costs.
-
Modality-specific failure signatures. Rerouting dominates tabular/document perturbations (80.6%). Audio uniquely favors early termination. Tabular perturbations trigger extended execution (24.4%).
-
First divergence timing reveals failure mechanism. Early divergence (t* < 0.1T) signals extraction failures. Late divergence (t* > 0.3T) reveals reasoning-stage sensitivity.
-
LLM backend shapes contamination response. GPT-5-mini: 48.6% behavioral detours with recovery. LLaMA-3.1-70B: 35.4%. Same perturbations trigger different strategies — model choice is a robustness lever.
-
Current guardrails fail. They assume contamination manifests as control-flow disruption. Silent semantic corruption (15.3%) preserves nominal execution while producing wrong outputs, evading all structural guardrails.
Our pipeline — docling-pdf → update adversarial-review knowledge → review Flang compiler code — is the exact "heterogeneous artifact → multi-agent reasoning" pattern studied.
When docling extracts a compiler correctness PDF, subtle errors (off-by-one in formulas, dropped negation in a precondition, misaligned table columns for instruction semantics) can propagate without triggering any structural change in the adversarial review agent's behavior. The agent executes its normal review pattern but reaches wrong conclusions about code correctness.
This is "silent semantic corruption" — the most dangerous category because it is invisible to both cost monitoring and structural trace analysis.
- Dual extraction (docling + pdftotext fallback): Provides the redundancy that enables "behavioral detours with recovery."
- Cross-model review (Claude Opus + GPT-5.4): Functions as contamination detection through behavioral divergence. If both models reach the same conclusion from the same extracted artifact, silent corruption is less likely. Disagreement is itself a contamination signal.
- Margin line number workarounds in docling: Addresses exactly the class of extraction error that propagates silently.
-
No extraction validation at the docling→agent boundary. The paper's "first divergence point" analysis shows early validation catches the most damaging errors. We should validate extracted formulas/definitions against known terms before feeding downstream.
-
No "quote your sources" requirement. When the adversarial review agent cites a standard clause or correctness property from extracted literature, it should quote the exact extracted text. This surfaces garbled extractions in citations rather than letting them propagate into wrong review conclusions.
-
No semantic checkpoints for dual extraction. We extract with both docling and pdftotext but don't systematically compare outputs at semantic checkpoints (definitions, theorems, code listings) before feeding downstream.
-
No extraction confidence signals. The paper shows OCR noise (1.4× overhead, 21.7% recovery) is less dangerous than watermarks (2.1× overhead, 7.0% recovery). Different extraction failure modes warrant different responses.
The following section should be added to ~/.copilot/agents/adversarial-review.agent.md to address the "silent semantic corruption" threat identified in the paper.
## Artifact Provenance and Contamination Defense
When your review relies on knowledge derived from external literature (compiler correctness papers, language standards, MLIR documentation), apply these contamination-aware practices:
### Quote Your Sources
When citing a correctness property, standard clause, or semantic rule from extracted literature:
- Reproduce the exact text you are relying on (as extracted)
- If you cannot locate the exact wording, state that your claim is based on parametric knowledge rather than extracted evidence, and flag the confidence accordingly
- Never paraphrase a formal property without also providing the literal extracted form — paraphrasing can mask extraction errors
### Cross-Reference Extraction Artifacts
When the invoker provides context derived from PDF extraction (e.g., via docling):
- Treat extracted formulas, type signatures, and code listings as potentially corrupted inputs
- If a formal property seems surprising or contradicts your parametric knowledge, flag the discrepancy explicitly rather than deferring to the extracted text
- Check for common extraction failure modes: off-by-one in array indices, dropped negation operators, misaligned table columns, merged header cells, margin line numbers contaminating body text
### Contamination-Aware Confidence
Adjust finding confidence downward when:
- The finding depends on a single extracted artifact with no cross-reference
- The extracted text contains formatting artifacts (stray numbers, broken Unicode, suspicious whitespace)
- A formal property is only available in paraphrased form
Adjust finding confidence upward when:
- The property is confirmed by multiple independent sources (e.g., both extracted text and your parametric knowledge agree)
- The property is verified against executable test cases or compiler behavior
- Cross-model review independently reached the same conclusion from the same evidenceThis addresses the paper's central finding: "locally valid but globally corrupting" data passes local syntactic checks while distorting downstream computation. In our context:
- A garbled extraction of a Fortran standard clause could cause the review agent to flag correct code as buggy (false positive) or miss a real semantic violation (false negative)
- The "quote your sources" requirement makes silent semantic corruption visible at the review output layer
- The cross-reference requirement implements the paper's recommendation for "schema-level invariants at interface boundaries"
- The confidence adjustment implements risk-proportional validation without adding cost to clean runs