Trace-Level Analysis of Information Contamination in Multi-Agent Systems

Paper Reference

Title: Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Authors: Anna Mazhar, Huzaifa Suri, Sainyam Galhotra
Venue: ACM Conference on AI and Agentic Systems (ACM CAIS '26), May 26–29, 2026, San Jose, CA
DOI: 10.1145/3786335.3813147
arXiv: 2604.27586
Repository: https://github.com/anna-mazhar/trace-level-contamination-mas

Summary

The paper studies how uncertainty in heterogeneous input artifacts (PDFs, spreadsheets, slide decks) propagates through multi-agent workflows. Key contribution: outcome-only evaluation systematically misses contamination-induced failures visible only at the trace level.

Across 614 paired runs on 32 GAIA tasks with three LLMs (GPT-5-mini, LLaMA-3.1-70B, Qwen3-235B), they find structural divergence and outcome corruption are decoupled:

Manifestation Type	Prevalence	Meaning
Silent semantic corruption	15.3%	Trace looks normal, output is wrong
Behavioral detours with recovery	40.3%	Trace diverges substantially, output is correct
Combined structural disruption	39.9%	Both trace and output corrupted
No observable effect	4.5%	Perturbation effectively ignored

Key Findings

Outcome-only metrics miss contamination. 40.3% of runs show major structural divergence yet recover correct answers. 15.3% maintain stable traces while producing wrong outputs.
Cost is a poor indicator of correctness. 76.2% of low-cost runs produce incorrect answers. Only 16.3% of high-cost runs succeed. Silent semantic corruption disguises errors with baseline costs.
Modality-specific failure signatures. Rerouting dominates tabular/document perturbations (80.6%). Audio uniquely favors early termination. Tabular perturbations trigger extended execution (24.4%).
First divergence timing reveals failure mechanism. Early divergence (t* < 0.1T) signals extraction failures. Late divergence (t* > 0.3T) reveals reasoning-stage sensitivity.
LLM backend shapes contamination response. GPT-5-mini: 48.6% behavioral detours with recovery. LLaMA-3.1-70B: 35.4%. Same perturbations trigger different strategies — model choice is a robustness lever.
Current guardrails fail. They assume contamination manifests as control-flow disruption. Silent semantic corruption (15.3%) preserves nominal execution while producing wrong outputs, evading all structural guardrails.

Relevance to Our Workflow

Our pipeline — docling-pdf → update adversarial-review knowledge → review Flang compiler code — is the exact "heterogeneous artifact → multi-agent reasoning" pattern studied.

Threat Model

When docling extracts a compiler correctness PDF, subtle errors (off-by-one in formulas, dropped negation in a precondition, misaligned table columns for instruction semantics) can propagate without triggering any structural change in the adversarial review agent's behavior. The agent executes its normal review pattern but reaches wrong conclusions about code correctness.

This is "silent semantic corruption" — the most dangerous category because it is invisible to both cost monitoring and structural trace analysis.

What We Already Do Right

Dual extraction (docling + pdftotext fallback): Provides the redundancy that enables "behavioral detours with recovery."
Cross-model review (Claude Opus + GPT-5.4): Functions as contamination detection through behavioral divergence. If both models reach the same conclusion from the same extracted artifact, silent corruption is less likely. Disagreement is itself a contamination signal.
Margin line number workarounds in docling: Addresses exactly the class of extraction error that propagates silently.

Gaps to Address

No extraction validation at the docling→agent boundary. The paper's "first divergence point" analysis shows early validation catches the most damaging errors. We should validate extracted formulas/definitions against known terms before feeding downstream.
No "quote your sources" requirement. When the adversarial review agent cites a standard clause or correctness property from extracted literature, it should quote the exact extracted text. This surfaces garbled extractions in citations rather than letting them propagate into wrong review conclusions.
No semantic checkpoints for dual extraction. We extract with both docling and pdftotext but don't systematically compare outputs at semantic checkpoints (definitions, theorems, code listings) before feeding downstream.
No extraction confidence signals. The paper shows OCR noise (1.4× overhead, 21.7% recovery) is less dangerous than watermarks (2.1× overhead, 7.0% recovery). Different extraction failure modes warrant different responses.

Proposed Diff: Adversarial Review Agent

The following section should be added to ~/.copilot/agents/adversarial-review.agent.md to address the "silent semantic corruption" threat identified in the paper.

Addition: After "## Grounding Rules" section (line 119), add:

## Artifact Provenance and Contamination Defense

When your review relies on knowledge derived from external literature (compiler correctness papers, language standards, MLIR documentation), apply these contamination-aware practices:

### Quote Your Sources

When citing a correctness property, standard clause, or semantic rule from extracted literature:
- Reproduce the exact text you are relying on (as extracted)
- If you cannot locate the exact wording, state that your claim is based on parametric knowledge rather than extracted evidence, and flag the confidence accordingly
- Never paraphrase a formal property without also providing the literal extracted form — paraphrasing can mask extraction errors

### Cross-Reference Extraction Artifacts

When the invoker provides context derived from PDF extraction (e.g., via docling):
- Treat extracted formulas, type signatures, and code listings as potentially corrupted inputs
- If a formal property seems surprising or contradicts your parametric knowledge, flag the discrepancy explicitly rather than deferring to the extracted text
- Check for common extraction failure modes: off-by-one in array indices, dropped negation operators, misaligned table columns, merged header cells, margin line numbers contaminating body text

### Contamination-Aware Confidence

Adjust finding confidence downward when:
- The finding depends on a single extracted artifact with no cross-reference
- The extracted text contains formatting artifacts (stray numbers, broken Unicode, suspicious whitespace)
- A formal property is only available in paraphrased form

Adjust finding confidence upward when:
- The property is confirmed by multiple independent sources (e.g., both extracted text and your parametric knowledge agree)
- The property is verified against executable test cases or compiler behavior
- Cross-model review independently reached the same conclusion from the same evidence

Rationale

This addresses the paper's central finding: "locally valid but globally corrupting" data passes local syntactic checks while distorting downstream computation. In our context:

A garbled extraction of a Fortran standard clause could cause the review agent to flag correct code as buggy (false positive) or miss a real semantic violation (false negative)
The "quote your sources" requirement makes silent semantic corruption visible at the review output layer
The cross-reference requirement implements the paper's recommendation for "schema-level invariants at interface boundaries"
The confidence adjustment implements risk-proportional validation without adding cost to clean runs

MattPD/trace-contamination-findings.md

Select an option

No results found