Skip to content

Instantly share code, notes, and snippets.

@cgwalters
Last active May 28, 2026 18:16
Show Gist options
  • Select an option

  • Save cgwalters/b8ecd9af312e53edabcdaced523a0735 to your computer and use it in GitHub Desktop.

Select an option

Save cgwalters/b8ecd9af312e53edabcdaced523a0735 to your computer and use it in GitHub Desktop.
Concept for a generic agentic LLM tool to trim tool output

Agent-Directed Tool Result Trimming: Design and Caching Trade-offs

(AI: Significant AI usage went into generating and writing this)

The Problem

In long agentic sessions, tool outputs (build logs, file reads, search results) accumulate in the context window. Most content is consumed once — the agent reads it, acts on it, and never references it again (for example, a successful build may output a lot of text when the agent only needs to know "the build worked"). But it stays in context, consuming tokens on every subsequent LLM call and pushing the session toward compaction or context overflow.

Various agent frameworks like pi and goose impose hard output caps on their built-in tools (e.g. 50KB / 2000 lines for shell/bash), but that's not consistent across all tools. MCP servers may have their own limits, or may not. A JIRA keyword dump, an API response with hundreds of unrelated records, or a raw HTML page can burn enormous context that is useless for the rest of the task.

The Idea

Give the agent a tool to replace the most recent tool result with a short summary. The original content is preserved in session storage; only the LLM-facing context is modified. The agent decides what to trim and writes the summary itself.

The recommended implementation is to only allow trimming the last tool result to avoid cache invalidation problems with trimming older results.

Sample implementations


2. Other Background and Related Things

Other Related Ideas/Projects

  • rtk tries to rewrite well-known command invocations. On the positive side, this prevents the LLM from seeing the tokens at all. However on the negative side, it hardcodes output schema for commands it does know (which is fragile because these tools do not promise their output won't change), does not handle unknown commands at all, may unintentionally mask any actually unusual output from the command, and crucially doesn't handle any non-shell tools (especially MCP servers).
  • https://openai.github.io/openai-agents-python/ref/extensions/tool_output_trimmer/ - but that one is not agent-driven
  • NousResearch/hermes-agent#415

Safe Prefix Cache Interactions across Model Servers

A common concern with retroactive context mutation is that it breaks LLM prefix caching (which matches byte-for-byte from the start of the prompt). However, this technique is perfectly safe and highly optimal across major provider backends and self-hosted inference engines.

Anthropic (Explicit Breakpoints): In Anthropic's caching architecture, the API checks for the longest matching prefix by walking backward up to 20 "positions" (content blocks) from the current request's breakpoints.

  • If an agent generates a 50,000-token tool output, the framework drops a cache breakpoint on that turn.
  • If the agent subsequently calls trim_tool_result to summarize that block down to 100 tokens, it breaks the cache hash for that specific block and everything after it.
  • However, on the next request, the API will walk backward, miss the mutated block, but hit the previous unmodified breakpoint.
  • The system successfully recovers the cache for the entire conversation history leading up to the edit. The 50,000 read tokens are durably eliminated from the context window, and you only pay a minimal cache-write cost for the new summary block.

OpenAI, Google Gemini, and vLLM (Implicit Prefix Alignment): Neither OpenAI, Gemini (2.5+), nor self-hosted engines like vLLM use explicit breakpoints. Instead, their backends automatically hash request prefixes into token-aligned blocks/chunks (typically ~1024 tokens for APIs, or smaller block sizes like 16 tokens in vLLM's Automatic Prefix Caching).

  • If you mutate a tool result at the end of a 100-turn conversation, only the chunk or KV-blocks containing that specific mutation are invalidated.
  • The cache for the preceding 99 turns remains completely intact because the prefix is byte-for-byte identical up to that block boundary.
  • On the next turn, the backend simply matches the hash of the shared prefix, reuses the cached KV states, and only computes attention for the new summarized tail.
  • Note: To reliably hit these chunks on OpenAI, you must maintain shard locality by passing a consistent prompt_cache_key across requests.

Google Gemini (Explicit CachedContent API): (Exception) Gemini also offers a separate, out-of-band REST API where users can upload context to get a cache handle. This is an all-or-nothing system designed for large static corpora (32,768+ tokens). Retroactive mutation is not supported here; mutating a turn would require deleting and re-uploading the entire cached object.

Prior Art in Existing Tools

Aider: Multi-Model Background Summarization and Binary-Search Repo-Mapping

Aider implements two distinct, highly optimized context-management paradigms:

  1. Retroactive Asynchronous Chat Summarization (ChatSummary): Dynamically calculates a soft token history limit. When crossed, it non-blockingly launches a background thread to summarize older conversation turns (the head) from the first-person perspective of the user (e.g., "I spoke to you previously about X...") while keeping recent turns (the tail) as-is.
  2. Proactive Binary-Search Repo-Mapping (get_ranked_tags_map): To fit global repo symbols within a rigid token budget, Aider ranks code symbols via tree-sitter. It performs a binary search on the subset of ranked tags, counting tokens on each iteration until it lands within a 15% error tolerance of the target budget.

opencode: Automated Token-Based Pruning (superseded)

OpenCode previously implemented a rule-based retroactive pruning mechanism:

  • Heuristics & Triggering: Monitors completed tool outputs. When their collective token size exceeds PRUNE_PROTECT (40,000 tokens), it walks backwards through turn history to find and flag older completed tool parts as compacted.
  • Content Replacement: During model prompt assembly, any tool block flagged has its entire raw output string deleted and replaced with a static text placeholder: "[Old tool result content cleared]" and all associated media attachments are stripped.

This was replaced by agent-directed trimming (trim_tool_result) and defaulted off. The background pruner breaks KV cache across the whole session and removes content without the model's knowledge or consent.

Anthropic Messages API: Native Context Compaction Beta

Anthropic has introduced a first-party, native Messages API beta feature (compact-2026-01-12) that offloads conversation compaction directly onto the server-side model provider:

  • API Halt & State Block: When input tokens cross the threshold, the API immediately halts token generation and returns with stop_reason: "compaction". The very first content block in the returned array is a model-generated summary block (compaction_block). The client discards intermediate older messages, preserves the most recent turns, and resubmits the request to continue.

Claude Code: Progressive Auto-Compaction, Trident Pipeline, and Boundary-Safe Compaction

Claude Code is not open source, but based on observed behavior it features a comprehensive, multi-tiered session compaction strategy:

  1. Deterministic Standard Compaction: Compiles older messages locally and deterministically (counting message roles, unique tool invocations, parses file extensions) into a timeline where each line is truncated to 160 characters. It also includes boundary guards to ensure orphaned tool results don't cause API compilation errors.
  2. The "Trident" Pre-Compaction Pipeline:
    • Stage 1 (Supersede) — Zero-Cost Factual Pruning: If a file is edited, earlier read/write operations for that path are completely deleted from history.
    • Stage 2 & 3: Groups consecutive short messages and semantic clusters, replacing them with single system notifications.

Proprietary Editors: Cursor, Windsurf, and Copilot Workspace

  • Cursor (Semantic RAG and Sliding History): Uses interactive semantic retrieval (embeddings-based RAG) to dynamically slice and send only relevant code chunks instead of whole file buffers.
  • Windsurf (Selective Output Truncation): Aggressively prunes raw, heavy terminal outputs and tool results while keeping active conversational history intact.
  • GitHub Copilot Workspace: Decouples chat history from the target objective, keeping structured "Plan", "Specification", and "Files to modify" in dedicated immutable prompt sections.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment