Context Engineering Hard-won Lessons

Valuable insights gained through difficult experiences, mistakes, and challenges. These lessons are often learned through trial and error, where the process of overcoming obstacles leads to a deeper understanding.

Here are some examples of hard-won lessons.

1. Context Engineering for AI Agents

Blog post: Context Engineering for AI Agents: Lessons from Building Manus by Manus AI (Jul 18, 2025)

Tweet: https://x.com/ManusAI_HQ/status/1946291647849144516

After four overhauls and millions of real-world sessions, here are the lessons we learned about context engineering for AI agents

Analysis

Behind the scenes look of the complexities demanded of context engineering. The blog post shares some practical, empirically derived patterns that operationalize agent context management. These include:

Engineering for high KV-cache hit rates
- Keep context stable and only append to it to allow caching for efficiency (and cost)
Mask, don't remove tool choices for agent loop stability and cache retention
- Instead of using RAG to specify available tools, use masking logits to avoid generating undesirable tools
Treat the file system not just as storage, but as the ultimate context (as structured, externalized memory)
- Instead of compressing context, use filesystem to allow infinite context length. Use file paths to make sure everything is available to the agent.
Actively manipulating LLM attention via recitation (Manipulate attention through recitation reduce goal misalignment. Manus create a todo.md file—and update it step-by-step as the task progresses. It's a deliberate mechanism to manipulate attention.)
Leaving error traces for adaptation (Error recovery is one of the clearest indicators of true agentic behavior)
Introducing diversity to avoid overfitting to few-shot prompts. (Don't few-shot yourself into a rut. The more uniform your context, the more brittle your agent becomes.)

Novel context engineering techniques

KV-cache optimization as a first-class pattern
Logit masking instead of context mutation
File system as structured agent memory
Active recitation for attention manipulation
Error retention as learning substrate
Anti-few-shot drift via pattern randomization

Interesting research direction, quoting the blog post (emphasis mine):

While developing this feature, I found myself imagining what it would take for a State Space Model (SSM) to work effectively in an agentic setting. Unlike Transformers, SSMs lack full attention and struggle with long-range backward dependencies. But if they could master file-based memory—externalizing long-term state instead of holding it in context—then their speed and efficiency might unlock a new class of agents. Agentic SSMs could be the real successors to Neural Turing Machines.

Questions

Does Manus context engineering work lead to end-user token efficiency savings?

2. How Long Contexts Fail

Blog post: How Long Contexts Fail by Drew Breunig (Jun 22, 2025)

Summary

In reality, longer contexts do not generate better responses

Overloading your context can cause your agents and applications to fail in suprising ways. Contexts can become poisoned, distracting, confusing, or conflicting.

Ways contexts can get out of hand:

The Pokémon-playing Gemini agent demonstrated these problems. (I read Gemini 2.5 Pro tech report and Tweeted some of them before. Love to see these examples again.)
- Context Poisoning is when a hallucination or other error makes it into the context, where it is repeatedly referenced.
- Context Distraction is when a context grows so long that the model over-focuses on the context, neglecting what it learned during training.
  
  If models start to misbehave long before their context windows are filled, what’s the point of super large context windows? In a nutshell: summarization3 and fact retrieval. If you’re not doing either of those, be wary of your chosen model’s distraction ceiling.
- Context Confusion is when superfluous content in the context is used by the model to generate a low-quality response.
  - Be careful, MCP can be a curse - In Claude Code, I only connect to two MCP servers, Puppeteer and Context7. People asked me why I don't use these 10 other MCP clients/servers. (mitsuhiko got a good point about this in his video) Young blood don't understand why "less is more". (they say the future is abundance! lol)
- Context Clash is when you accrue new information and tools in your context that conflicts with other information in the context.

As we've seen, bigger contexts create new failure modes. These failures hit agents hardest because agents operate in exactly the scenarios where contexts balloon.

The article is good because the facts were backed by data from research papers and reports.

3. How to Fix Your Context

Blog post: How to Fix Your Context by Drew Breunig (Jun 22, 2025)

This is the follow up on earlier post, How Long Contexts Fail".

The ways we can mitigate or avoid these context failures:

RAG is the act of selectively adding relevant information to help the LLM generate a better response.
Tool Loadout is the act of selecting only relevant tool definitions to add to your context.
Context Quarantine is the act of isolating contexts in their own dedicated threads, each used separately by one or more LLMs.
Context Pruning is the act of removing irrelevant or otherwise unneeded information from the context.
Context Summarization is the act of boiling down an accrued context into a condensed summary.
Context Offloading is the act of storing information outside the LLM's context, usually via a tool that stores and manages the data.

Again, Anthropic has a good write up of the technique, which details their “think” tool, which is basically a scratchpad: ...

Having a space to log notes and progress works. Anthropic shows pairing the “think” tool with a domain-specific prompt (which you’d do anyway in an agent) yields significant gains, up to a 54% improvement against a benchmark for specialized agents.

The key insight across all the above tactics is that context is not free. (context window is the "CPU cache". OS and memory hierarchy teaches us about L1/L2/L3 cache is precious, fast but small. It's a limited resources.)

I like how this article put a proper name to these context mitigation tactics. Now we can make sure during a discourse, we're talking about the same thing!

4. Context Rot

Chroma technical report: Context Rot: How Increasing Input Tokens Impacts LLM Performance (Jul 14, 2025)

Analysis

Insights about how increasing input tokens impact the performance of top LLMs. My notes:

The research evals how foundation models perform as input context length increases, challenging the common assumption that longer contexts are uniformly handled.
Simple tasks reveal degradation
Distractors and structure matter

The presence of plausible distractors significantly reduces accuracy, with different distractors affecting models to varying degrees. Surprisingly, models often perform better on shuffled (structureless) haystacks than logically coherent ones, suggesting that attention is disrupted by narrative flow.
Similarity and position effects

Lower semantic similarity between a query and its answer leads to faster degradation with context length. I guess this is one of the reasons why query augmentation techniques are effective. Models also show a preference for needles appearing early in context and struggle with information retrieval when the needle blends into the haystack thematically.
Repetition and refusal behaviors

In repeated-word tasks, autoregressive degradation appears in long outputs, with models hallucinating, refusing to generate, or inserting random tokens. Performance varies even within model families, with conservative models like Claude Opus 4 often abstaining, while GPT-4.1 more frequently hallucinates.
Thinking Mode

Thinking mode (with full reasoning capabilities) improves performance, but there is still a performance gap between the two input lengths. This trend can be seen across different model families.

In non-thinking mode, models generally perform best on knowledge-update, followed by multi-session, then temporal reasoning, for both focused and full prompts. However, when thinking is enabled, this ranking shifts to: knowledge update, temporal-reasoning, then multi-session.

Why is this useful?

This report suggested the need for smaller task specific agents with narrow focus. Individual agents would do their task and report back with only the most relevant information.

cedrickchee/context-engineering-lessons.md

Context Engineering Hard-won Lessons

1. Context Engineering for AI Agents

Analysis

Novel context engineering techniques

Questions

2. How Long Contexts Fail

Summary

3. How to Fix Your Context

4. Context Rot

Analysis

Why is this useful?