Attention thrashing, part 2: Forgetting is a feature
A production service once went down for six hours because a cache never evicted. Every request added entries, nothing removed them, and eventually the JVM ran out of heap. The fix was trivial: add a TTL. The lesson is permanent: systems that only accumulate eventually drown in their own state.
The AI industry is building that cache. Magic.dev announced 100 million token context windows. Anthropic and Google keep pushing toward longer contexts. The implicit assumption is that more memory equals more capability. But anyone who's maintained a production system knows the truth: memory management isn't about capacity. It's about garbage collection. Knowing what to forget is harder than knowing what to remember.
Lost in the middle
The problem isn't theoretical. Stanford researchers tested language models on multi-document question answering, varying where the relevant information appeared in the context. The accuracy curve was U-shaped: over 80% when the answer was in the first or last documents, under 40% when it was in the middle. This held across all tested models, including those explicitly designed for long contexts.
The mechanism is straightforward. Attention scores compete across tokens. As context grows, each token's share of attention shrinks. Information in the middle gets squeezed between the primacy of the beginning and the recency of the end. Chroma's research on "context rot" documents the same pattern: performance degrades as context accumulates, not because the information is gone, but because it's drowned in noise.
| Information Position | Retrieval Accuracy | Relative Performance |
|---|---|---|
| Beginning (first 10%) | 80%+ | Baseline |
| Middle (40-60%) | <40% | -50% relative |
| End (last 10%) | 80%+ | Baseline |
More context doesn't help if the model can't attend to it properly. A 100K token window with 40% middle-retrieval accuracy is worse than a 10K window with 80% accuracy for many practical tasks.
The infinite context illusion
A 100M token context window sounds revolutionary until you do the math. At roughly 4 characters per token, that's 400MB of text, or about 750 novels. A typical engineering conversation over a week generates maybe 50K tokens. The context window could hold 2,000 weeks of conversation. Four decades. Why would you ever need to forget?
Because relevance decays faster than tokens accumulate. The architectural decision I made on Monday is superseded by Tuesday's refactor. The bug I described on Wednesday was a misdiagnosis; the real issue emerged Thursday. The variable name I mentioned in passing is now renamed. Every hour, some fraction of context becomes not just irrelevant but actively misleading.
Retrieval-augmented systems attempt to solve this with embedding search: RAG retrieves "relevant" context based on semantic similarity. But semantic similarity isn't temporal validity. A discussion from three months ago might be highly similar to today's question while being completely obsolete. RAG retrieves based on what matches, not what's current.
Compression as partial solution
DeepSeek's Multi-head Latent Attention (MLA) takes a different approach: compress the KV cache itself. Instead of storing full key and value tensors for each attention head, MLA projects them into a lower-dimensional latent space before caching. At inference time, the compressed tensors are projected back to full size.
The results are significant. DeepSeek-V2 with MLA reduces KV cache size by 93.3% compared to standard multi-head attention, while achieving 5.76x higher generation throughput. Follow-up research showed that existing models can be converted to MLA with minimal fine-tuning: Llama2-7B's KV cache was reduced by 92% with only a 0.5% performance drop on long-context benchmarks.
But compression doesn't solve the relevance problem. A 93% smaller cache still contains stale information. You're storing outdated facts more efficiently, not deciding which facts to keep.
Strategic forgetting
MemGPT treats forgetting not as failure but as essential feature. The system uses the LLM itself as a memory manager, deciding what to store, what to summarize, and what to discard through tool-calling. The key insight: "strategic forgetting through summarization and targeted deletion" is how biological memory works. We don't store everything; we compress, abstract, and prune.
The architecture implements two mechanisms: summarization (condensing detailed exchanges into compressed representations) and targeted deletion (removing information that's been superseded or is no longer relevant). Research on long-running agents confirms the pattern: naive "accumulate everything" strategies show sustained performance decline from memory inflation, while systems with intelligent decay mechanisms maintain or improve performance over time.
See the strain: Context Decay Simulator
The interactive below simulates a month-long coding session. Watch how different memory strategies handle accumulating context. The "accumulate all" baseline drowns in stale state while decay-weighted and summarization strategies maintain retrieval accuracy.
The architectural escape
Magic.dev's approach to 100M token context is instructive. Their LTM-2-mini model does not use standard transformer attention at all. They built "an entire training and inference stack from scratch (no torch autograd, lots of custom CUDA)" specifically to avoid the O(N^2) attention bottleneck. That connects to the core issue: quadratic attention is the right model for why long context gets expensive.
A non-quadratic path is real progress and makes very long context feasible. It does not solve relevance by itself. You still need a policy for what to keep, update, and forget, otherwise you only get more stale tokens.
Active state management
The deeper problem isn't eviction policy. It's that context windows are passive stores. They accumulate what's said without tracking what's changed. A true memory system needs active state management: the ability to update, invalidate, and version information as the world evolves.
Consider a simple scenario: I tell the model "the API endpoint is /v1/users" on day one. On day five, I mention "we migrated to /v2/users." A passive context window now contains both facts. Which one is true? The model has no mechanism to mark the first statement as superseded. It will retrieve whichever embedding happens to be closer to the current query, regardless of temporal validity.
Human memory isn't just versioned; it's reconstructive. We don't retrieve facts verbatim; we reconstruct them in context of current understanding. An AI system with true long-term memory would need to do the same: not just store tokens, but continuously reinterpret them as surrounding context evolves.
The observability gap
When a Linux kernel hangs, you don't guess. You attach an eBPF probe. You trace the syscalls in real-time without stopping the kernel. When an LLM hallucinates, we stare at the output and guess. We are trying to debug a non-deterministic state machine with print() statements. We need a kernel probe for the neural net.
You can't manage context you can't observe. Current systems provide token counts and maybe attention visualizations, but nothing like the tracing infrastructure we expect from production systems. Where is the flame graph showing which concepts consumed the attention budget? Where is the semantic stack trace that explains why the model retrieved a three-month-old API endpoint instead of yesterday's migration?
Active memory management requires active observability. Until we can trace what the model is actually attending to, not just what's in the context window, but what is being retrieved, weighted, and acted upon, we are optimizing blind. You cannot set a TTL on tokens you cannot even see aging.
The hygiene principle
That cache failure was not fixed with more RAM. It was fixed with a TTL, one line of configuration that said "entries older than X get evicted." The breakthrough was policy, not capacity. Context windows need the same shift: not "how many tokens can we store" but "which tokens should we evict."
The industry is racing toward infinite context because it's a tractable engineering problem: just add more memory, longer attention spans, better compression. Active state management is harder because it requires the model to understand not just what was said, but what that statement means in context of everything that's happened since. That's a semantic problem, not a capacity problem.
A clean context window is worth 100 million dirty tokens.