← Back

Attention thrashing, part 2: forgetting is a feature

A production service cratered when a simple in-memory HashMap cache grew without limits. Entries never expired, allowing the heap to climb, GC to spike, and the JVM to eventually hit OOM. The fix was not more capacity. It was policy: bounds and TTL. The AI industry is building that cache. Magic.dev announced 100 million token context windows, and Anthropic and Google keep pushing toward longer contexts, while the harder problem is deciding what to forget.

Context decay

Stanford researchers tested multi-document question answering and found a U-shaped curve: accuracy was high at the start and end, low in the middle. Chroma's research on "context rot" shows why: as context grows, attention spreads thin and the middle gets drowned, so a 100K window can underperform a 10K window.

More context helps only if the model can attend to it. A larger window with weak retrieval can underperform a smaller one that keeps relevant tokens closer to the query.

Without versioning or TTLs, contradictions accumulate and the model has no native way to resolve them.

The drift is subtle at first, then brutal at scale, compounding silently in ways that only become visible too late.

Relevance decays faster than tokens accumulate. A refactor or bug report can be superseded within days, so old context becomes misleading. Retrieval-augmented systems attempt to solve this with embedding search, but RAG retrieves by semantic similarity, not temporal validity, so stale facts can win.

Compression vs forgetting

DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache, and follow-up research shows large reductions with small performance drops. DeepSeek-V2 with MLA reduced KV cache size by over 90% while improving throughput, and follow-up work showed models could be converted with minimal quality loss. Compression keeps stale facts intact instead of deciding which ones actually matter.

MemGPT treats forgetting as a feature, and research on long-running agents shows decay strategies outperform naive accumulation.

Simulate a month-long session and compare memory strategies. The "accumulate all" baseline drowns in stale state, while decay-weighted and summarization strategies maintain retrieval accuracy.

Long-context architectures try to avoid O(N^2) costs by skipping attention pairs or rethinking attention entirely. Their LTM-2-mini model avoids standard transformer attention, but cheaper attention still does not solve relevance.

Active memory

The deeper problem isn't eviction policy. Context windows are passive stores that do not track what changed. If you tell the model "the API endpoint is /v1/users" and later say "we migrated to /v2/users," both facts persist with no versioning signal.

Human memory is reconstructive. An AI system with true long-term memory would need to continuously reinterpret stored facts as context evolves.

That is a semantic problem, not a capacity problem. It requires the model to understand what a statement means after everything that has happened since.

When a Linux kernel hangs, you attach an eBPF probe and trace syscalls in real time. LLMs need similar observability; you cannot manage context you cannot see aging.

Context windows need the same shift: not "how many tokens can we store" but "which tokens should we evict." The industry can scale capacity faster than it can invalidate stale state. A clean context window often delivers better results than 100 million tokens full of stale contradictions.

The industry keeps pushing longer context because it is a tractable engineering problem. Active state management is harder because it requires the model to update, invalidate, and version facts as the world changes. That forces semantics, not just storage.