Attention thrashing, part 2: forgetting is a feature
A production service cratered when a simple in-memory HashMap cache grew without limits. Entries never expired, allowing the heap to climb, GC to spike, and the JVM to eventually hit OOM. The repair was a bounded cache with TTLs instead of a larger heap. AI companies are building the same kind of cache. Magic.dev announced 100 million token context windows, and Anthropic and Google keep pushing toward longer contexts, while the harder problem is deciding what to forget.
Context decay
Stanford researchers tested multi-document question answering and found a U-shaped curve: accuracy was high at the start and end, low in the middle. Chroma's research on "context rot" shows why: as context grows, attention spreads thin and the middle gets drowned, so a 100K window can underperform a 10K window.
More context helps only if the model can attend to it. A larger window with weak retrieval can underperform a smaller one that keeps relevant tokens closer to the query.
Without versioning or TTLs, contradictions accumulate and the model has no native way to resolve them.
At first the drift is easy to miss. At scale it piles up until the failures are obvious.
Relevance decays faster than tokens accumulate. A refactor or bug report can be superseded within days, so old context becomes misleading. Retrieval-augmented systems attempt to solve this with embedding search, but RAG retrieves by semantic similarity, not temporal validity, so stale facts can win.
Compression vs forgetting
DeepSeek's Multi-head Latent Attention (MLA) compresses the KV cache, and follow-up research shows large reductions with small performance drops. DeepSeek-V2 with MLA reduced KV cache size by over 90% while improving throughput, and follow-up work showed models could be converted with minimal quality loss. Compression keeps stale facts intact instead of deciding which ones actually matter.
MemGPT treats forgetting as a feature, and research on long-running agents shows decay strategies outperform naive accumulation.
Simulate a month-long session and compare memory strategies. The "accumulate all" baseline drowns in stale state, while decay-weighted and summarization strategies maintain retrieval accuracy.
Long-context architectures try to avoid O(N^2) costs by skipping attention pairs or rethinking attention entirely. Their LTM-2-mini model avoids standard transformer attention, but cheaper attention still does not solve relevance.
Active memory
Eviction policy is only the visible part. Context windows store old and new statements side by side without recording which one supersedes the other. If you tell the model "the API endpoint is /v1/users" and later say "we migrated to /v2/users," both facts persist with no versioning signal.
Useful long-term memory needs more bookkeeping than a longer prompt: time, source, scope, and replacement rules. When context changes, old entries need to be rewritten, demoted, or invalidated.
More token capacity does not provide that bookkeeping. The system needs a way to decide whether a stored statement still applies after later evidence arrives.
When a Linux kernel hangs, you attach an eBPF probe and trace syscalls in real time. LLMs need similar observability; you cannot manage context you cannot see aging.
Context windows need explicit eviction rules: which tokens expire, which facts supersede earlier facts, and which evidence stays pinned. The industry can scale capacity faster than it can invalidate stale state. A smaller, cleaner context window can outperform a huge one packed with stale contradictions.
The industry keeps pushing longer context because it is a tractable engineering problem. Active state management is harder because it requires the model to update, invalidate, and version facts as the world changes. That means deciding which facts are still valid, not just storing more text.