← Back

Attention thrashing in long-context models

Server thrashing in traditional infrastructure often has a blunt fix: add more RAM. When my team ran into it at Google, we weighed two options: fix the root cause at significant engineering cost, or restart the jobs with more memory, which was cheap at our scale. We took the cheaper path and moved on.

LLMs do not get an equivalent simple fix. More memory can hold a larger KV cache and support longer prompts, but it does not decide which tokens deserve attention. In standard full-attention transformers, long prompts create quadratic prefill work; during decode, each new token commonly reads a KV cache that grows with the context. The overhead can dominate the useful signal.

Scaling pressure

In standard full attention, each token can attend to every other token. For N input tokens, prefill does O(N²) complexity. Implementations such as FlashAttention avoid materializing a full attention matrix, but the pairwise dependence still creates scaling pressure. During decode, the KV cache grows linearly with context, and each new token typically reads prior keys and values. At long contexts, bandwidth often dominates.

To avoid recomputing prior context during generation, transformers maintain a Key-Value (KV) cache that stores prior attention states. Sparse, sliding-window, paged, and compressed-cache implementations can change constants or visibility, but the basic pressure remains: longer retained context increases work and retrieval burden. Performance slows down, and quality can degrade too. In Needle-in-a-Haystack tests, models show the "Lost in the Middle" phenomenon: they can retrieve information from the beginning and end of context but miss what's in the middle.

Try this needle-in-haystack test. At 25K+ tokens, models often miss quotes from middle positions despite "seeing" the entire context.

Thrashing mechanics

Analogous to OS memory thrashing, attention thrashing happens when context processing overhead exceeds benefits. At long context lengths, research shows models can spend significant compute on irrelevant tokens while mid-context retrieval fails, and prefill latency degrades from sub-second at short contexts to 20-60 seconds at 100K+ tokens on H100 GPUs. This is a limit of standard long-context attention, not a lack of RAM alone. Long contexts increase overhead and can bury signal in noise.

Researchers are actively trying to mitigate attention's quadratic cost, each trading different constraints.

Push sequence length from 256 to 4K tokens and watch latency and memory demands climb quadratically. Toggle between baseline attention, FlashAttention-style tiling, and block sparse approximations to see how much each mitigation actually relieves. The simulator makes the quadratic growth obvious.

Mitigations

Sparse Attention: restricts tokens to local windows or patterns and reduces O(N²) cost. FlashAttention: keeps data on chip to reduce memory traffic. Retrieval-Augmented Generation (RAG): retrieves relevant passages before model processing to shrink the context and focus attention.

These optimizations reduce the cost of carrying context, but they do not decide what should remain in context. Sparse attention can skip pairs, FlashAttention can reduce memory traffic, and RAG can move selection upstream. None of those mechanisms, by itself, versions facts, expires stale evidence, or decides that a newer instruction supersedes an older one. Research into hierarchical memory systems like HMT and alternatives like Mamba is promising, but most production transformer systems still pay a substantial cost for long retained context. Research confirms "model attention becomes increasingly unreliable as input length grows," which is why longer windows need selection and memory policy, not only more capacity.