Attention thrashing: ADHD in artificial minds
Server thrashing in traditional infrastructure often has a blunt fix: add more RAM. When my team ran into it at Google, we weighed two options: fix the root cause at significant engineering cost, or restart the jobs with more memory, which was cheap at our scale. We took the cheaper path and moved on.
LLMs don't have the same luxury. Adding more RAM can allow longer context windows, but it won't solve the many to many token "attention" thrashing. The self-attention architecture forces quadratic overhead in both compute and fundamental mechanism design as context grows from thousands to millions of tokens. Less capacity remains for generating quality outputs. Is it ADHD?
Scaling pressure
Transformers process all tokens simultaneously via self-attention, computing attention scores between every token pair. For N tokens, this creates an N×N matrix with O(N²) complexity. That quadratic work shows up as both tensor operations and memory traffic. In practice the bottleneck is often memory bandwidth rather than compute, because even fast GPUs still read the full KV cache for each token at long contexts.
Processing the initial prompt requires full attention across all input tokens, so prefill latency grows quadratically with input length. To avoid recomputing everything during generation, transformers maintain a Key-Value (KV) cache that stores prior attention states. Even though the KV cache grows linearly, generating each new token requires reading the entire cache from HBM. Performance slows down, and quality can degrade too. In Needle-in-a-Haystack tests, models show the "Lost in the Middle" phenomenon: they can retrieve information from the beginning and end of context but miss what's in the middle.
Try this needle-in-haystack test. At 25K+ tokens, models often miss quotes from middle positions despite "seeing" the entire context.
Thrashing mechanics
Analogous to OS memory thrashing, attention thrashing happens when context processing overhead exceeds benefits. At long context lengths, research shows models can spend significant compute on irrelevant tokens while mid-context retrieval fails, and prefill latency degrades from sub-second at short contexts to 20-60 seconds at 100K+ tokens on H100 GPUs. This is largely an architectural limit. Long contexts increase overhead and can bury signal in noise.
Researchers are actively trying to mitigate attention's quadratic cost, each trading different constraints.
See the strain: Attention Scaling Lab
The interactive below benchmarks self-attention kernels directly in your browser. Push sequence length from 256 to 4K tokens and watch latency and memory demands climb quadratically. Toggle between baseline attention, FlashAttention-style tiling, and block sparse approximations to see how much each mitigation actually relieves. The O(N²) curve is not just theory, you can watch it.
Mitigations
Sparse Attention: restricts tokens to local windows or patterns and reduces O(N²) cost. FlashAttention: keeps data on chip to reduce memory traffic. Retrieval-Augmented Generation (RAG): retrieves relevant passages before model processing to shrink the context and focus attention.
Optimizations make things faster but do not change the fundamental problem. Attention just computes statistical correlations rather than semantic understanding. The model treats token 1 and token 10,000 as separate entities. Good summaries require selecting key moments, while attention still computes over every token pair. Research into hierarchical memory systems like HMT and alternatives like Mamba is promising, but production systems still rely on quadratic attention for long contexts.
Research confirms "model attention becomes increasingly unreliable as input length grows," yet every context extension multiplies inference costs across millions of queries. Provisioning helps, but it does not remove the architectural limits.