256.1

Signal Processing
Blog Labs About

Attention thrashing: ADHD in artificial minds

Server thrashing in traditional infrastructure often has a blunt fix: add more RAM. When I ran into it at Google, we weighed two options:
1. fix the root cause at significant engineering cost, or
2. restart the jobs with more memory, which was cheap at our scale.
We took the cheaper path and moved on.

LLMs don't have the same luxury. Adding more RAM can allow longer context windows, but it won't solve the many to many token "attention" thrashing. The self-attention architecture forces quadratic overhead in both compute and fundamental mechanism design as context grows from thousands to millions of tokens. Less capacity remains for generating quality outputs. Its ADHD for artificial minds: too much context and not enough focus. Attention thrashing.

Transformer context scaling

Transformers process all tokens simultaneously via self-attention, computing attention scores between every token pair. For N tokens, this creates an N×N matrix with O(N²) complexity. That quadratic work shows up as both tensor operations and memory traffic: doubling context length quadruples the tensor math and quadruples the data that must move through the memory hierarchy. Some models push toward million-token windows, and the scaling cost remains a constraint.

In practice the bottleneck is often memory bandwidth rather than compute. Even fast GPUs still have to read the full KV cache for each token, which dominates at long contexts.

Performance degradation

Processing the initial prompt requires computing full attention across all input tokens, so prefill latency grows quadratically with input length. To avoid recomputing everything during generation, transformers maintain a Key-Value (KV) cache that stores prior attention states. Even though the KV cache grows linearly O(N), generating each new token requires reading the entire cache from HBM. As context expands from 2K to 128K tokens, memory bandwidth consumption grows proportionally, slowing decode throughput even though total FLOPs remain constant per token.

Performance slows down, and quality can degrade too. In Needle-in-a-Haystack tests, models show the "Lost in the Middle" phenomenon: they can retrieve information from the beginning and end of context but miss what's in the middle. More context means attention gets spread thinner across irrelevant tokens. Larger windows can reduce accuracy, not improve it. The model sees more but may focus less.

Try this needle-in-haystack test. At 25K+ tokens, models often miss quotes from middle positions despite "seeing" the entire context.

Attention thrashing

Analogous to OS memory thrashing (where your computer spends more time swapping to disk than working), attention thrashing happens when context processing overhead exceeds benefits. At long context lengths, research shows models can spend significant compute on irrelevant tokens while mid-context information retrieval can fail at rates between 7-50% depending on position depth. Prefill latency degrades from sub-second at short contexts to 20-60 seconds at 100K+ tokens on H100 GPUs, with attention overhead dominating.

This is largely an architectural limit, not an intelligence issue. Long contexts increase computational overhead and can bury signal in noise.

Researchers are actively trying to mitigate attention's quadratic cost, each trading different constraints.

See the strain: Attention Scaling Lab

The interactive below benchmarks self-attention kernels directly in your browser. Push sequence length from 256 to 4K tokens and watch latency and memory demands climb quadratically. Toggle between baseline attention, FlashAttention-style tiling, and block sparse approximations to see how much each mitigation actually relieves. The O(N²) curve is not just theory, you can watch it.

Mitigation strategies

Sparse Attention: Restricts tokens to attend only to local windows or patterns. Reduces O(N²) complexity but loses long-range dependencies.

FlashAttention: Keeps data on chip to reduce memory traffic, which makes longer contexts more practical.

Retrieval-Augmented Generation (RAG): Retrieves relevant passages before model processing to shrink the context and focus attention.

Fundamental limitations

Optimizations make things faster but do not change the fundamental problem. Attention just computes statistical correlations rather than semantic understanding. The model treats token 1 and token 10,000 as completely separate entities every time. Long-range structure has to emerge indirectly rather than through built-in hierarchy.

Good summaries require selecting key moments, while attention still computes over every token pair. Intelligence is not about seeing everything. It's about knowing what to look at.

These mitigations help, but they do not change the core cost of long-range attention. FlashAttention reduces memory traffic, sparse attention prunes tokens, and RAG filters externally. Research into hierarchical memory systems like HMT and alternatives like Mamba is promising, but production systems still rely on quadratic attention for long contexts.

The server thrashing problem I saw at Google had a cheap fix because the architecture allowed it. RAM was a provisionable resource, and our workload could restart without state loss. Transformers face different economics. Research confirms "model attention becomes increasingly unreliable as input length grows," yet every context extension multiplies inference costs across millions of queries. Provisioning helps, but it does not remove the architectural limits.