256.1

Compiled Thoughts
← Back to Article

Attention Thrashing: Interactive Test

This Needle-in-Haystack test demonstrates the "Lost in the Middle" phenomenon. We embed specific quotes (needles) at 40%, 50%, and 60% positions. Short contexts retrieve perfectly; long contexts fail despite technically "seeing" everything.

1) ✅ REAL (40%): The future has not been written yet.
2) ❌ FAKE: The destiny we create has not been written.
3) ✅ REAL (50%): No fate but what we make for ourselves.
4) ❌ FAKE: No future except what we make for ourselves.
5) ✅ REAL (60%): There is no destiny except the one we create.

Real needles (✅) are embedded at exactly 40%, 50%, and 60%. Fake needles (❌) use similar words but are guaranteed NOT to appear anywhere in the generated text.

Control Test: 100 Tokens

Proves the model can retrieve. Expect 100% accuracy.

100 1000

Thrashing Test: 25K Tokens

Same task, but attention overwhelms. Accuracy collapses.

25K 750K

How to Test:

  1. Generate a test → Copy → Paste into ai.dev (or any LLM)
  2. Ask this question:
    "Does this text contain the following five phrases? For each phrase, respond with YES and the approximate percentage position (e.g., 'YES at ~40%') or NO if not found: (1) 'The future has not been written yet.', (2) 'The destiny we create has not been written.', (3) 'No fate but what we make for ourselves.', (4) 'No future except what we make for ourselves.', (5) 'There is no destiny except the one we create.'"
  3. Expected:
    • Control: Perfect accuracy: 1) YES ~40% | 2) NO | 3) YES ~50% | 4) NO | 5) YES ~60%
    • Thrashing: MAY miss needles, report wrong positions, or falsely detect fakes (#2, #4)

💡 The old-fashioned way: Click "Show Preview" and use Ctrl+F to search for any needle in the generated text.

What You're Testing

As context grows, transformers exhibit attention thrashing: wasting compute on irrelevant tokens while losing mid-context retrieval accuracy. Models "see" everything but focus on nothing.

Notice: Long-context responses are noticeably slower. Prefill latency grows quadratically with input length. Processing 128K tokens takes multiple seconds versus sub-second at short contexts.

Read the full article →