KV Cache Stress Test

Explore how autoregressive throughput collapses as the context window grows. Adjust model size, retrieval pruning, and batching to see the associated memory bandwidth and latency penalties.

Model size

Context length (tokens) 8,192 tokens

Batch size Batch 1 (interactive)

Mitigations FlashAttention / tiling Retrieval pruning (RAG top-4)

Throughput

Bandwidth

KV Cache

Decode throughput

— tokens/s

Awaiting input

HBM bandwidth

— GB/s

Awaiting input

KV cache footprint

— GB

Awaiting input

Note: This lab uses simplified models for illustrative purposes. RAG reduction is a conceptual approximation.