Whose Bits are Wiser, GPU | TPU?
In June 2018 I lucked into a live question during Google's TGIF (internal all-hands Q&A) in Mountain View, a few weeks after they'd hyped TPU v3 at I/O. I asked Sundar Pichai something I'd been wondering: "Outside Google, who's actually using Cloud TPUs right now?" He said it was a good question, admitted he wasn't sure, and promised to follow up because no one on stage had an answer either.
A few days later, to my surprise (even though I had emailed "TGIF customer support" for a follow-up), an email arrived. It contained roughly two dozen customer names. I honestly can't recall a single one now, and I'm pretty sure I didn't recognize any of them at the time. That was when it started to click: TPUs weren't a straight upgrade over GPUs, they were a bet on specialization. The decision comes down to choosign between GPU flexibility or TPU efficiency.
GPU architecture: parallel flexibility
Modern GPUs organize thousands of cores into Streaming Multiprocessors (SMs). Each SM executes hundreds of threads concurrently using SIMT (Single Instruction, Multiple Thread); 32 threads in a "warp" execute identical instructions in lockstep.
Memory hierarchy includes on-chip registers and shared memory for fast access, backed by High Bandwidth Memory (HBM) for capacity. NVIDIA added Tensor Cores starting with Volta, providing dedicated matrix multiplication hardware for deep learning.
MLPerf Training v3.0 shows H100 completing GPT-3 175B training in 10.9 minutes on 3,584 GPUs. Subsequent benchmarks demonstrated near-linear scaling to larger clusters.
Select a workload to see how NVIDIA's SMs and Google's systolic arrays light up. Utilization, memory pressure, and FLOPs/W figures derive from Hopper and TPU documentation plus MLPerf counter traces.
Google designed TPUs to solve what GPUs struggle with: the memory bandwidth wall that appears when matrix multiplication turns into a systolic assembly line.
TPU architecture: specialized efficiency
This animation illustrates a systolic array performing matrix multiplication, the core function of a TPU. Data flows rhythmically through a grid of Processing Elements, showcasing its specialized efficiency.
TPUs are ASICs built for matrix multiplication. The first generation TPU featured a 65,536 8-bit MAC systolic array delivering 92 TOPS peak throughput.
Beyond raw compute, TPUs optimize for low-precision arithmetic (INT8, bfloat16). Neural networks tolerate reduced precision, trading accuracy for speed and power efficiency. Google scales TPUs into "Pods" using custom torus interconnects. Specialization limits flexibility but delivers 30-80x better TOPS/Watt than contemporary GPUs for inference workloads.
Performance comparison
| Metric | NVIDIA H100 | Google TPUv5e | AMD MI300X |
|---|---|---|---|
| Peak FP16 Performance | 989 TFLOPS | 393 TFLOPS | 1,300 TFLOPS |
| Memory Bandwidth | 3.35 TB/s | 1.6 TB/s | 5.3 TB/s |
| TDP | 700W | 250W | 750W |
| Architecture | General Purpose | Matrix-Optimized | General Purpose |
| Software Ecosystem | CUDA (Dominant) | JAX/TensorFlow | ROCm (Growing) |
Adjust utilization, electricity price, cooling overhead, and amortization to compare annual total cost for H100, TPU v5e, MI300X, and Groq clusters built from the same node count.
The converging landscape
Heterogeneous computing is already here, but the trajectories are converging: GPUs keep clawing back efficiency with tensor cores and smarter memory hierarchies, while TPUs keep layering on abstractions that make them feel less exotic. AMD's MI300X, Cerebras, and Groq show the spectrum, yet the core question stays the same. Can you bend GPU flexibility enough to meet your power budget, or stretch TPU efficiency far enough to cover your use case? Both camps are closing the gap, but physics keeps the trade-off intact.