Whose bits are wiser, GPU | TPU?
In June 2018 I got a live question in at Google's TGIF in Mountain View, a few weeks after TPU v3 was announced at I/O. I asked Sundar Pichai who was using Cloud TPUs outside Google. He said he did not know and promised to follow up.
A few days later, an email arrived with about two dozen customer names. I did not recognize most of them. I took it as a sign that TPUs were still specialized and not yet mainstream.
Architectural differences
Modern GPUs group cores into Streaming Multiprocessors that run many threads in lockstep. Warps of 32 threads execute the same instruction.
On-chip registers and shared memory keep hot data close, with High Bandwidth Memory (HBM) for capacity. NVIDIA added Tensor Cores starting with Volta to accelerate matrix math.
MLPerf Training v3.0 reports H100 performance on GPT-3 scale workloads at multi-thousand GPU scale.
GPUs win on flexibility. Divergent control flow slows warps, but the programming model lets you handle irregular ops, custom kernels, and mixed workloads in one place. That generality is why GPUs dominate open ecosystems.
It also makes GPUs the default target for new research ideas.
That inertia compounds over time.
TPUs are built around systolic arrays to keep data moving and reduce memory traffic for matrix math.
See the strain: Systolic Array Demo
The interactive below shows the dataflow. Matrix A enters from the left, Matrix B from the top, and each PE does multiply-accumulate in lockstep. The path is predictable, which is one reason TPUs can be efficient at matrix math. They trade flexibility for simpler data movement.
TPUs are ASICs built for matrix multiplication. The first generation TPU featured a 65,536 8-bit MAC systolic array delivering 92 TOPS peak throughput.
Beyond raw compute, TPUs optimize for low-precision arithmetic (INT8, bfloat16). Many workloads tolerate reduced precision, trading some accuracy for speed and power efficiency. Google scales TPUs into "Pods" using custom torus interconnects. Specialization limits flexibility but can deliver 30-80x better TOPS/Watt gains for inference workloads.
TPUs trade control flow for dataflow. Systolic arrays keep operands moving across the grid, which is efficient for dense matmul, but irregular ops need to be reshaped or offloaded. The specialization is the advantage and the constraint.
From a developer standpoint, GPUs handle the full pipeline, from data preprocessing to custom ops and control-heavy models. TPUs excel at dense linear algebra, but workloads often need to be reshaped to fit their dataflow model. That makes them powerful for large training runs, but less forgiving when the model has irregular components.
Performance tradeoffs
| Metric | NVIDIA H100 | Google TPUv5e | AMD MI300X |
|---|---|---|---|
| Peak FP16 Performance | 989 TFLOPS | 393 TFLOPS | 1,300 TFLOPS |
| Memory Bandwidth | 3.35 TB/s | 1.6 TB/s | 5.3 TB/s |
| TDP | 700W | 250W | 750W |
| Architecture | General Purpose | Matrix-Optimized | General Purpose |
| Software Ecosystem | CUDA (Dominant) | JAX/TensorFlow | ROCm (Growing) |
Peak numbers are not the full story. Memory bandwidth, interconnect, and software ecosystem determine throughput on real models. The table is a useful shorthand, but workload fit decides the winner. Training at scale tends to be limited by interconnect and software stack, while inference is often constrained by power efficiency and cost per token.
Converging landscape
Heterogeneous computing is already here, and the trajectories are converging. GPUs keep adding matrix hardware and memory tricks, while TPUs keep adding software layers. AMD's MI300X, Cerebras, and Groq show the range, but the core trade-off remains: flexibility vs efficiency. The gap is narrowing, but the trade-off does not disappear.