256.1

Signal Processing
Blog Labs About

Whose bits are wiser, GPU | TPU?

In June 2018 I got a live question in at Google's TGIF in Mountain View, a few weeks after TPU v3 was announced at I/O. I asked Sundar Pichai who was using Cloud TPUs outside Google. He said he did not know and promised to follow up.

A few days later, an email arrived with about two dozen customer names. I did not recognize most of them. I took it as a sign that TPUs were still specialized and not yet mainstream.

GPU architecture: parallel flexibility

Modern GPUs group cores into Streaming Multiprocessors that run many threads in lockstep. Warps of 32 threads execute the same instruction.

On-chip registers and shared memory keep hot data close, with High Bandwidth Memory (HBM) for capacity. NVIDIA added Tensor Cores starting with Volta to accelerate matrix math.

MLPerf Training v3.0 reports H100 performance on GPT-3 scale workloads at multi-thousand GPU scale.

TPUs are built around systolic arrays to keep data moving and reduce memory traffic for matrix math.

TPU architecture: specialized efficiency

See the strain: Systolic Array Demo

The interactive below shows the dataflow. Matrix A enters from the left, Matrix B from the top, and each PE does multiply-accumulate in lockstep. The path is predictable, which is one reason TPUs can be efficient at matrix math. They trade flexibility for simpler data movement.

TPUs are ASICs built for matrix multiplication. The first generation TPU featured a 65,536 8-bit MAC systolic array delivering 92 TOPS peak throughput.

Beyond raw compute, TPUs optimize for low-precision arithmetic (INT8, bfloat16). Many workloads tolerate reduced precision, trading some accuracy for speed and power efficiency. Google scales TPUs into "Pods" using custom torus interconnects. Specialization limits flexibility but can deliver 30-80x better TOPS/Watt gains for inference workloads.

Performance comparison

Metric NVIDIA H100 Google TPUv5e AMD MI300X
Peak FP16 Performance 989 TFLOPS 393 TFLOPS 1,300 TFLOPS
Memory Bandwidth 3.35 TB/s 1.6 TB/s 5.3 TB/s
TDP 700W 250W 750W
Architecture General Purpose Matrix-Optimized General Purpose
Software Ecosystem CUDA (Dominant) JAX/TensorFlow ROCm (Growing)

The converging landscape

Heterogeneous computing is already here, and the trajectories are converging. GPUs keep adding matrix hardware and memory tricks, while TPUs keep adding software layers. AMD's MI300X, Cerebras, and Groq show the range, but the core trade-off remains: flexibility vs efficiency. The gap is narrowing, but the trade-off does not disappear.