← Back

Whose bits are wiser, GPU | TPU?

In June 2018 I got a live question in at Google's TGIF in Mountain View, a few weeks after TPU v3 was announced at I/O. I asked Sundar Pichai who was using Cloud TPUs outside Google. He said he did not know and promised to follow up.

A few days later, an email arrived with about two dozen customer names. I did not recognize most of them. I took it as a sign that TPUs were still specialized and not yet a general-purpose default. In production terms, that usually means "best fit for a narrower workload class, less likely a one-stack default."

Architectural differences

Modern GPUs group cores into Streaming Multiprocessors that run many threads in lockstep. Warps of 32 threads execute the same instruction.

On-chip registers and shared memory keep hot data close, with High Bandwidth Memory (HBM) for capacity. NVIDIA added Tensor Cores starting with Volta to accelerate matrix math.

MLPerf Training v3.0 reports H100 performance on GPT-3 scale workloads at multi-thousand GPU scale.

When the workload keeps changing, flexibility usually wins. Divergent control flow can hurt warp efficiency, but the programming model still lets you handle irregular ops, custom kernels, and changing model architecture in one stack. That practical reach is why GPUs dominate open ecosystems and keep becoming the default for early-stage production moves.

TPUs are built around systolic arrays to keep data moving and reduce memory traffic for matrix math.

Matrix A enters from the left. Matrix B enters from the top. Each PE does multiply-accumulate in lockstep. The path is predictable, which is one reason TPUs can be efficient at matrix math. They trade flexibility for simpler data movement.

TPUs are ASICs built for matrix multiplication. The first generation TPU featured a 65,536 8-bit MAC systolic array delivering 92 TOPS peak throughput.

Beyond raw compute, TPUs optimize for low-precision arithmetic (INT8, bfloat16). Many workloads tolerate reduced precision, trading some accuracy for speed and power efficiency. Google scales TPUs into "Pods" using custom torus interconnects. Specialization limits flexibility, and in the first-generation inference setting it delivered 30-80x better TOPS/Watt gains on the evaluated neural-network inference workloads.

TPUs trade control flow flexibility for efficient dataflow, keeping operands moving across a grid where they work best for dense matrix operations. The trade-off is real: if your stack is stable and matrix heavy, TPU can be efficient; if your model keeps introducing control-flow exceptions or shape-shifting operators, TPU mapping gets painful fast.

From a developer standpoint, GPUs are usually easier for mixed stacks: data preprocessing, custom ops, control-heavy models, and kernels that change often. TPUs excel when the workload maps cleanly through XLA/JAX onto dense linear algebra. That makes them powerful for large training runs, but less forgiving when the model has irregular components.

Performance tradeoffs

Metric NVIDIA H100 Google TPUv5e AMD MI300X
Vendor peak matrix compute 1,979 TFLOPS FP16/BF16 Tensor Core with sparsity 197 TFLOPS BF16 per chip 1.3 PFLOPS FP16/BF16 dense
HBM capacity / bandwidth 80 GB HBM3 / 3.35 TB/s 16 GB HBM2 / 819 GB/s 192 GB HBM3 / 5.3 TB/s
Power disclosure up to 700W configurable not publicly listed 750W peak TBP
Architecture General Purpose Matrix-Optimized General Purpose
Software Ecosystem CUDA (Dominant) JAX/TensorFlow ROCm (Growing)

Peak numbers miss most of the production story. They also mix vendor conventions: NVIDIA's H100 figure above is the Tensor Core peak with sparsity, Google's TPUv5e figure is per-chip BF16, and AMD quotes dense plus separate sparsity peaks. Memory bandwidth, interconnect, software ecosystem, and model shape determine throughput on real workloads. The table is a useful shorthand, not a benchmark.

Converging landscape

Mixed fleets are already normal, and the lines are getting blurrier. GPUs keep adding matrix hardware and memory tricks, while TPUs keep adding software layers. AMD's MI300X, Cerebras, and Groq show the range. The useful question is not which accelerator is wiser in the abstract. It is whether the workload, compiler stack, interconnect, memory footprint, and fleet economics line up.