← Back

Thinking fast and MoE

Sparse models decouple total capacity from active compute. DeepSeek-V3 has 671 billion parameters with 37 billion active per token. DeepSeek's technical report frames this as training capacity diverging from inference cost.

Across recent releases, total capacity keeps growing while active compute stays much smaller. Dense models are still getting bigger, but many new releases are using sparse designs instead.

MoE routing basics

Mixture of Experts replaces dense feedforward layers with specialist networks, and the router selects a few experts per token. Google's Switch Transformer made this blunt by routing to one expert and cutting training time while maintaining quality.

This ratio matters because inference cost tracks active parameters, not total capacity, which is why MoE models can look enormous on paper while staying practical at inference time.

The cost is routing overhead: the model has to choose experts and move weights around, and that overhead can eat into the savings.

Economics and balance

DeepSeek-V3 trained on 2,788,000 H800 GPU hours at an estimated cost of $5.6 million. Llama 3.1 405B trained for 30,840,000 GPU hours. Those are not controlled comparisons, but they show why sparse architectures changed the cost discussion for frontier-scale training. NVIDIA's analysis reports up to 10x faster throughput and about one tenth the cost in those comparisons, with workload and implementation caveats.

Mixtral 8x22B made the case concrete. Eight expert networks, 22 billion parameters each, 2 experts activated per token. Total parameters: 141B. Active parameters: 39B. The result was faster than dense 70B baselines while staying competitive among open-weight models at the time of release.

Sparse activation makes compute scale closer to the active parameter count, with the router acting as a selector that decides which experts handle each token. Memory footprint, routing overhead, and serving complexity still depend on the full expert set.

A significant MoE failure mode is expert collapse: if the router keeps sending most tokens to a few experts while others idle, you end up with a sparse model that behaves like a dense one. Traditional fixes add auxiliary losses that can fight the main objective, and DeepSeek-V3 reported an auxiliary-loss-free strategy for load balancing.

Trace token routing through an MoE layer. Different tokens light up different expert combinations. When routing skews, utilization drops and effective capacity shrinks. Router quality decides whether you realize the efficiency or drift toward expert collapse.

Inference and distillation

Sparse models trade training efficiency for more complex inference. An MoE model has to fetch different expert weights per token, which keeps memory bandwidth pressure high even if compute scales with the active slice.

DeepSeek-V3 tackles this in part through Multi-head Latent Attention (MLA), compressing the KV cache by over 90% and enabling efficient handling of 128,000-token context windows. That reduces attention-cache pressure; it does not remove the need to keep expert weights available for routing.

Memory footprint stays large because every expert still has to be resident even if only a few run per token. For serving workloads where memory is available but per-token compute matters, MoE can be a good trade.

If inference complexity is prohibitive, MoE models offer a fallback: distillation. Train a sparse teacher with high capacity, then compress into a dense student for deployment. Switch Transformer research showed that 30-40% of the sparsity gains could be preserved when distilling back to a dense student, so you pay for routing during training but ship a simpler model.

Architecture shift

Recent frontier releases increasingly use MoE or related conditional computation. Conditional computation is no longer just a lab idea. Sparse scaling adds capacity while limiting per-token compute, but the serving trade-offs are real: routing strategy, expert count, memory residency, and batching all decide whether the theoretical savings show up in production.