Thinking fast and MoE

Sparse models decouple total capacity from active compute. DeepSeek-V3 has 671 billion parameters with 37 billion active per token. DeepSeek's technical report frames this as training capacity diverging from inference cost.

Across recent releases, total capacity keeps growing while active compute stays much smaller. Dense scaling is slowing, and sparse models are leading many releases.

MoE routing basics

Mixture of Experts replaces dense feedforward layers with specialist networks, and the router selects a few experts per token. Google's Switch Transformer made this blunt by routing to one expert and cutting training time while maintaining quality.

This ratio matters because inference cost tracks active parameters, not total capacity, which is why MoE models can look enormous on paper while staying practical at inference time.

The tradeoff comes in routing complexity, which adds overhead that sparse designs must carefully manage.

Economics and balance

DeepSeek-V3 trained on 2,788,000 H800 GPU hours at an estimated cost of $5.6 million. Llama 3.1 405B trained for 30,840,000 GPU hours, about 11x more compute for a dense model in the same ballpark. NVIDIA's analysis reports up to 10x faster throughput and about one tenth the cost in those comparisons, which is why architecture choices can matter as much as cluster size.

Mixtral 8x22B made the case concrete. Eight expert networks, 22 billion parameters each, 2 experts activated per token. Total parameters: 141B. Active parameters: 39B. The result was faster than dense 70B baselines while staying competitive among open-weight models at the time of release.

Sparse activation ensures inference cost tracks only active parameters, with the router acting as a selector that decides which experts handle each token.

A significant MoE failure mode is expert collapse: if the router keeps sending most tokens to a few experts while others idle, you end up with a sparse model that behaves like a dense one. Traditional fixes add auxiliary losses that can fight the main objective, and DeepSeek-V3 reported an auxiliary-loss-free strategy for load balancing.

Trace token routing through an MoE layer. Different tokens light up different expert combinations. When routing skews, utilization drops and effective capacity shrinks. Router quality decides whether you realize the efficiency or drift toward expert collapse.

Inference and distillation

Sparse models trade training efficiency for more complex inference. An MoE model has to fetch different expert weights per token, which keeps memory bandwidth pressure high even if compute scales with the active slice.

DeepSeek-V3 tackles this in part through Multi-head Latent Attention (MLA), compressing the KV cache by over 90% and enabling efficient handling of 128,000-token context windows. Memory still scales with total experts, while compute scales with active parameters.

Memory footprint stays large because every expert still has to be resident even if only a few run per token. For serving workloads where memory is available but per-token compute matters, MoE can be a good trade.

If inference complexity is prohibitive, MoE models offer a fallback: distillation. Train a sparse teacher with high capacity, then compress into a dense student for deployment. Switch Transformer research showed that 30-40% of the sparsity gains could be preserved when distilling back to a dense student, so you pay for routing during training but ship a simpler model.

Architecture shift

Many frontier releases in 2025 use MoE or related conditional computation. Conditional computation moved from research curiosity to production default. Dense scaling shows diminishing returns, while sparse scaling adds capacity without doubling inference cost, even as routing strategy and expert count remain open questions.