Thinking Fast and MoE
DeepSeek-V3 has 671 billion parameters. It activates 37 billion per token. Mixtral 8x22B has 141 billion parameters. It activates 39 billion per token. The pattern is consistent across every frontier model released in 2025: total parameter counts in the hundreds of billions, active parameters during inference closer to the tens of billions. The era of dense scaling is over. Sparse models won.
The logic is simple. Training a model gives it knowledge. Running inference retrieves that knowledge. These don't have to cost the same. DeepSeek's technical report puts it directly: the massive discrepancy between total capacity (knowledge storage) and active computation (inference cost) represents a fundamental shift. It is possible to decouple the accumulation of knowledge from the cost of retrieving it.
The routing mechanism
Mixture of Experts replaces dense feedforward layers with a collection of specialized "expert" networks. A routing mechanism examines each token and selects which experts should process it. Only the selected experts run. The others sit idle. A model with 8 experts activating 2 per token has 8x the total parameters but only pays 2x the compute cost of a single expert.
Google's Switch Transformer simplified this in 2021 by routing to just one expert per token. The Switch-Base 64-expert model trained in one-seventh the time of the equivalent dense T5-Base to reach similar perplexity. Despite T5-Large applying 3.5x more FLOPs per token, Switch-Base was still more sample efficient.
| Model | Total Params | Active Params | Ratio |
|---|---|---|---|
| Llama 3.1 405B (dense) | 405B | 405B | 1:1 |
| Mixtral 8x22B | 141B | 39B | 3.6:1 |
| DeepSeek-V3 | 671B | 37B | 18:1 |
The ratio matters. DeepSeek-V3 stores 18x more knowledge than it uses per token. That's 18x the model capacity at roughly constant inference cost.
Training economics
DeepSeek-V3 trained on 2,788,000 H800 GPU hours at an estimated cost of $5.6 million. Llama 3.1 405B trained for 30,840,000 GPU hours, 11x more compute for a dense model that benchmarks slightly worse on many tasks. NVIDIA's analysis confirms the trend: MoE models run 10x faster and deliver tokens at 1/10th the cost compared to equivalently capable dense models.
The economics are stark. If a research lab can achieve frontier performance using efficiency optimizations on restricted hardware, the future of AI will be defined not by who has the biggest cluster, but by who has the smartest architecture. Dense scaling is a brute-force strategy. Sparse scaling is an engineering strategy.
Mixtral's demonstration
Mixtral 8x22B made the case concrete. Eight expert networks, 22 billion parameters each, 2 experts activated per token. Total parameters: 141B. Active parameters: 39B. Result: faster than any dense 70B model while being more capable than any open-weight model at the time of release.
The sparse activation pattern means inference is cheaper than the parameter count suggests. You're not running 141B parameters. You're running 39B parameters that happen to have access to 141B parameters worth of specialized knowledge. The routing mechanism acts like a lookup table: examine the token, pick the relevant experts, ignore the rest.
Load balancing challenges
MoE architectures have a failure mode: expert collapse. If the router consistently sends most tokens to a few experts while others idle, you've trained a sparse model that behaves like a dense one. Worse, the idle experts never improve, creating a feedback loop where the router has even less reason to use them.
Traditional solutions add auxiliary losses that penalize unbalanced routing, but these losses can fight against the primary training objective. DeepSeek-V3 pioneered an auxiliary-loss-free strategy for load balancing, letting the model learn balanced routing without competing objectives. The result was more stable training and better expert utilization.
See the strain: Expert Routing Visualizer
The interactive below traces token routing through an MoE layer. Watch how different tokens activate different expert combinations. Observe load balancing in action: when routing becomes uneven, utilization drops and effective capacity shrinks. The routing decision determines whether you get 18:1 efficiency or expert collapse.
Inference complexity
Sparse models trade training efficiency for inference complexity. A dense model loads weights once and runs them for every token. An MoE model must load different expert weights depending on routing decisions. This creates memory bandwidth challenges: the full model must fit in memory even though only a fraction runs per token.
DeepSeek-V3 addresses this partly through Multi-head Latent Attention (MLA), compressing the KV cache by over 90% and enabling efficient handling of 128,000-token context windows. The combination of sparse experts and compressed attention makes the model practical to deploy despite its 671B total parameters.
The memory footprint remains large. You need to store all experts even if you only use a few per token. But inference compute scales with active parameters, not total parameters. For serving workloads where memory is available but compute per token matters, MoE wins.
Distillation escape hatch
If inference complexity is prohibitive, MoE models offer an escape hatch: distillation. Train a sparse model with massive capacity, then distill its outputs into a smaller dense model for deployment. Research on Switch Transformers showed that 30-40% of the sparsity gains could be preserved when distilling back to a dense student.
This creates a training pipeline: train a massive sparse teacher with 10x the capacity at 1x the compute, then compress it to a deployable dense student. The student inherits the teacher's knowledge without the expert routing overhead. You pay for sparsity during training but ship dense for inference.
The architecture shift
Every major frontier model released in 2025 uses MoE or a close variant. Conditional computation went from research curiosity to production default. The question isn't whether to use sparse models. It's how many experts, what routing strategy, and how to balance training efficiency against inference complexity.
Dense scaling hit diminishing returns. Doubling parameters doubled compute and roughly doubled capability. Sparse scaling breaks that relationship. Doubling experts adds capacity without doubling inference cost. The same compute budget that trained one frontier dense model can now train a sparse model with 10x the knowledge.
Brute-force dense scaling is ending. The era of the agile, sparse expert has begun.