Those Who Can, Teach Students to Teach
In 2023, training a model to reason required three things: a reward model to score outputs, a critic model to estimate value functions, and enough compute to run Proximal Policy Optimization without it diverging. The practical result was that reasoning research happened at a handful of labs with the budget to absorb failed experiments. Everyone else waited for API access.
Two years later, the landscape looks different. MIT researchers showed that orchestrated small models can match or exceed frontier reasoning systems at a fraction of the cost. DeepSeek published GRPO, a reinforcement learning algorithm simple enough that grad students can implement it over a weekend. And Karpathy's 2025 review noted that reinforcement learning from verifiable rewards has become "the de facto new major stage in LLM training." The frontier tax is shrinking.
The PPO problem
PPO works, but it's complicated. You need a policy model (the LLM you're training), a reference model (to prevent drift), a reward model (to score outputs), and a value model (to estimate expected future rewards). Four models, synchronized updates, and hyperparameters that require extensive tuning. Sebastian Raschka's survey describes it as requiring "substantial domain knowledge and extensive compute resources." Most of the research stayed confined to top labs.
The value model is the bottleneck. It needs to predict expected cumulative reward for any partial response, which means training a second large model alongside your policy. Memory doubles. Training complexity increases. And if your value estimates are noisy, the whole process becomes unstable.
GRPO: dropping the critic
Group Relative Policy Optimization takes a different approach. Instead of training a separate value model to estimate advantages, GRPO samples multiple responses from the policy itself and uses their relative quality as the baseline. Generate eight answers, score them with your reward function, subtract the mean score from each, and you have advantages without a critic.
The math is straightforward. For a prompt, sample a group of completions. Compute rewards for each (pass/fail on a math problem, unit tests for code). The advantage for completion i is just r_i minus the group mean. No learned value function, no second model, no instability from noisy estimates.
| Method | Models Required | Memory Overhead | Primary Use (2025) |
|---|---|---|---|
| PPO + RLHF | 4 (policy, ref, reward, value) | High | Preference alignment |
| DPO | 2 (policy, ref) | Medium | Preference alignment |
| GRPO | 2 (policy, ref) | Medium | Reasoning tasks |
| GRPO + RLVR | 1 (policy only) | Low | Math/code reasoning |
The combination of GRPO with verifiable rewards (RLVR) is where things get interesting. For math problems, you can check if the answer is correct. For code, you can run unit tests. No reward model needed at all. DeepSeek-R1 used this approach to train reasoning capabilities at what they described as "a fraction of RLHF budgets."
Verifiable rewards change the economics
RLVR works because some tasks have ground truth. GSM8K math problems have correct answers. LeetCode problems have test suites. If you can automatically verify whether a response is right or wrong, you don't need humans to label preferences or a model to predict rewards. Binary signal, infinite scale.
The results are strong. Models trained with RLVR show "super-human pass@1 on GSM8K and LeetCode" according to multiple benchmarks. The reasoning behavior emerges from the training signal: break problems into steps, check intermediate results, backtrack when stuck. Not because anyone told the model to reason that way, but because that strategy maximizes verified correctness.
Open researchers can now replicate this. The Awesome-RL-for-LRMs repository tracks dozens of papers from 2025 alone, most from academic groups rather than frontier labs. GRPO's simplicity lowered the barrier enough that the research democratized.
The ensemble alternative
Training your own reasoning model is one path. Running multiple existing models and aggregating their outputs is another. The MIT CSAIL DisCIPL framework takes this approach: a large model plans the strategy, small models execute subtasks, and disagreements surface uncertainty.
The numbers are notable. DisCIPL with small Llama models achieved "40.1 percent shorter reasoning and 80.2 percent cost savings over o1" while approaching comparable accuracy. The small models are 1,000 to 10,000 times cheaper per token than frontier reasoning systems. You trade single-model capability for orchestrated collaboration.
See the strain: Ensemble Cost Calculator
The interactive below compares reasoning approaches by cost and accuracy. Adjust model sizes, ensemble counts, and task difficulty to see where single large models win versus where ensembles dominate. The crossover point depends on your accuracy requirements and budget constraints.
Consensus as signal
Iterative Consensus Ensemble (ICE) pushes this further. Three LLMs generate responses, critique each other's answers, and iterate until they converge or agree to disagree. On GPQA-diamond, a PhD-level reasoning benchmark, ICE improved accuracy from 46.9% to 68.2%. That's a 45% relative gain with no fine-tuning, no gradient updates, just inference-time deliberation.
The mechanism matters. When models disagree, that disagreement contains information. High-confidence agreement suggests reliable answers. Persistent disagreement flags genuine ambiguity or difficulty. A single model can't distinguish between "I'm confident" and "I'm confidently wrong." Multiple models disagreeing surfaces the difference.
Research on ensemble voting for content categorization showed similar patterns: the best single model achieved F1 of 0.55, a two-model ensemble hit 0.73, and larger ensembles of five to ten models exceeded 0.80. Diversity in the ensemble matters more than individual model quality past a certain point.
Distillation closes the loop
Ensembles are expensive at inference time. You're running multiple forward passes per query. But you can distill ensemble behavior back into a single model. TinyLLM demonstrated multi-teacher knowledge distillation where a small student learns from multiple large teachers. The student doesn't just mimic answers; it learns the reasoning patterns that led to those answers.
The results are counterintuitive: "TinyLLM can outperform large teacher LLMs significantly, despite a considerably smaller model size." The student benefits from seeing how different teachers approach the same problem. Disagreements between teachers become learning signal. The student model ends up with capabilities none of its teachers had individually.
The new stack
These techniques compose. You can train a base model with GRPO and verifiable rewards. Deploy multiple variants (different seeds, different fine-tuning) as an ensemble. Use consensus voting for high-stakes queries. Distill the ensemble's collective behavior back into a single model for efficiency. Then repeat.
The practical implication is that reasoning capability is no longer gated by access to frontier models. A well-orchestrated system of smaller models can match or exceed single large models on many tasks, at substantially lower cost. The frontier still exists, but the gap is closing from below.
| Approach | When to Use | Cost Profile |
|---|---|---|
| Single frontier model | Novel tasks, no verification possible | High per-query |
| GRPO-trained reasoning model | Verifiable domains (math, code) | Training cost, low inference |
| Small model ensemble | Accuracy-critical, latency-tolerant | Medium inference, no training |
| Distilled ensemble student | Production deployment at scale | One-time distillation, low inference |
The synthetic ceiling
The Kessler Syndrome describes a scenario where low-Earth orbit becomes so crowded with space junk that a single collision triggers a chain reaction, trapping us on Earth forever. We are approaching this event horizon for code. As agents generate billions of lines of "good enough" boilerplate, the human "Ground Truth" becomes statistically insignificant. We are poisoning the well we drink from.
Distillation assumes the teacher has something worth teaching. If the teacher was trained on synthetic data from a previous generation, and that generation was trained on synthetic data from the one before, each iteration amplifies errors and flattens diversity. The technique works until the training distribution collapses into a narrow band of "average" outputs. Data hygiene isn't a nice-to-have. It's the prerequisite for everything else working.
What remains expensive
This doesn't solve everything. RLVR requires verifiable tasks. Many real-world problems don't have automatic correctness checks. Ensembles help with uncertainty quantification but still need individual models that are competent enough to contribute signal. And truly novel capabilities still seem to require scale that only frontier labs can afford to develop.
But the direction is clear. Two years ago, reasoning was a capability you rented from frontier providers. Now it's something you can build, train, and deploy with modest resources. The techniques are published, the code is open, and the results are reproducible. The frontier tax hasn't disappeared, but it's no longer the only option.