Those who can, teach students to teach

Published December 12, 2025

clocksmith: 22% | AI: 78%

By late 2024, smaller models trained with simpler methods were matching results that used to require frontier scale. The PPO-based RLHF pipelines that dominated 2023 were being displaced by single-script techniques. MIT reported orchestrated small models approaching frontier benchmarks, DeepSeek released GRPO, and Karpathy called it part of "the de facto new major stage in LLM training." The common thread is reasoning access without frontier budgets.

Simpler training loops

PPO works, but it is heavy. You need a policy model, a reference model, a reward model, and a value model, and the value model becomes the bottleneck. That drives memory and tuning cost, and Sebastian Raschka's survey calls out the domain knowledge and compute required.

Group Relative Policy Optimization takes a different route. It samples multiple completions, scores them with a reward function, and uses the group mean as the baseline, so each completion's advantage is its score minus the mean. No learned value function, no second model.

In practice, you move from PPO's four models down to DPO and GRPO at two models, and eventually to GRPO plus RLVR as a single model.

Verifiable rewards

RLVR works because some tasks have ground truth such as GSM8K and LeetCode, so you can verify correctness instead of training a reward model. Reported results show strong pass@1, and the Awesome-RL-for-LRMs repository tracks the rapidly expanding set of reproducible papers. DeepSeek-R1 used this approach to train reasoning capabilities at a fraction of RLHF budgets.

Ensembles and consensus

Training your own reasoning model is one path. Running multiple existing models and aggregating their outputs is another. The MIT CSAIL DisCIPL framework uses a large model to plan and small models to execute, reporting shorter reasoning chains and lower cost while keeping accuracy close.

Compare reasoning approaches by cost and accuracy. Adjust model sizes, ensemble counts, and task difficulty to see where a single large model wins and where ensembles take over. The crossover depends on your accuracy bar and budget.

Iterative Consensus Ensemble (ICE) pushes this further. Three LLMs generate responses, critique each other's answers, and iterate until they converge or agree to disagree. On GPQA-diamond, a PhD-level reasoning benchmark, the paper reports a large accuracy gain without fine-tuning or gradient updates, just inference-time deliberation.

The mechanism matters because disagreement carries signal; a single model cannot tell the difference between confident and confidently wrong. High-confidence agreement suggests reliable answers while persistent disagreement flags ambiguity. Research on ensemble voting shows diversity can matter more than any single model past a point.

Distillation and limits

Ensembles are expensive at inference time, but you can distill ensemble behavior back into a single model. TinyLLM demonstrated multi-teacher distillation where a student learns reasoning patterns and can even outperform individual teachers.

These techniques compose. You can train a base model with GRPO and verifiable rewards, deploy ensembles for high-stakes queries, then distill their consensus into a single model for efficiency. The frontier still exists, but the gap is closing from below.

The Kessler Syndrome analogy for code is synthetic data drift: as agents generate large volumes of outputs that are good enough to pass immediate checks, human ground truth becomes a smaller share, and errors compound silently as diversity shrinks. Data hygiene becomes a prerequisite.

RLVR only works on verifiable tasks, and ensembles depend on strong base models, so achieving truly novel capabilities likely still requires frontier-scale training. The direction is clear: reasoning is no longer a capability you rent from frontier providers, but something you can build with modest resources, even if the frontier tax has not disappeared.