Distilling the Leviathan: TranslateGemma for local deployment
TranslateGemma-4B is the teacher. Gamma's 1B student is the shipped artifact. Distillation turns the larger model into training infrastructure instead of putting it on every request.
Retained capability is the measure: how close the student stays to the teacher after compression on the same translation task.
Teacher and student
In a practical distillation stack, the larger model generates targets, corrections, and edge cases. The smaller model learns the translation slice from that curated data instead of relearning the whole internet. The teacher run absorbs breadth before the student carries the deployable behavior.
Local execution changes the operating constraints. Privacy stays closer to the data, repeated calls are cheaper to serve, and latency stops depending on a remote request path.
Why local deployment matters
A local specialist matters only if it fits on the machine already doing the work. Then privacy risk drops, latency becomes part of the product, and the remote service stops being the default path.
Thesis
Measure retained capability, not teacher size.
TranslateGemma as the concrete case
Gamma is bounded enough to evaluate. Its translation distillation line
trains a 1B Gemma student from a TranslateGemma-4B teacher
and publishes evaluation bundles for both. The comparison uses the same
task and metrics, with the size reduction and remaining gap visible.
- Gamma translation results summary
- Gamma external and in-domain leaderboard
-
The current best 1B student in that bundle reaches
33.3780BLEU /58.8324chrF on WMT13 EN/ES 128, versus34.0474/61.0088for theTranslateGemma-4Bteacher. -
Clocksmith on Hugging Face
for the published
rdrrmodel repos.
What this artifact actually proves
The evidence supports one claim: on WMT13 EN/ES 128, the 1B student stays close to the 4B teacher on BLEU and chrF. That makes the larger model useful as a training source even when the shipped model is much smaller.
The teacher still matters. Its job moves upstream: create and sharpen capability before serving time.
Where the claim should stop
The Gamma bundle supports a translation distillation story. It does not prove that laptop-scale models are broadly ready for coding and reasoning. It also does not show a full outcome-based reinforcement learning pipeline.
Distillation plus evaluation is enough here. The bundle shows that a compact local model can preserve much of a larger teacher's utility on a real task. The claim should stop there.
Failure mode
The stopping point is where the evidence stops: translation distillation, measured against the published bundle.
The large model is useful when it produces a smaller specialist that runs near the work. In this case, the frontier model is training infrastructure, and the shipped value is the local translator.