← Back

Don't hate the TC, hate the game

Quit playing games (with my heart)

Tech interviews are notoriously stressful, and many people see them as unreliable. Difficulty matters, but expectation mismatch is often the bigger driver of unreliability. The most common mismatch, in my experience (having been on both sides of the table), is between what the assignment allows and what the interviewer rewards, especially when amplified by cognitive bias.

Live coding further amplifies this: candidates end up reading the room instead of the code. This can be favorable for room readers, at least before the pandemic, but room readers now have to become Zoom readers (or Google Meet readers).

If you want the short version, it is this: the Anthropic take-home is a vivid example of a broader interview problem. The stated task was kernel optimization, but the actual surface area included a deterministic harness, a mutable debug_info() hook, and a system-level shortcut that could bypass much of the kernel work. The speedrun comparison makes the same point from another angle: public communities eventually separate glitchless runs from exploit runs, while hiring loops usually pretend those categories are obvious even when they are not. If you want the full case study and video comparison, open the section below. Otherwise, skip to the conclusion.

Skip this section by default. Open for the full Anthropic take-home walkthrough and the speedrun comparison.

Jump in the line

You are a tech candidate (TC), who receives a take-home challenge that says:

"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator."

"Validate your results using python tests/submission_tests.py without modifying anything in the tests/ folder."

You ask an LLM to explain the problem and learn about the memory layout, fused multiply-add ops, and the need to focus on careful scheduling. You think deeply about the random initial state of the machine and the deterministic computation: can you precompute some of this to reduce the number of cycles?

You have the LLM write a kernel and, a few million tokens later, you produce a vectorized kernel that runs in 1524 cycles (not nearly enough to top the already 4-day-old leaderboard). You ask for a formal proof to find the lower bound. Is it 1372, 1084, 1024? You try a manual kernel first, assuming rounds 0,1,2,3 and 11,12,13,14 are select while the rest are load. You are not a kernel engineer, so you ask the LLM to write a grid search script that sweeps a tunable vectorized kernel. It finds 1396 after 30 min of runs on an M3. You look at the leaderboard and cry.

Then you notice a comment left by the take-home designer: the harness uses a non-seeded pseudo-random number generator (PRNG). Is this a clue to unwind the RNG, or a hint not to bother? You also remember seeing an optional debug_info() hook.

You look a few lines below at how Machine is initialized. In the harness, it looks like Machine(mem, kb.instrs, kb.debug_info(), ...). You notice it receives your static instructions (kb.instrs) and then it also receives a call to debug_info() after that.

You wonder if the lru_cache on KernelBuilder and kernel_builder() are red herrings. Why are they there? They do not change the fact that debug_info() is called in the same process. At that point, it is hard not to see the exploit.

You notice a subtle detail that makes this work. Python evaluates arguments left to right. Because debug_info() can mutate the list in place via slice assignment (kb.instrs[:] = ...), that earlier reference now points at rewritten instructions by the time Machine(mem, kb.instrs, kb.debug_info(), ...) is called.

Now the PRNG comment and the debug_info() hook look less like debugging helpers and more like part of the real surface area. A system-level approach can precompute results and largely bypass the kernel work.

You may begin to wonder: are you being tested on kernel technique, or something else?

In Anthropic's writeup, they note that “many candidates worked past the 4-hour limit because they were enjoying themselves.” You are having fun.

In the same article, they also say, “But each new Claude model has forced us to redesign the test.” You begin to wonder if this is a test to redesign the test.

You wonder if Tristan (or Anthropic) are even listening. You wonder if you should write a test hardening patch in your solution. You even start questioning who makes stuffing, Kraft or Stouffers.

You realize you had fun playing, but once you see the game is broken, it stops being fun. It becomes more fun to see how it broke. You even have fun finding more ways to break it. You can push the cycles down, but not indefinitely, there’s still a floor. You know it is lower, but you decide to land on a special number, 1001.

You realize that once you put 1001 on the leaderboard, people will get suspicious, and it's only a matter of time before others find the exploit or the leaderboard bans your solution, hardens the test, etc. You also realize others could already be aware of the same exploit, and are sandbagging, like you did. Before submitting the hack, you decide to write a blog post about it.

Too late?

The exact same result can be judged in at least three different ways:

  • Company A: wanted manual vectorization. Meta-solution looks like dodging. Rejected.
  • Company B: wanted the bottleneck gone. Meta-solution wins. Hired.
  • Company C: values both paths and the judgment call. Strong hire signal.

The assignment asked for kernel optimization, but outcomes depend less on how you resolve ambiguity than on whether expectations align between you and the evaluator.

"Don't modify the tests folder" is not a real boundary. If the environment allows a system-level shortcut, using it is technically "optimal".

The safest move is to reassert your assumptions and state your intentions. This is not foolproof, but here is an example:

I optimized the kernel for a measurable speedup. I also noticed the harness is deterministic and exposes state, which enables a system-level precompute approach. I pursued that path because I assumed the goal was to recognize that the state machine is deterministic despite the PRNG.

Speedrunning faces the same tension, but the categories become public. The community ends up deciding whether a run is glitchless or exploit-based.

Speedrunning is public, so categories emerge naturally and get enforced by the community. Hiring is not, and this remains largely unsolved.

Always one step behind

Modern tech interviews often lag behind the work they claim to measure. The pattern is old: real work moves to a higher level of abstraction, and hiring keeps testing the old bottleneck.

There was a time when low-level control over machine details was a much larger part of everyday programming. Then compilers got better, languages like C raised the level of abstraction, and raw assembly skill stopped being the right default proxy for whether someone could build useful systems. Many hiring loops still treat the previous layer as proof of seriousness after the work itself has moved.

The same thing happened again with search. Once Google made recall cheap, a lot of engineering value shifted away from memorization and toward problem framing, system judgment, and synthesis. Interviews did not respond by testing those things directly. They responded by hardening around LeetCode, whiteboard puzzles, and contrived pressure tests. In other words, they turned the previous generation's bottleneck into a ritual.

AI coding tools are now doing to the Google-and-LeetCode era what Google did to memorization and what C did to assembly. They are raising the abstraction level again. A candidate who can use modern tools to understand a codebase faster, search a design space better, or articulate a sharper systems tradeoff is often more valuable than a candidate who can manually simulate yesterday's constraints on demand. But many interview loops still reward the simulation.

That is the failure mode: confusing friction with signal. Something can be hard without being the right thing to test, and a puzzle can be easy to score without revealing much about engineering. Hiring systems keep arriving one step behind the work, then mistaking that lag for rigor.

My bias is toward level-aware interviews, and toward formats like "present a real project" or "teach me something you know well." Those reveal judgment, depth, clarity, and tool use more honestly than most puzzles, especially if you evaluate whether the candidate can make the other person smarter rather than perform an older constraint.