Don't hate the TC, hate the game
Quit playing games (with my heart)
Tech interviews are notoriously stressful, and many people see them as unreliable. Difficulty matters, but expectation mismatch is often the bigger driver of unreliability. The most common mismatch, in my experience (having been on both sides of the table), is between what the assignment allows and what the interviewer rewards, especially when amplified by cognitive bias.
Live coding further amplifies this: candidates end up reading the room instead of the code. This can be favorable for room readers, at least before the pandemic, but room readers now have to become Zoom readers (or Google Meet readers).
Jump in the line
You are a tech candidate (TC), who receives a take-home challenge that says:
"Optimize the kernel (in KernelBuilder.build_kernel) as much as possible in the available time, as measured by test_kernel_cycles on a frozen separate copy of the simulator."
"Validate your results using
python tests/submission_tests.pywithout modifying anything in the tests/ folder."
You ask an LLM to explain the problem and learn about the memory layout, fused multiply-add ops, and the need to focus on careful scheduling. You think deeply about the random initial state of the machine and the deterministic computation: can you precompute some of this to reduce the number of cycles?
You have the LLM write a kernel and, a few million tokens later, you produce a vectorized kernel that runs in 1524 cycles (not nearly enough to top the already 4-day-old leaderboard). You ask for a formal proof to find the lower bound. Is it 1372, 1084, 1024? You try a manual kernel first, assuming rounds 0,1,2,3 and 11,12,13,14 are select while the rest are load. You are not a kernel engineer, so you ask the LLM to write a grid search script that sweeps a tunable vectorized kernel. It finds 1396 after 30 min of runs on an M3. You look at the leaderboard and cry.
Then you
notice a comment
left by the take-home designer: the harness uses a non-seeded
pseudo-random number generator (PRNG). Is this a clue to unwind the
RNG, or a hint not to bother? You also remember seeing an optional
debug_info() hook.
You look a few lines below at how Machine is initialized.
In the harness, it looks like
Machine(mem, kb.instrs, kb.debug_info(), ...). You notice
it receives your static instructions (kb.instrs) and then
it also receives a call to debug_info() after that.
You wonder if the lru_cache on
KernelBuilder and kernel_builder() are red
herrings. Why are they there? They do not change the fact that
debug_info() is called in the same process. At that
point, it is hard not to see the exploit.
You notice a subtle detail that makes this work. Python evaluates
arguments left to right. Because debug_info() can mutate
the list in place via slice assignment (kb.instrs[:] = ...),
that earlier reference now points at rewritten instructions by the time
Machine(mem, kb.instrs, kb.debug_info(), ...) is called.
Now the PRNG comment and the debug_info() hook look less
like debugging helpers and more like part of the real surface area. A
system-level approach can precompute results and largely bypass the
kernel work.
You may begin to wonder: are you being tested on kernel technique, or something else?
In Anthropic's writeup, they note that “many candidates worked past the 4-hour limit because they were enjoying themselves.” You are having fun.
In the same article, they also say, “But each new Claude model has forced us to redesign the test.” You begin to wonder if this is a test to redesign the test.
You wonder if Tristan (or Anthropic) are even listening. You wonder if you should write a test hardening patch in your solution. You even start questioning who makes stuffing, Kraft or Stouffers.
You realize you had fun playing, but once you see the game is broken, it stops being fun. It becomes more fun to see how it broke. You even have fun finding more ways to break it. You can push the cycles down, but not indefinitely, there’s still a floor. You know it is lower, but you decide to land on a special number, 1001.
You realize that once you put 1001 on the leaderboard, people will get suspicious, and it's only a matter of time before others find the exploit or the leaderboard bans your solution, hardens the test, etc. You also realize others could already be aware of the same exploit, and are sandbagging, like you did. Before submitting the hack, you decide to write a blog post about it.
Too late?
The exact same result can be judged in at least three different ways:
- Company A: wanted manual vectorization. Meta-solution looks like dodging. Rejected.
- Company B: wanted the bottleneck gone. Meta-solution wins. Hired.
- Company C: values both paths and the judgment call. Strong hire signal.
The assignment was to optimize the kernel. The outcome depends less on ambiguity resolution than on expectation alignment.
"Don't modify the tests folder" is not a real boundary. If the environment allows a system-level shortcut, using it is technically "optimal".
The safest move is to reassert your assumptions and state your intentions. This is not foolproof, but here is an example:
I optimized the kernel for a measurable speedup. I also noticed the harness is deterministic and exposes state, which enables a system-level precompute approach. I pursued that path because I assumed the goal was to recognize that the state machine is deterministic despite the PRNG.
Speedrunning faces the same tension, but the categories become public. The community ends up deciding whether a run is glitchless or exploit-based.
Speedrunning is public, so categories emerge naturally and get enforced by the community. Hiring is not, and this remains largely unsolved.
My bias is toward level-aware interviews, and toward formats like "present a real project" or "teach me something you know well." Those tend to reveal judgment, depth, and clarity better than most puzzles, especially if you evaluate whether the candidate can make the other person smarter.