Cloud's metal linings playbook

NVIDIA's Blackwell chip draws 1,200 Watts. A MacBook Air has roughly 30 Watts for the entire system. That is the difference between a delivery truck and a bike. Both move things, but you choose different jobs. Frontier API responses often land at 400ms to 800ms, longer under load. We build for throughput while users feel latency. Local inference trades peak capability for predictable latency and fewer external constraints. With a quantized model on Apple Silicon, responses are consistent because there is no network variance, no API rate limits, and no cold starts. The gap to frontier models is real, but it keeps shrinking, especially for instruction-driven coding tasks.

Latency and cognition

Nielsen's research laid out response time thresholds decades ago, and Miller's capacity limits explain why latency erodes working memory. The Doherty Threshold puts the break around 400ms, which is where AI assistants often live, so the tool saves typing at the expense of thinking.

When you are deep in a refactor, you juggle a handful of things in working memory: the function signature, call sites, the failing test, the edge case, the architectural constraint. A 400ms pause gives those items time to fade. By the time the AI responds, you have lost the edge case. Latency erodes more than productivity; it erodes working memory itself.

Performance constraints

Datacenter GPUs run with active cooling while laptops throttle quickly. On fanless hardware, clocks drop as temperatures rise, so watts per token become sustained performance. Watts per token functions as both an energy constraint and a performance ceiling. Quantization helps because it reduces memory bandwidth pressure, even when dequantization eats some gains, and the net improvement still makes edge inference viable.

Running inference locally via WebGPU brings cross-platform deployment and browser sandboxing, but WGSL and WGSL subgroups still trail CUDA on performance and consistency, and missing tensor-core style ops keeps matmuls slower. CUDA exposes tensor cores and async memory to overlap compute and transfer, while WebGPU cannot expose vendor-specific hardware, so matmuls run on general SIMD and you cannot hide latency. It is a portability tax.

Model the thermal constraints of edge inference. Set target tokens per second and model size, then watch junction temperature climb as inference runs. Compare power envelopes (fanless laptop vs desktop vs datacenter) and see what determines sustained throughput.

Routing and coordination

The solution is not local-only or cloud-only. Route by complexity: keep short-loop tasks local, use local ensembles for higher-stakes work, and send the hardest or most novel tasks to the cloud.

Tier 1 is autocomplete, small refactors, and documentation lookup. Tier 2 is multi-file refactors, test generation, and code review. Tier 3 is novel architecture questions and deep debugging that still need cloud-scale models.

Hybrid routing assumes a single device. With multiple local agents, coordination needs a mesh, not a cloud API.

Conflict-free Replicated Data Types (CRDTs) solve this for databases: operations applied in any order still converge. The same principle maps to local agent coordination without routing everything through a cloud hub. Local-first means your laptop, your phone, and your home server can sync context without touching the internet.

The edge trade-off

Cloud inference will always be more powerful, but latency, privacy, offline capability, and cost pull work to the edge. The question is which tier fits the task. Local models are improving quickly, so more day-to-day workflows can stay on device, while complex reasoning still benefits from cloud-scale models.

For complex reasoning, the answer is not yet. Coding agents that need to understand novel architectures, debug subtle race conditions, or design systems still benefit from cloud-scale models. The boundary is moving as efficiency improves, so more of the loop can stay local without losing capability.