Cloud's metal linings playbook
NVIDIA's Blackwell chip draws 1,200 Watts; a MacBook Air is about 30. That gap matters when you run the same team at 6–10 prompt cycles per minute. In practice, local inference is a routing choice: predictable latency and no external dependency in exchange for lower peak capability. With a quantized model on Apple Silicon, responses can be consistent because there is no network variance, no API rate limits, and no cold starts. The gap to frontier models is real, but it keeps shrinking, especially for instruction-driven coding tasks.
Latency and cognition
Nielsen's research laid out response time thresholds decades ago, and Miller's capacity limits explain why latency erodes working memory. The Doherty Threshold puts the break around 400ms, which is where AI assistants often live, so the tool saves typing at the expense of thinking.
When you are deep in a refactor, you juggle a handful of things in working memory: the function signature, call sites, the failing test, the edge case, the architectural constraint. A 400ms pause gives those items time to fade. By the time the AI responds, you have lost the edge case. Latency breaks the working-memory loop.
Performance constraints
Datacenter GPUs run with active cooling while laptops throttle quickly. On fanless hardware, clocks drop as temperatures rise, so watts per token become sustained performance. Watts per token functions as both an energy constraint and a performance ceiling. Quantization helps because it reduces memory bandwidth pressure, even when dequantization eats some gains, and the net improvement still makes edge inference viable.
Running inference locally via WebGPU brings cross-platform deployment and browser sandboxing, but WGSL and WGSL subgroups still trail CUDA on performance and consistency. CUDA gives developers tensor-core paths and async memory controls that can overlap compute and transfer. WebGPU can use fast hardware through the browser's implementation, but portable WGSL cannot assume CUDA-style tensor intrinsics or vendor-specific scheduling controls. It is a portability tax: one runtime target across browsers and devices, but lower peak throughput and less explicit control than CUDA.
Model the thermal constraints of edge inference. Set target tokens per second and model size, then watch junction temperature climb as inference runs. Compare power envelopes (fanless laptop vs desktop vs datacenter) and see what determines sustained throughput.
Routing and coordination
This is a routing problem, not a purity test. Keep short-loop tasks local, use local ensembles for higher-stakes work, and send the hardest or most novel tasks to the cloud.
Tier 1 is autocomplete, small refactors, and documentation lookup. Tier 2 is multi-file refactors, test generation, and code review. Tier 3 is novel architecture questions and deep debugging that still need cloud-scale models.
Hybrid routing usually assumes one device. Once you have several local agents, they need to sync with each other instead of bouncing everything through a cloud API.
Conflict-free Replicated Data Types (CRDTs) solve one part of this for databases: shared state can converge even when updates arrive in different orders. Local agents need that kind of state synchronization, plus separate policy for trust, permissions, and conflict handling. Local-first means your laptop, your phone, and your home server can sync context without routing every update through a cloud hub.
The edge trade-off
Cloud inference will keep the highest peak capability, but latency, privacy, offline use, and cost pull routine work to the edge. The question is which tier fits the task: autocomplete, small refactors, and documentation lookup can often stay local; novel architecture questions, subtle race conditions, and system design still benefit from cloud-scale models. That boundary will keep shifting as local models get cheaper and better.