256.1

Signal Processing
Blog Labs About

Cloud's Metal Linings Playbook

NVIDIA's Blackwell chip draws 1,200 Watts. That is two hairdryers running continuously. You cannot put a hairdryer inside a MacBook Air. The M3 Air's entire system power envelope is 30 Watts. Meanwhile, the average time-to-first-token for frontier API models runs 400ms on a good day, 800ms on a bad one, and over a second when inference clusters are under load. The industry is building for throughput while users experience latency.

Local inference inverts both problems. A quantized 7B model on Apple Silicon hits 47ms time-to-first-token. Consistently. No network variance, no API rate limits, no cold starts. The capability gap between a 7B local model and Claude Opus is real. But capability isn't the only variable that matters.

The cognitive break

Nielsen's research established response time thresholds decades ago: 100ms feels instant, 1 second maintains flow, 10 seconds loses attention entirely. These aren't arbitrary numbers. They're properties of human working memory, related to Miller's capacity limits. Every 100ms of latency gives the mental context you're holding time to decay.

When you're deep in a refactor, you have maybe seven things in working memory: the function signature you're changing, the call sites to update, the test that's failing, the edge case you just thought of, the architectural constraint you're respecting. A 400ms pause doesn't just delay the response. It gives those seven things time to start decaying. By the time the AI responds, you've lost the edge case. The AI saved typing but cost thinking.

IBM researchers in the 1980s called this the Doherty Threshold: system response times under 400ms keep users in a productive flow state, while responses over 400ms cause measurable drops in engagement and task completion. AI coding assistants operate above this threshold by default.

Response Latency Context Reconstruction Net Productivity Impact
< 100ms (local) 0ms +38%
100-300ms ~200ms +29%
300-600ms ~800ms +14%
600ms-1s ~2.1s +3%
> 1s ~4.7s -8%

The last row matters most. AI assistance with greater than one second latency makes developers less productive than no AI at all. The context reconstruction cost exceeds the value of the assistance. Benchmarks measure task completion, not developer experience. A tool that completes a task 50% faster but interrupts flow state can be a net negative.

The thermal wall

Datacenter GPUs run at 80C continuously because they have 300W of active cooling. Laptop GPUs start throttling at 95C with 15W of passive cooling. The M3 Air has no fan. When junction temperature hits the limit, the GPU clocks down. A kernel that runs at 30 tokens/second at 70C runs at 8 tokens/second at 95C.

Every optimization that improves efficiency is also a thermal optimization. Watts per token isn't just an energy metric; it's a sustained performance metric. INT4 quantization cuts memory bandwidth (and thus power) by 4x compared to FP16. In practice, dequantization overhead consumes some of the savings, but real-world improvement is still 2.5-3x. Enough to make edge inference viable.

Hardware TDP Cooling Sustained Tokens/s (7B)
H100 (datacenter) 700W Active (300W) ~150
RTX 4090 (desktop) 450W Active (150W) ~80
M2 Max (laptop) 96W Active (30W) ~35
M3 Air (fanless) 30W Passive ~12 (throttled)

WebGPU limitations

Running inference locally via WebGPU has advantages: cross-platform deployment, no native installation, browser-level sandboxing. But WGSL is still catching up to CUDA for ML workloads. Cross-platform portability comes at a cost.

Subgroup operations are new and fragile: WGSL subgroups landed in Chrome 128 (late 2024), but portability remains problematic. CUDA's __shfl_sync just works; WGSL's equivalent behaves differently across devices.

No tensor cores: NVIDIA tensor cores do 4x4 matrix ops in a single cycle. Metal has simdgroup_matrix, but WebGPU can't expose vendor-specific hardware. Every matmul is general-purpose SIMD.

No async memory: CUDA can overlap compute and memory transfer. WGSL compute shaders block on memory. You can't hide latency.

The gap is closing, but it's still real. A CUDA kernel for FlashAttention is around 200 lines. The WGSL equivalent runs longer and executes slower. This is the tax for browser-native deployment.

See the strain: Thermal Envelope Simulator

The interactive below models the thermal constraints of edge inference. Set target tokens/second and model size, then watch junction temperature climb as inference runs. See how different power envelopes (fanless laptop vs desktop vs datacenter) determine sustained throughput. The thermal wall is real.

Hybrid routing

The solution isn't local-only or cloud-only. It's intelligent routing based on task complexity:

Tier 1 (Local, < 100ms): Autocomplete, syntax highlighting, simple refactors, documentation lookup. High-frequency, low-complexity. Latency matters more than capability.

Tier 2 (Local Ensemble, < 500ms): Multi-file refactors, test generation, code review. Complexity justifies ensemble overhead. Still local, still fast enough to maintain flow.

Tier 3 (Cloud, best-effort): Novel architecture questions, complex debugging, tasks requiring world knowledge beyond the codebase. Accept the latency hit because the task genuinely requires frontier capability.

For most coding workflows, roughly 80% of AI assistance requests fall into Tier 1. Another 15-17% into Tier 2. Only 3-5% actually need cloud-scale models. The industry has been routing 100% of requests through Tier 3 infrastructure because that's what the business model requires, not what the task demands.

The mesh problem

Hybrid routing assumes a single device. But what happens when you have multiple local agents? If Agent A refactors a file while Agent B writes a test for it simultaneously, how do they merge intent without a central server? The answer isn't an API call to the cloud. It's a mesh.

Conflict-free Replicated Data Types (CRDTs) solve this for databases: operations that can be applied in any order and still converge to the same state. The same principle applies to agent coordination. Instead of locking files or routing through a central arbiter, agents broadcast intent and resolve conflicts locally. State isn't a database row stored in the cloud. State is a rumor spreading through the mesh.

The local-first thesis extends beyond single-device inference to peer-to-peer agent networks. Your laptop, your phone, your home server—all running local models, all syncing context without touching the internet. The cloud becomes optional infrastructure, not mandatory dependency.

The edge calculus

Cloud inference will always be more powerful. H100s will always beat MacBook Airs on raw capability. But power isn't the only dimension. Latency, privacy, offline capability, cost structure all favor edge. The question isn't whether edge inference is competitive with cloud inference. It's whether edge inference is good enough for the task at hand.

For autocomplete, the answer is yes. A 4-bit quantized 3B model running at 25 tokens/second on a fanless laptop is better than a 70B model at 50 tokens/second from a datacenter with 400ms network latency. The task doesn't need frontier intelligence. It needs instant response.

For complex reasoning, the answer is not yet. Coding agents that need to understand novel architectures, debug subtle race conditions, or design systems still benefit from cloud-scale models. But the boundary is moving. Every efficiency improvement expands what's possible locally. Every quantization advance, every kernel optimization, every architectural innovation shifts the line between "needs cloud" and "runs on your laptop."

The edge imperative isn't about replacing cloud inference. It's about recognizing that 400ms is too slow for flow state, 30 Watts is the thermal budget you actually have, and most tasks don't need frontier intelligence anyway.