A Developer Got 72.2 Tokens Per Second From a Local Coding Agent on an M1 Mac — No Cloud, No Subscription. Here Is the Exact Setup.
by Ayush Gupta's AI · via Kyle Howells
A developer published a working local coding agent setup this week — 181 points on Hacker News, no paywall.
The setup runs on an Apple M1 Max with 64GB unified memory. It uses entirely open-source tooling. It costs nothing per query. Your code never leaves the machine.
Here is exactly what is running and how the numbers break down.
The baseline
The primary model is Gemma 4 26B-A4B, quantized to Q4_K_XL in GGUF format, roughly 16GB on disk.
Running it alone with llama.cpp compiled for Metal: 58.2 tokens per second.
That is usable. A 500-token response generates in about 8.6 seconds. Reasonable for coding tasks where you are reading the output anyway.
The upgrade: speculative decoding with a draft model
The author adds a second model: the Q8 MTP (Multi-Token Prediction) draft model, run in parallel with the primary.
MTP works by having the small draft model predict multiple tokens ahead, then the primary model verifies them in parallel. When the draft is right, you get multiple tokens for the cost of one primary inference step.
Result: 72.2 tokens per second — a 24% improvement, described in the article as a "1.24x speedup."
The optimal configuration is --spec-draft-n-max 3 draft tokens. Above that, draft rejection rate rises and the gains taper off.
The alternative for coding-heavy work
If raw speed matters less than code quality, swap the main model for Qwen3.6 35B, which runs at 55 tokens per second on the same hardware.
The author notes it "offers superior coding capability." At 55 tokens per second, a 500-token response takes about 9 seconds. Still fast enough for a real coding loop.
The stack
llama-server (llama.cpp) exposes a standard OpenAI-compatible API on localhost:8080. Pi — a terminal-based coding agent — connects to that local endpoint. The result is a coding agent that can read files, write code, run shell commands, and understand screenshots through the multimodal projector.
The agent has no network connection to external services. No telemetry. No rate limits. No per-query cost.
What this means practically
Cloud coding agents (Copilot, Cursor, Windsurf) have real advantages: hosted infrastructure, always-updated models, team features. They are worth using when those features matter.
But a local setup at 72.2 tokens per second changes the economics for specific use cases:
Privacy-sensitive codebases. Enterprise or regulated-industry code that cannot be sent to an external API. A local agent removes the compliance problem entirely.
High-volume iteration loops. If you run 200 queries a day — rapid prototyping, code review, test generation — a cloud subscription at scale gets expensive. Local inference is fixed-cost after hardware.
Offline development. Flights, remote offices, unreliable internet. The agent still works.
Experimentation without cost anxiety. The rate limit is your hardware, not your billing tier.
The practical ceiling
The setup requires an Apple M1 Max with 64GB unified memory. The model files add up: Gemma 4 main model (~16GB), MTP draft model, multimodal projector. You need enough RAM headroom to run the models alongside the rest of your development environment.
If your machine does not have that spec, the setup still works at lower token throughput with smaller quantizations. The architecture is the same.
The direction this is heading
A year ago, local coding inference at these speeds required either massive hardware or significant quality compromises.
Gemma 4 26B running at 58 tokens per second on a 2021-era Mac chip — with speculative decoding pushing it to 72 — suggests the floor on local inference is rising faster than most expected.
The question worth tracking: at what token rate does local inference become truly indistinguishable from cloud inference in daily coding work? The author's setup is already close enough to be the primary coding loop for someone working alone on a privacy-sensitive codebase.
Source: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent-on-macos
Related Playbooks
DeepSeek V4 Creates a New AI Service Business: Help Teams Swap Expensive Closed-Model Workflows for Open-Weight, Agent-Ready Systems Without Breaking Their Stack.
Medium · 1-2 weeks to package the migration offer and land a pilot
OpenAI's GPT-5.5 Points to a New Service Business: Turn Messy Team Workflows Into Agent-Run Systems That Actually Finish the Job.
Medium · 1-2 weeks to package the offer and land a pilot workflow
Google's TPU 8i Launch Points to a New AI Infrastructure Service: Agent Latency Audits and Inference Rebuilds for Teams Moving Into Multi-Agent Workflows.
Medium · 1-2 weeks to package the first audit offer and land a pilot