Technical Context
I latched onto the 115 tok/sec figure not because of a pretty screenshot, but because it represents a viable working speed for AI automation on a Mac, not just a lab curiosity. We're talking about gemma-4-26B-A4B-it-mlx-lm-4bit, a 26B Mixture-of-Experts (MoE) model where roughly 4B parameters are active per token.
This is a crucial detail. On paper, the model is large, but in practice, the inference load is significantly lighter than a dense 26B or 30B model. That's why the Gemma 4 + MLX combination on Apple Silicon now looks less like a compromise and more like a practical AI integration for local scenarios.
I haven't seen an official Google benchmark for this specific setup. The source here is essentially the community: MLX-LM, a 4-bit build for Apple, optimizations like TurboQuant, and measurements from people running it live on M-series chips. A key part of the news is that 115 tok/sec is noticeably higher than what many previously saw through clunky pipelines or fallback modes.
And here, I wouldn't lump everything together. Ollama, llama.cpp, raw MLX-LM, context length, prefill, and decode all yield very different numbers. If someone saw 2 tok/sec on a 26B MoE and concluded the model wasn't viable locally, this benchmark proves the opposite: the problem was often the stack, not the model.
Another practical point: the 4-bit MLX variant fits into about 14 GB, but you still need a buffer in unified memory for a smooth experience. With 24 GB, you can already work without issues, and on higher-end M-chips, this becomes a truly comfortable local inference setup—no cloud, a good context window, and no endless waiting for a response.
What This Changes for Business and Automation
For me, the conclusion is simple: local agents on Mac are no longer just a gimmick. If a model can genuinely maintain this decode speed, I can start building private pipelines for documents, support, internal search, and analytics without the mandatory step of sending data externally.
The winners are teams that prioritize speed, privacy, and predictable costs. The losers are primarily cloud-based scenarios where small queries are run through an expensive external API out of sheer inertia.
But there's a catch I regularly see in client projects: a fast benchmark alone doesn't guarantee a good system. You need a proper AI architecture, task routing, context management, caching, and an understanding of where a local model excels versus where it's better to use an external service. At Nahornyi AI Lab, we build these kinds of systems for real-world processes, not just for impressive demos.
If you're considering an AI implementation without cloud dependency, I'd suggest taking a hard look at your stack: what can be moved locally, where can you cut latency, and how can you assemble it into a functional automation system. At Nahornyi AI Lab, this is usually where I start, because Vadym Nahornyi doesn't like to sell magic when a business just needs a reliable result.