Flash-MoE: Running a 397 Billion Parameter Model on a Laptop

March 23, 2026

A developer just open-sourced Flash-MoE, a pure C and Metal inference engine that runs Qwen3.5-397B — a 397 billion parameter Mixture-of-Experts model — on a MacBook Pro with 48GB of RAM at 4.4 tokens per second. No Python, no frameworks. Just C, Objective-C, and hand-tuned Metal shaders streaming 209GB of model weights directly from SSD through a custom GPU compute pipeline.

The project uses Apple's own "LLM in a Flash" research as inspiration, trusting the OS page cache to manage expert caching instead of building custom systems. Only 4 of the model's 512 experts per layer are loaded on demand, keeping memory usage manageable. The result: production-quality output with full tool calling on consumer hardware.

Why This Matters

This is a concrete proof point that frontier-scale AI models can escape the data center. A 397B model is larger than most commercially deployed LLMs, and running it at usable speed on a laptop — with no cloud dependency — changes the calculus on privacy, cost, and accessibility. For developers, researchers, and small teams, the implication is clear: you may not need an API subscription or GPU cluster to work with very large models.

Apple Silicon's unified memory architecture and fast NVMe storage are doing more for local AI than most people realize. Flash-MoE is a sharp example of what happens when someone writes an inference engine from scratch to exploit those specific hardware characteristics, rather than layering abstractions on top of abstractions.

Also in the News

Relevant Links

← Back to Home