Flash-MoE: Running a 397 Billion Parameter Model on a Laptop

A developer just open-sourced Flash-MoE, a pure C and Metal inference engine that runs Qwen3.5-397B — a 397 billion parameter Mixture-of-Experts model — on a MacBook Pro with 48GB of RAM at 4.4 tokens per second. No Python, no frameworks. Just C, Objective-C, and hand-tuned Metal shaders streaming 209GB of model weights directly from SSD through a custom GPU compute pipeline.

The project uses Apple's own "LLM in a Flash" research as inspiration, trusting the OS page cache to manage expert caching instead of building custom systems. Only 4 of the model's 512 experts per layer are loaded on demand, keeping memory usage manageable. The result: production-quality output with full tool calling on consumer hardware.

Why This Matters

This is a concrete proof point that frontier-scale AI models can escape the data center. A 397B model is larger than most commercially deployed LLMs, and running it at usable speed on a laptop — with no cloud dependency — changes the calculus on privacy, cost, and accessibility. For developers, researchers, and small teams, the implication is clear: you may not need an API subscription or GPU cluster to work with very large models.

Apple Silicon's unified memory architecture and fast NVMe storage are doing more for local AI than most people realize. Flash-MoE is a sharp example of what happens when someone writes an inference engine from scratch to exploit those specific hardware characteristics, rather than layering abstractions on top of abstractions.

Also in the News

Walmart reported that ChatGPT-powered checkout converts 3× worse than its standard website — a reality check on AI commerce hype.
Mark Zuckerberg is reportedly building an AI "CEO agent" to help him run Meta, retrieving answers he'd normally get through layers of people.
WordPress.com now lets AI agents like Claude and ChatGPT draft and publish blog posts via MCP (posts start as drafts for human review).
Meta announced AI moderation will replace third-party content review contractors over the next few years.

Relevant Links

Flash-MoE on GitHub: github.com/danveloper/flash-moe
Hacker News discussion: news.ycombinator.com/item?id=47476422
Apple "LLM in a Flash" paper: arxiv.org/abs/2312.11514

← Back to Home