Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save ShalokShalom/84604c96d7ca2b39d51bfcc4514c2fef to your computer and use it in GitHub Desktop.

Select an option

Save ShalokShalom/84604c96d7ca2b39d51bfcc4514c2fef to your computer and use it in GitHub Desktop.
AI models on M5 Mac Mini
I just did the math, and it should be possible to use the upcoming Mac Mini M5's with models of the following weights:
**M5 Base with 16 GB** (600 dollar, 256 GB SSD ¹)
25 TPS for 8 billion parameter, or 14 billion parameter models with a tight context storage left of around 5 GB, at a rate of 15 token per second.
**M5 Base with 24 GB** (projected to cost 800 $ with 256 GB SSD ¹)
Run 14 billion parameter comfortable, with around 20 tokens per second.
This configuration will leave lots of room for the context window, around 12 GB.
**M5 base with 32 GB**
(1000$ with 256 GB SSD ¹)
Run 32 billion parameter and have 8 to 9 GB left for context.
Probably the best deal for price concious users, who can utilize 32 billion parameter models effectively.
And now the Pro models:
**M5 Pro with 24 GB** (projected to cost 1400 dollar)
This one has faster cores, and significantly higher memory bandwidth, which results in faster "time to first token" and higher "token per second" performance.
It also has 512 GB SSD by default.
Still, this one is killed by its RAM: It's tricky to fit 32 parameter models into it, and that would be severely limited to short context windows.
And its pointless to use such a high performance hardware to run 14 billion token in my mind.
**M5 Pro with 48 GB** (2000 dollar)
This automatically brings the strongest chip for a Mac Mini, cutting the time to first token by 25% compared to the normal Mac Mini Pro.
It makes still basically no sense as well, since you are running 70 billion parameter models with just 6 GB of RAM left for context.
Still possible, but the upgrade costs just 200 dollars and is absolutely worth it.
**M5 Pro with 64 GB** (projected to cost 2200 dollar)
This one is the holy grail: Expensive, but it allows you to run enterprise grade models with 70 billion parameter at a reading speed of around 7 TPS, and leaves a substantial amount for the context window (20 GB).
---------------------------------
How to:
1) Configure the related services (like MLX or Ollama with MLX) as *system service*
2) Boot up until the login screen, and manage the device remotely, this saves you a good 2 GB of unified memory
3) Use **only** 4-bit quantized MLX models
4) Set the memory limit for the CPU to ~20% (down from around 33% default)
¹ You can totally use external SSDs to store the models and other stuff, to stay within the limits of the 256 GB SSD.
Upgrade to 512 takes 200 more $.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment