Skip to content

Instantly share code, notes, and snippets.

@nikoma
Forked from awni/mlx_distributed_deepseek.md
Created January 27, 2025 23:10
Show Gist options
  • Save nikoma/4fa4c079d9e2b786f7653d12da13334d to your computer and use it in GitHub Desktop.
Save nikoma/4fa4c079d9e2b786f7653d12da13334d to your computer and use it in GitHub Desktop.
Run DeepSeek R1 or V3 with MLX Distributed

Setup

On every machine in the cluster install conda and mlx-lm:

conda install openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.txt file on the machine you plan to launch the generation. For two machines it should look like this:

hostname1 slots=1
hostname2 slots=2

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mpirun -np 2 --hostfile /path/to/hosts.txt python /path/to/pipeline_generate.py \
  --model mlx-community/DeepSeek-R1-3bit \
  --prompt "What's better a straight or a flush in texas hold'em?" \
  --max-tokens 1024

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment