Run DeepSeek R1 or V3 with MLX Distributed

Setup

On every machine in the cluster install conda and mlx-lm:

conda install openmpi
pip install -U mlx-lm

Next download the pipeline parallel run script. Download it to the same path on every machine:

curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py

Make a hosts.txt file on the machine you plan to launch the generation. For two machines it should look like this:

hostname1 slots=1
hostname2 slots=2

Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.

Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:

sudo sysctl iogpu.wired_limit_mb=180000

Run

Run the generation with a command like the following:

mpirun -np 2 --hostfile /path/to/hosts.txt python /path/to/pipeline_generate.py \
  --model mlx-community/DeepSeek-R1-3bit \
  --prompt "What's better a straight or a flush in texas hold'em?" \
  --max-tokens 1024

For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.

nikoma/mlx_distributed_deepseek.md

Setup

Run