On every machine in the cluster install conda
and mlx-lm
:
conda install openmpi
pip install -U mlx-lm
Next download the pipeline parallel run script. Download it to the same path on every machine:
curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.py
Make a hosts.txt
file on the machine you plan to launch the generation. For two machines it should look like this:
hostname1 slots=1
hostname2 slots=2
Also make sure you can ssh hostname
from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.
Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:
sudo sysctl iogpu.wired_limit_mb=180000
Run the generation with a command like the following:
mpirun -np 2 --hostfile /path/to/hosts.txt python /path/to/pipeline_generate.py \
--model mlx-community/DeepSeek-R1-3bit \
--prompt "What's better a straight or a flush in texas hold'em?" \
--max-tokens 1024
For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.