On every machine in the cluster install conda and mlx-lm:
conda install openmpi
pip install -U mlx-lmNext download the pipeline parallel run script. Download it to the same path on every machine:
curl -O https://raw.githubusercontent.com/ml-explore/mlx-examples/refs/heads/main/llms/mlx_lm/examples/pipeline_generate.pyMake a hosts.txt file on the machine you plan to launch the generation. For two machines it should look like this:
hostname1 slots=1
hostname2 slots=2
Also make sure you can ssh hostname from every machine to every other machine. Check-out the MLX documentation for more information on setting up and testing MPI.
Set the wired limit on the machines to use more memory. For example on a 192GB M2 Ultra set this:
sudo sysctl iogpu.wired_limit_mb=180000Run the generation with a command like the following:
mpirun -np 2 --hostfile /path/to/hosts.txt python /path/to/pipeline_generate.py \
  --model mlx-community/DeepSeek-R1-3bit \
  --prompt "What's better a straight or a flush in texas hold'em?" \
  --max-tokens 1024
For DeepSeek R1 quantized in 3-bit you need in aggregate 350GB of RAM accross the cluster of machines, e.g. two 192 GB M2 Ultras. To run the model quantized to 4-bit you need 450GB in aggregate RAM or three 192 GB M2 Ultras.