README.md

Iron-llama

This repository hosts the latest iteration of a customized LLaMA inference setup optimized for macOS 26.2 on a machine equipped with an Intel Xeon W3235 CPU and two Radeon 6800X Duo GPUs. The implementation is designed to be MetalV3 compatible and avoids Apple Silicon optimizations (M1, M2, M3), ensuring compatibility with Intel-based Mac hardware.

🎯 Objective

The goal is to run large language models (LLMs) efficiently using GGUF quantization on Metal-compatible GPUs, focusing on:

Single GPU optimization Quantized models: Q4_K_M, Q4_K, Q6_K Support for F32 tensors: 241 tensors Support for Q4_K tensors: 289 tensors Support for Q6_K tensors: 49 tensors ⚙️ Configuration

Target Platform: macOS 26.2 (Intel-based) CPU: Intel Xeon W3235 GPUs: Two Radeon 6800X Duo (MetalV3 compatible) Model Format: GGUF quantized models (Q4_K_M, Q4_K, Q6_K)

📦 Installation

git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp git fetch --tags git checkout b6123

--->Replace the 2 modified files of this repo into /llama.cpp/ggml/src/ggml-metal/ et voilà !

brew install cmake git libomp glslang molten-vk shaderc vulkan-loader vulkan-headers

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DGGML_METAL_MGPU=ON -DOpenMP_ROOT=$(brew --prefix)/opt/libomp && cmake --build build -j

⚙️ Execution Command:

export GGML_METAL_N_CB=4 export GGML_METAL_DEVICE_INDEX=0 (Choose the gpu index)

export GGML_METAL_N_CB=4; ./build/bin/llama-bench -m /Volumes/NM790-4To/Qwen3-2507/Qwen3-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf -fa 0 -ub 32 -b 32

model	size	params	backend	threads	n_batch	n_ubatch	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	32	pp512	271.87 ± 0.17
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	32	tg128	72.57 ± 0.29

./build/bin/llama-server --port 8010 -ngl 99 --temp 0.7 -c 16192 -m ~/models/Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf --jinja --prio 2 -ub 32 -b 32

🧠 Optimizations

Currently, the code is optimized for a single GPU using Q4_K_M quantized tensors and MOE models, which balances performance and accuracy for inference tasks. Future steps will expand support to both GPUs and include further optimizations for different tensor types.

📦 Dependencies

To get started, ensure you have the following installed:

llama.cpp (with Metal support) MetalV3 drivers for AMD GPUs Compatible GGUF model files for F16, F32, Q4_K_M, Q4_K, and Q6_K quantization 🛠️ Strategy

Validate current setup using ./build/bin/llama-server with the specified model. Iterate on single GPU optimization with Q4_K_M tensors. Expand to dual GPU support as needed. Tune for performance and memory usage on Intel-based Macs. 📝 Notes

No file or function names are invented; all code and structure are consistent with existing llama.cpp implementations. The model path and execution parameters are based on the provided example and should be adapted to your system.

Performance llama.cpp MetalV3 MacIntel (iRon-Llama)

export GGML_METAL_N_CB=4; ./build/bin/llama-bench -m ~/models/Qwen3-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf -fa 0 -ub 16,32,64 -b 16,32,64

model	size	params	backend	threads	n_batch	n_ubatch	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	16	16	pp512	250.96 ± 1.91
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	16	16	tg128	80.56 ± 0.27
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	16	32	pp512	245.31 ± 0.21
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	16	32	tg128	72.25 ± 0.50
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	16	64	pp512	235.57 ± 0.15
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	16	64	tg128	62.70 ± 0.61
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	16	pp512	251.67 ± 0.29
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	16	tg128	80.33 ± 0.17
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	32	pp512	269.62 ± 0.11
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	32	tg128	71.63 ± 0.27
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	64	pp512	262.39 ± 0.20
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	32	64	tg128	60.68 ± 0.49
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	64	16	pp512	254.45 ± 0.22
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	64	16	tg128	78.94 ± 0.47
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	64	32	pp512	271.10 ± 0.21
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	64	32	tg128	71.00 ± 0.31
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	64	64	pp512	275.74 ± 1.71
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Metal,BLAS	12	64	64	tg128	61.09 ± 0.17

build: 79c1160b0 (6123)

llama.cpp on Vulkan ./build/bin/llama-bench -m /Volumes/NM790-4To/Qwen3-2507/Qwen3-30B-A3B/Qwen3-Coder-30B-A3B-Instruct-1M-Q4_K_M.gguf -fa 0 -ub 16,32,64 -b 16,32,64 -mg 3 -sm none ggml_vulkan: Found 5 Vulkan devices: ggml_vulkan: 0 = AMD Radeon RX 6800 XT (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 1 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 2 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 3 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none ggml_vulkan: 4 = AMD Radeon PRO W6800X Duo (MoltenVK) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none

model	size	params	backend	threads	n_batch	n_ubatch	main_gpu	sm	test	t/s
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	16	16	3	none	pp512	82.14 ± 1.22
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	16	16	3	none	tg128	87.85 ± 0.22
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	16	32	3	none	pp512	83.21 ± 0.70
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	16	32	3	none	tg128	85.24 ± 2.63
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	16	64	3	none	pp512	83.45 ± 0.80
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	16	64	3	none	tg128	87.50 ± 0.35
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	32	16	3	none	pp512	82.34 ± 0.94
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	32	16	3	none	tg128	87.09 ± 0.36
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	32	32	3	none	pp512	164.17 ± 2.78
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	32	32	3	none	tg128	86.02 ± 2.65
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	32	64	3	none	pp512	164.53 ± 0.76
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	32	64	3	none	tg128	87.55 ± 0.37
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	64	16	3	none	pp512	84.02 ± 0.22
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	64	16	3	none	tg128	88.08 ± 0.05
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	64	32	3	none	pp512	163.21 ± 0.74
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	64	32	3	none	tg128	87.25 ± 0.44
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	64	64	3	none	pp512	25.65 ± 0.35
qwen3moe 30B.A3B Q4_K - Medium	17.28 GiB	30.53 B	Vulkan,BLAS	12	64	64	3	none	tg128	85.43 ± 0.54

build: bcb43163a (7833)

Basten7/README.md

Select an option

No results found

Select an option

No results found