baste Basten7

Iron-llama

This repository hosts the latest iteration of a customized LLaMA inference setup optimized for macOS 26.2 on a machine equipped with an Intel Xeon W3235 CPU and two Radeon 6800X Duo GPUs. The implementation is designed to be MetalV3 compatible and avoids Apple Silicon optimizations (M1, M2, M3), ensuring compatibility with Intel-based Mac hardware.

🎯 Objective

The goal is to run large language models (LLMs) efficiently using GGUF quantization on Metal-compatible GPUs, focusing on:

Single GPU optimization Quantized models: Q4_K_M, Q4_K, Q6_K

llama.cpp Metal mgpu overlay (env shim + optional hooks)

This overlay adds a tiny Objective‑C helper that:

Lets you specify Metal device(s) via GGML_METAL_DEVICES="3,4,5"
- If GGML_METAL_DEVICE_INDEX is not set, it will be derived from the first index in GGML_METAL_DEVICES
- Example log:
  [metal-env-shim] derived GGML_METAL_DEVICE_INDEX='3' from GGML_METAL_DEVICES='3,4'
Provides weak optional hooks:

	#import "ggml-metal.h"

	#import "ggml-impl.h"
	#import "ggml-backend-impl.h"
	#import "ggml-metal-impl.h"

	#import <Foundation/Foundation.h>

	#import <Metal/Metal.h>