Skip to content

Instantly share code, notes, and snippets.

Iron-llama

This repository hosts the latest iteration of a customized LLaMA inference setup optimized for macOS 26.2 on a machine equipped with an Intel Xeon W3235 CPU and two Radeon 6800X Duo GPUs. The implementation is designed to be MetalV3 compatible and avoids Apple Silicon optimizations (M1, M2, M3), ensuring compatibility with Intel-based Mac hardware.

🎯 Objective

The goal is to run large language models (LLMs) efficiently using GGUF quantization on Metal-compatible GPUs, focusing on:

Single GPU optimization Quantized models: Q4_K_M, Q4_K, Q6_K

llama.cpp Metal mgpu overlay (env shim + optional hooks)

This overlay adds a tiny Objective‑C helper that:

  1. Lets you specify Metal device(s) via GGML_METAL_DEVICES="3,4,5"

    • If GGML_METAL_DEVICE_INDEX is not set, it will be derived from the first index in GGML_METAL_DEVICES
    • Example log:
      [metal-env-shim] derived GGML_METAL_DEVICE_INDEX='3' from GGML_METAL_DEVICES='3,4'
  2. Provides weak optional hooks:

@Basten7
Basten7 / gist:091df055c04edaa9c88eb0cdc7fc429d
Last active September 10, 2025 14:59
Prompt Processing vs Token Generation
Classic LLM-inference trace on the GPU
@Basten7
Basten7 / ggml-metal-optimized-4.m
Created August 11, 2025 08:24
Evol for A new Metal3 Backend for llama.cpp
#import "ggml-metal.h"
#import "ggml-impl.h"
#import "ggml-backend-impl.h"
#import "ggml-metal-impl.h"
#import <Foundation/Foundation.h>
#import <Metal/Metal.h>