Skip to content

Instantly share code, notes, and snippets.

@DiegoRBaquero
Last active May 18, 2026 21:21
Show Gist options
  • Select an option

  • Save DiegoRBaquero/f53ab22ae978226c86158a60dad8199d to your computer and use it in GitHub Desktop.

Select an option

Save DiegoRBaquero/f53ab22ae978226c86158a60dad8199d to your computer and use it in GitHub Desktop.
Running Claude Code with a local LLM

Running Claude Code with a local LLM

1. Download and install oMLX (macOS-native MLX server with smart caching)

https://github.com/jundot/omlx/releases

2. Download the model

Go to model downloader

Multiple options, depending on your RAM

35B parameters with 3 billion active:

  1. unsloth/Qwen3.6-35B-A3B-UD-MLX-3bit - 17.4 GB (36GB+ RAM ideal)
  2. unsloth/Qwen3.6-35B-A3B-UD-MLX-4bit - 21.6 GB (48GB+ RAM ideal)
  3. unsloth/Qwen3.6-35B-A3B-MLX-8bit - 37.7GB GB (64GB+ RAM ideal)

27B billion parameters

  1. unsloth/Qwen3.6-27B-UD-MLX-4bit - 26.2GB (48GB+ RAM ideal)
  2. unsloth/Qwen3.6-27B-UD-MLX-6bit - 30.5GB (64GB+ RAM ideal)
  3. unsloth/Qwen3.6-27B-UD-MLX-8bit - 34.7GB (64GB+ RAM ideal)

3. Configure oMLX settings

  • Go to model settings
  • Pin and default model to the downloaded one
  • Open the model's settings
  • Enable TurboQuant KV Cache in 3.5-bit
  • Go to global settings
  • Turn on Fallback to Default Model
  • Set Hot Cache Limit (In-Memory Cache) to 10%
  • Set Cold Cache Limit (SSD Cache) to 10%
  • Increase Max Context Window to 256000
  • Increase Max Tokens to 64000
  • Save

4. Configure Claude Code

  • Add "CLAUDE_CODE_ATTRIBUTION_HEADER": "0" in env key inside ~/.claude/settings.json (Ref)

    Example:

      {
        "env": {
          "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
        }
      }
    

5. Configure oMLX Claude's settings

  • Go to Dashboard
  • Scroll down to Claude Code with oMLX
  • Set Qwen3.6-35B-A3B-UD-MLX-4bit for all three tiers
  • Enable Context Scaling for Claude Code and set Target Context Size to 64000
  • Run the displayed Command
  • Optional for lighter work: Add --bare

6. Share you findings and optimizations!

@christ-off
Copy link
Copy Markdown

Thanks a LOT for this page. It's simply the most useful page I found to run LLM locally on my MacBookProd
As I have 128Gb I currently use "Qwen3.6-35B-A3B-bf16".
Let me try that with ClaudeCode on a few project for a few more days.
I will let you give my feedback on model / settings ...
Thanks a lot again

@kynrai
Copy link
Copy Markdown

kynrai commented May 2, 2026

Thanks a LOT for this page. It's simply the most useful page I found to run LLM locally on my MacBookProd As I have 128Gb I currently use "Qwen3.6-35B-A3B-bf16". Let me try that with ClaudeCode on a few project for a few more days. I will let you give my feedback on model / settings ... Thanks a lot again

Not so lucky to have such awesome hardware, can you report back as to your experience with this, I'm very curious

@christ-off
Copy link
Copy Markdown

Hi
Using Qwen3.6-35B-A3B-bf16 I reach input 1539.3 tok/s and more than output 50 tok/s
I am using it as my main ClaudeCode LLM now
When doing bigger stuff I do the plan on ClaudeCode online and make it write the the implementation plan in markdown.
I give my local Claude the plan to implement.
It's fast enough for me.
I still have a lot of memory to use a lot of cache ( Hot Cache Limit (In-Memory Cache) 20% (25GB) )
I complement Claude with rtk to make it more efficient and make it also use Context7 to get latest docs/features/releases for my code.
My personnal code is too small to see if this will work on big entreprise monoliths tough.
So today I keep the small ClaudeCode 20$ subscription.

@christ-off
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment