Skip to content

Instantly share code, notes, and snippets.

@lkarlslund
Created January 21, 2026 15:34
Show Gist options
  • Select an option

  • Save lkarlslund/f660a5bb0f53b35299de24c33392a264 to your computer and use it in GitHub Desktop.

Select an option

Save lkarlslund/f660a5bb0f53b35299de24c33392a264 to your computer and use it in GitHub Desktop.
Run GLM 4.7 Flash with OpenCode on RTX 5090 (or other 32GB+ setups)

Run GLM 4.7 Flash with OpenCode on RTX 5090

Grab latest llama.cpp sources and build it:

git clone https ://github.com/ggml-org/llama.cpp/
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

Get OpenCode from https://opencode.ai/ - they have install instructions

Put this in ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8888/v1"
      },
      "models": {
        "GLM-4.7-flash": {
          "name": "unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL",
          "modalities": { "input": ["text"], "output": ["text"] },
          "limit": {
            "context": 64000,
            "output": 65536
          }
        }
      }
    }
  }
}

Fire up llama.cpp:

./llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --jinja --threads -1 --ctx-size 65000 --temp 0.7 --top-p 1.0 --min-p 0.01 --dry-multiplier 0.0 --fit off -fa auto --port 8888 --host 127.0.0.1 --no-op-offload --no-mmap

Start OpenCode, press "Ctrl-X, m", scroll to the bottom and select GLM 4.7 Flash.

Be the hero you want to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment