Run GLM 4.7 Flash with OpenCode on RTX 5090 (or other 32GB+ setups)

Run GLM 4.7 Flash with OpenCode on RTX 5090

Grab latest llama.cpp sources and build it:

git clone https ://github.com/ggml-org/llama.cpp/
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split

Get OpenCode from https://opencode.ai/ - they have install instructions

Put this in ~/.config/opencode/opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama.cpp": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8888/v1"
      },
      "models": {
        "GLM-4.7-flash": {
          "name": "unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL",
          "modalities": { "input": ["text"], "output": ["text"] },
          "limit": {
            "context": 64000,
            "output": 65536
          }
        }
      }
    }
  }
}

Fire up llama.cpp:

./llama.cpp/build/bin/llama-server -hf unsloth/GLM-4.7-Flash-GGUF:UD-Q4_K_XL --jinja --threads -1 --ctx-size 65000 --temp 0.7 --top-p 1.0 --min-p 0.01 --dry-multiplier 0.0 --fit off -fa auto --port 8888 --host 127.0.0.1 --no-op-offload --no-mmap

Start OpenCode, press "Ctrl-X, m", scroll to the bottom and select GLM 4.7 Flash.

Be the hero you want to be.

lkarlslund/glm4.7flash-opencode.md

Select an option

No results found

Select an option

No results found

Run GLM 4.7 Flash with OpenCode on RTX 5090