Skip to content

Instantly share code, notes, and snippets.

@mmguero
Created January 9, 2026 05:02
Show Gist options
  • Select an option

  • Save mmguero/f7456921fb02cac8b992ebb0f60be163 to your computer and use it in GitHub Desktop.

Select an option

Save mmguero/f7456921fb02cac8b992ebb0f60be163 to your computer and use it in GitHub Desktop.
run llama-server on a CPU-only box with a bunch of RAM
File: llama-server.sh
#!/usr/bin/env bash
# Paths
LLAMA_BIN="./llama.cpp/build/bin/llama-server"
MODELS_DIR="./models"
# Network
HOST="127.0.0.1"
PORT="8081"
# Global Performance Settings
THREADS=4
BATCH_SIZE=256
CTX_SIZE=16384
taskset -c 0,1,2,3 \
"$LLAMA_BIN" \
--host "$HOST" \
--port "$PORT" \
--cpu-moe \
--n-gpu-layers 0 \
--mlock \
--threads "$THREADS" \
--batch-size "$BATCH_SIZE" \
--ctx-size "$CTX_SIZE" \
--cache-type-k q4_0 \
--cache-type-v q4_0 \
--models-dir "$MODELS_DIR" \
--models-max 1 \
--temp 0.3 \
--top-p 0.85 \
--api-prefix /v1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment