Skip to content

Instantly share code, notes, and snippets.

@ichim-david
Forked from joematthews/humble-pi.md
Created June 23, 2026 13:22
Show Gist options
  • Select an option

  • Save ichim-david/4fc711674c4f92f63c0e608107762ba3 to your computer and use it in GitHub Desktop.

Select an option

Save ichim-david/4fc711674c4f92f63c0e608107762ba3 to your computer and use it in GitHub Desktop.
Humble Pi — local agentic coding on minimal hardware

Humble Pi — local agentic coding on minimal hardware

A coding agent that runs entirely on your own machine. No API keys, no cloud, works offline.

pi is the agent you talk to. It runs against a local llama.cpp server hosting Google Gemma 4 (the unsloth builds).

Pick a model by how much memory you have:

Model unsloth repo Download VRAM Start Max
Gemma 4 E4B unsloth/gemma-4-E4B-it-GGUF 5.13 GB 8 GB+ 65536 (64k) 131072 (128k)
Gemma 4 12B unsloth/gemma-4-12b-it-GGUF 7.37 GB 12 GB+ 65536 (64k) 262144 (256k)

1. Install llama.cpp

This gives you the llama binary; you start the server with llama serve. Check it installed with llama version.

macOSHomebrew:

brew install llama.cpp

Linux (any distro) — grab a prebuilt binary with installama.sh.

This script auto-detects your CPU and GPU (CUDA / ROCm / Vulkan) and drops llama into ~/.local/bin (it'll tell you if that's not on your PATH):

curl -fsSL https://angt.github.io/installama.sh | sh

2. Start the model server

Add an alias to start the server. It also saves the server's output to a timestamped log file, which helps if something goes wrong. Put this in your shell config — ~/.zshrc on macOS, ~/.bashrc on Linux:

# Gemma 4 E4B @ 64k for pi (f16 KV, flash-attn, :8080, vision on, retain-reasoning template)
alias llamagemma4b='mkdir -p ~/.llama-logs && llama serve -hf unsloth/gemma-4-E4B-it-GGUF:UD-Q4_K_XL -c 65536 -fa 1 --jinja --chat-template-file ~/.llama-templates/gemma-e4b-retain.jinja --parallel 1 --cache-ram 0 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 2>&1 | tee ~/.llama-logs/llama-$(date +%Y%m%d-%H%M%S).log'

# Gemma 4 12B @ 64k for pi (needs ~12GB) — same flags, bigger model
alias llamagemma12b='mkdir -p ~/.llama-logs && llama serve -hf unsloth/gemma-4-12b-it-GGUF:UD-Q4_K_XL -c 65536 -fa 1 --jinja --parallel 1 --cache-ram 0 --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 --reasoning on 2>&1 | tee ~/.llama-logs/llama-$(date +%Y%m%d-%H%M%S).log'

Install the E4B reasoning template (one time)

The E4B alias points at a custom chat template with --chat-template-file. Fetch it before the first launch, or the server won't start:

mkdir -p ~/.llama-templates
curl -fsSL https://gist.githubusercontent.com/joematthews/162c8ca749c68702c29780de9c51b9e1/raw/gemma-e4b-retain.jinja -o ~/.llama-templates/gemma-e4b-retain.jinja

Why it matters. Gemma 4's stock template drops the model's reasoning from history on each new message. On a cache-by-prefix backend like llama.cpp that shifts the prompt and forces the server to reprocess the entire conversation every turn — slow, and it throws away the reasoning you just generated. This template keeps reasoning in history, so the cache is reused and the model's prior reasoning carries forward. (The 12B alias doesn't need it — it runs with --reasoning on and the stock template.)

The template is a small modification of Gemma 4's official chat template, made by vevi33 and shared in llama.cpp #21912, where aldehir identified the underlying cause. Credit to both.

Then start it:

source ~/.zshrc      # or ~/.bashrc
llamagemma4b         # first run downloads the model (~5 GB), then serves on :8080

Note

When you run the alias for the first time llama.cpp will download the model automatically and this takes some time. You may see no output in the terminal, wait a few minutes for the model to download and the server should start up after. Subsequent starts will be instant.

What the flags do

flag what it does
--cache-ram 0 Keeps memory steady. Without it, llama.cpp piles up cached copies of the conversation that grow every time you /new. This keeps just one in place and reuses it.
-c 65536 The context window — 64k tokens here. A good default; raise it if you have memory to spare (max 128k on E4B, 256k on the 12B).
-fa 1 Flash attention — faster, and uses less memory.
--parallel 1 One conversation slot, which keeps the cache simple.
--temp 1.0 --top-p 0.95 --top-k 64 --min-p 0 Gemma 4's recommended sampling settings.
--reasoning on Enables reasoning. Only needed by 12b which has reasoning off by default.
--chat-template-file …gemma-e4b-retain.jinja E4B only. Keeps the model's reasoning in history so llama.cpp reuses its prefix cache instead of reprocessing every turn (see below).

Gemma 4 can read images, that adds about 1.2 GB of memory; add --no-mmproj to the alias if you'd rather run text-only.

You don't need to set anything for thinking — Gemma 4 handles it, and pi controls it. Unlike some models, turning thinking on here doesn't slow things down.

3. Install pi and its packages

# install pi
curl -fsSL https://pi.dev/install.sh | sh

# install extension packages
pi install npm:pi-llama-cpp         # connects pi to the llama.cpp server
pi install npm:pi-smart-fetch       # lets the agent read web pages
pi install npm:pi-smart-web-search  # lets the agent search the web

Together they let a small model look things up instead of guessing.

4. Run it

# terminal 1: start the model server
source ~/.zshrc && llamagemma4b

# terminal 2: open your project and start coding
cd ~/your-project
pi

That's the whole setup — a private coding agent that runs on a modest laptop.

5. Prompts to try

Give it a spin. These all need up-to-date info from the web, which is exactly where a small local model needs a hand:

What's the latest version of Node.js, and what's new in it?
Compare Bun and Deno for a new TypeScript API in 2026.
Scaffold a minimal Vite + React + TypeScript app in ./demo, then explain the structure.
Read package.json and tell me which dependencies are out of date.
Find the current recommended way to set up GitHub Actions for a Node project, then write the workflow file.

Watch what it does: it searches, opens the most useful results, reads them, and answers from what it read — instead of guessing from old training data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment