Running Gemma 4 Vision with llama.cpp

This guide provides instructions on how to download, set up, and run the Gemma 4 Vision (E2B) model using llama.cpp. It includes instructions for both text-only generation and multimodal (text + image) generation, along with testing examples for Windows and Linux.

1. Download Required Files

Models

You will need to download the following model files from HuggingFace:

Main Model (GGUF): gemma-4-E2B-it-UD-Q5_K_XL.gguf
Multimodal Projector: mmproj-F16.gguf

llama.cpp Server

Download the latest llama.cpp release suitable for your operating system:

Releases Page (tag b8648)

2. Starting the Server

The server can be started in two modes depending on whether you want to process images alongside text. Keep the server running in your terminal to process requests.

🪟 Windows Instructions

Option A: Text-Only Mode

llama-server.exe -m "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\gemma-4-E2B-it-UD-Q4_K_XL.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings  --pooling mean -c 65536

Option B: Multimodal (Text and Image) Mode

llama-server.exe -m "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\gemma-4-E2B-it-UD-Q4_K_XL.gguf"  --mmproj "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\mmproj-F16.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings  --pooling mean -c 65536

🐧 Linux Instructions

Note: Replace /path/to/models/... with your actual Linux file paths.

Option A: Text-Only Mode

./llama-server -m "/path/to/models/gemma-4-E2B-it-UD-Q4_K_XL.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536

Option B: Multimodal (Text and Image) Mode

./llama-server -m "/path/to/models/gemma-4-E2B-it-UD-Q4_K_XL.gguf" --mmproj "/path/to/models/mmproj-F16.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536

3. Testing the Inference

Testing Text Completion

Once the server is running, you can test a standard text completion request.

🪟 Windows (Command Prompt / PowerShell)

curl http://localhost:8000/v1/chat/completions ^
 -H "Content-Type: application/json" ^
   -d "{\"model\":\"gemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Explain quantum computing in one sentence.\"}],\"temperature\":0.7}"

🐧 Linux (Bash)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma","messages":[{"role":"user","content":"Explain quantum computing in one sentence."}],"temperature":0.7}'

Testing Multimodal (Image) Completion

To test multimodal generation, you need to encode an image to Base64 format and send it as part of the JSON payload.

Step 1: Base64 Encode Your Image

Download any image file locally (e.g., test.jpg).

🪟 Windows:

certutil -encode test.jpg temp.txt

After running, open temp.txt in a text editor, and remove the first and last lines (-----BEGIN CERTIFICATE----- and -----END CERTIFICATE-----). The remaining text is your Base64 string.

🐧 Linux:

base64 -w 0 test.jpg > temp.txt

The temp.txt file now contains your Base64 encoded string continuously on a single line. No extra trimming is required.

Step 2: Prepare the JSON Payload

Create a file named file.json and paste the following content. Replace <PASTE THAT BASE 64 HERE> with the text from your temp.txt:

{
  "model": "gemma",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image in detail." },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,<PASTE THAT BASE 64 HERE>"
          }
        }
      ]
    }
  ]
}

Step 3: Send the Request