Skip to content

Instantly share code, notes, and snippets.

@deepanshu-yadav
Created April 3, 2026 18:23
Show Gist options
  • Select an option

  • Save deepanshu-yadav/5892fa6ae33bd36f422b2c594916557f to your computer and use it in GitHub Desktop.

Select an option

Save deepanshu-yadav/5892fa6ae33bd36f422b2c594916557f to your computer and use it in GitHub Desktop.
Test New Gemma 4 on LLama cpp no (GPU required)

Running Gemma 4 Vision with llama.cpp

This guide provides instructions on how to download, set up, and run the Gemma 4 Vision (E2B) model using llama.cpp. It includes instructions for both text-only generation and multimodal (text + image) generation, along with testing examples for Windows and Linux.

1. Download Required Files

Models

You will need to download the following model files from HuggingFace:

  1. Main Model (GGUF): gemma-4-E2B-it-UD-Q5_K_XL.gguf
  2. Multimodal Projector: mmproj-F16.gguf

llama.cpp Server

Download the latest llama.cpp release suitable for your operating system:


2. Starting the Server

The server can be started in two modes depending on whether you want to process images alongside text. Keep the server running in your terminal to process requests.

πŸͺŸ Windows Instructions

Option A: Text-Only Mode

llama-server.exe -m "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\gemma-4-E2B-it-UD-Q4_K_XL.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings  --pooling mean -c 65536

Option B: Multimodal (Text and Image) Mode

llama-server.exe -m "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\gemma-4-E2B-it-UD-Q4_K_XL.gguf"  --mmproj "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\mmproj-F16.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings  --pooling mean -c 65536

🐧 Linux Instructions

Note: Replace /path/to/models/... with your actual Linux file paths.

Option A: Text-Only Mode

./llama-server -m "/path/to/models/gemma-4-E2B-it-UD-Q4_K_XL.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536

Option B: Multimodal (Text and Image) Mode

./llama-server -m "/path/to/models/gemma-4-E2B-it-UD-Q4_K_XL.gguf" --mmproj "/path/to/models/mmproj-F16.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536

3. Testing the Inference

Testing Text Completion

Once the server is running, you can test a standard text completion request.

πŸͺŸ Windows (Command Prompt / PowerShell)

curl http://localhost:8000/v1/chat/completions ^
 -H "Content-Type: application/json" ^
   -d "{\"model\":\"gemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Explain quantum computing in one sentence.\"}],\"temperature\":0.7}"

🐧 Linux (Bash)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"gemma","messages":[{"role":"user","content":"Explain quantum computing in one sentence."}],"temperature":0.7}'

Testing Multimodal (Image) Completion

To test multimodal generation, you need to encode an image to Base64 format and send it as part of the JSON payload.

Step 1: Base64 Encode Your Image

Download any image file locally (e.g., test.jpg).

πŸͺŸ Windows:

certutil -encode test.jpg temp.txt

After running, open temp.txt in a text editor, and remove the first and last lines (-----BEGIN CERTIFICATE----- and -----END CERTIFICATE-----). The remaining text is your Base64 string.

🐧 Linux:

base64 -w 0 test.jpg > temp.txt

The temp.txt file now contains your Base64 encoded string continuously on a single line. No extra trimming is required.

Step 2: Prepare the JSON Payload

Create a file named file.json and paste the following content. Replace <PASTE THAT BASE 64 HERE> with the text from your temp.txt:

{
  "model": "gemma",
  "messages": [
    {
      "role": "user",
      "content": [
        { "type": "text", "text": "Describe this image in detail." },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,<PASTE THAT BASE 64 HERE>"
          }
        }
      ]
    }
  ]
}

Step 3: Send the Request

πŸͺŸ Windows (Command Prompt / PowerShell)

curl http://localhost:8000/v1/chat/completions ^
  -H "Content-Type: application/json" ^
   -d @file.json

🐧 Linux (Bash)

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @file.json
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment