This guide provides instructions on how to download, set up, and run the Gemma 4 Vision (E2B) model using llama.cpp. It includes instructions for both text-only generation and multimodal (text + image) generation, along with testing examples for Windows and Linux.
You will need to download the following model files from HuggingFace:
- Main Model (GGUF): gemma-4-E2B-it-UD-Q5_K_XL.gguf
- Multimodal Projector: mmproj-F16.gguf
Download the latest llama.cpp release suitable for your operating system:
The server can be started in two modes depending on whether you want to process images alongside text. Keep the server running in your terminal to process requests.
Option A: Text-Only Mode
llama-server.exe -m "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\gemma-4-E2B-it-UD-Q4_K_XL.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536Option B: Multimodal (Text and Image) Mode
llama-server.exe -m "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\gemma-4-E2B-it-UD-Q4_K_XL.gguf" --mmproj "C:\Users\DEEPANSHU\Desktop\workspace\models\lfm2.5 thinking\mmproj-F16.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536Note: Replace /path/to/models/... with your actual Linux file paths.
Option A: Text-Only Mode
./llama-server -m "/path/to/models/gemma-4-E2B-it-UD-Q4_K_XL.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536Option B: Multimodal (Text and Image) Mode
./llama-server -m "/path/to/models/gemma-4-E2B-it-UD-Q4_K_XL.gguf" --mmproj "/path/to/models/mmproj-F16.gguf" --host 0.0.0.0 --port 8000 --n-gpu-layers -1 --embeddings --pooling mean -c 65536Once the server is running, you can test a standard text completion request.
πͺ Windows (Command Prompt / PowerShell)
curl http://localhost:8000/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d "{\"model\":\"gemma\",\"messages\":[{\"role\":\"user\",\"content\":\"Explain quantum computing in one sentence.\"}],\"temperature\":0.7}"π§ Linux (Bash)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"gemma","messages":[{"role":"user","content":"Explain quantum computing in one sentence."}],"temperature":0.7}'To test multimodal generation, you need to encode an image to Base64 format and send it as part of the JSON payload.
Download any image file locally (e.g., test.jpg).
πͺ Windows:
certutil -encode test.jpg temp.txtAfter running, open temp.txt in a text editor, and remove the first and last lines (-----BEGIN CERTIFICATE----- and -----END CERTIFICATE-----). The remaining text is your Base64 string.
π§ Linux:
base64 -w 0 test.jpg > temp.txtThe temp.txt file now contains your Base64 encoded string continuously on a single line. No extra trimming is required.
Create a file named file.json and paste the following content. Replace <PASTE THAT BASE 64 HERE> with the text from your temp.txt:
{
"model": "gemma",
"messages": [
{
"role": "user",
"content": [
{ "type": "text", "text": "Describe this image in detail." },
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,<PASTE THAT BASE 64 HERE>"
}
}
]
}
]
}πͺ Windows (Command Prompt / PowerShell)
curl http://localhost:8000/v1/chat/completions ^
-H "Content-Type: application/json" ^
-d @file.jsonπ§ Linux (Bash)
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d @file.json