GGUF (GPT-Generated Unified Format) has quickly become the go-to standard for running large language models on your machine. There’s a growing number of GGUF models on Hugging Face, and thanks to community contributors like TheBloke, you now have easy access to them.
Ollama is an application based on llama.cpp
that allows you to interact with large language models directly on your computer. With Ollama, you can use any GGUF quantized models available on Hugging Face directly, without the need to create a new Modelfile or download the models manually.
In this guide, we'll explore two methods to run GGUF models locally with Ollama:
- Directly Running GGUF Models from Hugging Face via Ollama
- Downloading GGUF Models and Running Them Locally with Ollama
Let's dive into each method.
Ollama now supports running any GGUF models available on Hugging Face directly, without manual downloads or Modelfiles. At the time of writing, there are over 45,000 public GGUF checkpoints on the Hugging Face Hub that you can run with a single ollama run
command.
If you haven't installed Ollama yet, you can do so by following the instructions on their GitHub repository.
For macOS users with Homebrew:
brew install ollama/tap/ollama
For other platforms, please refer to the installation guide.
To run a GGUF model directly from Hugging Face:
ollama run hf.co/{username}/{repository}
For example, to run the Llama-3.2-1B-Instruct-GGUF model:
ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
This command will automatically download the model and run it.
Here are some models you can try running directly:
ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF
By default, the Q4_K_M
quantization scheme is used when it’s present inside the model repository. If not, Ollama defaults to a reasonable quantization type available in the repository.
To select a different quantization scheme, you can specify it in the command:
ollama run hf.co/{username}/{repository}:{quantization}
For example:
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
# The quantization name is case-insensitive
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m
# You can also use the full filename as a tag
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf
By default, a template is selected automatically based on the tokenizer.chat_template
metadata stored inside the GGUF file.
If your GGUF file doesn’t have a built-in template or if you want to customize your chat template, you can create a new file called template
in the repository. The template must be a Go template (not a Jinja template). Here's an example:
{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>
You can optionally configure a system prompt by putting it into a new file named system
in the repository.
To change sampling parameters, create a file named params
in the repository. The file must be in JSON format. For a list of all available parameters, please refer to the Ollama documentation.
In some cases, you might prefer to download GGUF models manually, especially if you need to make modifications or use custom Modelfiles for advanced configurations.
You'll need the Hugging Face CLI to download models.
pip install huggingface_hub
-
Go to the Hugging Face Model Page: Visit the model's page on Hugging Face, such as huihui-ai/Llama-3.2-3B-Instruct or TheBloke/MistralLite-7B-GGUF.
-
Click on the "Files and Versions" Tab:
- Do not rely on the Model Card alone. The card only gives general information about the model. You need to confirm the file format.
- Look for
.gguf
files in the list. This is the format required by Ollama.
Example of a GGUF file listing:
mistrallite.Q4_K_M.gguf 4.37 GB
Example of incompatible files:
model-00001-of-00002.safetensors 4.97 GB model-00002-of-00002.safetensors 2.25 GB
If you see files like
.safetensors
or.bin
, those are not GGUF-compatible and cannot be used directly with Ollama without conversion.
Once you’ve confirmed that the model has a GGUF file, proceed with the download. For this example, we’ll use TheBloke/MistralLite-7B-GGUF:
huggingface-cli download \
TheBloke/MistralLite-7B-GGUF \
mistrallite.Q4_K_M.gguf \
--local-dir downloads
Note: Be sure to specify the exact GGUF file (
mistrallite.Q4_K_M.gguf
) to avoid downloading other formats that won’t work with Ollama.
This particular file is over 4GB, so if you’re downloading at home, grab a coffee or connect via Ethernet to speed things up.
Once you’ve successfully downloaded the GGUF model, you can create a Modelfile that points to the downloaded GGUF file. This allows for further customization, such as setting parameters and system prompts.
Create a file named Modelfile
with the following content:
# Modelfile
FROM ./downloads/mistrallite.Q4_K_M.gguf
Now, build the model using the following command:
ollama create mistrallite -f Modelfile
This command tells Ollama to create a new model named mistrallite
using the GGUF file specified in the Modelfile
.
You’re now ready to run the model! Let’s test it by asking it a question:
ollama run mistrallite "What is Grafana?"
The output will vary due to the stochastic nature of these models, but here's an example response:
Grafana is an open-source data visualization and monitoring tool that allows users to create interactive and customizable dashboards. It supports various data sources and helps in visualizing metrics, logs, and alerts.
If the model you want only provides safetensors or bin files, these formats won’t work directly with Ollama. You would need to either:
- Look for a GGUF version of the model on Hugging Face (check different repositories).
- Convert the model into GGUF format, which can be complex and requires additional tools and steps (outside the scope of this guide).
Once you have the GGUF model, proceed to create a Modelfile and build the model as shown above.
Using Modelfiles allows you to customize your models further. For example, you can set parameters like temperature, top_p, and top_k, or define a system prompt.
# custom-coder.modelfile
FROM ./downloads/mistrallite.Q4_K_M.gguf
PARAMETER temperature 0.7 # Balance creativity and consistency
PARAMETER top_p 0.9 # Limit randomness in results
PARAMETER top_k 40 # Keep suggestions varied but useful
SYSTEM """
You are a senior developer:
1. Explain why something works or breaks.
2. Point out potential performance bottlenecks.
3. Suggest improvements and include example code.
"""
ollama create mycoder -f custom-coder.modelfile
ollama run mycoder "Review this code:"
After working with models, you might want to clean things up.
# List installed models
ollama list
# Remove all models
ollama list | awk 'NR>1 {print $1}' | xargs ollama rm
# Remove specific models
ollama list | awk 'NR>1 {print $1}' | grep 'llama' | xargs ollama rm
Speed up your workflow by creating shell aliases or functions.
alias oc='ollama run codellama'
alias om='ollama run mistral'
model() {
local model_name="$1"
shift
if ! ollama list | grep -q "$model_name"; then
echo "Model $model_name not found. Downloading..."
ollama pull "$model_name"
fi
ollama run "$model_name" "$@"
}
# Example usage
model codellama "Explain this Python code:"
You can chain models together for more advanced workflows.
# Analyze code and explain it to a non-technical audience
cat complex.py | \
ollama run codellama "Analyze this code:" | \
ollama run mistral "Explain this in layman's terms."
# Generate code and review it
ollama run codellama "Write a Redis cache wrapper." | \
ollama run mycoder "Review this code."
Keep an eye on your resource usage, especially with large models.
# Monitor GPU memory
watch -n 1 'nvidia-smi | grep MiB'
# Check model disk usage
du -sh ~/.ollama
ollama run model --gpu memory_limit=4096
Choose models based on your hardware capabilities.
- 8GB VRAM GPU: Use mistral or codellama for most tasks.
- 12GB VRAM GPU: Step up to wizard-math for larger models.
- 24GB+ VRAM GPU: Go for the big models like llama2:70b.
If you encounter issues, here’s a quick fix guide.
# Kill stuck processes
pkill ollama
# Restart Ollama
ollama serve
# Full reset (re-download models after this)
rm -rf ~/.ollama
By following this guide, you should be able to effectively run and manage GGUF models locally using Ollama, whether by running them directly from Hugging Face or by downloading and customizing them yourself.
Happy modeling!