Complete Ollama Guide

Running GGUF Models Locally with Ollama

GGUF (GPT-Generated Unified Format) has quickly become the go-to standard for running large language models on your machine. There’s a growing number of GGUF models on Hugging Face, and thanks to community contributors like TheBloke, you now have easy access to them.

Ollama is an application based on llama.cpp that allows you to interact with large language models directly on your computer. With Ollama, you can use any GGUF quantized models available on Hugging Face directly, without the need to create a new Modelfile or download the models manually.

In this guide, we'll explore two methods to run GGUF models locally with Ollama:

Directly Running GGUF Models from Hugging Face via Ollama
Downloading GGUF Models and Running Them Locally with Ollama

Let's dive into each method.

Method 1: Directly Running GGUF Models from Hugging Face via Ollama

Ollama now supports running any GGUF models available on Hugging Face directly, without manual downloads or Modelfiles. At the time of writing, there are over 45,000 public GGUF checkpoints on the Hugging Face Hub that you can run with a single ollama run command.

Step 1: Install Ollama

If you haven't installed Ollama yet, you can do so by following the instructions on their GitHub repository.

For macOS users with Homebrew:

brew install ollama/tap/ollama

For other platforms, please refer to the installation guide.

Step 2: Run a GGUF Model from Hugging Face

To run a GGUF model directly from Hugging Face:

ollama run hf.co/{username}/{repository}

For example, to run the Llama-3.2-1B-Instruct-GGUF model:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF

This command will automatically download the model and run it.

Examples of Models You Can Try

Here are some models you can try running directly:

ollama run hf.co/bartowski/Llama-3.2-1B-Instruct-GGUF
ollama run hf.co/mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated-GGUF
ollama run hf.co/arcee-ai/SuperNova-Medius-GGUF
ollama run hf.co/bartowski/Humanish-LLama3-8B-Instruct-GGUF

Custom Quantization

By default, the Q4_K_M quantization scheme is used when it’s present inside the model repository. If not, Ollama defaults to a reasonable quantization type available in the repository.

To select a different quantization scheme, you can specify it in the command:

ollama run hf.co/{username}/{repository}:{quantization}

For example:

ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:IQ3_M
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

# The quantization name is case-insensitive
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:iq3_m

# You can also use the full filename as a tag
ollama run hf.co/bartowski/Llama-3.2-3B-Instruct-GGUF:Llama-3.2-3B-Instruct-IQ3_M.gguf

Custom Chat Template and Parameters

By default, a template is selected automatically based on the tokenizer.chat_template metadata stored inside the GGUF file.

If your GGUF file doesn’t have a built-in template or if you want to customize your chat template, you can create a new file called template in the repository. The template must be a Go template (not a Jinja template). Here's an example:

{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
{{ .Response }}<|end|>

You can optionally configure a system prompt by putting it into a new file named system in the repository.

To change sampling parameters, create a file named params in the repository. The file must be in JSON format. For a list of all available parameters, please refer to the Ollama documentation.

References

Method 2: Downloading GGUF Models and Running Them Locally with Ollama

In some cases, you might prefer to download GGUF models manually, especially if you need to make modifications or use custom Modelfiles for advanced configurations.

Step 1: Install Hugging Face CLI

You'll need the Hugging Face CLI to download models.

pip install huggingface_hub

Step 2: Find and Download a GGUF Model

Check the Files and Versions on Hugging Face

Go to the Hugging Face Model Page: Visit the model's page on Hugging Face, such as huihui-ai/Llama-3.2-3B-Instruct or TheBloke/MistralLite-7B-GGUF.
Click on the "Files and Versions" Tab:
- Do not rely on the Model Card alone. The card only gives general information about the model. You need to confirm the file format.
- Look for .gguf files in the list. This is the format required by Ollama.
Example of a GGUF file listing:
```
mistrallite.Q4_K_M.gguf 4.37 GB
```
Example of incompatible files:
```
model-00001-of-00002.safetensors    4.97 GB
model-00002-of-00002.safetensors    2.25 GB
```
If you see files like .safetensors or .bin, those are not GGUF-compatible and cannot be used directly with Ollama without conversion.

Download the GGUF Model

Once you’ve confirmed that the model has a GGUF file, proceed with the download. For this example, we’ll use TheBloke/MistralLite-7B-GGUF:

huggingface-cli download \
  TheBloke/MistralLite-7B-GGUF \
  mistrallite.Q4_K_M.gguf \
  --local-dir downloads

Note: Be sure to specify the exact GGUF file (mistrallite.Q4_K_M.gguf) to avoid downloading other formats that won’t work with Ollama.

This particular file is over 4GB, so if you’re downloading at home, grab a coffee or connect via Ethernet to speed things up.

Step 3: Create a Modelfile

Once you’ve successfully downloaded the GGUF model, you can create a Modelfile that points to the downloaded GGUF file. This allows for further customization, such as setting parameters and system prompts.

Create a file named Modelfile with the following content:

# Modelfile
FROM ./downloads/mistrallite.Q4_K_M.gguf

Step 4: Build the Ollama Model

Now, build the model using the following command:

ollama create mistrallite -f Modelfile

This command tells Ollama to create a new model named mistrallite using the GGUF file specified in the Modelfile.

Step 5: Run the Model

You’re now ready to run the model! Let’s test it by asking it a question:

ollama run mistrallite "What is Grafana?"

The output will vary due to the stochastic nature of these models, but here's an example response:

Grafana is an open-source data visualization and monitoring tool that allows users to create interactive and customizable dashboards. It supports various data sources and helps in visualizing metrics, logs, and alerts.

What About Models in Other Formats (like Safetensors)?

If the model you want only provides safetensors or bin files, these formats won’t work directly with Ollama. You would need to either:

Look for a GGUF version of the model on Hugging Face (check different repositories).
Convert the model into GGUF format, which can be complex and requires additional tools and steps (outside the scope of this guide).

Once you have the GGUF model, proceed to create a Modelfile and build the model as shown above.

Advanced Customization with Modelfiles

Using Modelfiles allows you to customize your models further. For example, you can set parameters like temperature, top_p, and top_k, or define a system prompt.

Example Modelfile with Custom Parameters

# custom-coder.modelfile

FROM ./downloads/mistrallite.Q4_K_M.gguf

PARAMETER temperature 0.7  # Balance creativity and consistency
PARAMETER top_p 0.9        # Limit randomness in results
PARAMETER top_k 40         # Keep suggestions varied but useful

SYSTEM """
You are a senior developer:
1. Explain why something works or breaks.
2. Point out potential performance bottlenecks.
3. Suggest improvements and include example code.
"""

Building and Running the Custom Model

ollama create mycoder -f custom-coder.modelfile
ollama run mycoder "Review this code:"

Efficient Model Cleanup

After working with models, you might want to clean things up.

List and Remove Models

# List installed models
ollama list

# Remove all models
ollama list | awk 'NR>1 {print $1}' | xargs ollama rm

# Remove specific models
ollama list | awk 'NR>1 {print $1}' | grep 'llama' | xargs ollama rm

Practical Shell Integration

Speed up your workflow by creating shell aliases or functions.

Create Aliases

alias oc='ollama run codellama'
alias om='ollama run mistral'

Dynamic Model Loading Function

model() {
    local model_name="$1"
    shift
    if ! ollama list | grep -q "$model_name"; then
        echo "Model $model_name not found. Downloading..."
        ollama pull "$model_name"
    fi
    ollama run "$model_name" "$@"
}

# Example usage
model codellama "Explain this Python code:"

Advanced Model Chaining

You can chain models together for more advanced workflows.

Example: Code Analysis and Simplification

# Analyze code and explain it to a non-technical audience
cat complex.py | \
    ollama run codellama "Analyze this code:" | \
    ollama run mistral "Explain this in layman's terms."

Example: Code Generation and Review

# Generate code and review it
ollama run codellama "Write a Redis cache wrapper." | \
    ollama run mycoder "Review this code."

Resource Management

Keep an eye on your resource usage, especially with large models.

Monitor System Resources

# Monitor GPU memory
watch -n 1 'nvidia-smi | grep MiB'

# Check model disk usage
du -sh ~/.ollama

Limit GPU Memory Usage

ollama run model --gpu memory_limit=4096

Model Selection Based on Hardware

Choose models based on your hardware capabilities.

8GB VRAM GPU: Use mistral or codellama for most tasks.
12GB VRAM GPU: Step up to wizard-math for larger models.
24GB+ VRAM GPU: Go for the big models like llama2:70b.

Emergency Fixes

If you encounter issues, here’s a quick fix guide.

# Kill stuck processes
pkill ollama

# Restart Ollama
ollama serve

# Full reset (re-download models after this)
rm -rf ~/.ollama

By following this guide, you should be able to effectively run and manage GGUF models locally using Ollama, whether by running them directly from Hugging Face or by downloading and customizing them yourself.

Happy modeling!

loftwah/ollama.md