exllamav3 guide

Quick start to convert and run inferencing with your own exlv3 quants.

This is still a bit of a WIP with my notes. Slowly iterating and cleaning up as I learn more.

Install

# Clone Repo
# git clone https://github.com/turboderp-org/exllamav3.git
git clone [email protected]:turboderp-org/exllamav3.git
cd exllamav3

# Install Build Dependencies
# ...this excercise is left for the reader and specific Linux flavor e.g.
# apt-get install build-essentials nvidia-cuda-toolkit blah drivers whatever etc...

# Install Python and Dependencies
# https://docs.astral.sh/uv/getting-started/installation/
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install wheel setuptools packaging ninja
uv pip install 'torch>2.6.0'
# you *might* be able to use a prebuilt from here: https://github.com/mjun0812/flash-attention-prebuild-wheels/
# 8 max jobs will take like 64GB RAM and 30 minutes...
MAX_JOBS=8 uv pip install 'flash_attn>=2.7.4.post1' --no-build-isolation
uv pip install -r requirements.txt

# Install
git checkout dev
uv pip install .

Convert

# Convert model
# python convert.py -i <input_dir> -o <output_dir> -w <working_dir> -b <bitrate>
# Resume an interrupted quant job
# convert.py -w <working_dir> -r
# More options
# convert.py -h

mkdir ./workdir
mkdir ./models
python convert.py \
    -i /mnt/models/Qwen/Qwen3-14B/ \
    -o ./models/Qwen3-14B-exl3-3.0bpw/ \
    -w ./workdir/ \
    -b 3.0

Inference

#python examples/chat.py -m <input_dir> -mode <prompt_mode>
#python examples/chat.py -m /mnt/models/llama3.1-8b-instruct-exl3 -mode llama3

uv pip install pyperclip prompt-toolkit

python examples/chat.py \
    -m ./models/ \
    -mode llama3

Benchmarking Perplexity

uv pip install -r requirements_eval.txt

python eval/ppl.py \
    -m ./models/ \
    --rows 500

# ParetoQ 2bit QAT of Llama-3.2-1B-Instruct
 -- Bitrate: 2.01 bpw / 6.01 bpw (head)
 -- Evaluated: 500 rows of 2048 tokens
 -- Perplexity: 25.534051

# Original Llama-3.2-1B-Instruct
 -- Bitrate: 2.01 bpw / 6.01 bpw (head)
 -- Evaluated: 500 rows of 2048 tokens
 -- Perplexity: 22.257788

Benchmarking Compare Q Script

uv pip install matplotlib adjusttext
# reduce MAX_JOBS if you OOM on RAM and wait 5 minutes or so...
MAX_JOBS=8 uv pip install gptqmodel --no-build-isolation
uv pip install -r requirements_eval.txt
uv pip install tokenicer device-smi logbar
CMAKE_ARGS="-DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1" \
    uv pip install llama-cpp-python --upgrade --force-reinstall --no-cache-dir

python eval/compare_q.py \
    --help

# example data and model spec files live here:
# https://github.com/turboderp-org/exllamav3/tree/master/eval/spec
# dataspec.json
{
  "dataset": "wiki2",
  "eval_stride": 512,
  "eval_len": 2048,
  "max_rows": 20,
  "tokenizer_dir": "./models/Qwen3-14B-UG-exl3-3.47bpw/",
  "tokenize_fn": "transformers"
}
# modelspec.jsonl
# doesn't seem to be a JSONL but a list of those like those found in ./eval/spec/ e.g.
# maybe the idea is put a bunch into a directory and grab them with a glob e.g. `spec/Qwen3-14B*.json`
# if you specify a folder for "out_logits" in the model spec it will save big kld logits base there for comparison with other models
#    you would do this only for a bf16 or full unquantized model which is then used as baseline for comparisons
#    "out_logits" "/mnt/astrodata/llm/models/ubergarm/Qwen3-14B-GGUF/kld-base-logits-Qwen3-14B-BF16/"
# modelspec.jsonl
[
    {
        "load_fn": "exllamav3",
        "fwd_fn": "exllamav3",
        "label": "EXL3 3.47bpw ubergarm",
        "model_dir": "./models/Qwen3-14B-UG-exl3-3.47bpw/"
    }
]

# You can collect perplexity alone, but now I realize
# you can gather it at same time as KLD so skip this one maybe
# Perplexity
python eval/compare_q.py \
    -d dataspec.json \
    -m modelspec.jsonl \
    --plot \
    --vram \
    --title "Qwen3-14B Quant Perplexity vs VRAM"

# KLD seems to require you run it on the baseline first and have some logits file
# add your bf16 into the modelspec list and its the only one with `out_logits" pointing to directory to hold kld base
# then the rest of the quants will get compared KLD to that baseline quant
# i believe it will grab PPL as well during the run so this is probably the way to go about it
# Cleanup old stuff if necessary you can do: `rm -rf ./eval/__disk_lru_cache__/`
python eval/compare_q.py \
    -d dataspec.json \
    -m modelspec.jsonl \
    --plot \
    --kld

# KLD
# this is for one-off KLD, but using command above will be the way to get those nice plots
# Oh, only reference is here: https://github.com/turboderp-org/exllamav3/pull/26#issuecomment-2855565926
# python eval/model_diff.py -ma /mnt/models/test_model -mb /mnt/models/ref_model -r 5
# WOW it gives perplexity and KLD stats for the *full unquantized model*
# without having enough VRAM to load thef full bf16 as it seems to go one layer at a time!?
# but you have to keep the `-r N` value very low or it will OOM at the very end...
python eval/model_diff.py \
    -ma ./models/Qwen3-14B-exl3-3.0bpw/ \
    -mb /mnt/astrodata/llm/models/Qwen/Qwen3-14B/ \
    -r 10

# -- A perplexity: 11.42925760
# -- B perplexity: 10.72348506
# -- A label in top-K:
#      K = 1: 0.5059
#      K = 2: 0.6207
#      K = 3: 0.6871
#      K = 4: 0.7240
#      K = 5: 0.7520
# -- B label in top-K:
#      K = 1: 0.5153
#      K = 2: 0.6292
#      K = 3: 0.6918
#      K = 4: 0.7299
#      K = 5: 0.7572
# -- Top-K agreement, A vs B:
#      K = 1: 0.8999
#      K = 2: 0.7209
#      K = 3: 0.5231
#      K = 4: 0.3565
#      K = 5: 0.2243
# -- KL divergence (A, B):  0.05868838 # <--- this must be how much A diverges from reference bf16 B ?
# -- KL divergence (B, A):  0.06291481

python eval/model_diff.py \
    -ma ./models/Qwen3-14B-UG-exl3-3.47bpw/
    -mb /mnt/astrodata/llm/models/Qwen/Qwen3-14B/ \
    -r 10

# -- A perplexity: 11.09519256
# -- B perplexity: 10.72348506
# -- A label in top-K:
#      K = 1: 0.5121
#      K = 2: 0.6269
#      K = 3: 0.6879
#      K = 4: 0.7256
#      K = 5: 0.7532
# -- B label in top-K:
#      K = 1: 0.5153
#      K = 2: 0.6292
#      K = 3: 0.6918
#      K = 4: 0.7299
#      K = 5: 0.7572
# -- Top-K agreement, A vs B:
#      K = 1: 0.9260
#      K = 2: 0.7836
#      K = 3: 0.6134
#      K = 4: 0.4474
#      K = 5: 0.3130
# -- KL divergence (A, B):  0.03526903
# -- KL divergence (B, A):  0.03676973

tabbyAPI

Add an OpenAI compliant API endpoint to serve your fresh exl3 quants!

git clone [email protected]:theroyallab/tabbyAPI.git
cd tabbyAPI
git checkout exl3
git rebase main
uv venv ./venv --python 3.11 --python-preference=only-managed
source venv/bin/activate

# *IMPORTANT* Install the exact version of exllamav3 you built above!
uv pip install -U .[cu121]
uv pip uninstall exllamav3
cd ../exllamav3/ && uv pip intall . && cd -

# edit config to point at your models directory and specify the default model path name to load
cp config_sample.yml config.yaml
vi config.yaml

# finally run the api endpoint and connect with your fav client
python main.py --config config.yaml

Compare to ik_llama.cpp?

How to compare perplexity and kld with ik_llama.cpp exclusive quant types e.g. iq4_ks etc?

As there is no llama-cpp-python support for ik_llama.cpp can't just use it that way psure...

Plan A

Use same settings and corupus and just run it manually in ik_llama.cpp llama-perplexity e.g.

Get exact dataset wii2 wikitext-2-raw-v1 test split as in ppl.py and output to file.
Run llama-perplexity with same settings as ppl.py or the dataspec.json e.g.
- eval_len = 2048 = params.n_ctx probably as eval/compare_q_llama.cpp
- eval_stride = 512 = params.ppl_stride or params.n_batch or params.n_ctx ?
- max_rows = 100 = params.n_chunks ? ik_llama.cpp if strided perplexity params.n_ctx += params.ppl_stride/2;
Test using ppl.py vs ik_llama.cpp llama-perplexity on the same GGUF e.g. Q4_K_M type

Is exllamav3 "etokenizing" the input text the same as llama.cpp?

Experience

So ik's wiki.test.raw from huggingface seems like the first 100 "rows" so the same text! (about 1.3MiB)
exllamav3 says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 8.876035154437135
ik_llama.cpp says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 9.1395 +/- 0.07236 (--ctx-size 512 --batch-size 2048)
- perplexity: calculating perplexity over 584 chunks, n_ctx=512, batch_size=2048, n_seq=4
ik_llama.cpp says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 7.7043 +/- 0.05615 (--ctx-size 2048 --batch-size 512)
- perplexity: calculating perplexity over 146 chunks, n_ctx=2048, batch_size=512, n_seq=1
ik_llama.cpp says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 7.5303 +/- 0.05443 (--ctx-size 2048 --batch-size 2048)
- perplexity: calculating perplexity over 146 chunks, n_ctx=2048, batch_size=2048, n_seq=1
llama.cpp says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 7.5304 +/- 0.05443 (--ctx-size 2048 --batch-size 2048)
- perplexity: calculating perplexity over 146 chunks, n_ctx=2048, batch_size=2048, n_seq=1
ik_llama.cpp says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 7.5303 +/- 0.05443 (--ctx-size 2048 --batch-size 8192)
- perplexity: calculating perplexity over 146 chunks, n_ctx=2048, batch_size=8192, n_seq=4
ik_llama.cpp says PPL for bartowski Qwen_Qwen3-14B-Q4_K_L.gguf is 7.4742 +/- 0.06484 (--ctx-size 2048 --batch-size 2048 --chunks 100)
- perplexity: calculating perplexity over 100 chunks, n_ctx=2048, batch_size=2048, n_seq=1
ik_llama.cpp/build/bin/llama-perplexity --ppl-stride 512 -b 2048 # NOPE ing ppl-stride sets 2nd ppl mode which is super slow?!
ik_llama.cpp/build/bin/llama-perplexity --ctx-size 512 --batch-size 2048 # maybe?

References

👈 Silly ai test response...

Amusing Test of Goofy 1B 2bit ParetoQ QAT'd exl3 2.0bpw Test Quant

My first exl3 quant, lmao...

User: How are you?

Assistant: A bit tired, I must admit. I'm feeling a bit overwhelmed with all the questions and tasks I'm
supposed to help users with. It seems like there's a lot of demand for assistance, and I'm not sure which topic should be
answered first.

But, hey, I'm here to help, so don't worry if I'm not the best person to answer them. I'll do my best to provide you with
the most accurate and helpful information possible.

So, how are you? What's been going on? Is there anything new or unusual about the task I'm working on right now? Or is
there something that's been frustrating for you for some time? Tell me about it! I'm here to help you if you need it!

User: How many letter "r"s are there in the word "strawberry"?

Assistant: The word "strawberry" contains two letter "r"s.

The word "strawberry" has seven letters: S-T-R-A-R-B-Y.

Let me count the letter "r"s:

S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y
S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-R-B-Y S-T-R-A-RB-Y
S-T-R-A-RB-Y S-TR-A-RB-Y S-TR-A-RBY

It looks like there are eight letter "r"s in the word "strawberry". Let me check again.

The word "strawberry" contains eight letter "r"s.

Let me count the letter "r"s:

S-T-R-A-R-B-Y S-T-R-A-RB-Y S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY
S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY
S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY S-TR-A-RBY

I'm happy to see that I was able to accurately count the letter "r"s. The word "strawberry" has eight letter "r"s.

Thank you for pointing out my mistake. I'm glad to have had the chance to correct it. It was a great learning experience
and I'm glad I could help.

ubergarm/exl3-quant-testing.md