See geerlingguy/ai-benchmarks#21 (comment)
command and output
$ wget https://huggingface.co/bartowski/Llama-3.2-3B-Instruct-GGUF/resolve/main/Llama-3.2-3B-Instruct-Q4_K_M.gguf
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Llama-3.2-3B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | CUDA | 99 | pp512 | 5447.12 ± 21.94 |
| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | CUDA | 99 | pp4096 | 3451.02 ± 6.83 |
| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | CUDA | 99 | tg128 | 103.92 ± 0.34 |
| llama 3B Q4_K - Medium | 1.87 GiB | 3.21 B | CUDA | 99 | pp4096+tg128 | 1520.61 ± 0.97 |
build: 9515c613 (6097)| test | t/s | Framework Desktop t/s |
|---|---|---|
| pp512 | 5447.12 ± 21.94 | 1581.18 ± 10.80 |
| pp4096 | 3451.02 ± 6.83 | 1059.94 ± 1.85 |
| tg128 | 103.92 ± 0.34 | 88.14 ± 1.28 |
| pp4096+tg128 | 1520.61 ± 0.97 | 652.67 ± 1.39 |
command and output
$ wget https://huggingface.co/bartowski/Qwen2.5-14B-Instruct-GGUF/resolve/main/Qwen2.5-14B-Instruct-Q4_K_M.gguf
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/Qwen2.5-14B-Instruct-Q4_K_M.gguf -n 128 -p 512,4096 -pg 4096,128 -ngl 99 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | CUDA | 99 | pp512 | 1252.52 ± 21.25 |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | CUDA | 99 | pp4096 | 919.03 ± 0.74 |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | CUDA | 99 | tg128 | 27.23 ± 0.02 |
| qwen2 14B Q4_K - Medium | 8.37 GiB | 14.77 B | CUDA | 99 | pp4096+tg128 | 417.29 ± 1.54 |
build: 9515c613 (6097)| test | t/s | Framework Desktop t/s |
|---|---|---|
| pp512 | 1252.52 ± 21.25 | 321.48 ± 0.61 |
| pp4096 | 919.03 ± 0.74 | 266.84 ± 0.17 |
| tg128 | 27.23 ± 0.02 | 22.97 ± 0.16 |
| pp4096+tg128 | 417.29 ± 1.54 | 184.66 ± 0.36 |
command and output
$ wget https://huggingface.co/unsloth/gpt-oss-20b-GGUF/resolve/main/gpt-oss-20b-F16.gguf
$ ~/llama.cpp/source/build/bin/llama-bench -m ~/models/gpt-oss-20b-F16.gguf --threads 32 -n 128 -p 512 -pg 512,128 -ngl 125 -r 2
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA RTX 4000 SFF Ada Generation, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | --------------: | -------------------: |
| gpt-oss ?B F16 | 12.83 GiB | 20.91 B | CUDA | 125 | 32 | pp512 | 2006.79 ± 11.89 |
| gpt-oss ?B F16 | 12.83 GiB | 20.91 B | CUDA | 125 | 32 | tg128 | 57.69 ± 0.06 |
| gpt-oss ?B F16 | 12.83 GiB | 20.91 B | CUDA | 125 | 32 | pp512+tg128 | 251.00 ± 0.29 |
build: 9515c613 (6097)| test | t/s | Framework Desktop t/s |
|---|---|---|
| pp512 | 2006.79 ± 11.89 | 564.23 ± 0.46 |
| tg128 | 57.69 ± 0.06 | 45.01 ± 0.05 |
| pp512+tg128 | 251.00 ± 0.29 | 167.77 ± 0.09 |