llama.cpp - reading the source code

looking through the llama.cpp source code, to learn about language models. Disclaimer: it's january 2026, so things will likely change, at some point.

llama.cpp repo link
build instructions
- contains instructions on how to build the debug version
- default build creates devices: 'libggml-base.so' and 'libggml-cpu.so' - probably need to set up additional requirements for the other backends.
- if you need additional backends: documentation on each backend includes build instructions
looking at main.cpp for llama-simple - a CLI program to continue with a given prompt, specified on the command line. command line: ./llama-simple -m model.gguf [-n n_predict] [-ngl n_gpu_layers] {prompt-text}
- -m <file> - (mandatory) path to the model file in gguf format
- -ngl <n_gpu_layers> : (default 99) number of layers to offload to the GPU (-1 - means all layers)
- -n <num_output_tokens_to_predict> : (default 32)
- {prompt-text} : it has a default prompt value: Hello my name is

loads backend shared libraries.

each backend has it's own shared libraries (base, cpu, cuda, etc) - ibggml-base.so, ibggml-cpu.so, ibggml-cude.so
backend api is defined in ggml-backend-impl.h
- each backend supports set of functions defined in struct ggml_backend_device_i (file ggml-backend-impl.h)
Each backend implementation is in ggml/src/<backend_name> cpu cuda opencl
what is in a 'backend' ?
- ggml_backend_reg_t - all the structs in this file
  - has field or api version
  - void *context
  - struct ggml_backend_reg_i iface; - the 'registration interface' - this one returns the nested 'devices'
- struct ggml_backend_reg_i
  - get_name - function returns backend name
  - get_device_count - returns number of devices (-1 is the max index to call get_device)
  - get_device - gets device index, returns ggml_backend_dev_t - the 'devicce' (this contains the interface functions!)
  - void * (*get_proc_address)(ggml_backend_reg_t reg, const char * name); - opt. can return 'custom' functions not in te standard device interface (messy interface)
- ggml_backend_dev_t
  - struct ggml_backend_device_i iface; finally interface functions of the standard device interface are here
  - void * context; - device internal context (set by implementation of ggml_backend_init)
  - ggml_backend_reg_t reg;
- ggml_backend_device_i - type with function pointers that make up the interface supported by a backend 'device'.
When loading a single backend instance: ggml_backend_load_best
- finds all instances of the shared library in the configured search path,
- loads each shared library & calls ggml_backend_score - this function returns a score number
- loads the backend with the max positive score (score of zero means - not supported on this machine)
- to init: calls ggml_backend_initof shared library - the return value is ggml_backend_reg_t
- checks that the returned version field in the ggml_backend_reg_t is as expected
  - registers the 'backend' pointer, enumerates all contained 'devices' (that's the struct with the interface table) and registers them too.

load the model

Models are stored in GGUF format spec here.

The model is stored as a single file. However llama.cppx allows to 'split' the model into multiple files called 'shards' (purpose is to bypass a size limit on HuggingFace ?)
GGUF models can be loaded via mmap - works from memory mapped files.
The breakup of a GGUF file (main sections)
- header/version
- metadata - that's a key value map.
  - the keys are strings, they have names like 'general.architecture' 'llama.context_length' 'tokenizer.ggml.tokens'
  - the value types can be either
    - numeric (all numeric integer types, floating point number types)
    - boolean
    - string
    - array
  - an important array type info in the metadata: the Vocabulary
    - the vocabulary is a list of basic tokens. Each basic token is a string, it is a sub-word unit, the token.
    - the role of the token section is to split the text up into token units, where each token unit is then converted into an embedding vector (that's what the LLM works with)
    - how are tokens converted into embedding vectors: each token in this vocabulary list has a token ID (that's the index of the token in the vocabulary array). This token ID is used as an index into a tensor that is stored in the tensor named the 'primary embedding' tensor.
- a series of tensors (a tensor is a n-dimensional array, n >=1)
  - each tensor has a name

loading a model

llama_model_load_from_file loads a model file. parameters:
- Full path of model
- ptr to llama_model_params - this one contains instruction on what to do (including if to use mmap), also will hold 'devices' to be used by the model
llama_model_load_from_file_impl - does the loading
- Enumerates all devices of GPU ad IGPU type backends, adds them to llam_model_params::devices field
- Calls llama_model_load
  - delegates all the action to llama_model_loader ctor
- As part of loading model: construct vocabulary object (now that's a big whopper!)

tokenization of input prompt

Get vocabulary from parsed model (on role of vocabulary see previous explanation) ```const llama_vocab * vocab = llama_model_get_vocab(model);``

Perform tokenization (token is a word or sub-word represented in the dictionary)

Call down to to text tokenizer: llama_tokenize - c function that calls into vocabulary object, in order to perform the tokenization process
Call down to to text tokenizer: llama_vocab::impl::tokenize
Tokenize starts for real.
- Now it gets messy: there are multiple vocabulary types, each one is handled separately...
- Let's look at one case LLAMA_VOCAB_TYPE_SPM (this is hadling LLaMA tokenization). The tokenization process in this case is handled by implementation class class llm_tokenizer_spm_session.
- llm_tokenizer_spm_session::tokenize does the tokenization.
  - Split all unicode symbols into symbols array. (also has n - the length of the token that starts with this symbol)
  - Is iterating over all bi-grams (two adjacent character sequences), if they are part of the vocabulary. if yes then they are added to the work_queue.
  - Next: while workqueue is not empty, pop an entry and try to extend the recognized vocabulary sub-word, first to the right and then to the left - again: this assumes that the extension is also part of the vocabulary! (try_add_bigram)
  - Last loop: recursive re-segmentation
  - Next loop: build the output array of token ids: pass over all symbols , each start symbol has the length of the substring that starts with it which has been recognized as a token - lookup that token in vocabulary and add token id to the output array.
  - Reminder: token vectors are represented by their ID's - the index of the token vector in the array of vectors that is kept in the model.

the completion process, find the next token

Context initialization llama_init_from_model . The LLM context is looking at a window of token vectors, this window represents the text that the LLM is processing at the current moment. The aim of completion is to find the most probable next word. That's the whole point of this magic.
- after checking the parameter object, creates llama_context object, ctor
  - allocates the context
  - copies parameters from parameter object
  - sets up a list of backends in the context and allocates per context buffer
  - sets up various caches (key value block from model file cache)
prepare the 'batch' - these are the new tokens that will be added to the context window, in next run of transformer model. On first call these are all tokens that form the prompt.
llama_decode - evaluates the current batch with transformer model...
the sampler seems to be picking the next predicted token id from the result. llama_sampler_sample does this
llama_token_to_piece - uses vocabulary to turn token_id into token text.
the next iteration is prepared by llama_token_to_piece, where the token that was just predicted is added to the context window (?)

to-be-continued.

MoserMichael/learning-lama-cpp.md

Select an option

No results found