This document explains how to do LoRAs hot-swapping via the llama.cpp Python wrapper, llama-cpp-python.
This provides extra docs for my pull request: abetlen/llama-cpp-python#1817
For now, the main llama-cpp-python main branch doesn't have LoRA hot-swapping support merged, so this guide uses the code for the branch to test out LoRA support.
git
uv
Get clean repo for testing
git clone https://github.com/abetlen/llama-cpp-python.git llama-cpp-python
cd llama-cpp-python
git remote add rd https://github.com/richdougherty/llama-cpp-python.git
git fetch rd update-lora-api
git switch update-lora-api
git submodule update --init --recursive
cd ..
We will use this Python virtual environment to install the library and test it.
For testing we use Python 3.8, which is the oldest version supported by llama-cpp-python.
uv venv lora-demo-venv --python 3.8
source lora-demo-venv/bin/activate
# Try to import llama_cpp before building - should fail
python -c 'import llama_cpp'
uv pip install --upgrade pip
# As per llama-cpp-python README.md - builds llama.cpp as well
pip install -e .
# Now the import will work
python -c 'import llama_cpp'
Download the model - note that I am using a quantised model.
pip install huggingface_hub[cli]
export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
Now run the Python interpreter.
python
Python 3.8.20 (default, Oct 2 2024, 16:34:12)
>>> import llama_cpp
>>> import os
>>> llm = llama_cpp.Llama(os.environ['MODEL_GGUF'])
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt="Question: What shape is an orange?\n\nAnswer:", stop=["\n"])
{'id': 'cmpl-03e8d861-3dbc-49c3-bd62-04f5b02f5491', 'object': 'text_completion', 'created': 1732428089, 'model': '$HOME/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q4_K_S.gguf', 'choices': [{'text': ' A sphere.', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 14, 'completion_tokens': 4, 'total_tokens': 18}}
Now exit the demo venv.
deactivate
Now we're going to convert some LoRAs into GGUF format so we can use them.
First we install the llama.cpp conversion script dependencies.
uv venv lora-convert-venv --python 3.12
source lora-convert-venv/bin/activate
# Install base requirements before lora_to_gguf requirements, due to issue with numpy in torch repo
uv pip install -r vendor/llama.cpp/requirements/requirements-convert_legacy_llama.txt
# Install torch needed to convert LoRA adapters
uv pip install -r vendor/llama.cpp/requirements/requirements-convert_lora_to_gguf.txt
Then we get the params and token info about the base model (note that we don't need weights).
export BASE_MODEL_SETTINGS_DIR=$(huggingface-cli download mistralai/Mistral-7B-v0.1 config.json tokenizer.json)
Now we get some LoRa adapters and convert them. We'll use some from the LoRAX project, since there are a lot of them.
mkdir adapters
vendor/llama.cpp/convert_lora_to_gguf.py --base $BASE_MODEL_SETTINGS_DIR --outfile adapters/lora_conllpp.gguf \
$(huggingface-cli download predibase/conllpp adapter_config.json adapter_model.safetensors)
vendor/llama.cpp/convert_lora_to_gguf.py --base $BASE_MODEL_SETTINGS_DIR --outfile adapters/lora_dbpedia.gguf \
$(huggingface-cli download predibase/dbpedia adapter_config.json adapter_model.safetensors)
vendor/llama.cpp/convert_lora_to_gguf.py --base $BASE_MODEL_SETTINGS_DIR --outfile adapters/lora_tldr_content_gen.gguf \
$(huggingface-cli download predibase/tldr_content_gen adapter_config.json adapter_model.safetensors)
vendor/llama.cpp/convert_lora_to_gguf.py --base $BASE_MODEL_SETTINGS_DIR --outfile adapters/lora_tldr_headline_gen.gguf \
$(huggingface-cli download predibase/tldr_headline_gen adapter_config.json adapter_model.safetensors)
vendor/llama.cpp/convert_lora_to_gguf.py --base $BASE_MODEL_SETTINGS_DIR --outfile adapters/lora_magicoder.gguf \
$(huggingface-cli download predibase/magicoder adapter_config.json adapter_model.safetensors)
Now that we've converted the adapters we exit the venv.
deactivate
Activate the demo venv that we created at the start of this guide.
source lora-demo-venv/bin/activate
Run the Python interpreter. Examples of text are taken from the READMEs of the adapters' HF repos.
python
>>> import llama_cpp
>>> import os
>>> llm = llama_cpp.Llama(os.environ['MODEL_GGUF'])
>>> llm.lora_adapters
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt="The following passage is content from a news report. Please summarize this passage in one sentence or less.\n\nPassage: Jeffrey Berns, CEO of Blockchains LLC, wants the Nevada government to allow companies like his to form local governments on land they own, granting them power over everything from schools to law enforcement. Berns envisions a city based on digital currencies and blockchain storage. His company is proposing to build a 15,000 home town 12 miles east of Reno. Nevada Lawmakers have responded with intrigue and skepticism. The proposed legislation has yet to be formally filed or discussed in public hearings.\n\nSummary: ", stop=["\n"])
{'id': 'cmpl-c2cee6f9-122c-4243-b469-e11446417ac3', 'object': 'text_completion', 'created': 1732429888, 'model': '$HOME/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q4_K_S.gguf', 'choices': [{'text': 'Blockchain CEO wants to build a city in Nevada ', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 146, 'completion_tokens': 12, 'total_tokens': 158}}```
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_headline_gen.gguf', 0.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0}
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_content_gen.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt="The following headline is the headline of a news report. Please write the content of the news passage based on only this headline.\n\nHeadline: Latest success from Google’s AI group: Controlling a fusion reactor \n\nContent: ", stop=["\n"])
{'id': 'cmpl-b48f45d3-bdd8-457f-8b16-27119a06d1c3', 'object': 'text_completion', 'created': 1732430123, 'model': '/home/rich/.cache/huggingface/hub/models--TheBloke--Mistral-7B-v0.1-GGUF/snapshots/d4ae605152c8de0d6570cf624c083fa57dd0d551/mistral-7b-v0.1.Q4_K_S.gguf', 'choices': [{'text': "Google's AI group has developed a system that can control a nuclear fusion reactor. The system uses machine learning to control the plasma in a nuclear fusion reactor. It can control the plasma in a way that humans can't. The system is able to control the plasma in a way that humans can't. It is able to control the plasma in a way that humans can't.", 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 56, 'completion_tokens': 82, 'total_tokens': 138}}
>>> llm.set_lora_adapter_scale('./adapters/lora_tldr_content_gen.gguf', 0.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 0.0}
>>> llm.set_lora_adapter_scale('./adapters/lora_dbpedia.gguf', 1.0)
>>> llm.lora_adapters
{'./adapters/lora_tldr_headline_gen.gguf': 0.0, './adapters/lora_tldr_content_gen.gguf': 0.0, './adapters/lora_dbpedia.gguf': 1.0}
>>> llm.create_completion(seed=12345, temperature=0, max_tokens=256, prompt="You are given the title and the body of an article below. Please determine the type of the article.\n### Title: Great White Whale\n\n### Body: Great White Whale is the debut album by the Canadian rock band Secret and Whisper. The album was in the works for about a year and was released on February 12 2008. A music video was shot in Pittsburgh for the album's first single XOXOXO. The album reached number 17 on iTunes's top 100 albums in its first week on sale.\n\n### Article Type: ", stop=["\n"])
Deactivate the demo venv.
deactivate
uv venv lora-server-venv --python 3.8
source lora-server-venv/bin/activate
uv pip install --upgrade pip
pip install -e .[server]
Get the model (if not already done earlier).
pip install huggingface_hub[cli]
export MODEL_GGUF=$(huggingface-cli download TheBloke/Mistral-7B-v0.1-GGUF mistral-7b-v0.1.Q4_K_S.gguf)
Create some server settings
SERVER_CACHE=false
SERVER_VERBOSE=true
cat >server-settings.json <<EOF
{
"host": "127.0.0.1",
"port": 8080,
"models": [
{
"model_alias": "mistral",
"model": "$MODEL_GGUF",
"cache": $SERVER_CACHE,
"verbose": $SERVER_VERBOSE
},
{
"model_alias": "conllpp",
"model": "$MODEL_GGUF",
"lora_adapters": {
"adapters/lora_conllpp.gguf": 1.0
},
"cache": $SERVER_CACHE,
"verbose": $SERVER_VERBOSE
},
{
"model_alias": "dbpedia",
"model": "$MODEL_GGUF",
"lora_adapters": {
"adapters/lora_dbpedia.gguf": 1.0
},
"cache": $SERVER_CACHE,
"verbose": $SERVER_VERBOSE
},
{
"model_alias": "magicoder",
"model": "$MODEL_GGUF",
"lora_adapters": {
"adapters/lora_magicoder.gguf": 1.0
},
"cache": $SERVER_CACHE,
"verbose": $SERVER_VERBOSE
},
{
"model_alias": "tldr_content_gen",
"model": "$MODEL_GGUF",
"lora_adapters": {
"adapters/lora_tldr_content_gen.gguf": 1.0
},
"cache": $SERVER_CACHE,
"verbose": $SERVER_VERBOSE
},
{
"model_alias": "tldr_headline_gen",
"model": "$MODEL_GGUF",
"lora_adapters": {
"adapters/lora_tldr_headline_gen.gguf": 1.0
},
"cache": $SERVER_CACHE,
"verbose": $SERVER_VERBOSE
}
]
}
EOF
Run the server
python -m llama_cpp.server --config_file server-settings.json
In another terminal try some completions.
curl -X POST http://localhost:8080/v1/completions \
-H 'accept: application/json' -H 'Content-Type: application/json' \
-d '{
"model": "mistral",
"temperature": 0, "seed": 12345, "max_tokens": 256,
"prompt": "Question: What shape is an orange?\n\nAnswer:",
"stop": ["\n"]
}'
{"id":"cmpl-2b224335-4016-4417-8b3a-c3288bb813a2","object":"text_completion","created":1732431287,"model":"mistral","choices":[{"text":" A sphere.","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":14,"completion_tokens":4,"total_tokens":18}}
curl -X POST http://localhost:8080/v1/completions \
-H 'accept: application/json' -H 'Content-Type: application/json' \
-d '{
"model": "tldr_headline_gen",
"temperature": 0, "seed": 12345, "max_tokens": 256,
"prompt": "The following passage is content from a news report. Please summarize this passage in one sentence or less.\n\nPassage: Jeffrey Berns, CEO of Blockchains LLC, wants the Nevada government to allow companies like his to form local governments on land they own, granting them power over everything from schools to law enforcement. Berns envisions a city based on digital currencies and blockchain storage. His company is proposing to build a 15,000 home town 12 miles east of Reno. Nevada Lawmakers have responded with intrigue and skepticism. The proposed legislation has yet to be formally filed or discussed in public hearings.\n\nSummary: ",
"stop": ["\n"]
}'
{"id":"cmpl-f260baef-bd63-4734-bf5c-c3253b931213","object":"text_completion","created":1732431350,"model":"tldr_headline_gen","choices":[{"text":"Blockchain CEO wants to build a city in Nevada ","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":146,"completion_tokens":12,"total_tokens":158}}
curl -X POST http://localhost:8080/v1/completions \
-H 'accept: application/json' -H 'Content-Type: application/json' \
-d '{
"model": "tldr_content_gen",
"temperature": 0, "seed": 12345, "max_tokens": 256,
"prompt": "The following headline is the headline of a news report. Please write the content of the news passage based on only this headline.\n\nHeadline: Latest success from Google’s AI group: Controlling a fusion reactor \n\nContent: ",
"stop": ["\n"]
}'
{"id":"cmpl-a29a33be-c62c-4028-938e-c278c706c902","object":"text_completion","created":1732431424,"model":"tldr_content_gen","choices":[{"text":"Google's AI group has developed a system that can control a nuclear fusion reactor. The system uses machine learning to predict the behavior of plasma in a nuclear fusion reactor and then adjusts the reactor's magnetic fields accordingly. It is able to keep the plasma stable for longer than humans could, which means that more energy can be produced from each reaction. The technology could eventually lead to a new source of clean energy.","index":0,"logprobs":null,"finish_reason":"stop"}],"usage":{"prompt_tokens":56,"completion_tokens":88,"total_tokens":144}}
curl -X POST http://localhost:8080/v1/completions \
-H 'accept: application/json' -H 'Content-Type: application/json' \
-d '{
"model": "dbpedia",
"temperature": 0, "seed": 12345, "max_tokens": 256,
"prompt": "You are given the title and the body of an article below. Please determine the type of the article.\n### Title: Great White Whale\n\n### Body: Great White Whale is the debut album by the Canadian rock band Secret and Whisper. The album was in the works for about a year and was released on February 12 2008. A music video was shot in Pittsburgh for the album'\''s first single XOXOXO. The album reached number 17 on iTunes'\''s top 100 albums in its first week on sale.\n\n### Article Type: ",
"stop": ["\n"]
}'