Created
January 17, 2025 08:17
-
-
Save ssghost/0a23eb44be9d435cbeaceb18b0fb2e5c to your computer and use it in GitHub Desktop.
Phi-4 Unsloth notebook
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"kernelspec": { | |
"name": "python3", | |
"display_name": "Python 3", | |
"language": "python" | |
}, | |
"language_info": { | |
"name": "python", | |
"version": "3.10.12", | |
"mimetype": "text/x-python", | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"pygments_lexer": "ipython3", | |
"nbconvert_exporter": "python", | |
"file_extension": ".py" | |
}, | |
"accelerator": "GPU", | |
"colab": { | |
"gpuType": "T4", | |
"provenance": [], | |
"name": "Phi-4 Unsloth notebook", | |
"include_colab_link": true | |
}, | |
"kaggle": { | |
"accelerator": "nvidiaTeslaT4", | |
"dataSources": [], | |
"isInternetEnabled": true, | |
"language": "python", | |
"sourceType": "notebook", | |
"isGpuEnabled": true | |
} | |
}, | |
"nbformat_minor": 0, | |
"nbformat": 4, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"id": "view-in-github", | |
"colab_type": "text" | |
}, | |
"source": [ | |
"<a href=\"https://colab.research.google.com/gist/ssghost/0a23eb44be9d435cbeaceb18b0fb2e5c/phi-4-unsloth-notebook.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n", | |
"<div class=\"align-center\">\n", | |
"<a href=\"https://unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n", | |
"<a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord button.png\" width=\"145\"></a>\n", | |
"<a href=\"https://docs.unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n", | |
"</div>\n", | |
"\n", | |
"To install Unsloth on your own computer, follow the installation instructions [here](https://docs.unsloth.ai/get-started/installing-+-updating).\n", | |
"\n", | |
"You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save).\n", | |
"\n", | |
"Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).\n", | |
"\n" | |
], | |
"metadata": { | |
"id": "hwuDO656rQM4" | |
} | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"**[NEW] We've fixed many bugs in Phi-4** which greatly increases Phi-4's accuracy. See our [blogpost](https://unsloth.ai/blog/phi4)\n", | |
"\n", | |
"[NEW] You can view all Phi-4 model uploads with our bug fixes including [dynamic 4-bit quants](https://unsloth.ai/blog/dynamic-4bit), GGUF & more [here](https://huggingface.co/collections/unsloth/phi-4-all-versions-677eecf93784e61afe762afa)\n", | |
"\n", | |
"[NEW] As of Novemeber 2024, Unsloth now supports [vision finetuning](https://unsloth.ai/blog/vision)!\n" | |
], | |
"metadata": { | |
"id": "-EbJ7cIVrQM8" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"%%capture\n", | |
"!pip install unsloth\n", | |
"# Also get the latest nightly Unsloth if you want!\n", | |
"# !pip install --force-reinstall --no-cache-dir --no-deps git+https://github.com/unslothai/unsloth.git" | |
], | |
"metadata": { | |
"id": "oakVJTAArQM-" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from unsloth import FastLanguageModel # FastVisionModel for LLMs\n", | |
"import torch\n", | |
"max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!\n", | |
"load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.\n", | |
"\n", | |
"# 4bit pre quantized models we support for 4x faster downloading + no OOMs.\n", | |
"fourbit_models = [\n", | |
" \"unsloth/Meta-Llama-3.1-8B-bnb-4bit\", # Llama-3.1 2x faster\n", | |
" \"unsloth/Mistral-Small-Instruct-2409\", # Mistral 22b 2x faster!\n", | |
" \"unsloth/Phi-4\", # Phi-4 2x faster!\n", | |
" \"unsloth/Phi-4-unsloth-bnb-4bit\", # Phi-4 Unsloth Dynamic 4-bit Quant\n", | |
" \"unsloth/gemma-2-9b-bnb-4bit\", # Gemma 2x faster!\n", | |
" \"unsloth/Qwen2.5-7B-Instruct-bnb-4bit\" # Qwen 2.5 2x faster!\n", | |
" \"unsloth/Llama-3.2-1B-bnb-4bit\", # NEW! Llama 3.2 models\n", | |
" \"unsloth/Llama-3.2-1B-Instruct-bnb-4bit\",\n", | |
" \"unsloth/Llama-3.2-3B-bnb-4bit\",\n", | |
" \"unsloth/Llama-3.2-3B-Instruct-bnb-4bit\",\n", | |
"] # More models at https://docs.unsloth.ai/get-started/all-our-models\n", | |
"\n", | |
"model, tokenizer = FastLanguageModel.from_pretrained(\n", | |
" model_name = \"unsloth/Phi-4\",\n", | |
" max_seq_length = max_seq_length,\n", | |
" load_in_4bit = load_in_4bit,\n", | |
" # token = \"hf_...\", # use one if using gated models like meta-llama/Llama-2-7b-hf\n", | |
")" | |
], | |
"metadata": { | |
"id": "QmUBVEnvCDJv" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We now add LoRA adapters for parameter efficient finetuning - this allows us to only efficiently train 1% of all parameters." | |
], | |
"metadata": { | |
"id": "SXd9bTZd1aaL" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"model = FastLanguageModel.get_peft_model(\n", | |
" model,\n", | |
" r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128\n", | |
" target_modules = [\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n", | |
" \"gate_proj\", \"up_proj\", \"down_proj\",],\n", | |
" lora_alpha = 16,\n", | |
" lora_dropout = 0, # Supports any, but = 0 is optimized\n", | |
" bias = \"none\", # Supports any, but = \"none\" is optimized\n", | |
" # [NEW] \"unsloth\" uses 30% less VRAM, fits 2x larger batch sizes!\n", | |
" use_gradient_checkpointing = \"unsloth\", # True or \"unsloth\" for very long context\n", | |
" random_state = 3407,\n", | |
" use_rslora = False, # We support rank stabilized LoRA\n", | |
" loftq_config = None, # And LoftQ\n", | |
")" | |
], | |
"metadata": { | |
"id": "6bZsfBuZDeCL" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"<a name=\"Data\"></a>\n", | |
"### Data Prep\n", | |
"We now use the `Phi-4` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. But we convert it to HuggingFace's normal multiturn format `(\"role\", \"content\")` instead of `(\"from\", \"value\")`/ Phi-4 renders multi turn conversations like below:\n", | |
"\n", | |
"```\n", | |
"<|im_start|>user<|im_sep|>Hello!<|im_end|>\n", | |
"<|im_start|>assistant<|im_sep|>Hi! How can I help?<|im_end|>\n", | |
"<|im_start|>user<|im_sep|>What is 2+2?<|im_end|>\n", | |
"```\n", | |
"\n", | |
"We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, phi4, llama3` and more." | |
], | |
"metadata": { | |
"id": "vITh0KVJ10qX" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from unsloth.chat_templates import get_chat_template\n", | |
"\n", | |
"tokenizer = get_chat_template(\n", | |
" tokenizer,\n", | |
" chat_template = \"phi-4\",\n", | |
")\n", | |
"\n", | |
"def formatting_prompts_func(examples):\n", | |
" convos = examples[\"conversations\"]\n", | |
" texts = [\n", | |
" tokenizer.apply_chat_template(\n", | |
" convo, tokenize = False, add_generation_prompt = False\n", | |
" )\n", | |
" for convo in convos\n", | |
" ]\n", | |
" return { \"text\" : texts, }\n", | |
"pass\n", | |
"\n", | |
"from datasets import load_dataset\n", | |
"dataset = load_dataset(\"mlabonne/FineTome-100k\", split = \"train\")" | |
], | |
"metadata": { | |
"id": "LjY75GoYUCB8" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We now use `standardize_sharegpt` to convert ShareGPT style datasets into HuggingFace's generic format. This changes the dataset from looking like:\n", | |
"```\n", | |
"{\"from\": \"system\", \"value\": \"You are an assistant\"}\n", | |
"{\"from\": \"human\", \"value\": \"What is 2+2?\"}\n", | |
"{\"from\": \"gpt\", \"value\": \"It's 4.\"}\n", | |
"```\n", | |
"to\n", | |
"```\n", | |
"{\"role\": \"system\", \"content\": \"You are an assistant\"}\n", | |
"{\"role\": \"user\", \"content\": \"What is 2+2?\"}\n", | |
"{\"role\": \"assistant\", \"content\": \"It's 4.\"}\n", | |
"```" | |
], | |
"metadata": { | |
"id": "K9CBpiISFa6C" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from unsloth.chat_templates import standardize_sharegpt\n", | |
"\n", | |
"dataset = standardize_sharegpt(dataset)\n", | |
"dataset = dataset.map(\n", | |
" formatting_prompts_func,\n", | |
" batched=True,\n", | |
")" | |
], | |
"metadata": { | |
"id": "oPXzJZzHEgXe" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We look at how the conversations are structured for item 5:" | |
], | |
"metadata": { | |
"id": "ndDUB23CGAC5" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"dataset[5][\"conversations\"]" | |
], | |
"metadata": { | |
"id": "gGFzmplrEy9I" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"And we see how the chat template transformed these conversations." | |
], | |
"metadata": { | |
"id": "GfzTdMtvGE6w" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"dataset[5][\"text\"]" | |
], | |
"metadata": { | |
"id": "vhXv0xFMGNKE" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"<a name=\"Train\"></a>\n", | |
"### Train the model\n", | |
"Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!" | |
], | |
"metadata": { | |
"id": "idAEIeSQ3xdS" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from trl import SFTTrainer\n", | |
"from transformers import TrainingArguments, DataCollatorForSeq2Seq\n", | |
"from unsloth import is_bfloat16_supported\n", | |
"\n", | |
"trainer = SFTTrainer(\n", | |
" model = model,\n", | |
" tokenizer = tokenizer,\n", | |
" train_dataset = dataset,\n", | |
" dataset_text_field = \"text\",\n", | |
" max_seq_length = max_seq_length,\n", | |
" data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),\n", | |
" dataset_num_proc = 2,\n", | |
" packing = False, # Can make training 5x faster for short sequences.\n", | |
" args = TrainingArguments(\n", | |
" per_device_train_batch_size = 2,\n", | |
" gradient_accumulation_steps = 4,\n", | |
" warmup_steps = 5,\n", | |
" # num_train_epochs = 1, # Set this for 1 full training run.\n", | |
" max_steps = 30,\n", | |
" learning_rate = 2e-4,\n", | |
" fp16 = not is_bfloat16_supported(),\n", | |
" bf16 = is_bfloat16_supported(),\n", | |
" logging_steps = 1,\n", | |
" optim = \"adamw_8bit\",\n", | |
" weight_decay = 0.01,\n", | |
" lr_scheduler_type = \"linear\",\n", | |
" seed = 3407,\n", | |
" output_dir = \"outputs\",\n", | |
" report_to = \"none\", # Use this for WandB etc\n", | |
" ),\n", | |
")" | |
], | |
"metadata": { | |
"id": "95_Nn-89DhsL" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs." | |
], | |
"metadata": { | |
"id": "C_sGp5XlG6dq" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from unsloth.chat_templates import train_on_responses_only\n", | |
"\n", | |
"trainer = train_on_responses_only(\n", | |
" trainer,\n", | |
" instruction_part=\"<|im_start|>user<|im_sep|>\",\n", | |
" response_part=\"<|im_start|>assistant<|im_sep|>\",\n", | |
")" | |
], | |
"metadata": { | |
"id": "juQiExuBG5Bt" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We verify masking is actually done:" | |
], | |
"metadata": { | |
"id": "Dv1NBUozV78l" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"tokenizer.decode(trainer.train_dataset[5][\"input_ids\"])" | |
], | |
"metadata": { | |
"id": "LtsMVtlkUhja" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"space = tokenizer(\" \", add_special_tokens = False).input_ids[0]\n", | |
"tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5][\"labels\"]])" | |
], | |
"metadata": { | |
"id": "_rD6fl8EUxnG" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"We can see the System and Instruction prompts are successfully masked!" | |
], | |
"metadata": { | |
"id": "3enWUM0jV-jV" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# @title Show current memory stats\n", | |
"gpu_stats = torch.cuda.get_device_properties(0)\n", | |
"start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n", | |
"max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n", | |
"print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n", | |
"print(f\"{start_gpu_memory} GB of memory reserved.\")" | |
], | |
"metadata": { | |
"cellView": "form", | |
"id": "2ejIt2xSNKKp" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"trainer_stats = trainer.train()" | |
], | |
"metadata": { | |
"id": "yqxqAZ7KJ4oL" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# @title Show final memory and time stats\n", | |
"used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n", | |
"used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n", | |
"used_percentage = round(used_memory / max_memory * 100, 3)\n", | |
"lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n", | |
"print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n", | |
"print(\n", | |
" f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\"\n", | |
")\n", | |
"print(f\"Peak reserved memory = {used_memory} GB.\")\n", | |
"print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n", | |
"print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n", | |
"print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")" | |
], | |
"metadata": { | |
"cellView": "form", | |
"id": "pCqnaKmlO1U9" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"<a name=\"Inference\"></a>\n", | |
"### Inference\n", | |
"Let's run the model! You can change the instruction and input - leave the output blank!\n", | |
"\n", | |
"**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**\n", | |
"\n", | |
"We use `min_p = 0.1` and `temperature = 1.5`." | |
], | |
"metadata": { | |
"id": "ekOmTR1hSNcr" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"from unsloth.chat_templates import get_chat_template\n", | |
"\n", | |
"tokenizer = get_chat_template(\n", | |
" tokenizer,\n", | |
" chat_template = \"phi-4\",\n", | |
")\n", | |
"FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n", | |
"\n", | |
"messages = [\n", | |
" {\"role\": \"user\", \"content\": \"Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,\"},\n", | |
"]\n", | |
"inputs = tokenizer.apply_chat_template(\n", | |
" messages,\n", | |
" tokenize = True,\n", | |
" add_generation_prompt = True, # Must add for generation\n", | |
" return_tensors = \"pt\",\n", | |
").to(\"cuda\")\n", | |
"\n", | |
"outputs = model.generate(\n", | |
" input_ids = inputs, max_new_tokens = 64, use_cache = True, temperature = 1.5, min_p = 0.1\n", | |
")\n", | |
"tokenizer.batch_decode(outputs)" | |
], | |
"metadata": { | |
"id": "kR3gIAX-SM2q" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
" You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!" | |
], | |
"metadata": { | |
"id": "CrSvZObor0lY" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n", | |
"\n", | |
"messages = [\n", | |
" {\"role\": \"user\", \"content\": \"Continue the fibonnaci sequence: 1, 1, 2, 3, 5, 8,\"},\n", | |
"]\n", | |
"inputs = tokenizer.apply_chat_template(\n", | |
" messages,\n", | |
" tokenize = True,\n", | |
" add_generation_prompt = True, # Must add for generation\n", | |
" return_tensors = \"pt\",\n", | |
").to(\"cuda\")\n", | |
"\n", | |
"from transformers import TextStreamer\n", | |
"text_streamer = TextStreamer(tokenizer, skip_prompt = True)\n", | |
"_ = model.generate(\n", | |
" input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,\n", | |
" use_cache = True, temperature = 1.5, min_p = 0.1\n", | |
")" | |
], | |
"metadata": { | |
"id": "e2pEuRb1r2Vg" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"<a name=\"Save\"></a>\n", | |
"### Saving, loading finetuned models\n", | |
"To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.\n", | |
"\n", | |
"**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!" | |
], | |
"metadata": { | |
"id": "uMuVrWbjAzhc" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"model.save_pretrained(\"lora_model\") # Local saving\n", | |
"tokenizer.save_pretrained(\"lora_model\")\n", | |
"# model.push_to_hub(\"your_name/lora_model\", token = \"...\") # Online saving\n", | |
"# tokenizer.push_to_hub(\"your_name/lora_model\", token = \"...\") # Online saving" | |
], | |
"metadata": { | |
"id": "upcOlWe7A1vc" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:" | |
], | |
"metadata": { | |
"id": "AEEcJ4qfC7Lp" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"if False:\n", | |
" from unsloth import FastLanguageModel\n", | |
" model, tokenizer = FastLanguageModel.from_pretrained(\n", | |
" model_name = \"lora_model\", # YOUR MODEL YOU USED FOR TRAINING\n", | |
" max_seq_length = max_seq_length,\n", | |
" dtype = dtype,\n", | |
" load_in_4bit = load_in_4bit,\n", | |
" )\n", | |
" FastLanguageModel.for_inference(model) # Enable native 2x faster inference\n", | |
"\n", | |
"messages = [\n", | |
" {\"role\": \"user\", \"content\": \"Describe a tall tower in the capital of France.\"},\n", | |
"]\n", | |
"inputs = tokenizer.apply_chat_template(\n", | |
" messages,\n", | |
" tokenize = True,\n", | |
" add_generation_prompt = True, # Must add for generation\n", | |
" return_tensors = \"pt\",\n", | |
").to(\"cuda\")\n", | |
"\n", | |
"from transformers import TextStreamer\n", | |
"text_streamer = TextStreamer(tokenizer, skip_prompt = True)\n", | |
"_ = model.generate(\n", | |
" input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,\n", | |
" use_cache = True, temperature = 1.5, min_p = 0.1\n", | |
")" | |
], | |
"metadata": { | |
"id": "MKX_XKs_BNZR" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**." | |
], | |
"metadata": { | |
"id": "QQMjaNrjsU5_" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"if False:\n", | |
" # I highly do NOT suggest - use Unsloth if possible\n", | |
" from peft import AutoPeftModelForCausalLM\n", | |
" from transformers import AutoTokenizer\n", | |
"\n", | |
" model = AutoPeftModelForCausalLM.from_pretrained(\n", | |
" \"lora_model\", # YOUR MODEL YOU USED FOR TRAINING\n", | |
" load_in_4bit=load_in_4bit,\n", | |
" )\n", | |
" tokenizer = AutoTokenizer.from_pretrained(\"lora_model\")" | |
], | |
"metadata": { | |
"id": "yFfaXG0WsQuE" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### Saving to float16 for VLLM\n", | |
"\n", | |
"We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens." | |
], | |
"metadata": { | |
"id": "f422JgM9sdVT" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Merge to 16bit\n", | |
"if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_16bit\",)\n", | |
"if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"merged_16bit\", token = \"\")\n", | |
"\n", | |
"# Merge to 4bit\n", | |
"if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"merged_4bit\",)\n", | |
"if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"merged_4bit\", token = \"\")\n", | |
"\n", | |
"# Just LoRA adapters\n", | |
"if False: model.save_pretrained_merged(\"model\", tokenizer, save_method = \"lora\",)\n", | |
"if False: model.push_to_hub_merged(\"hf/model\", tokenizer, save_method = \"lora\", token = \"\")" | |
], | |
"metadata": { | |
"id": "iHjt_SMYsd3P" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"### GGUF / llama.cpp Conversion\n", | |
"To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.\n", | |
"\n", | |
"Some supported quant methods (full list in our [Docs](https://docs.unsloth.ai/basics/saving-and-using-models/saving-to-gguf)):\n", | |
"* `q8_0` - Fast conversion. High resource use, but generally acceptable.\n", | |
"* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.\n", | |
"* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.\n", | |
"\n", | |
"[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)" | |
], | |
"metadata": { | |
"id": "TCv4vXHd61i7" | |
} | |
}, | |
{ | |
"cell_type": "code", | |
"source": [ | |
"# Save to 8bit Q8_0\n", | |
"if False: model.save_pretrained_gguf(\"model\", tokenizer,)\n", | |
"# Remember to go to https://huggingface.co/settings/tokens for a token!\n", | |
"# And change hf to your username!\n", | |
"if False: model.push_to_hub_gguf(\"hf/model\", tokenizer, token = \"\")\n", | |
"\n", | |
"# Save to 16bit GGUF\n", | |
"if False: model.save_pretrained_gguf(\"model\", tokenizer, quantization_method = \"f16\")\n", | |
"if False: model.push_to_hub_gguf(\"hf/model\", tokenizer, quantization_method = \"f16\", token = \"\")\n", | |
"\n", | |
"# Save to q4_k_m GGUF\n", | |
"if False: model.save_pretrained_gguf(\"model\", tokenizer, quantization_method = \"q4_k_m\")\n", | |
"if False: model.push_to_hub_gguf(\"hf/model\", tokenizer, quantization_method = \"q4_k_m\", token = \"\")\n", | |
"\n", | |
"# Save to multiple GGUF options - much faster if you want multiple!\n", | |
"if False:\n", | |
" model.push_to_hub_gguf(\n", | |
" \"hf/model\", # Change hf to your username!\n", | |
" tokenizer,\n", | |
" quantization_method = [\"q4_k_m\", \"q8_0\", \"q5_k_m\",],\n", | |
" token = \"\", # Get a token at https://huggingface.co/settings/tokens\n", | |
" )" | |
], | |
"metadata": { | |
"id": "FqfebeAdT073" | |
}, | |
"outputs": [], | |
"execution_count": null | |
}, | |
{ | |
"cell_type": "markdown", | |
"source": [ | |
"Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)\n", | |
"\n", | |
"And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n", | |
"\n", | |
"Some other links:\n", | |
"1. Llama 3.2 Conversational notebook. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb)\n", | |
"2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)\n", | |
"3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)\n", | |
"6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!\n", | |
"\n", | |
"<div class=\"align-center\">\n", | |
" <a href=\"https://unsloth.ai\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n", | |
" <a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord.png\" width=\"145\"></a>\n", | |
" <a href=\"https://docs.unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a>\n", | |
"\n", | |
" Join Discord if you need help + ⭐️ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐️\n", | |
"</div>\n" | |
], | |
"metadata": { | |
"id": "Y69v-icGrQNY" | |
} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment