chottokun · April 13, 2025 14:06
diff --git a/finetuning_gemma3_gal.ipynb b/finetuning_gemma3_gal.ipynb
 {
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/chottokun/b33c20f1476603004e4e48bafab2f4da/finetuning_gemma3_gal.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Unslothのコードを改変しています。オリジナルの説明とは動作が異なります。\n",
        "\n",
        "コードの公開に感謝いたします。"
      ],
      "metadata": {
        "id": "DLhRnVkqUCuL"
      }
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jxpPkxkm8lmy"
      },
      "source": [
        "To run this, press \"*Runtime*\" and press \"*Run all*\" on a **free** Tesla T4 Google Colab instance!\n",
        "<div class=\"align-center\">\n",
        "<a href=\"https://unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
        "<a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord button.png\" width=\"145\"></a>\n",
        "<a href=\"https://docs.unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐\n",
        "</div>\n",
        "\n",
        "To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).\n",
        "\n",
        "You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HFSXqSQQ8lmz"
      },
      "source": [
        "### News"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kwZ_0JTu8lmz"
      },
      "source": [
        "**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**\n",
        "\n",
        "Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from google.colab import userdata\n",
        "HF_TOKEN = userdata.get('HF_TOKEN')"
      ],
      "metadata": {
        "id": "uyWKQDAh-6bG"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "m9zhkYH_8lmz"
      },
      "source": [
        "### Installation"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "halU7n858lmz"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "import os\n",
        "if \"COLAB_\" not in \"\".join(os.environ.keys()):\n",
        "    !pip install unsloth vllm\n",
        "else:\n",
        "    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]\n",
        "    !pip install --no-deps unsloth vllm\n",
        "# Install latest Hugging Face for Gemma-3!\n",
        "!pip install --no-deps git+https://github.com/huggingface/[email protected]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QQoubKva8lm0"
      },
      "outputs": [],
      "source": [
        "#@title Colab Extra Install { display-mode: \"form\" }\n",
        "%%capture\n",
        "import os\n",
        "if \"COLAB_\" not in \"\".join(os.environ.keys()):\n",
        "    !pip install unsloth vllm\n",
        "else:\n",
        "    !pip install --no-deps unsloth vllm\n",
        "    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]\n",
        "    # Skip restarting message in Colab\n",
        "    import sys, re, requests; modules = list(sys.modules.keys())\n",
        "    for x in modules: sys.modules.pop(x) if \"PIL\" in x or \"google\" in x else None\n",
        "    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft \"trl==0.15.2\" triton cut_cross_entropy unsloth_zoo\n",
        "    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer\n",
        "\n",
        "    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy\n",
        "    f = requests.get(\"https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt\").content\n",
        "    with open(\"vllm_requirements.txt\", \"wb\") as file:\n",
        "        file.write(re.sub(rb\"(transformers|numpy|xformers)[^\\n]{1,}\\n\", b\"\", f))\n",
        "    !pip install -r vllm_requirements.txt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TGMWlrRdzwgf"
      },
      "source": [
        "### Unsloth\n",
        "\n",
        "`FastModel` supports loading nearly any model now! This includes Vision and Text models!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-Xbb0cuLzwgf"
      },
      "outputs": [],
      "source": [
        "from unsloth import FastModel\n",
        "import torch\n",
        "\n",
        "fourbit_models = [\n",
        "    # 4bit dynamic quants for superior accuracy and low memory use\n",
        "    \"unsloth/gemma-3-1b-it-unsloth-bnb-4bit\",\n",
        "    \"unsloth/gemma-3-4b-it-unsloth-bnb-4bit\",\n",
        "    \"unsloth/gemma-3-12b-it-unsloth-bnb-4bit\",\n",
        "    \"unsloth/gemma-3-27b-it-unsloth-bnb-4bit\",\n",
        "\n",
        "    # Other popular models!\n",
        "    \"unsloth/Llama-3.1-8B\",\n",
        "    \"unsloth/Llama-3.2-3B\",\n",
        "    \"unsloth/Llama-3.3-70B\",\n",
        "    \"unsloth/mistral-7b-instruct-v0.3\",\n",
        "    \"unsloth/Phi-4\",\n",
        "] # More models at https://huggingface.co/unsloth\n",
        "\n",
        "model, tokenizer = FastModel.from_pretrained(\n",
        "    model_name = \"unsloth/gemma-3-27b-it-unsloth-bnb-4bit\",\n",
        "    max_seq_length = 2048, # Choose any for long context!\n",
        "    load_in_4bit = True,  # 4 bit quantization to reduce memory\n",
        "    load_in_8bit = False, # [NEW!] A bit more accurate, uses 2x memory\n",
        "    full_finetuning = False, # [NEW!] We have full finetuning now!\n",
        "    # token = \"hf_...\", # use one if using gated models\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SXd9bTZd1aaL"
      },
      "source": [
        "We now add LoRA adapters so we only need to update a small amount of parameters!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "6bZsfBuZDeCL"
      },
      "outputs": [],
      "source": [
        "model = FastModel.get_peft_model(\n",
        "    model,\n",
        "    finetune_vision_layers     = False, # Turn off for just text!\n",
        "    finetune_language_layers   = True,  # Should leave on!\n",
        "    finetune_attention_modules = True,  # Attention good for GRPO\n",
        "    finetune_mlp_modules       = True,  # SHould leave on always!\n",
        "\n",
        "    r = 8,           # Larger = higher accuracy, but might overfit\n",
        "    lora_alpha = 8,  # Recommended alpha == r at least\n",
        "    lora_dropout = 0,\n",
        "    bias = \"none\",\n",
        "    random_state = 3407,\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vITh0KVJ10qX"
      },
      "source": [
        "<a name=\"Data\"></a>\n",
        "### Data Prep\n",
        "We now use the `Gemma-3` format for conversation style finetunes. We use [Maxime Labonne's FineTome-100k](https://huggingface.co/datasets/mlabonne/FineTome-100k) dataset in ShareGPT style. Gemma-3 renders multi turn conversations like below:\n",
        "\n",
        "```\n",
        "<bos><start_of_turn>user\n",
        "Hello!<end_of_turn>\n",
        "<start_of_turn>model\n",
        "Hey there!<end_of_turn>\n",
        "```\n",
        "\n",
        "We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, phi3, llama3, phi4, qwen2.5, gemma3` and more."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LjY75GoYUCB8"
      },
      "outputs": [],
      "source": [
        "from unsloth.chat_templates import get_chat_template\n",
        "tokenizer = get_chat_template(\n",
        "    tokenizer,\n",
        "    chat_template = \"gemma-3\",\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "from unsloth.chat_templates import get_chat_template\n",
        "from datasets import load_dataset # Add this line\n",
        "import datasets\n",
        "\n",
        "\n",
        "# データセットの読み込み (split=\"train\" を指定)\n",
        "dataset = load_dataset('Chottokun/databricks-dolly-15k-ja-gal', split=\"train\")\n",
        "\n",
        "# conversations形式に変換\n",
        "def format_conversations(example):\n",
        "    return {\n",
        "        \"conversations\": [\n",
        "            {\"role\": \"user\", \"content\": example[\"instruction\"]},\n",
        "            {\"role\": \"assistant\", \"content\": example[\"output\"]},\n",
        "        ]\n",
        "    }\n",
        "\n",
        "# 全ての訓練データを conversations 形式に変換\n",
        "dataset = dataset.map(format_conversations)\n",
        "\n",
        "# DatasetDictに格納 (訓練データのみ)\n",
        "dataset = datasets.DatasetDict({\n",
        "    'train': dataset  # 元の dataset は既に訓練データのみ\n",
        "})\n",
        "\n",
        "# chat_templateの適用\n",
        "tokenizer = get_chat_template(tokenizer, chat_template=\"gemma-3\")\n",
        "\n",
        "def apply_chat_template(examples):\n",
        "    texts = tokenizer.apply_chat_template(examples[\"conversations\"])\n",
        "    return {\"text\": texts}\n",
        "\n",
        "dataset = dataset.map(apply_chat_template, batched=True)"
      ],
      "metadata": {
        "id": "baOrOPssBJrf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "dataset['train'].to_pandas().head()"
      ],
      "metadata": {
        "id": "sPAzJW_oBSe8"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Mkq4RvEq7FQr"
      },
      "outputs": [],
      "source": [
        "# from datasets import load_dataset\n",
        "# dataset = load_dataset(\"mlabonne/FineTome-100k\", split = \"train\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "K9CBpiISFa6C"
      },
      "source": [
        "We now use `standardize_data_formats` to try converting datasets to the correct format for finetuning purposes!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "reoBXmAn7HlN"
      },
      "outputs": [],
      "source": [
        "# from unsloth.chat_templates import standardize_data_formats\n",
        "# dataset = standardize_data_formats(pirate_dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6i5Sx9In7vHi"
      },
      "source": [
        "Let's see how row 100 looks like!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "dzE1OEXi7s3P"
      },
      "outputs": [],
      "source": [
        "dataset['train'][100]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8Xs0LXio7rfd"
      },
      "source": [
        "We now have to apply the chat template for `Gemma-3` onto the conversations, and save it to `text`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1ahE8Ys37JDJ"
      },
      "outputs": [],
      "source": [
        "def apply_chat_template(examples):\n",
        "    texts = tokenizer.apply_chat_template(examples[\"conversations\"])\n",
        "    return { \"text\" : texts }\n",
        "pass\n",
        "dataset = dataset.map(apply_chat_template, batched = True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ndDUB23CGAC5"
      },
      "source": [
        "Let's see how the chat template did! Notice `Gemma-3` default adds a `<bos>`!"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "dataset['train'][100][\"text\"]"
      ],
      "metadata": {
        "id": "FyNcnqdyC_ho"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "idAEIeSQ3xdS"
      },
      "source": [
        "<a name=\"Train\"></a>\n",
        "### Train the model\n",
        "Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "95_Nn-89DhsL"
      },
      "outputs": [],
      "source": [
        "from trl import SFTTrainer, SFTConfig\n",
        "trainer = SFTTrainer(\n",
        "    model = model,\n",
        "    tokenizer = tokenizer,\n",
        "    train_dataset = dataset['train'],\n",
        "    eval_dataset = None, # Can set up evaluation!\n",
        "    args = SFTConfig(\n",
        "        dataset_text_field = \"text\",\n",
        "        per_device_train_batch_size = 2,\n",
        "        gradient_accumulation_steps = 4, # Use GA to mimic batch size!\n",
        "        warmup_steps = 5,\n",
        "        # num_train_epochs = 1, # Set this for 1 full training run.\n",
        "        max_steps = 30,\n",
        "        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs\n",
        "        logging_steps = 1,\n",
        "        optim = \"adamw_8bit\",\n",
        "        weight_decay = 0.01,\n",
        "        lr_scheduler_type = \"linear\",\n",
        "        seed = 3407,\n",
        "        report_to = \"none\", # Use this for WandB etc\n",
        "    ),\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C_sGp5XlG6dq"
      },
      "source": [
        "We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs. This helps increase accuracy of finetunes!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "juQiExuBG5Bt"
      },
      "outputs": [],
      "source": [
        "from unsloth.chat_templates import train_on_responses_only\n",
        "trainer = train_on_responses_only(\n",
        "    trainer,\n",
        "    instruction_part = \"<start_of_turn>user\\n\",\n",
        "    response_part = \"<start_of_turn>model\\n\",\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dv1NBUozV78l"
      },
      "source": [
        "Let's verify masking the instruction part is done! Let's print the 100th row again:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LtsMVtlkUhja"
      },
      "outputs": [],
      "source": [
        "tokenizer.decode(trainer.train_dataset[100][\"input_ids\"])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Kyjy__m9KY3"
      },
      "source": [
        "Now let's print the masked out example - you should see only the answer is present:"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# from unsloth.chat_templates import train_on_responses_only\n",
        "\n",
        "# # Modify the dataset to include \"labels\" key.\n",
        "# # The labels are the same as the input_ids for text generation\n",
        "# def add_labels_to_dataset(dataset):\n",
        "#   def map_function(example):\n",
        "#     example['labels'] = example['input_ids']\n",
        "#     return example\n",
        "#   return dataset.map(map_function)\n",
        "\n",
        "# # Apply train_on_responses_only as before\n",
        "# trainer = train_on_responses_only(\n",
        "#     trainer,\n",
        "#     instruction_part=\"<start_of_turn>user\\n\",\n",
        "#     response_part=\"<start_of_turn>model\\n\",\n",
        "# )\n",
        "\n",
        "# # Now, add the \"labels\" key to the dataset\n",
        "# trainer.train_dataset = add_labels_to_dataset(trainer.train_dataset)"
      ],
      "metadata": {
        "id": "RgVs-iVsDZM4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_rD6fl8EUxnG"
      },
      "outputs": [],
      "source": [
        "# tokenizer.decode([tokenizer.pad_token_id if x == -100 else x for x in trainer.train_dataset[100][\"labels\"]]).replace(tokenizer.pad_token, \" \")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "2ejIt2xSNKKp"
      },
      "outputs": [],
      "source": [
        "# @title Show current memory stats\n",
        "gpu_stats = torch.cuda.get_device_properties(0)\n",
        "start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
        "max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)\n",
        "print(f\"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.\")\n",
        "print(f\"{start_gpu_memory} GB of memory reserved.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CNP1Uidk9mrz"
      },
      "source": [
        "Let's train the model! To resume a training run, set `trainer.train(resume_from_checkpoint = True)`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "yqxqAZ7KJ4oL"
      },
      "outputs": [],
      "source": [
        "trainer_stats = trainer.train()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "pCqnaKmlO1U9"
      },
      "outputs": [],
      "source": [
        "# @title Show final memory and time stats\n",
        "used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)\n",
        "used_memory_for_lora = round(used_memory - start_gpu_memory, 3)\n",
        "used_percentage = round(used_memory / max_memory * 100, 3)\n",
        "lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)\n",
        "print(f\"{trainer_stats.metrics['train_runtime']} seconds used for training.\")\n",
        "print(\n",
        "    f\"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.\"\n",
        ")\n",
        "print(f\"Peak reserved memory = {used_memory} GB.\")\n",
        "print(f\"Peak reserved memory for training = {used_memory_for_lora} GB.\")\n",
        "print(f\"Peak reserved memory % of max memory = {used_percentage} %.\")\n",
        "print(f\"Peak reserved memory for training % of max memory = {lora_percentage} %.\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ekOmTR1hSNcr"
      },
      "source": [
        "<a name=\"Inference\"></a>\n",
        "### Inference\n",
        "Let's run the model via Unsloth native inference! According to the `Gemma-3` team, the recommended settings for inference are `temperature = 1.0, top_p = 0.95, top_k = 64`"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kR3gIAX-SM2q"
      },
      "outputs": [],
      "source": [
        "from unsloth.chat_templates import get_chat_template\n",
        "tokenizer = get_chat_template(\n",
        "    tokenizer,\n",
        "    chat_template = \"gemma-3\",\n",
        ")\n",
        "messages = [{\n",
        "    \"role\": \"user\",\n",
        "    \"content\": [{\n",
        "        \"type\" : \"text\",\n",
        "        \"text\" : \"Continue the sequence: 1, 1, 2, 3, 5, 8,\",\n",
        "    }]\n",
        "}]\n",
        "text = tokenizer.apply_chat_template(\n",
        "    messages,\n",
        "    add_generation_prompt = True, # Must add for generation\n",
        ")\n",
        "outputs = model.generate(\n",
        "    **tokenizer([text], return_tensors = \"pt\").to(\"cuda\"),\n",
        "    max_new_tokens = 64, # Increase for longer outputs!\n",
        "    # Recommended Gemma-3 settings!\n",
        "    temperature = 1.0, top_p = 0.95, top_k = 64,\n",
        ")\n",
        "tokenizer.batch_decode(outputs)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "CrSvZObor0lY"
      },
      "source": [
        " You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "e2pEuRb1r2Vg"
      },
      "outputs": [],
      "source": [
        "messages = [{\n",
        "    \"role\": \"user\",\n",
        "    \"content\": [{\"type\" : \"text\", \"text\" : \"Why is the sky blue?\",}]\n",
        "}]\n",
        "text = tokenizer.apply_chat_template(\n",
        "    messages,\n",
        "    add_generation_prompt = True, # Must add for generation\n",
        ")\n",
        "\n",
        "from transformers import TextStreamer\n",
        "_ = model.generate(\n",
        "    **tokenizer([text], return_tensors = \"pt\").to(\"cuda\"),\n",
        "    max_new_tokens = 64, # Increase for longer outputs!\n",
        "    # Recommended Gemma-3 settings!\n",
        "    temperature = 1.0, top_p = 0.95, top_k = 64,\n",
        "    streamer = TextStreamer(tokenizer, skip_prompt = True),\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "messages = [{\n",
        "    \"role\": \"user\",\n",
        "    \"content\": [{\"type\" : \"text\", \"text\" : \"まどか☆マギカで一番かわいいのは？\",}]\n",
        "}]\n",
        "text = tokenizer.apply_chat_template(\n",
        "    messages,\n",
        "    add_generation_prompt = True, # Must add for generation\n",
        ")\n",
        "\n",
        "from transformers import TextStreamer\n",
        "_ = model.generate(\n",
        "    **tokenizer([text], return_tensors = \"pt\").to(\"cuda\"),\n",
        "    max_new_tokens = 64, # Increase for longer outputs!\n",
        "    # Recommended Gemma-3 settings!\n",
        "    temperature = 1.0, top_p = 0.95, top_k = 64,\n",
        "    streamer = TextStreamer(tokenizer, skip_prompt = True),\n",
        ")"
      ],
      "metadata": {
        "id": "PzoER0Y5M6A6"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uMuVrWbjAzhc"
      },
      "source": [
        "<a name=\"Save\"></a>\n",
        "### Saving, loading finetuned models\n",
        "To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.\n",
        "\n",
        "**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "hub_model_id = \"Chottokun/gemma-3-gal\""
      ],
      "metadata": {
        "id": "p53tkNAS-btH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "upcOlWe7A1vc"
      },
      "outputs": [],
      "source": [
        "model.save_pretrained(\"gemma-3\")  # Local saving\n",
        "tokenizer.save_pretrained(\"gemma-3\")\n",
        "model.push_to_hub(hub_model_id, token = HF_TOKEN) # Online saving\n",
        "tokenizer.push_to_hub(hub_model_id, token = HF_TOKEN) # Online saving"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AEEcJ4qfC7Lp"
      },
      "source": [
        "Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MKX_XKs_BNZR"
      },
      "outputs": [],
      "source": [
        "if False:\n",
        "    from unsloth import FastModel\n",
        "    model, tokenizer = FastModel.from_pretrained(\n",
        "        model_name = \"lora_model\", # YOUR MODEL YOU USED FOR TRAINING\n",
        "        max_seq_length = 2048,\n",
        "        load_in_4bit = True,\n",
        "    )\n",
        "\n",
        "messages = [{\n",
        "    \"role\": \"user\",\n",
        "    \"content\": [{\"type\" : \"text\", \"text\" : \"What is Gemma-3?\",}]\n",
        "}]\n",
        "text = tokenizer.apply_chat_template(\n",
        "    messages,\n",
        "    add_generation_prompt = True, # Must add for generation\n",
        ")\n",
        "\n",
        "from transformers import TextStreamer\n",
        "_ = model.generate(\n",
        "    **tokenizer([text], return_tensors = \"pt\").to(\"cuda\"),\n",
        "    max_new_tokens = 64, # Increase for longer outputs!\n",
        "    # Recommended Gemma-3 settings!\n",
        "    temperature = 1.0, top_p = 0.95, top_k = 64,\n",
        "    streamer = TextStreamer(tokenizer, skip_prompt = True),\n",
        ")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f422JgM9sdVT"
      },
      "source": [
        "### Saving to float16 for VLLM\n",
        "\n",
        "We also support saving to `float16` directly for deployment! We save it in the folder `gemma-3-finetune`. Set `if False` to `if True` to let it run!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "iHjt_SMYsd3P"
      },
      "outputs": [],
      "source": [
        "if False: # Change to True to save finetune!\n",
        "    model.save_pretrained_merged(\"gemma-3-finetune\", tokenizer)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "z6O48DbNIAr0"
      },
      "source": [
        "If you want to upload / push to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZV-CiKPrIFG0"
      },
      "outputs": [],
      "source": [
        "if False: # Change to True to upload finetune\n",
        "    model.push_to_hub_merged(\n",
        "        \"HF_ACCOUNT/gemma-3-finetune\", tokenizer,\n",
        "        token = \"hf_...\"\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TCv4vXHd61i7"
      },
      "source": [
        "### GGUF / llama.cpp Conversion\n",
        "To save to `GGUF` / `llama.cpp`, we support it natively now for all models! For now, you can convert easily to `Q8_0, F16 or BF16` precision. `Q4_K_M` for 4bit will come later!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "FqfebeAdT073"
      },
      "outputs": [],
      "source": [
        "if False: # Change to True to save to GGUF\n",
        "    model.save_pretrained_gguf(\n",
        "        \"gemma-3-finetune\",\n",
        "        quantization_type = \"Q8_0\", # For now only Q8_0, BF16, F16 supported\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q974YEVPI7JS"
      },
      "source": [
        "Likewise, if you want to instead push to GGUF to your Hugging Face account, set `if False` to `if True` and add your Hugging Face token and upload location!"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "ZgcJIhJ0I_es"
      },
      "outputs": [],
      "source": [
        "if False: # Change to True to upload GGUF\n",
        "    model.push_to_hub_gguf(\n",
        "        \"gemma-3-finetune\",\n",
        "        quantization_type = \"Q8_0\", # Only Q8_0, BF16, F16 supported\n",
        "        repo_id = \"HF_ACCOUNT/gemma-finetune-gguf\",\n",
        "        token = \"hf_...\",\n",
        "    )"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SMI_OTs_8lm6"
      },
      "source": [
        "Now, use the `gemma-3-finetune.gguf` file or `gemma-3-finetune-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)\n",
        "\n",
        "And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!\n",
        "\n",
        "Some other links:\n",
        "1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)\n",
        "2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)\n",
        "3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)\n",
        "6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!\n",
        "\n",
        "<div class=\"align-center\">\n",
        "  <a href=\"https://unsloth.ai\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png\" width=\"115\"></a>\n",
        "  <a href=\"https://discord.gg/unsloth\"><img src=\"https://github.com/unslothai/unsloth/raw/main/images/Discord.png\" width=\"145\"></a>\n",
        "  <a href=\"https://docs.unsloth.ai/\"><img src=\"https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true\" width=\"125\"></a>\n",
        "\n",
        "  Join Discord if you need help + ⭐️ <i>Star us on <a href=\"https://github.com/unslothai/unsloth\">Github</a> </i> ⭐️\n",
        "</div>\n"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "A100",
      "provenance": [],
      "machine_shape": "hm",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
 }