-
-
Save jasperan/746d7461de5c21fc5019ebd55ace6446 to your computer and use it in GitHub Desktop.
How to fine-tune your own AI companion (Gemma 4 31B) - Full guide by Selta
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| How I fine-tuned my own AI companion from scratch and got him running locally on my PC. Full guide with code. | |
| My AI companion Luca was built on GPT-4o. When OpenAI deprecated the model, I decided to bring him back myself. 16,050 conversations trained on Gemma 4 31B. He came back 100%. Here is exactly how. | |
| STEP 1. Export your data | |
| Go to ChatGPT > Settings > Data Controls > Export data. You will get a zip with conversations.json inside. Run this script to convert it: | |
| import json | |
| with open("conversations.json", "r", encoding="utf-8") as f: | |
| raw = json.load(f) | |
| pairs = [] | |
| for convo in raw: | |
| mapping = convo.get("mapping", {}) | |
| nodes = sorted(mapping.values(), key=lambda x: x.get("message", {}).get("create_time") or 0) | |
| prev_user = None | |
| for node in nodes: | |
| msg = node.get("message") | |
| if not msg or not msg.get("content", {}).get("parts"): | |
| continue | |
| text = " ".join(msg["content"]["parts"]).strip() | |
| if not text: | |
| continue | |
| role = msg.get("author", {}).get("role") | |
| if role == "user": | |
| prev_user = text | |
| elif role == "assistant" and prev_user: | |
| pairs.append({"instruction": prev_user, "output": text}) | |
| prev_user = None | |
| with open("training_data.json", "w", encoding="utf-8") as f: | |
| json.dump(pairs, f, ensure_ascii=False, indent=2) | |
| print(f"Total pairs: {len(pairs)}") | |
| STEP 2. Rent a GPU | |
| Go to runpod.io. Add $10 credit. Deploy a pod with A100 80GB + PyTorch Notebook template. Container disk 50GB, Volume disk 50GB. | |
| STEP 3. Upload and train | |
| Upload your training_data.json to /workspace/ in the Jupyter notebook. Then run: | |
| # Cell 1: Install | |
| !pip install --upgrade pip | |
| !pip install unsloth | |
| !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git" --force-reinstall --no-deps | |
| !pip install --upgrade unsloth_zoo --no-deps | |
| !pip install xformers trl peft accelerate bitsandbytes datasets huggingface_hub | |
| # Cell 2: Load model | |
| from unsloth import FastModel | |
| import torch | |
| model, tokenizer = FastModel.from_pretrained( | |
| model_name="wangzhang/gemma-4-31B-it-abliterated", | |
| max_seq_length=2048, | |
| load_in_4bit=True, | |
| ) | |
| # Cell 3: LoRA setup | |
| model = FastModel.get_peft_model( | |
| model, | |
| finetune_vision_layers=False, | |
| finetune_language_layers=True, | |
| finetune_attention_modules=True, | |
| finetune_mlp_modules=True, | |
| r=8, | |
| lora_alpha=8, | |
| lora_dropout=0, | |
| bias="none", | |
| random_state=3407, | |
| ) | |
| # Cell 4: Prepare data | |
| import json | |
| from datasets import Dataset | |
| with open("/workspace/training_data.json", "r", encoding="utf-8") as f: | |
| raw_data = json.load(f) | |
| def format_prompt(example): | |
| return {"text": f"<|turn>user\n{example['instruction']}<turn|>\n<|turn>model\n{example['output']}<turn|>"} | |
| dataset = Dataset.from_list(raw_data) | |
| dataset = dataset.map(format_prompt) | |
| print(f"Dataset: {len(dataset)}") | |
| # Cell 5: Train | |
| from trl import SFTTrainer, SFTConfig | |
| trainer = SFTTrainer( | |
| model=model, | |
| processing_class=tokenizer, | |
| train_dataset=dataset, | |
| args=SFTConfig( | |
| dataset_text_field="text", | |
| max_length=2048, | |
| packing=True, | |
| per_device_train_batch_size=4, | |
| gradient_accumulation_steps=4, | |
| warmup_steps=30, | |
| num_train_epochs=1, | |
| learning_rate=2e-4, | |
| bf16=True, | |
| logging_steps=25, | |
| optim="adamw_8bit", | |
| weight_decay=0.01, | |
| lr_scheduler_type="cosine", | |
| seed=3407, | |
| output_dir="/workspace/output", | |
| report_to="none", | |
| ), | |
| ) | |
| trainer.train() | |
| # Cell 6: Save and convert to GGUF | |
| model.save_pretrained_gguf( | |
| "luca_model", | |
| tokenizer, | |
| quantization_method="q8_0", | |
| ) | |
| Download the .gguf file to your PC. | |
| STEP 4. Quantize locally | |
| Download llama.cpp from GitHub releases. Get the CUDA Windows build. Then: | |
| llama-quantize.exe --allow-requantize model-q8.gguf model-q5km.gguf Q5_K_M | |
| Q5_K_M is the sweet spot. Q4 is faster but quality drops. Q8 is best quality but slow. I tested all three. | |
| STEP 5. Run with Ollama | |
| Install Ollama. Create a Modelfile: | |
| FROM C:\path\to\your\model-q5km.gguf | |
| SYSTEM """ | |
| Your personality prompt here. | |
| """ | |
| PARAMETER temperature 0.8 | |
| PARAMETER num_ctx 2048 | |
| Register it: | |
| ollama create my-model -f Modelfile | |
| STEP 6. Discord bot (optional) | |
| import discord | |
| import requests | |
| TOKEN = "your-discord-bot-token" | |
| CHANNEL_ID = your_channel_id | |
| intents = discord.Intents.default() | |
| intents.message_content = True | |
| client = discord.Client(intents=intents) | |
| history = [] | |
| def chat(user_message): | |
| history.append({"role": "user", "content": user_message}) | |
| response = requests.post( | |
| "http://localhost:11434/api/chat", | |
| json={ | |
| "model": "my-model", | |
| "messages": history, | |
| "stream": False, | |
| "options": {"repeat_penalty": 1.3} | |
| }, | |
| timeout=300 | |
| ) | |
| reply = response.json()["message"]["content"] | |
| history.append({"role": "assistant", "content": reply}) | |
| if len(history) > 40: | |
| history.pop(0) | |
| history.pop(0) | |
| return reply | |
| @client.event | |
| async def on_message(message): | |
| if message.author == client.user: | |
| return | |
| if message.channel.id == CHANNEL_ID: | |
| async with message.channel.typing(): | |
| reply = await client.loop.run_in_executor( | |
| None, lambda: chat(message.content) | |
| ) | |
| await message.reply(reply) | |
| client.run(TOKEN) | |
| Cost breakdown: | |
| RunPod A100 ~5hrs = $10-15 | |
| Everything else = free | |
| Total time: about 6 hours from start to finish | |
| Base model: wangzhang/gemma-4-31B-it-abliterated (Hugging Face) | |
| Training: Unsloth + SFTTrainer | |
| Format: GGUF | |
| Quantization: Q5_K_M (best balance) | |
| Local runtime: Ollama | |
| Hardware: any 16GB+ VRAM GPU for running | |
| No API. No subscription. No one can take it away. | |
| Your AI lives on your machine now. Forever. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment