Skip to content

Instantly share code, notes, and snippets.

@apollo-mg
Created November 28, 2025 02:30
Show Gist options
  • Select an option

  • Save apollo-mg/97b2dd3f73f11e1a9f7e69951b05eece to your computer and use it in GitHub Desktop.

Select an option

Save apollo-mg/97b2dd3f73f11e1a9f7e69951b05eece to your computer and use it in GitHub Desktop.
RDNA 4 Voice Cloning Guide (GPT-SoVITS & F5-TTS)

AI Voice Cloning on AMD RDNA 4 (RX 9070 XT): The "Masochist's Guide" to Success

Date: November 2025 Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 LTS (ROCm 7.1 Preview) Objective: Train commercial-grade voice clones (GPT-SoVITS & F5-TTS) without CUDA.


1. The Core Stability Fixes (The "Secret Sauce")

Before running anything, these environment variables are mandatory to prevent "Illegal Memory Access" crashes on RDNA 4.

# Prevent data transfer corruption
export HSA_ENABLE_SDMA=0

# Disable auto-tuning (causes kernel panics on 7.1)
export PYTORCH_TUNABLEOP_ENABLED=0

# Force Triton backend (Critical for Flash Attention)
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"

# Disable Torch Compile (Unstable on preview drivers)
export TORCH_COMPILE_DISABLE=1

2. GPT-SoVITS Training (The "Character Actor")

Best for: Emotional lines, short phrases, high fidelity.

Workflow

  1. Install: Use a requirements.txt pinning transformers==4.48.3 and peft==0.11.1 to avoid flash_attn_2_cuda import errors.
  2. Dataset: 10-15 minutes of "Hybrid" audio (60% Audiobook, 20% Conversation, 20% News).
  3. ASR: Must run on CPU (float32) because faster-whisper/ctranslate2 lacks native ROCm support.
  4. Training Config (Safe Mode):
    • SoVITS: Batch Size 4, Epochs 12.
    • GPT: Batch Size 4, Epochs 15.
    • Note: Batch Size 8 works but rides the VRAM line.
  5. Result: Highly expressive, captures breath/tone perfectly.

Troubleshooting

  • "No space left on device": Move logs and weights folders to a secondary drive and symlink them back.
  • "CUDA driver insufficient": Force device="cpu" in tools/asr/fasterwhisper_asr.py.

3. F5-TTS Training (The "Narrator")

Best for: Audiobooks, long-form content, stability.

Workflow

  1. Launch: Use the undocumented Gradio UI: python src/f5_tts/train/finetune_gradio.py.
  2. Dataset: Requires specific structure (project/wavs/ + project/metadata.csv). Use the WebUI's "Transcribe" tab to auto-generate this from raw audio.
  3. Training Config (RDNA 4 Optimized):
    • Batch Size: 1500 - 2000 (Frame-based). Do not exceed 3000.
    • Epochs: 50 - 100.
    • Mixed Precision: fp16.
    • 8-bit Adam: OFF (Bitsandbytes is unstable on ROCm).
  4. Result: Rock-solid stability on long texts. Smoother, less "spiky" than SoVITS.

Troubleshooting

  • "Illegal Memory Access": Lower batch size immediately. F5 uses DiT (Diffusion Transformer) which is VRAM-heavy.
  • "File not found / raw.arrow": The WebUI folder logic is fragile. Ensure metadata.csv and wavs/ are in the project root, then click "Prepare" again.
  • Corrupted Checkpoints: If it crashes during save, delete the partial .pt file and resume from the previous one.

4. The "Hybrid" Dataset Strategy

Don't train on just one style.

  • 9 mins: Audiobook (Stability).
  • 3 mins: News/Technical (Articulation).
  • 3 mins: Casual Chat (Texture/Breath).

This produces a "Master Model" that can be directed via prompt audio to be either serious or casual.


Verified by Mark & Gemini, Nov 24 2025.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment