Date: November 2025 Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 LTS (ROCm 7.1 Preview) Objective: Train commercial-grade voice clones (GPT-SoVITS & F5-TTS) without CUDA.
Before running anything, these environment variables are mandatory to prevent "Illegal Memory Access" crashes on RDNA 4.
# Prevent data transfer corruption
export HSA_ENABLE_SDMA=0
# Disable auto-tuning (causes kernel panics on 7.1)
export PYTORCH_TUNABLEOP_ENABLED=0
# Force Triton backend (Critical for Flash Attention)
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
# Disable Torch Compile (Unstable on preview drivers)
export TORCH_COMPILE_DISABLE=1Best for: Emotional lines, short phrases, high fidelity.
- Install: Use a
requirements.txtpinningtransformers==4.48.3andpeft==0.11.1to avoidflash_attn_2_cudaimport errors. - Dataset: 10-15 minutes of "Hybrid" audio (60% Audiobook, 20% Conversation, 20% News).
- ASR: Must run on CPU (
float32) becausefaster-whisper/ctranslate2lacks native ROCm support. - Training Config (Safe Mode):
- SoVITS: Batch Size
4, Epochs12. - GPT: Batch Size
4, Epochs15. - Note: Batch Size 8 works but rides the VRAM line.
- SoVITS: Batch Size
- Result: Highly expressive, captures breath/tone perfectly.
- "No space left on device": Move
logsandweightsfolders to a secondary drive and symlink them back. - "CUDA driver insufficient": Force
device="cpu"intools/asr/fasterwhisper_asr.py.
Best for: Audiobooks, long-form content, stability.
- Launch: Use the undocumented Gradio UI:
python src/f5_tts/train/finetune_gradio.py. - Dataset: Requires specific structure (
project/wavs/+project/metadata.csv). Use the WebUI's "Transcribe" tab to auto-generate this from raw audio. - Training Config (RDNA 4 Optimized):
- Batch Size:
1500-2000(Frame-based). Do not exceed 3000. - Epochs:
50-100. - Mixed Precision:
fp16. - 8-bit Adam: OFF (Bitsandbytes is unstable on ROCm).
- Batch Size:
- Result: Rock-solid stability on long texts. Smoother, less "spiky" than SoVITS.
- "Illegal Memory Access": Lower batch size immediately. F5 uses DiT (Diffusion Transformer) which is VRAM-heavy.
- "File not found / raw.arrow": The WebUI folder logic is fragile. Ensure
metadata.csvandwavs/are in the project root, then click "Prepare" again. - Corrupted Checkpoints: If it crashes during save, delete the partial
.ptfile and resume from the previous one.
Don't train on just one style.
- 9 mins: Audiobook (Stability).
- 3 mins: News/Technical (Articulation).
- 3 mins: Casual Chat (Texture/Breath).
This produces a "Master Model" that can be directed via prompt audio to be either serious or casual.
Verified by Mark & Gemini, Nov 24 2025.