AI Voice Cloning on AMD RDNA 4 (RX 9070 XT): The "Masochist's Guide" to Success

Date: November 2025 Hardware: AMD Radeon RX 9070 XT (16GB VRAM) OS: Ubuntu 22.04 LTS (ROCm 7.1 Preview) Objective: Train commercial-grade voice clones (GPT-SoVITS & F5-TTS) without CUDA.

1. The Core Stability Fixes (The "Secret Sauce")

Before running anything, these environment variables are mandatory to prevent "Illegal Memory Access" crashes on RDNA 4.

# Prevent data transfer corruption
export HSA_ENABLE_SDMA=0

# Disable auto-tuning (causes kernel panics on 7.1)
export PYTORCH_TUNABLEOP_ENABLED=0

# Force Triton backend (Critical for Flash Attention)
export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"

# Disable Torch Compile (Unstable on preview drivers)
export TORCH_COMPILE_DISABLE=1

2. GPT-SoVITS Training (The "Character Actor")

Best for: Emotional lines, short phrases, high fidelity.

Workflow

Install: Use a requirements.txt pinning transformers==4.48.3 and peft==0.11.1 to avoid flash_attn_2_cuda import errors.
Dataset: 10-15 minutes of "Hybrid" audio (60% Audiobook, 20% Conversation, 20% News).
ASR: Must run on CPU (float32) because faster-whisper/ctranslate2 lacks native ROCm support.
Training Config (Safe Mode):
- SoVITS: Batch Size 4, Epochs 12.
- GPT: Batch Size 4, Epochs 15.
- Note: Batch Size 8 works but rides the VRAM line.
Result: Highly expressive, captures breath/tone perfectly.

Troubleshooting

"No space left on device": Move logs and weights folders to a secondary drive and symlink them back.
"CUDA driver insufficient": Force device="cpu" in tools/asr/fasterwhisper_asr.py.

3. F5-TTS Training (The "Narrator")

Best for: Audiobooks, long-form content, stability.

Workflow

Launch: Use the undocumented Gradio UI: python src/f5_tts/train/finetune_gradio.py.
Dataset: Requires specific structure (project/wavs/ + project/metadata.csv). Use the WebUI's "Transcribe" tab to auto-generate this from raw audio.
Training Config (RDNA 4 Optimized):
- Batch Size: 1500 - 2000 (Frame-based). Do not exceed 3000.
- Epochs: 50 - 100.
- Mixed Precision: fp16.
- 8-bit Adam: OFF (Bitsandbytes is unstable on ROCm).
Result: Rock-solid stability on long texts. Smoother, less "spiky" than SoVITS.

Troubleshooting

"Illegal Memory Access": Lower batch size immediately. F5 uses DiT (Diffusion Transformer) which is VRAM-heavy.
"File not found / raw.arrow": The WebUI folder logic is fragile. Ensure metadata.csv and wavs/ are in the project root, then click "Prepare" again.
Corrupted Checkpoints: If it crashes during save, delete the partial .pt file and resume from the previous one.

4. The "Hybrid" Dataset Strategy

Don't train on just one style.

9 mins: Audiobook (Stability).
3 mins: News/Technical (Articulation).
3 mins: Casual Chat (Texture/Breath).

This produces a "Master Model" that can be directed via prompt audio to be either serious or casual.

Verified by Mark & Gemini, Nov 24 2025.

apollo-mg/RDNA4_Voice_Cloning_Guide.md

Select an option

No results found

Select an option

No results found

AI Voice Cloning on AMD RDNA 4 (RX 9070 XT): The "Masochist's Guide" to Success

1. The Core Stability Fixes (The "Secret Sauce")

2. GPT-SoVITS Training (The "Character Actor")

Workflow

Troubleshooting

3. F5-TTS Training (The "Narrator")

Workflow

Troubleshooting

4. The "Hybrid" Dataset Strategy