This is stuff that has worked well for me.
Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE
Changelog
- 2026-06-18: Switch to
rocm.nightlies.amd.com/whl-multi-archtorch builds. - 2026-06-14: Re-add
PYTORCH_MIOPEN_SUGGEST_NHWCsince it affects pytorch 2.10. Comment outPYTORCH_NO_HIP_MEMORY_CACHING=1as it should only be used if OOMs are encountered. - 2026-06-12: Update flash-attention install workarounds for commit
fc8cbad6. Remove aiterENABLE_CK,LIBRARY_PATHusage, addAITER_TRITON_ONLY=1. - 2026-06-01: Add note about aiter pre-configured fwd attn config making
FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSONno longer necessary for gfx1100. TweakPYTORCH_TUNABLEOP_ENABLEDadvice. - 2026-05-28: Add flash-attention/aiter issues + workarounds. Remove
PYTORCH_MIOPEN_SUGGEST_NHWCusage. - 2026-02-07: Switch to upstream flash-attention +
FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.
Create python 3.14 venv
python3.14 -m venv venvpip install --index-url https://rocm.nightlies.amd.com/whl-multi-arch/ "torch[device-gfx1100]" "torchvision[device-gfx1100]" torchaudioSee rocm install instructions in https://github.com/Dao-AILab/flash-attention.
Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section)
or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.
Since flash-attention moved to using aiter it may have issues installing & running.
The following workaround works on flash-attention main: d16e381f.
- In the flash-attention dir:
BUILD_TARGET="rocm" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" pip install --no-build-isolation . - The above step successfully install flast_attn but may re-install triton. To get the rocm triton version re-install: torch etc
- We'll also need to set
AITER_TRITON_ONLYenv var for aiter JIT (see env vars section).
Installed packages should look something like:
$ pip list | rg "torch|flash|aiter|triton"
amd-aiter 0.0.0
amd-torch-device-gfx11 2.12.0+rocm7.14.0a20260618
amd-torch-device-gfx1100 2.12.0+rocm7.14.0a20260618
amd-torchvision-device-gfx1100 0.27.0+rocm7.14.0a20260618
flash_attn 2.8.4
torch 2.12.0+rocm7.14.0a20260618
torchaudio 2.11.0+rocm7.14.0a20260617
torchsde 0.2.6
torchvision 0.27.0+rocm7.14.0a20260618
triton 3.7.0+gitb4e20bbe.rocm7.14.0a20260618
triton_kernels 1.0.0+amd.rocm7.0.0.gitd0d77a509
pip install -r requirements.txtNote: Also install any custom_nodes requirements (not described here).
# Slower, but more stable / fewer OOMs. No OOMs? You don't need this.
# I find this necessary to run 720p wan gen
#export PYTORCH_NO_HIP_MEMORY_CACHING=1
# aiter
# https://github.com/Dao-AILab/flash-attention/issues/2580#issuecomment-4607206930
export AITER_TRITON_ONLY=1
# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
## Significantly faster attn_fwd performance for wan2.2 workflows
## Note: Only necessary for older flash-attention or non-gfx1100 as aiter has this pre-configured for gfx1100
## See <https://github.com/ROCm/aiter/blob/ca781d8197cc85240111093f8c796ecbc2af6294/aiter/ops/triton/_triton_kernels/flash_attn_triton_amd/fwd_prefill.py#L169>
# export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":6,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'
# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# https://github.com/ROCm/TheRock/issues/2485#issuecomment-3666986174
# affects pytorch 2.10, fixed in later version (not exactly sure which though)
# export PYTORCH_MIOPEN_SUGGEST_NHWC=0
# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FASTNotes:
PYTORCH_TUNABLEOP_ENABLED(tunable ops) can be beneficial but it's slow to tune and can be harmful for workflows with distinct resolutions (like detailer workflows). Consider enabling online tuning for specific workflows for a while and then settingPYTORCH_TUNABLEOP_TUNING=0to ensure no further downsides.
--use-flash-attention: use faster flash attention installed above.--disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)--cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.
- Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add
vae_tile_sizeoptional arg. Usevae_tile_size: 256for significant encode perf improvement. Add with e.g.git remote add alexheretic https://github.com/alexheretic/ComfyUI git fetch alexheretic git merge --squash alexheretic/wan-vae-tiled-encode
- Use tiled vae decode nodes (size 256 for wan).
Thanks for sharing, proved also useful for my 9060 XT :)