7900 GRE / gfx1100 optimised ComfyUI setup for Linux

This is stuff that has worked well for me.

Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE

Changelog

2026-06-18: Switch to rocm.nightlies.amd.com/whl-multi-arch torch builds.
2026-06-14: Re-add PYTORCH_MIOPEN_SUGGEST_NHWC since it affects pytorch 2.10. Comment out PYTORCH_NO_HIP_MEMORY_CACHING=1 as it should only be used if OOMs are encountered.
2026-06-12: Update flash-attention install workarounds for commit fc8cbad6. Remove aiter ENABLE_CK, LIBRARY_PATH usage, add AITER_TRITON_ONLY=1.
2026-06-01: Add note about aiter pre-configured fwd attn config making FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON no longer necessary for gfx1100. Tweak PYTORCH_TUNABLEOP_ENABLED advice.
2026-05-28: Add flash-attention/aiter issues + workarounds. Remove PYTORCH_MIOPEN_SUGGEST_NHWC usage.
2026-02-07: Switch to upstream flash-attention + FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.

Setup venv

Create python 3.14 venv

python3.14 -m venv venv

Install torch

pip install --index-url https://rocm.nightlies.amd.com/whl-multi-arch/ "torch[device-gfx1100]" "torchvision[device-gfx1100]" torchaudio

Install flash-attention

See rocm install instructions in https://github.com/Dao-AILab/flash-attention.

Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section) or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.

Issues/workarounds

Since flash-attention moved to using aiter it may have issues installing & running.

The following workaround works on flash-attention main: d16e381f.

In the flash-attention dir: BUILD_TARGET="rocm" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" pip install --no-build-isolation .
The above step successfully install flast_attn but may re-install triton. To get the rocm triton version re-install: torch etc
We'll also need to set AITER_TRITON_ONLY env var for aiter JIT (see env vars section).

Installed packages should look something like:

$ pip list | rg "torch|flash|aiter|triton"
amd-aiter                              0.0.0
amd-torch-device-gfx11                 2.12.0+rocm7.14.0a20260618
amd-torch-device-gfx1100               2.12.0+rocm7.14.0a20260618
amd-torchvision-device-gfx1100         0.27.0+rocm7.14.0a20260618
flash_attn                             2.8.4
torch                                  2.12.0+rocm7.14.0a20260618
torchaudio                             2.11.0+rocm7.14.0a20260617
torchsde                               0.2.6
torchvision                            0.27.0+rocm7.14.0a20260618
triton                                 3.7.0+gitb4e20bbe.rocm7.14.0a20260618
triton_kernels                         1.0.0+amd.rocm7.0.0.gitd0d77a509

Install comfy requirements

pip install -r requirements.txt

Note: Also install any custom_nodes requirements (not described here).

Env vars

# Slower, but more stable / fewer OOMs. No OOMs? You don't need this.
# I find this necessary to run 720p wan gen
#export PYTORCH_NO_HIP_MEMORY_CACHING=1

# aiter
# https://github.com/Dao-AILab/flash-attention/issues/2580#issuecomment-4607206930
export AITER_TRITON_ONLY=1

# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE

## Significantly faster attn_fwd performance for wan2.2 workflows
## Note: Only necessary for older flash-attention or non-gfx1100 as aiter has this pre-configured for gfx1100
## See <https://github.com/ROCm/aiter/blob/ca781d8197cc85240111093f8c796ecbc2af6294/aiter/ops/triton/_triton_kernels/flash_attn_triton_amd/fwd_prefill.py#L169>
# export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":6,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'

# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# https://github.com/ROCm/TheRock/issues/2485#issuecomment-3666986174
# affects pytorch 2.10, fixed in later version (not exactly sure which though)
# export PYTORCH_MIOPEN_SUGGEST_NHWC=0

# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FAST

Notes:

PYTORCH_TUNABLEOP_ENABLED (tunable ops) can be beneficial but it's slow to tune and can be harmful for workflows with distinct resolutions (like detailer workflows). Consider enabling online tuning for specific workflows for a while and then setting PYTORCH_TUNABLEOP_TUNING=0 to ensure no further downsides.

ComfyUI args

--use-flash-attention: use faster flash attention installed above.
--disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)
--cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.

ComfyUI proposed patches

Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add vae_tile_size optional arg. Use vae_tile_size: 256 for significant encode perf improvement. Add with e.g.
```
git remote add alexheretic https://github.com/alexheretic/ComfyUI
git fetch alexheretic
git merge --squash alexheretic/wan-vae-tiled-encode
```

Usage hints

Use tiled vae decode nodes (size 256 for wan).

alexheretic/gfx1100-comfyui-setup.md

Select an option

No results found