Skip to content

Instantly share code, notes, and snippets.

@alexheretic
Last active June 25, 2026 23:04
Show Gist options
  • Select an option

  • Save alexheretic/d868b340d1cef8664e1b4226fd17e0d0 to your computer and use it in GitHub Desktop.

Select an option

Save alexheretic/d868b340d1cef8664e1b4226fd17e0d0 to your computer and use it in GitHub Desktop.
7900 GRE / gfx1100 optimised ComfyUI setup for Linux

7900 GRE / gfx1100 optimised ComfyUI setup for Linux

This is stuff that has worked well for me.

Tested on Arch Linux, Ryzen 7 5800X, 64GB RAM, RX 7900 GRE

Changelog
  • 2026-06-18: Switch to rocm.nightlies.amd.com/whl-multi-arch torch builds.
  • 2026-06-14: Re-add PYTORCH_MIOPEN_SUGGEST_NHWC since it affects pytorch 2.10. Comment out PYTORCH_NO_HIP_MEMORY_CACHING=1 as it should only be used if OOMs are encountered.
  • 2026-06-12: Update flash-attention install workarounds for commit fc8cbad6. Remove aiter ENABLE_CK, LIBRARY_PATH usage, add AITER_TRITON_ONLY=1.
  • 2026-06-01: Add note about aiter pre-configured fwd attn config making FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON no longer necessary for gfx1100. Tweak PYTORCH_TUNABLEOP_ENABLED advice.
  • 2026-05-28: Add flash-attention/aiter issues + workarounds. Remove PYTORCH_MIOPEN_SUGGEST_NHWC usage.
  • 2026-02-07: Switch to upstream flash-attention + FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON.

Setup venv

Create python 3.14 venv

python3.14 -m venv venv

Install torch

pip install --index-url https://rocm.nightlies.amd.com/whl-multi-arch/ "torch[device-gfx1100]" "torchvision[device-gfx1100]" torchaudio

Install flash-attention

See rocm install instructions in https://github.com/Dao-AILab/flash-attention.

Optimised forward attention config can be set with FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON (see env vars section) or you can try autotuning, see the repo's README & ensure you have a new enough version checked out.

Issues/workarounds

Since flash-attention moved to using aiter it may have issues installing & running.

The following workaround works on flash-attention main: d16e381f.

  • In the flash-attention dir: BUILD_TARGET="rocm" FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" pip install --no-build-isolation .
  • The above step successfully install flast_attn but may re-install triton. To get the rocm triton version re-install: torch etc
  • We'll also need to set AITER_TRITON_ONLY env var for aiter JIT (see env vars section).

Installed packages should look something like:

$ pip list | rg "torch|flash|aiter|triton"
amd-aiter                              0.0.0
amd-torch-device-gfx11                 2.12.0+rocm7.14.0a20260618
amd-torch-device-gfx1100               2.12.0+rocm7.14.0a20260618
amd-torchvision-device-gfx1100         0.27.0+rocm7.14.0a20260618
flash_attn                             2.8.4
torch                                  2.12.0+rocm7.14.0a20260618
torchaudio                             2.11.0+rocm7.14.0a20260617
torchsde                               0.2.6
torchvision                            0.27.0+rocm7.14.0a20260618
triton                                 3.7.0+gitb4e20bbe.rocm7.14.0a20260618
triton_kernels                         1.0.0+amd.rocm7.0.0.gitd0d77a509

Install comfy requirements

pip install -r requirements.txt

Note: Also install any custom_nodes requirements (not described here).

Env vars

# Slower, but more stable / fewer OOMs. No OOMs? You don't need this.
# I find this necessary to run 720p wan gen
#export PYTORCH_NO_HIP_MEMORY_CACHING=1

# aiter
# https://github.com/Dao-AILab/flash-attention/issues/2580#issuecomment-4607206930
export AITER_TRITON_ONLY=1

# triton
export TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
export FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE

## Significantly faster attn_fwd performance for wan2.2 workflows
## Note: Only necessary for older flash-attention or non-gfx1100 as aiter has this pre-configured for gfx1100
## See <https://github.com/ROCm/aiter/blob/ca781d8197cc85240111093f8c796ecbc2af6294/aiter/ops/triton/_triton_kernels/flash_attn_triton_amd/fwd_prefill.py#L169>
# export FLASH_ATTENTION_FWD_TRITON_AMD_CONFIG_JSON='{"BLOCK_M":128,"BLOCK_N":64,"waves_per_eu":6,"PRE_LOAD_V":false,"num_stages":1,"num_warps":8}'

# pytorch switches on NHWC for rocm > 7, causes signifant miopen regressions for upscaling
# https://github.com/ROCm/TheRock/issues/2485#issuecomment-3666986174
# affects pytorch 2.10, fixed in later version (not exactly sure which though)
# export PYTORCH_MIOPEN_SUGGEST_NHWC=0

# miopen
## Tell comfyui to *not* disable miopen/cudnn, otherwise upscale perf is much worse
export COMFYUI_ENABLE_MIOPEN=1
## miopen default find mode causes significant initial slowness, yields little or no benefit to workloads I tested
export MIOPEN_FIND_MODE=FAST

Notes:

  • PYTORCH_TUNABLEOP_ENABLED (tunable ops) can be beneficial but it's slow to tune and can be harmful for workflows with distinct resolutions (like detailer workflows). Consider enabling online tuning for specific workflows for a while and then setting PYTORCH_TUNABLEOP_TUNING=0 to ensure no further downsides.

ComfyUI args

  • --use-flash-attention: use faster flash attention installed above.
  • --disable-pinned-memory: Comfy-Org/ComfyUI#11781 (comment)
  • --cache-ram 32: optional, helps prevent comfy from using up all 64GB of ram.

ComfyUI proposed patches

  • Comfy-Org/ComfyUI#10238: WanImageToVideo, WanFirstLastFrameToVideo: Add vae_tile_size optional arg. Use vae_tile_size: 256 for significant encode perf improvement. Add with e.g.
    git remote add alexheretic https://github.com/alexheretic/ComfyUI
    git fetch alexheretic
    git merge --squash alexheretic/wan-vae-tiled-encode

Usage hints

  • Use tiled vae decode nodes (size 256 for wan).
@legitsplit

Copy link
Copy Markdown

Thanks for sharing, proved also useful for my 9060 XT :)

@mikharju

Copy link
Copy Markdown

Good for 7900 XTX too. I did try to get Flash Attention to work before on Bazzite and DistroBox, but wasn't sure if it was working or not since not much improvement could be seen. With all of your optimizations though, WAN videos are coming twice as fast now compared to before. Huge thanks! Also your vae_tile_size option rocks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment