Created
January 4, 2026 05:56
-
-
Save thomasantony/b3f244c184ab827eb9c9952d2faa34b6 to your computer and use it in GitHub Desktop.
Claude Code fixing a CubeCL/WGSL kernel bug
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| ╭─── Claude Code v2.0.76 ──────────────────────────────────────────────────────╮ | |
| │ │ Tips for getting │ | |
| │ Welcome back Thomas! │ started │ | |
| │ │ Run /init to create a … │ | |
| │ * ▐▛███▜▌ * │ ─────────────────────── │ | |
| │ * ▝▜█████▛▘ * │ Recent activity │ | |
| │ * ▘▘ ▝▝ * │ No recent activity │ | |
| │ │ │ | |
| │ Opus 4.5 · Claude Max · │ │ | |
| │ [email protected]'s Organization │ │ | |
| │ ~/…/projects-2026-01/radix-sort-project/wgpu_sort │ │ | |
| ╰──────────────────────────────────────────────────────────────────────────────╯ | |
| > The current folder contains a radix sort implementation using WGSL in | |
| Rust. I have made a port of this into "Cubecl" - a GPU agnostic programming | |
| frameowrk. It works .. except when it doesn't. | |
| The code uses a "decoupled lookback" mechanism. So it assumes that some | |
| blocks are scheduled for execution first - and if they are not, the | |
| subsequent ones block indefinitely. This seems to be happening from time to | |
| time in my code. | |
| Can you check how this code avoids that? As far as I know, a CUDA backnd | |
| doesn't guarantee some order of execution for blocks. This seems to | |
| specifically affect the scatter_odd and scatter_even kernels which does the | |
| decoupled lookback with a spin-wait. | |
| ⏺ I'll explore this codebase to understand how the decoupled lookback mechanism is implemented and whether there are any safeguards against the scheduling issue you're describing. | |
| ⏺ Explore(Explore radix sort lookback) | |
| ⎿ Done (19 tool uses · 55.6k tokens · 1m 36s) | |
| ⏺ This is a thorough analysis. Let me look at the specific spin-wait code directly to give you a more precise answer about the ordering guarantees: | |
| ⏺ Read(src/radix_sort.wgsl) | |
| ⎿ Read 60 lines | |
| ⏺ Let me also see Block 0's handling and the initialization: | |
| ⏺ Read(src/radix_sort.wgsl) | |
| ⎿ Read 45 lines | |
| ⏺ Now I have a complete picture. Here's the key finding: | |
| How This Code "Avoids" the Ordering Problem | |
| Short answer: It doesn't really avoid it - it just assumes Block 0 will execute and complete before others need its data. | |
| The Critical Assumption | |
| The code at lines 319-333 shows Block 0 (workgroup 0) has special handling: | |
| - It reads from the global histogram (no dependency on other blocks) | |
| - It writes its result with partition_mask_prefix status immediately | |
| - No spin-wait for Block 0 | |
| All other blocks (lines 335-373) spin-wait looking backward for either: | |
| 1. A PREFIX status (terminates the chain) | |
| 2. A REDUCTION status (accumulate and keep looking back) | |
| Why It Works on WebGPU/Vulkan (Usually) | |
| 1. Block 0 has zero dependencies - it can complete immediately | |
| 2. The assumption is that by the time later blocks reach the spin-wait, Block 0 has already written its PREFIX status | |
| 3. WebGPU/Vulkan schedulers tend to launch blocks somewhat in order for small dispatches | |
| Why Your CubeCL/CUDA Port Might Fail | |
| CUDA makes no guarantees about block scheduling order. The problem: | |
| 1. All blocks launch concurrently - Block 100 might start executing before Block 0 | |
| 2. Block 100 reaches the spin-wait at line 351 | |
| 3. Block 100 spins indefinitely on atomicLoad waiting for Block 99's data | |
| 4. Block 99 is also spinning waiting for Block 98 | |
| 5. All GPU resources are consumed by spinning threads | |
| 6. Block 0 never gets scheduled because there are no free resources | |
| 7. Deadlock | |
| The Spin-Wait Has No Yield | |
| while true { | |
| let prev = atomicLoad(&histograms[...]); | |
| if (prev & rs_partition_mask_status) == partition_mask_invalid { | |
| continue; // Tight spin - no way to yield to other blocks | |
| } | |
| ... | |
| } | |
| There's no mechanism to release resources and let other blocks run. | |
| Solutions for Your CubeCL Port | |
| 1. Limit concurrent blocks: Ensure only a subset of blocks run at once, guaranteeing earlier blocks complete before later ones start. CUDA allows setting max blocks per SM. | |
| 2. Use cooperative groups with grid sync: CUDA's cooperative kernel launch can provide ordering guarantees. | |
| 3. Add a fallback/timeout: After N spins, fall back to a slower but guaranteed-progress algorithm. | |
| 4. Use persistent threads pattern: Launch fewer blocks than data requires, have each block process multiple chunks sequentially. | |
| 5. Two-phase approach: First pass computes all reductions (no dependency), second pass does the prefix (after barrier). | |
| Would you like me to look at specific patterns for implementing any of these solutions in CubeCL? | |
| > CubeCL doesn't use CUDA. I mentioend cuda because presumably, WGSL code | |
| *can* run on CUDA GPUs. CubeCL is backend agnostic. It has a "wgpu" backend | |
| which I am using. So I am suprised that it is not working. | |
| ⏺ Ah, that's an important clarification! If both are using wgpu, then the scheduling behavior should theoretically be similar. This suggests the issue might be elsewhere. Let me reconsider: | |
| Possible Causes in Your CubeCL Port | |
| 1. Initialization of the partition buffer | |
| The status bits must start as INVALID (00 for even passes, 10 for odd passes). If the buffer isn't zeroed properly before each pass, blocks might see stale PREFIX/REDUCTION status from previous runs and skip waiting when they shouldn't. | |
| 2. Status bit encoding mismatch | |
| The original uses different encodings for even vs odd: | |
| - scatter_even: INVALID=0, REDUCTION=1, PREFIX=2 | |
| - scatter_odd: INVALID=2, REDUCTION=3, PREFIX=0 | |
| This alternation prevents stale data from a previous pass being misinterpreted. Is your port handling this correctly? | |
| 3. Atomic memory ordering | |
| CubeCL might generate different atomic semantics. The spin-wait relies on atomicLoad having acquire semantics and atomicStore having release semantics. | |
| 4. Does the original WGSL code also fail occasionally? | |
| You said your port "works except when it doesn't" - does the original WGSL version here work 100% reliably, or does it also occasionally hang? | |
| --- | |
| Could you share: | |
| 1. Your CubeCL spin-wait/lookback implementation? | |
| 2. How you're initializing the partition buffer between passes? | |
| 3. Which GPU/driver you're testing on? | |
| This would help pinpoint whether it's a translation issue or something more fundamental about the algorithm's assumptions. | |
| > It is in ../cubecl-recipes/src/bin/07_radix_sort/main.rs | |
| ⏺ Read(~/code/projects-2026-01/radix-sort-project/cubecl-recipes/src/bin/07_radi | |
| x_sort/main.rs) | |
| ⎿ Read 761 lines | |
| ⏺ Now I can analyze the differences. Your CubeCL implementation looks quite faithful to the original, but I've identified a few potential issues: | |
| Key Findings | |
| 1. Memory Ordering - Most Likely Culprit | |
| CubeCL's Atomic::load and Atomic::store may not have the same memory ordering semantics as WGSL's atomicLoad/atomicStore. | |
| In WGSL, atomics provide acquire/release semantics across workgroups. If CubeCL's atomics use relaxed ordering, Block 1 might see stale zeros even after Block 0 has written its PREFIX. | |
| This would explain intermittent failures - sometimes the timing/caching works out, sometimes it doesn't. | |
| 2. Missing Last-Block Guard | |
| In the WGSL code (line 368-369): | |
| if wid.x < nwg.x - 1u { // only store when inbetween, skip for last workgroup | |
| atomicAdd(&histograms[partition_offset + partition_base], exc | (1u << 30u)); | |
| } | |
| Your CubeCL code (lines 354-357) doesn't have this check: | |
| Atomic::add( | |
| &partition_status[UNIT_POS + partition_base], | |
| global_prefix | (1u32 << 30), | |
| ); | |
| This isn't the hang cause, but it writes unnecessary data. | |
| 3. Same issue at line 313-318 | |
| WGSL checks wid.x < nwg.x - 1u before writing REDUCTION. Your code writes for all blocks. Again, not the hang cause, but a minor difference. | |
| The Core Problem | |
| Both implementations fundamentally rely on the GPU scheduler eventually running Block 0. The difference is likely in memory visibility: | |
| - WGSL atomics guarantee cross-workgroup visibility | |
| - CubeCL might not have the same guarantees by default | |
| Questions to Help Debug | |
| 1. Does it hang consistently with small block counts (2-3 blocks) or only with many blocks? | |
| 2. Which GPU/driver are you testing on? (Intel, AMD, NVIDIA?) | |
| 3. Can you check what CubeCL generates for the atomics? There might be a way to dump the generated WGSL/SPIR-V. | |
| 4. What's the value of PLANE_DIM at runtime? Different subgroup sizes could affect behavior. | |
| 5. Does adding a storage barrier after the atomic writes help? CubeCL might have something like storageBarrier() or a memory fence you can insert. | |
| > 1. I have only tested with a 2-block setup so far. And it works .. until | |
| it doesn't. I am on an m1 macbook pro andusing teh "wgpu" backend for cubecl | |
| which is based on webgpu (WGSL). | |
| 2. M1 Macbook with WGPU (webgpu) driver | |
| 3. See scatter_kernel.wgsl | |
| 4. PLANE_DIM is 32 | |
| 5. There is "sync_storage()". I tried putting that after every atomic-write | |
| in decoupled-lookback. Still no help. | |
| Oh, it doesn't get stuck forever, it may just take a minute or so. 72 | |
| seconds on last try. And once it continues, it does sort correctly. When it | |
| works fine, it is less than a second (in debug mode, not even release). and | |
| when it breaks, it is ~70+ seconds (still gives correct results) | |
| ⏺ This is crucial information! The behavior you describe - not a deadlock, but a 70+ second delay before succeeding - strongly points to a memory visibility/cache coherence issue on Apple Silicon. | |
| Let me look at the generated WGSL: | |
| ⏺ Search(pattern: "**/scatter_kernel.wgsl") | |
| ⎿ Found 1 file (ctrl+o to expand) | |
| ⏺ Read(scatter_kernel.wgsl) | |
| ⎿ Read 821 lines | |
| ⏺ Found it! Look at the spin-wait loop in the generated WGSL (lines 192-212): | |
| loop { | |
| let l_86 = l_mut_85 + local_idx; | |
| let l_576 = info[3u]; // <-- Reading from info buffer every iteration! | |
| let l_577 = l_576 * 1u; | |
| let l_578 = min(l_86, l_577); // <-- Bounds check every iteration | |
| let l_579 = &buffer_3_global[l_578]; | |
| let l_87 = l_579; | |
| let l_88 = atomicLoad(l_87); | |
| // ... rest of loop | |
| } | |
| The problem: CubeCL is inserting bounds-checking that reads info[3u] on every single iteration of the spin-wait loop. This is a non-atomic read from a different buffer. | |
| Compare to the original WGSL which has a tight loop with just atomicLoad: | |
| while true { | |
| let prev = atomicLoad(&histograms[partition_base_prev + partition_offset]); | |
| // ... just atomic operations, no other memory reads | |
| } | |
| Why This Matters on Apple Silicon | |
| On M1's unified memory architecture with Metal backend: | |
| 1. The non-atomic read (info[3u]) might be forcing the GPU to serialize memory accesses differently | |
| 2. Or worse - it could be thrashing caches between the info buffer and the partition_status buffer | |
| 3. The GPU might be prioritizing coherence of the info reads over the atomic partition reads | |
| Quick Test | |
| Can you try modifying your CubeCL code to not use bounds checking in the spin-wait loop? Something like removing the ArrayArg size check or using raw pointer access if CubeCL supports it? | |
| Alternatively, does CubeCL have an unchecked array access mode? | |
| Also Worth Checking | |
| Have you actually tested the original wgpu_sort code on your M1? If that also has the 70-second delay, then it's a Metal/M1 issue. If it works fine, the bounds checking is likely the culprit. | |
| > I tried a launch_unchekced macro. I got tihs new kernel: | |
| scatter_kernel_unchecked.wgsl. Can you check that? | |
| ⏺ Read(scatter_kernel_unchecked.wgsl) | |
| ⎿ Read 719 lines | |
| ⏺ The unchecked version looks much cleaner. The spin-wait loop (lines 173-189) is now tight: | |
| loop { | |
| let l_86 = l_mut_85 + local_idx; | |
| let l_87 = &buffer_3_global[l_86]; | |
| let l_88 = atomicLoad(l_87); | |
| let l_89 = l_88 & 3221225472u; | |
| let l_90 = l_89 != scalars_u32[0]; | |
| if l_90 { | |
| let l_91 = l_88 & 1073741823u; | |
| l_mut_22 = l_mut_22 + l_91; | |
| let l_92 = l_89 == scalars_u32[2]; | |
| if l_92 { | |
| break; | |
| } else { | |
| l_mut_85 = l_mut_85 - 256u; | |
| } | |
| } | |
| } | |
| No more info[] reads inside the loop. This is very close to the original WGSL. | |
| One remaining difference: The mask values (scalars_u32[0], scalars_u32[2]) are read from a storage buffer, whereas in the original they're function parameters/constants. This shouldn't matter since they're uniform, but it's still a storage read vs. a register/constant. | |
| Next Steps | |
| 1. Test this unchecked version - does it still have the 70-second delays? | |
| 2. Critical question: Have you actually run the original wgpu_sort (the WGSL code in this repo) on your M1? If that also has delays, the issue is with Metal/M1's atomic visibility, not CubeCL. | |
| 3. If the original works fine but CubeCL still delays, the remaining difference is the scalars_u32[] reads. You could try making those comptime constants in CubeCL to eliminate all storage reads from the spin loop. | |
| Does the unchecked version improve things? | |
| > It actually seems fine now. Thanks! | |
| ⏺ Excellent! So the culprit was indeed the bounds-checking reads from info[] inside the spin-wait loop. | |
| On Apple Silicon, those extra storage buffer reads were interfering with the cache coherence of the atomic partition status reads. The GPU was likely prioritizing/serializing those info[] accesses in a way that delayed visibility of Block 0's atomic writes. | |
| Key takeaway for CubeCL GPU kernels with spin-waits: | |
| - Use launch_unchecked (or equivalent) for kernels with tight spin-wait loops | |
| - Any non-atomic memory reads inside a spin loop can interfere with atomic visibility | |
| - This is especially sensitive on Apple Silicon/Metal | |
| Good luck with the rest of your radix sort implementation! | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment