Skip to content

Instantly share code, notes, and snippets.

@nihalpasham
Created September 10, 2025 02:07
Show Gist options
  • Save nihalpasham/9dcc11c1d20818159087880a55d2db76 to your computer and use it in GitHub Desktop.
Save nihalpasham/9dcc11c1d20818159087880a55d2db76 to your computer and use it in GitHub Desktop.
The only true cross-platform GPGPU API —WebGPU (with WGML 🦀)

True cross-platform GPU compute —WebGPU + WGML

image 2 basic-webgpu-stack ![](image.png) image

WebGPU vs WebGL

  • ✅ No global state
  • ✅ Explicit command buffers (WebGL doesn’t have these)
  • ✅ Pipelines (WebGL uses shader programs + fixed function state but no explicit pipeline object)
  • ✅ Bind groups (WebGL uses individual uniform locations and textures)
  • ✅ Render/compute passes (WebGL has framebuffers but no explicit passes)
  • ✅ Compute shaders (WebGL doesn’t support compute shaders)

Differences to Vulkan

  • Some rendering related features/extensions not available (yet)
    • VRS
    • Raytracing
    • Bindless
    • Geometry shaders
    • Etc.
  • ==🔴No explicit memory management==
  • ==🟡No explicit synchronization==
    • No multi-stream / overlapping execution control.
    • No cross-workgroup synchronization inside a dispatch.
    • No explicit barriers between passes — the API handles them.
    • No low-level fences/timeline semaphores.

WGSL

Mapping CUDA terminology to WGSL (WebGPU Shading Language) terminology, helps to think in terms of thread hierarchy, memory model, and execution model, since CUDA is very GPU-centric and WGSL is designed around WebGPU’s abstraction.

CUDA Concept WGSL / WebGPU Equivalent Notes
thread invocation (a single execution of a compute shader) In WGSL, a compute shader invocation is analogous to a CUDA thread.
threadIdx @builtin(local_invocation_id) Gives the ID of the invocation within a workgroup; equivalent to CUDA’s per-block thread index.
blockIdx @builtin(workgroup_id) Gives the ID of the workgroup; equivalent to CUDA’s block index.
blockDim @builtin(workgroup_size) Number of invocations in a workgroup (like blockDim in CUDA).
gridDim Not directly exposed In WebGPU/WGSL, the total number of workgroups is set by the dispatch call, e.g., dispatchWorkgroups(x, y, z).
shared memory var Memory shared among invocations in the same workgroup (CUDA’s shared).
global memory var Device-visible storage buffers, similar to CUDA global memory.
local memory var Private to each invocation, analogous to per-thread registers/private memory in CUDA.
__syncthreads() workgroupBarrier() Synchronizes all invocations in the same workgroup/block.
grid-stride loops Manual loop using workgroup_id + local_invocation_id Compute linear global id: global_invocation_id = workgroup_id × workgroup_size + local_invocation_id; then stride by workgroup_size × num_workgroups as needed.
warp Not exposed WGSL abstracts hardware warps/wavefronts; CUDA warps are a hardware execution detail, not part of the high-level model.
atomic* atomic with atomicLoad, atomicStore, atomicAdd WGSL provides atomics on var and var with operations like atomicLoad/Store/Add.

Example: CUDA → WGSL Thread Mapping

CUDA kernel:

__global__ void add(float *a, float *b, float *c) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x*;*
    c[idx] = a[idx] + b[idx]*;*
}

WGSL compute shader:

@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> c: array<f32>;

@compute @workgroup_size(64)
fn main(
    @builtin(local_invocation_id) local_id: vec3<u32>,
    @builtin(workgroup_id) group_id: vec3<u32>
) {
    let idx = local_id.x + group_id.x * 64u;
    c[idx] = a[idx] + b[idx];
}

Notice how 64u in WGSL corresponds to blockDim.x in CUDA.

Softmax Implementation

What is Softmax? Softmax is like a "probability converter" - it takes a list of numbers and turns them into probabilities that add up to 1.0. Think of it like converting test scores into percentages where higher scores get higher probabilities.

Example:

Here's a simple 8x8 matrix before and after softmax:

BEFORE SOFTMAX:

 2.3  -1.2   0.8   3.1  -0.5   1.7   2.9  -2.1
 1.5   0.3  -1.8   2.4   1.1  -0.7   0.9   1.6
-0.9   2.7   1.4  -2.3   0.6   3.2  -1.5   0.2
 3.8  -0.4   2.1   1.9  -1.1   0.5   2.6  -0.8
 0.7   1.8  -2.5   0.1   2.3  -1.4   1.2   3.4
-1.6   0.9   3.5  -0.6   1.7   2.8  -0.3   1.0
 2.2  -2.0   0.4   2.7   0.8  -1.9   3.1   0.6
 1.3   3.6  -1.7   1.5  -0.2   0.7  -2.4   2.5

AFTER SOFTMAX:

0.2156 0.0302 0.1089 0.4982 0.0364 0.2648 0.3542 0.0061
0.0969 0.1297 0.0079 0.2508 0.1808 0.0237 0.0476 0.2359
0.0088 0.1456 0.1978 0.0022 0.1088 0.1178 0.0043 0.0583
0.9787 0.0647 0.3854 0.1488 0.0198 0.0789 0.2639 0.0217
0.0434 0.0598 0.0040 0.0251 0.5542 0.0118 0.0640 1.4377
0.0044 0.0245 0.1500 0.0124 0.3264 0.0779 0.0142 0.1301
0.1946 0.0135 0.0720 0.3312 0.1328 0.0072 0.4518 0.0869
0.0796 0.3620 0.0088 0.1013 0.0488 0.0096 0.0018 0.5788

Each column now sums to 1.0 and all values are between 0 and 1.

The Math Formula: For each number x[i] in your list, softmax calculates:

softmax(x[i]) = e^(x[i] - max) / sum(e^(x[j] - max) for all j)

Imports & bindings

#import wgblas::shape as Shape;

@group(0) @binding(0)
var<uniform> shape: Shape::Shape;
@group(0) @binding(1)
var<storage, read_write> in_out_mat: array<f32>;
  • shape describes the dimensions of the matrix (rows, cols, strides).
  • in_out_mat is the actual matrix data in memory.
  • These are GPU resources bound at runtime.

Workgroup configuration

const WORKGROUP_SIZE: u32 = 64;

var<workgroup> workspace: array<f32, WORKGROUP_SIZE>;
var<workgroup> the_max: f32;
var<workgroup> denominator: f32;
  • WORKGROUP_SIZE = 64 → each GPU workgroup has 64 threads.
  • workspace is shared memory for temporary values inside a workgroup.
  • the_max and denominator will hold the final values of the row maximum and the softmax denominator (sum of exponentials).

Reduction helpers

fn reduce_max(thread_id: u32, stride: u32) { ... }
fn reduce_sum(thread_id: u32, stride: u32) { ... }
  • These functions do tree reductions in shared memory.
  • They progressively combine values in workspace until only one remains:
  • reduce_max finds the maximum value.
  • reduce_sum finds the total sum.

Main compute function

@compute @workgroup_size(WORKGROUP_SIZE, 1, 1)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
        @builtin(local_invocation_id) local_id: vec3<u32>) {
  • Each workgroup handles one column j of the matrix.
  • Each thread inside the group has an ID (thread_id) and works on a subset of rows.

Step 1. Compute the row maximum

let j = workgroup_id.x;
let thread_id = local_id.x;
let data_len = shape.nrows;
var my_max = -1.0e38;   *// very negative initial value*

for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
    let val_i = in_out_mat[Shape::im(shape, i, j)];
    my_max = max(my_max, val_i);
}

workspace[thread_id] = my_max;
  • Each thread scans part of the column j, tracking the largest value it sees.
  • Writes its local max into workspace.

Step 2. Reduce to a single max

reduce_max(thread_id, 32u);
reduce_max(thread_id, 16u);
reduce_max(thread_id, 8u);
reduce_max(thread_id, 4u);
reduce_max(thread_id, 2u);
reduce_max(thread_id, 1u);

if (thread_id == 0) {
    the_max = workspace[0];
}
  • Threads cooperatively reduce workspace down to a single maximum.
  • Thread 0 saves the result in the_max.

Step 3. Compute the denominator (sum of exponentials)

var my_denominator = 0.0;
for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
    let ii = Shape::im(shape, i, j);
    let val_i = in_out_mat[ii];
    let exp_i = exp(val_i - the_max);
    my_denominator += exp_i;
    in_out_mat[ii] = exp_i;   // overwrite with exp(x - max)
}
workspace[thread_id] = my_denominator;
  • Each thread computes exp(val - max) for its portion of the column.
  • This avoids overflow by subtracting the_max (softmax trick).
  • Each thread accumulates a partial sum (my_denominator).
  • Writes results back into in_out_mat.

Step 4. Reduce to get total denominator

reduce_sum(thread_id, 32u);
reduce_sum(thread_id, 16u);
reduce_sum(thread_id, 8u);
reduce_sum(thread_id, 4u);
reduce_sum(thread_id, 2u);
reduce_sum(thread_id, 1u);

if (thread_id == 0) {
    denominator = workspace[0];
}
  • Just like max, but summing instead.

Step 5. Normalize

for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
    let ii = Shape::im(shape, i, j);
    let val_i = in_out_mat[ii];
    in_out_mat[ii] = val_i / denominator;
}
  • Each thread divides its portion of the column by the total denominator.
  • At the end, in_out_mat contains the softmax values.

Shape utilities

The rest of the file (Shape, im, it, etc.) is just utility code for indexing into row-major or column-major matrices stored in linear memory.

In summary: This shader implements softmax on each column of a matrix, using parallel reductions in GPU shared memory. The steps are: 1. Find the max of the column. 2. Subtract max, exponentiate, and sum. 3. Divide by the sum to normalize.

webgpuvisual

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment