True cross-platform GPU compute —WebGPU + WGML

![](image.png)

WebGPU vs WebGL

✅ No global state
✅ Explicit command buffers (WebGL doesn’t have these)
✅ Pipelines (WebGL uses shader programs + fixed function state but no explicit pipeline object)
✅ Bind groups (WebGL uses individual uniform locations and textures)
✅ Render/compute passes (WebGL has framebuffers but no explicit passes)
✅ Compute shaders (WebGL doesn’t support compute shaders)

Differences to Vulkan

Some rendering related features/extensions not available (yet)
- VRS
- Raytracing
- Bindless
- Geometry shaders
- Etc.
==🔴No explicit memory management==
==🟡No explicit synchronization==
- No multi-stream / overlapping execution control.
- No cross-workgroup synchronization inside a dispatch.
- No explicit barriers between passes — the API handles them.
- No low-level fences/timeline semaphores.

WGSL

Mapping CUDA terminology to WGSL (WebGPU Shading Language) terminology, helps to think in terms of thread hierarchy, memory model, and execution model, since CUDA is very GPU-centric and WGSL is designed around WebGPU’s abstraction.

CUDA Concept	WGSL / WebGPU Equivalent	Notes
thread	invocation (a single execution of a compute shader)	In WGSL, a compute shader invocation is analogous to a CUDA thread.
threadIdx	@builtin(local_invocation_id)	Gives the ID of the invocation within a workgroup; equivalent to CUDA’s per-block thread index.
blockIdx	@builtin(workgroup_id)	Gives the ID of the workgroup; equivalent to CUDA’s block index.
blockDim	@builtin(workgroup_size)	Number of invocations in a workgroup (like blockDim in CUDA).
gridDim	Not directly exposed	In WebGPU/WGSL, the total number of workgroups is set by the dispatch call, e.g., dispatchWorkgroups(x, y, z).
shared memory	var	Memory shared among invocations in the same workgroup (CUDA’s shared).
global memory	var	Device-visible storage buffers, similar to CUDA global memory.
local memory	var	Private to each invocation, analogous to per-thread registers/private memory in CUDA.
__syncthreads()	workgroupBarrier()	Synchronizes all invocations in the same workgroup/block.
grid-stride loops	Manual loop using workgroup_id + local_invocation_id	Compute linear global id: global_invocation_id = workgroup_id × workgroup_size + local_invocation_id; then stride by workgroup_size × num_workgroups as needed.
warp	Not exposed	WGSL abstracts hardware warps/wavefronts; CUDA warps are a hardware execution detail, not part of the high-level model.
atomic*	atomic with atomicLoad, atomicStore, atomicAdd	WGSL provides atomics on var and var with operations like atomicLoad/Store/Add.

Example: CUDA → WGSL Thread Mapping

CUDA kernel:

__global__ void add(float *a, float *b, float *c) {
    int idx = threadIdx.x + blockIdx.x * blockDim.x*;*
    c[idx] = a[idx] + b[idx]*;*
}

WGSL compute shader:

@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> c: array<f32>;

@compute @workgroup_size(64)
fn main(
    @builtin(local_invocation_id) local_id: vec3<u32>,
    @builtin(workgroup_id) group_id: vec3<u32>
) {
    let idx = local_id.x + group_id.x * 64u;
    c[idx] = a[idx] + b[idx];
}

Notice how 64u in WGSL corresponds to blockDim.x in CUDA.

Softmax Implementation

What is Softmax? Softmax is like a "probability converter" - it takes a list of numbers and turns them into probabilities that add up to 1.0. Think of it like converting test scores into percentages where higher scores get higher probabilities.

Example:

Here's a simple 8x8 matrix before and after softmax:

BEFORE SOFTMAX:

 2.3  -1.2   0.8   3.1  -0.5   1.7   2.9  -2.1
 1.5   0.3  -1.8   2.4   1.1  -0.7   0.9   1.6
-0.9   2.7   1.4  -2.3   0.6   3.2  -1.5   0.2
 3.8  -0.4   2.1   1.9  -1.1   0.5   2.6  -0.8
 0.7   1.8  -2.5   0.1   2.3  -1.4   1.2   3.4
-1.6   0.9   3.5  -0.6   1.7   2.8  -0.3   1.0
 2.2  -2.0   0.4   2.7   0.8  -1.9   3.1   0.6
 1.3   3.6  -1.7   1.5  -0.2   0.7  -2.4   2.5

AFTER SOFTMAX:

0.2156 0.0302 0.1089 0.4982 0.0364 0.2648 0.3542 0.0061
0.0969 0.1297 0.0079 0.2508 0.1808 0.0237 0.0476 0.2359
0.0088 0.1456 0.1978 0.0022 0.1088 0.1178 0.0043 0.0583
0.9787 0.0647 0.3854 0.1488 0.0198 0.0789 0.2639 0.0217
0.0434 0.0598 0.0040 0.0251 0.5542 0.0118 0.0640 1.4377
0.0044 0.0245 0.1500 0.0124 0.3264 0.0779 0.0142 0.1301
0.1946 0.0135 0.0720 0.3312 0.1328 0.0072 0.4518 0.0869
0.0796 0.3620 0.0088 0.1013 0.0488 0.0096 0.0018 0.5788

Each column now sums to 1.0 and all values are between 0 and 1.

The Math Formula: For each number x[i] in your list, softmax calculates:

softmax(x[i]) = e^(x[i] - max) / sum(e^(x[j] - max) for all j)

Imports & bindings

#import wgblas::shape as Shape;

@group(0) @binding(0)
var<uniform> shape: Shape::Shape;
@group(0) @binding(1)
var<storage, read_write> in_out_mat: array<f32>;

shape describes the dimensions of the matrix (rows, cols, strides).

in_out_mat is the actual matrix data in memory.
These are GPU resources bound at runtime.

⸻

Workgroup configuration

const WORKGROUP_SIZE: u32 = 64;

var<workgroup> workspace: array<f32, WORKGROUP_SIZE>;
var<workgroup> the_max: f32;
var<workgroup> denominator: f32;

WORKGROUP_SIZE = 64 → each GPU workgroup has 64 threads.

workspace is shared memory for temporary values inside a workgroup.
the_max and denominator will hold the final values of the row maximum and the softmax denominator (sum of exponentials).

⸻

Reduction helpers

fn reduce_max(thread_id: u32, stride: u32) { ... }
fn reduce_sum(thread_id: u32, stride: u32) { ... }

These functions do tree reductions in shared memory.

They progressively combine values in workspace until only one remains:
reduce_max finds the maximum value.
reduce_sum finds the total sum.

⸻

Main compute function

@compute @workgroup_size(WORKGROUP_SIZE, 1, 1)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
        @builtin(local_invocation_id) local_id: vec3<u32>) {

Each workgroup handles one column j of the matrix.
Each thread inside the group has an ID (thread_id) and works on a subset of rows.

⸻

Step 1. Compute the row maximum

let j = workgroup_id.x;
let thread_id = local_id.x;
let data_len = shape.nrows;
var my_max = -1.0e38;   *// very negative initial value*

for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
    let val_i = in_out_mat[Shape::im(shape, i, j)];
    my_max = max(my_max, val_i);
}

workspace[thread_id] = my_max;

Each thread scans part of the column j, tracking the largest value it sees.

Writes its local max into workspace.

⸻

Step 2. Reduce to a single max

reduce_max(thread_id, 32u);
reduce_max(thread_id, 16u);
reduce_max(thread_id, 8u);
reduce_max(thread_id, 4u);
reduce_max(thread_id, 2u);
reduce_max(thread_id, 1u);

if (thread_id == 0) {
    the_max = workspace[0];
}

Threads cooperatively reduce workspace down to a single maximum.

Thread 0 saves the result in the_max.

⸻

Step 3. Compute the denominator (sum of exponentials)

var my_denominator = 0.0;
for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
    let ii = Shape::im(shape, i, j);
    let val_i = in_out_mat[ii];
    let exp_i = exp(val_i - the_max);
    my_denominator += exp_i;
    in_out_mat[ii] = exp_i;   // overwrite with exp(x - max)
}
workspace[thread_id] = my_denominator;

Each thread computes exp(val - max) for its portion of the column.

This avoids overflow by subtracting the_max (softmax trick).
Each thread accumulates a partial sum (my_denominator).
Writes results back into in_out_mat.

⸻

Step 4. Reduce to get total denominator

reduce_sum(thread_id, 32u);
reduce_sum(thread_id, 16u);
reduce_sum(thread_id, 8u);
reduce_sum(thread_id, 4u);
reduce_sum(thread_id, 2u);
reduce_sum(thread_id, 1u);

if (thread_id == 0) {
    denominator = workspace[0];
}

Just like max, but summing instead.

⸻

Step 5. Normalize

for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
    let ii = Shape::im(shape, i, j);
    let val_i = in_out_mat[ii];
    in_out_mat[ii] = val_i / denominator;
}

Each thread divides its portion of the column by the total denominator.

At the end, in_out_mat contains the softmax values.

⸻

Shape utilities

The rest of the file (Shape, im, it, etc.) is just utility code for indexing into row-major or column-major matrices stored in linear memory.

⸻

✅ In summary: This shader implements softmax on each column of a matrix, using parallel reductions in GPU shared memory. The steps are: 1. Find the max of the column. 2. Subtract max, exponentiate, and sum. 3. Divide by the sum to normalize.

nihalpasham/WebGPU+WGML.md