
- ✅ No global state
- ✅ Explicit command buffers (WebGL doesn’t have these)
- ✅ Pipelines (WebGL uses shader programs + fixed function state but no explicit pipeline object)
- ✅ Bind groups (WebGL uses individual uniform locations and textures)
- ✅ Render/compute passes (WebGL has framebuffers but no explicit passes)
- ✅ Compute shaders (WebGL doesn’t support compute shaders)
- Some rendering related features/extensions not available (yet)
- VRS
- Raytracing
- Bindless
- Geometry shaders
- Etc.
- ==🔴No explicit memory management==
- ==🟡No explicit synchronization==
- No multi-stream / overlapping execution control.
- No cross-workgroup synchronization inside a dispatch.
- No explicit barriers between passes — the API handles them.
- No low-level fences/timeline semaphores.
Mapping CUDA terminology to WGSL (WebGPU Shading Language) terminology, helps to think in terms of thread hierarchy, memory model, and execution model, since CUDA is very GPU-centric and WGSL is designed around WebGPU’s abstraction.
| CUDA Concept | WGSL / WebGPU Equivalent | Notes |
|---|---|---|
| thread | invocation (a single execution of a compute shader) | In WGSL, a compute shader invocation is analogous to a CUDA thread. |
| threadIdx | @builtin(local_invocation_id) | Gives the ID of the invocation within a workgroup; equivalent to CUDA’s per-block thread index. |
| blockIdx | @builtin(workgroup_id) | Gives the ID of the workgroup; equivalent to CUDA’s block index. |
| blockDim | @builtin(workgroup_size) | Number of invocations in a workgroup (like blockDim in CUDA). |
| gridDim | Not directly exposed | In WebGPU/WGSL, the total number of workgroups is set by the dispatch call, e.g., dispatchWorkgroups(x, y, z). |
| shared memory | var | Memory shared among invocations in the same workgroup (CUDA’s shared). |
| global memory | var | Device-visible storage buffers, similar to CUDA global memory. |
| local memory | var | Private to each invocation, analogous to per-thread registers/private memory in CUDA. |
| __syncthreads() | workgroupBarrier() | Synchronizes all invocations in the same workgroup/block. |
| grid-stride loops | Manual loop using workgroup_id + local_invocation_id | Compute linear global id: global_invocation_id = workgroup_id × workgroup_size + local_invocation_id; then stride by workgroup_size × num_workgroups as needed. |
| warp | Not exposed | WGSL abstracts hardware warps/wavefronts; CUDA warps are a hardware execution detail, not part of the high-level model. |
| atomic* | atomic with atomicLoad, atomicStore, atomicAdd | WGSL provides atomics on var and var with operations like atomicLoad/Store/Add. |
__global__ void add(float *a, float *b, float *c) {
int idx = threadIdx.x + blockIdx.x * blockDim.x*;*
c[idx] = a[idx] + b[idx]*;*
}@group(0) @binding(0) var<storage, read> a: array<f32>;
@group(0) @binding(1) var<storage, read> b: array<f32>;
@group(0) @binding(2) var<storage, read_write> c: array<f32>;
@compute @workgroup_size(64)
fn main(
@builtin(local_invocation_id) local_id: vec3<u32>,
@builtin(workgroup_id) group_id: vec3<u32>
) {
let idx = local_id.x + group_id.x * 64u;
c[idx] = a[idx] + b[idx];
}Notice how 64u in WGSL corresponds to blockDim.x in CUDA.
What is Softmax? Softmax is like a "probability converter" - it takes a list of numbers and turns them into probabilities that add up to 1.0. Think of it like converting test scores into percentages where higher scores get higher probabilities.
Example:
Here's a simple 8x8 matrix before and after softmax:
BEFORE SOFTMAX:
2.3 -1.2 0.8 3.1 -0.5 1.7 2.9 -2.1
1.5 0.3 -1.8 2.4 1.1 -0.7 0.9 1.6
-0.9 2.7 1.4 -2.3 0.6 3.2 -1.5 0.2
3.8 -0.4 2.1 1.9 -1.1 0.5 2.6 -0.8
0.7 1.8 -2.5 0.1 2.3 -1.4 1.2 3.4
-1.6 0.9 3.5 -0.6 1.7 2.8 -0.3 1.0
2.2 -2.0 0.4 2.7 0.8 -1.9 3.1 0.6
1.3 3.6 -1.7 1.5 -0.2 0.7 -2.4 2.5
AFTER SOFTMAX:
0.2156 0.0302 0.1089 0.4982 0.0364 0.2648 0.3542 0.0061
0.0969 0.1297 0.0079 0.2508 0.1808 0.0237 0.0476 0.2359
0.0088 0.1456 0.1978 0.0022 0.1088 0.1178 0.0043 0.0583
0.9787 0.0647 0.3854 0.1488 0.0198 0.0789 0.2639 0.0217
0.0434 0.0598 0.0040 0.0251 0.5542 0.0118 0.0640 1.4377
0.0044 0.0245 0.1500 0.0124 0.3264 0.0779 0.0142 0.1301
0.1946 0.0135 0.0720 0.3312 0.1328 0.0072 0.4518 0.0869
0.0796 0.3620 0.0088 0.1013 0.0488 0.0096 0.0018 0.5788
Each column now sums to 1.0 and all values are between 0 and 1.
The Math Formula:
For each number x[i] in your list, softmax calculates:
softmax(x[i]) = e^(x[i] - max) / sum(e^(x[j] - max) for all j)
#import wgblas::shape as Shape;
@group(0) @binding(0)
var<uniform> shape: Shape::Shape;
@group(0) @binding(1)
var<storage, read_write> in_out_mat: array<f32>;- shape describes the dimensions of the matrix (rows, cols, strides).
in_out_matis the actual matrix data in memory.- These are GPU resources bound at runtime.
⸻
const WORKGROUP_SIZE: u32 = 64;
var<workgroup> workspace: array<f32, WORKGROUP_SIZE>;
var<workgroup> the_max: f32;
var<workgroup> denominator: f32;WORKGROUP_SIZE= 64 → each GPU workgroup has 64 threads.
- workspace is shared memory for temporary values inside a workgroup.
the_maxanddenominatorwill hold the final values of the row maximum and the softmax denominator (sum of exponentials).
⸻
fn reduce_max(thread_id: u32, stride: u32) { ... }
fn reduce_sum(thread_id: u32, stride: u32) { ... }- These functions do tree reductions in shared memory.
- They progressively combine values in workspace until only one remains:
reduce_maxfinds the maximum value.reduce_sumfinds the total sum.
⸻
@compute @workgroup_size(WORKGROUP_SIZE, 1, 1)
fn main(@builtin(workgroup_id) workgroup_id: vec3<u32>,
@builtin(local_invocation_id) local_id: vec3<u32>) {- Each workgroup handles one column j of the matrix.
- Each thread inside the group has an ID (thread_id) and works on a subset of rows.
⸻
Step 1. Compute the row maximum
let j = workgroup_id.x;
let thread_id = local_id.x;
let data_len = shape.nrows;
var my_max = -1.0e38; *// very negative initial value*
for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
let val_i = in_out_mat[Shape::im(shape, i, j)];
my_max = max(my_max, val_i);
}
workspace[thread_id] = my_max;- Each thread scans part of the column j, tracking the largest value it sees.
- Writes its local max into workspace.
⸻
Step 2. Reduce to a single max
reduce_max(thread_id, 32u);
reduce_max(thread_id, 16u);
reduce_max(thread_id, 8u);
reduce_max(thread_id, 4u);
reduce_max(thread_id, 2u);
reduce_max(thread_id, 1u);
if (thread_id == 0) {
the_max = workspace[0];
}- Threads cooperatively reduce workspace down to a single maximum.
- Thread 0 saves the result in
the_max.
⸻
Step 3. Compute the denominator (sum of exponentials)
var my_denominator = 0.0;
for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
let ii = Shape::im(shape, i, j);
let val_i = in_out_mat[ii];
let exp_i = exp(val_i - the_max);
my_denominator += exp_i;
in_out_mat[ii] = exp_i; // overwrite with exp(x - max)
}
workspace[thread_id] = my_denominator;- Each thread computes
exp(val - max)for its portion of the column.
- This avoids overflow by subtracting the_max (softmax trick).
- Each thread accumulates a partial sum (my_denominator).
- Writes results back into
in_out_mat.
⸻
Step 4. Reduce to get total denominator
reduce_sum(thread_id, 32u);
reduce_sum(thread_id, 16u);
reduce_sum(thread_id, 8u);
reduce_sum(thread_id, 4u);
reduce_sum(thread_id, 2u);
reduce_sum(thread_id, 1u);
if (thread_id == 0) {
denominator = workspace[0];
}- Just like max, but summing instead.
⸻
Step 5. Normalize
for (var i = thread_id; i < data_len; i += WORKGROUP_SIZE) {
let ii = Shape::im(shape, i, j);
let val_i = in_out_mat[ii];
in_out_mat[ii] = val_i / denominator;
}- Each thread divides its portion of the column by the total denominator.
- At the end,
in_out_matcontains the softmax values.
⸻
The rest of the file (Shape, im, it, etc.) is just utility code for indexing into row-major or column-major matrices stored in linear memory.
⸻
✅ In summary: This shader implements softmax on each column of a matrix, using parallel reductions in GPU shared memory. The steps are: 1. Find the max of the column. 2. Subtract max, exponentiate, and sum. 3. Divide by the sum to normalize.
