Skip to content

Instantly share code, notes, and snippets.

@dondragmer
dondragmer / BitonicSort.hsl
Last active April 15, 2025 09:21
An optimized bitonic sorting compute shader. Has no bank conflicts, uses shader model 6.6 wave interstices, and works with different sort sizes and per-wave lane counts.
RWBuffer<uint> SortBuffer : register(u0);
static const uint sSortSize = 1024; // can be any power of up to 1024 (the max threads in a group)
// to avoid all bank conflicts there needs to be a space of padding inserted at every multiple of every power of the wave size
// (if an index is a multiple of several powers of the wave size a pad needs to be added for each)
// the smallest wave size possible is 4, so the most padding needed is (sort size) * (1/4 + 1/16 + 1/64 ...)
// which convergets to (sort size) / 3;
static const uint sGroupsharedSortValuesToPaddingRatio = 3;
@Marc-B-Reynolds
Marc-B-Reynolds / output.md
Last active November 25, 2024 22:53
brute force testing of 1/sqrt functions
click for range breakdown

checking on [3f800000,40000000] [1.000000e+00,2.000000e+00]

func e max ULP CR FR 2 ULP > 2 ULP CR% FR% 2 ULP% > 2 ULP%
vrsqrte_f32 -- 4947968 103 225 216 8388065 0.001228 0.002682 0.002575 99.993515
FRSR_Mon0 -- 564177 3 8 6 8388592 0.000036 0.000095 0.000072 99.999797
FRSR_Deg0 -- 403258 0 0 0 8388609 0.000000 0.000000 0.000000 100.000000
FRSR_Mon1 -- 14751 230 464 466 8387449 0.002742 0.005531 0.005555 99.986172
// based on xxhash32
unsigned int hash(char *data, size_t len)
{
unsigned int hash;
if (len < 4) {
// load 3 bytes, overlapping if len < 3
static unsigned char offset1[4] = { 0,0,1,1 };
static unsigned char offset2[4] = { 0,0,0,2 };
unsigned int h = data[0] + (data[offset1[len]]<<8) + (data[offset2[len]]<<16);
h = xxprime1 + h*xxprime2;
#if _WIN32
struct Timer {
LARGE_INTEGER win32_freq;
LARGE_INTEGER win32_start;
Timer() {
QueryPerformanceFrequency(&win32_freq);
win32_start.QuadPart = 0;
}