Skip to content

Instantly share code, notes, and snippets.

View benaadams's full-sized avatar
🦆
rubber duck debugging

Ben {chmark} Adams benaadams

🦆
rubber duck debugging
View GitHub Profile
@animetosho
animetosho / gf2p8affineqb-articles.md
Last active April 28, 2025 19:48
A list of articles documenting uses of the GF2P8AFFINE instruction

Unexpected Uses for the Galois Field Affine Transformation Instruction

Intel added the Galois Field instruction set (GFNI) extensions to their Sunny Cove and Tremont cores. What’s particularly interesting is that GFNI is the only new SIMD extension that came with SSE and VEX/AVX encodings (in addition to EVEX/AVX512), to allow it to be supported on all future Intel cores, including those which don’t support AVX512 (such as the Atom line, as well as Celeron/Pentium branded “big” cores).

I suspect GFNI was aimed at accelerating SM4 encryption, however, one of the instructions can be used for many other purposes. The extension includes three instructions, but of particular interest here is the Affine Transformation (GF2P8AFFINEQB), aka bit-matrix multiply, instruction.

There have been various articles which discuss out-of-band

@animetosho
animetosho / galois-field-affine-uses.md
Last active April 25, 2025 19:50
A list of “out-of-band” uses for the GF2P8AFFINEQB instruction I haven’t seen documented elsewhere

Count Leading/Trailing Zero Bits (Byte-wise)

Counting the trailing zero bit count (TZCNT) can be done by isolating the lowest bit, then depositing this into the appropriate locations for the count. The leading zero bit count (LZCNT) can be done by reversing bits, then computing the TZCNT.

__m128i _mm_tzcnt_epi8(__m128i a) {
	// isolate lowest bit
	a = _mm_andnot_si128(_mm_add_epi8(a, _mm_set1_epi8(0xff)), a);
	// convert lowest bit to index
@Erkaman
Erkaman / taa.frag
Last active January 21, 2025 01:51
rudimentary temporal anti-aliasing solution, that is good as a starting point for more advanced TAA techniques.
/*
The MIT License (MIT)
Copyright (c) 2018 Eric Arnebäck
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
@eshelman
eshelman / latency.txt
Last active March 4, 2025 03:41 — forked from jboner/latency.txt
HPC-oriented Latency Numbers Every Programmer Should Know
Latency Comparison Numbers
--------------------------
L1 cache reference/hit 1.5 ns 4 cycles
Floating-point add/mult/FMA operation 1.5 ns 4 cycles
L2 cache reference/hit 5 ns 12 ~ 17 cycles
Branch mispredict 6 ns 15 ~ 20 cycles
L3 cache hit (unshared cache line) 16 ns 42 cycles
L3 cache hit (shared line in another core) 25 ns 65 cycles
Mutex lock/unlock 25 ns
L3 cache hit (modified in another core) 29 ns 75 cycles
@TheRealMJP
TheRealMJP / Tex2DCatmullRom.hlsl
Last active December 30, 2024 10:01
An HLSL function for sampling a 2D texture with Catmull-Rom filtering, using 9 texture samples instead of 16
// The following code is licensed under the MIT license: https://gist.github.com/TheRealMJP/bc503b0b87b643d3505d41eab8b332ae
// Samples a texture with Catmull-Rom filtering, using 9 texture fetches instead of 16.
// See http://vec3.ca/bicubic-filtering-in-fewer-taps/ for more details
float4 SampleTextureCatmullRom(in Texture2D<float4> tex, in SamplerState linearSampler, in float2 uv, in float2 texSize)
{
// We're going to sample a a 4x4 grid of texels surrounding the target UV coordinate. We'll do this by rounding
// down the sample location to get the exact center of our "starting" texel. The starting texel will be at
// location [1, 1] in the grid, where [0, 0] is the top left corner.
float2 samplePos = uv * texSize;
using System;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
namespace ConsoleApp2
{
public unsafe class UnmanagedArray<T>
where T : struct
{
private static IntPtr _value;
@bishboria
bishboria / springer-free-maths-books.md
Last active March 24, 2025 13:36
Springer made a bunch of books available for free, these were the direct links
@paulirish
paulirish / what-forces-layout.md
Last active April 28, 2025 06:24
What forces layout/reflow. The comprehensive list.

What forces layout / reflow

All of the below properties or methods, when requested/called in JavaScript, will trigger the browser to synchronously calculate the style and layout*. This is also called reflow or layout thrashing, and is common performance bottleneck.

Generally, all APIs that synchronously provide layout metrics will trigger forced reflow / layout. Read on for additional cases and details.

Element APIs

Getting box metrics
  • elem.offsetLeft, elem.offsetTop, elem.offsetWidth, elem.offsetHeight, elem.offsetParent
@patriciogonzalezvivo
patriciogonzalezvivo / GLSL-Noise.md
Last active April 28, 2025 11:55
GLSL Noise Algorithms

Please consider using http://lygia.xyz instead of copy/pasting this functions. It expand suport for voronoi, voronoise, fbm, noise, worley, noise, derivatives and much more, through simple file dependencies. Take a look to https://github.com/patriciogonzalezvivo/lygia/tree/main/generative

Generic 1,2,3 Noise

float rand(float n){return fract(sin(n) * 43758.5453123);}

float noise(float p){
	float fl = floor(p);
  float fc = fract(p);