csullivan / test_group_gemm.py

Created May 21, 2025 23:50

Roughly analogous performance to the fp8xfp8 the first FC layer from the triton-lang/triton/python/triton_kernels _p_matmul_ogs.py Mixture of Experts kernel when the routing is exactly uniform (even; no variance) to all the experts

	import pytest
	from typing import Optional
	import torch
	import triton
	import triton.language as tl


	DEVICE = "cuda"

csullivan / 2025_05_08_recipe_pesto.md

Created May 9, 2025 03:58

Creamy Basil Cashew Pesto Pasta

Ingredients

Pasta: 3/4 of a 14.5 oz box (approx. 300–310 g), cooked in very salty water (“like the ocean”)
Basil: 1 package (12 g) lightly dried chopped basil or a large handful of fresh leaves or 3–4 tbsp dried basil
Cashews: 1/3 to 1/2 cup unsalted
Parmesan: 1/2 cup grated
Garlic: 1 large clove (or 2 medium), peeled

csullivan / 2024_09_26_nsys_single_instance_wgmma_register_and_shared_layout.txt

Created September 26, 2024 16:55

Performance comparison: 5% gain using wgmma with LHS in registers vs shared. [1] https://github.com/csullivan/wgmma-intrin

	Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
	-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
	---- 2495089 101 24703.9 24736.0 24544 27520 302.9 wgmma_f16_m64n256k16_kernel_shared_layout(__half , __half , __half *)
	---- 2361204 101 23378.3 23423.0 23231 25600 245.6 wgmma_f16_m64n256k16_register_layout_kernel(__half , __half , __half *)

csullivan / 2024_09_26_nsys_single_instance_wgmma_register_and_shared_layout.txt

Created September 26, 2024 16:54

Performance comparison: 5% gain using wgmma with LHS in registers vs shared.

	Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
	-------- --------------- --------- -------- -------- -------- -------- ----------- ----------------------------------------------------------------------------------------------------
	---- 2495089 101 24703.9 24736.0 24544 27520 302.9 wgmma_f16_m64n256k16_kernel_shared_layout(__half , __half , __half *)
	---- 2361204 101 23378.3 23423.0 23231 25600 245.6 wgmma_f16_m64n256k16_register_layout_kernel(__half , __half , __half *)

csullivan / install_python3.11.sh

Last active May 14, 2024 04:53

	sudo add-apt-repository ppa:deadsnakes/ppa
	sudo apt install python3.11 python3.11-distutils python3.11-venv libpython3.11-dev
	curl -sS https://bootstrap.pypa.io/get-pip.py \| python3.11

csullivan / note.md

Last active September 7, 2023 06:50

CUTLASS CMake configuration for Hopper (sm90a)

Reference: https://github.com/NVIDIA/cutlass/blob/e01b9b5029b7caca5a43c29f7d2714d7cf1dcae8/README.md?plain=1#L137C10-L137C37

cmake -DCUTLASS_LIBRARY_KERNELS=cutlass3x_sm90* -DCUTLASS_ENABLE_CUBLAS=ON -DCUTLASS_ENABLE_CUDNN=ON -DCUTLASS_NVCC_ARCHS="90a" ..

csullivan / sharded_decode.py

Last active September 6, 2023 23:30

Sharded decode, sharding rewrite done after FuseOpsByPattern (cublas/cutlass byoc) -- With debug tracing calls

	# Ignore `tvm.save_and_copy_tensor` packed functions inserted for debugging
	@R.function
	def decode(input_ids1: R.Tensor((1, 1), dtype="int32"), all_seq_len: R.Shape(["n"]), kv_cache: R.Tuple(R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Object, R.Obj

csullivan / test_torch_distributed.py

Created July 17, 2023 05:44

Minimal NCCL torch.distributed example

	import os
	import torch
	import torch.distributed as dist

	def read_file_and_all_reduce():
	# Get the rank and world size from environment variables
	rank = int(os.environ['LOCAL_RANK'])
	world_size = int(os.environ['WORLD_SIZE'])

	# initialize the process group

csullivan / bradbury_if_only_we_had_taller_been.txt

Created June 2, 2023 15:40

	The fence we walked between the years
	Did balance us serene;
	It was a place half in the sky where
	In the green of leaf and promising of peach
	We’d reach our hand to touch, and almost touch the sky.
	If we could reach and touch, we said,
	‘Twould teach us not to, never to, be dead.

	We ached and almost touched that stuff;
	Our reach was never quite enough.

csullivan / CMake_3_20_Ubuntu_18_04.md

Last active March 30, 2023 05:39 — forked from bmegli/CMake_3_20_Ubuntu_18_04.md

apt reversible source builds with checkinstall-- Example: CMake 3.20 in Ubuntu 18.04 (reversible way)

Motivatation

modern CMake is required for building a lot of new software
CMake is dependency for many packages (e.g. ROS related)
we don't want to remove CMake (which would remove packages that depend on it)
we want safe procedure to update CMake that can be reversed easily

Current version in OS

Check current version

Chris Sullivan csullivan

Creamy Basil Cashew Pesto Pasta

Ingredients

Motivatation

Current version in OS