Skip to content

Instantly share code, notes, and snippets.

View mratsim's full-sized avatar
:shipit:

Mamy Ratsimbazafy mratsim

:shipit:
  • Paris
View GitHub Profile
@abcdabcd987
abcdabcd987 / 2025-02-23-mla-flashinfer-vs-deepseek.py
Last active April 6, 2025 11:34
MLA Kernel Performance: FlashInfer vs DeepSeek FlashMLA
# Results: https://docs.google.com/spreadsheets/d/1t0Txa7Ph9u7Su9LyWpS24vqr9A5FB-FyL0EZNpYOqwg/edit?gid=0#gid=0
# FlashInfer: 28053ac54023fbf3fb552f7be015b0f90a37ed76
# FlashMLA : accc1695ee0ff996ec63eaf2ebcbf6874ed0e7df
import itertools
import torch
from flash_mla import flash_mla_with_kvcache, get_mla_metadata
from flashinfer import BatchMLAPagedAttentionWrapper
from triton.testing import do_bench # type: ignore[import]
@VictorTaelin
VictorTaelin / conversation_2024-08-28T19-40-20-814Z.md
Last active March 26, 2025 03:54
conversation_2024-08-28T19-40-20-814Z.txt

Refactoring Kind's switch with ChatSH

(This is a readable version of my ChatSH session. For the full log, click here.)

Taelin: Hello. We're going to refactor an aspect of the implementation of the Kind language. Are you ready? Start by doing 'ls', then 'cat kind-lang.cabal' to get familiar with the repo.

ChatSH: Certainly! I'm ready to help you refactor an aspect of the Kind language implementation. Let's start by examining the repository structure and the contents of the Cabal file.

ls && echo "---" && cat kind-lang.cabal
@VictorTaelin
VictorTaelin / fast_scansum.cu
Created April 9, 2024 21:22
Fast CUDA block-local prefix sum (scamsun) using warp sync primitives (__shfl_up_sync)
// Fast block-local prefix-sum on CUDA, using warp-syncs.
// The input is an array of u32. It is mutated in place. Example:
// arr = [1,1,1,1,...]
// Becomes:
// arr = [1,2,3,4,...]
// The number of elements must be equal to threads per block (TPB).
#include <stdio.h>
#include <cuda_runtime.h>
@Nicklas373
Nicklas373 / av1_nvenc
Created March 25, 2024 00:42
AV1 NVENC Config for FFMPEG
Encoder av1_nvenc [NVIDIA NVENC av1 encoder]:
General capabilities: dr1 delay hardware
Threading capabilities: none
Supported hardware devices: cuda cuda d3d11va d3d11va
Supported pixel formats: yuv420p nv12 p010le yuv444p p016le yuv444p16le bgr0 bgra rgb0 rgba x2rgb10le x2bgr10le gbrp gbrp16le cuda d3d11
av1_nvenc AVOptions:
-preset <int> E..V....... Set the encoding preset (from 0 to 18) (default p4)
default 0 E..V.......
slow 1 E..V....... hq 2 passes
medium 2 E..V....... hq 1 pass
@moyix
moyix / Makefile
Created March 8, 2024 05:26
Claude 3 writes a fuzzer
all: gifread gifread.asan gifread.ubsan gifread.coverage
gifread: gifdec.c gifread.c gifdec.h
$(CC) $(CFLAGS) -o $@ gifdec.c gifread.c $(LDFLAGS)
gifread.asan: gifdec.c gifread.c gifdec.h
$(CC) $(CFLAGS) -g -fsanitize=address -o $@ gifdec.c gifread.c $(LDFLAGS)
gifread.ubsan: gifdec.c gifread.c gifdec.h
$(CC) $(CFLAGS) -g -fsanitize=undefined -o $@ gifdec.c gifread.c $(LDFLAGS)
@VictorTaelin
VictorTaelin / itt-coc.ts
Last active January 26, 2025 18:02
ITT-Flavored Calculus of Constructions Type Checker
// A nano dependent type-checker featuring inductive types via self encodings.
// All computation rules are justified by interaction combinator semantics,
// resulting in major simplifications and improvements over old Kind-Core.
// Specifically, computable annotations (ANNs) and their counterpart (ANN
// binders) and a new self encoding based on equality (rather than dependent
// motives) greatly reduce code size. A more complete file, including
// superpositions (for optimal unification) is available on the
// Interaction-Type-Theory repository.
// Credits also to Franchu and T6 for insights.
@ibireme
ibireme / kpc_demo.c
Last active April 24, 2025 08:25
A demo shows how to read Intel or Apple M1 CPU performance counter in macOS.
// =============================================================================
// XNU kperf/kpc demo
// Available for 64-bit Intel/Apple Silicon, macOS/iOS, with root privileges
//
//
// Demo 1 (profile a function in current thread):
// 1. Open directory '/usr/share/kpep/', find your CPU PMC database.
// M1 (Pro/Max/Ultra): /usr/share/kpep/a14.plist
// M2 (Pro/Max): /usr/share/kpep/a15.plist
// M3: /usr/share/kpep/as1.plist
@Matthias247
Matthias247 / a_case_for_cancellationtokens.md
Last active January 17, 2022 22:50
A case for CancellationTokens

A case for CancellationTokens

Background

The Rust async working group is currently actively discusing on ways to improve async/await. Niko Matsakis documented the main goals and ideas in the async vision document.

As part of the improved async ecosytem, users should be able to make use of

Base C++ class cppbase.cpp

#pragma once

#include <stdio.h>

struct CppBase {
  virtual void baseMethod(int arg) {
    printf("arg from nim - %d -\n", arg);
@kobigurk
kobigurk / BLS_with_help.sol
Created August 25, 2020 15:51
Created using remix-ide: Realtime Ethereum Contract Compiler and Runtime. Load this file by pasting this gists URL or ID at https://remix.ethereum.org/#version=soljson-v0.5.17+commit.d19bba13.js&optimize=false&gist=
pragma solidity ^0.5.15;
contract BLS {
// Field order
uint256 constant N = 21888242871839275222246405745257275088696311157297823662689037894645226208583;
// Negated genarator of G2
uint256 constant nG2x1 = 11559732032986387107991004021392285783925812861821192530917403151452391805634;
uint256 constant nG2x0 = 10857046999023057135944570762232829481370756359578518086990519993285655852781;
uint256 constant nG2y1 = 17805874995975841540914202342111839520379459829704422454583296818431106115052;