Skip to content

Instantly share code, notes, and snippets.

View rahulunair's full-sized avatar
🏠
Working from home

Rahul Nair rahulunair

🏠
Working from home
View GitHub Profile
@mingfeima
mingfeima / part_2_parallelization_techniques.md
Last active June 28, 2024 11:03
PyTorch CPU Performance Optimization Tutorial - Section II
<?xml version="1.0" encoding="UTF-8"?>
<opml version="1.0">
<head>
<title>RSS subscriptions for [email protected]</title>
<dateCreated>Sat, 27 Feb 2021 17:33:45 +0000</dateCreated>
<ownerEmail>[email protected]</ownerEmail>
</head>
<body>
<outline text="Google AI Blog" title="Google AI Blog" type="rss" xmlUrl="http://feeds.feedburner.com/blogspot/gJZg" htmlUrl="http://ai.googleblog.com/"/>
<outline text="FastML" title="FastML" type="rss" xmlUrl="http://fastml.com/atom.xml" htmlUrl="http://fastml.com/"/>
@mingfeima
mingfeima / part_1_memory_format_and_channels_last_optimization.md
Last active March 24, 2025 16:56
PyTorch CPU Performance Optimization Tutorial - Section I
@mingfeima
mingfeima / pytorch_channels_last_perf_optimization.md
Last active September 1, 2023 03:02
PyTorch Channels Last memory format perf optimization and oneDNN integration plan.

PyTorch Channels Last Memory Format Performance Optimization on CPU Path

("mkldnn" has been renamed to "oneDNN", but exsiting PyTorch APIs still use "mkldnn", future work will align PyTorch user level APIs to "oneDNN")

Table of Contents

  • PyTorch Channels Last memory format introduction
  • oneDNN API for NHWC layout
  • Generic Channels Last memory format optimization with ATen native
  • oneDNN NHWC integration

NB: Memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic with layout in oneDNN. Layout in PyTorch has other semantic ofdescribing dense or sparse with the attributes: 'torch.strided', 'torch.sparse_coo'.

@mingfeima
mingfeima / pytorch_performance_profiling.md
Last active April 11, 2025 15:38
How to do performance profiling on PyTorch

(Internal Tranining Material)

Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:

  • How to find the bottleneck operator?
  • How to trace source file of a particular operator?
  • How do I indentify threading issues? (oversubscription)
  • How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

@mingfeima
mingfeima / pytorch_cpu_perf_bkm.md
Last active September 6, 2024 01:40
BKM for PyTorch CPU Performance

General guidelines for CPU performance on PyTorch

This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.

1. Use channels last memory format

Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.

  • torch.contiguous_format: default memory format, also referred as NHCW.
  • torch.channels_last: also referred as NHWC.
  • torch._mkldnn: mkldnn blocked format.
@FedeMiorelli
FedeMiorelli / turbo_colormap_mpl.py
Last active March 31, 2023 02:45
Turbo Colormap for Matplotlib
# -*- coding: utf-8 -*-
"""
Created on 2019-08-22 09:37:36
@author: fmiorell
"""
# This script registers the "turbo" colormap to matplotlib, and the reversed version as "turbo_r"
# Reference: https://ai.googleblog.com/2019/08/turbo-improved-rainbow-colormap-for.html
@lolz0r
lolz0r / basis.py
Created January 22, 2019 17:06
Learned basis function, pytorch
class ConvSeluSVD(nn.Module):
def __init__(self, inputSize, outputSize, stride=1, maxpool=False, ownBasis=False):
super(ConvSeluSVD, self).__init__()
self.inputSize = inputSize
self.outputSize = outputSize
self.stride = stride
self.params = Parameter( torch.Tensor(outputSize * inputSize, 1,3).normal_(0, .02))
@gkbrk
gkbrk / lobsters-mastodon.lisp
Last active January 20, 2023 06:20
Common lisp Mastodon bot
(ql:quickload :drakma)
(ql:quickload :cl-json)
(ql:quickload :plump)
(ql:quickload :babel)
(ql:quickload :tooter)
(ql:quickload :split-sequence)
(defvar *feed-path* "https://lobste.rs/rss")
(setf drakma:*drakma-default-external-format* :UTF-8)
@nadavrot
nadavrot / Matrix.md
Last active May 19, 2025 10:19
Efficient matrix multiplication

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of