Rahul Nair rahulunair

Part II: Parallelization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

This section contains the following subjects:

Part I: Memory Formats and Channels Last Optimization

(Training material on pytorch CPU performance optimization)

Part II: Parallelization Techniques
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

PyTorch Channels Last Memory Format Performance Optimization on CPU Path

("mkldnn" has been renamed to "oneDNN", but exsiting PyTorch APIs still use "mkldnn", future work will align PyTorch user level APIs to "oneDNN")

NB: Memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic with layout in oneDNN. Layout in PyTorch has other semantic ofdescribing dense or sparse with the attributes: 'torch.strided', 'torch.sparse_coo'.

(Internal Tranining Material)

Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:

How to find the bottleneck operator?
How to trace source file of a particular operator?
How do I indentify threading issues? (oversubscription)
How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

General guidelines for CPU performance on PyTorch

This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.

1. Use channels last memory format

Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.

torch.contiguous_format: default memory format, also referred as NHCW.
torch.channels_last: also referred as NHWC.
torch._mkldnn: mkldnn blocked format.

High-Performance Matrix Multiplication

This is a short post that explains how to write a high-performance matrix multiplication program on modern processors. In this tutorial I will use a single core of the Skylake-client CPU with AVX2, but the principles in this post also apply to other processors with different instruction sets (such as AVX512).

Intro

Matrix multiplication is a mathematical operation that defines the product of

	<?xml version="1.0" encoding="UTF-8"?>
	<opml version="1.0">
	<head>
	<title>RSS subscriptions for [email protected]</title>
	<dateCreated>Sat, 27 Feb 2021 17:33:45 +0000</dateCreated>
	<ownerEmail>[email protected]</ownerEmail>
	</head>
	<body>
	<outline text="Google AI Blog" title="Google AI Blog" type="rss" xmlUrl="http://feeds.feedburner.com/blogspot/gJZg" htmlUrl="http://ai.googleblog.com/"/>
	<outline text="FastML" title="FastML" type="rss" xmlUrl="http://fastml.com/atom.xml" htmlUrl="http://fastml.com/"/>

	# -- coding: utf-8 --
	"""
	Created on 2019-08-22 09:37:36

	@author: fmiorell
	"""

	# This script registers the "turbo" colormap to matplotlib, and the reversed version as "turbo_r"
	# Reference: https://ai.googleblog.com/2019/08/turbo-improved-rainbow-colormap-for.html


	class ConvSeluSVD(nn.Module):

	def __init__(self, inputSize, outputSize, stride=1, maxpool=False, ownBasis=False):
	super(ConvSeluSVD, self).__init__()

	self.inputSize = inputSize
	self.outputSize = outputSize
	self.stride = stride
	self.params = Parameter( torch.Tensor(outputSize * inputSize, 1,3).normal_(0, .02))

	(ql:quickload :drakma)
	(ql:quickload :cl-json)
	(ql:quickload :plump)
	(ql:quickload :babel)
	(ql:quickload :tooter)
	(ql:quickload :split-sequence)

	(defvar feed-path "https://lobste.rs/rss")
	(setf drakma:drakma-default-external-format :UTF-8)