You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
PyTorch Channels Last memory format perf optimization and oneDNN integration plan.
PyTorch Channels Last Memory Format Performance Optimization on CPU Path
("mkldnn" has been renamed to "oneDNN", but exsiting PyTorch APIs still use "mkldnn", future work will align PyTorch user level APIs to "oneDNN")
Table of Contents
PyTorch Channels Last memory format introduction
oneDNN API for NHWC layout
Generic Channels Last memory format optimization with ATen native
oneDNN NHWC integration
NB: Memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic with layout in oneDNN. Layout in PyTorch has other semantic ofdescribing dense or sparse with the attributes: 'torch.strided', 'torch.sparse_coo'.
Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload.
This gist tells basic knowledge of performance profiling on PyTorch, you will get:
How to find the bottleneck operator?
How to trace source file of a particular operator?
How do I tell a specific operator is running efficiently or not?
This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This is a short post that explains how to write a high-performance matrix
multiplication program on modern processors. In this tutorial I will use a
single core of the Skylake-client CPU with AVX2, but the principles in this post
also apply to other processors with different instruction sets (such as AVX512).
Intro
Matrix multiplication is a mathematical operation that defines the product of