Rahul Nair rahulunair

benchmark

Based on huggingface repo for performance evaluation, actual benchmark run script placed at repo. How to reproduce performance:

prepare dataset according to link.
update GLUE_DIR to actual dataset path in run_inference.sh.
change env settings, the default setting is using 20 cores;

MKL v.s. MKLDNN

Inference performance result on Xeon 6148 (2x20 cores), single socket and single thread.

General guidelines for CPU performance on PyTorch

This file serves a BKM to get better performance on CPU for PyTorch, mostly focusing on inference or deployment. Chinese version available here.

1. Use channels last memory format

Right now, on PyTorch CPU path, you may choose to use 3 types of memory formats.

torch.contiguous_format: default memory format, also referred as NHCW.
torch.channels_last: also referred as NHWC.
torch._mkldnn: mkldnn blocked format.

BKMs to check whether mkl or mkldnn is enabled on PyTorch

PyTorch can be installed via different channels: conda, pip, docker, source code...

By default, mkl and mkl-dnn are enabled; But this might not always be true, so it is still useful to learn how to check this by yourself:

1. How to check whether mkl is enabled?

### check where your torch is installed
python -c 'import torch; print(torch.__path__)'

(Internal Tranining Material)

Usually the first step in performance optimization is to do profiling, e.g. to identify performance hotspots of a workload. This gist tells basic knowledge of performance profiling on PyTorch, you will get:

How to find the bottleneck operator?
How to trace source file of a particular operator?
How do I indentify threading issues? (oversubscription)
How do I tell a specific operator is running efficiently or not?

This tutorial takes one of my recent projects - pssp-transformer as an example to guide you through path of PyTorch CPU peformance optimization. Focus will be on Part 1 & Part 2.

Part IV: BFloat16 Kernel Optimization

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part III: Vectorization Techniques

Chinese version for this chapter, link.

This section contains the following subjects:

Part III: Vectorization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part II: Parallelization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

This section contains the following subjects:

Part II: Parallelization Techniques

(Training material on pytorch CPU performance optimization)

Part I: Memory Formats and Channels Last Optimization
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

This section contains the following subjects:

Part I: Memory Formats and Channels Last Optimization

(Training material on pytorch CPU performance optimization)

Part II: Parallelization Techniques
Part III: Vectorization Techniques
Part IV: BFloat16 Kernel Optimization

Chinese version for this chapter, link.

PyTorch Channels Last Memory Format Performance Optimization on CPU Path

("mkldnn" has been renamed to "oneDNN", but exsiting PyTorch APIs still use "mkldnn", future work will align PyTorch user level APIs to "oneDNN")

NB: Memory format refers to data representation that describes how multidimensional arrays (nD) are stored in linear (1D) memory address space. Memory format has the same semantic with layout in oneDNN. Layout in PyTorch has other semantic ofdescribing dense or sparse with the attributes: 'torch.strided', 'torch.sparse_coo'.

	"""
	PyTorch implementation of a sequence labeler (POS taggger).

	Basic architecture:
	- take words
	- run though bidirectional GRU
	- predict labels one word at a time (left to right), using a recurrent neural network "decoder"

	The decoder updates hidden state based on:
	- most recent word

Rahul Nair rahulunair

benchmark

MKL v.s. MKLDNN

General guidelines for CPU performance on PyTorch

1. Use channels last memory format

BKMs to check whether mkl or mkldnn is enabled on PyTorch

1. How to check whether mkl is enabled?

Part IV: BFloat16 Kernel Optimization

Part III: Vectorization Techniques

Part II: Parallelization Techniques

Part I: Memory Formats and Channels Last Optimization

PyTorch Channels Last Memory Format Performance Optimization on CPU Path

Table of Contents