PyTorch to MXNet

This cheatsheet serves as a quick reference for PyTorch users who are interested in trying MXNet, and vice versa.

Pytorch is a deep learning framework provides imperative tensor manipulation and neural network training. MXNet provides similar imperative tensor manipulation through the ndarray package and neural network training through gluon. This cheatsheet maps functions one-by-one between these two frameworks.

Note that MXNet has a symbolic interface similar to Keras and Tensorflow that may provide better performance and portability. This cheatsheet mainly focus on MXNet's imperative interface.

Installation

PyTorch recommends installing by using conda, MXNet's default installation is through pip.

A	PyTorch	MXNet
CPU	`conda install pytorch-cpu -c pytorch`	`pip install mxnet`
Intel CPU1	Same as above	`pip install mxnet-mkl`
Nvidia GPU2	`conda install pytorch -c pytorch`	`pip install mxnet-cu91`

1: PyTorch ships with MKL, while mxnet-mkl in addition uses MKL-DNN, which is a DNN accelerating library for Intel CPU. (Similar to cuDNN for Nvidia GPU) 2: Need to install CUDA Toolkit first. Python in default uses CUDA 8.0, one can choose CUDA 9.0 by specifying cuda90. MXNet needs explicitly specifying the CUDA version. Other versions include mxnet-cu80, mxnet-cu90.

Load Package

A	PyTorch	MXNet
Main package	`import torch`	`import mxnet as mx`
Tensor	Already under `torch`	`from mxnet import nd`
Imperative training	`torch` is purely imperative	`from mxnet import gluon`
Neural network	`from torch import nn`	`from mxnet.gluon import nn`
Recurrent Neural network	Already under `torch.nn`	`from mxnet.gluon import rnn`

Manipulate Tensor

PyTorch inherited the tensor funcitionalities from Torch, while MXNet adopted an interface similiar to numpy's ndarray. So for the same multi-dimensional array, Pytorch calls it tensor, while MXNet names it ndarray. In this cheatsheet, we use the Tensor name conversion.

Create tensor

One main difference that often confusing users is how to specify the shape. PyTorch allows the shape elements to be positional arguments, e.g. torch.zeros(2,3), while in most cases MXNet only accepts a tuple shape, e.g. nd.zeros((2,3)).

Description	PyTorch	MXNet
Unintialized 5x3 matrix	`torch.Tensor(5,3)`	`nd.empty((5,3))`
Initialized matrix with 0s	`torch.zeros(2,3)`	`nd.zeros((2,3))`
Initialized matrix with 1s	`torch.ones(5,3)`	`nd.ones((5,3))`
Returns a Tensor filled with the scalar value 1, with the same size as input	`torch.ones_like(x)`	`nd.ones_like(x)`
A range list	`torch.arange(-5,5)`	`nd.arange(-5,5)`
Construct from list	`torch.FloatTensor([[1,],[2,]])`	`nd.array([[1,],[2,]])`

Query Tensor Information

Function	PyTorch	MXNet
Print the matrix	`print(x)`	Same
Get the tensor shape	`x.size()`	x.shape
Get the tensor size (#elements)	`x.numel()`	`x.size`
Data type of a tensor	`x.type()`	`x.dtype`
Address of the first element	`x.data_ptr()`	Not available
Device context of a tensor	x.device	`x.context`

Indexing

Description	PyTorch	MXNet
Fancy index read	`x[:], x[1,2], x[1:2,2:3]`	same
Write	`x[1:2,2:3]=4`	same
Take elements by an index list	`x.index_select(dim, idx)`	`x.pick(idx)`, `idx` is a list of (n-1)-D indicies
Gathers values along an axis specified by dim	`torch.gather(x, 1, torch.LongTensor([[0,0],[1,0]]))`	`nd.gather_nd(x, nd.array([[[0,0],[1,1]],[[0,0],[1,0]]]))`
Scatter	`torch.zeros(2, 4).scatter_(1, torch.LongTensor([[2], [3]]), 1.23)`	`nd.scatter_nd(nd.array([1.23,1.23]), nd.array([[0,1],[2,3]]), (2,4))`

Elementwise operations

Description	PyTorch	MXNet
+, -, *, / with broadcasting	`x+y`, `x-y`, `x*y`, `x/y`	Same
`abs`,`ceil`,`sin`,`exp`...	`x.abs()`,`x.ceil()`,...	Same
`arcsine`, `arcosine`,...	`x.asin()`, `x.acos()` :shorten `arc` to `a`	`x.arcsin()`,`x.arccos()`
Comparision	`x>=y`, `x==y`: return byte tensor	Same, but return data type is same as `x`
Element-wise mod	`x.fmod(3)`	`nd.modulo(x, 3)`
Floor, ceil, and round, trunc	`x.floor()`, `x.ceil()`, `x.round()`, `x.trunc()`	Same
Fractional portion of a tensor	`x.frac()`	`x - x.trunc()`
Limits the values of a tensor to between min and max	`x.clamp(min, max)`	`x.clip(min, max)`
Element-wise division of t1, t2, multiply v, and add t	`torch.addcdiv(t, v, t1, t2)`	`t + v*(t1/t2)`
Element-wise applies function	`x.apply_(calllable)`	Not available, but there is `nd.custom(x, 'op')`
Computes Gauss error function	`x.erf()`	Not available

PyTorch comparison results a byte tensor, which can used as a boolean indexing. For example x[x>1] returns the elements in x that is larger than 1. While x>1 in MXNet returns a tensor with the same data type as x.

Reduction operators

Description	PyTorch	MXNet
max, min, mean, sum, and product1	`x.max()`, `x.min()`,...: return a scalar result	Same, but return a (1,)-shape tensor
same functions above but along axis i	`x.max(dim=i)`	`x.max(axis=i)`
sum and prod that treat NaN as zero	Not available	`nansum(), nanprod()`
Cumulative sum, product along axis i	`x.cumprod(i), x.cumsum(i)`	Not available
p-norm	`x.norm(p)`	`x.norm()` only supports L2 norm
p-norm distance between x and y	`x.dist(y, p)`	(x-y).norm(): only support L2 norm

1.In PyTorch, x.max() returns a scalar. While the same code in MXNet returns a tensor with shape (1,). We then can call asscalar(), namely x.max().asscalar(), to get the scalar result. 2.PyTorch's max/min along an axis returns a (max/min element positions, max/min element values) tuple. While MXNet's min and max only return the values.

Expansion operators

Description	PyTorch	MXNet
Broadcasts/expands tensor to new shape	`x.expand(3,4)`	`x.broadcast_to((3,4))`
Repeats tensor	`x.repeat(4,2)`	`x.tile(4,2)`

Sorting

Description	PyTorch	MXNet
Sort	`x.sort()`	Same
Sort along axis i	`x.sort(dim=i)`	x.sort(axis=i)
Top k elements	`x.topk(k)`	`x.topk(k=k)`

Indices that sort a tensor	`_,idx=x.sort()`	`x.argsort()`
Indices of the max/min values	`_,idx=x.max()`, `_,idx=x.min()`	`x.argmax()`, `x.argmin()`

Manipulating shape

Description	PyTorch	MXNet
Reshape without changing data	`x.view(3,5)`	`x.reshape((3,5))`
Reshape to `y`'s shape without change data	`x.view_as(y)`	`x.reshape_like(y)`
Insert a new axis with size 1 at `k`-th dim	`x.unsqueeze(k)`	`x.expand_dims()`
Repeat along each axis	`x.repeat(4,2)`	`x.tile((4,2))`
Number of elements in a tensor	`x.numel()`	`x.size`
Splits a tensor in a given dim	`x.chunk(num_of_chunk)`	`nd.split(x, num_outputs=num_of_chunk)`

Change Data Type

Mu is here.

Description	PyTorch	MXNet
Change value type to int8	`x.char()`	`x.astype('int8')`
Returns a copy of the tensor after casting to a specified type	`x.type(type)`	`x.astype(dtype)`
Return a copy	`x.clone()`	`x.copy()`
Construst from numpy `x`	`torch.from_numpy(x)`	`nd.array(x)`
Returns this tensor as a NumPy ndarray	`x.numpy()`	`x.asnumpy()`
Inplace `z` by `x+y`	`torch.add(x,y,out=z)`	`nd.elemwise_add(x, y, out=z)`
Inplace `y` by `x+y`	`y.add_(x)`	`y+=x`

Sampling Random Tensors

Description	PyTorch	MXNet
Random values in `[0,1]`	`torch.rand(5,3)`	`nd.random.uniform(shape=(5,3))`
Normal distribution	`torch.normal(means, std)`: means and std should be tensors with the same shape as output	`nd.random.uniform(loc, scale, (5,3))`: loc (mean) and scale (std) can be scalars
Uniform distribution	`torch.rand(*sizes)` for `[0, 1)` `torch.Tensor.uniform(from, to)` for `[from, to)`	`nd.random.uniform(low, high, (5, 3))`
Cauchy distribution	`torch.Tensor.cauchy_(median, sigma)`(Inplace)	N/A
Exponential distribution	`torch.Tensor.exponential_(lambd)` (Inplace)	`nd.random.exponential(scale, (5, 3))`
Geometric distribution	`torch.Tensor.geometric_(p)` (Inplace)	N/A
Log-normal distribution	`torch.Tensor.log_normal_(mean, std)`(Inplace)	N/A
Bernoulli distribution	`x.bernoulli()`	Not available

Linear Algebra

Description	PyTorch	MXNet Gluon
Matrix transpose	`x.t()`	`x.T`
Matrix products with broadcasting	`torch.matmul(x,y)`	`nd.dot(x,y)`
GEMM	`torch.addmm(b,z,a,x,y)`	`nd.linalg.gemm(x,y,z,a,b)`
batch matrix product	`x.bmm(y)` or `torch.bmm(x, y)`	`nd.linalg.gemm2(x, y)`
batch matrix product and accumulation	`torch.addbmm(M, batch1, batch2)`	`nd.linalg.gemm(batch1, batch2, M)` Leading n-2 dim are reduced
Outer-product of two vector add a matrix	`m.addr(vec1, vec2)`	Not available
Cross product	`x.cross(y)`	Not available
Creates a diagonal tensor	`x.diag(diagonal)`	Not available
Solves least square & least norm	`B.gels(A)`	Not available
Eigendecomposition for symmetric matrix	`e, v = a.symeig()`	`v, e = nd.linalg.syevd(a)`

On GPUs

Description	PyTorch	MXNet Gluon
Create on GPU 0	`torch.ones(5,3).cuda()`	`nd.ones((5,3), ctx=mx.gpu())`
Context scope	`with torch.cuda.device(1): torch.ones(5,3)`	`with mx.gpu(1):`
`nd.ones((5,3))`
Copy to GPU 1 (No change if already on GPU 1)	`x.cuda(1)`	`x.as_in_context(mx.gpu(1))`
Always copy	`x.cuda(1)`	`x.copyto(mx.gpu(1))`
Copy data from `x` to `y`(may in different GPUs)	`y.copy_(x)`	`x.copyto(y)`
Convert GPU Tensor/NDArray to numpy array	`x = y.numpy()`	`x = y.asnumpy()`

Autograd

Tensor params vs autograd scope

Autograd package of PyTorch/MXNet enables automatic differentiation of Tensor/NDArray.

PyTorch. Construct Tensor with requires_grad specifying if gradients are needed

x = torch.ones((1,), requires_grad=True)
y = x * 2

MXNet. Create gradient for ndarray with attach_grad, and put the computation within the autograd.record() scope:

x = mx.nd.ones((1,))
x.attach_grad()with mx.autograd.record():
    y = x * 2

Description	PyTorch	MXNet Gluon
Run backward	`y.backward()`	Same
Access gradients	`x.grad`	Same
Grad check	`test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4)`	Not Implemented yet for Gluon

scope override (pause, train_mode, predict_mode)

Some operators (Dropout, BatchNorm, etc) behave differently in training and making predictions. This can be controlled with train_mode and predict_mode scope in MXNet. Pause scope is for codes that do not need gradients to be calculated.

x = mx.nd.ones((1,))with autograd.train_mode():
    y = mx.nd.Dropout(x)
    with autograd.predict_mode():
        z = mx.nd.Dropout(y)

w = mx.nd.ones((1,))
w.attach_grad()with autograd.record():
    y = x * w
    y.backward()
    with autograd.pause():
        w += w.grad

batch-end synchronization is needed

MXNet uses lazy evaluation to achieve superior performance. The Python thread just pushes the operations into the backend engine and then returns. In training phase batch-end synchronization is needed, e.g, asnumpy(), wait_to_read(), metric.update(...).

for (data, label) in train_data:
    with autograd.record():
        output = net(data)
        L = loss(output, label)
        L.backward()
    trainer.step(data.shape[0])
    metric.update([label], [output])

Neural networks

Construct Networks

We can construct a neural network by creating its layers and defining the forward function.

PyTorch: inherent nn.Module

class Net(nn.Module):
    def __init__(self, D_in, D_out):
        super(Net, self).__init__()
        self.linear = nn.Linear(D_in, D_out)
    def forward(self, x):
        return self.linear(x)
net = Net(2,4)
print(net(x))

MXNet: inherent nn.Block. Compared to PyTorch:
- Layers defined in nn don't require input shape, as they can be inferenced later.
- Layers need to be defined within the parent name scope to give model parameters proper names, which are needed to save and load the parameters

class Net(nn.Block):
    def __init__(self, D_out):
        super(Net, self).__init__()
        with self.name_scope():
            self.dense = nn.Dense(D_out)
    def forward(self, x):
        return self.dense(x)
net = Net(4)
print(net(x))

Both of them provide a sequential container to chain layers into a network:

Pytorch

net = nn.Sequential(
    nn.Linear(2,3),
    nn.ReLU(),
    nn.Linear(3,4)
)

MXNet:

net = nn.Sequential()
with net.name_scope():
    net.add(
        nn.Dense(3, activation='relu'),
        nn.Dense(4)
    )

Foward: Compute outputs

PyTorch: needs to wrap input data into Variable

x = torch.autograd.Variable(torch.ones(2,2))
y = net(x)
print(y)

MXNet: needs initialize the parameters explicity.

net.initialize()
x = nd.ones((2,2))
y = net(x)
print(y)

Backward: Compute Gradients

PyTorch:

x = torch.autograd.Variable(torch.ones(2,2))
z = torch.autograd.Variable(torch.ones(2,4)*2)
loss = nn.MSELoss()

net.zero_grad()
y = net(x)
L = loss(y, z)
L.backward()
print(net[0].weight.grad)

proprogates Variable to define the scope for backward
uses the add semantic to store the gradients, namely w.grad += cur_grad. It needs to explicitly zero the gradients before every forward
the MSELoss returns the average of all L2 losses over examples, so L is size 1

MXNet:

x = nd.ones((2,2))
z = nd.ones((2,4))*2
loss = gluon.loss.L2Loss()

net.initialize()
with mx.autograd.record():
    y = net(x)
    L = loss(y, z)
L.backward()
print(net[0].weight.grad())

explicitly put the computations that need to be difficientiated in the autograd.record() scope.
uses write in default to store gradients, namely w.grad = cur_grad, so calling zero_grad is not necessary unless we change to add , e.g. net[0].weight.grad_req = 'add'
the L2Loss applies L2 loss to examples one by one, so L is size 2. Calling L.backward() equals to sum L's elements and then backward. To behavior the same as PyTorch's MSELoss, we can change to L = loss(y, z).mean()

Feedforward Layers

Both Pytorch and Gluon defined various neural networkl layers in the nn module. Compared to Pytorch, MXNet

adopts Keras' layer defintion while Pytorch's name convention is close to Chainer
Again, the input size of a layer is optional in MXNet

A	Pytorch	Gluon
Sequential container	`nn.Sequential(x,y)`	`nn.Sequential(x,y)`
Fully connected	`nn.Linear(in, out)`	`nn.Dense(out)`
Activation	`nn.ReLU()`	`nn.Activation('relu')`
3x3 2D convolution with 1-stride 0-padding	`nn.Conv2d(in, out,3,1,0)`	`nn.Conv2D(out,3,1,0)`
3x3 2D max pooling with 1-stride 0-padding	`nn.MaxPool2d(3,1,0)`	`nn.MaxPool2D(3,1,0)`
Adaptive max pooling with fixed output size	`nn.AdaptiveMaxPool2d(4)`	Not Available
Adaptive avg pooling with fixed output size	`nn.AdaptiveAvgPool2d(4)`	Not Available
Batch normalization	`nn.BatchNorm2d(in)`	`nn.BatchNorm()`
Embedding	`nn.Embedding(in,out)`	`nn.Embedding(in,out)`

1: supports applying an activation layer on the output by activation='relu'

Recurrent Layers

Pytorch defines recurrent neural network layers in nn module, while MXNet defines them in rnn module.

A	Pytorch	Gluon
Elman RNN with tanh activation, with input dimension 10 and hidden dimension 20	`nn.RNN(10, 20, nonlinearity='tanh')`	`rnn.RNN(20, activation='tanh')`, input dimension is not required
Bidirectional LSTM with input dimension 10, hidden dimension 20 per direction (total 40)	`nn.LSTM(10, 20, bidirectional=True)`	`rnn.LSTM(20, bidirectional=True)`
GRU with output dropout rate 0.2 on each GRU layer except the last	`nn.GRU(10, 20, dropout=0.2)`	`rnn.LSTM(20, dropout=0.2)`
LSTM cell with input dimension 10 and hidden dimension 20.	`nn.LSTMCell(10, 20)`	`rnn.LSTMCell(20)`
Same LSTM cell without bias	`nn.LSTMCell(10, 20, bias=False)`	Not available
Enable Zoneout with probability 0.2 on recurrent state for `cell`	Not available	`ZoneoutCell(cell, zoneout_states=0.2)`
Single step unroll of `cell` with input `x` of shape `(3, 10)`, hidden dimension 20, and beginning states `h, c`both of shape `(3, 20)`	`h, c = cell(x, (h, c))`	Same
Unroll `cell` for 6 steps, with `input` of shape `(6, 3, 10)`, hidden dimension 20, and beginning states `h, c`both of shape `(3, 20)`	`output = []` `for i in range(6):` `h, c = cell(input[i], (h, c))` `output.append(h)`	`output, last_state = cell.unroll(6, input, begin_state=(h, c), layout='TNC')`

Loss functions

Gluon's loss functions are in the gluon.loss module.

Description	PyTorch	Gluon
Square loss	`nn.MSELoss`	`gluon.loss.L2Loss`
L1 Loss	`nn.L1Loss`	`gluon.loss.L1Loss`
Cross-entropy loss with softmax	`nn.CrossEntropyLoss`	`gluon.loss.SoftmaxCrossEntropyLoss`
Logistic loss	`nn.SoftMarginLoss`	`gluon.loss.LogisticLoss`
Binary cross-entropy loss	`nn.BCELoss`	`gluon.loss.SigmoidBCELoss(from_sigmoid=False)`
Binary cross-entropy loss with sigmoid	`nn.BCEWithLogitsLoss`	`gluon.loss.SigmoidBCELoss`
The Kullback-Leibler divergence Loss	`nn.KLDivLoss`	`gluon.loss.KLDivLoss`
Huber loss	`nn.SmoothL1Loss`	`gluon.loss.HuberLoss`
Hinge loss	`nn.HingeEmbeddingLoss`	`gluon.loss.HingeLoss`
Squared hinge loss	Not available	`gluon.loss.SquaredHingeLoss`
Triplet loss	`nn.TripletMarginLoss`	`gluon.loss.TripletLoss`
Connectionist Temporal Classification (CTC) loss	Not avilable	`gluon.loss.CTCLoss`
Negative log likelihood loss	`nn.NLLLoss`	Not available
Negative log likelihood loss with Poisson distribution of target	`nn.PoissonNLLLoss`	Not available
Margin ranking loss	`nn.MarginRankingLoss`	Not available
Multi-class hinge loss with 2D target	`nn.MultiLabelMarginLoss`	Not available
Multi-class classification hinge loss with 1D target	`nn.MultiMarginLoss`	Not available
Loss for multi-label one-versus-all classification based on max-entropy	`nn.MultiLabelSoftMarginLoss`	Not available
Cosine loss	`nn.CosineEmbeddingLoss`	Not available

Low-level functions

Pytorch has a nn.functional module providing low-level functions, for example:

import torch.nn.functional as F
x = torch.arange(-5,5)
y = F.relu(torch.autograd.Variable(x))

The nd package in MXNet serves a similar functionality:

import mxnet.nd as F
x = nd.arange(-5,5)
y = F.relu(x)

Manipulate Parameters

Parameter: container of weights

Gluon's Parameter differs to PyTorch in two ways:

it has a string name (necessary to switch to symbolic execution)
it can be created with a shape and an intilization method without the real data. By this way we can support lazy initialization, namely creating layers with the input size to be determined later.

A	PyTorch	MXNet
Create	`w = torch.nn.Parameter(torch.ones(2,3))`	`w = gluon.Parameter('w', shape=(2,3), init=mx.init.One())` `w.initialize()`
Get data	`w.data`	`w.data()`
Get gradient	`w.grad`	`w.grad()`
Write data	`w.data -= 1` `w.data[:] = 2`	`w.data() -= 1` `w.data()[:] = 2`
Change a random initialization	`nn.init.uniform(w)`	`w.initialize(init=mx.init.Uniform(), force_reinit=True)`
Move to GPU 1	`w = w.cuda(1)`	`w.reset_ctx(mx.gpu(1))`

Network Parameters

A	PyTorch	MXNet
Get all parameters	`params = net.named_parameters()`	`params = net.collect_params()`
Query parameters by regex	Not available	`params = net.collect_params(regex)`
Initialization	Done when constructing `net`	`net.initialize(init=mx.init.Uniform())`
Use a different init method	`for w in params:` `nn.init.normal(w)`	`params.initialize(` `init=mx.init.Normal(), force_reinit=True)`
Get a layer's parameter	`net.linear.weight`	`net.dense.weight`
Initialize a layer differently	`nn.init.normal(net.linear.weight)`	`net.dense.weight.initialize(` `init=mx.init.Normal(), force_reinit=True)`
Specify layer initializer	`layer = torch.nn.Linear(20, 10)` `torch.nn.init.normal(layer.weight, 0, 0.01)`	`layer = mx.gluon.nn.Dense(10, weight_initializer=mx.init.Normal(0.01))`
Create a parameter that doesn't update	`Variable(torch.ones(5, 5), requires_grad=False)`	`self.params.get('weight', shape=(5, 5), init='ones', grad_req='null')`
Stop a parameter from updating	`weight = weight.detach()` before forward	`net.dense.weight.grad_req = 'null'`, can be called anywhere

Move weights to GPU 1	`net.cuda(1)`	`params.reset_ctx(mx.gpu(1))`
Set all grads to 0	`net.zero_grad()`	`params.zero_grad()`

Initialization Methods

A	PyTorch	MXNet
Uniform in `[-1,1]`	`nn.init.uniform(w,-1,1)`	`mx.init.Uniform(1)`
Normal with 0-mean 1-std	`nn.init.normal(w, 0, 1)`	`mx.init.Normal(1)`
Xavier with uniform dist	`nn.init.xavier_uniform(w, gain=3)`	`mx.init.Xavier(magnitude=3)`
Constant	`nn.init.constant(w, 2)`	`mx.init.Constant(2)`

Serialization

Serialization and loading of model parameters are achieved by calling save_params and load_params. Computation graph serialization is done by calling export.

Class	Pytorch	MXNet Gluon
Save model parameters	`torch.save(the_model.state_dict(), filename)`	`model.save_params(filename)`
Load parameters	`the_model.load_state_dict(torch.load(PATH))`	`model.load_params(filename, ctx, allow_missing=False, ignore_extra=False)`
Export computation graph and parameters	`torch.onnx.export(model, dummy_input, 'model.proto')`	`model.export('./model', epoch=0)`

MXNet support dump the network into disk through hybridizing as well, see this tutorial.

HybridBlock

HybridBlock supports forwarding with both Symbol and NDArray. After hybridized, HybridBlock will create a symbolic graph representing the forward computation and cache it. Most of the built-in blocks (Dense, Conv2D, MaxPool2D, BatchNorm, etc.) are HybridBlocks. Instead of explicitly declaring the number of inputs to a layer, we can simply state the number of outputs. The shape will be inferred on the fly once the network is provided with some input.

Function	PyTorch	MXNet Gluon
partial-shape hybridized	Not Available	`net = mx.gluon.nn.HybridSequential()` `with net.name_scope():` `net.add(mx.gluon.nn.Dense(10))` `net.hybridize()`

SymbolBlock

SymbolBlock can construct block from symbol. This is useful for using pre-trained models as feature extractors.

Function	PyTorch	MXNet Gluon
SymbolBlock	Not Available	`alexnet = mx.gluon.model_zoo.vision.alexnet(pretrained=True, prefix='model_')` `out = alexnet(inputs)` `internals = out.get_internals()` `outputs = [internals['model_dense0_relu_fwd_output']]` `feat_model = gluon.SymbolBlock(outputs, inputs, params=alexnet.collect_params())`

Optimization

Optimization Method

PyTorch:

optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
for x, y in data:
    optimizer.zero_grad()
    L = loss(net(x), y)
    L.backward()
    optimizer.step()

MXNet:

optimizer = mx.optimizer.SGD(learning_rate=0.01, momentum=0.9)
trainer = gluon.Trainer(net.collect_params(), optimizer)
for x, y in data:
    with mx.autograd.record():
        L = loss(net(x), y)
    L.backward()
    trainer.step(x.shape[0]) # batch size

zero_grad in optimizer(Pytorch) or Trainer(Gluon) clears the gradients of all parameters. In gluon, there is no need to clear the gradients every batch if grad_req = 'write'(default).

Optimizers

A	PyTorch	MXNet
SGD	`optim.SGD`	`optimizer.SGD`
SGD with Nesterov momentum	`optim.SGD(nestrov=True)`	`optimizer.NAG`
Averaged SGD	`optim.ASGD`	Not available
Adam	`optim.Adam`	`optimizer.Adam`
Sparse Adam	`optim.SparseAdam`	`optimizer.Adam(lazy_update=True)`
AdaGrad	`optim.Adagrad`	`optimizer.AdaGrad`
RMSProp	`optim.RMSprop`	`optimizer.RMSProp`
Rprop	`optim.Rprop`	Not available
AdaDelta	`optim.Adadelta`	`optimizer.AdaDelta`
Adamax	`optim.Adamax`	`optimizer.Adamax`
L-BFGS	`optim.LBFGS`	Not available
Ftrl	Not available	`optimizer.Ftrl`
Nesterov Adam optimizer	Not available	`optimizer.Nadam`
Signum	Not available	`optimizer.Signum`
Follow the Moving Leader	Not available	`optimizer.FTML`
Large Batch SGD	Not available	`optimizer.LBSGD`
Asynchronous SGD with delay compensation	Not available	`optimizer.DCASGD`
Stochastic Gradient Riemannian Langevin Dynamics	Not available	`optimizer.SGLD`

Multi-GPU training

Function	Pytorch	MXNet Gluon
data parallelism	`net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])` `output = net(data)`	`ctx = [mx.gpu(i) for i in range(3)]` `data = gluon.utils.split_and_load(data, ctx)` `label = gluon.utils.split_and_load(label, ctx)` `with autograd.record():` `losses = [loss(net(X), Y) for X, Y in zip(data, label)]` `for l in losses:` `l.backward()`

Distributed training

Function	Pytorch	MXNet Gluon
distributed data parallelism	`torch.distributed.init_process_group(...)` `model = torch.nn.parallel.distributedDataParallel(model, ...)`	`store = kv.create('dist')` `trainer = gluon.Trainer(net.collect_params(), ..., kvstore=store)`

Monitoring

MXNet has pre-defined metrics

Gluon provide several predefined metrics which can online evaluate the performance of a learned model.

Function	Pytorch	MXNet Gluon
metric	Not available	`metric = mx.metric.Accuracy()` `with autograd.record():` `output = net(data)` `L = loss(ouput, label)` `loss(ouput, label).backward()` `trainer.step(batch_size)` `metric.update(label, output)`

Data visualization

tensorboardX(PyTorch) and dmlc-tensorboard(Gluon) can be used to visualize your network and plot quantitative metrics about the execution of your graph.

Function	PyTorch	MXNet Gluon
visualization	`writer = tensorboardX.SummaryWriter()` `...` `for name, param in model.named_parameters():` `grad = param.clone().cpu().data.numpy()` `writer.add_histogram(name, grad, n_iter)` `...` `writer.close()`	`summary_writer = tensorboard.FileWriter('./logs/')` `...` `for name, param in net.collect_params():` `grad = param.grad.asnumpy().flatten()` `s = tensorboard.summary.histogram(name, grad)` `summary_writer.add_summary(s)` `...` `tensorboard.summary_writer.close()`

Symbolic Programming

Besides the imperative interface, MXNet offers a symbolic programming interface, which is somewhat similar to TensorFlow and Keras. See module tutorials.

Data loading

Dataset and DataLoader are the basic components for loading data.

Class	Pytorch	MXNet Gluon
Dataset holding arrays	`torch.utils.data.TensorDataset(data_tensor, label_tensor)`	`gluon.data.ArrayDataset(data_array, label_array)`
Data loader	`torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=<function default_collate>, drop_last=False)`	`gluon.data.DataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch='keep', batch_sampler=None, batchify_fn=None, num_workers=0)`
Sequentially applied sampler	`torch.utils.data.sampler.SequentialSampler(data_source)`	`gluon.data.SequentialSampler(length)`
Random order sampler	`torch.utils.data.sampler.RandomSampler(data_source)`	`gluon.data.RandomSampler(length)`

Some commonly used datasets for computer vision are provided in mx.gluon.data.vision package.

Class	Pytorch	MXNet Gluon
MNIST handwritten digits dataset.	`torchvision.datasets.MNIST`	`mx.gluon.data.vision.MNIST`
CIFAR10 Dataset.	`torchvision.datasets.CIFAR10`	`mx.gluon.data.vision.CIFAR10`
CIFAR100 Dataset.	`torchvision.datasets.CIFAR100`	`mx.gluon.data.vision.CIFAR100`
A generic data loader where the images are arranged in folders.	`torchvision.datasets.ImageFolder(root, transform=None, target_transform=None, loader=<function default_loader>)`	`mx.gluon.data.vision.ImageFolderDataset(root, flag, transform=None)`

zhanghang1989/cheatsheet.md