Skip to content

Instantly share code, notes, and snippets.

@zhanghang1989
Last active September 10, 2024 08:49
Show Gist options
  • Save zhanghang1989/3d646f71d60c17048cf8ad582393ac6c to your computer and use it in GitHub Desktop.
Save zhanghang1989/3d646f71d60c17048cf8ad582393ac6c to your computer and use it in GitHub Desktop.

PyTorch to MXNet

This cheatsheet serves as a quick reference for PyTorch users who are interested in trying MXNet, and vice versa.

Pytorch is a deep learning framework provides imperative tensor manipulation and neural network training. MXNet provides similar imperative tensor manipulation through the ndarray package and neural network training through gluon. This cheatsheet maps functions one-by-one between these two frameworks.

Note that MXNet has a symbolic interface similar to Keras and Tensorflow that may provide better performance and portability. This cheatsheet mainly focus on MXNet's imperative interface.

Installation

PyTorch recommends installing by using conda, MXNet's default installation is through pip.

A PyTorch MXNet
CPU conda install pytorch-cpu -c pytorch pip install mxnet
Intel CPU1 Same as above pip install mxnet-mkl
Nvidia GPU2 conda install pytorch -c pytorch pip install mxnet-cu91

1: PyTorch ships with MKL, while mxnet-mkl in addition uses MKL-DNN, which is a DNN accelerating library for Intel CPU. (Similar to cuDNN for Nvidia GPU) 2: Need to install CUDA Toolkit first. Python in default uses CUDA 8.0, one can choose CUDA 9.0 by specifying cuda90. MXNet needs explicitly specifying the CUDA version. Other versions include mxnet-cu80, mxnet-cu90.

Load Package

A PyTorch MXNet
Main package import torch import mxnet as mx
Tensor Already under torch from mxnet import nd
Imperative training torch is purely imperative from mxnet import gluon
Neural network from torch import nn from mxnet.gluon import nn
Recurrent Neural network Already under torch.nn from mxnet.gluon import rnn

Manipulate Tensor

PyTorch inherited the tensor funcitionalities from Torch, while MXNet adopted an interface similiar to numpy's ndarray. So for the same multi-dimensional array, Pytorch calls it tensor, while MXNet names it ndarray. In this cheatsheet, we use the Tensor name conversion.

Create tensor

One main difference that often confusing users is how to specify the shape. PyTorch allows the shape elements to be positional arguments, e.g. torch.zeros(2,3), while in most cases MXNet only accepts a tuple shape, e.g. nd.zeros((2,3)).

Description PyTorch MXNet
Unintialized 5x3 matrix torch.Tensor(5,3) nd.empty((5,3))
Initialized matrix with 0s torch.zeros(2,3) nd.zeros((2,3))
Initialized matrix with 1s torch.ones(5,3) nd.ones((5,3))
Returns a Tensor filled with the scalar value 1, with the same size as input torch.ones_like(x) nd.ones_like(x)
A range list torch.arange(-5,5) nd.arange(-5,5)
Construct from list torch.FloatTensor([[1,],[2,]]) nd.array([[1,],[2,]])

Query Tensor Information

Function PyTorch MXNet
Print the matrix print(x) Same
Get the tensor shape x.size() x.shape
Get the tensor size (#elements) x.numel() x.size
Data type of a tensor x.type() x.dtype
Address of the first element x.data_ptr() Not available
Device context of a tensor x.device x.context

Indexing

Description PyTorch MXNet
Fancy index read x[:], x[1,2], x[1:2,2:3] same
Write x[1:2,2:3]=4 same
Take elements by an index list x.index_select(dim, idx) x.pick(idx), idx is a list of (n-1)-D indicies
Gathers values along an axis specified by dim torch.gather(x, 1, torch.LongTensor([[0,0],[1,0]])) nd.gather_nd(x, nd.array([[[0,0],[1,1]],[[0,0],[1,0]]]))
Scatter torch.zeros(2, 4).scatter_(1, torch.LongTensor([[2], [3]]), 1.23) nd.scatter_nd(nd.array([1.23,1.23]), nd.array([[0,1],[2,3]]), (2,4))

Elementwise operations

Description PyTorch MXNet
+, -, *, / with broadcasting x+y, x-y, x*y, x/y Same
abs,ceil,sin,exp... x.abs(),x.ceil(),... Same
arcsine, arcosine,... x.asin(), x.acos() :shorten arc to a x.arcsin(),x.arccos()
Comparision x>=y, x==y: return byte tensor Same, but return data type is same as x
Element-wise mod x.fmod(3) nd.modulo(x, 3)
Floor, ceil, and round, trunc x.floor(), x.ceil(), x.round(), x.trunc() Same
Fractional portion of a tensor x.frac() x - x.trunc()
Limits the values of a tensor to between min and max x.clamp(min, max) x.clip(min, max)
Element-wise division of t1, t2, multiply v, and add t torch.addcdiv(t, v, t1, t2) t + v*(t1/t2)
Element-wise applies function x.apply_(calllable) Not available, but there is nd.custom(x, 'op')
Computes Gauss error function x.erf() Not available

PyTorch comparison results a byte tensor, which can used as a boolean indexing. For example x[x>1] returns the elements in x that is larger than 1. While x>1 in MXNet returns a tensor with the same data type as x.

Reduction operators

Description PyTorch MXNet
max, min, mean, sum, and product1 x.max(), x.min(),...: return a scalar result Same, but return a (1,)-shape tensor 
same functions above but along axis i x.max(dim=i) x.max(axis=i)
sum and prod that treat NaN as zero Not available nansum(), nanprod()
Cumulative sum, product along axis i x.cumprod(i), x.cumsum(i) Not available
p-norm x.norm(p) x.norm() only supports L2 norm
p-norm distance between x and y x.dist(y, p) (x-y).norm(): only support L2 norm

1.In PyTorch, x.max() returns a scalar. While the same code in MXNet returns a tensor with shape (1,). We then can call asscalar(), namely x.max().asscalar(), to get the scalar result. 2.PyTorch's max/min along an axis returns a (max/min element positions, max/min element values) tuple. While MXNet's min and max only return the values.

Expansion operators

Description PyTorch MXNet
Broadcasts/expands tensor to new shape x.expand(3,4) x.broadcast_to((3,4))
Repeats tensor x.repeat(4,2) x.tile(4,2)

Sorting

Description PyTorch MXNet
Sort x.sort() Same
Sort along axis i x.sort(dim=i) x.sort(axis=i)
Top k elements x.topk(k) x.topk(k=k)
Indices that sort a tensor _,idx=x.sort() x.argsort()
Indices of the max/min values _,idx=x.max(), _,idx=x.min() x.argmax(), x.argmin()

Manipulating shape

Description PyTorch MXNet
Reshape without changing data x.view(3,5) x.reshape((3,5))
Reshape to y's shape without change data x.view_as(y) x.reshape_like(y)
Insert a new axis with size 1 at k-th dim x.unsqueeze(k) x.expand_dims()
Repeat along each axis x.repeat(4,2) x.tile((4,2))
Number of elements in a tensor x.numel() x.size
Splits a tensor in a given dim x.chunk(num_of_chunk) nd.split(x, num_outputs=num_of_chunk)

Change Data Type

Mu is here.

Description PyTorch MXNet
Change value type to int8 x.char() x.astype('int8')
Returns a copy of the tensor after casting to a specified type x.type(type) x.astype(dtype)
Return a copy x.clone() x.copy()
Construst from numpy x torch.from_numpy(x) nd.array(x)
Returns this tensor as a NumPy ndarray x.numpy() x.asnumpy()
Inplace z by x+y torch.add(x,y,out=z) nd.elemwise_add(x, y, out=z)
Inplace y by x+y y.add_(x) y+=x

Sampling Random Tensors

Description PyTorch MXNet
Random values in [0,1] torch.rand(5,3) nd.random.uniform(shape=(5,3))
Normal distribution torch.normal(means, std): means and std should be tensors with the same shape as output nd.random.uniform(loc, scale, (5,3)): loc (mean) and scale (std) can be scalars
Uniform distribution torch.rand(*sizes) for [0, 1)
torch.Tensor.uniform(from, to) for [from, to)
nd.random.uniform(low, high, (5, 3))
Cauchy distribution torch.Tensor.cauchy_(median, sigma)(Inplace) N/A
Exponential distribution torch.Tensor.exponential_(lambd) (Inplace) nd.random.exponential(scale, (5, 3))
Geometric distribution torch.Tensor.geometric_(p) (Inplace) N/A
Log-normal distribution torch.Tensor.log_normal_(mean, std)(Inplace) N/A
Bernoulli distribution x.bernoulli() Not available

Linear Algebra

Description PyTorch MXNet Gluon
Matrix transpose x.t() x.T
Matrix products with broadcasting torch.matmul(x,y) nd.dot(x,y)
GEMM torch.addmm(b,z,a,x,y) nd.linalg.gemm(x,y,z,a,b)
batch matrix product x.bmm(y) or torch.bmm(x, y) nd.linalg.gemm2(x, y)
batch matrix product and accumulation torch.addbmm(M, batch1, batch2) nd.linalg.gemm(batch1, batch2, M) Leading n-2 dim are reduced
Outer-product of two vector add a matrix m.addr(vec1, vec2) Not available
Cross product x.cross(y) Not available
Creates a diagonal tensor x.diag(diagonal) Not available
Solves least square & least norm B.gels(A) Not available
Eigendecomposition for symmetric matrix e, v = a.symeig() v, e = nd.linalg.syevd(a)

On GPUs

Description PyTorch MXNet Gluon
Create on GPU 0 torch.ones(5,3).cuda() nd.ones((5,3), ctx=mx.gpu())
Context scope with torch.cuda.device(1): torch.ones(5,3) with mx.gpu(1):
nd.ones((5,3))
Copy to GPU 1 (No change if already on GPU 1) x.cuda(1) x.as_in_context(mx.gpu(1))
Always copy x.cuda(1) x.copyto(mx.gpu(1))
Copy data from x to y(may in different GPUs) y.copy_(x) x.copyto(y)
Convert GPU Tensor/NDArray to numpy array x = y.numpy() x = y.asnumpy()

Autograd

Tensor params vs autograd scope

Autograd package of PyTorch/MXNet enables automatic differentiation of Tensor/NDArray.

  • PyTorch. Construct Tensor with requires_grad specifying if gradients are needed
x = torch.ones((1,), requires_grad=True)
y = x * 2
  • MXNet. Create gradient for ndarray with attach_grad, and put the computation within the autograd.record() scope:
x = mx.nd.ones((1,))
x.attach_grad()with mx.autograd.record():
    y = x * 2
Description PyTorch MXNet Gluon
Run backward y.backward() Same
Access gradients x.grad Same
Grad check test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4) Not Implemented yet for Gluon

scope override (pause, train_mode, predict_mode)

Some operators (Dropout, BatchNorm, etc) behave differently in training and making predictions. This can be controlled with train_mode and predict_mode scope in MXNet. Pause scope is for codes that do not need gradients to be calculated.

x = mx.nd.ones((1,))with autograd.train_mode():
    y = mx.nd.Dropout(x)
    with autograd.predict_mode():
        z = mx.nd.Dropout(y)

w = mx.nd.ones((1,))
w.attach_grad()with autograd.record():
    y = x * w
    y.backward()
    with autograd.pause():
        w += w.grad

batch-end synchronization is needed

MXNet uses lazy evaluation to achieve superior performance. The Python thread just pushes the operations into the backend engine and then returns. In training phase batch-end synchronization is needed, e.g, asnumpy(), wait_to_read(), metric.update(...).

for (data, label) in train_data:
    with autograd.record():
        output = net(data)
        L = loss(output, label)
        L.backward()
    trainer.step(data.shape[0])
    metric.update([label], [output])

Neural networks

Construct Networks

We can construct a neural network by creating its layers and defining the forward function.

  • PyTorch: inherent nn.Module
class Net(nn.Module):
    def __init__(self, D_in, D_out):
        super(Net, self).__init__()
        self.linear = nn.Linear(D_in, D_out)
    def forward(self, x):
        return self.linear(x)
net = Net(2,4)
print(net(x))
  • MXNet: inherent nn.Block. Compared to PyTorch:
    • Layers defined in nn don't require input shape, as they can be inferenced later.
    • Layers need to be defined within the parent name scope to give model parameters proper names, which are needed to save and load the parameters
class Net(nn.Block):
    def __init__(self, D_out):
        super(Net, self).__init__()
        with self.name_scope():
            self.dense = nn.Dense(D_out)
    def forward(self, x):
        return self.dense(x)
net = Net(4)
print(net(x))

Both of them provide a sequential container to chain layers into a network:

  • Pytorch
net = nn.Sequential(
    nn.Linear(2,3),
    nn.ReLU(),
    nn.Linear(3,4)
)
  • MXNet:
net = nn.Sequential()
with net.name_scope():
    net.add(
        nn.Dense(3, activation='relu'),
        nn.Dense(4)
    )

Foward: Compute outputs

  • PyTorch: needs to wrap input data into Variable
x = torch.autograd.Variable(torch.ones(2,2))
y = net(x)
print(y)
  • MXNet: needs initialize the parameters explicity.
net.initialize()
x = nd.ones((2,2))
y = net(x)
print(y)

Backward: Compute Gradients

  • PyTorch:
x = torch.autograd.Variable(torch.ones(2,2))
z = torch.autograd.Variable(torch.ones(2,4)*2)
loss = nn.MSELoss()

net.zero_grad()
y = net(x)
L = loss(y, z)
L.backward()
print(net[0].weight.grad)
  • proprogates Variable to define the scope for backward
  • uses the add semantic to store the gradients, namely w.grad += cur_grad. It needs to explicitly zero the gradients before every forward
  • the MSELoss returns the average of all L2 losses over examples, so L is size 1
  • MXNet:
x = nd.ones((2,2))
z = nd.ones((2,4))*2
loss = gluon.loss.L2Loss()

net.initialize()
with mx.autograd.record():
    y = net(x)
    L = loss(y, z)
L.backward()
print(net[0].weight.grad())
  • explicitly put the computations that need to be difficientiated in the autograd.record() scope.
  • uses write in default to store gradients, namely w.grad = cur_grad, so calling zero_grad is not necessary unless we change to add , e.g. net[0].weight.grad_req = 'add'
  • the L2Loss applies L2 loss to examples one by one, so L is size 2. Calling L.backward() equals to sum L's elements and then backward. To behavior the same as PyTorch's MSELoss, we can change to L = loss(y, z).mean()

Feedforward Layers

Both Pytorch and Gluon defined various neural networkl layers in the nn module. Compared to Pytorch, MXNet

  • adopts Keras' layer defintion while Pytorch's name convention is close to Chainer
  • Again, the input size of a layer is optional in MXNet
A Pytorch Gluon
Sequential container nn.Sequential(x,y) nn.Sequential(x,y)
Fully connected nn.Linear(in, out) nn.Dense(out) 
Activation nn.ReLU() nn.Activation('relu')
3x3 2D convolution with 1-stride 0-padding nn.Conv2d(in, out,3,1,0) nn.Conv2D(out,3,1,0)
3x3 2D max pooling with 1-stride 0-padding nn.MaxPool2d(3,1,0) nn.MaxPool2D(3,1,0)
Adaptive max pooling with fixed output size nn.AdaptiveMaxPool2d(4) Not Available
Adaptive avg pooling with fixed output size nn.AdaptiveAvgPool2d(4) Not Available
Batch normalization nn.BatchNorm2d(in) nn.BatchNorm()
Embedding nn.Embedding(in,out) nn.Embedding(in,out)

1: supports applying an activation layer on the output by activation='relu'

Recurrent Layers

Pytorch defines recurrent neural network layers in nn module, while MXNet defines them in rnn module.

A Pytorch Gluon
Elman RNN with tanh activation, with input dimension 10 and hidden dimension 20 nn.RNN(10, 20, nonlinearity='tanh') rnn.RNN(20, activation='tanh'), input dimension is not required
Bidirectional LSTM with input dimension 10, hidden dimension 20 per direction (total 40) nn.LSTM(10, 20, bidirectional=True) rnn.LSTM(20, bidirectional=True)
GRU with output dropout rate 0.2 on each GRU layer except the last nn.GRU(10, 20, dropout=0.2) rnn.LSTM(20, dropout=0.2)
LSTM cell with input dimension 10 and hidden dimension 20. nn.LSTMCell(10, 20) rnn.LSTMCell(20)
Same LSTM cell without bias nn.LSTMCell(10, 20, bias=False) Not available
Enable Zoneout with probability 0.2 on recurrent state for cell Not available ZoneoutCell(cell, zoneout_states=0.2)
Single step unroll of cell with input x of shape (3, 10), hidden dimension 20, and beginning states h, cboth of shape (3, 20) h, c = cell(x, (h, c)) Same
Unroll cell for 6 steps, with input of shape (6, 3, 10), hidden dimension 20, and beginning states h, cboth of shape (3, 20) output = []
for i in range(6):
h, c = cell(input[i], (h, c))
output.append(h)
output, last_state = cell.unroll(6, input, begin_state=(h, c), layout='TNC')

Loss functions

Gluon's loss functions are in the gluon.loss module.

Description PyTorch Gluon
Square loss nn.MSELoss gluon.loss.L2Loss
L1 Loss nn.L1Loss gluon.loss.L1Loss
Cross-entropy loss with softmax nn.CrossEntropyLoss gluon.loss.SoftmaxCrossEntropyLoss
Logistic loss nn.SoftMarginLoss gluon.loss.LogisticLoss
Binary cross-entropy loss nn.BCELoss gluon.loss.SigmoidBCELoss(from_sigmoid=False)
Binary cross-entropy loss with sigmoid nn.BCEWithLogitsLoss gluon.loss.SigmoidBCELoss
The Kullback-Leibler divergence Loss nn.KLDivLoss gluon.loss.KLDivLoss
Huber loss nn.SmoothL1Loss gluon.loss.HuberLoss
Hinge loss nn.HingeEmbeddingLoss gluon.loss.HingeLoss
Squared hinge loss Not available gluon.loss.SquaredHingeLoss
Triplet loss nn.TripletMarginLoss gluon.loss.TripletLoss
Connectionist Temporal Classification (CTC) loss Not avilable gluon.loss.CTCLoss
Negative log likelihood loss nn.NLLLoss Not available
Negative log likelihood loss with Poisson distribution of target nn.PoissonNLLLoss Not available
Margin ranking loss nn.MarginRankingLoss Not available
Multi-class hinge loss with 2D target nn.MultiLabelMarginLoss Not available
Multi-class classification hinge loss with 1D target nn.MultiMarginLoss Not available
Loss for multi-label one-versus-all classification based on max-entropy nn.MultiLabelSoftMarginLoss Not available
Cosine loss nn.CosineEmbeddingLoss Not available

Low-level functions

Pytorch has a nn.functional module providing low-level functions, for example:

import torch.nn.functional as F
x = torch.arange(-5,5)
y = F.relu(torch.autograd.Variable(x))

The nd package in MXNet serves a similar functionality:

import mxnet.nd as F
x = nd.arange(-5,5)
y = F.relu(x)

Manipulate Parameters

Parameter: container of weights

Gluon's Parameter differs to PyTorch in two ways:

  1. it has a string name (necessary to switch to symbolic execution)
  2. it can be created with a shape and an intilization method without the real data. By this way we can support lazy initialization, namely creating layers with the input size to be determined later.
A PyTorch MXNet
Create w = torch.nn.Parameter(torch.ones(2,3)) w = gluon.Parameter('w', shape=(2,3), init=mx.init.One())
w.initialize()
Get data w.data w.data()
Get gradient w.grad w.grad()
Write data w.data -= 1
w.data[:] = 2
w.data() -= 1
w.data()[:] = 2
Change a random initialization nn.init.uniform(w) w.initialize(init=mx.init.Uniform(), force_reinit=True)
Move to GPU 1 w = w.cuda(1) w.reset_ctx(mx.gpu(1))

Network Parameters

A PyTorch MXNet
Get all parameters params = net.named_parameters() params = net.collect_params()
Query parameters by regex Not available params = net.collect_params(regex)
Initialization Done when constructing net net.initialize(init=mx.init.Uniform())
Use a different init method for w in params:
nn.init.normal(w)
params.initialize(
init=mx.init.Normal(), force_reinit=True)
Get a layer's parameter net.linear.weight net.dense.weight
Initialize a layer differently nn.init.normal(net.linear.weight) net.dense.weight.initialize(
init=mx.init.Normal(), force_reinit=True)
Specify layer initializer layer = torch.nn.Linear(20, 10)
torch.nn.init.normal(layer.weight, 0, 0.01)
layer = mx.gluon.nn.Dense(10, weight_initializer=mx.init.Normal(0.01))
Create a parameter that doesn't update Variable(torch.ones(5, 5), requires_grad=False) self.params.get('weight', shape=(5, 5), init='ones', grad_req='null')
Stop a parameter from updating weight = weight.detach() before forward net.dense.weight.grad_req = 'null', can be called anywhere
Move weights to GPU 1 net.cuda(1) params.reset_ctx(mx.gpu(1))
Set all grads to 0 net.zero_grad() params.zero_grad()

Initialization Methods

A PyTorch MXNet
Uniform in [-1,1] nn.init.uniform(w,-1,1) mx.init.Uniform(1)
Normal with 0-mean 1-std nn.init.normal(w, 0, 1) mx.init.Normal(1)
Xavier with uniform dist nn.init.xavier_uniform(w, gain=3) mx.init.Xavier(magnitude=3)
Constant nn.init.constant(w, 2) mx.init.Constant(2)

Serialization

Serialization and loading of model parameters are achieved by calling save_params and load_params. Computation graph serialization is done by calling export.

Class Pytorch MXNet Gluon
Save model parameters torch.save(the_model.state_dict(), filename) model.save_params(filename)
Load parameters the_model.load_state_dict(torch.load(PATH)) model.load_params(filename, ctx, allow_missing=False, ignore_extra=False)
Export computation graph and parameters torch.onnx.export(model, dummy_input, 'model.proto') model.export('./model', epoch=0)

MXNet support dump the network into disk through hybridizing as well, see this tutorial.

HybridBlock

HybridBlock supports forwarding with both Symbol and NDArray. After hybridized, HybridBlock will create a symbolic graph representing the forward computation and cache it. Most of the built-in blocks (Dense, Conv2D, MaxPool2D, BatchNorm, etc.) are HybridBlocks. Instead of explicitly declaring the number of inputs to a layer, we can simply state the number of outputs. The shape will be inferred on the fly once the network is provided with some input.

Function PyTorch MXNet Gluon
partial-shape
hybridized
Not Available net = mx.gluon.nn.HybridSequential()
with net.name_scope():
net.add(mx.gluon.nn.Dense(10))
net.hybridize()

SymbolBlock

SymbolBlock can construct block from symbol. This is useful for using pre-trained models as feature extractors.

Function PyTorch MXNet Gluon
SymbolBlock Not Available alexnet = mx.gluon.model_zoo.vision.alexnet(pretrained=True, prefix='model_')
out = alexnet(inputs)
internals = out.get_internals()
outputs = [internals['model_dense0_relu_fwd_output']]
feat_model = gluon.SymbolBlock(outputs, inputs, params=alexnet.collect_params())

Optimization

Optimization Method

  • PyTorch:
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
for x, y in data:
    optimizer.zero_grad()
    L = loss(net(x), y)
    L.backward()
    optimizer.step()
  • MXNet:
optimizer = mx.optimizer.SGD(learning_rate=0.01, momentum=0.9)
trainer = gluon.Trainer(net.collect_params(), optimizer)
for x, y in data:
    with mx.autograd.record():
        L = loss(net(x), y)
    L.backward()
    trainer.step(x.shape[0]) # batch size

zero_grad in optimizer(Pytorch) or Trainer(Gluon) clears the gradients of all parameters. In gluon, there is no need to clear the gradients every batch if grad_req = 'write'(default).

Optimizers

A PyTorch MXNet
SGD optim.SGD optimizer.SGD
SGD with Nesterov momentum optim.SGD(nestrov=True) optimizer.NAG
Averaged SGD optim.ASGD Not available
Adam optim.Adam optimizer.Adam
Sparse Adam optim.SparseAdam optimizer.Adam(lazy_update=True)
AdaGrad optim.Adagrad optimizer.AdaGrad
RMSProp optim.RMSprop optimizer.RMSProp
Rprop optim.Rprop Not available
AdaDelta optim.Adadelta optimizer.AdaDelta
Adamax optim.Adamax optimizer.Adamax
L-BFGS optim.LBFGS Not available
Ftrl Not available optimizer.Ftrl
Nesterov Adam optimizer Not available optimizer.Nadam
Signum Not available optimizer.Signum
Follow the Moving Leader Not available optimizer.FTML
Large Batch SGD Not available optimizer.LBSGD
Asynchronous SGD with delay compensation Not available optimizer.DCASGD
Stochastic Gradient Riemannian Langevin Dynamics Not available optimizer.SGLD

Multi-GPU training

Function Pytorch MXNet Gluon
data parallelism net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
output = net(data)
ctx = [mx.gpu(i) for i in range(3)]
data = gluon.utils.split_and_load(data, ctx)
label = gluon.utils.split_and_load(label, ctx)
with autograd.record():
losses = [loss(net(X), Y) for X, Y in zip(data, label)]
for l in losses:
l.backward()

Distributed training

Function Pytorch MXNet Gluon
distributed data parallelism torch.distributed.init_process_group(...)
model = torch.nn.parallel.distributedDataParallel(model, ...)
store = kv.create('dist')
trainer = gluon.Trainer(net.collect_params(), ..., kvstore=store)

Monitoring

MXNet has pre-defined metrics

Gluon provide several predefined metrics which can online evaluate the performance of a learned model.

Function Pytorch MXNet Gluon
metric Not available metric = mx.metric.Accuracy()
with autograd.record():
output = net(data)
L = loss(ouput, label)
loss(ouput, label).backward()
trainer.step(batch_size)
metric.update(label, output)

Data visualization

tensorboardX(PyTorch) and dmlc-tensorboard(Gluon) can be used to visualize your network and plot quantitative metrics about the execution of your graph.

Function PyTorch MXNet Gluon
visualization writer = tensorboardX.SummaryWriter()
...
for name, param in model.named_parameters():
grad = param.clone().cpu().data.numpy()
writer.add_histogram(name, grad, n_iter)
...
writer.close()
summary_writer = tensorboard.FileWriter('./logs/')
...
for name, param in net.collect_params():
grad = param.grad.asnumpy().flatten()
s = tensorboard.summary.histogram(name, grad)
summary_writer.add_summary(s)
...
tensorboard.summary_writer.close()

Symbolic Programming

Besides the imperative interface, MXNet offers a symbolic programming interface, which is somewhat similar to TensorFlow and Keras. See module tutorials.

Data loading

Dataset and DataLoader are the basic components for loading data.

Class Pytorch MXNet Gluon
Dataset holding arrays torch.utils.data.TensorDataset(data_tensor, label_tensor) gluon.data.ArrayDataset(data_array, label_array)
Data loader torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=<function default_collate>, drop_last=False) gluon.data.DataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch='keep', batch_sampler=None, batchify_fn=None, num_workers=0)
Sequentially applied sampler torch.utils.data.sampler.SequentialSampler(data_source) gluon.data.SequentialSampler(length)
Random order sampler torch.utils.data.sampler.RandomSampler(data_source) gluon.data.RandomSampler(length)

Some commonly used datasets for computer vision are provided in mx.gluon.data.vision package.

Class Pytorch MXNet Gluon
MNIST handwritten digits dataset. torchvision.datasets.MNIST mx.gluon.data.vision.MNIST
CIFAR10 Dataset. torchvision.datasets.CIFAR10 mx.gluon.data.vision.CIFAR10
CIFAR100 Dataset. torchvision.datasets.CIFAR100 mx.gluon.data.vision.CIFAR100
A generic data loader where the images are arranged in folders. torchvision.datasets.ImageFolder(root, transform=None, target_transform=None, loader=<function default_loader>) mx.gluon.data.vision.ImageFolderDataset(root, flag, transform=None)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment