This cheatsheet serves as a quick reference for PyTorch users who are interested in trying MXNet, and vice versa.
Pytorch is a deep learning framework provides imperative tensor manipulation and neural network training. MXNet provides similar imperative tensor manipulation through the ndarray
package and neural network training through gluon
. This cheatsheet maps functions one-by-one between these two frameworks.
Note that MXNet has a symbolic interface similar to Keras and Tensorflow that may provide better performance and portability. This cheatsheet mainly focus on MXNet's imperative interface.
PyTorch recommends installing by using conda
, MXNet's default installation is through pip
.
A | PyTorch | MXNet |
---|---|---|
CPU | conda install pytorch-cpu -c pytorch |
pip install mxnet |
Intel CPU1 | Same as above | pip install mxnet-mkl |
Nvidia GPU2 | conda install pytorch -c pytorch |
pip install mxnet-cu91 |
1: PyTorch ships with MKL, while mxnet-mkl in addition uses MKL-DNN, which is a DNN accelerating library for Intel CPU. (Similar to cuDNN for Nvidia GPU) 2: Need to install CUDA Toolkit first. Python in default uses CUDA 8.0, one can choose CUDA 9.0 by specifying cuda90. MXNet needs explicitly specifying the CUDA version. Other versions include mxnet-cu80, mxnet-cu90.
A | PyTorch | MXNet |
---|---|---|
Main package | import torch |
import mxnet as mx |
Tensor | Already under torch |
from mxnet import nd |
Imperative training | torch is purely imperative |
from mxnet import gluon |
Neural network | from torch import nn |
from mxnet.gluon import nn |
Recurrent Neural network | Already under torch.nn |
from mxnet.gluon import rnn |
PyTorch inherited the tensor funcitionalities from Torch, while MXNet adopted an interface similiar to numpy's ndarray
. So for the same multi-dimensional array, Pytorch calls it tensor, while MXNet names it ndarray. In this cheatsheet, we use the Tensor name conversion.
One main difference that often confusing users is how to specify the shape. PyTorch allows the shape elements to be positional arguments, e.g. torch.zeros(2,3)
, while in most cases MXNet only accepts a tuple shape, e.g. nd.zeros((2,3))
.
Description | PyTorch | MXNet |
---|---|---|
Unintialized 5x3 matrix | torch.Tensor(5,3) |
nd.empty((5,3)) |
Initialized matrix with 0s | torch.zeros(2,3) |
nd.zeros((2,3)) |
Initialized matrix with 1s | torch.ones(5,3) |
nd.ones((5,3)) |
Returns a Tensor filled with the scalar value 1, with the same size as input | torch.ones_like(x) |
nd.ones_like(x) |
A range list | torch.arange(-5,5) |
nd.arange(-5,5) |
Construct from list | torch.FloatTensor([[1,],[2,]]) |
nd.array([[1,],[2,]]) |
Function | PyTorch | MXNet |
---|---|---|
Print the matrix | print(x) |
Same |
Get the tensor shape | x.size() |
x.shape |
Get the tensor size (#elements) | x.numel() |
x.size |
Data type of a tensor | x.type() |
x.dtype |
Address of the first element | x.data_ptr() |
Not available |
Device context of a tensor | x.device | x.context |
Description | PyTorch | MXNet |
---|---|---|
Fancy index read | x[:], x[1,2], x[1:2,2:3] |
same |
Write | x[1:2,2:3]=4 |
same |
Take elements by an index list | x.index_select(dim, idx) |
x.pick(idx) , idx is a list of (n-1)-D indicies |
Gathers values along an axis specified by dim | torch.gather(x, 1, torch.LongTensor([[0,0],[1,0]])) |
nd.gather_nd(x, nd.array([[[0,0],[1,1]],[[0,0],[1,0]]])) |
Scatter | torch.zeros(2, 4).scatter_(1, torch.LongTensor([[2], [3]]), 1.23) |
nd.scatter_nd(nd.array([1.23,1.23]), nd.array([[0,1],[2,3]]), (2,4)) |
Description | PyTorch | MXNet |
---|---|---|
+, -, *, / with broadcasting | x+y , x-y , x*y , x/y |
Same |
abs ,ceil ,sin ,exp ... |
x.abs() ,x.ceil() ,... |
Same |
arcsine , arcosine ,... |
x.asin() , x.acos() :shorten arc to a |
x.arcsin() ,x.arccos() |
Comparision | x>=y , x==y : return byte tensor |
Same, but return data type is same as x |
Element-wise mod | x.fmod(3) |
nd.modulo(x, 3) |
Floor, ceil, and round, trunc | x.floor() , x.ceil() , x.round() , x.trunc() |
Same |
Fractional portion of a tensor | x.frac() |
x - x.trunc() |
Limits the values of a tensor to between min and max | x.clamp(min, max) |
x.clip(min, max) |
Element-wise division of t1, t2, multiply v, and add t | torch.addcdiv(t, v, t1, t2) |
t + v*(t1/t2) |
Element-wise applies function | x.apply_(calllable) |
Not available, but there is nd.custom(x, 'op') |
Computes Gauss error function | x.erf() |
Not available |
PyTorch comparison results a byte tensor, which can used as a boolean indexing. For example x[x>1] returns the elements in x that is larger than 1. While x>1 in MXNet returns a tensor with the same data type as x.
Description | PyTorch | MXNet |
---|---|---|
max, min, mean, sum, and product1 | x.max() , x.min() ,...: return a scalar result |
Same, but return a (1,)-shape tensor |
same functions above but along axis i | x.max(dim=i) |
x.max(axis=i) |
sum and prod that treat NaN as zero | Not available | nansum(), nanprod() |
Cumulative sum, product along axis i | x.cumprod(i), x.cumsum(i) |
Not available |
p-norm | x.norm(p) |
x.norm() only supports L2 norm |
p-norm distance between x and y | x.dist(y, p) |
(x-y).norm(): only support L2 norm |
1.In PyTorch, x.max() returns a scalar. While the same code in MXNet returns a tensor with shape (1,). We then can call asscalar(), namely x.max().asscalar(), to get the scalar result. 2.PyTorch's max/min along an axis returns a (max/min element positions, max/min element values) tuple. While MXNet's min and max only return the values.
Description | PyTorch | MXNet |
---|---|---|
Broadcasts/expands tensor to new shape | x.expand(3,4) |
x.broadcast_to((3,4)) |
Repeats tensor | x.repeat(4,2) |
x.tile(4,2) |
Description | PyTorch | MXNet |
---|---|---|
Sort | x.sort() |
Same |
Sort along axis i | x.sort(dim=i) |
x.sort(axis=i) |
Top k elements | x.topk(k) |
x.topk(k=k) |
Indices that sort a tensor | _,idx=x.sort() |
x.argsort() |
Indices of the max/min values | _,idx=x.max() , _,idx=x.min() |
x.argmax() , x.argmin() |
Description | PyTorch | MXNet |
---|---|---|
Reshape without changing data | x.view(3,5) |
x.reshape((3,5)) |
Reshape to y 's shape without change data |
x.view_as(y) |
x.reshape_like(y) |
Insert a new axis with size 1 at k -th dim |
x.unsqueeze(k) |
x.expand_dims() |
Repeat along each axis | x.repeat(4,2) |
x.tile((4,2)) |
Number of elements in a tensor | x.numel() |
x.size |
Splits a tensor in a given dim | x.chunk(num_of_chunk) |
nd.split(x, num_outputs=num_of_chunk) |
Mu is here.
Description | PyTorch | MXNet |
---|---|---|
Change value type to int8 | x.char() |
x.astype('int8') |
Returns a copy of the tensor after casting to a specified type | x.type(type) |
x.astype(dtype) |
Return a copy | x.clone() |
x.copy() |
Construst from numpy x |
torch.from_numpy(x) |
nd.array(x) |
Returns this tensor as a NumPy ndarray | x.numpy() |
x.asnumpy() |
Inplace z by x+y |
torch.add(x,y,out=z) |
nd.elemwise_add(x, y, out=z) |
Inplace y by x+y |
y.add_(x) |
y+=x |
Description | PyTorch | MXNet |
---|---|---|
Random values in [0,1] |
torch.rand(5,3) |
nd.random.uniform(shape=(5,3)) |
Normal distribution | torch.normal(means, std) : means and std should be tensors with the same shape as output |
nd.random.uniform(loc, scale, (5,3)) : loc (mean) and scale (std) can be scalars |
Uniform distribution | torch.rand(*sizes) for [0, 1) torch.Tensor.uniform(from, to) for [from, to) |
nd.random.uniform(low, high, (5, 3)) |
Cauchy distribution | torch.Tensor.cauchy_(median, sigma) (Inplace) |
N/A |
Exponential distribution | torch.Tensor.exponential_(lambd) (Inplace) |
nd.random.exponential(scale, (5, 3)) |
Geometric distribution | torch.Tensor.geometric_(p) (Inplace) |
N/A |
Log-normal distribution | torch.Tensor.log_normal_(mean, std) (Inplace) |
N/A |
Bernoulli distribution | x.bernoulli() |
Not available |
Description | PyTorch | MXNet Gluon |
---|---|---|
Matrix transpose | x.t() |
x.T |
Matrix products with broadcasting | torch.matmul(x,y) |
nd.dot(x,y) |
GEMM | torch.addmm(b,z,a,x,y) |
nd.linalg.gemm(x,y,z,a,b) |
batch matrix product | x.bmm(y) or torch.bmm(x, y) |
nd.linalg.gemm2(x, y) |
batch matrix product and accumulation | torch.addbmm(M, batch1, batch2) |
nd.linalg.gemm(batch1, batch2, M) Leading n-2 dim are reduced |
Outer-product of two vector add a matrix | m.addr(vec1, vec2) |
Not available |
Cross product | x.cross(y) |
Not available |
Creates a diagonal tensor | x.diag(diagonal) |
Not available |
Solves least square & least norm | B.gels(A) |
Not available |
Eigendecomposition for symmetric matrix | e, v = a.symeig() |
v, e = nd.linalg.syevd(a) |
Description | PyTorch | MXNet Gluon |
---|---|---|
Create on GPU 0 | torch.ones(5,3).cuda() |
nd.ones((5,3), ctx=mx.gpu()) |
Context scope | with torch.cuda.device(1): torch.ones(5,3) |
with mx.gpu(1): |
nd.ones((5,3)) |
||
Copy to GPU 1 (No change if already on GPU 1) | x.cuda(1) |
x.as_in_context(mx.gpu(1)) |
Always copy | x.cuda(1) |
x.copyto(mx.gpu(1)) |
Copy data from x to y (may in different GPUs) |
y.copy_(x) |
x.copyto(y) |
Convert GPU Tensor/NDArray to numpy array | x = y.numpy() |
x = y.asnumpy() |
Autograd package of PyTorch/MXNet enables automatic differentiation of Tensor/NDArray.
- PyTorch. Construct
Tensor
withrequires_grad
specifying if gradients are needed
x = torch.ones((1,), requires_grad=True)
y = x * 2
- MXNet. Create gradient for ndarray with
attach_grad
, and put the computation within theautograd.record()
scope:
x = mx.nd.ones((1,))
x.attach_grad()with mx.autograd.record():
y = x * 2
Description | PyTorch | MXNet Gluon |
---|---|---|
Run backward | y.backward() |
Same |
Access gradients | x.grad |
Same |
Grad check | test = gradcheck(Linear.apply, input, eps=1e-6, atol=1e-4) |
Not Implemented yet for Gluon |
Some operators (Dropout, BatchNorm, etc) behave differently in training and making predictions. This can be controlled with train_mode and predict_mode scope in MXNet. Pause scope is for codes that do not need gradients to be calculated.
x = mx.nd.ones((1,))with autograd.train_mode():
y = mx.nd.Dropout(x)
with autograd.predict_mode():
z = mx.nd.Dropout(y)
w = mx.nd.ones((1,))
w.attach_grad()with autograd.record():
y = x * w
y.backward()
with autograd.pause():
w += w.grad
MXNet uses lazy evaluation to achieve superior performance. The Python thread just pushes the operations into the backend engine and then returns. In training phase batch-end synchronization is needed, e.g, asnumpy()
, wait_to_read()
, metric.update(...)
.
for (data, label) in train_data:
with autograd.record():
output = net(data)
L = loss(output, label)
L.backward()
trainer.step(data.shape[0])
metric.update([label], [output])
We can construct a neural network by creating its layers and defining the forward function.
- PyTorch: inherent
nn.Module
class Net(nn.Module):
def __init__(self, D_in, D_out):
super(Net, self).__init__()
self.linear = nn.Linear(D_in, D_out)
def forward(self, x):
return self.linear(x)
net = Net(2,4)
print(net(x))
- MXNet: inherent
nn.Block
. Compared to PyTorch:- Layers defined in
nn
don't require input shape, as they can be inferenced later. - Layers need to be defined within the parent name scope to give model parameters proper names, which are needed to save and load the parameters
- Layers defined in
class Net(nn.Block):
def __init__(self, D_out):
super(Net, self).__init__()
with self.name_scope():
self.dense = nn.Dense(D_out)
def forward(self, x):
return self.dense(x)
net = Net(4)
print(net(x))
Both of them provide a sequential container to chain layers into a network:
- Pytorch
net = nn.Sequential(
nn.Linear(2,3),
nn.ReLU(),
nn.Linear(3,4)
)
- MXNet:
net = nn.Sequential()
with net.name_scope():
net.add(
nn.Dense(3, activation='relu'),
nn.Dense(4)
)
- PyTorch: needs to wrap input data into
Variable
x = torch.autograd.Variable(torch.ones(2,2))
y = net(x)
print(y)
- MXNet: needs initialize the parameters explicity.
net.initialize()
x = nd.ones((2,2))
y = net(x)
print(y)
- PyTorch:
x = torch.autograd.Variable(torch.ones(2,2))
z = torch.autograd.Variable(torch.ones(2,4)*2)
loss = nn.MSELoss()
net.zero_grad()
y = net(x)
L = loss(y, z)
L.backward()
print(net[0].weight.grad)
- proprogates
Variable
to define the scope for backward - uses the
add
semantic to store the gradients, namelyw.grad += cur_grad
. It needs to explicitly zero the gradients before every forward - the
MSELoss
returns the average of all L2 losses over examples, soL
is size 1
- MXNet:
x = nd.ones((2,2))
z = nd.ones((2,4))*2
loss = gluon.loss.L2Loss()
net.initialize()
with mx.autograd.record():
y = net(x)
L = loss(y, z)
L.backward()
print(net[0].weight.grad())
- explicitly put the computations that need to be difficientiated in the
autograd.record()
scope. - uses
write
in default to store gradients, namelyw.grad = cur_grad
, so callingzero_grad
is not necessary unless we change toadd
, e.g.net[0].weight.grad_req = 'add'
- the
L2Loss
applies L2 loss to examples one by one, soL
is size 2. CallingL.backward()
equals to sumL
's elements and then backward. To behavior the same as PyTorch'sMSELoss
, we can change toL = loss(y, z).mean()
Both Pytorch and Gluon defined various neural networkl layers in the nn
module. Compared to Pytorch, MXNet
- adopts Keras' layer defintion while Pytorch's name convention is close to Chainer
- Again, the input size of a layer is optional in MXNet
A | Pytorch | Gluon |
---|---|---|
Sequential container | nn.Sequential(x,y) |
nn.Sequential(x,y) |
Fully connected | nn.Linear(in, out) |
nn.Dense(out) |
Activation | nn.ReLU() |
nn.Activation('relu') |
3x3 2D convolution with 1-stride 0-padding | nn.Conv2d(in, out,3,1,0) |
nn.Conv2D(out,3,1,0) |
3x3 2D max pooling with 1-stride 0-padding | nn.MaxPool2d(3,1,0) |
nn.MaxPool2D(3,1,0) |
Adaptive max pooling with fixed output size | nn.AdaptiveMaxPool2d(4) |
Not Available |
Adaptive avg pooling with fixed output size | nn.AdaptiveAvgPool2d(4) |
Not Available |
Batch normalization | nn.BatchNorm2d(in) |
nn.BatchNorm() |
Embedding | nn.Embedding(in,out) |
nn.Embedding(in,out) |
1: supports applying an activation layer on the output by activation='relu'
Pytorch defines recurrent neural network layers in nn
module, while MXNet defines them in rnn
module.
A | Pytorch | Gluon |
---|---|---|
Elman RNN with tanh activation, with input dimension 10 and hidden dimension 20 | nn.RNN(10, 20, nonlinearity='tanh') |
rnn.RNN(20, activation='tanh') , input dimension is not required |
Bidirectional LSTM with input dimension 10, hidden dimension 20 per direction (total 40) | nn.LSTM(10, 20, bidirectional=True) |
rnn.LSTM(20, bidirectional=True) |
GRU with output dropout rate 0.2 on each GRU layer except the last | nn.GRU(10, 20, dropout=0.2) |
rnn.LSTM(20, dropout=0.2) |
LSTM cell with input dimension 10 and hidden dimension 20. | nn.LSTMCell(10, 20) |
rnn.LSTMCell(20) |
Same LSTM cell without bias | nn.LSTMCell(10, 20, bias=False) |
Not available |
Enable Zoneout with probability 0.2 on recurrent state for cell |
Not available | ZoneoutCell(cell, zoneout_states=0.2) |
Single step unroll of cell with input x of shape (3, 10) , hidden dimension 20, and beginning states h, c both of shape (3, 20) |
h, c = cell(x, (h, c)) |
Same |
Unroll cell for 6 steps, with input of shape (6, 3, 10) , hidden dimension 20, and beginning states h, c both of shape (3, 20) |
output = [] for i in range(6): h, c = cell(input[i], (h, c)) output.append(h) |
output, last_state = cell.unroll(6, input, begin_state=(h, c), layout='TNC') |
Gluon's loss functions are in the gluon.loss
module.
Description | PyTorch | Gluon |
---|---|---|
Square loss | nn.MSELoss |
gluon.loss.L2Loss |
L1 Loss | nn.L1Loss |
gluon.loss.L1Loss |
Cross-entropy loss with softmax | nn.CrossEntropyLoss |
gluon.loss.SoftmaxCrossEntropyLoss |
Logistic loss | nn.SoftMarginLoss |
gluon.loss.LogisticLoss |
Binary cross-entropy loss | nn.BCELoss |
gluon.loss.SigmoidBCELoss(from_sigmoid=False) |
Binary cross-entropy loss with sigmoid | nn.BCEWithLogitsLoss |
gluon.loss.SigmoidBCELoss |
The Kullback-Leibler divergence Loss | nn.KLDivLoss |
gluon.loss.KLDivLoss |
Huber loss | nn.SmoothL1Loss |
gluon.loss.HuberLoss |
Hinge loss | nn.HingeEmbeddingLoss |
gluon.loss.HingeLoss |
Squared hinge loss | Not available | gluon.loss.SquaredHingeLoss |
Triplet loss | nn.TripletMarginLoss |
gluon.loss.TripletLoss |
Connectionist Temporal Classification (CTC) loss | Not avilable | gluon.loss.CTCLoss |
Negative log likelihood loss | nn.NLLLoss |
Not available |
Negative log likelihood loss with Poisson distribution of target | nn.PoissonNLLLoss |
Not available |
Margin ranking loss | nn.MarginRankingLoss |
Not available |
Multi-class hinge loss with 2D target | nn.MultiLabelMarginLoss |
Not available |
Multi-class classification hinge loss with 1D target | nn.MultiMarginLoss |
Not available |
Loss for multi-label one-versus-all classification based on max-entropy | nn.MultiLabelSoftMarginLoss |
Not available |
Cosine loss | nn.CosineEmbeddingLoss |
Not available |
Pytorch has a nn.functional
module providing low-level functions, for example:
import torch.nn.functional as F
x = torch.arange(-5,5)
y = F.relu(torch.autograd.Variable(x))
The nd
package in MXNet serves a similar functionality:
import mxnet.nd as F
x = nd.arange(-5,5)
y = F.relu(x)
Gluon's Parameter
differs to PyTorch in two ways:
- it has a string
name
(necessary to switch to symbolic execution) - it can be created with a shape and an intilization method without the real data. By this way we can support lazy initialization, namely creating layers with the input size to be determined later.
A | PyTorch | MXNet |
---|---|---|
Create | w = torch.nn.Parameter(torch.ones(2,3)) |
w = gluon.Parameter('w', shape=(2,3), init=mx.init.One()) w.initialize() |
Get data | w.data |
w.data() |
Get gradient | w.grad |
w.grad() |
Write data | w.data -= 1 w.data[:] = 2 |
w.data() -= 1 w.data()[:] = 2 |
Change a random initialization | nn.init.uniform(w) |
w.initialize(init=mx.init.Uniform(), force_reinit=True) |
Move to GPU 1 | w = w.cuda(1) |
w.reset_ctx(mx.gpu(1)) |
A | PyTorch | MXNet |
---|---|---|
Get all parameters | params = net.named_parameters() |
params = net.collect_params() |
Query parameters by regex | Not available | params = net.collect_params(regex) |
Initialization | Done when constructing net |
net.initialize(init=mx.init.Uniform()) |
Use a different init method | for w in params: nn.init.normal(w) |
params.initialize( init=mx.init.Normal(), force_reinit=True) |
Get a layer's parameter | net.linear.weight |
net.dense.weight |
Initialize a layer differently | nn.init.normal(net.linear.weight) |
net.dense.weight.initialize( init=mx.init.Normal(), force_reinit=True) |
Specify layer initializer | layer = torch.nn.Linear(20, 10) torch.nn.init.normal(layer.weight, 0, 0.01) |
layer = mx.gluon.nn.Dense(10, weight_initializer=mx.init.Normal(0.01)) |
Create a parameter that doesn't update | Variable(torch.ones(5, 5), requires_grad=False) |
self.params.get('weight', shape=(5, 5), init='ones', grad_req='null') |
Stop a parameter from updating | weight = weight.detach() before forward |
net.dense.weight.grad_req = 'null' , can be called anywhere |
Move weights to GPU 1 | net.cuda(1) |
params.reset_ctx(mx.gpu(1)) |
Set all grads to 0 | net.zero_grad() |
params.zero_grad() |
A | PyTorch | MXNet |
---|---|---|
Uniform in [-1,1] |
nn.init.uniform(w,-1,1) |
mx.init.Uniform(1) |
Normal with 0-mean 1-std | nn.init.normal(w, 0, 1) |
mx.init.Normal(1) |
Xavier with uniform dist | nn.init.xavier_uniform(w, gain=3) |
mx.init.Xavier(magnitude=3) |
Constant | nn.init.constant(w, 2) |
mx.init.Constant(2) |
Serialization and loading of model parameters are achieved by calling save_params
and load_params
. Computation graph serialization is done by calling export
.
Class | Pytorch | MXNet Gluon |
---|---|---|
Save model parameters | torch.save(the_model.state_dict(), filename) |
model.save_params(filename) |
Load parameters | the_model.load_state_dict(torch.load(PATH)) |
model.load_params(filename, ctx, allow_missing=False, ignore_extra=False) |
Export computation graph and parameters | torch.onnx.export(model, dummy_input, 'model.proto') |
model.export('./model', epoch=0) |
MXNet support dump the network into disk through hybridizing as well, see this tutorial.
HybridBlock supports forwarding with both Symbol and NDArray. After hybridized, HybridBlock will create a symbolic graph representing the forward computation and cache it. Most of the built-in blocks (Dense, Conv2D, MaxPool2D, BatchNorm, etc.) are HybridBlocks. Instead of explicitly declaring the number of inputs to a layer, we can simply state the number of outputs. The shape will be inferred on the fly once the network is provided with some input.
Function | PyTorch | MXNet Gluon |
---|---|---|
partial-shape hybridized |
Not Available | net = mx.gluon.nn.HybridSequential() with net.name_scope(): net.add(mx.gluon.nn.Dense(10)) net.hybridize() |
SymbolBlock can construct block from symbol. This is useful for using pre-trained models as feature extractors.
Function | PyTorch | MXNet Gluon |
---|---|---|
SymbolBlock | Not Available | alexnet = mx.gluon.model_zoo.vision.alexnet(pretrained=True, prefix='model_') out = alexnet(inputs) internals = out.get_internals() outputs = [internals['model_dense0_relu_fwd_output']] feat_model = gluon.SymbolBlock(outputs, inputs, params=alexnet.collect_params()) |
- PyTorch:
optimizer = optim.SGD(net.parameters(), lr=0.01, momentum=0.9)
for x, y in data:
optimizer.zero_grad()
L = loss(net(x), y)
L.backward()
optimizer.step()
- MXNet:
optimizer = mx.optimizer.SGD(learning_rate=0.01, momentum=0.9)
trainer = gluon.Trainer(net.collect_params(), optimizer)
for x, y in data:
with mx.autograd.record():
L = loss(net(x), y)
L.backward()
trainer.step(x.shape[0]) # batch size
zero_grad
in optimizer(Pytorch) or Trainer(Gluon) clears the gradients of all parameters. In gluon, there is no need to clear the gradients every batch if grad_req = 'write'
(default).
A | PyTorch | MXNet |
---|---|---|
SGD | optim.SGD |
optimizer.SGD |
SGD with Nesterov momentum | optim.SGD(nestrov=True) |
optimizer.NAG |
Averaged SGD | optim.ASGD |
Not available |
Adam | optim.Adam |
optimizer.Adam |
Sparse Adam | optim.SparseAdam |
optimizer.Adam(lazy_update=True) |
AdaGrad | optim.Adagrad |
optimizer.AdaGrad |
RMSProp | optim.RMSprop |
optimizer.RMSProp |
Rprop | optim.Rprop |
Not available |
AdaDelta | optim.Adadelta |
optimizer.AdaDelta |
Adamax | optim.Adamax |
optimizer.Adamax |
L-BFGS | optim.LBFGS |
Not available |
Ftrl | Not available | optimizer.Ftrl |
Nesterov Adam optimizer | Not available | optimizer.Nadam |
Signum | Not available | optimizer.Signum |
Follow the Moving Leader | Not available | optimizer.FTML |
Large Batch SGD | Not available | optimizer.LBSGD |
Asynchronous SGD with delay compensation | Not available | optimizer.DCASGD |
Stochastic Gradient Riemannian Langevin Dynamics | Not available | optimizer.SGLD |
Function | Pytorch | MXNet Gluon |
---|---|---|
data parallelism | net = torch.nn.DataParallel(model, device_ids=[0, 1, 2]) output = net(data) |
ctx = [mx.gpu(i) for i in range(3)] data = gluon.utils.split_and_load(data, ctx) label = gluon.utils.split_and_load(label, ctx) with autograd.record(): losses = [loss(net(X), Y) for X, Y in zip(data, label)] for l in losses: l.backward() |
Function | Pytorch | MXNet Gluon |
---|---|---|
distributed data parallelism | torch.distributed.init_process_group(...) model = torch.nn.parallel.distributedDataParallel(model, ...) |
store = kv.create('dist') trainer = gluon.Trainer(net.collect_params(), ..., kvstore=store) |
Gluon provide several predefined metrics which can online evaluate the performance of a learned model.
Function | Pytorch | MXNet Gluon |
---|---|---|
metric | Not available | metric = mx.metric.Accuracy() with autograd.record(): output = net(data) L = loss(ouput, label) loss(ouput, label).backward() trainer.step(batch_size) metric.update(label, output) |
tensorboardX(PyTorch) and dmlc-tensorboard(Gluon) can be used to visualize your network and plot quantitative metrics about the execution of your graph.
Function | PyTorch | MXNet Gluon |
---|---|---|
visualization | writer = tensorboardX.SummaryWriter() ... for name, param in model.named_parameters(): grad = param.clone().cpu().data.numpy() writer.add_histogram(name, grad, n_iter) ... writer.close() |
summary_writer = tensorboard.FileWriter('./logs/') ... for name, param in net.collect_params(): grad = param.grad.asnumpy().flatten() s = tensorboard.summary.histogram(name, grad) summary_writer.add_summary(s) ... tensorboard.summary_writer.close() |
Besides the imperative interface, MXNet offers a symbolic programming interface, which is somewhat similar to TensorFlow and Keras. See module tutorials.
Dataset
and DataLoader
are the basic components for loading data.
Class | Pytorch | MXNet Gluon |
---|---|---|
Dataset holding arrays | torch.utils.data.TensorDataset(data_tensor, label_tensor) |
gluon.data.ArrayDataset(data_array, label_array) |
Data loader | torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False, sampler=None, batch_sampler=None, num_workers=0, collate_fn=<function default_collate>, drop_last=False) |
gluon.data.DataLoader(dataset, batch_size=None, shuffle=False, sampler=None, last_batch='keep', batch_sampler=None, batchify_fn=None, num_workers=0) |
Sequentially applied sampler | torch.utils.data.sampler.SequentialSampler(data_source) |
gluon.data.SequentialSampler(length) |
Random order sampler | torch.utils.data.sampler.RandomSampler(data_source) |
gluon.data.RandomSampler(length) |
Some commonly used datasets for computer vision are provided in mx.gluon.data.vision
package.
Class | Pytorch | MXNet Gluon |
---|---|---|
MNIST handwritten digits dataset. | torchvision.datasets.MNIST |
mx.gluon.data.vision.MNIST |
CIFAR10 Dataset. | torchvision.datasets.CIFAR10 |
mx.gluon.data.vision.CIFAR10 |
CIFAR100 Dataset. | torchvision.datasets.CIFAR100 |
mx.gluon.data.vision.CIFAR100 |
A generic data loader where the images are arranged in folders. | torchvision.datasets.ImageFolder(root, transform=None, target_transform=None, loader=<function default_loader>) |
mx.gluon.data.vision.ImageFolderDataset(root, flag, transform=None) |