MXNet Tutorial (2019)

MXNet Complete Tutorial

MXNet Complete Tutorial

The MXNet framework is a very robust and versatile platform for a variety of DL systems. Although we demonstrated its functionality in Python, it is equally powerful when used with other programming languages.


In addition, the Gluon interface is a useful add-on. If you are new to DL applications, we recommend you use Gluon as your go-to tool when employing the MXNet framework. This doesn’t mean that the framework itself is limited to Gluon, though, since the MXNet package is versatile and robust in a variety of programming platforms.


Moreover, in this tutorial, we covered just the basics of MXNet and Gluon. Going through all the details of these robust systems would take a whole blog! Learn more about the details of the Gluon interface in the Straight Dope tutorial, which is part of the MXNet documentation.


Finally, the examples in this blog are executed in a Docker container; as such, you may experience some lagging. When developing a DL system on a computer cluster, of course, it is significantly faster.


MXNet in action

Now let’s take a look at what we can do with MXNet, using Python, on a Docker image with all the necessary software already installed. We’ll begin with a brief description of the datasets we’ll use, and then proceed to a couple of specific DL applications using that data (namely classification and regression).


Upon mastering these, you can explore some more advanced DL systems of this framework on your own.


Datasets description

In this section, we’ll introduce two synthetic datasets that we prepared to demonstrate classification and regression methods on them. The first dataset is for classification, and the other for regression.


The reason we use synthetic datasets in these exercises to maximize our understanding of the data so that we can evaluate the results of the DL systems independent of data quality.


The first dataset comprises 4 variables, 3 features, and 1 label variable. With 250,000 data points, it is adequately large for a DL network to work with. Its small dimensionality makes it ideal for visualization. It is also made to have a great deal of non-linearity, making it a good challenge for any data model (though not too hard for a DL system).


Furthermore, classes 2 and 3 of this dataset are close enough to be confusing, but still distinct. This makes them a good option for a clustering application, as we’ll see later.


The second dataset is somewhat larger, comprising 21 variables—20 of which are the features used to predict the last, which is the target variable. With 250,000 data points, again, it is ideal for a DL system. Note that only 10 of the 20 features are relevant to the target variable (which is a combination of these 10).


A bit of noise is added to the data to make the whole problem a bit more challenging. The remaining 10 features are just random data that must be filtered out by the DL model.


Relevant or not, this dataset has enough features altogether to render a dimensionality reduction application worthwhile. Naturally, due to its dimensionality, we cannot plot this dataset.


Loading a dataset into an NDArray

Let’s now take a look at how we can load a dataset in MXNet so that we can process it with a DL model later on. First, let’s start with setting some parameters:

DataCtx = mx.cpu() # assign context of the data used

BatchSize = 64 # batch parameter for dataloader object

r = 0.8 # ratio of training data

nf = 3 # number of features in the dataset (for the classification problem)

Now, we can import the data as we’d normally do in a conventional DS project, but this time store it in NDArrays instead of Pandas or NumPy arrays:

with open(“../data/data1.csv”) as f:

data_raw =

lines = data_raw.splitlines() # split the data into separate lines

ndp = len(lines) # number of data points X = nd.zeros((ndp, nf), ctx=data_ctx)

Y = nd.zeros((ndp, 1), ctx=data_ctx) for i, line in enumerate(lines): tokens = line.split() Y[i] = int(tokens[0])

for token in tokens[1:]:

index = int(token[:-2]) - 1

X[i, index] = 1


Now we can split the data into a training set and a testing set so that we can use it both to build and to validate our classification model:

import numpy as np # we’ll be needing this package as well

n = np.round(N * r) # number of training data points

train = data[:n, ] # training set partition

test = data[(n + 1):,] # testing set partition

data_train =

rain[:,:3], train[:,3]), batch_size=BatchSize,


data_test =

est[:,:3], test[:,3]), batch_size=BatchSize,



We’ll then need to repeat the same process to load the second dataset— this time using data2.csv as the source file. Also, to avoid confusion with the data loader objects of dataset 1, you can name the new data loaders data_train2 and data_test2, respectively.



Now let’s explore how we can use this data to build an MLP system that can discern the different classes within the data we have prepared. For starters, let’s see how to do this using the MXNet package on its own; then we’ll examine how the same thing can be achieved using Gluon.

First, let’s define some constants that we’ll use later to build, train, and test the MLP network:

nhn = 256 # number of hidden nodes for each layer

WeightScale = 0.01 # scale multiplier for weights

ModelCtx = mx.cpu() # assign context of the model itself

no = 3 # number of outputs (classes)

ne = 10 # number of epochs (for training)

lr = 0.001 # learning rate (for training)

sc = 0.01 # smoothing constant (for training)

ns = test.shape[0] # number of samples (for testing)


Next, let’s initialize the network’s parameters (weights and biases) for the first layer:

W1 = nd.random_normal(shape=(nf, nhn), scale=WeightScale, ctx=ModelCtx)

b1 = nd.random_normal(shape=nhn, scale=WeightScale, ctx=ModelCtx)

And do the same for the second layer:

W2 = nd.random_normal(shape=(nhn, nhn), scale=WeightScale, ctx=ModelCtx)

b2 = nd.random_normal(shape=nhn, scale=WeightScale, ctx=ModelCtx)

Then let’s initialize the output layer and aggregate all the parameters into a single data structure called params:

W3 = nd.random_normal(shape=(nhn, no), scale=WeightScale, ctx=ModelCtx)

b3 = nd.random_normal(shape=no, scale=WeightScale, ctx=ModelCtx)

params = [W1, b1, W2, b2, W3, b3]

Finally, let’s allocate some space for a gradient for each one of these parameters:

for param in params:



Remember that without any non-linear functions in the MLP’s neurons, the whole system would be too rudimentary to be useful. We’ll make use of the ReLU and the Softmax functions as activation functions for our system:

def relu(X): return nd.maximum(X, nd.zeros_like(X))

def softmax(y_linear):

exp = nd.exp(y_linear - nd.max(y_linear))

partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1, 1))

return exp / partition


Note that the Softmax function will be used in the output neurons, while the ReLU function will be used in all the remaining neurons of the network.


For the cost function of the network (or, in other words, the fitness function of the optimization method under the hood), we’ll use the cross-entropy function:

def cross_entropy(yhat, y): return - nd.nansum(y * nd.log(yhat), axis=0, exclude=True)


To make the whole system a bit more efficient, we can combine the softmax and the cross-entropy functions into one, as follows:

def softmax_cross_entropy(yhat_linear, y):

return - nd.nansum(y *

nd.log_softmax(yhat_linear), axis=0,


After all this, we can now define the function of the whole neural network, based on the above architecture:

def net(X):

h1_linear =, W1) + b1

h1 = relu(h1_linear)

h2_linear =, W2) + b2

h2 = relu(h2_linear)

yhat_linear =, W3) + b3 return yhat_linear

The optimization method for training the system must also be defined.

In this case we’ll utilize a form of Gradient Descent:

def SGD(params, lr):

for param in params:

param[:] = param - lr * param.grad

return param

For the purposes of this example, we’ll use a simple evaluation metric for the model: accuracy rate. Of course, this needs to be defined first:

def evaluate_accuracy(data_iterator, net):

numerator = 0.

denominator = 0.

for i, (data, label) in


data =

data.as_in_context(model_ctx).reshape((-1, 784))

label = label.as_in_context(model_ctx)

output = net(data)

predictions = nd.argmax(output, axis=1) numerator += nd.sum(predictions == label) denominator += data.shape[0]

return (numerator / denominator).asscalar()


Now we can train the system as follows:

for e in range(epochs):

cumulative_loss = 0

for i, (data, label) in


data =

data.as_in_context(model_ctx).reshape((-1, 784))

label = label.as_in_context(model_ctx)

label_one_hot = nd.one_hot(label, 10)

with autograd.record():

output = net(data)

loss = softmax_cross_entropy(output, label_one_hot)


SGD(params, learning_rate)

cumulative_loss += nd.sum(loss).asscalar()

test_accuracy = evaluate_accuracy(test_data, net)

train_accuracy =

evaluate_accuracy(train_data, net)

print(“Epoch %s. Loss: %s, Train_acc %s,

Test_acc %s” %

(e, cumulative_loss/num_examples, train_accuracy, test_accuracy))

Finally, we can use to system to make some predictions using the following code:

def model_predict(net, data):

output = net(data)

return nd.argmax(output, axis=1)

SampleData =, ns,


for i, (data, label) in enumerate(SampleData):

data = data.as_in_context(ModelCtx)

im = nd.transpose(data,(1,0,2,3))

im = nd.reshape(im,(28,10*28,1))

imtiles = nd.tile(im, (1,1,3))



print(‘model predictions are:’, pred)

print(‘true labels :’, label)



If you run the above code (preferably in the Docker environment provided), you will see that this simple MLP system does a good job at predicting the classes of some unknown data points—even if the class boundaries are highly non-linear.


Experiment with this system more and see how you can improve its performance even further, using the MXNet framework.


Now we’ll see how we can significantly simplify all this by employing the Gluon interface. First, let’s define a Python class to cover some common cases of Multi-Layer Perceptrons, transforming a “gluon. Block” object into something that can be leveraged to gradually build a neural network, consisting of multiple layers (also known as MLP):

class MLP(gluon.Block):

def __init__(self, **kwargs):

super(MLP, self).__init__(**kwargs)

with self.name_scope():

self.dense0 = gluon.nn.Dense(64) # architecture of 1st layer (hidden)

self.dense1 = gluon.nn.Dense(64) # architecture of 2nd layer (hidden)

self.dense2 = gluon.nn.Dense(3) # architecture of 3rd layer (output)

def forward(self, x): # a function enabling an MLP to process data (x) by passing it forward (towards the output layer)

x = nd.relu(self.dense0(x)) # outputs of first hidden layer

x = nd.relu(self.dense1(x)) # outputs of second hidden layer

x = self.dense2(x) # outputs of final layer (output)

return x


Of course, this is just an example of how you can define an MLP using Gluon, not a one-size-fits-all kind of solution. You may want to define the MLP class differently since the architecture you use will have an impact on the system’s performance. 


However, if you find what follows too challenging, and you don’t have the time to assimilate the theory behind DL systems, you can use an MLP object like the one above for your project.


Since DL systems are rarely as compact as the MLP above, and since we often need to add more layers (which would be cumbersome in the above approach), it is common to use a different class called Sequential.


After we define the number of neurons in each hidden layer and specify the activation function for these neurons, we can build an MLP like a ladder, with each step representing one layer in the MLP:

nhn = 64 # number of hidden neurons (in each layer)

af = “relu” # activation function to be used in each neuron

net = gluon.nn.Sequential()

with net.name_scope():

net.add(gluon.nn.Dense(nhn , activation=af))

net.add(gluon.nn.Dense(nhn , activation=af))



This takes care of the architecture for us. To make the above network functional, we’ll first need to initialize it:

sigma = 0.1 # sigma value for distribution of weights for the ANN connections

ModelCtx = mx.cpu()

lr = 0.01 # learning rate

oa = ‘sgd’ # optimization algorithm

net.collect_params().initialize(mx.init.Normal( sigma=sigma), ctx=ModelCtx)

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

trainer = gluon.Trainer(net.collect_params(), oa, {‘learning_rate’: lr})

ne = 10 # number of epochs for training


Next, we must define how we assess the network’s progress, through an evaluation metric function. For the purposes of simplicity, we’ll use the standard accuracy rate metric:

def AccuracyEvaluation(iterator, net):

acc = mx.metric.Accuracy()

for i, (data, label) in enumerate(iterator):

data =

data.as_in_context(ModelCtx).reshape((-1, 3))

label = label.as_in_context(ModelCtx)

output = net(data)

predictions = nd.argmax(output, axis=1) acc.update(preds=predictions, labels=label) return acc.get()


Finally, it’s time to train and test the MLP, using the aforementioned settings:

for e in range(ne):

cumulative_loss = 0

for i, (data, label) in


data =

data.as_in_context(ModelCtx).reshape((-1, 784))

label = label.as_in_context(ModelCtx)

with autograd.record():

output = net(data)

loss = softmax_cross_entropy(output, label)



cumulative_loss += nd.sum(loss).asscalar()

train_accuracy =

AccuracyEvaluation(train_data, net)

test_accuracy = AccuracyEvaluation(test_data, net)

print(“Epoch %s. Loss: %s, Train_acc %s, Test_acc %s” %

(e, cumulative_loss/ns, train_accuracy, test_accuracy))

Running the above code should yield similar results to those from conventional MXNet commands.

To make things easier, we’ll rely on the Gluon interface in the example that follows. Nevertheless, we still recommend that you experiment with the standard MXNet functions afterward, should you wish to develop your own architectures (or better understand the theory behind DL).


Creating checkpoints for models developed in MXNet

As training a system may take some time, the ability to save and load DL models and data through this framework is essential. We must create “checkpoints” in our work so that we can pick up from where we’ve stopped, without having to recreate a network from scratch every time. This is achieved through the following process.


First import all the necessary packages and classes, and then define the context parameter:

import MXNet as mx

from MXNet import nd, autograd, gluon

import os

ctx = mx.cpu() # context for NDArrays

We’ll then save the data, but let’s put some of it into a dictionary first:

dict = {“X”: X, “Y”: Y}

Now we’ll set the name of the file and save it:

filename = “test.dat”, dict)

We can verify that everything has been saved properly by loading that checkpoint as follows:

Z = nd.load(filename)



When using gluon, there is a shortcut for saving all the parameters of the DL network we have developed. It involves the save_params() function:

filename = “MyNet.params”



To restore the DL network, however, you’ll need to recreate the original network’s architecture, and then load the original network’s parameters from the corresponding file:

net2 = gluon.nn.Sequential()with net2.name_scope():

net2.add(gluon.nn.Dense(num_hidden, activation=”relu”))

net2.add(gluon.nn.Dense(num_hidden, activation=”relu”))


net2.load_params(filename, ctx=ctx)


It’s best to save your work at different parts of the pipeline, and give the checkpoint files descriptive names. It is also important to keep in mind that we don’t have “untraining” option and it is likely that the optimal performance happens before the completion of the training phase.


Because of this, we may want to create checkpoints after each training epoch so that we can revert to it when we find out at which point the optimal performance is achieved.


Moreover, for the computer to make sense of these files when you load them in your programming environment, you’ll need to have the nd class of MXNet in memory, in whatever programming language you are using.



MXNet is a deep learning framework developed by Apache. It exhibits ease of use, flexibility, and high speed, among other perks. All of this makes MXNet an attractive option for DL, in a variety of programming languages, including Python, Julia, Scala, and R.


MXNet models can be deployed to all kinds of computing systems, including smart devices. This is achieved by exporting them as a single file, to be executed by these devices.


Gluon is a package that provides a simple interface for all your DL work using MXNet. Its main benefits include ease of use, no significant overhead, ability to handle dynamic graphs for your ANN models, and flexibility.


NDArrays are useful data structures when working with the MXNet framework. They can be imported as modules from the MXNet package as nd. They are similar to NumPy arrays, but more versatile and efficient when it comes to DL applications.


The MXNet package is Python’s API for the MXNet framework and contains a variety of modules for building and using DL systems.

Data can be loaded into MXNet through an NDArray, directly from the data file; and then creating a data loader object, to feed the data into the model built afterward.


Classification in MXNet involves creating an MLP (or some other DL network), training it, and using it to predict unknown data, allocating one neuron for every class in the dataset. Classification is significantly simpler when using Gluon.


Regression in MXNet is like classification, but the output layer has a single neuron. Also, additional care must be taken so that the system doesn’t overfit; therefore we often use some regularization function such as L2.


Creating project checkpoints in MXNet involves saving the model and any other relevant data into NDArrays so that you can retrieve them at another time. This is also useful for sharing your work with others, for reviewing purposes.

Remember that MXNet it is generally faster than on the Docker container used in this blog’s examples and that it is equally useful and robust in other programming languages.