Deep learning (Best Tutorial 2019)

Deep learning Best Tutorial

Deep learning Best Tutorial

Deep learning is arguably the most popular aspect of AI, especially when it comes to data science (DS) applications. But what exactly are deep learning frameworks, and how are they related to other terms often used in AI and data science?


In this context, “framework” refers to a set of tools and processes for developing a certain system, testing it, and ultimately deploying it. Most AI systems today are created using frameworks. When a developer downloads and installs a framework on his computer, it is usually accompanied by a library.


This library (or package, as it is often termed in high-level languages) will be compiled in the programming languages supported by the AI framework. The library acts as a proxy to the framework, making its various processes available through a series of functions and classes in the programming language used.


This way, you can do everything the framework enables you to do, without leaving the programming environment where you have the rest of your scripts and data. So, for all practical purposes, that library is the framework, even if the framework can manifest in other programming languages too.


This way, a framework supported by both Python and Julia can be accessed through either one of these languages, making the language you use a matter of preference.


Since enabling a framework to function in a different language is a challenging task for the creators of the framework, oftentimes the options they provide for the languages compatible with that framework are rather limited.


But what is a system, exactly? In a nutshell, a system is a standalone program or script designed to accomplish a certain task or set of tasks. In a data science setting, a system often corresponds to a data model. However, systems can include features beyond just models, such as an I/O process or a data transformation process.


The term model involves a mathematical abstraction used to represent a real-world situation in a simpler, more workable manner. Models in DS are optimized through a process called training and validated through a process called testing before they are deployed.


Another term that often appears alongside these terms is methodology, which refers to a set of methods and the theory behind those methods, for solving a particular type of problem in a certain field. Different methodologies are often geared toward different applications/objectives.


It’s easy to see why frameworks are celebrities of sorts in the AI world. They help make the modeling aspect of the pipeline faster, and they make the data engineering demanded by deep learning models significantly easier.


This makes AI frameworks great for companies that cannot afford a whole team of data scientists, or prefer to empower and develop the data scientists they already have.


These systems are fairly simple, but not quite “plug and play.” In this blog, we’ll explore the utility behind deep learning models, their key characteristics, how they are used, their main applications, and the methodologies they support.


About deep learning systems

Deep Learning (DL) is a subset of AI that is used for predictive analytics, using an AI system called an Artificial Neural Network (ANN). Predictive analytics is a group of data science methodologies that are related to the prediction of certain variables. This includes various techniques such as classification, regression, etc.


As for an ANN, it is a clever abstraction of the human brain, at a much smaller scale. ANNs manage to approximate every function (mapping) that has been tried on them, making them ideal for any data analytics related task. In data science, ANNs are categorized as machine learning methodologies.


The main drawback DL systems have is that they are “black boxes.” It is exceedingly difficult – practically unfeasible – to figure out exactly how their predictions happen, as the data flux in them is extremely complicated.


Deep Learning generally involves large ANNs that are often specialized for specific tasks. Convolutional Neural Networks (CNN) ANNs, for instance, are better for processing images, video, and audio data streams.


However, all DL systems share a similar structure. This involves elementary modules called neurons organized in layers, with various connections among them.


These modules can perform some basic transformations (usually non-linear ones) as data passes through them.


Since there are a plethora of potential connections among these neurons, organizing them in a structured way (much like real neurons are organized in the network in brain tissue), we can obtain a more robust and function form of these modules. This is what an artificial neural network is, in a nutshell.


In general, DL frameworks include tools for building a DL system, methods for testing it, and various other Extract, Transform, and Load (ETL) processes; when taken together, these framework components help you seamlessly integrate DL systems with the rest of your pipeline. We’ll look at this in more detail later in this blog.


Although deep learning systems share some similarities with machine learning systems, certain characteristics make them sufficiently distinct. For example, conventional machine learning systems tend to be simpler and have fewer options for training.


DL systems are noticeably more sophisticated; they each have a set of training algorithms, along with several parameters regarding the systems’ architecture. This is one of the reasons we consider them a distinct framework in data science.


DL systems also tend to be more autonomous than their machine counterparts. To some extent, DL systems can do their own feature engineering. More conventional systems tend to require more fine-tuning of the feature-set, and sometimes require dimensionality reduction to provide any decent results.


In addition, the generalization of conventional ML systems when provided with additional data generally don’t improve as much as DL systems. This is also one of the key characteristics that make DL systems a preferable option when big data is involved.


Finally, DL systems take longer to train and require more computational resources than conventional ML systems. This is due to their more sophisticated functionality. However, as the work of DL systems is easily parallelizable, modern computing architecture as well as cloud computing, benefit DL systems the most, compared to other predictive analytics systems.


How deep learning systems work

At their cores, all DL frameworks work similarly, particularly when it comes to the development of DL networks. First, a DL network consists of several neurons organized in layers; many of these are connected to other neurons in other layers. In the simplest DL network, connections take place only between neurons in adjacent layers.


The first layer of the network corresponds to the features of our dataset; the last layer corresponds to its outputs. In the case of classification, each class has its own node, with node values reflecting how confident the system is that a data point belongs to that class.


The layers in the middle involve some combination of these features. Since they aren’t visible to the end user of the network, they are described as hidden.


The connections among the nodes are weighted, indicating the contribution of each node to the nodes of the next layer it is connected to, in the next layer. The weights are initially randomized, when the network object is created, but are defined as the ANN is trained.


Moreover, each node contains a mathematical function that creates a transformation of the received signal, before it is passed to the next layer.


This is referred to as the transfer function (also known as the activation function). The sigmoid function is the most well-known of these, but others include softmax, tanh, and ReLU. We’ll delve more into these in a moment.


Furthermore, each layer has a bias node, which is a constant that appears unchanged on each layer. Just like all the other nodes, the bias node has a weight attached to its output.


However, it has no transfer function. Its weighted value is simply added to the other nodes it is connected to, much like a constant c is added to a regression model in Statistics.


The presence of such a term balances out any bias the other terms inevitably bring to the model, ensuring that the overall bias in the model is minimal. As the topic of bias is a very complex one, we recommend you check out some external resources4 if you are not familiar with it.


Once the transformed inputs (features) and the biases arrive at the end of the DL network, they are compared with the target variable. The differences that inevitably occur are relayed back to the various nodes of the network, and the weights are changed accordingly.


Then the whole process is repeated until the error margin of the outputs is within a certain predefined level, or until the maximum number of iterations is reached. Iterations of this process are often referred to as training epochs, and the whole process is intimately connected to the training algorithm used.


In fact, the number of epochs used for training a DL network is often set as a parameter and it plays an important role in the ANN’s performance.


All of the data entering a neuron (via connections with neurons of the previous layer, as well as the bias node) is summed, and then the transfer function is applied to the sum so that the data flow from that node is y = f(Σ(wixi + b)).


Where wi is the weight of node I of the previous layer, and xi its output, while b is the bias of that layer. Also, f() is the mathematical expression of the transfer function.


This relatively simple process is at the core of every ANN. The process is equivalent to that which takes place in a perceptron system—a rudimentary AI model that emulates the function of a single neuron.


Although a perception system is never used in practice, it is the most basic element of an ANN, and the first system created using this paradigm.


The function of a single neuron is basically a single, predefined transformation of the data at hand. This can be viewed as a kind of meta-feature of the framework, as it takes a certain input x and after applying a (usually non-linear) function f() to it, x is transformed into something else, which is the neuron’s output y.


While in the majority of cases one single meta-feature would be terrible at predicting the target variable, several of them across several layers can work together quite effectively – no matter how complex the mapping of the original features to the target variable.


The downside is that such a system can easily overfit, which is why the training of an ANN doesn’t end until the error is minimal (smaller than a predefined threshold).


This most rudimentary description of a DL network works for networks of the multi-layer perceptron type. Of course, there are several variants beyond this type.


CNN's, for example, contain specialized layers with huge numbers of neurons, while RNNs have connections that go back to previous layers. Additionally, some training algorithms involve pruning nodes of the network to ensure that no overfitting takes place.


Once the DL network is trained, it can be used to make predictions about any data similar to the data it was trained on.


Furthermore, its generalization capability is quite good, particularly if the data it is trained on is diverse. What’s more, most DL networks are quite robust when it comes to noisy data, which sometimes helps them achieve even better generalization.


When it comes to classification problems, the performance of a DL system is improved by the class boundaries it creates.


Although many conventional ML systems create straightforward boundary landscapes (e.g. rectangles or simple curves), a DL system creates a more sophisticated line around each class (reminiscent of the borders of certain counties in the US).


This is because the DL system is trying to capture every bit of signal it is given in order to make fewer mistakes when classifying, boosting its raw performance. Of course, this highly complex mapping of the classes makes interpretation of the results a very challenging, if not unfeasible, task. More on that later in this blog.


Main deep learning frameworks

Having knowledge of multiple DL frameworks gives you a better understanding of the AI field. You will not be limited by the capabilities of a specific framework.


For example, some DL frameworks are geared towards a certain programming language, which may make focusing on just that framework an issue, since languages come and go.


After all, things change very rapidly in technology, especially when it comes to software. What better way to shield yourself from any unpleasant developments than to be equipped with a diverse portfolio of DL know-how?


The main frameworks in DL include MXNet, TensorFlow, and Keras. Pytorch and Theano have also played an important role, but currently they are not as powerful or versatile as the aforementioned frameworks, which we will focus on in this blog.


Also, for those keen on the Julia language, there is the Knet framework, which to the best of our knowledge, is the only deep learning framework written in a high-level language mainly (in this case, Julia). You can learn more about it at its Github repository.


MXNet is developed by Apache and it’s Amazon’s favorite framework. Some of Amazon’s researchers have collaborated with researchers from the University of Washington to benchmark it and make it more widely known to the scientific community. 


TensorFlow is probably the most well-known DL framework, partly because it has been developed by Google. As such, it is widely used in the industry and there are many courses and blogs discussing it. 


Keras is a high-level framework; it works on top of TensorFlow (as well as other frameworks like Theano). Its ease of use without losing flexibility or power makes it one of the favorite deep learning libraries today.


Any data science enthusiast who wants to dig into the realm of deep learning can start using Keras with reasonably little effort.


Moreover, Keras’ seamless integrity with TensorFlow, plus the official support it gets from Google, have convinced many that Keras will be one of the long-lasting frameworks for deep learning models, while its corresponding library will continue to be maintained. 


Main deep learning programming languages

As a set of techniques, DL is language-agnostic; any computer language can potentially be used to apply its methods and construct its data structures (the DL networks), even if each DL framework focuses on specific languages only.


This is because it is more practical to develop frameworks that are compatible with certain languages, some programming languages are used more than others, such as Python. The fact that certain languages are more commonly used in data science plays an important role in language selection, too.


Besides, DL is more of a data science framework nowadays anyway, so it is marketed to the data science community mainly, as part of Machine Learning (ML). This likely contributes to the confusion about what constitutes ML and AI these days.


Because of this, the language that dominates the DL domain is Python. This is also the reason why we use it in the DL part of this blog. It is also one of the easiest languages to learn, even if you haven’t done any programming before.


However, if you are using a different language in your everyday work, there are DL frameworks that support other languages, such as Julia, Scala, R, JavaScript, Matlab, and Java. Julia is particularly useful for this sort of task as it is high-level (like Python, R, and Matlab), but also very fast (like any low-level language, including Java).


In addition, almost all the DL frameworks support C / C++, since they are usually written in C or its object-oriented counterpart. Note that all these languages access the DL frameworks through APIs, which take the form of packages in these languages.


Therefore, in order to use a DL framework in your favorite language’s environment, you must become familiar with the corresponding package, its classes, and its various functions.


How to leverage deep learning frameworks

Deep learning frameworks add value to AI and DS practitioners in various ways. The most important value-adding processes include ETL processes, building data models, and deploying these models. Beyond these main functions, a DL framework may offer other things that a data scientist can leverage to make their work easier.


For example, a framework may include some visualization functionality, helping you produce some slick graphics to use in your report or presentation. As such, it’s best to read up on each framework’s documentation, becoming familiar with its capabilities to leverage it for your data science projects.


ETL processes

A DL framework can be helpful in fetching data from various sources, such as databases and files. This is a rather time-consuming process if done manually, so using a framework is very advantageous.


The framework will also do some formatting on the data so that you can start using it in your model without too much data engineering. However, doing some data processing of your own is always useful, particularly if you have some domain knowledge.


Building data models

The main function of a DL framework is to enable you to efficiently build data models. The framework facilitates the architecture design part, as well as all the data flow aspects of the ANN, including the training algorithm.


In addition, the framework allows you to view the performance of the system as it is being trained so that you gain insight into how likely it is to overfit.


Moreover, the DL framework takes care of all the testing required before the model is tested on different than the dataset it was trained on (new data). All this makes building and fine-tuning a DL data model a straightforward and intuitive process, empowering you to make a more informed choice about what model to use for your data science project.


Deploying data models

Model deployment is something that DL frameworks can handle, too, making movement through the data science pipeline swifter. This mitigates the risk of errors through this critical process, while also facilitating easy updating of the deployed model. All this enables the data scientist to focus more on the tasks that require more specialized or manual attention.


For example, if you (rather than the DL model) worked on the feature engineering, you would have a greater awareness of exactly what is going into the model.


Deep learning methodologies and applications

Deep learning is a very broad AI category, encompassing several data science methodologies through its various systems. As we have seen, for example, it can be successfully used in classification—if the output layer of the network is built with the same number of neurons as the number of classes in the dataset.


When DL is applied to problems with the regression methodology, things are simpler, as a single neuron in the output layer is enough. Reinforcement learning is another methodology where DL is used; along with the other two methodologies, it forms the set of supervised learning, a broad methodology under the predictive analytics umbrella.


DL is also used for dimensionality reduction, which (in this case) comprises a set of meta-features that are usually developed by an autoencoder system.

This approach to dimensionality reduction is also more efficient than the traditional statistical ones, which are computationally expensive when the number of features is remarkably high. Clustering is another methodology where deep learning can be used, with the proper changes in the ANN’s structure and data flow.


Clustering and dimensionality reduction are the most popular unsupervised learning methodologies in data science and provide a lot of value when exploring a dataset. Beyond these data science methodologies involving DL, there are others that are more specialized and require some domain expertise. We’ll talk about some of them more, shortly.


There are many applications of deep learning. Some are more established or general, while others are more specialized or novel. Since DL is still a new tool, its applications in the data science world remain works in progress, so keep an open mind about this matter.


After all, the purpose of all AI systems is to be as universally applicable as possible, so the list of applications is only going to grow.


For the time being, DL is used in complex problems where high-accuracy predictions are required. These could be datasets with high dimensionality and/or highly non-linear patterns.


In the case of high-dimensional datasets that need to be summarized in a more compact form with fewer dimensions, DL is a highly effective tool for the job.


Also, since the very beginning of its creation, DL has been applied to image, sound, and video analytics, with a focus on images. Such data is quite difficult to process otherwise; the tools used before DL could only help so much, and developing those features manually was a very time-consuming process.


Moving on to more niche applications, DL is widely used in various natural language processing (NLP) methods. This includes all kinds of data related to the everyday text, such as that found in articles, blogs, and even social media posts.


Where it is important to identify any positive or negative attitudes in the text, we use a methodology called “sentiment analysis,” which offers a fertile ground for many DL systems.


There are also DL networks that perform text prediction, which is common in many mobile devices and some text editors.


More advanced DL systems manage to link images to captions by mapping these images to words that are relevant and that form sentences. Such advanced applications of DL include chatbots, in which the AI system both creates text and understands the text it is given.


Also, applications like text summarization are under the NLP umbrella too and DL contributes to them significantly. Some DL applications are more advanced or domain-specific – so much so that they require a tremendous amount of data and computing power to work. However, as computing becomes more readily available, these are bound to become more accessible in the short term.


Assessing a deep learning framework

DL frameworks make it easy and efficient to employ DL in a data science project. Of course, part of the challenge is deciding which framework to use. Because not all DL frameworks are built equal, there are factors to keep in mind when comparing or evaluating these frameworks.


The number of languages supported by a framework is especially important. Since programming languages are particularly fluid in the data science world, it is best to have your language bases covered in the DL framework you plan to use.


What’s more, having multiple languages support in a DL framework enables the formation of a more diverse data science team, with each member having different specific programming expertise.


You must also consider the raw performance of the DL systems developed by the framework in question. Although most of these systems use the same low-level language on the back end, not all of them are fast.


There may also be other overhead costs involved. As such, it’s best to do your due diligence before investing your time in a DL framework—particularly if your decision affects other people in your organization.


Furthermore, consider the ETL processes supporting a DL framework. Not all frameworks are good at ETL, which is both inevitable and time-consuming in a data science pipeline. Again, any inefficiencies of a DL framework in this aspect are not going to be advertised; you must do some research to uncover them yourself.


Finally, the user community and documentation around a DL framework are important things, too. Naturally, the documentation of the framework is going to be helpful, though in some cases it may leave much to be desired.


If there is a healthy community of users for the DL framework you are considering, things are bound to be easier when learning its more esoteric aspects—as well as when you need to troubleshoot issues that may arise.



Interpretability is the capability of a model to be understood in terms of its functionality and its results. Although interpretability is often a given with conventional data science systems, it is a pain point of every DL system.


This is because every DL model is a “black box,” offering little to no explanation for why it yields the results it does. Unlike the framework itself, whose various modules and their functionality is clear, the models developed by these frameworks are convoluted graphs.


There is no comprehensive explanation as to how the inputs you feed them turn into the outputs they yield.


Although obtaining an accurate result through such a method may be enticing, it is quite hard to defend, especially when the results are controversial or carry a demographic bias.


The reason for a demographic bias has to do with the data, by the way, so no number of bias nodes in the DL networks can fix that, since a DL network’s predictions can only be as good as the data used to train it. Also, the fact that we have no idea how the predictions correspond to the inputs allows biased predictions to slip through unnoticed.


However, this lack of interpretability may be resolved in the future. This may require a new approach to them, but if it’s one thing that the progress of AI system has demonstrated over the years, it is that innovations are still possible and that new architectures of models are still being discovered. Perhaps one of the newer DL systems will have interpretability as one of its key characteristics.


Model maintenance

Maintenance is essential to every data science model. This entails updating or even upgrading a model in production, as new data becomes available. Alternatively, the assumptions of the problem may change; when this happens, model maintenance is also needed. In a DL setting, model maintenance usually involves retraining the DL network.


If the retrained model doesn’t perform well enough, more significant changes may be considered such as changing the architecture or the training parameters. Whatever the case, this whole process is largely straightforward and not too time-consuming.


How often model maintenance is required depends on the dataset and the problem in general. Whatever the case, it is good to keep the previous model available too when doing major changes, in case the new model has unforeseen issues.


Also, the whole model maintenance process can be automated to some extent, at least the offline part, when the model is retrained as new data is integrated with the original dataset.


When to use DL over conventional data science systems

Deciding when to use a DL system instead of a conventional method is an important task. It is easy to be enticed by the new and exciting features of DL and to use it for all kinds of data science problems. However, not all problems require DL. Sometimes, the extra performance of DL is not worth the extra resources required.


In cases where conventional data science systems fail or don’t offer any advantage (like interpretability), DL systems may be preferable. Complex problems with lots of variables and cases with non-linear relationships between the features and the target variables are great matches for a DL framework.


If there is an abundance of data, and the main objective is a good raw performance in the model, a DL system is typically preferable. This is particularly true if computational resources are not a concern, since a DL system requires quite a lot of them, especially during its training phase.


Whatever the case, it’s good to consider alternatives before setting off to build a DL model. While these models are incredibly versatile and powerful, sometimes simpler systems are good enough.



Deep Learning is a particularly important aspect of AI and has found a lot of applications in data science. Deep Learning employs a certain kind of AI system called an Artificial Neural Networks (or ANN). 


An ANN is a graph-based system involving a series of (usually non-linear) operations, whereby the original features are transformed into a few meta-features capable of predicting the target variable more accurately than the original features.


The main frameworks in DL are MXNet, TensorFlow, and Keras, though Pytorch and Theano also play roles in the whole DL ecosystem. Also, Knet is an interesting alternative for those using Julia primarily.


There are various programming languages used in DL, including Python, Julia, Scala, Javascript, R, and C / C++. Python is the most popular.

A DL framework offers diverse functionality, including ETL processes, building data models, deploying and evaluating models, and other functions like creating visuals.


A DL system can be used in various data science methodologies, including Classification, Regression, Reinforcement Learning, Dimensionality Reduction, Clustering, and Sentiment Analysis.


Classification, regression, and reinforcement learning are supervised learning methodologies, while dimensionality reduction and clustering are unsupervised.


Applications of DL include making high-accuracy predictions for complex problems; summarizing data into a more compact form; analyzing images, sound, or video; natural language processing and sentiment analysis; text prediction; linking images to captions; chatbots; and text summarization.


A DL framework needs to be assessed on various metrics (not just popularity). Such factors include the programming languages it supports, its raw performance, how well it handles ETL processes, the strength of its documentation and user communities, and the need for future maintenance.


It is not currently very easy to interpret DL results and trace them back to specific features (i.e. DL results currently have low interpretability).


Giving more weight to raw performance or interpretability can help you decide whether a DL system or conventional data science system is ideal for your particular problem. Other factors, like the number of computational resources at our disposal, are also essential for making this decision.


Deep Learning Libraries

This section involves an introduction to some of the widely used deep learning libraries, including Theano, TensorFlow, and Keras, also in addition to a basic tutorial on each one of these.



Theano was an open source project. It is a numerical computation library for Python with syntaxes similar to NumPy. It is efficient at performing complex mathematical expressions with multidimensional arrays. This makes it is a perfect choice for neural networks.


The link will give the user a better idea of the various operations involved. We will be illustrating the installation steps for Theano on different platforms, followed by the basic tutorials involved.


Theano is a mathematical library that provides ways to create the machine learning models that could be used later for multiple datasets. Many tools have been implemented on top of Theano. Principally, it includes

•\ Blocks 

•\ Lasagne latest/

•\ PyLearn2


Note It should be noted that at the time of writing this blog, contributions to the Theano package have been stopped by the community members, owing to a substantial increase in the usage of other deep learning packages.


Theano Installation

The following command will work like a charm for Theano installation on Ubuntu:

> sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git

> sudo pip install Theano


For detailed instructions on installing Theano on different platforms, please refer to the following link: theano/install.html. Even docker images with CPU and GPU compatibility are available.


Note It is always advisable to proceed with installation in a separate virtual environment.

> git clone http://git://

> cd Theano

> python install


For the installation on Windows, take the following steps (sourced from an answer on Stack Overflow):

\ 1.\ Install TDM GCC x64 (TDM-GCC : News).

\ 2.\ Install Anaconda x64 (Home - Anaconda downloads, say in C:/Anaconda).

\ 3.\ After Anaconda installation, run the following commands:

a.\ conda update conda

b.\ conda update -all

c.\ conda install mingw libpython

\ 4.\ Include the destination 'C:\Anaconda\Scripts' in the environment variable PATH.

\ 5.\ Install Theano, either the older version or the latest version available.

a.\ Older version:

> pip install Theano

b.\ Latest version:

> pip install --upgrade --no-deps git+git:// 


Theano Examples

The following section introduces the basic codes in the Theano library. The Tensor subpackage of the Theano library contains most of the required symbols.


The following example makes use of the Tensor subpackage and performs operations on the two numbers (outputs have been included for reference):

> import theano

> import theano.tensor as T

> import numpy

> from theano import function

# Variables 'x' and 'y' are defined

> x = T.dscalar('x') # dscalar : Theano datatype > y = T.dscalar('y')

# 'x' and 'y' are instances of TensorVariable, and are of dscalar theano type

> type(x)

<class 'theano.tensor.var.TensorVariable'> > x.type

TensorType(float64, scalar) > T.dscalar TensorType(float64, scalar)

# 'z' represents the sum of 'x' and 'y' variables. Theano's pp function, pretty-print out, is used to display the computation of the variable 'z'

> z = x + y

> from theano import pp > print(pp(z))


# 'f' is a numpy.ndarray of zero dimensions, which takes input as the first argument, and output as the second argument

# 'f' is being compiled in C code

> f = function([x, y], z)

The preceding function could be used in the following manner to perform the addition operation:

> f(6, 10) array(16.0)

> numpy.allclose(f(10.3, 5.4), 15.7) True



TensorFlow is an open sourced library by Google for large-scale machine learning implementations. TensorFlow, in a true sense, is the successor of DistBelief, which was an earlier software framework released by Google capable of utilizing computing clusters with thousands of machines to train large models.


TensorFlow is the brainchild of the software engineers and researchers from the Google Brain Team, which is part of the Google group (now Alphabet) and is primarily focused on deep learning and its applications. It makes use of the data flow graphs for the numerical computation, mentioned in detail following.


It has been designed in such a way that computations on CPUs or GPU systems across a single desktop or servers or mobile devices are catered to by a single API.


TensorFlow offers the movement of highly intensive computational tasks from CPUs to heterogeneous GPU-oriented platforms, with very minute changes in the codes. Also, a model trained on one machine could be used on another light device, such as an Android-enabled mobile device, for final implementation purposes.


TensorFlow is the foundation for the implementation of such applications as DeepDream, which is an automated image-captioning software, and RankBrain, which helps Google to process search results and provide more relevant search results to users.


To get a better sense of the working and implementation of TensorFlow, one can read the relevant white paper at http://download.


Data Flow Graphs

Data flow graphs are used by TensorFlow to represent the mathematical computations performed in the form of graphs. It makes use of the directed graphs, with nodes and edges.


The nodes represent mathematical operations and act as a terminal for data input, the output of results, or read/ write of persistent variables. The edges handle the input and output relationships between nodes.


The data edges carry tensors, or dynamically sized multidimensional data arrays, between the nodes. The movement of these tensor units through the whole graph has itself lead to the name TensorFlow. The nodes in a graph, upon receiving all their respective tensors from the incoming edges, execute asynchronously and in parallel.


The overall design and flow of computations covered within a data flow graph occur in a session and are then executed on the desired machines. TensorFlow, with the Python, C, and C+ APIs offered, relies on C++ for optimized computations.


With the following features of TensorFlow, it is the best choice for the massive parallelism and high scalability required in the field of machine learning


•\ Deep flexibility: Users get the full privilege to write their own libraries on top of the TensorFlow. One need only create the whole computation in the form of a graph, and the rest is taken care of by TensorFlow.


•\ True portability: Extensibility offered by TensorFlow enables a machine learning code written on a laptop to be trained on GPUs for faster model training, with no code changes, and to be deployed on mobile, in the final product, or on docker, as a cloud service.


•\ Automatic differentiation: TensorFlow handles the derivatives computation for the gradient-based machine learning algorithms by the automatic differentiation functionality of it. The computation of derivatives of values helps in understanding the extended graph of values with respect to each other.


•\ Language options: TensorFlow offers Python and C++ interfaces to build and execute the computational graphs.

•\ Performance maximization: The compute elements from the TensorFlow graph can be assigned to multiple devices, and TensorFlow takes care of the maximum performance by its wide support of threads, queues, and asynchronous computation.


TensorFlow Installation

TensorFlow installation is very easy, like any other Python package, and can be achieved by using a single pip install command. Also, if required, users can follow the detailed explanation for the installation on the main site of TensorFlow ( os_setup.html, for the r0.10 version).


Installation via pip must be preceded by the binary package installation relevant to the platform. Please refer to the following link for more details on the TensorFlow package and its repository


To check the installation of TensorFlow on Windows, check out the following blog link: PlayingWithTensorFlowOnWindows.aspx.


TensorFlow Examples

Running and experimenting with TensorFlow is as easy as the installation. The tutorial on the official web site,, is pretty clear and covers basic to expert-level examples.


Following is one such example, with the basics of TensorFlow (outputs have been included for reference):

> import tensorflow as tf

> hello = tf.constant('Hello, Tensors!')

> sess = tf.Session()


Hello, Tensors!

# Mathematical computation > a = tf.constant(10)

> b = tf.constant(32) >


The run() method takes the resulting variables for computations as arguments, and a backward chain of required calls are made for this.


TensorFlow graphs get formed from nodes not requiring any kind of input, i.e., the source. These nodes then pass their output to further nodes, which perform computations on the resulting tensors, and the whole process moves in this pattern.


The following examples show the creation of two matrices using Numpy, then using TensorFlow to assign these matrices as objects in TensorFlow, and then multiplying both the matrices. The second example


Introduction to Natural Language Processing and Deep Learning includes the addition and subtraction of two constants. A TensorFlow session has also been activated to perform the operation and deactivated once the operation is complete.

> import tensorflow as tf

> import numpy as np

> mat_1 = 10*np.random.random_sample((3, 4)) # Creating NumPy


> mat_2 = 10*np.random.random_sample((4, 6))

# Creating a pair of constant ops, and including the above made matrices

> tf_mat_1 = tf.constant(mat_1) > tf_mat_2 = tf.constant(mat_2)

# Multiplying TensorFlow matrices with matrix multiplication operation

> tf_mat_prod = tf.matmul(tf_mat_1 , tf_mat_2)

> sess = tf.Session() # Launching a session

# run() executes required ops and performs the request to store output in 'mult_matrix' variable

> mult_matrix = > print(mult_matrix)

# Performing constant operations with the addition and subtraction of two constants

> a = tf.constant(10) > a = tf.constant(20)

> print("Addition of constants 10 and 20 is %i " % sess. run(a+b))

Addition of constants 10 and 20 is 30

> print("Subtraction of constants 10 and 20 is %i " % sess. run(a-b))

Subtraction of constants 10 and 20 is -10

> sess.close() # Closing the session


Note As no graph was specified in the preceding example with the TensorFlow, the session makes use of the default instance only.



Keras is a highly modular neural networks library, which runs on top of Theano or TensorFlow. Keras is one of the libraries which supports both CNN and RNNs (we will be discussing these two neural networks in detail in later blogs), and runs effortlessly on GPU and CPU.


A model is understood as a sequence or a graph of standalone, fully configurable modules that can be plugged together with as little restrictions as possible. In particular, neural layers, cost functions, optimizers, initialization schemes, activation functions, regularization schemes are all standalone modules that could be combined to create new models.


Keras Installation

In addition to the Theano or TensorFlow as the back end, Keras makes use of the few libraries as dependencies. Installing these before Theano or TensorFlow installation eases the process.

> pip install numpy scipy

> pip install scikit-learn

> pip install pillow

> pip install h5py


Note Keras always require the latest version of Theano to be installed. We have made use of TensorFlow as a backend for Keras throughout the blog.

> pip install keras


Keras Principles

Keras offers a model as one of its main data structures. Each model is a customizable entity that can be made up of different layers, cost functions, activation functions, and regularization schemes.


Keras offers a wide range of pre-built layers to plug in a neural network, a few of which include convolutional, dropout, pooling, locally connected, recurrent, noise, and normalization layers. An individual layer of the network is considered to be an input object for the next layer.


Built primarily for the implementation of neural networks and deep learning, code snippets in Keras will be included in later blogs as well, in addition to their relevant neural networks.


Keras Examples

The base data structure of Keras is a model type, made up of the different layers of the network. The sequential model is the major type of model in Keras, in which layers are added one by one until the final output layer.


The following example of Keras uses the blood transfusion dataset from the UCI ML Repository. One can find the details regarding the data here: Service+Center).


The data is taken from a blood transfusion service center located in Taiwan and has four attributes, in addition to the target variable.


The problem is one of binary classification, with '1' standing for the person who has donated the blood and '0' for the person who refused a blood donation. More details regarding the attributes can be gleaned from the link mentioned.


Save the dataset shared at the website in the current working directory (if possible, with the headers removed). We start by loading the dataset, building a basic MLP model in Keras, followed by fitting the model on the dataset.


The basic type of model in Keras is sequential, which offers layer-­ by-­layer addition of complexity to the model. The multiple layers can be fabricated with their respective configurations and stacked onto the initial base model.

# Importing the required libraries and layers and model from Keras

> import keras

> from keras.layers import Dense

> from keras.models import Sequential > import numpy as np

# Dataset Link : # +Transfusion+Service+Center

# Save the dataset as a .csv file :

tran_ = np.genfromtxt('transfusion.csv', delimiter=',')

X=tran_[:,0:4] # The dataset offers 4 input variables

Y=tran_[:,4] # Target variable with '1' and '0' print(x)


As the input data has four corresponding variables, the input_dim, which refers to the number of different input variables, has been set to four.


We have made use of the fully connected layers defined by dense layers in Keras to build the additional layers. The selection of the network structure is done on the basis of the complexity of the problem.


Here, the first hidden layer is made up of eight neurons, which are responsible for further capturing the nonlinearity.


The layer has been initialized with the uniformly distributed random numbers and with the activation function as ReLU, as described previously in this blog. The second layer has six neurons and configurations similar to its previous layer.

# Creating our first MLP model with Keras > mlp_keras = Sequential()

> mlp_keras.add(Dense(8, input_dim=4, init='uniform', activation='relu'))

> mlp_keras.add(Dense(6, init='uniform', activation='relu'))


In the last layer of output, we have set the activation as sigmoid, mentioned previously, which is responsible for generating a value between 0 and 1 and helps in the binary classification.

> mlp_keras.add(Dense(1, init='uniform', activation='sigmoid'))


To compile the network, we have made use of the binary classification with logarithmic loss and selected Adam as the default choice of the optimizer, and accuracy as the desired metric to be tracked.


The network is trained using the backpropagation algorithm, along with the given optimization algorithm and loss function.

> mlp_keras.compile(loss = 'binary_crossentropy', optimizer='adam',metrics=['accuracy'])


The model has been trained on the given dataset with a small number of iterations (nb_epoch) and started with a feasible batch size of instances (batch_size).


The parameters could be chosen either on the basis of prior experience of working with such kinds of datasets, or one can even make use of Grid Search to optimize the choice of such parameters. We will be covering the same concept in later blogs, where necessary.

>,Y, nb_epoch=200, batch_size=8, verbose=0)


The next step is to finally evaluate the model that has been built and to check out the performance metrics, loss, and accuracy on the initial training dataset. The same operation could be performed on a new test dataset with which the model is not acquainted and could be a better measure of the model performance.

> accuracy = mlp_keras.evaluate(X,Y)

> print("Accuracy : %.2f%% " % (accuracy[1]*100 ))


If one wants to further optimize the model by using different combinations of parameters and other tweaks, it could be done by using different parameters and steps while undertaking model creation and validation, though it need not result in better performance in all cases.

# Using a different set of optimizer > from keras.optimizers import SGD > opt = SGD(lr=0.01)


The following creates a model with configurations similar to those in the earlier model but with a different optimizer and including a validation dataset from the initial training data:

> mlp_optim = Sequential()

> mlp_optim.add(Dense(8, input_dim=4, init='uniform', activation='relu'))

> mlp_optim.add(Dense(6, init='uniform', activation='relu'))

> mlp_optim.add(Dense(1, init='uniform', activation='sigmoid'))

# Compiling the model with SGD

> mlp_optim.compile(loss = 'binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# Fitting the model and checking accuracy

>,Y, validation_split=0.3, nb_epoch=150, batch_size=10, verbose=0)

> results_optim = mlp_optim.evaluate(X,Y)

> print("Accuracy : %.2f%%" % (results_optim[1]*100 ) )


Make sure that all the packages mentioned for natural language processing and deep learning in the preceding sections are installed before moving forward. Once you have set up the system, you will be good to go with the examples offered throughout this blog.


AI Methodologies Beyond Deep Learning

As we’ve seen, deep learning is a key aspect of most robust AI systems —but it’s not the only way to use AI. This blog covers some alternatives to deep learning. Even if these methods are not as popular as DL methods, they can be very useful in certain scenarios.


We’ll take a look at the two main methodologies – optimization and fuzzy logic – as well as some less well-known methods such as artificial creativity. We’ll cover new trends in AI methodologies. Finally, we’ll explore some useful considerations to leverage these methods and make the most out of them for your data science projects.


Many of the AI methodologies alternative to DL don’t use ANNs of any kind but rely on other systems that exhibit a certain level of intelligence. As some such systems don’t use an obscure graph for making their predictions (like ANNs do) they are more transparent than DL, making them useful when interpreting results.


Most of these alternative AI methodologies have been around for a few decades now, so there is plenty of support behind them, making them reliable resources overall. Others are generally newer but are quite robust and reliable nevertheless.


Since the field of AI is rapidly evolving, these alternatives to DL may become even more relevant over the next few years. After all, many data science problems involve optimizing a function.


Among the various alternative AI methodologies out there, the ones that are more suitable for data science work can be classified under the optimization umbrella. However, fuzzy logic systems may be useful, even though they apply mainly to low-dimensionality datasets, as we’ll see later.


Optimization, on the other hand, involves all kinds of datasets and is often used in other data science systems.



Optimization is the process of finding the maximum or minimum of a given function (also known as a fitness function), by calculating the best values for its variables (also known as a “solution”).


Despite the simplicity of this definition, it is not an easy process; often involves restrictions, as well as complex relationships among the various variables. Even though some functions can be optimized using some mathematical process, most functions we encounter in data science are not as simple, requiring a more advanced technique.


Optimization systems (or optimizers, as they are often referred to) aim to optimize in a systematic way, oftentimes using a heuristics-based approach.


Such an approach enables the AI system to use a macro level concept as part of its low-level calculations, accelerating the whole process and making it more light-weight. After all, most of these systems are designed with scalability in mind, so the heuristic approach is most practical.


Importance of optimization

Optimization is especially important in many data science problems— particularly those involving a lot of variables that need to be fine-tuned, or cases where the conventional tools don’t seem to work. In order to tackle more complex problems, beyond classical methodologies, optimization is essential.


Moreover, optimization is useful for various data engineering tasks such as feature selection, in cases where maintaining a high degree of interpretability is desired. We’ll investigate the main applications of optimizers in data science later in this blog.


Optimization systems overview

There are different kinds of optimization systems. The most basic ones have been around the longest. These are called “deterministic optimizers,” and they tend to yield the best possible solution for the problem at hand.


That is the absolute maximum or minimum of the fitness function. Since they are quite time-consuming and cannot handle large-scale problems, these deterministic optimizers are usually used for applications where the number of variables is relatively small.


A classic example of such an optimizer is the one used for least squared error regression—a simple method to figure out the optimal line that fits a set of data points, in a space with relatively small dimensionality.


In addition to deterministic optimizers, there are stochastic optimizers, which more closely fit the definition of AI. After all, most of these are based on natural phenomena, such as the movement of the members of a swarm, or the way a metal melts. The main advantage of these methods is that they are very efficient.


Although they usually don’t yield the absolute maximum or minimum of the function they are trying to optimize, their solutions are good enough for all practical purposes (even if they vary slightly every time you run the optimizer).


Stochastic optimizers also scale very well, so they are ideal for complex problems involving many variables. In this blog, we will focus on some of these stochastic optimization methods.


Programming languages for optimization

Optimization is supported by most programming languages in terms of libraries, like the Optim and JuMP packages in Julia. However, each algorithm is simple enough so that you can code it yourself if you cannot find an available “off-the-shelf” function. In this blog, we’ll examine the main algorithms for advanced optimization and how they are implemented in Julia.


We chose this programming language because it combines ease of use and high execution speed. Remember that all the code is available in the Docker environment that accompanies this blog.


Fuzzy inference systems

Fuzzy logic (FL) is a methodology designed to emulate the human capacity of imprecise or approximate reasoning. This ability to judge under uncertainty was previously considered strictly human, but FL has made it possible for machines, too.


Despite its name, there is nothing unclear about the outputs of fuzzy logic. This is because fuzzy logic is an extension of classical logic when partial truths are included to extend bivalued logic (true or false) to a multivalued logic (degrees of truth between true and false).


According to its creator, Professor Zadeh, the ultimate goal of fuzzy logic is to form the theoretical foundation for reasoning about imprecise propositions (also known as “approximate reasoning”). Over the past couple of decades, FL has gained ground and become regarded as one of the most promising AI methodologies.


An FL system contains a series of mappings corresponding to the various features of the data at hand. This system contains terms that make sense to us, such as high-low, hot-cold, and large-medium-small, terms that may appear fuzzy since there are no clear-cut boundaries among them.


Also, these attributes are generally relative and require some context to become explicit, through a given mapping between each term and some number that the system can use in its processes.


This mapping is described mathematically through a set of membership functions, graphically taking the form of triangles, trapezoids, or even curves. This way something somewhat abstract like “large” can take very specific dimensions in the form of “how large on a scale of 0 to 1” it is. The process of coding data into these states is called fuzzification.


Once all the data is coded in this manner, the various mappings are merged together through logical operators, such as inference rules (for example, “If A and B then C,” where A and B correspond to states of two different features and C to the target variable). The result is a new membership function describing this complex relationship, usually depicted as a polygon.


This is then turned into a crisp value, through one of the various methods, in a process called defuzzification. Since this whole process is graphically accessible to the user, and the terms used are borrowed from human language, the result is always something clear-cut and interpretable (given some understanding of how FL works).


Interestingly, FL has also been used in conjunction with ANNs to form what is referred to as neuro-fuzzy systems. Instead of having a person create the membership functions by hand, an FL system can make use of the optimization method in a neural network’s training algorithm to calculate them on the fly.


This whole process and the data structure that it entails take the form of an automated fuzzy system, combining the best of both worlds.


Why systems based on fuzzy logic are still relevant

Although FL was originally developed with a certain type of engineering systems in minds such as Control Systems, its ease of use and low cost of implementation has made it relevant as an AI methodology across a variety of other fields, including data science.


What’s more, fuzzy systems are very accessible, especially when automated through optimization for their membership functions (such as the neuro-fuzzy systems mentioned previously). Such a system employs a set of FL rules (which are created based on the data) to infer the target variable. These systems are called Fuzzy Inference Systems, or FISs.


The main advantage of this FIS approach is that it is transparent—a big plus if you want to defend your results to the project stakeholders. The transparency of a FIS makes the whole problem more understandable, enabling you to figure out which features are more relevant.


In addition, a FIS can be used in conjunction with custom-made rules based on an expert’s knowledge. This is particularly useful if you are looking at upgrading a set of heuristic rules using AI. Certain larger companies that are planning to use data science to augment their existing systems are likely to be interested in such a solution.


The downside of fuzzy inference systems

Despite all the merits of FIS, these AI systems don’t always meet the expectations of modern data science projects. Specifically, when the dimensionality of the data at hand is quite large, the number of rules produced increases exponentially, making these systems too large for any practical purposes.


Of course, you can mitigate this issue with an autoencoder or a statistical process, like PCA or ICA, that creates a smaller set of features.


However, when you do this, the whole interpretability benefit of FIS goes out the window. Why? With a reduced feature set, the relationship with the original features (and the semantic meaning they carry) is warped.


As such, it is very difficult to reconstruct meaning; the new features will require a different interpretation if they are to be meaningful. This is not always feasible.


Nevertheless, for datasets of smaller dimensionality, a FIS is a worthwhile alternative, even if it’s not a particularly popular one. We’ll explore Fuzzy Logic and FIS more in blog 11, where we’ll discuss alternative AI methodologies.


Artificial creativity

Artificial creativity (AC) is a relatively new methodology of AI, where new information is created based on relevant data it has been trained on. Its applications span across various domains, including most of the arts, as well as industrial design, and even data science.


This kind of AI methodology makes use of a specialized DL network that is trained to develop new data that retains some characteristics of the data it was trained on.


When you feed such a specialized AI system some image data, and it’s been trained on the artwork of a particular painter, it will produce new “artwork” that makes use of the images it is fed, but using the painting patterns of the artist it is trained to emulate. The results may not win any art prizes, but they are certainly interesting and original!


In data science, AC can aid in the creation of new data, which is quite useful in certain cases. This new data may not be particularly helpful as an expansion of the training set for that ANN, but it can be especially useful in other ways.


For example, if the original data is sensitive like medical data, and contains too much personally identifiable information (PII), you can generate new data using AC that, although very similar to the original data, cannot be mapped back to a real individual.


In addition, data created from an AC system can be useful for different data models—perhaps as a new test set or even part of their training set.


This way it can offer the potential for better generalization for these models, as there is more data available for them to train or test on. This can be particularly useful in domains where labeled data is hard to come by or is expensive to generate otherwise.


Additional AI methodologies

Beyond all the AI methodologies we’ve discussed so far, there exist several others worth noting. These systems also have a role to play in data science, while their similarities to DL systems make them easier to comprehend. Also, as the AI field is constantly expanding, it’s good to be aware of all the new methodologies that pop up.


The Extreme Learning Machine (or ELM) is an example of an alternative AI methodology that hasn’t yet received the attention it deserves. Although they are architecturally like DL networks, ELMs are quite distinct in the way they are trained.


In fact, their training is so unconventional that some people considered the whole approach borderline unscientific (the professor who came up with ELMs most recently has received a lot of criticism from other academics).


Instead of optimizing all the weights across the network, ELMs focus on just the connections of the two last layers—namely the last set of meta-features and their outputs.


The rest of the weights maintain their initial random values from the beginning. Because the focus is solely on just the optimized weights of the last layers, this optimization is extremely fast and very precise.


As a result, ELMs are the fastest network-based methodology out there, and their performance is quite decent too. What’s more, they are quite unlikely to overfit, which is another advantage.


Despite its counter-intuitive approach, an ELM system does essentially the same thing as a conventional DL system; instead of optimizing all the meta-features it creates, though, it focuses on optimizing the way they work together to form a predictive analytics model.


Another new alternative AI methodology is Capsule Networks (CapsNets). Although the CapsNet should be regarded as a member of the deep learning methods family, its architecture and its optimization training method are quite novel.


CapsNets try to capture the relative relationships between the objects within a relevant context. A CNN model that achieves high performance in image recognition tasks may not necessarily be able to identify the same object from different angles. CapsNets, though, capture those kinds of contextual relationships quite well.


Their performance on some tasks has already surpassed the leading models by about 45%, which is quite astonishing. Considering their promising future.


Self-organizing Maps (SOMs) are a special type of AI system. Although they are also ANNs of sorts, they are unique in function. SOMs offer a way to map the feature space into a two-dimensional grid so that it can be better visualized afterward.


Since it doesn’t make use of a target variable, a SOM is an unsupervised learning methodology; as such, it is ideal for data exploration.


SOMs have been successfully applied in various domains, such as meteorology, oceanography, oil and gas exploration, and project prioritization. One key difference SOMs have from other ANNs is that their learning is based on competition instead of error correction.


Also, their architecture is quite different, as their various nodes are only connected to the input layer with no lateral connections. This unique design was first introduced by Professor Kohonen, which is why SOMs are also referred to as “Kohonen Maps.”


The Generative Adversarial Network (GAN) is a very interesting type of AI methodology, geared towards optimizing a DL network in a rather creative way. A GAN comprises two distinct ANNs.


One is for learning, and the other is for “breaking” the first one – finding cases where the predictions of the first ANN are off. These systems are comparable to the “white hat” hackers of cybersecurity.


In essence, the second ANN creates increasingly more demanding challenges for the first ANN, thereby constantly improving its generalization (even with a limited amount of data).


GANs are used for simulations as well as data science problems. Their main field of application is astronomy, where a somewhat limited quantity of images and videos of the cosmos is available to use in training.


The idea of GANs has been around for over a decade, but has only recently managed to gain popularity; this is largely due to the number of computational resources demanded by such a system (just like any other DL-related AI system). 


Artificial Emotional Intelligence (AEI) is another kind of AI that’s novel on both the methodological as well as the application levels. The goal of AEI is to facilitate an understanding of the emotional context of data (which is usually text-based) and to assess it just like a human.


Applications of AEI are currently limited to comprehension; in the future, though, more interactive systems could provide a smoother interface between humans and machines. There is an intersect between AEI and ANNs, but some aspects of AEI make use of other kinds of


ML systems on the back end.

Glimpse into the future

While the field of AI expands in various directions, making it hard to speculate about how it will evolve, there is one common drawback to most of the AI systems used today: a lack of interpretability.


As such, it is quite likely that some future AI system will address this matter, providing a more comprehensive result, or at least some information as to how the result came about (something like a rationale for each prediction), all while maintaining the scalability of modern AI systems.


A more advanced AI system of the future will likely have a network structure, just like current DL systems—though it may be quite different architecturally. Such a system would be able to learn with fewer data points (possibly assisted by a GAN), as well as generate new data (just like variational autoencoders).


Could an AI system learn to build new AI systems? It is possible, however, the limitation of excessive resources required for such a task has made it feasible only for cloud-based systems. Google may showcase its progress in this area in what it refers to as Automated Machine Learning (AutoML).


So, if you were to replicate this task with your own system, who is to say that the AI-created AI would be better than what you yourself would have built? Furthermore, would you be able to pinpoint its shortcomings, which may be quite subtle and obscure?


After all, an AI system requires a lot of effort to make sure that its results are not just accurate but also useful, addressing the end user's needs. You can imagine how risky it would be to have an AI system built that you know nothing about!


However, all this is just an idea of a potential evolutionary course, since AI can always evolve in unexpected ways. Fortunately, with all the popularity of AI systems today, if something new and better comes along, you’ll probably find out about it sooner rather than later.


Perhaps for things like that, it’s best to stop and think about the why’s instead of focusing only on the how’s, since as many science and industry experts have warned us, AI is a high-risk endeavor and needs to be handled carefully and always with fail-safes set in place.


For example, although it’s fascinating and in some cases important to think about how we can develop AIs that improve themselves, it’s also crucial to understand what implications this may have and plan for AI safety matters beforehand.


Also, prioritizing certain characteristics of an AI system (e.g. interpretability, ease of use, having limited issues in the case of malfunction, etc.) over raw performance, may provide more far-reaching benefits. After all, isn’t improving our lives in the long-term the reason why we have AI in the first place?


About the methods

It is not hard to find problems that can be tackled with optimization. For example, you may be looking at an optimum configuration of a marketing process to minimize the total cost, or to maximize the number of people reached.


Although data science can lend some aid in solving such a problem, at the end of the day, you’ll need to employ an optimizer to find a true solution to a problem like this.


Furthermore, optimization can help in data engineering, too. Some feature selection methods, for instance, use optimization to keep only the features that work well together.


There are also cases of feature fusion that employ optimization (although few people use this method since it sacrifices some interpretability of the model that makes use of these meta-features).


In addition, when building a custom predictive analytics system combining other classifiers or regressors, you often need to maximize the overall accuracy rate (or some other performance metric).


To do this, you must figure out the best parameters for each module (i.e. the ones that optimize a certain performance metric for the corresponding model), and consider the weights of each module’s output in the overall decision rule for the final output of the system that comprises of all these modules.


This work really requires an optimizer, since often the number of variables involved is considerable.


In general, if you are tackling a predictive analytics problem and you have a dataset whose dimensionality you have reduced through feature selection, it can be effectively processed through a FIS. In addition, if interpretability is a key requirement for your data model, using a FIS is a good strategy to follow.


Finally, if you already have a set of heuristic rules at your disposal from an existing predictive analytics system, then you can use a FIS to merge those rules with some new ones that the FIS creates. This way, you won’t have to start from square one when developing your solution.


Novel AI systems tend to be less predictable and, as a result, somewhat unreliable. They may work well for a certain dataset, but that performance may not hold true with other datasets.


That’s why it is critical to try out different AI systems before settling on one to use as your main data model. In many cases, optimizing a certain AI system may yield better performance, despite the time and resources, it takes to optimize.


Striking the balance between exploring various alternatives and digging deeper into existing ones is something that comes about with experience.


Moreover, it’s a good idea to set your project demands and user requirements beforehand. Knowing what is needed can make the selection of your AI system (or if you are more adventurous, your design of a new one) much easier and more straightforward.


For example, if you state early on that interpretability is more important than performance, this will affect which model you decide to use. Make sure you understand what you are looking for in an AI system from the beginning, as this is bound to help you significantly in making the optimal choice.


Although AI systems have a lot to offer in all kinds of data science problems, they are not panaceas. If the data at your disposal is not of high veracity, meaning not of good quality or reliability, no AI system can remedy that.


All AI systems function based on the data we train them with; if the training data is very noisy, biased, or otherwise problematic, their generalizations are not going to be any better.


This underlines the importance of data engineering and utilizing data from various sources, thereby maximizing your chances of creating a robust and useful data model. This is also why it’s always good to always keep a human in the loop when it comes to data science projects —even (or perhaps especially) when they evolve AI.



Optimization is an AI-related process for finding the maximum or minimum of a given function, by tweaking the values for its variables. It is an integral part of many other systems (including ANNs) and consists of deterministic and stochastic systems.


Optimization systems (or “optimizers,” as they are often called) can be implemented in all programming languages since their main algorithms are fairly straightforward.


Fuzzy Logic (FL) is an AI methodology that attempts to model imprecise data as well as uncertainty. Systems employing FL are referred to as Fuzzy Inference Systems (FIS).


These systems involve the development and use of fuzzy rules, which automatically link features to the target variable. A FIS is great for datasets of small dimensionality since it doesn’t scale well as the number of features increases.


Artificial creativity (AC) is an AI methodology of sorts that creates new information based on patterns derived from the data it is fed. It has many applications in the arts and industrial design.


This methodology could also be useful in data science, through the creation of new data points for sensitive datasets (for example, where data privacy is important).


Artificial Emotional Intelligence (AEI) is another alternative AI, emulating human emotions. Currently, its applications are limited to comprehension.


Although speculative, if a truly novel AI methodology were to arise in the near future, it would probably combine the characteristics of existing AI systems, with an emphasis on interpretability.


Even though theoretically possible, an AI system that can design and build other AI systems is not a trivial task. A big part of this involves the excessive risks of using such a system since we would have little control over the result.


It’s important to understand the subtle differences between all these methodologies, as well as their various limitations. Most importantly, for data science, there is no substitute for high veracity data.


Building a DL Network Using MXNet

We’ll begin our in-depth examinations of the DL frameworks with that which seems one of the most promising: Apache’s MXNet. We’ll cover its core components including the Gluon interface, NDArrays, and the MXNet package in Python.


You will learn how you can save your work like the networks you trained in data files, and some other useful things to keep in mind about MXNet.


MXNet supports a variety of programming languages through its API, most of which are useful for data science. Languages like Python, Julia, Scala, R, Perl, and C++ have their own wrappers of the MXNet system, which makes them easily integrated with your pipeline.


Also, MXNet allows for parallelism, letting you take full advantage of your machine’s additional hardware resources, such as extra CPUs and GPUs. This makes MXNet quite fast, which is essential when tackling computationally heavy problems, like the ones found in most DL applications.


Interestingly, the DL systems you create in MXNet can be deployed on all kinds of computer platforms, including smart devices.


This is possible through a process called amalgamation, which ports a whole system into a single file that can then be executed as a standalone program. Amalgamation in MXNet was created by Jack Deng, and involves the development of .cc files, which use the BLAS library as their only dependency.


Files like this tend to be quite large (more than 30000 lines long).

There is also the option of compiling .h files using a program called emscripten.

This program is independent of any library and can be used by other programming languages with the corresponding API.


Finally, there exist several tutorials for MXNet, should you wish to learn more about its various functions. Because MXNet is an open-source project, you can even create your own tutorial, if you are so inclined.


What’s more, it is a cross-platform tool, running on all major operating systems. MXNet has been around long enough that it is a topic of much research.


Core components Gluon interface

Gluon is a simple interface for all your DL work using MXNet. You install it on your machine just like any Python library:

pip install MXNet —pre —user


The main selling point of Gluon is that it is straightforward. It offers an abstraction of the whole network building process, which can be intimidating for people new to the craft.


Also, Gluon is very fast, not adding any significant overhead to the training of your DL system. Moreover, Gluon can handle dynamic graphs, offering some malleability in the structure of the ANNs created. Finally, Gluon has an overall flexible structure, making the development process for any ANN less rigid.


Naturally, for Gluon to work, you must have MXNet installed on your machine (although you don’t need to if you are using the Docker container provided with this blog). This is achieved using the familiar pip command:

pip install MXNet —pre —user


Because of its utility and excellent integration with MXNet, we’ll be using Gluon throughout this blog, as we explore this DL framework. However, to get a better understanding of MXNet, we’ll first briefly consider how you can use some of its other functions (which will come in handy for one of the case studies we examine later).



The NDArray is a particularly useful data structure that’s used throughout an MXNet project. NDArrays are essentially NumPy arrays, but with the added capability of asynchronous CPU processing.


They are also compatible with distributed cloud architectures, and can even utilize automatic differentiation, which is particularly useful when training a deep learning system, but NDArrays can be effectively used in other ML applications too. NDArrays are part of the MXNet package, which we will examine shortly. You can import the NDArrays module as follows:

from MXNet import nd

To create a new NDArray consisting of 4 rows and 5 columns, for example, you can type the following:

nd.empty((4, 5))


The output will differ every time you run it since the framework will allocate whatever value it finds in the parts of the memory that it allocates to that array. If you want the NDArray to have just zeros instead, type:

nd.zeros((4, 5))

To find the number of rows and columns of a variable having an NDArray assigned to it, you need to use the .shape function, just like in NumPy:

x = nd.empty((2, 7))


Finally, if you want to find to total number of elements in an NDArray, you use the .size function:



The operations in an NDArray are just like the ones in NumPy, so we won’t elaborate on them here. Contents are also accessed in the same way, through indexing and slicing.


Should you want to turn an NDArray into a more familiar data structure from the NumPy package, you can use the asnumpy() function:

y = x.asnumpy()

The reverse can be achieved using the array() function:

z = nd.array(y)


One of the distinguishing characteristics of NDArrays is that they can assign different computational contexts to different arrays—either on the CPU or on a GPU attached to your machine (this is referred to as “context” when discussing NDArrays).


This is made possible by the ctx parameter in all the package’s relevant functions. For example, when creating an empty array of zeros that you want to assign to the first GPU, simply type:

a = nd.zeros(shape=(5,5), ctx=mx.gpu(0))


Of course, the data assigned to a particular processing unit is not set in stone. It is easy to copy data to a different location, linked to a different processing unit, using the copyto() function:

y = x.copyto(mx.gpu(1)) # copy the data of NDArray x to the 2nd GPU

You can find the context of a variable through the .context attribute:



It is often more convenient to define the context of both the data and the models, using a separate variable for each. For example, say that your DL project uses data that you want to be processed by the CPU, and a model that you prefer to be handled by the first GPU. In this case, you’d type something like:

DataCtx = mx.cpu()

ModelCtx = mx.gpu(0)

MXNet package in Python


The MXNet package (or “MXNet,” with all lower-case letters, when typed in Python), is a very robust and self-sufficient library in Python. MXNet provides deep learning capabilities through the MXNet framework. Importing this package in Python is fairly straightforward:

import MXNet as mx


If you want to perform some additional processes that make the MXNet experience even better, it is highly recommended that you first install the following packages on your computer:

graphviz (ver. 0.8.1 or later)

requests (ver. 2.18.4 or later)

numpy (ver. 1.13.3 or later)

You can learn more about the MXNet package through the corresponding GitHub repository.


MXNet in action

Now let’s take a look at what we can do with MXNet, using Python, on a Docker image with all the necessary software already installed. We’ll begin with a brief description of the datasets we’ll use, and then proceed to a couple specific DL applications using that data (namely classification and regression).


Upon mastering these, you can explore some more advanced DL systems of this framework on your own.


Datasets description

In this section, we’ll introduce two synthetic datasets that we prepared to demonstrate classification and regression methods on them. The first dataset is for classification, and the other for regression.


The reason we use synthetic datasets in these exercises to maximize our understanding of the data so that we can evaluate the results of the DL systems independent of data quality.


The first dataset comprises 4 variables, 3 features, and 1 label variable. With 250,000 data points, it is adequately large for a DL network to work with. Its small dimensionality makes it ideal for visualization (see Figure 2). It is also made to have a great deal of non-linearity, making it a good challenge for any data model (though not too hard for a DL system).


Furthermore, classes 2 and 3 of this dataset are close enough to be confusing, but still distinct. This makes them a good option for a clustering application, as we’ll see later.


The second dataset is somewhat larger, comprising 21 variables—20 of which are the features used to predict the last, which is the target variable. With 250,000 data points, again, it is ideal for a DL system. Note that only 10 of the 20 features are relevant to the target variable (which is a combination of these 10).


A bit of noise is added to the data to make the whole problem a bit more challenging. The remaining 10 features are just random data that must be filtered out by the DL model.


Relevant or not, this dataset has enough features altogether to render a dimensionality reduction application worthwhile. Naturally, due to its dimensionality, we cannot plot this dataset.


Loading a dataset into an NDArray

Let’s now take a look at how we can load a dataset in MXNet so that we can process it with a DL model later on. First, let’s start with setting some parameters:

DataCtx = mx.cpu() # assign context of the data used

BatchSize = 64 # batch parameter for dataloader object

r = 0.8 # ratio of training data

nf = 3 # number of features in the dataset (for the classification problem)

Now, we can import the data as we’d normally do in a conventional DS project, but this time store it in NDArrays instead of Pandas or NumPy arrays:

with open(“../data/data1.csv”) as f:

data_raw =

lines = data_raw.splitlines() # split the data into separate lines

ndp = len(lines) # number of data points X = nd.zeros((ndp, nf), ctx=data_ctx)

Y = nd.zeros((ndp, 1), ctx=data_ctx) for i, line in enumerate(lines): tokens = line.split() Y[i] = int(tokens[0])

for token in tokens[1:]:

index = int(token[:-2]) - 1

X[i, index] = 1


Now we can split the data into a training set and a testing set so that we can use it both to build and to validate our classification model:

import numpy as np # we’ll be needing this package as well

n = np.round(N * r) # number of training data points

train = data[:n, ] # training set partition

test = data[(n + 1):,] # testing set partition

data_train =

rain[:,:3], train[:,3]), batch_size=BatchSize,


data_test =

est[:,:3], test[:,3]), batch_size=BatchSize,



We’ll then need to repeat the same process to load the second dataset— this time using data2.csv as the source file. Also, to avoid confusion with the data loader objects of dataset 1, you can name the new data loaders data_train2 and data_test2, respectively.



Now let’s explore how we can use this data to build an MLP system that can discern the different classes within the data we have prepared. For starters, let’s see how to do this using the MXNet package on its own; then we’ll examine how the same thing can be achieved using Gluon.

First, let’s define some constants that we’ll use later to build, train, and test the MLP network:

nhn = 256 # number of hidden nodes for each layer

WeightScale = 0.01 # scale multiplier for weights

ModelCtx = mx.cpu() # assign context of the model itself

no = 3 # number of outputs (classes)

ne = 10 # number of epochs (for training)

lr = 0.001 # learning rate (for training)

sc = 0.01 # smoothing constant (for training)

ns = test.shape[0] # number of samples (for testing)


Next, let’s initialize the network’s parameters (weights and biases) for the first layer:

W1 = nd.random_normal(shape=(nf, nhn), scale=WeightScale, ctx=ModelCtx)

b1 = nd.random_normal(shape=nhn, scale=WeightScale, ctx=ModelCtx)

And do the same for the second layer:

W2 = nd.random_normal(shape=(nhn, nhn), scale=WeightScale, ctx=ModelCtx)

b2 = nd.random_normal(shape=nhn, scale=WeightScale, ctx=ModelCtx)

Then let’s initialize the output layer and aggregate all the parameters into a single data structure called params:

W3 = nd.random_normal(shape=(nhn, no), scale=WeightScale, ctx=ModelCtx)

b3 = nd.random_normal(shape=no, scale=WeightScale, ctx=ModelCtx)

params = [W1, b1, W2, b2, W3, b3]

Finally, let’s allocate some space for a gradient for each one of these parameters:

for param in params:



Remember that without any non-linear functions in the MLP’s neurons, the whole system would be too rudimentary to be useful. We’ll make use of the ReLU and the Softmax functions as activation functions for our system:

def relu(X): return nd.maximum(X, nd.zeros_like(X))

def softmax(y_linear):

exp = nd.exp(y_linear - nd.max(y_linear))

partition = nd.nansum(exp, axis=0, exclude=True).reshape((-1, 1))

return exp / partition


Note that the Softmax function will be used in the output neurons, while the ReLU function will be used in all the remaining neurons of the network.


For the cost function of the network (or, in other words, the fitness function of the optimization method under the hood), we’ll use the cross-entropy function:

def cross_entropy(yhat, y): return - nd.nansum(y * nd.log(yhat), axis=0, exclude=True)


To make the whole system a bit more efficient, we can combine the softmax and the cross-entropy functions into one, as follows:

def softmax_cross_entropy(yhat_linear, y):

return - nd.nansum(y *

nd.log_softmax(yhat_linear), axis=0,


After all this, we can now define the function of the whole neural network, based on the above architecture:

def net(X):

h1_linear =, W1) + b1

h1 = relu(h1_linear)

h2_linear =, W2) + b2

h2 = relu(h2_linear)

yhat_linear =, W3) + b3 return yhat_linear

The optimization method for training the system must also be defined.

In this case we’ll utilize a form of Gradient Descent:

def SGD(params, lr):

for param in params:

param[:] = param - lr * param.grad

return param

For the purposes of this example, we’ll use a simple evaluation metric for the model: accuracy rate. Of course, this needs to be defined first:

def evaluate_accuracy(data_iterator, net):

numerator = 0.

denominator = 0.

for i, (data, label) in


data =

data.as_in_context(model_ctx).reshape((-1, 784))

label = label.as_in_context(model_ctx)

output = net(data)

predictions = nd.argmax(output, axis=1) numerator += nd.sum(predictions == label) denominator += data.shape[0]

return (numerator / denominator).asscalar()


Now we can train the system as follows:

for e in range(epochs):

cumulative_loss = 0

for i, (data, label) in


data =

data.as_in_context(model_ctx).reshape((-1, 784))

label = label.as_in_context(model_ctx)

label_one_hot = nd.one_hot(label, 10)

with autograd.record():

output = net(data)

loss = softmax_cross_entropy(output, label_one_hot)


SGD(params, learning_rate)

cumulative_loss += nd.sum(loss).asscalar()

test_accuracy = evaluate_accuracy(test_data, net)

train_accuracy =

evaluate_accuracy(train_data, net)

print(“Epoch %s. Loss: %s, Train_acc %s,

Test_acc %s” %

(e, cumulative_loss/num_examples, train_accuracy, test_accuracy))

Finally, we can use to system to make some predictions using the following code:

def model_predict(net, data):

output = net(data)

return nd.argmax(output, axis=1)

SampleData =, ns,


for i, (data, label) in enumerate(SampleData):

data = data.as_in_context(ModelCtx)

im = nd.transpose(data,(1,0,2,3))

im = nd.reshape(im,(28,10*28,1))

imtiles = nd.tile(im, (1,1,3))



print(‘model predictions are:’, pred)

print(‘true labels :’, label)



If you run the above code (preferably in the Docker environment provided), you will see that this simple MLP system does a good job at predicting the classes of some unknown data points—even if the class boundaries are highly non-linear.


Experiment with this system more and see how you can improve its performance even further, using the MXNet framework.


Now we’ll see how we can significantly simplify all this by employing the Gluon interface. First, let’s define a Python class to cover some common cases of Multi-Layer Perceptrons, transforming a “gluon.Block” object into something that can be leveraged to gradually build a neural network, consisting of multiple layers (also known as MLP):

class MLP(gluon.Block):

def __init__(self, **kwargs):

super(MLP, self).__init__(**kwargs)

with self.name_scope():

self.dense0 = gluon.nn.Dense(64) # architecture of 1st layer (hidden)

self.dense1 = gluon.nn.Dense(64) # architecture of 2nd layer (hidden)

self.dense2 = gluon.nn.Dense(3) # architecture of 3rd layer (output)

def forward(self, x): # a function enabling an MLP to process data (x) by passing it forward (towards the output layer)

x = nd.relu(self.dense0(x)) # outputs of first hidden layer

x = nd.relu(self.dense1(x)) # outputs of second hidden layer

x = self.dense2(x) # outputs of final layer (output)

return x


Of course, this is just an example of how you can define an MLP using Gluon, not a one-size-fits-all kind of solution. You may want to define the MLP class differently since the architecture you use will have an impact on the system’s performance. (This is particularly true for complex problems where additional hidden layers would be useful.)


However, if you find what follows too challenging, and you don’t have the time to assimilate the theory behind DL systems, you can use an MLP object like the one above for your project.


Since DL systems are rarely as compact as the MLP above, and since we often need to add more layers (which would be cumbersome in the above approach), it is common to use a different class called Sequential.


After we define the number of neurons in each hidden layer and specify the activation function for these neurons, we can build an MLP like a ladder, with each step representing one layer in the MLP:

nhn = 64 # number of hidden neurons (in each layer)

af = “relu” # activation function to be used in each neuron

net = gluon.nn.Sequential()

with net.name_scope():

net.add(gluon.nn.Dense(nhn , activation=af))

net.add(gluon.nn.Dense(nhn , activation=af))



This takes care of the architecture for us. To make the above network functional, we’ll first need to initialize it:

sigma = 0.1 # sigma value for distribution of weights for the ANN connections

ModelCtx = mx.cpu()

lr = 0.01 # learning rate

oa = ‘sgd’ # optimization algorithm

net.collect_params().initialize(mx.init.Normal( sigma=sigma), ctx=ModelCtx)

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss()

trainer = gluon.Trainer(net.collect_params(), oa, {‘learning_rate’: lr})

ne = 10 # number of epochs for training


Next, we must define how we assess the network’s progress, through an evaluation metric function. For the purposes of simplicity, we’ll use the standard accuracy rate metric:

def AccuracyEvaluation(iterator, net):

acc = mx.metric.Accuracy()

for i, (data, label) in enumerate(iterator):

data =

data.as_in_context(ModelCtx).reshape((-1, 3))

label = label.as_in_context(ModelCtx)

output = net(data)

predictions = nd.argmax(output, axis=1) acc.update(preds=predictions, labels=label) return acc.get()


Finally, it’s time to train and test the MLP, using the aforementioned settings:

for e in range(ne):

cumulative_loss = 0

for i, (data, label) in


data =

data.as_in_context(ModelCtx).reshape((-1, 784))

label = label.as_in_context(ModelCtx)

with autograd.record():

output = net(data)

loss = softmax_cross_entropy(output, label)



cumulative_loss += nd.sum(loss).asscalar()

train_accuracy =

AccuracyEvaluation(train_data, net)

test_accuracy = AccuracyEvaluation(test_data, net)

print(“Epoch %s. Loss: %s, Train_acc %s, Test_acc %s” %

(e, cumulative_loss/ns, train_accuracy, test_accuracy))


Running the above code should yield similar results to those from conventional MXNet commands.

To make things easier, we’ll rely on the Gluon interface in the example that follows. Nevertheless, we still recommend that you experiment with the standard MXNet functions afterward, should you wish to develop your own architectures (or better understand the theory behind DL).



Creating a regression MLP system is similar to creating a classification one but with some differences. In the regression case, the regression will be simpler, since regressors are typically lighter architecturally than classifiers. For this example, we’ll use the second dataset.

First, let’s start by importing the necessary classes from the MXNet package and setting the context for the model:

import MXNet as mx

from MXNet import nd, autograd, gluon

ModelCtx = mx.cpu()


To load data to the model, we’ll use the data loaders created previously (data_train2 and data_test2). Let’s now define some basic settings and build the DL network gradually:

nf = 20 # we have 20 features in this dataset

sigma = 1.0 # sigma value for distribution of

weights for the ANN connections

net = gluon.nn.Dense(1, in_units=nf) # the “1” here is the number of output neurons, which is 1 in regression

Let’s now initialize the network with some random values for the weights and biases:

net.collect_params().initialize(mx.init.Normal( sigma=sigma), ctx=ModelCtx)


Just like any other DL system, we need to define the loss function. Using this function, the system understands how much of an error each deviation from the target variable’s values costs. At the same time, cost functions can also deal with the complexity of the models (since if models are too complex they can cost us overfitting):

square_loss = gluon.loss.L2Loss()


Now it’s time to train the network using the data at hand. After we define some essential parameters (just like in the classification case), we can create a loop for the network to train:

ne = 10 # number of epochs for training

loss_sequence = [] # cumulative loss for the various epochs

nb = ns / BatchSize # number of batches for e in range(ne):

cumulative_loss = 0

for i, (data, label) in enumerate(train_data): # inner loop

data = data.as_in_context(ModelCtx) label = label.as_in_context(ModelCtx) with autograd.record():

output = net(data)

loss = square_loss(output, label)



CumulativeLoss += nd.mean(loss).asscalar()

print(“Epoch %s, loss: %s” % (e, CumulativeLoss / ns))


If you wish to view the parameters of the model, you can do so by collecting them into a dictionary structure:

params = net.collect_params()

for param in params.values():



Printing out the parameters may not seem to be useful as we have usually too many of them and especially when we add new layers to the system, something we’d accomplish as follows:



where nhn is the number of neurons in that additional hidden layer. Note that the network requires an output layer with a single neuron, so be sure to insert any additional layers between the input and output layers.


Creating checkpoints for models developed in MXNet

As training a system may take some time, the ability to save and load DL models and data through this framework is essential. We must create “checkpoints” in our work so that we can pick up from where we’ve stopped, without having to recreate a network from scratch every time. This is achieved through the following process.


First import all the necessary packages and classes, and then define the context parameter:

import MXNet as mx

from MXNet import nd, autograd, gluon

import os

ctx = mx.cpu() # context for NDArrays

We’ll then save the data, but let’s put some of it into a dictionary first:

dict = {“X”: X, “Y”: Y}

Now we’ll set the name of the file and save it:

filename = “test.dat”, dict)

We can verify that everything has been saved properly by loading that checkpoint as follows:

Z = nd.load(filename)



When using gluon, there is a shortcut for saving all the parameters of the DL network we have developed. It involves the save_params() function:

filename = “MyNet.params”



To restore the DL network, however, you’ll need to recreate the original network’s architecture, and then load the original network’s parameters from the corresponding file:

net2 = gluon.nn.Sequential()with net2.name_scope():

net2.add(gluon.nn.Dense(num_hidden, activation=”relu”))

net2.add(gluon.nn.Dense(num_hidden, activation=”relu”))


net2.load_params(filename, ctx=ctx)


It’s best to save your work at different parts of the pipeline, and give the checkpoint files descriptive names. It is also important to keep in mind that we don’t have “untraining” option and it is likely that the optimal performance happens before the completion of the training phase.


Because of this, we may want to create checkpoints after each training epoch so that we can revert to it when we find out at which point the optimal performance is achieved.


Moreover, for the computer to make sense of these files when you load them in your programming environment, you’ll need to have the nd class of MXNet in memory, in whatever programming language you are using.


MXNet tips

The MXNet framework is a very robust and versatile platform for a variety of DL systems. Although we demonstrated its functionality in Python, it is equally powerful when used with other programming languages.


In addition, the Gluon interface is a useful add-on. If you are new to DL applications, we recommend you use Gluon as your go-to tool when employing the MXNet framework. This doesn’t mean that the framework itself is limited to Gluon, though, since the MXNet package is versatile and robust in a variety of programming platforms.


Moreover, in this blog, we covered just the basics of MXNet and Gluon. Going through all the details of these robust systems would take a whole blog! Learn more about the details of the Gluon interface in the Straight Dope tutorial, which is part of the MXNet documentation.


Finally, the examples in this blog are executed in a Docker container; as such, you may experience some lagging. When developing a DL system on a computer cluster, of course, it is significantly faster.



MXNet is a deep learning framework developed by Apache. It exhibits ease of use, flexibility, and high speed, among other perks. All of this makes MXNet an attractive option for DL, in a variety of programming languages, including Python, Julia, Scala, and R.


MXNet models can be deployed to all kinds of computing systems, including smart devices. This is achieved by exporting them as a single file, to be executed by these devices.


Gluon is a package that provides a simple interface for all your DL work using MXNet. Its main benefits include ease of use, no significant overhead, ability to handle dynamic graphs for your ANN models, and flexibility.


NDArrays are useful data structures when working with the MXNet framework. They can be imported as modules from the MXNet package as nd. They are similar to NumPy arrays, but more versatile and efficient when it comes to DL applications.


The MXNet package is Python’s API for the MXNet framework and contains a variety of modules for building and using DL systems.

Data can be loaded into MXNet through an NDArray, directly from the data file; and then creating a data loader object, to feed the data into the model built afterward.


Classification in MXNet involves creating an MLP (or some other DL network), training it, and using it to predict unknown data, allocating one neuron for every class in the dataset. Classification is significantly simpler when using Gluon.


Regression in MXNet is like classification, but the output layer has a single neuron. Also, additional care must be taken so that the system doesn’t overfit; therefore we often use some regularization function such as L2.


Creating project checkpoints in MXNet involves saving the model and any other relevant data into NDArrays so that you can retrieve them at another time. This is also useful for sharing your work with others, for reviewing purposes.


Remember that MXNet it is generally faster than on the Docker container used in this blog’s examples and that it is equally useful and robust in other programming languages.


Building an Advanced Deep Learning System

In the previous blogs, we discovered how to build deep learning models using MXNet, TensorFlow, and Keras frameworks. Recall that the models we used in those blogs are known as Artificial Neural Networks or ANNs for short. Recent research on ANNs has uncovered a broad type of neural networks that have special architectures, different than that of ANNs.


In this blog, we introduce two of the most popular alternative architectures, which are quite useful for tasks like image classification and natural language translation.


The first model we mention is the Convolutional Neural Network. These models perform well in computer vision-related tasks; in some domains, like image recognition, it has already surpassed human performance.


The second model we cover is the Recurrent Neural Network, which is very convenient for sequence modeling, including machine translation and speech recognition. Although we restrict our attention to these two types of neural networks in this blog, you can read more on other prominent network architectures in the appendices of this blog.


Convolutional Neural Networks (CNN)

One of the most interesting DL systems is the Convolutional Neural Network (usually called CNN's, though some use the term ConvNets). These are DL networks that are very effective in solving image- or sound-related problems, particularly within the classification methodology.


Over the years, though, their architecture has evolved and applicability has expanded to include a variety of cases, such as NLP (natural language processing—the processing and classification of various human sentences).


Furthermore, convolutional layers used in CNN's can be integrated as components of more advanced DL systems, such as GANs. Let’s start by describing the architecture and the building blocks of CNN's.


CNN components

CNN's have evolved considerably since their introduction in the 1980s. However, most of them use some variation of the LeNet architecture, which was introduced by Yann LeCun and flourished in the 1990s.


Back then, CNN's were used primarily for character recognition tasks; this niche changed as they become more versatile in other areas like object detection and segmentation.


It is composed of several layers, each specialized in some way. These layers eventually develop a series of meta-features, which are then used to classify the original data into one of the classes that are represented as separate neurons in the output layer.


In the output layer, usually, a function like sigmoid is used to calculate scores for each class. Those scores can be interpreted as probabilities. For example, if the score for the first class is 0.20 we can say that the probability of the observation belongs to the first class is 20%.


Data flow and functionality

The data in a CNN flows the same way as in a basic DL system like an

MLP. However, a CNN’s functionality is characterized by a series of key operations that are unique for this type of DL network. Namely, functionality is described in terms of:

  • 1. Convolution
  • 2. Non-linearity (usually through the ReLU function, though tanh and sigmoid are also viable options)
  • 3. Pooling (a special kind of sub-sampling)
  • 4. Classification through the fully connected layer(s)


We’ll go over each of these operations below. Before that, note that all the data of the original image (or audio clip, etc.) takes the form of a series of integer features.


In the case of an image, each one of these features corresponds to a particular pixel in that image. However, CNN can also use sensor data as input, making it a very versatile system.



This is where CNN's get their name. The idea of convolution is to extract features from the input image in a methodical and efficient manner. The key benefit of this process is that it considers the spatial relationships of the pixels. This is accomplished by using a small square (aka the “filter”) that traverses the image matrix in pixel-long steps. 


From a programmatic point of view, it is helpful to think of the input to a convolutional layer as a two-dimensional matrix. In effect, the convolution operation is just a series of matrix multiplications, where the filter matrix is multiplied by a shifted part of the input matrix each time, and the elements of the resulting matrix are summed.


This simple mathematical process enables the CNN system to obtain information regarding the local aspects of the data analyzed, giving it a sense of context, which it can leverage in the data science task it undertakes.


The output of this process is represented by a series of neurons comprising the feature map. Normally, more than one filter is used to better capture the subtleties of the original image, resulting in a feature map with a certain “depth” (which is basically a stack of different layers, each corresponding to a filter).



Non-linearity is essential in all DL systems and since convolution is a linear operation, we need to introduce non-linearity in a different way. One such way is the ReLU function which is applied to each pixel in the image.


Note that other non-linear functions can also be used, such as the hyperbolic tangent (tanh) or the sigmoid. Descriptions of these functions are in the glossary.



Since the feature maps and the results of the non-linear transformations to the original data are rather large, in the part that follows, we make them smaller through a process called pooling.


This involves some summarization operation, such as taking the maximum value (called “max pooling”), the average, or even the sum of a particular neighborhood (e.g. a 3x3 window).


Various experiments have indicated that max pooling yields the best performance. Finally, the pooling process is an effective way to prevent overfitting.



This final part of a CNN’s functionality is almost identical to that of an MLP which uses softmax as a transfer function in the final layer. As inputs, the CNN uses the meta-features created by pooling.


Fully-connected layers in this part of the CNN allow for additional non-linearity and different combinations of these high-level features, yielding a better generalization at a relatively low computational cost.


Training process

When training a CNN, we can use various algorithms; the most popular is backpropagation. Naturally, we must model the outputs using a series of binary vectors, the size of which is the number of classes.


Also, the initial weights in all the connections and the filters are all random. Once the CNN is fully trained, it can be used to identify new images that are related to the predefined classes.


Visualization of a CNN model

Visualizing a CNN is often necessary, as this enables us to better understand the results and decide whether the CNN has been trained properly. This is particularly useful when dealing with image data since we can see how the CNN’s perception of the input image evolves through the various layers.


We’ll see that Keras’ datasets module already provides this dataset, so no additional download is required. The code below is taken from the official Keras repository.


As usual, we begin by importing the relevant libraries. We should also import the MNIST dataset from the datasets module of Keras:

from __future__ import print_function import keras

from keras.datasets import mnist from keras.models import Sequential

from keras.layers import Dense, Dropout, Flatten

from keras.layers import Conv2D, MaxPooling2D

from keras import backend as K


Then we define the batch size as 128, the number of classes as 10 (the number of digits from 0 to 9), the epochs to run the model as 12, and the input image dimension as (28,28), since all of the corresponding images are 28 by 28 pixels:

batch_size = 128

num_classes = 10

epochs = 12

img_rows, img_cols = 28, 28

Next, we obtain the MNIST data and load it to variables, after splitting as train and test sets:

(x_train, y_train), (x_test, y_test) = mnist.load_data()

It is time for some pre-processing—mostly reshaping the variables that hold the data:

if K.image_data_format() == ‘channels_first’:

x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)

x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)

input_shape = (1, img_rows, img_cols)


x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)

x_test = x_test.reshape(x_test.shape[0],

img_rows, img_cols, 1)

input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype(‘float32’)

x_test = x_test.astype(‘float32’)

x_train /= 255

x_test /= 255

print(‘x_train shape:’, x_train.shape)

print(x_train.shape[0], ‘train samples’)

print(x_test.shape[0], ‘test samples’)

Then we convert the vectors that hold classes into binary class matrices:

y_train = keras.utils.to_categorical(y_train, num_classes)

y_test = keras.utils.to_categorical(y_test, num_classes)


After these steps, we are now ready to build our graph, using a sequential model. We first add two convolutional layers on top of each other, then we apply the max-pooling operation to the output of the second convolutional layer.


Next, we apply dropout. Before we feed the resulting output to the dense layer, we flatten our variables, to comply with the input shapes of the dense layer.


The output of this dense layer is regulated with dropout; the resulting output is then fed into the last dense layer for classification. The softmax function is used to turn the results into something that can be interpreted in terms of probabilities. Here is the code snippet of the model building part:

model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),



model.add(Conv2D(64, (3, 3),


model.add(MaxPooling2D(pool_size=(2, 2)))



model.add(Dense(128, activation=’relu’))


model.add(Dense(num_classes, activation=’softmax’))


We next compile our model using cross-entry loss and the Adadelta optimization algorithm. We use accuracy as the evaluation metric, as usual:

model.compile(loss=keras.losses.categorical_cro ssentropy,




It is time to train our model on the training set that we separated from the original MNIST dataset before. We just use the fit() function of the model object to train our model:, y_train, batch_size=batch_size,



validation_data=(x_test, y_test))


CNN's can be used in several different applications:

Identifying faces. This application is particularly useful in image analysis cases. It works by first rejecting parts of the image that don’t contain a face, which is processed in low resolution. It then focuses on the parts containing a face and draws the perceived boundaries in high resolution for better accuracy.


Computer vision (CV) in general. Beyond face recognition, CNN's are applied in various other scenarios of computer vision. This has been a hot topic for the past decade or so and has yielded a variety of applications.


Self-driving cars. Since CV features heavily in self-driving cars, the CNN is often the AI tool of choice for this technology. Their versatility in the kind of inputs they accept, and the fact that they have been studied thoroughly, make them the go-to option for NVIDIA’s self-driving car project.


NLP. Due to their high speed and versatility, CNN's lend themselves well to NLP applications.


The key here is to “translate” all the words into the corresponding embeddings, using specialized methods such as GloVe or word2vec. CNN's are optimal here since NLP models range from incredibly simple like a “bag of words” to computationally demanding like n-grams.


Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are an interesting type of DL network, widely used for NLP applications. They process data sequentially, resulting in an improved analysis of complex datasets, through the modeling of the temporal aspect of the data at hand. In a way, RNNs mimic human memory; this enables them to understand relationships among the data points like we do.


Interestingly, RNNs can also be used for text-related artificial creativity. It’s common for them to generate text that stylistically resembles some famous writer’s prose or even poems. Because of their popularity, RNNs have a few variants that are even more effective for the tasks in which they specialize.


RNN components

RNNs have “recurrent” as part of their name because they perform the same task for each element of a sequence, while the output depends on the previous computations.


This architecture could make it possible for the network to consider an unlimited number of previous states of the data. In reality, though, it usually includes merely a few steps. This is enough to give the RNN system a sense of “memory,” enabling it to see each data point within the context of the other data points preceding it.


Since the recurrent connections in an RNN are not always easy to depict or comprehend (particularly when trying to analyze its data flow), we often “unfold” them.


This creates a more spread-out version of the same network, where the temporal aspect of the data is more apparent. This process is sometimes referred to as “unrolling” or “unfolding”.


Data flow and functionality

The data in an RNN flows in loops, as the system gradually learns how each data point correlates with some of the previous ones. In this context, the hidden nodes of an RNN (which are often referred to as “states”) are basically the memory of the system.


As you would expect, these nodes have a non-linear activation function such as ReLU or tank. The activation function of the final layer before the output usually has a softmax function, though, so as to approximate probabilities.


Contrary to a traditional DL system, which uses different weights at each layer, an RNN shares the same parameters across all steps. This is because it is basically performing the same task at every step, with the only difference being the inputs. This significantly decreases the total number of parameters it must learn, making the training phase significantly faster and computationally lighter.


Training process

When training an RNN, we employ many of the same principles as with other DL networks—with a key difference in the training algorithm (which is typically backpropagation). RNNs demand an algorithm that considers the number of steps we needed to traverse before reaching the node when calculating the gradient of the error of each output node.


This variant of the training algorithm is called Backpropagation Through Time (BPTT). Because the gradient function is unstable as it goes through an RNN, the BPTT is not good at helping the RNN learn long-term dependencies among its data points. Fortunately, this issue is resolved using a specialized architecture called the LSTM, which we’ll discuss in the next section.


RNN variants

When it comes to variants of RNNs, the ones that stand out are Bidirectional RNNs (as well as their “deep” version), LSTMs, and GRUs.


Bidirectional RNNs and their deeper counterparts

A bidirectional RNN is like an ensemble of two RNNs. The key difference between these two networks is that one of them considers previous data points, while the other looks at data points that follow. This way, the two of them together can have a more holistic view of the data at hand, since they know both what’s before and what’s after.


A deep bidirectional RNN is like a regular bidirectional RNN, but with several layers for each time step. This enables a better prediction, but it requires a much larger dataset.


LSTMs and GRUs

Short for Long Short-Term Memory, an LSTM network (or cell) is a very unique type of RNN that is widely used in NLP problems. It comprises four distinct ANNs that work together to create a kind of memory that is not limited by the training algorithm (like it is in conventional RNNs).


This is possible because LSTMs have an internal mechanism that allows them to selectively forget, and to combine different previous states, in a way that facilitates the mapping of long-term dependencies. 


LSTMs are quite complex; as such, developers quickly sought a more straightforward version. This is where GRUs, or Gated Recurrent Units, come into play; a GRU is basically a lightweight LSTM. A GRU is an LSTM network with two gates—one for resetting, and one for updating previous states.


The first gate determines how to best combine the new input with the previous memory, while the second gate specifies how much of the previous memory to hold onto.


RNNs in action

Here we provide an example of text classification using the LSTM variant of the RNN. The code is implemented in Python and Keras. The dataset we use is from the IMDB movie database; it is already available with the Keras’ datasets module. The dataset includes IMDB’s users’ comments on movies and their associated sentiments.


Our task is to classify comments as positive or negative sentiments—a binary classification task. The code below is taken from the official Keras repository.


First, we begin by importing the relevant libraries, as usual. Notice that we also import the IMDB dataset from Keras’ datasets module:

from __future__ import print_function

from keras.preprocessing import sequence

from keras.models import Sequential

from keras.layers import Dense, Embedding

from keras.layers import LSTM

from keras.datasets import imdb

Then, we set the maximum number of features to 20,000; the maximum number of words in a text to 80; and batch size to 32:

max_features = 20000

maxlen = 80

batch_size = 32

Next, we load the dataset into some variables, after splitting train and test sets:

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)


We need to pad some text comments, as some of them are shorter than 80 words, and our model only accepts inputs of the same length. In short, padding works by adding a predefined word to the end of a sequence to make the sequence of the desired length. The code below pads the sequences:

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)


Now, we are all set to build our sequential model. First, we add an embedding layer, and then we add an LSTM. Last, we add the dense layer for classification:

model = Sequential()

model.add(Embedding(max_features, 128))

model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

model.add(Dense(1, activation=‘sigmoid’))


After we build our model, we can now train on our train set. We use binary cross-entropy loss as our loss function, the Adam algorithm as our optimizer, and accuracy as our evaluation metric:





It is time to train our model:, y_train, batch_size=batch_size,


validation_data=(x_test, y_test))

Last, we test the performance of our model using the test set:

score, acc = model.evaluate(x_test, y_test,


print(‘Test score:’, score)

print(‘Test accuracy:’, acc)


The model achieves almost 82% accuracy on the test set after 10 epochs, which is a satisfactory result for such a simple model. The output of the code above is:


Train on 25000 samples, validate on 25000 samples

Before closing this blog, we will bring to your attention several other advanced DL models (e.g. GANs), contained in appendices of this blog. We strongly encourage you to read those appendices to learn more about DL models. After that, you can find many other useful resources to dig into the details of DL models.


RNNs are ideal tools to solve the following problems:

NLP. As mentioned before, RNNs excel at working with natural language text. Tasks, like predicting the next word or figuring out the general topic of a block of text, are solved well by RNNs.


Text synthesis. A particular NLP application that deserves its own bullet point is text synthesis. This involves creating new streams of words, which is an extension of the “predicting the next word” application. RNNs can create whole paragraphs of text, taking text prediction to a whole new level.


Automated translation.

This is a harder problem than it seems since each language and dialect has its own intricacies (for instance, the order of words in constructing a sentence). To accurately translate something, a computer must process sentences as a whole—something that’s made possible through an RNN model.


Image caption generation. Although this is not entirely RNN-related, it is certainly a valid application. When combined with CNN's, RNNs can generate short descriptions of an image, perfect for captions. They can even evaluate and rank the most important parts of the image, from most to least relevant.


Speech recognition. When the sound of someone talking is transformed into a digitized sound wave, it is not far-fetched to ask an RNN to understand the context of each sound bit. The next step is turning that into written text, which is quite challenging, but plausible using the same RNN technology.