Deep learning (Best Tutorial 2019)

Deep learning

Deep learning Best Tutorial

Deep Learning (DL) is a subset of AI that is used for predictive analytics, using an AI system called an Artificial Neural Network (ANN). Predictive analytics is a group of data science methodologies that are related to the prediction of certain variables. This includes various techniques such as classification, regression, etc. This tutorial explains the Deep learning with best examples.


As for an ANN, it is a clever abstraction of the human brain, at a much smaller scale. ANNs manage to approximate every function (mapping) that has been tried on them, making them ideal for any data analytics related task. In data science, ANNs are categorized as machine learning methodologies.


The main drawback DL systems have is that they are “black boxes.” It is exceedingly difficult – practically unfeasible – to figure out exactly how their predictions happen, as the data flux in them is extremely complicated.


Deep Learning generally involves large ANNs that are often specialized for specific tasks. Convolutional Neural Networks (CNN) ANNs, for instance, are better for processing images, video, and audio data streams.


However, all DL systems share a similar structure. This involves elementary modules called neurons organized in layers, with various connections among them.


These modules can perform some basic transformations (usually non-linear ones) as data passes through them.


Since there are a plethora of potential connections among these neurons, organizing them in a structured way (much like real neurons are organized in the network in brain tissue), we can obtain a more robust and function form of these modules. This is what an artificial neural network is, in a nutshell.


In general, DL frameworks include tools for building a DL system, methods for testing it, and various other Extract, Transform, and Load (ETL) processes; when taken together, these framework components help you seamlessly integrate DL systems with the rest of your pipeline. We’ll look at this in more detail later in this blog.


Although deep learning systems share some similarities with machine learning systems, certain characteristics make them sufficiently distinct. For example, conventional machine learning systems tend to be simpler and have fewer options for training.


DL systems are noticeably more sophisticated; they each have a set of training algorithms, along with several parameters regarding the systems’ architecture. This is one of the reasons we consider them a distinct framework in data science.


DL systems also tend to be more autonomous than their machine counterparts. To some extent, DL systems can do their own feature engineering. More conventional systems tend to require more fine-tuning of the feature-set, and sometimes require dimensionality reduction to provide any decent results.


In addition, the generalization of conventional ML systems when provided with additional data generally don’t improve as much as DL systems. This is also one of the key characteristics that make DL systems a preferable option when big data is involved.


Finally, DL systems take longer to train and require more computational resources than conventional ML systems. This is due to their more sophisticated functionality. However, as the work of DL systems is easily parallelizable, modern computing architecture as well as cloud computing, benefit DL systems the most, compared to other predictive analytics systems.


How deep learning systems work

At their cores, all DL frameworks work similarly, particularly when it comes to the development of DL networks. First, a DL network consists of several neurons organized in layers; many of these are connected to other neurons in other layers. In the simplest DL network, connections take place only between neurons in adjacent layers.


The first layer of the network corresponds to the features of our dataset; the last layer corresponds to its outputs. In the case of classification, each class has its own node, with node values reflecting how confident the system is that a data point belongs to that class.


The layers in the middle involve some combination of these features. Since they aren’t visible to the end user of the network, they are described as hidden.


The connections among the nodes are weighted, indicating the contribution of each node to the nodes of the next layer it is connected to, in the next layer. The weights are initially randomized, when the network object is created, but are defined as the ANN is trained.


Moreover, each node contains a mathematical function that creates a transformation of the received signal, before it is passed to the next layer.


This is referred to as the transfer function (also known as the activation function). 

Furthermore, each layer has a bias node, which is a constant that appears unchanged on each layer. Just like all the other nodes, the bias node has a weight attached to its output.


However, it has no transfer function. Its weighted value is simply added to the other nodes it is connected to, much like a constant c is added to a regression model in Statistics.


The presence of such a term balances out any bias the other terms inevitably bring to the model, ensuring that the overall bias in the model is minimal. As the topic of bias is a very complex one, we recommend you check out some external resources4 if you are not familiar with it.


Once the transformed inputs (features) and the biases arrive at the end of the DL network, they are compared with the target variable. The differences that inevitably occur are relayed back to the various nodes of the network, and the weights are changed accordingly.


Then the whole process is repeated until the error margin of the outputs is within a certain predefined level, or until the maximum number of iterations is reached. Iterations of this process are often referred to as training epochs, and the whole process is intimately connected to the training algorithm used.


In fact, the number of epochs used for training a DL network is often set as a parameter and it plays an important role in the ANN’s performance.


All of the data entering a neuron (via connections with neurons of the previous layer, as well as the bias node) is summed, and then the transfer function is applied to the sum so that the data flow from that node is y = f(Σ(wixi + b)).


Where wi is the weight of node I of the previous layer, and xi its output, while b is the bias of that layer. Also, f() is the mathematical expression of the transfer function.


This relatively simple process is at the core of every ANN. The process is equivalent to that which takes place in a perceptron system—a rudimentary AI model that emulates the function of a single neuron.


Although a perception system is never used in practice, it is the most basic element of an ANN, and the first system created using this paradigm.


The function of a single neuron is basically a single, predefined transformation of the data at hand. This can be viewed as a kind of meta-feature of the framework, as it takes a certain input x and after applying a (usually non-linear) function f() to it, x is transformed into something else, which is the neuron’s output y.


While in the majority of cases one single meta-feature would be terrible at predicting the target variable, several of them across several layers can work together quite effectively – no matter how complex the mapping of the original features to the target variable.


The downside is that such a system can easily overfit, which is why the training of an ANN doesn’t end until the error is minimal (smaller than a predefined threshold).


This most rudimentary description of a DL network works for networks of the multi-layer perceptron type. Of course, there are several variants beyond this type.


CNN's, for example, contain specialized layers with huge numbers of neurons, while RNNs have connections that go back to previous layers. Additionally, some training algorithms involve pruning nodes of the network to ensure that no overfitting takes place.


Once the DL network is trained, it can be used to make predictions about any data similar to the data it was trained on.


Furthermore, its generalization capability is quite good, particularly if the data it is trained on is diverse. What’s more, most DL networks are quite robust when it comes to noisy data, which sometimes helps them achieve even better generalization.


When it comes to classification problems, the performance of a DL system is improved by the class boundaries it creates.


Although many conventional ML systems create straightforward boundary landscapes (e.g. rectangles or simple curves), a DL system creates a more sophisticated line around each class.


This is because the DL system is trying to capture every bit of signal it is given in order to make fewer mistakes when classifying, boosting its raw performance. Of course, this highly complex mapping of the classes makes interpretation of the results a very challenging, if not unfeasible, task. More on that later in this blog.


Main deep learning frameworks

Having knowledge of multiple DL frameworks gives you a better understanding of the AI field. You will not be limited by the capabilities of a specific framework.


For example, some DL frameworks are geared towards a certain programming language, which may make focusing on just that framework an issue, since languages come and go.


After all, things change very rapidly in technology, especially when it comes to software. What better way to shield yourself from any unpleasant developments than to be equipped with a diverse portfolio of DL know-how?


Also, for those keen on the Julia language, there is the Knet framework, which to the best of our knowledge, is the only deep learning framework written in a high-level language mainly (in this case, Julia). You can learn more about it at its Github repository.


MXNet is developed by Apache and it’s Amazon’s favorite framework. Some of Amazon’s researchers have collaborated with researchers from the University of Washington to benchmark it and make it more widely known to the scientific community. 


TensorFlow is probably the most well-known DL framework, partly because it has been developed by Google. As such, it is widely used in the industry and there are many courses and blogs discussing it. 


Keras is a high-level framework; it works on top of TensorFlow (as well as other frameworks like Theano). Its ease of use without losing flexibility or power makes it one of the favorite deep learning libraries today.


Any data science enthusiast who wants to dig into the realm of deep learning can start using Keras with reasonably little effort.


Moreover, Keras’ seamless integrity with TensorFlow, plus the official support it gets from Google, have convinced many that Keras will be one of the long-lasting frameworks for deep learning models, while its corresponding library will continue to be maintained. 


How to leverage deep learning frameworks

Deep learning frameworks add value to AI and DS practitioners in various ways. The most important value-adding processes include ETL processes, building data models, and deploying these models. Beyond these main functions, a DL framework may offer other things that a data scientist can leverage to make their work easier.


For example, a framework may include some visualization functionality, helping you produce some slick graphics to use in your report or presentation. As such, it’s best to read up on each framework’s documentation, becoming familiar with its capabilities to leverage it for your data science projects.


ETL processes

A DL framework can be helpful in fetching data from various sources, such as databases and files. This is a rather time-consuming process if done manually, so using a framework is very advantageous.


The framework will also do some formatting on the data so that you can start using it in your model without too much data engineering. However, doing some data processing of your own is always useful, particularly if you have some domain knowledge.


Building data models

The main function of a DL framework is to enable you to efficiently build data models. The framework facilitates the architecture design part, as well as all the data flow aspects of the ANN, including the training algorithm.


In addition, the framework allows you to view the performance of the system as it is being trained so that you gain insight into how likely it is to overfit.


Moreover, the DL framework takes care of all the testing required before the model is tested on different than the dataset it was trained on (new data). All this makes building and fine-tuning a DL data model a straightforward and intuitive process, empowering you to make a more informed choice about what model to use for your data science project.


Deploying data models

Model deployment is something that DL frameworks can handle, too, making movement through the data science pipeline swifter. This mitigates the risk of errors through this critical process, while also facilitating easy updating of the deployed model. All this enables the data scientist to focus more on the tasks that require more specialized or manual attention.


For example, if you (rather than the DL model) worked on the feature engineering, you would have a greater awareness of exactly what is going into the model.


Deep learning methodologies and applications

Deep learning is a very broad AI category, encompassing several data science methodologies through its various systems. As we have seen, for example, it can be successfully used in classification—if the output layer of the network is built with the same number of neurons as the number of classes in the dataset.


When DL is applied to problems with the regression methodology, things are simpler, as a single neuron in the output layer is enough. Reinforcement learning is another methodology where DL is used; along with the other two methodologies, it forms the set of supervised learning, a broad methodology under the predictive analytics umbrella.


DL is also used for dimensionality reduction, which (in this case) comprises a set of meta-features that are usually developed by an autoencoder system.

This approach to dimensionality reduction is also more efficient than the traditional statistical ones, which are computationally expensive when the number of features is remarkably high. Clustering is another methodology where deep learning can be used, with the proper changes in the ANN’s structure and data flow.


Clustering and dimensionality reduction are the most popular unsupervised learning methodologies in data science and provide a lot of value when exploring a dataset. Beyond these data science methodologies involving DL, there are others that are more specialized and require some domain expertise. We’ll talk about some of them more, shortly.


There are many applications of deep learning. Some are more established or general, while others are more specialized or novel. Since DL is still a new tool, its applications in the data science world remain works in progress, so keep an open mind about this matter.


After all, the purpose of all AI systems is to be as universally applicable as possible, so the list of applications is only going to grow.


For the time being, DL is used in complex problems where high-accuracy predictions are required. These could be datasets with high dimensionality and/or highly non-linear patterns.


In the case of high-dimensional datasets that need to be summarized in a more compact form with fewer dimensions, DL is a highly effective tool for the job.


Also, since the very beginning of its creation, DL has been applied to image, sound, and video analytics, with a focus on images. Such data is quite difficult to process otherwise; the tools used before DL could only help so much, and developing those features manually was a very time-consuming process.


Moving on to more niche applications, DL is widely used in various natural language processing (NLP) methods. This includes all kinds of data related to the everyday text, such as that found in articles, blogs, and even social media posts.


Where it is important to identify any positive or negative attitudes in the text, we use a methodology called “sentiment analysis,” which offers a fertile ground for many DL systems.


There are also DL networks that perform text prediction, which is common in many mobile devices and some text editors.


More advanced DL systems manage to link images to captions by mapping these images to words that are relevant and that form sentences. Such advanced applications of DL include chatbots, in which the AI system both creates text and understands the text it is given.


Also, applications like text summarization are under the NLP umbrella too and DL contributes to them significantly. Some DL applications are more advanced or domain-specific – so much so that they require a tremendous amount of data and computing power to work. However, as computing becomes more readily available, these are bound to become more accessible in the short term.


Assessing a deep learning framework

DL frameworks make it easy and efficient to employ DL in a data science project. Of course, part of the challenge is deciding which framework to use. Because not all DL frameworks are built equal, there are factors to keep in mind when comparing or evaluating these frameworks.


The number of languages supported by a framework is especially important. Since programming languages are particularly fluid in the data science world, it is best to have your language bases covered in the DL framework you plan to use.


What’s more, having multiple languages support in a DL framework enables the formation of a more diverse data science team, with each member having different specific programming expertise.


You must also consider the raw performance of the DL systems developed by the framework in question. Although most of these systems use the same low-level language on the back end, not all of them are fast.


There may also be other overhead costs involved. As such, it’s best to do your due diligence before investing your time in a DL framework—particularly if your decision affects other people in your organization.


Furthermore, consider the ETL processes supporting a DL framework. Not all frameworks are good at ETL, which is both inevitable and time-consuming in a data science pipeline. Again, any inefficiencies of a DL framework in this aspect are not going to be advertised; you must do some research to uncover them yourself.


Finally, the user community and documentation around a DL framework are important things, too. Naturally, the documentation of the framework is going to be helpful, though in some cases it may leave much to be desired.


If there is a healthy community of users for the DL framework you are considering, things are bound to be easier when learning its more esoteric aspects—as well as when you need to troubleshoot issues that may arise.



Interpretability is the capability of a model to be understood in terms of its functionality and its results. Although interpretability is often a given with conventional data science systems, it is a pain point of every DL system.


This is because every DL model is a “black box,” offering little to no explanation for why it yields the results it does. Unlike the framework itself, whose various modules and their functionality is clear, the models developed by these frameworks are convoluted graphs.


There is no comprehensive explanation as to how the inputs you feed them turn into the outputs they yield.


Although obtaining an accurate result through such a method may be enticing, it is quite hard to defend, especially when the results are controversial or carry a demographic bias.


The reason for a demographic bias has to do with the data, by the way, so no number of bias nodes in the DL networks can fix that since a DL network’s predictions can only be as good as the data used to train it. Also, the fact that we have no idea how the predictions correspond to the inputs allows biased predictions to slip through unnoticed.


However, this lack of interpretability may be resolved in the future. This may require a new approach to them, but if it’s one thing that the progress of AI system has demonstrated over the years, it is that innovations are still possible and that new architectures of models are still being discovered. Perhaps one of the newer DL systems will have interpretability as one of its key characteristics.


Model maintenance

Maintenance is essential to every data science model. This entails updating or even upgrading a model in production, as new data becomes available. Alternatively, the assumptions of the problem may change; when this happens, model maintenance is also needed. In a DL setting, model maintenance usually involves retraining the DL network.


If the retrained model doesn’t perform well enough, more significant changes may be considered such as changing the architecture or the training parameters. Whatever the case, this whole process is largely straightforward and not too time-consuming.


How often model maintenance is required depends on the dataset and the problem in general. Whatever the case, it is good to keep the previous model available too when doing major changes, in case the new model has unforeseen issues.


Also, the whole model maintenance process can be automated to some extent, at least the offline part, when the model is retrained as new data is integrated with the original dataset.


When to use DL over conventional data science systems

Deciding when to use a DL system instead of a conventional method is an important task. It is easy to be enticed by the new and exciting features of DL and to use it for all kinds of data science problems. However, not all problems require DL. Sometimes, the extra performance of DL is not worth the extra resources required.


In cases where conventional data science systems fail or don’t offer any advantage (like interpretability), DL systems may be preferable. Complex problems with lots of variables and cases with non-linear relationships between the features and the target variables are great matches for a DL framework.


If there is an abundance of data, and the main objective is a good raw performance in the model, a DL system is typically preferable. This is particularly true if computational resources are not a concern, since a DL system requires quite a lot of them, especially during its training phase.


Whatever the case, it’s good to consider alternatives before setting off to build a DL model. While these models are incredibly versatile and powerful, sometimes simpler systems are good enough.



Deep Learning is a particularly important aspect of AI and has found a lot of applications in data science. Deep Learning employs a certain kind of AI system called Artificial Neural Networks (or ANN). 


An ANN is a graph-based system involving a series of (usually non-linear) operations, whereby the original features are transformed into a few meta-features capable of predicting the target variable more accurately than the original features.


The main frameworks in DL are MXNet, TensorFlow, and Keras, though Pytorch and Theano also play roles in the whole DL ecosystem.


There are various programming languages used in DL, including Python, Julia, Scala, Javascript, R, and C / C++. Python is the most popular.

A DL framework offers diverse functionality, including ETL processes, building data models, deploying and evaluating models, and other functions like creating visuals.


A DL system can be used in various data science methodologies, including Classification, Regression, Reinforcement Learning, Dimensionality Reduction, Clustering, and Sentiment Analysis.


Classification, regression, and reinforcement learning are supervised learning methodologies, while dimensionality reduction and clustering are unsupervised.


Applications of DL include making high-accuracy predictions for complex problems; summarizing data into a more compact form; analyzing images, sound, or video; natural language processing and sentiment analysis; text prediction; linking images to captions; chatbots; and text summarization.


A DL framework needs to be assessed on various metrics (not just popularity). Such factors include the programming languages it supports, its raw performance, how well it handles ETL processes, the strength of its documentation and user communities, and the need for future maintenance.


It is not currently very easy to interpret DL results and trace them back to specific features (i.e. DL results currently have low interpretability).


Giving more weight to raw performance or interpretability can help you decide whether a DL system or conventional data science system is ideal for your particular problem. Other factors, like the number of computational resources at our disposal, are also essential for making this decision.


Deep Learning Libraries

This section involves an introduction to some of the widely used deep learning libraries, including Theano, TensorFlow, and Keras, also in addition to a basic tutorial on each one of these.



Theano was an open source project. It is a numerical computation library for Python with syntaxes similar to NumPy. It is efficient at performing complex mathematical expressions with multidimensional arrays. This makes it is a perfect choice for neural networks.

We will be illustrating the installation steps for Theano on different platforms, followed by the basic tutorials involved.


Theano is a mathematical library that provides ways to create machine learning models that could be used later for multiple datasets. Many tools have been implemented on top of Theano. Principally, it includes

•\ Blocks 

•\ Lasagne latest/

•\ PyLearn2


Note It should be noted that at the time of writing this blog, contributions to the Theano package have been stopped by the community members, owing to a substantial increase in the usage of other deep learning packages.


Theano Installation

The following command will work like a charm for Theano installation on Ubuntu:

> sudo apt-get install python-numpy python-scipy python-dev python-pip python-nose g++ libopenblas-dev git

> sudo pip install Theano


For detailed instructions on installing Theano on different platforms, please refer to the following link: theano/install.html. Even docker images with CPU and GPU compatibility are available.


Note It is always advisable to proceed with installation in a separate virtual environment.

> git clone http://git://

> cd Theano

> python install


For the installation on Windows, take the following steps (sourced from an answer on Stack Overflow):

\ 1.\ Install TDM GCC x64 (TDM-GCC : News).

\ 2.\ Install Anaconda x64 (Home - Anaconda downloads, say in C:/Anaconda).

\ 3.\ After Anaconda installation, run the following commands:

a.\ conda update conda

b.\ conda update -all

c.\ conda install mingw libpython

\ 4.\ Include the destination 'C:\Anaconda\Scripts' in the environment variable PATH.

\ 5.\ Install Theano, either the older version or the latest version available.

a.\ Older version:

> pip install Theano

b.\ Latest version:

> pip install --upgrade --no-deps git+git:// 


Theano Examples

The following section introduces the basic codes in the Theano library. The Tensor subpackage of the Theano library contains most of the required symbols.


The following example makes use of the Tensor subpackage and performs operations on the two numbers (outputs have been included for reference):

> import theano

> import theano.tensor as T

> import numpy

> from theano import function

# Variables 'x' and 'y' are defined

> x = T.dscalar('x') # dscalar : Theano datatype > y = T.dscalar('y')

# 'x' and 'y' are instances of TensorVariable, and are of dscalar theano type

> type(x)

<class 'theano.tensor.var.TensorVariable'> > x.type

TensorType(float64, scalar) > T.dscalar TensorType(float64, scalar)

# 'z' represents the sum of 'x' and 'y' variables. Theano's pp function, pretty-print out, is used to display the computation of the variable 'z'

> z = x + y

> from theano import pp > print(pp(z))


# 'f' is a numpy.ndarray of zero dimensions, which takes input as the first argument, and output as the second argument

# 'f' is being compiled in C code

> f = function([x, y], z)

The preceding function could be used in the following manner to perform the addition operation:

> f(6, 10) array(16.0)

> numpy.allclose(f(10.3, 5.4), 15.7) True


TensorFlow Examples

Running and experimenting with TensorFlow is as easy as the installation. 


Following is one such example, with the basics of TensorFlow (outputs have been included for reference):

> import tensorflow as tf

> hello = tf.constant('Hello, Tensors!')

> sess = tf.Session()


Hello, Tensors!

# Mathematical computation > a = tf.constant(10)

> b = tf.constant(32) >


The run() method takes the resulting variables for computations as arguments, and a backward chain of required calls are made for this.


TensorFlow graphs get formed from nodes not requiring any kind of input, i.e., the source. These nodes then pass their output to further nodes, which perform computations on the resulting tensors, and the whole process moves in this pattern.


The following examples show the creation of two matrices using Numpy, then using TensorFlow to assign these matrices as objects in TensorFlow, and then multiplying both the matrices. The second example


Introduction to Natural Language Processing and Deep Learning includes the addition and subtraction of two constants. A TensorFlow session has also been activated to perform the operation and deactivated once the operation is complete.

> import tensorflow as tf

> import numpy as np

> mat_1 = 10*np.random.random_sample((3, 4)) # Creating NumPy


> mat_2 = 10*np.random.random_sample((4, 6))

# Creating a pair of constant ops, and including the above made matrices

> tf_mat_1 = tf.constant(mat_1) > tf_mat_2 = tf.constant(mat_2)

# Multiplying TensorFlow matrices with matrix multiplication operation

> tf_mat_prod = tf.matmul(tf_mat_1 , tf_mat_2)

> sess = tf.Session() # Launching a session

# run() executes required ops and performs the request to store output in 'mult_matrix' variable

> mult_matrix = > print(mult_matrix)

# Performing constant operations with the addition and subtraction of two constants

> a = tf.constant(10) > a = tf.constant(20)

> print("Addition of constants 10 and 20 is %i " % sess. run(a+b))

Addition of constants 10 and 20 is 30

> print("Subtraction of constants 10 and 20 is %i " % sess. run(a-b))

Subtraction of constants 10 and 20 is -10

> sess.close() # Closing the session


Note As no graph was specified in the preceding example with the TensorFlow, the session makes use of the default instance only.


As the input data has four corresponding variables, the input_dim, which refers to the number of different input variables, has been set to four.


We have made use of the fully connected layers defined by dense layers in Keras to build the additional layers. The selection of the network structure is done on the basis of the complexity of the problem.


Here, the first hidden layer is made up of eight neurons, which are responsible for further capturing the nonlinearity.


The layer has been initialized with the uniformly distributed random numbers and with the activation function as ReLU, as described previously in this blog. The second layer has six neurons and configurations similar to its previous layer.

# Creating our first MLP model with Keras > mlp_keras = Sequential()

> mlp_keras.add(Dense(8, input_dim=4, init='uniform', activation='relu'))

> mlp_keras.add(Dense(6, init='uniform', activation='relu'))


In the last layer of output, we have set the activation as sigmoid, mentioned previously, which is responsible for generating a value between 0 and 1 and helps in the binary classification.

> mlp_keras.add(Dense(1, init='uniform', activation='sigmoid'))


To compile the network, we have made use of the binary classification with logarithmic loss and selected Adam as the default choice of the optimizer, and accuracy as the desired metric to be tracked.


The network is trained using the backpropagation algorithm, along with the given optimization algorithm and loss function.

> mlp_keras.compile(loss = 'binary_crossentropy', optimizer='adam',metrics=['accuracy'])


The model has been trained on the given dataset with a small number of iterations (nb_epoch) and started with a feasible batch size of instances (batch_size).


The parameters could be chosen either on the basis of prior experience of working with such kinds of datasets, or one can even make use of Grid Search to optimize the choice of such parameters. We will be covering the same concept in later blogs, where necessary.

>,Y, nb_epoch=200, batch_size=8, verbose=0)


The next step is to finally evaluate the model that has been built and to check out the performance metrics, loss, and accuracy on the initial training dataset. The same operation could be performed on a new test dataset with which the model is not acquainted and could be a better measure of the model performance.

> accuracy = mlp_keras.evaluate(X,Y)

> print("Accuracy : %.2f%% " % (accuracy[1]*100 ))


If one wants to further optimize the model by using different combinations of parameters and other tweaks, it could be done by using different parameters and steps while undertaking model creation and validation, though it need not result in better performance in all cases.

# Using a different set of optimizer > from keras.optimizers import SGD > opt = SGD(lr=0.01)


The following creates a model with configurations similar to those in the earlier model but with a different optimizer and including a validation dataset from the initial training data:

> mlp_optim = Sequential()

> mlp_optim.add(Dense(8, input_dim=4, init='uniform', activation='relu'))

> mlp_optim.add(Dense(6, init='uniform', activation='relu'))

> mlp_optim.add(Dense(1, init='uniform', activation='sigmoid'))

# Compiling the model with SGD

> mlp_optim.compile(loss = 'binary_crossentropy', optimizer=opt, metrics=['accuracy'])

# Fitting the model and checking accuracy

>,Y, validation_split=0.3, nb_epoch=150, batch_size=10, verbose=0)

> results_optim = mlp_optim.evaluate(X,Y)

> print("Accuracy : %.2f%%" % (results_optim[1]*100 ) )


Make sure that all the packages mentioned for natural language processing and deep learning in the preceding sections are installed before moving forward. Once you have set up the system, you will be good to go with the examples offered throughout this blog.


AI Methodologies Beyond Deep Learning

Optimization, on the other hand, involves all kinds of datasets and is often used in other data science systems.



Optimization is the process of finding the maximum or minimum of a given function (also known as a fitness function), by calculating the best values for its variables (also known as a “solution”).


Despite the simplicity of this definition, it is not an easy process; often involves restrictions, as well as complex relationships among the various variables. Even though some functions can be optimized using some mathematical process, most functions we encounter in data science are not as simple, requiring a more advanced technique.


Optimization systems (or optimizers, as they are often referred to) aim to optimize in a systematic way, oftentimes using a heuristics-based approach.


Such an approach enables the AI system to use a macro level concept as part of its low-level calculations, accelerating the whole process and making it more light-weight. After all, most of these systems are designed with scalability in mind, so the heuristic approach is most practical.


Importance of optimization

Optimization is especially important in many data science problems— particularly those involving a lot of variables that need to be fine-tuned, or cases where the conventional tools don’t seem to work. In order to tackle more complex problems, beyond classical methodologies, optimization is essential.


Moreover, optimization is useful for various data engineering tasks such as feature selection, in cases where maintaining a high degree of interpretability is desired. We’ll investigate the main applications of optimizers in data science later in this blog.


Optimization systems overview

There are different kinds of optimization systems. The most basic ones have been around the longest. These are called “deterministic optimizers,” and they tend to yield the best possible solution for the problem at hand.


That is the absolute maximum or minimum of the fitness function. Since they are quite time-consuming and cannot handle large-scale problems, these deterministic optimizers are usually used for applications where the number of variables is relatively small.


A classic example of such an optimizer is the one used for least squared error regression—a simple method to figure out the optimal line that fits a set of data points, in a space with relatively small dimensionality.


In addition to deterministic optimizers, there are stochastic optimizers, which more closely fit the definition of AI. After all, most of these are based on natural phenomena, such as the movement of the members of a swarm, or the way a metal melts. The main advantage of these methods is that they are very efficient.


Although they usually don’t yield the absolute maximum or minimum of the function they are trying to optimize, their solutions are good enough for all practical purposes (even if they vary slightly every time you run the optimizer).


Stochastic optimizers also scale very well, so they are ideal for complex problems involving many variables


Programming languages for optimization

Optimization is supported by most programming languages in terms of libraries, like the Optim and JuMP packages in Julia. However, each algorithm is simple enough so that you can code it yourself if you cannot find an available “off-the-shelf” function. In this blog, we’ll examine the main algorithms for advanced optimization and how they are implemented in Julia.


We chose this programming language because it combines ease of use and high execution speed. Remember that all the code is available in the Docker environment that accompanies this blog.


Fuzzy inference systems

Fuzzy logic (FL) is a methodology designed to emulate the human capacity of imprecise or approximate reasoning. This ability to judge under uncertainty was previously considered strictly human, but FL has made it possible for machines, too.


Despite its name, there is nothing unclear about the outputs of fuzzy logic. This is because fuzzy logic is an extension of classical logic when partial truths are included to extend bivalued logic (true or false) to a multivalued logic (degrees of truth between true and false).


According to its creator, Professor Zadeh, the ultimate goal of fuzzy logic is to form the theoretical foundation for reasoning about imprecise propositions (also known as “approximate reasoning”). Over the past couple of decades, FL has gained ground and become regarded as one of the most promising AI methodologies.


An FL system contains a series of mappings corresponding to the various features of the data at hand. This system contains terms that make sense to us, such as high-low, hot-cold, and large-medium-small, terms that may appear fuzzy since there are no clear-cut boundaries among them.


Also, these attributes are generally relative and require some context to become explicit, through a given mapping between each term and some number that the system can use in its processes.


This mapping is described mathematically through a set of membership functions, graphically taking the form of triangles, trapezoids, or even curves. This way something somewhat abstract like “large” can take very specific dimensions in the form of “how large on a scale of 0 to 1” it is. The process of coding data into these states is called fuzzification.


Once all the data is coded in this manner, the various mappings are merged together through logical operators, such as inference rules (for example, “If A and B then C,” where A and B correspond to states of two different features and C to the target variable). The result is a new membership function describing this complex relationship, usually depicted as a polygon.


This is then turned into a crisp value, through one of the various methods, in a process called defuzzification. Since this whole process is graphically accessible to the user, and the terms used are borrowed from human language, the result is always something clear-cut and interpretable (given some understanding of how FL works).


Interestingly, FL has also been used in conjunction with ANNs to form what is referred to as neuro-fuzzy systems. Instead of having a person create the membership functions by hand, an FL system can make use of the optimization method in a neural network’s training algorithm to calculate them on the fly.


This whole process and the data structure that it entails take the form of an automated fuzzy system, combining the best of both worlds.


Why systems based on fuzzy logic are still relevant

Although FL was originally developed with a certain type of engineering systems in minds such as Control Systems, its ease of use and low cost of implementation has made it relevant as an AI methodology across a variety of other fields, including data science.


What’s more, fuzzy systems are very accessible, especially when automated through optimization for their membership functions (such as the neuro-fuzzy systems mentioned previously). Such a system employs a set of FL rules (which are created based on the data) to infer the target variable. These systems are called Fuzzy Inference Systems, or FISs.


The main advantage of this FIS approach is that it is transparent—a big plus if you want to defend your results to the project stakeholders. The transparency of a FIS makes the whole problem more understandable, enabling you to figure out which features are more relevant.


In addition, a FIS can be used in conjunction with custom-made rules based on an expert’s knowledge. This is particularly useful if you are looking at upgrading a set of heuristic rules using AI. Certain larger companies that are planning to use data science to augment their existing systems are likely to be interested in such a solution.


The downside of fuzzy inference systems

Despite all the merits of FIS, these AI systems don’t always meet the expectations of modern data science projects. Specifically, when the dimensionality of the data at hand is quite large, the number of rules produced increases exponentially, making these systems too large for any practical purposes.


Of course, you can mitigate this issue with an autoencoder or a statistical process, like PCA or ICA, that creates a smaller set of features.


However, when you do this, the whole interpretability benefit of FIS goes out the window. Why? With a reduced feature set, the relationship with the original features (and the semantic meaning they carry) is warped.


As such, it is very difficult to reconstruct meaning; the new features will require a different interpretation if they are to be meaningful. This is not always feasible.


Nevertheless, for datasets of smaller dimensionality, a FIS is a worthwhile alternative, even if it’s not a particularly popular one. We’ll explore Fuzzy Logic and FIS more in blog 11, where we’ll discuss alternative AI methodologies.


Artificial creativity

Artificial creativity (AC) is a relatively new methodology of AI, where new information is created based on relevant data it has been trained on. Its applications span across various domains, including most of the arts, as well as industrial design, and even data science.


This kind of AI methodology makes use of a specialized DL network that is trained to develop new data that retains some characteristics of the data it was trained on.


When you feed such a specialized AI system some image data, and it’s been trained on the artwork of a particular painter, it will produce new “artwork” that makes use of the images it is fed, but using the painting patterns of the artist it is trained to emulate. The results may not win any art prizes, but they are certainly interesting and original!


In data science, AC can aid in the creation of new data, which is quite useful in certain cases. This new data may not be particularly helpful as an expansion of the training set for that ANN, but it can be especially useful in other ways.


For example, if the original data is sensitive like medical data, and contains too much personally identifiable information (PII), you can generate new data using AC that, although very similar to the original data, cannot be mapped back to a real individual.


In addition, data created from an AC system can be useful for different data models—perhaps as a new test set or even part of their training set.


This way it can offer the potential for better generalization for these models, as there is more data available for them to train or test on. This can be particularly useful in domains where labeled data is hard to come by or is expensive to generate otherwise.


Additional AI methodologies

Beyond all the AI methodologies we’ve discussed so far, there exist several others worth noting. These systems also have a role to play in data science, while their similarities to DL systems make them easier to comprehend. Also, as the AI field is constantly expanding, it’s good to be aware of all the new methodologies that pop up.


The Extreme Learning Machine (or ELM) is an example of an alternative AI methodology that hasn’t yet received the attention it deserves. Although they are architecturally like DL networks, ELMs are quite distinct in the way they are trained.


In fact, their training is so unconventional that some people considered the whole approach borderline unscientific (the professor who came up with ELMs most recently has received a lot of criticism from other academics).


Instead of optimizing all the weights across the network, ELMs focus on just the connections of the two last layers—namely the last set of meta-features and their outputs.


The rest of the weights maintain their initial random values from the beginning. Because the focus is solely on just the optimized weights of the last layers, this optimization is extremely fast and very precise.


As a result, ELMs are the fastest network-based methodology out there, and their performance is quite decent too. What’s more, they are quite unlikely to overfit, which is another advantage.


Despite its counter-intuitive approach, an ELM system does essentially the same thing as a conventional DL system; instead of optimizing all the meta-features it creates, though, it focuses on optimizing the way they work together to form a predictive analytics model.


Another new alternative AI methodology is Capsule Networks (CapsNets). Although the CapsNet should be regarded as a member of the deep learning methods family, its architecture and its optimization training method are quite novel.


CapsNets try to capture the relative relationships between the objects within a relevant context. A CNN model that achieves high performance in image recognition tasks may not necessarily be able to identify the same object from different angles. CapsNets, though, capture those kinds of contextual relationships quite well.


Their performance on some tasks has already surpassed the leading models by about 45%, which is quite astonishing. Considering their promising future.


Self-organizing Maps (SOMs) are a special type of AI system. Although they are also ANNs of sorts, they are unique in function. SOMs offer a way to map the feature space into a two-dimensional grid so that it can be better visualized afterward.


Since it doesn’t make use of a target variable, a SOM is an unsupervised learning methodology; as such, it is ideal for data exploration.


SOMs have been successfully applied in various domains, such as meteorology, oceanography, oil and gas exploration, and project prioritization. One key difference SOMs have from other ANNs is that their learning is based on competition instead of error correction.


Also, their architecture is quite different, as their various nodes are only connected to the input layer with no lateral connections. This unique design was first introduced by Professor Kohonen, which is why SOMs are also referred to as “Kohonen Maps.”


The Generative Adversarial Network (GAN) is a very interesting type of AI methodology, geared towards optimizing a DL network in a rather creative way. A GAN comprises two distinct ANNs.


One is for learning, and the other is for “breaking” the first one – finding cases where the predictions of the first ANN are off. These systems are comparable to the “white hat” hackers of cybersecurity.


In essence, the second ANN creates increasingly more demanding challenges for the first ANN, thereby constantly improving its generalization (even with a limited amount of data).


GANs are used for simulations as well as data science problems. Their main field of application is astronomy, where a somewhat limited quantity of images and videos of the cosmos is available to use in training.


The idea of GANs has been around for over a decade, but has only recently managed to gain popularity; this is largely due to the number of computational resources demanded by such a system (just like any other DL-related AI system). 


Artificial Emotional Intelligence (AEI) is another kind of AI that’s novel on both the methodological as well as the application levels. The goal of AEI is to facilitate an understanding of the emotional context of data (which is usually text-based) and to assess it just like a human.


Applications of AEI are currently limited to comprehension; in the future, though, more interactive systems could provide a smoother interface between humans and machines. There is an intersect between AEI and ANNs, but some aspects of AEI make use of other kinds of


ML systems on the back end.

Glimpse into the future

While the field of AI expands in various directions, making it hard to speculate about how it will evolve, there is one common drawback to most of the AI systems used today: a lack of interpretability.


As such, it is quite likely that some future AI system will address this matter, providing a more comprehensive result, or at least some information as to how the result came about (something like a rationale for each prediction), all while maintaining the scalability of modern AI systems.


A more advanced AI system of the future will likely have a network structure, just like current DL systems—though it may be quite different architecturally. Such a system would be able to learn with fewer data points (possibly assisted by a GAN), as well as generate new data (just like variational autoencoders).


Could an AI system learn to build new AI systems? It is possible, however, the limitation of excessive resources required for such a task has made it feasible only for cloud-based systems. Google may showcase its progress in this area in what it refers to as Automated Machine Learning (AutoML).


So, if you were to replicate this task with your own system, who is to say that the AI-created AI would be better than what you yourself would have built? Furthermore, would you be able to pinpoint its shortcomings, which may be quite subtle and obscure?


After all, an AI system requires a lot of effort to make sure that its results are not just accurate but also useful, addressing the end user's needs. You can imagine how risky it would be to have an AI system built that you know nothing about!


However, all this is just an idea of a potential evolutionary course, since AI can always evolve in unexpected ways. Fortunately, with all the popularity of AI systems today, if something new and better comes along, you’ll probably find out about it sooner rather than later.


Perhaps for things like that, it’s best to stop and think about the why’s instead of focusing only on the how’s, since as many science and industry experts have warned us, AI is a high-risk endeavor and needs to be handled carefully and always with fail-safes set in place.


For example, although it’s fascinating and in some cases important to think about how we can develop AIs that improve themselves, it’s also crucial to understand what implications this may have and plan for AI safety matters beforehand.


Also, prioritizing certain characteristics of an AI system (e.g. interpretability, ease of use, having limited issues in the case of malfunction, etc.) over raw performance, may provide more far-reaching benefits. After all, isn’t improving our lives in the long-term the reason why we have AI in the first place?


About the methods

It is not hard to find problems that can be tackled with optimization. For example, you may be looking at an optimum configuration of a marketing process to minimize the total cost, or to maximize the number of people reached.


Although data science can lend some aid in solving such a problem, at the end of the day, you’ll need to employ an optimizer to find a true solution to a problem like this.


Furthermore, optimization can help in data engineering, too. Some feature selection methods, for instance, use optimization to keep only the features that work well together.


There are also cases of feature fusion that employ optimization (although few people use this method since it sacrifices some interpretability of the model that makes use of these meta-features).


In addition, when building a custom predictive analytics system combining other classifiers or regressors, you often need to maximize the overall accuracy rate (or some other performance metric).


To do this, you must figure out the best parameters for each module (i.e. the ones that optimize a certain performance metric for the corresponding model), and consider the weights of each module’s output in the overall decision rule for the final output of the system that comprises of all these modules.


This work really requires an optimizer, since often the number of variables involved is considerable.


In general, if you are tackling a predictive analytics problem and you have a dataset whose dimensionality you have reduced through feature selection, it can be effectively processed through a FIS. In addition, if interpretability is a key requirement for your data model, using a FIS is a good strategy to follow.


Finally, if you already have a set of heuristic rules at your disposal from an existing predictive analytics system, then you can use a FIS to merge those rules with some new ones that the FIS creates. This way, you won’t have to start from square one when developing your solution.


Novel AI systems tend to be less predictable and, as a result, somewhat unreliable. They may work well for a certain dataset, but that performance may not hold true with other datasets.


That’s why it is critical to try out different AI systems before settling on one to use as your main data model. In many cases, optimizing a certain AI system may yield better performance, despite the time and resources, it takes to optimize.


Striking the balance between exploring various alternatives and digging deeper into existing ones is something that comes about with experience.


Moreover, it’s a good idea to set your project demands and user requirements beforehand. Knowing what is needed can make the selection of your AI system (or if you are more adventurous, your design of a new one) much easier and more straightforward.


For example, if you state early on that interpretability is more important than performance, this will affect which model you decide to use. Make sure you understand what you are looking for in an AI system from the beginning, as this is bound to help you significantly in making the optimal choice.


Although AI systems have a lot to offer in all kinds of data science problems, they are not panaceas. If the data at your disposal is not of high veracity, meaning not of good quality or reliability, no AI system can remedy that.


All AI systems function based on the data we train them with; if the training data is very noisy, biased, or otherwise problematic, their generalizations are not going to be any better.


This underlines the importance of data engineering and utilizing data from various sources, thereby maximizing your chances of creating a robust and useful data model. This is also why it’s always good to always keep a human in the loop when it comes to data science projects —even (or perhaps especially) when they evolve AI.



Optimization is an AI-related process for finding the maximum or minimum of a given function, by tweaking the values for its variables. It is an integral part of many other systems (including ANNs) and consists of deterministic and stochastic systems.


Optimization systems (or “optimizers,” as they are often called) can be implemented in all programming languages since their main algorithms are fairly straightforward.


Fuzzy Logic (FL) is an AI methodology that attempts to model imprecise data as well as uncertainty. Systems employing FL are referred to as Fuzzy Inference Systems (FIS).


These systems involve the development and use of fuzzy rules, which automatically link features to the target variable. A FIS is great for datasets of small dimensionality since it doesn’t scale well as the number of features increases.


Artificial creativity (AC) is an AI methodology of sorts that creates new information based on patterns derived from the data it is fed. It has many applications in the arts and industrial design.


This methodology could also be useful in data science, through the creation of new data points for sensitive datasets (for example, where data privacy is important).


Artificial Emotional Intelligence (AEI) is another alternative AI, emulating human emotions. Currently, its applications are limited to comprehension.


Although speculative, if a truly novel AI methodology were to arise in the near future, it would probably combine the characteristics of existing AI systems, with an emphasis on interpretability.


Even though theoretically possible, an AI system that can design and build other AI systems is not a trivial task. A big part of this involves the excessive risks of using such a system since we would have little control over the result.


It’s important to understand the subtle differences between all these methodologies, as well as their various limitations. Most importantly, for data science, there is no substitute for high veracity data.



Creating a regression MLP system is similar to creating a classification one but with some differences. In the regression case, the regression will be simpler, since regressors are typically lighter architecturally than classifiers. For this example, we’ll use the second dataset.

First, let’s start by importing the necessary classes from the MXNet package and setting the context for the model:

import MXNet as mx

from MXNet import nd, autograd, gluon

ModelCtx = mx.cpu()


To load data to the model, we’ll use the data loaders created previously (data_train2 and data_test2). Let’s now define some basic settings and build the DL network gradually:

nf = 20 # we have 20 features in this dataset

sigma = 1.0 # sigma value for distribution of

weights for the ANN connections

net = gluon.nn.Dense(1, in_units=nf) # the “1” here is the number of output neurons, which is 1 in regression

Let’s now initialize the network with some random values for the weights and biases:

net.collect_params().initialize(mx.init.Normal( sigma=sigma), ctx=ModelCtx)


Just like any other DL system, we need to define the loss function. Using this function, the system understands how much of an error each deviation from the target variable’s values costs. At the same time, cost functions can also deal with the complexity of the models (since if models are too complex they can cost us overfitting):

square_loss = gluon.loss.L2Loss()


Now it’s time to train the network using the data at hand. After we define some essential parameters (just like in the classification case), we can create a loop for the network to train:

ne = 10 # number of epochs for training

loss_sequence = [] # cumulative loss for the various epochs

nb = ns / BatchSize # number of batches for e in range(ne):

cumulative_loss = 0

for i, (data, label) in enumerate(train_data): # inner loop

data = data.as_in_context(ModelCtx) label = label.as_in_context(ModelCtx) with autograd.record():

output = net(data)

loss = square_loss(output, label)



CumulativeLoss += nd.mean(loss).asscalar()

print(“Epoch %s, loss: %s” % (e, CumulativeLoss / ns))


If you wish to view the parameters of the model, you can do so by collecting them into a dictionary structure:

params = net.collect_params()

for param in params.values():



Printing out the parameters may not seem to be useful as we have usually too many of them and especially when we add new layers to the system, something we’d accomplish as follows:



where nhn is the number of neurons in that additional hidden layer. Note that the network requires an output layer with a single neuron, so be sure to insert any additional layers between the input and output layers.



Non-linearity is essential in all DL systems and since convolution is a linear operation, we need to introduce non-linearity in a different way. One such way is the ReLU function which is applied to each pixel in the image.


Note that other non-linear functions can also be used, such as the hyperbolic tangent (tanh) or the sigmoid. Descriptions of these functions are in the glossary.



Since the feature maps and the results of the non-linear transformations to the original data are rather large, in the part that follows, we make them smaller through a process called pooling.


This involves some summarization operation, such as taking the maximum value (called “max pooling”), the average, or even the sum of a particular neighborhood (e.g. a 3x3 window).


Various experiments have indicated that max pooling yields the best performance. Finally, the pooling process is an effective way to prevent overfitting.



This final part of CNN’s functionality is almost identical to that of an MLP which uses softmax as a transfer function in the final layer. As inputs, CNN uses the meta-features created by pooling.


Fully-connected layers in this part of the CNN allow for additional non-linearity and different combinations of these high-level features, yielding a better generalization at a relatively low computational cost.


Training process

When training a CNN, we can use various algorithms; the most popular is backpropagation. Naturally, we must model the outputs using a series of binary vectors, the size of which is the number of classes.


Also, the initial weights in all the connections and the filters are all random. Once CNN is fully trained, it can be used to identify new images that are related to the predefined classes.


Visualization of a CNN model

Visualizing a CNN is often necessary, as this enables us to better understand the results and decide whether the CNN has been trained properly. This is particularly useful when dealing with image data since we can see how CNN’s perception of the input image evolves through the various layers.


We’ll see that Keras’ datasets module already provides this dataset, so no additional download is required. The code below is taken from the official Keras repository.


As usual, we begin by importing the relevant libraries. We should also import the MNIST dataset from the datasets module of Keras:

from __future__ import print_function import keras

from keras.datasets import mnist from keras.models import Sequential

from keras.layers import Dense, Dropout, Flatten

from keras.layers import Conv2D, MaxPooling2D

from keras import backend as K


Then we define the batch size as 128, the number of classes as 10 (the number of digits from 0 to 9), the epochs to run the model as 12, and the input image dimension as (28,28), since all of the corresponding images are 28 by 28 pixels:

batch_size = 128

num_classes = 10

epochs = 12

img_rows, img_cols = 28, 28

Next, we obtain the MNIST data and load it to variables, after splitting as train and test sets:

(x_train, y_train), (x_test, y_test) = mnist.load_data()

It is time for some pre-processing—mostly reshaping the variables that hold the data:

if K.image_data_format() == ‘channels_first’:

x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)

x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)

input_shape = (1, img_rows, img_cols)


x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)

x_test = x_test.reshape(x_test.shape[0],

img_rows, img_cols, 1)

input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype(‘float32’)

x_test = x_test.astype(‘float32’)

x_train /= 255

x_test /= 255

print(‘x_train shape:’, x_train.shape)

print(x_train.shape[0], ‘train samples’)

print(x_test.shape[0], ‘test samples’)

Then we convert the vectors that hold classes into binary class matrices:

y_train = keras.utils.to_categorical(y_train, num_classes)

y_test = keras.utils.to_categorical(y_test, num_classes)


After these steps, we are now ready to build our graph, using a sequential model. We first add two convolutional layers on top of each other, then we apply the max-pooling operation to the output of the second convolutional layer.


Next, we apply dropout. Before we feed the resulting output to the dense layer, we flatten our variables, to comply with the input shapes of the dense layer.


The output of this dense layer is regulated with dropout; the resulting output is then fed into the last dense layer for classification. The softmax function is used to turn the results into something that can be interpreted in terms of probabilities. Here is the code snippet of the model building part:

model = Sequential()

model.add(Conv2D(32, kernel_size=(3, 3),



model.add(Conv2D(64, (3, 3),


model.add(MaxPooling2D(pool_size=(2, 2)))



model.add(Dense(128, activation=’relu’))


model.add(Dense(num_classes, activation=’softmax’))


We next compile our model using cross-entry loss and the Adadelta optimization algorithm. We use accuracy as the evaluation metric, as usual:

model.compile(loss=keras.losses.categorical_cro ssentropy,




It is time to train our model on the training set that we separated from the original MNIST dataset before. We just use the fit() function of the model object to train our model:, y_train, batch_size=batch_size,



validation_data=(x_test, y_test))


CNN's can be used in several different applications:

Identifying faces. This application is particularly useful in image analysis cases. It works by first rejecting parts of the image that don’t contain a face, which is processed in low resolution. It then focuses on the parts containing a face and draws the perceived boundaries in high resolution for better accuracy.


Computer vision (CV) in general. Beyond face recognition, CNN's are applied in various other scenarios of computer vision. This has been a hot topic for the past decade or so and has yielded a variety of applications.


Self-driving cars. Since CV features heavily in self-driving cars, the CNN is often the AI tool of choice for this technology. Their versatility in the kind of inputs they accept, and the fact that they have been studied thoroughly, make them the go-to option for NVIDIA’s self-driving car project.


NLP. Due to their high speed and versatility, CNN's lend themselves well to NLP applications.


The key here is to “translate” all the words into the corresponding embeddings, using specialized methods such as GloVe or word2vec. CNN's are optimal here since NLP models range from incredibly simple like a “bag of words” to computationally demanding like n-grams.


Recurrent Neural Networks

Recurrent Neural Networks (RNNs) are an interesting type of DL network, widely used for NLP applications. They process data sequentially, resulting in an improved analysis of complex datasets, through the modeling of the temporal aspect of the data at hand. In a way, RNNs mimic human memory; this enables them to understand relationships among the data points like we do.


Interestingly, RNNs can also be used for text-related artificial creativity. It’s common for them to generate text that stylistically resembles some famous writer’s prose or even poems. Because of their popularity, RNNs have a few variants that are even more effective for the tasks in which they specialize.


RNN components

RNNs have “recurrent” as part of their name because they perform the same task for each element of a sequence, while the output depends on the previous computations.


This architecture could make it possible for the network to consider an unlimited number of previous states of the data. In reality, though, it usually includes merely a few steps. This is enough to give the RNN system a sense of “memory,” enabling it to see each data point within the context of the other data points preceding it.


Since the recurrent connections in an RNN are not always easy to depict or comprehend (particularly when trying to analyze its data flow), we often “unfold” them.


This creates a more spread-out version of the same network, where the temporal aspect of the data is more apparent. This process is sometimes referred to as “unrolling” or “unfolding”.


Data flow and functionality

The data in an RNN flows in loops, as the system gradually learns how each data point correlates with some of the previous ones. In this context, the hidden nodes of an RNN (which are often referred to as “states”) are basically the memory of the system.


As you would expect, these nodes have a non-linear activation function such as ReLU or tank. The activation function of the final layer before the output usually has a softmax function, though, so as to approximate probabilities.


Contrary to a traditional DL system, which uses different weights at each layer, an RNN shares the same parameters across all steps. This is because it is basically performing the same task at every step, with the only difference being the inputs. This significantly decreases the total number of parameters it must learn, making the training phase significantly faster and computationally lighter.


Training process

When training an RNN, we employ many of the same principles as with other DL networks—with a key difference in the training algorithm (which is typically backpropagation). RNNs demand an algorithm that considers the number of steps we needed to traverse before reaching the node when calculating the gradient of the error of each output node.


This variant of the training algorithm is called Backpropagation Through Time (BPTT). Because the gradient function is unstable as it goes through an RNN, the BPTT is not good at helping the RNN learn long-term dependencies among its data points. Fortunately, this issue is resolved using a specialized architecture called the LSTM, which we’ll discuss in the next section.


RNN variants

When it comes to variants of RNNs, the ones that stand out are Bidirectional RNNs (as well as their “deep” version), LSTMs, and GRUs.


Bidirectional RNNs and their deeper counterparts

A bidirectional RNN is like an ensemble of two RNNs. The key difference between these two networks is that one of them considers previous data points, while the other looks at data points that follow. This way, the two of them together can have a more holistic view of the data at hand, since they know both what’s before and what’s after.


A deep bidirectional RNN is like a regular bidirectional RNN, but with several layers for each time step. This enables a better prediction, but it requires a much larger dataset.


LSTMs and GRUs

Short for Long Short-Term Memory, an LSTM network (or cell) is a very unique type of RNN that is widely used in NLP problems. It comprises four distinct ANNs that work together to create a kind of memory that is not limited by the training algorithm (like it is in conventional RNNs).


This is possible because LSTMs have an internal mechanism that allows them to selectively forget, and to combine different previous states, in a way that facilitates the mapping of long-term dependencies. 


LSTMs are quite complex; as such, developers quickly sought a more straightforward version. This is where GRUs, or Gated Recurrent Units, come into play; a GRU is basically a lightweight LSTM. A GRU is an LSTM network with two gates—one for resetting, and one for updating previous states.


The first gate determines how to best combine the new input with the previous memory, while the second gate specifies how much of the previous memory to hold onto.


RNNs in action

Here we provide an example of text classification using the LSTM variant of the RNN. The code is implemented in Python and Keras. The dataset we use is from the IMDB movie database; it is already available with the Keras’ datasets module. The dataset includes IMDB’s users’ comments on movies and their associated sentiments.


Our task is to classify comments as positive or negative sentiments—a binary classification task. The code below is taken from the official Keras repository.


First, we begin by importing the relevant libraries, as usual. Notice that we also import the IMDB dataset from Keras’ datasets module:

from __future__ import print_function

from keras.preprocessing import sequence

from keras.models import Sequential

from keras.layers import Dense, Embedding

from keras.layers import LSTM

from keras.datasets import imdb

Then, we set the maximum number of features to 20,000; the maximum number of words in a text to 80; and batch size to 32:

max_features = 20000

maxlen = 80

batch_size = 32

Next, we load the dataset into some variables, after splitting train and test sets:

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)


We need to pad some text comments, as some of them are shorter than 80 words, and our model only accepts inputs of the same length. In short, padding works by adding a predefined word to the end of a sequence to make the sequence of the desired length. The code below pads the sequences:

x_train = sequence.pad_sequences(x_train, maxlen=maxlen)

x_test = sequence.pad_sequences(x_test, maxlen=maxlen)


Now, we are all set to build our sequential model. First, we add an embedding layer, and then we add an LSTM. Last, we add the dense layer for classification:

model = Sequential()

model.add(Embedding(max_features, 128))

model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))

model.add(Dense(1, activation=‘sigmoid’))


After we build our model, we can now train on our train set. We use binary cross-entropy loss as our loss function, the Adam algorithm as our optimizer, and accuracy as our evaluation metric:





It is time to train our model:, y_train, batch_size=batch_size,


validation_data=(x_test, y_test))

Last, we test the performance of our model using the test set:

score, acc = model.evaluate(x_test, y_test,


print(‘Test score:’, score)

print(‘Test accuracy:’, acc)


The model achieves almost 82% accuracy on the test set after 10 epochs, which is a satisfactory result for such a simple model. The output of the code above is:


Train on 25000 samples, validate on 25000 samples

Before closing this blog, we will bring to your attention several other advanced DL models (e.g. GANs), contained in appendices of this blog. We strongly encourage you to read those appendices to learn more about DL models. After that, you can find many other useful resources to dig into the details of DL models.


RNNs are ideal tools to solve the following problems:

NLP. As mentioned before, RNNs excel at working with natural language text. Tasks, like predicting the next word or figuring out the general topic of a block of text, are solved well by RNNs.


Text synthesis. A particular NLP application that deserves its own bullet point is text synthesis. This involves creating new streams of words, which is an extension of the “predicting the next word” application. RNNs can create whole paragraphs of text, taking text prediction to a whole new level.


Automated translation.

This is a harder problem than it seems since each language and dialect has its own intricacies (for instance, the order of words in constructing a sentence). To accurately translate something, a computer must process sentences as a whole—something that’s made possible through an RNN model.


Image caption generation. Although this is not entirely RNN-related, it is certainly a valid application. When combined with CNN's, RNNs can generate short descriptions of an image, perfect for captions. They can even evaluate and rank the most important parts of the image, from most to least relevant.


Speech recognition. When the sound of someone talking is transformed into a digitized sound wave, it is not far-fetched to ask an RNN to understand the context of each sound bit. The next step is turning that into written text, which is quite challenging, but plausible using the same RNN technology.