What is Deep Learning

What is Deep Learning

What is Deep Learning

Google uses deep learning for voice recognition algorithms. Amazon and Netflix use it to decide what it is that you are interested in watching or buying next.


MIT researchers use it to predict the future. This very well established and still growing industry is always looking for a chance to sell these tools about how revolutionary it all is. But what is it exactly? Is this just some other fad that is used to push the old-fashioned AI on us by using some sexy new name?


It would probably be helpful to look at deep learning as the cutting-edge of the cutting-edge. Machine learning uses some of the main ideas of AI and focuses on figuring out some real-world problems using neural networks that are used to mimic a human brain’s decision-making process.


Deep learning likes to focus more on narrower subsets of machine learning tools and techniques, and then it will apply them to figure out almost any problem that needs thinking, whether artificial or human.


If you are just starting in the deep learning field, or if you have a bit of experience with neural networks a long time ago, you will probably find yourself a bit confused. A lot of people have been baffled by this, especially those who learned about neural networks in the 1990s and early 2000s.


Deep learning hardware 

CPU cores

Most deep learning applications and libraries use a single core CPU unless they are used within a parallelization framework like Message-Passing Interface (MPI), MapReduce, or Spark.


For example, CaffeOnSpark by the team at Yahoo! uses Spark with Caffe for parallelizing network training across multiple GPUs and CPUs. In most normal settings in a single box, one CPU core is enough for deep learning application development.


CPU cache size

CPU cache size is an important CPU component that is used for high-speed computations. A CPU cache is often organized as a hierarchy of cache layers, from L1 to L4 L1, and L2 being smaller and faster cache layers as opposed to the larger and slower layers L3 and L4.


In an ideal setting, every data needed by the application resides in caches and hence no read is required from RAM, thereby making the overall operation faster.


However, this is hardly the scenario for most of the deep learning applications. For example, for a typical ImageNet experiment with a batch size of 128, we need more than 85MB of CPU cache to store all information for one mini batch.


Since such datasets are not small enough to be cache-only, a RAM read cannot be avoided. Hence modern day CPU cache sizes have little to no impact on the performance of deep learning applications.


RAM size

As we saw previously in this section, most of the deep learning applications read directly from RAM instead of CPU caches. Hence, it is often advisable to keep the CPU RAM almost as large, if not larger, than GPU RAM.


The size of the GPU RAM depends on the size of your deep learning model. For example, ImageNet based deep learnings models have a large number of parameters taking 4 GB to 5 GB of space, hence a GPU with at least 6 GB of RAM would be an ideal fit for such applications.


Paired with a CPU with at least 8 GB or preferably more CPU RAM will allow application developers to focus on key aspects of their application instead of debugging RAM performance issues.


Hard drive

Typical deep learning applications required large sets of data that is in 100s of GB. Since this data cannot be set in any RAM, there is an ongoing data pipeline is constructed. A deep learning application loads the mini-batch data from GPU RAM, which in turns keeps on reading data from CPU RAM, which loads data directly from the hard drive.


Since GPU's have a larger number of cores and each of these cores have a mini-batch of their data, they constantly need to be reading large volumes of data from the disk to allow for high data parallelism.


For example, in Alexie's Convolutional Neural Network (CNN) based model, roughly 300 MB of data needs to be read every second. This can often cripple the overall application performance. Hence, a solid state driver (SSD) is often the right choice for most deep learning application developers.


Deep learning software frameworks

  1. Every good deep learning application needs to have several components to be able to function correctly. These include:
  2. A model layer which allows a developer to design his or her own model with more flexibility
  3. A GPU layer that makes it seamless for application developers to choose between GPU/CPU for its application
  4. A parallelization layer that can allow the developer to scale his or her application to run on multiple devices or instances


As you can imagine, implementing these modules is not easy. Often a developer needs to spend more time on debugging implementation issues rather than the legitimate model issues.


Thankfully, a number of software frameworks exist in the industry today which make deep learning application development practically the first class of its programming language.


These frameworks vary in architecture, design, and feature but almost all of them provide immense value to developers by providing them easy and fast implementation framework for their applications. In this section, we will take a look at some popular deep learning software frameworks and how they compare with each other.


AI, Deep Learning, and Machine Learning

Artificial intelligence looks at how to create machines that are capable of fulfilling tasks that would normally require human intelligence. This loose definition basically tells you that AI encompasses several fields of research, from expert systems to genetic algorithms, and helps to provide a scope of arguments over what it means to be AI.


Machine learning has recently found a lot of success in the field of AI research. It has allowed computers to pass up or come very close to matching up human performances in all areas that range from face recognition to language and speech recognition.


Machine learning uses the process of teaching a computer system on how to perform a certain task instead of programming it to perform certain tasks in a step-by-step manner.


Once training has been finished, the system is able to come up with accurate predictions when it receives certain data.


This all may sound dry, but the predictions could end up answering if a fruit in a picture is an apple or banana, if a person is walking in front of a self-driving vehicle, if the word written in a sentence means a hotel reservation or a paperback, if an email message is a spam, or recognizing speech well enough to create captions on videos on YouTube.


Normally, machine learning is broken into supervised learning, which is where the computer is taught things by example from data that has been labeled, and unsupervised learning, which is when the computer groups together similar information and find the anomalies.


Deep learning is a single area within the machine learning process whose capabilities are different from traditional shallow machine learning in many important areas. This allows computers to be able to figure a whole host of complex issues that wasn’t able to be solved any other way.


A good example of a shallow machine learning task would be predicting that ice cream sales will be different depending on what the temperature is like outside. They make predictions with the use of only a couple of data features, and it is relatively straightforward. This can be carried out in a shallow technique, which is known as a linear regression with gradient descent.


The problem comes in the fact that there are a large number of problems within the world that don’t fit very well in such a simple model. One example of a complex real-world issue is being able to recognize handwritten numbers.


In order for this problem to be solved, the computer will have to cope with large variations in a way where data can be presented. Each digit that ranges from 0 to 9 can be written in a myriad of different ways.


Even the size and shape of the handwritten digits are able to be written in several different ways depending on the person writing them and in certain circumstances.


Coping with all of the variables of these different features and the large mess of interactions between each of them is where deep neural networks and deep learning start to be useful. Neural networks, which we will cover more completely in a later blog, are mathematical models whose structure is very loosely based on the brain.


Every neuron in the network is a function that will receive data through an input, it then transforms the data into a form that is more amenable, and it will then send it out through an output. These neurons can be viewed as layers.


Each of these networks has an input layer. This is where the starting data is fed in. They also have an output layer, which is what generates the last prediction.


When it comes to a deep neural network, there are several hidden layers of neurons that are located between these output and input layers, and each one of them feeds data into the other.


This is why you have the word deep in deep learning, as well as in deep neural networks. This references the number of hidden layers, which is normally more than three, located in the heart of these neural networks.


Neurons are believed to be activated once the sum of the values that are being inputted into the neuron has passed a certain threshold. The activation means is different depending on the layer it is in.


In the first hidden layer, activation could mean that the image of the handwritten number may contain a certain combo of pixels that look like horizontal lines at the top part of the number seven. Like this, that first hidden layer would detect a lot of the important curves and lines that would eventually mix together to create the final number.


A real neural network would probably have several hidden layers and several neurons in every layer. All of the small curves and lines found on the first layer would be fed in the second hidden layer, and then detect how they are combined to create a recognizable shape that creates a certain digit, like the entire loop of the number six.

Through this act of feeding data between the different layers, each layer will handle a higher-level of features.


How are these layers able to tell a computer the nature of a written number? All of the neuron layers provide a way for the network to create a rough hierarchy of different features that create the written number that is in question.


For example, if the input shows an array of values that represent the separate pixels in the photo of a written number, the following layer could show a combination of these pixels into shapes and lines, the following layer would combine all of the shapes into specific images such as the loops in an eight or a triangle in four, and so on.


When you slowly build up a picture of all of these features, a modern neural network is able to determine, with very good accuracy, the amount that is connected to the written number.


In a similar manner, different types of these deep neural networks are able to be trained to pick up of faces in a picture or change audio into written words.


The process of creating these increasingly complex hierarchies of features of written digits out of nothing except pixels is taught through the network. The computer is able to learn because of how the network can alter the importance of the connections between each layer’s neurons.


Each of the links has an attached value that is known as the weight which will end up modifying the value that is sent out by a neuron as it travels between each layer. By changing up the value of the different weight, and the value that is known as the bias, there is a possibility to emphasize or diminish how important the links are between the network and neurons.


For example, when it comes to recognizing a number that was handwritten, these different weights can be changed to show the importance of a certain pixel group that creates a line, or a pair of lines that intersect that create a number seven.


 Neural Networks

Neural networks, which are sometimes referred to as Artificial Neural Networks, are a simulation of machine learning and human brain functionality problems. You should understand that neural networks don’t provide a solution for all of the problems that come up but instead provide the best results with several other techniques for various machine learning tasks.


The most common neural networks are classification and clustering, which could also be used for regression, but you can use better methods for that.


A neuron is a building unit for a neural network, which works like a human neuron. A typical neural network will use a sigmoid function. This is typically used because of the nature of being able to write out the derivative using f(x), which works great for minimizing error.


Let’s look at what a Perceptron is. Dendrites are extensions that come off the nerve cell. These are what get the signals, and they then send them onto the cell body, which processes the stimulus and then will make a decision to either trigger a signal or not. When a cell chooses to trigger a signal, the cell body extension known as an axon will trigger a chemical transmission at its end to a different cell.


There is no need to feel like you have to memorize any of this. We aren’t actually studying neuroscience, so you only need a vague impression of how this works.


Perceptron looks similar to an actual neuron because they were inspired by the way actual neurons work. Keep in mind; it was only inspired by a neuron and in no way acts exactly like a real one. The way a Perceptron processes data is as such:


1. There are small circles on the left side of the Perceptron which are the “neurons” and they have x subscripts 1, 2,…, m that carries data input.

2. All of the inputs are multiplied by a weight, which is labeled using a subscript 1, 2, … , m, along with a long arrow called the synapse and travels to the big circle in the middle. So you will have w1 * x1, w2 * x2, w3 * x3, and so on.


3. After all of the inputs have been multiplied by the weight, you will sum them all up and add a bias that had been pre-determined.


4. The results are then pushed onto the right. You will then use the step function. All of these tells that if the number you get from step three is greater than or equal to zero, you will receive a one as your output, otherwise, if your result is lower than zero, the output will be zero.


5. You will get an output of either zero or one.


If you were to switch the bias and place it on the right in the activation function such as “sum(wx) ≥ -b” the –b would be known as a threshold value. With this, if the sum is higher than or equal to your threshold, then your activation trigger is one. Otherwise, it would come out to be zero. Pick the one that helps you understand this process because both of these representations are interchangeable.


Now, you have a pretty good understanding of how a Perceptron works. All it’s made up of is some mechanical multiplications, which then make summations, and then ultimately give you activation, and that will give you an output.


Just to make sure that you fully understand this, let’s have a look at a really simple example that is not really realistic. Let’s assume that you have found extreme motivation after you have read this book and you have to decide if you are going to study deep learning or not. You have three major factors that will help you make your decision:


  1. Will you be able to make more money once you master deep learning: 0 – No, 1 – Yes.
  2. Is the needed programming and mathematics simple: 0 – No, 1 – Yes.
  3. You are able to use deep learning immediately and not have to get an expensive GPU: 0 – No, 1 – Yes.


Our input variables will be x1, x2, and x3 for all of the factors, and we’ll give them each a binary value since they are all simple yes or no questions.

Let’s assume that you really love deep learning and you are now ready to work through your lifelong fear of programming and math. You also have some money put away to invest in the expensive Nvidia GPU that will train the deep learning model.


You can assume that both of these have the same importance because both of them can be compromised. But, you really want to be able to make extra money once you have spent all of the energy and time into learning about deep learning. Since you have a higher expectation of ROI, if you can’t make more moolah, you aren’t going to waste your time learning deep learning.


Now that you have a decent understanding of the decision preferences, we can assume that you have a 100 percent probability of making extra cash once you have learned deep learning because there’s plenty of demand for less supply. That means x1 = 1. Let’s assume that programming and math are extremely hard. That means x2 = 0.


Finally, let’s assume that you are going to need a powerful GPU such as a Titan X. That means x3 = 0. Okay, now that you have the inputs, you can initialize your weights. We’re going to try w1 = 8, w2 = 3, w3 = 3.


The higher the value for the weight, the bigger the influence it has with the input. Since the money you will make is more important, your decision for learning deep learning is, w1 is greater than w2, and w1 is greater than w3.


Let’s say that the value of the threshold is five, which equals the bias of negative five. We add everything together and add in the bias term. Since the threshold value is five, you will decide to learn deep learning if you are going to make more money.


Even if the math turns out to be easy and you aren’t going to have to buy a GPU, you won’t study deep learning unless you are able to make extra money later on.


Now, you have a decent understanding of bias and threshold. With a threshold as high as five, that means the main factor has to be satisfied in order for you to receive an output of one. Otherwise, you will receive a zero.


The fun part comes next: varying the threshold, bias, and weights will provide you with different possible decision-making models. With our example, if you lower your threshold from five to three, then you will get different scenarios where the output would be one.


Despite how well loved these Perceptrons were, the popularity faded quietly due to its limitations. Later on, people realized that multi-layer Perceptrons were able to learn the logic of an XOR gate, but this requires the use of back propagation so that the network can learn from its own problems. Every single deep learning neural networks are data-driven.


If you are looking at a model and the output it has is different from the desired output, you will have to have a way to backpropagate the error information throughout the network in order to let the weight know they need to adjust and fix themselves by a certain amounts.


This is so that, the real outputs from the model will start getting closer in a gradual way to the desired output with each round of testing.


As it turned out when it comes to the more complicated tasks that involved outputs that couldn’t be shown with a linear combination of inputs, meaning the outputs aren’t linearly separable or non-linear, the step function will not work because the backpropagation won’t be supported. This requires that your activation function should have meaningful derivatives.


Here’s just a bit of calculus: a step function works as a linear activation function where your derivative comes out to 0 for each of the inputs except for the actual point of 0.


At the point of 0, your derivative is going to be undefined because the function becomes discontinuous at this point. Even though this may be an easy and simple activation function, it’s not able to handle the more complicated tasks.


Sigmoid function: f(x) = 1/1+e^-x


“ import Jimp = require(“jimp”); Import Promise from “ts-promist”; Const synaptic = require(“synaptic”); Const _ = require(“lodash”); Const Neuron = synaptic.Neuron,

Layer = synaptic.Layer, Network = synaptic.Network, Trainer = synaptic.Trainer, Architect = synaptic.Architect; Function getImgData(filename) {

Return new Promise((resolve, reject) => { Jimp.read(filename).then((image) => {

Let inputSet: any = [];

Image.scan(0, 0, image.bitmap.width, image.bitmap.height, function (x,

y, idx) {

Var red = image.bitmap.data[idx + 0]; Var green = image.bitmap.data[idx + 1];

inputSet.push([re, green]);



}).catch(function (err) {





Const myPerceptron = new Archietect.Perceptron(4, 5); Const trainer = new Trainer(myPerceptron); Const traininSet: any = [];

getImgData(‘ imagefilename.jpg’). then((inputs: any) => {

getImageData(‘imagefilename.jpg’).then((outputs: any) => {

for (let i=0; I < inputs.length; i++) {


input: _.map(inputs[i], (val: any) => val/255),

output: _.map(outputs[i], (val: any) => val/255)



Trainer.train(trainingSet, {


Interations: 200,

Error: .005,

Shuffle: true,

Log: 10,

Cost: Trainer.cost.CROSS_ENTROPY


Jimp.read(‘yours.jpg’).then((image) => {

Image.scan(0, 0, image.bitmap.width, image.bitmap.height, (x, y, idx)

=> {

Var red = image.bitmap.data[idx + 0]; Var green = image.bitmap.data [idx + 1];

Var out – myPerceptron.activate([red/255, green/255);

Image.bitmap.data[idx + 0] = _.round(out[0] * 255);

Image.bitmap.data[idx + 1] = _.round(out[1] * 255);




}).catch(function (err) {




}); ”


Automatically Placing Audio in Silent Movies

When it comes to this, the system synthesizes the sounds that are similar to silent movies. This system was trained with a thousand examples from different videos with sounds of a drumstick hitting different types of surfaces and coming up with different types of sounds.


Deep learning models associate the frames of the video with a pre-recorded sound database so that it can choose a sound to play and matches up the best with the things going on in the scene.


They use a Turing Test to evaluate the system such as a setup where humans will have to figure out if the video has real or fake sounds. This uses applications of LSTM as well as RNN.


Automatic Machine Translation

This process is where a given word, sentence, or phrase is said in one language and then automatically translated to another language. This technology has been around for a while, but deep learning has gotten the best results in two areas:

  • Image translations
  • Text translations


These text translations can be done without the need for pre-processing the sequence, which allows the algorithm to be able to learn the dependencies between the word and the new language mapping.


Automatic Text Generation

This task is one of the most interesting. This is where a body of text has been learned, and new text is created either character-by-character or word-by-word.


This model can learn how to capture text styles, forms of sentences, punctuations, and spelling in the body. Large recurrent neural networks are helpful when it comes to learning the relationship between different items in an input string sequence, and it will then generate text.


Automatic Handwriting Generation

This task has provided a corpus of examples of handwriting and generates new handwriting for a certain phrase or word. The handwriting is given as coordinate sequences used by a pen once the samples have been created. From the body, the connection of the letters and the pen movement is learned and the new examples are able to be created ad hoc.


Internet Search

Chances are when you hear the word search; your first thought is Google. But there are actually several other search engines out there such as duckduckgo, AOL, Ask, Bing, and Yahoo.


Every search engine out there uses some form of a data science algorithm to provide their users with the best results for their search query in less than a second. Think about this. Google process over 20 petabytes of data every single day. If there wasn’t any data science, Google would not be as good as it is today.


Building a DL Network Using MXNet

This program is independent of any library and can be used by other programming languages with the corresponding API.


Finally, there exist several tutorials for MXNet, should you wish to learn more about its various functions. Because MXNet is an open-source project, you can even create your own tutorial, if you are so inclined.


What’s more, it is a cross-platform tool, running on all major operating systems. MXNet has been around long enough that it is a topic of much research.


Core components Gluon interface

Gluon is a simple interface for all your DL work using MXNet. You install it on your machine just like any Python library:

pip install MXNet —pre —user


The main selling point of Gluon is that it is straightforward. It offers an abstraction of the whole network building process, which can be intimidating for people new to the craft.


Also, Gluon is very fast, not adding any significant overhead to the training of your DL system. Moreover, Gluon can handle dynamic graphs, offering some malleability in the structure of the ANNs created. Finally, Gluon has an overall flexible structure, making the development process for any ANN less rigid.


Naturally, for Gluon to work, you must have MXNet installed on your machine (although you don’t need to if you are using the Docker container provided with this blog). This is achieved using the familiar pip command:

pip install MXNet —pre —user


Because of its utility and excellent integration with MXNet, we’ll be using Gluon throughout this blog, as we explore this DL framework. However, to get a better understanding of MXNet, we’ll first briefly consider how you can use some of its other functions.



The NDArray is a particularly useful data structure that’s used throughout an MXNet project. NDArrays are essentially NumPy arrays, but with the added capability of asynchronous CPU processing.


They are also compatible with distributed cloud architectures, and can even utilize automatic differentiation, which is particularly useful when training a deep learning system, but NDArrays can be effectively used in other ML applications too. NDArrays are part of the MXNet package, which we will examine shortly. You can import the NDArrays module as follows:

from MXNet import nd

To create a new NDArray consisting of 4 rows and 5 columns, for example, you can type the following:

nd.empty((4, 5))


The output will differ every time you run it since the framework will allocate whatever value it finds in the parts of the memory that it allocates to that array. If you want the NDArray to have just zeros instead, type:

nd.zeros((4, 5))

To find the number of rows and columns of a variable having an NDArray assigned to it, you need to use the .shape function, just like in NumPy:

x = nd.empty((2, 7))


Finally, if you want to find to total number of elements in an NDArray, you use the .size function:



The operations in an NDArray are just like the ones in NumPy, so we won’t elaborate on them here. Contents are also accessed in the same way, through indexing and slicing.


Should you want to turn an NDArray into a more familiar data structure from the NumPy package, you can use the asnumpy() function:

y = x.asnumpy()

The reverse can be achieved using the array() function:

z = nd.array(y)


One of the distinguishing characteristics of NDArrays is that they can assign different computational contexts to different arrays—either on the CPU or on a GPU attached to your machine (this is referred to as “context” when discussing NDArrays).


This is made possible by the CTX parameter in all the package’s relevant functions. For example, when creating an empty array of zeros that you want to assign to the first GPU, simply type:

a = nd.zeros(shape=(5,5), ctx=mx.gpu(0))


Of course, the data assigned to a particular processing unit is not set in stone. It is easy to copy data to a different location, linked to a different processing unit, using the copy to() function:

y = x.copyto(mx.gpu(1)) # copy the data of NDArray x to the 2nd GPU

You can find the context of a variable through the .context attribute:



It is often more convenient to define the context of both the data and the models, using a separate variable for each. For example, say that your DL project uses data that you want to be processed by the CPU, and a model that you prefer to be handled by the first GPU. In this case, you’d type something like:

DataCtx = mx.cpu()

ModelCtx = mx.gpu(0)

MXNet package in Python


The MXNet package (or “MXNet,” with all lower-case letters, when typed in Python), is a very robust and self-sufficient library in Python. MXNet provides deep learning capabilities through the MXNet framework. Importing this package in Python is fairly straightforward:

import MXNet as mx


If you want to perform some additional processes that make the MXNet experience even better, it is highly recommended that you first install the following packages on your computer:

graphviz (ver. 0.8.1 or later)

requests (ver. 2.18.4 or later)

numpy (ver. 1.13.3 or later)

You can learn more about the MXNet package through the corresponding GitHub repository.