Machine Learning using Python (2019)


Machine Learning using Python

Machine Learning using Python Tutorial

The business world is bursting with activities and philosophies about machine learning and its application to various business environments. Machine learning is the capability of systems to learn without explicit software development. It evolved from the study of pattern recognition and computational learning theory. This tutorial explains several Machine Learning algorithm using Python.


The impact is that with the appropriate processing and skills, you can amplify your own data capabilities, by training a processing environment to accomplish massive amounts of the discovery of data into actionable knowledge, while you have a cup of coffee, for example.


This skill is an essential part of achieving major gains in shortening the data-to-knowledge cycle.


 I will cover a limited rudimentary theory, but machine learning encompasses a wide area of expertise that merits a blog by itself. So, I will introduce you only to the core theories.


Supervised Learning

Supervised learning is the machine-learning task of inferring a function from labeled training data. The training data consists of a set of training examples. In supervised learning, each example is a pair consisting of an input object and the desired output value. You use this when you know the required outcome for a set of input features.



If you buy bread and jam, you can make a jam sandwich. Without either, you have no jam sandwich. If you investigate this data set, you can easily spot from the indicators what is bread and what is jam. A data science model could perform the same task, using supervised learning. 


Unsupervised Learning

Unsupervised learning is the machine-learning task of inferring a function to describe hidden structures from unlabeled data. This encompasses many other techniques that seek to summarize and explain key features of the data.



You can take a bag of marble with different colors, sizes, and materials and split them into three equal groups, by applying a set of features and a model. You do not know up front what the criteria are for splitting the marbles.


Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning inspired by behavioral psychology that is concerned with how software agents should take actions in an environment, so as to maximize, more or less, a notion of cumulative reward.


This is used in several different areas, such as game theory, swarm intelligence, control theory, operations research, simulation-based optimization, multi-agent systems, statistics, and genetic algorithms.


The process is simple. Your agent extracts feature from the environment that is either “state” or “reward.” State features indicate that something has happened.


Reward features indicate that something happened that has improved or worsened to the perceived gain in reward. The agent uses the state and reward to determine actions to change the environment.


This process of extracting state and reward, plus responding with action, will continue until a pre-agreed end reward is achieved. The real-world application for these types of reinforced learning is endless. You can apply reinforced learning to any environment which you can control with an agent.


I build many RL systems that monitor processes, such as a sorting system of purchases or assembly of products. It is also the core of most robot projects, as robots are physical agents that can interact with the environment.


I also build many “soft-robots” that take decisions on such data processing as approval of loans, payments of money, and fixing of data errors.


Bagging Data

Bootstrap aggregating also called bagging, is a machine-learning ensemble meta-­ algorithm that aims to advance the stability and accuracy of machine-learning algorithms used in statistical classification and regression. It decreases variance and supports systems to avoid overfitting.


I want to cover this concept, as I have seen many data science solutions over the last years that suffer from overfitting because they were trained with a known data set that eventually became the only data set they could process.


Thanks to inefficient processing and algorithms, we naturally had a lead way for variance in the data.


The new GPU (graphics processing unit)-based systems are so accurate that they overfit easily, if the training data is a consistent set, with little or no major changes in the patterns within the data set.


You will now see how to perform a simple bagging process. Open your Python editor and create this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor from sklearn.tree import DecisionTreeRegressor
You need a few select settings.
n_repeat = 100 # Number of iterations for processing
n_train = 100 # Size of the Training Data set
n_test = 10000 # Size of the Test Data set
noise = 0.1 # Standard deviation of the noise introduced np.random.seed(0)
You will select two estimators to compare.
estimators = [("Tree", DecisionTreeRegressor()),
("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor()))] n_estimators = len(estimators)
You will need a set of data to perform the bagging against, so generate a random data set.
def f(x):
x = x.ravel()
return np.exp(-x ** 2) - 2 * np.exp(-(x - 2) ** 2)
You can experiment with other data configurations, if you want to see the process working.
You need to create a function to add the noise to the data.
def generate(n_samples, noise, n_repeat=1): X = np.random.rand(n_samples) * 10 - 5 X = np.sort(X)
if n_repeat == 1:
y = f(X) + np.random.normal(0.0, noise, n_samples)
y = np.zeros((n_samples, n_repeat))
for i in range(n_repeat):
y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)
X = X.reshape((n_samples, 1))
return X, y
You can now train the system using these Transform steps.
X_train = []
y_train = []
You train the system with the bagging data set, by taking a sample each cycle. This exposes the model to a more diverse data-training spread of the data.
for i in range(n_repeat):
X, y = generate(n_samples=n_train, noise=noise)
You can now test your models.
X_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat)
You can now loop over estimators to compare the results, by computing your predictions.
for n, (name, estimator) in enumerate(estimators):
y_predict = np.zeros((n_test, n_repeat))
for i in range(n_repeat):[i], y_train[i])
y_predict[:, i] = estimator.predict(X_test)
# Bias^2 + Variance + Noise decomposition of the mean squared error y_error = np.zeros(n_test)
for i in range(n_repeat): for j in range(n_repeat):
y_error += (y_test[:, j] - y_predict[:, i]) ** 2 y_error /= (n_repeat * n_repeat)
y_noise = np.var(y_test, axis=1)
y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2 y_var = np.var(y_predict, axis=1)
You can now display your results.
print("{0}: {1:.4f} (error) = {2:.4f} (bias^2) "
" + {3:.4f} (var) + {4:.4f} (noise)".format(name, np.mean(y_error),
np.mean(y_bias), np.mean(y_var), np.mean(y_noise)))
You can now plot your results.
plt.subplot(2, n_estimators, n + 1)
plt.plot(X_test, f(X_test), "b", label="$f(x)$")
plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")
for i in range(n_repeat):
if i == 0:
plt.plot(X_test, y_predict[:, i], "r", label="$\^y(x)$")
plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)
plt.plot(X_test, np.mean(y_predict, axis=1), "c",
label="$\mathbb{E}_{LS} \^y(x)$")
plt.xlim([-5, 5])
if n == 0:
plt.legend(loc="upper left", prop={"size": 11}) plt.subplot(2, n_estimators, n_estimators + n + 1) plt.plot(X_test, y_error, "r", label="$error(x)$") plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"), plt.plot(X_test, y_var, "g", label="$variance(x)$"), plt.plot(X_test, y_noise, "c", label="$noise(x)$") plt.xlim([-5, 5])
plt.ylim([0, 0.1])
if n == 0:
plt.legend(loc="upper left", prop={"size": 11})
Display your hard work!


Well done. You have completed your bagging example.

Remember: The bagging enables the training engine to train against different sets of the data you expect it to process. This eliminates the impact of outliers and extremes on the data model. So, remember that you took a model and trained it against several training sets sampled from the same population of data.


Random Forests

Random forests, or random decision forests, are an ensemble learning method for classification and regression that works by constructing a multitude of decision trees at training time and outputting the results the mode of the classes (classification) or mean prediction (regression) of the individual trees.


Random decision forests correct for decision trees’ habit of overfitting to their training set.


The result is an aggregation of all the trees’ results, by performing a majority vote against the range of results. So, if five trees return three yeses and two nos, it passes a yes out of the Transform step. 


Sometimes, this is also called tree bagging, as you take a bagging concept to the next level by not only training the model on a range of samples from the data population but by actually performing the complete process with the data bag and then aggregating the data results.


Let me guide you through this process. Open your Python editor and prepare the following ecosystem:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import sys
import os
import datetime as dt
import calendar as cal
Set up the data location and load the data.
if sys.platform == 'linux':
Base=os.path.expanduser('~') + 'VKHCG'
print('Working Base :',Base, ' using ', sys.platform)
basedate = dt.datetime(2018,1,1,0,0,0)
InputFile=Base+'/'+Company+'/03-Process/01-EDS/02-Python/' + InputFileName
usecols=['Open','Close','UnitsOwn'], \ low_memory=False)
You must perform some preprocessing to reveal features in the data.
ShareRawData.index.names = ['ID'] ShareRawData['nRow'] = ShareRawData.index ShareRawData['TradeDate']=ShareRawData.apply(lambda row:\
(basedate - dt.timedelta(days=row['nRow'])),axis=1) ShareRawData['WeekDayName']=ShareRawData.apply(lambda row:\
(cal.day_name[row['TradeDate'].weekday()])\ ,axis=1)
ShareRawData['WeekDayNum']=ShareRawData.apply(lambda row:\ (row['TradeDate'].weekday())\
ShareRawData['sTarget']=ShareRawData.apply(lambda row:\
'true' if row['Open'] < row['Close'] else 'false'\ ,axis=1)
Here is your data set:
Select a data frame with the two feature variables.
df = pd.DataFrame(ShareRawData, columns=sColumns)
Let’s look at the top-five rows.
You need to select the target column.
df2 = pd.DataFrame(['sTarget'])
df2.columns =['WeekDayNum']
You must select a training data set.
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
Now create two new data frames, one with the training rows and the other with the test rows.
train, test = df[df['is_train']==True], df[df['is_train']==False]
Here is the number of observations for the test and training data frames:
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))
Start processing the data by creating a list of the feature column's names features = df.columns[:3]
Display your features.
You must factorize your target to use the model I selected.
y = pd.factorize(train['WeekDayNum'])[0]
You can now view the target values.
You now train The Random Forest Classifier
Create a random forest classifier.
clf = RandomForestClassifier(n_jobs=2, random_state=0)
You now train the classifier to take the training features and learn how they relate to the training y (weekday number).[features], y)
Now apply the classifier to test data. This action is called “scoring.”
You can look at the predicted probabilities of the first ten observations.
Evaluate the classifier. Is it any good?
preds = clf.predict(test[features])[3:4]
Look at the PREDICTED Week Day Number for the first ten observations.
print('PREDICTED Week Number:',preds[0:10])
Look at the ACTUAL WeekDayName for the first ten observations.
I suggest you create a confusion matrix.
c=pd.crosstab(df2['WeekDayNum'], preds, rownames=['Actual Week Day Number'], colnames=['Predicted Week Day Number']) print(c)
You can also look at a list of the features and their importance scores.
print(list(zip(train[features], clf.feature_importances_)))


You have completed the Transform steps for a random forest solution. At his point, I want to explain an additional aspect of random forests. This is simply the daisy-chaining of a series of random forests to create a solution.


I have found these to become more popular over the last two years, as solutions become more demanding and data sets become larger. The same principles apply; you are simply repeating them several times in a chain.


Computer Vision (CV)

Computer Vision (CV)

Computer vision is a complex feature extraction area, but once you have the features exposed, it simply becomes a matrix of values.

Open your Python editor and enter this quick example:

import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
imageIn =
fig1=plt.figure(figsize=(10, 10))
fig1.suptitle('Audi R8', fontsize=20)
imgplot = plt.imshow(imageIn)
You should see a car.
imagewidth, imageheight = imageIn.size imageMatrix=np.asarray(imageIn) pixelscnt = (imagewidth * imageheight) print('Pixels:', pixelscnt)
print('Size:', imagewidth, ' x', imageheight,)


This is what your computer sees!

You have achieved a computer vision. Remember how I showed you that movies convert several frames? Each frame becomes a matrix’s entry, and now you can make your data science “see.”


Natural Language Processing (NLP)

Natural Language Processing

Natural language processing is the area in data science that investigates the process we as humans use to communicate with each other. This covers mainly written and spoken words that form bigger concepts. Your data science is aimed at intercepting or interacting with humans, to react to the natural language.


There are two clear requirements in natural language processing. First is the direct interaction with humans, such as when you speak to your smartphone, and it responds with an appropriate answer. For example, you request “phone home,” and the phone calls the number set as “home.”


The second type of interaction is taking the detailed content of the interaction and understanding its context and relationship with other text or recorded information.


Examples of these are news reports that are examined, and common trends are found among different news reports. This a study of the natural language’s meaning, not simply a response to a human interaction.



If you want to process text, you must set up an ecosystem to perform the basic text processing.

I recommend that you use library nltk (conda install -c anaconda nltk).


Open your Python editor and set up your ecosystem. You then require the base data.

import nltk

You will see a program that enables you to download several text libraries, which will assist the process to perform text analysis against any text you submit for analysis. The basic principle is that the library matches your text against the text stored in the data libraries and will return the correct matching text analysis.


Open your Python editor and create the following ecosystem, to enable you to investigate this library:

from nltk.tokenize import sent_tokenize, word_tokenize
Txt = "Good Day Mr. Vermeulen,\
how are you doing today?\
The weather is great, and Data Science is awesome.\ You are doing well!"
print('Identify sentences')
print('Identify Word')




There is a major demand for speech-to-text conversion, to extract features. I suggest looking at the SpeechRecognition library (PyPI – the Python Package Index SpeechRecognition/).


You can install it by using conda install -c conda-forge speech recognition.

 This transform area is highly specialized, and I will not provide any further details on this subject.


Neural Networks

Neural Networks

Neural networks (also known as artificial neural networks) are inspired by the human nervous system. They simulate how complex information is absorbed and processed by the human system. Just like humans, neural networks learn by example and are configured to a specific application.


Neural networks are used to find patterns in extremely complex data and, thus, deliver forecasts and classify data points. 


Let me offer an example of feature development for neural networks. Suppose you have to select three colors for your dog’s new ball: blue, yellow, or pink. The features would be

•\ Is the ball blue? Yes/No

•\ Is the ball yellow? Yes/No

•\ Is the ball pink? Yes/No


When you feed these to the neural network, it will use them as simple 0 or 1 values, and this is what neural networks really excel at solving.


Unfortunately, the most important feature when buying a ball for a dog is “Does the dog fit under the house?” It took me two hours to retrieve one dog and one black ball from a space I did not fit!


The lesson: You must change the criteria as you develop the neural network. If you keep the question simple, you can just add more questions, or even remove questions, that result in features.


Note, too, that you can daisy-chain neural networks, and I design such systems on a regular basis. I call it “neural pipelining and clustering,” as I sell it as a service. Before you start your example, you must understand two concepts.


[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]


Regularization Strength

Regularization strength is the parameter that prevents overfitting of the neural network. The parameter enables the neural network to match the best set of weights for a general data set. The common name for this setting is the epsilon parameter, also known as the learning rate.


Simple Neural Network

Now that you understand these basic parameters, I will show you an example of a neural network. You can now open a new Python file and your Python editor. Let’s build a simple neural network.


Setup the eco-system.

import numpy as np
from sklearn import datasets, linear_model import matplotlib.pyplot as plt
You need a visualization procedure.
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 h = 0.01
# Generate a grid of points with distance h between them xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange (y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples plt.contourf(xx, yy, Z, plt.scatter(X[:, 0], X[:, 1], c=y,
You will generate a data set for the example.
I suggest 200 points, but feel free to increase or decrease as you experiment with your neural network.
X, y = datasets.make_moons(200, noise=0.20)
You can plot the data to see what you generated.
plt.scatter(X[:,0], X[:,1], s=40, c=y,
I suggest we train a logistic regression classifier to feed the features.
clf = linear_model.LogisticRegressionCV(), y)
Now you can plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")
You now configure the neural network. I kept it simple, with two inputs and two outputs.
num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality nn_output_dim = 2 # output layer dimensionality
You set the gradient descent parameters. This drives the speed at which you resolve the neural network.
Set learning rate for gradient descent and regularization strength, experiment with this as it drives the transform speeds.
epsilon = 0.01
reg_lambda = 0.01 #
You must engineer a helper function, to evaluate the total loss on the data set.
def calculate_loss(model):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation to calculate our predictions z1 = + b1
a1 = np.tanh(z1)
z2 = + b2 exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Calculating the loss
corect_logprobs = -np.log(probs[range(num_examples), y])
data_loss = np.sum(corect_logprobs)
# Add regulatization term to loss (optional)
data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum (np.square(W2)))
return 1./num_examples * data_loss
You also require a helper function, to predict an output (0 or 1).
def predict(model, x):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2'] # Forward propagation
z1 = + b1
a1 = np.tanh(z1)
z2 = + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
return np.argmax(probs, axis=1)
Your next function to engineer is central to the neural network. This function learns parameters for the neural network and returns the model.
•\ nn_hdim: Number of nodes in the hidden layer
•\ num_passes: Number of passes through the training data for gradient descent. I suggest 20000, but you can experiment with different sizes.
•\ print_loss: If True, print the loss every 1000 iterations.
def build_model(nn_hdim, num_passes=20000, print_loss=False):
# Initialize the parameters to random values. We need to learn these. np.random.seed(0)
W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim) b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim) b2 = np.zeros((1, nn_output_dim))
# This is what we return at the end
model = {}
# Gradient descent. For each batch...
for i in range(0, num_passes):
# Forward propagation z1 = + b1 a1 = np.tanh(z1)
z2 = + b2 exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Backpropagation
delta3 = probs
delta3[range(num_examples), y] -= 1
dW2 = (a1.T).dot(delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = * (1 - np.power(a1, 2))
dW1 =, delta2)
db1 = np.sum(delta2, axis=0)
# Add regularization terms (b1 and b2 don't have regularization terms)
dW2 += reg_lambda * W2 dW1 += reg_lambda * W1
# Gradient descent parameter update
W1 += -epsilon * dW1
b1 += -epsilon * db1
W2 += -epsilon * dW2
b2 += -epsilon * db2
# Assign new parameters to the model
model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print ("Loss after iteration %i: %f" %(i, calculate_loss(model)))
return model
You now define the model with a three-dimensional hidden layer.
model = build_model(3, print_loss=True)
You can now plot the decision boundary.
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
You can now visualize what you have achieved.
plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50] for i, nn_hdim in enumerate(hidden_layer_dimensions):
plt.subplot(5, 2, i+1)
plt.title('Hidden Layer size %d' % nn_hdim)
model = build_model(nn_hdim, print_loss=True)
plot_decision_boundary(lambda x: predict(model, x))
You can now build neural networks. Well done.
The preceding is neural networking in its simplest form.




TensorFlow is an open source software library for numerical computation using data-­ flow graphs.


I will guide you through a few examples, to demonstrate the capability. I have several installations that use this ecosystem, and it is gaining popularity in the data science communities.


To use it, you will require a library named sensor flow. You install it using condo install -c condo-forge sensor flow. Details about the library are available at www.


The next big advantage is the Cloud Tensor Processing Unit (TPU) hardware product, which was specifically designed to calculate tensor processing at better performance levels than standard CPU or GPU hardware. The TPU supports the TensorFlow process with an extremely effective hardware ecosystem.


Basic TensorFlow

I will take you through a basic example by explaining, as a starting point, how to convert the following mathematical equation into a TensorFlow:

\ a = (b + c ) ∗ ( c + 2 ) a = (b + c ) ∗ ( c + 2) \
I will calculate the following for you:
•\ B = 2.5
•\ C = 10
Open your Python editor and create the following ecosystem:
import tensorflow as tf
Create a TensorFlow constant.
const = tf.constant(2.0, name="const")
Create TensorFlow variables.
b = tf.Variable(2.5, name='b')
c = tf.Variable(10.0, name='c')
You must now create the operations.
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')
Next, set up the variable initialization.
init_op = tf.global_variables_initializer()
You can now start the session.
with tf.Session() as sess:
# initialise the variables
# compute the output of the graph a_out = print("Variable a is {}".format(a_out))


Well done. You have just successfully deployed a TensorFlow solution. I will now guide you through a more advance example: how to feed a range of values into a TensorFlow.

\ a = (b + c ) ∗ ( c + 22)a = (b + c ) ∗ ( c + 22) \

I will calculate for the following:

•\ B = range (-5,-4,-3,-2,-1,0,1,2,3,4,5)

•\ C = 3

Open your Python editor and create the following ecosystem:

import tensorflow as tf
import numpy as np
Create a TensorFlow constant.
const = tf.constant(22.0, name="const")
Now create the TensorFlow variables. Note the range format for variable b.
b = tf.placeholder(tf.float32, [None, 1], name='b')
c = tf.Variable(3.0, name='c')
You will create the required operations next.
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')
Start the setup of the variable initialization.
init_op = tf.global_variables_initializer()
Start the session construction.
with tf.Session() as sess:
# initialise the variables
# compute the output of the graph
a_out =, feed_dict={b: np.arange(-5, 5)[:, np.newaxis]})
print("Variable a is {}".format(a_out))


Did you notice how with minor changes TensorFlow handles larger volumes of data with ease?


The advantage of TensorFlow is the simplicity of the basic building block you use to create it, and the natural graph nature of the data pipelines, which enable you to easily convert data flows from the real world into complex simulations within the TensorFlow ecosystem.


I will use a fun game called the One-Arm Bandits to offer a sample real-world application of this technology. Open your Python editor and create the following ecosystem:

import tensorflow as tf
import numpy as np
Let’s construct a model for your bandits. There are four one-arm bandits, and currently, bandit 4 is set to provide a positive reward most often.
bandits = [0.2,0.0,-0.2,-2.0]
num_bandits = len(bandits)
You must model the bandit by creating a pull-bandit action.
def pullBandit(bandit):
#Get a random number.
result = np.random.randn(1)
if result > bandit:
#return a positive reward.
return 1
#return a negative reward.
return -1
Now, you reset the ecosystem.
You need the following two lines to establish the feed-forward part of the network.
You perform the actual selection using this formula.
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights,0)
These next six lines establish the training procedure. You feed the reward and chosen action into the network by computing the loss and using it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)
Now, you must train the system to perform as a one-arm bandit.
total_episodes = 1000 #Set total number of episodes to train the bandit.
total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0.
e = 0.1 #Set the chance of taking a random action.
Initialize the ecosystem now.
init = tf.initialize_all_variables()
Launch the TensorFlow graph processing.
with tf.Session() as sess:
i = 0
while i < total_episodes:
#Choose either a random action or one from our network.
if np.random.rand(1) < e:
action = np.random.randint(num_bandits)
action =
reward = pullBandit(bandits[action])
Collect your reward from picking one of the bandits and update the network.
_,resp,ww =[update,responsible_weight,weights], feed_ dict={reward_holder:[reward],action_holder:[action]})
Update your running tally of the scores.
total_reward[action] += reward
if i % 50 == 0:
print ("Running reward for the " + str(num_bandits) + "
bandits: " + str(total_reward))
print ("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
print ("...and it was right!")
print ("...and it was wrong!")


Congratulations! You have a fully functional TensorFlow solution. Can you think of three real-life examples you can model using this ecosystem?


Decision Trees

Decision Trees

Decision trees, as the name suggests, are a tree-shaped visual representation of routes you can follow to reach a particular decision, by laying down all options and their probability of occurrence. Decision trees are exceptionally easy to understand and interpret.


At each node of the tree, one can interpret what would be the consequence of selecting that node or option. The series of decisions lead you to the end result.


Before you start the example, I must discuss a common add-on algorithm to decision trees called AdaBoost. AdaBoost, short for “adaptive boosting,” is a machine-learning meta-algorithm.


The classifier is a meta-estimator because it begins by fitting a classifier on the original data set and then fits additional copies of the classifier on the same data set, but where the weights of incorrectly classified instances are adjusted, such that subsequent classifiers focus more on difficult cases.


It boosts the learning impact of less clear differences in the specific variable, by adding a progressive weight to boost the impact.


This boosting enables the data scientist to force decisions down unclear decision routes for specific data entities, to enhance the outcome.


Example of a Decision Tree using AdaBoost:

Open your Python editor and set up this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import AdaBoostRegressor
You need a data set. So, you can build a random data set.
rng = np.random.RandomState(1)
X = np.linspace(0, 6, 1000)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])
You then fit the normal regression model to the data set.
regr_1 = DecisionTreeRegressor(max_depth=4)
You then also apply an AdaBoost model to the data set. The parameters can be changed, if you want to experiment.
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=300, random_state=rng)
You then train the model., y), y)
You activate the predict.
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
You plot the results.
plt.figure(figsize=(15, 10))
plt.scatter(X, y, c="k", label="Training Samples")
plt.plot(X, y_1, c="g", label="n_Estimators=1", linewidth=2)
plt.plot(X, y_2, c="r", label="n_Estimators=300", linewidth=2)
plt.title("Boosted Decision Tree Regression")

Congratulations! You just built a decision tree. You can now test your new skills against a more complex example.


Once more, open your Python editor and set up this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_gaussian_quantiles
You are building a more complex data set.
X1, y1 = make_gaussian_quantiles(cov=2.,
n_samples=2000, n_features=2,
n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=3000, n_features=2,
n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))
You now have a data set ready. Create and fit an AdaBoosted decision tree to the data
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
You train the model., y)
You now need to set up a few parameters that you will require.
plot_colors = "br"
plot_step = 0.02
class_names = "AB"
You now create a visualization of the results.
plt.figure(figsize=(10, 5))
Add the plot for the decision boundaries.
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z,
Plot your training points.
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], c=c,, s=20, edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.title('Decision Boundary')
Plot the two-class decision scores.
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
for i, n, c in zip(range(2), class_names, plot_colors):
plt.hist(twoclass_output[y == i],
label='Class %s' % n,
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right')
plt.title('Decision Scores')

Well done, you have just completed a complex decision tree.


Support Vector Machines

Support Vector Machines

The support vector machine (SVM) constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification and regression.


The support vector network (SVN) daisy chains more than one SVM together to form a network. All the data flows through the same series of SVMs.


The support vector cluster (SVC) runs SVM on different clusters of the data in parallel. Hence, not all data flows through all the SVMs.


The support vector grid (SVG) is an SVC of an SVN or an SVN of an SVC. This solution is the most likely configuration you will develop at a customer site. It uses SVMs to handle smaller clusters of the data, to apply specific transform steps. As a beginner data scientist, you only need to note that they exist.


Support Vector Machines

A support vector machine is a discriminative classifier formally defined by a separating hyperplane. The method calculates an optimal hyperplane with a maximum margin, to ensure it classifies the data set into separate clusters of data points.


I will guide you through a sample SVM. Open your Python editor and create this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
Here is your data set and targets.
X = np.c_[(.4, -.7),
(-1.5, -1),
(-1.4, -.9),
(-1.3, -1.2),
(-1.1, -.2),
(-1.2, -.4),
(-.5, 1.2),
(-1.5, 2.1),
(1, 1),
(1.3, .8),
(1.2, .5),
(.2, -2),
(.5, -2.4),
(.2, -2.3),
(0, -2.7),
(1.3, 2.1)].T
Y = [0] * 8 + [1] * 8


You have several kernels you can use to fit the data model. I will take you through all of them. 

for kernel in ('linear', 'poly', 'rbf'): clf = svm.SVC(kernel=kernel, gamma=2), Y)


You now plot the line, the points, and the nearest vectors to the plane.

plt.figure(fignum, figsize=(8, 6))
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80,
facecolors='none', zorder=10, edgecolors='k') plt.scatter(X[:, 0], X[:, 1], c=Y, zorder=10,,
x_min = -3
x_max = 3
y_min = -3
y_max = 3
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
You now apply the result into a color plot.
Z = Z.reshape(XX.shape)
plt.figure(fignum, figsize=(8, 6))
plt.pcolormesh(XX, YY, Z > 0,
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
levels=[-.5, 0, .5])
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
fignum = fignum + 1
You now show your plots.

Well done. You have just completed three different types of SVMs.


Support Vector Networks

The support vector network is the ensemble of a network of support vector machines that together classify the same data set, by using different parameters or even different kernels. This a common method of creating feature engineering, by creating a chain of SVMs.


Tip You can change kernels on the same flow to expose new features. Practice against the data sets with the different kernels, to understand what they give you.


Support Vector Clustering

Support vector clustering is used where the data points are classified into clusters, with support vector machines performing the classification at the cluster level.


This is commonly used in high dimensional data sets, where the clustering creates a grouping that can then be exploited by the SVM to subdivide the data points, using different kernels and other parameters.


I have seen SVC, SVN, SVM, and SVG process in many of the deep-learning algorithms that I work with every day. The volume, variety, and velocity of the data require that the deep learning do multistage classifications, to enable the more detailed analysis of the data points to occur after a primary result is published.


Data Mining

Data mining is processing data to pinpoint patterns and establish relationships between data entities. Here are a small number of critical data mining theories you need to understand data patterns, to be successful with data mining.


Association Patterns

This involves detecting patterns in which one occasion is associated with another. If, for example, a loading bay door is opened, it is fair to assume that a truck is loading goods. Pattern associations simply discover the correlation of occasions in the data. You will use some core statistical skills for this processing.


Warning “Correlation does not imply causation.”

Correlation is only a relationship or indication of behavior between two data sets.

The relationship is not a cause-driven action.



If you discover a relationship between hot weather and ice cream sales, it does not mean high ice cream sales cause hot weather or vice versa. It is only an observed relationship. This is commonly used in retail basket analysis and recommender systems. I will guide you through an example now.


Please open your Python editor and create this ecosystem:

import pandas as pd
df1 = pd.DataFrame({'A': range(8), 'B': [2*i for i in range(8)]})
df2 = pd.DataFrame({'A': range(8), 'B': [-2*i for i in range(8)]})
Here is your data.
print('Positive Data Set')
print('Negative Data Set')
Here are your results.
print('Correlation Positive:', df1['A'].corr(df1['B']))
print('Correlation Negative:', df2['A'].corr(df2['B']))

You should see a correlation of either +1 or -1. If it is +1, this means there is a 100% correlation between the two values, hence they change at the same rate. If it is -1, this means there is a 100% negative correlation, indicating that the two values have a relationship if one increases while the other decreases.


Tip-In real-world data sets, such two extremes as +1 and -1 do not occur. The range is normally i-1 < C < +1.


You will now apply changes that will interfere with the correlation.

df1.loc[2, 'B'] = 10
df2.loc[2, 'B'] = -10
You can now see the impact.
print('Positive Data Set')
print('Negative Data Set')
print('Correlation Positive:', df1['A'].corr(df1['B']))
print('Correlation Negative:', df2['A'].corr(df2['B']))
Can you sense the impact a minor change has on the model?
So, let’s add a bigger change.
df1.loc[3, 'B'] = 100
df2.loc[3, 'B'] = -100
You check the impact.
print('Positive Data Set')
print('Negative Data Set')
print('Correlation Positive:', df1['A'].corr(df1['B']))
print('Correlation Negative:', df2['A'].corr(df2['B']))

Well done, if you understood the changes that interfere with the relationship. If you see the relationship, you have achieved an understanding of association patterns.


These “What if?” analyses are common among data scientists’ daily tasks. For example, one such analysis might relate to the following: What happens when I remove 300 of my 1000 staff?


And at what point does the removal of staff have an impact? These small-­step increases of simulated impact can be used to simulate progressive planned changes in the future.


Classification Patterns

This technique discovers new patterns in the data, to enhance the quality of the complete data set. Data classification is the process of consolidating data into categories, for its most effective and efficient use by the data processing.


For example, if the data is related to the shipping department, you must then augment a label on the data that states that fact.


A carefully planned data-classification system creates vital data structures that are easy to find and retrieve. You do not want to scour your complete data lake to find data every time you want to analyze a new data pattern.


Clustering Patterns

Clustering is the discovery and labeling of groups of specifics not previously known. An example of clustering is if, when your customers buy bread and milk together on a Monday night, your group, or cluster, these customers as “start-of-the-week small-size shoppers,” by simply looking at their typical basic shopping basket.


Any combination of variables that you can use to cluster data entries into a specific group can be viewed as some form of clustering. For data scientists, the following clustering types are beneficial to master.


Connectivity-Based Clustering

You can discover the interaction between data items by studying the connections between them. This process is sometimes also described as hierarchical clustering.


Centroid-Based Clustering (K-Means Clustering)

”Centroid-based” describes the cluster as a relationship between data entries and a virtual center point in the data set. K-means clustering is the most popular centroid-­ based clustering algorithm. I will guide you through an example.


Open your Python editor and set up this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets
You now set up a data set to study.
iris = datasets.load_iris()
X =
y =
Configure your estimators for the clustering.
estimators = [('k_means_iris_8', KMeans(n_clusters=8)),
('k_means_iris_3', KMeans(n_clusters=3)),
('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
Get ready to virtualize your results.
fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), edgecolor='k')
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(titles[fignum - 1])
ax.dist = 12
fignum = fignum + 1
Plot your results.
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean(),
X[y == label, 2].mean() + 2, name,
bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
Reorder the labels to have colors matching the cluster results.
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12

Well done. You should see a set of results for your hard work. Can you see how the connectivity-based clustering uses the interaction between data points to solve the transformation steps?


Grid-Based Method

Grid-based approaches are common for mining large multidimensional space clusters having denser regions than their surroundings. The grid-based clustering approach differs from the conventional clustering algorithms in that it does not use the data points but a value space that surrounds the data points.


Bayesian Classification

Naive Bayes (NB) classifiers are a group of probabilistic classifiers established by applying Bayes’s theorem with strong independence assumptions between the features of the data set. There is one more specific Bayesian classification you must take note of, and it is called tree augmented naive Bayes (TAN).


Tree augmented naive Bayes is a semi-naive Bayesian learning method. It relaxes the naive Bayes attribute independence assumption by assigning a tree structure, in which each attribute only depends on the class and one other attribute.


A maximum weighted spanning tree that maximizes the likelihood of the training data is used to perform classification.


The naive Bayesian (NB) classifier and the tree augmented naive Bayes (TAN) classifier are well-known models that will be discussed next. Here is an example:


Open your Python editor and start with this ecosystem.

Python editor

import numpy as np
import urllib
Load data via the web interface.
url = "­ indians-diabetes/"
Download the file.
raw_data = urllib.request.urlopen(url)
Load the CSV file into a numpy matrix.
dataset = np.loadtxt(raw_data, delimiter=",")
Separate the data from the target attributes in the data set.
X = dataset[:,0:8]
y = dataset[:,8]
Add extra processing capacity.
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB(), y)
Produce predictions.
expected = y
predicted = model.predict(X)


# summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

You have just built your naive Bayes classifiers. Good progress!


Sequence or Path Analysis

This identifies patterns in which one event leads to another, later resulting in insights into the business. Path analysis is a chain of consecutive events that a given business entity performs during a set period, to understand behavior, in order to gain actionable insights into the data.


I suggest you use a combination of tools to handle this type of analysis. I normally model the sequence or path with the help of a graph database, or, for smaller projects, I use a library called network in Python.



Your local telecommunications company is interested in understanding the reasons or flow of events that resulted in people churning their telephone plans to their competitor.

Open your Python editor and set up this ecosystem.

# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import sqlite3 as sq import networkx as nx import datetime pd.options.mode.chained_assignment = None
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn = sq.connect(sDatabaseName)
You must create a new graph to track the individual paths of each customer through their journey over the last nine months.
The following loop structure enables you to compare two sequential months and determine the changes:
for M in range(1,10):
print('Month: ', M)
MIn = str(M - 1)
MOut = str(M)
sFile0 = 'ConnectionsChurn' + MIn + '.csv' sFile1 = 'ConnectionsChurn' + MOut + '.csv' sTable0 = 'ConnectionsChurn' + MIn sTable1 = 'ConnectionsChurn' + MOut
sFileName0=Base + '/' + Company + '/00-RawData/' + sFile0 ChurnData0=pd.read_csv(sFileName0,header=0,low_memory=False, encoding="latin-1")
sFileName1=Base + '/' + Company + '/00-RawData/' + sFile1 ChurnData1=pd.read_csv(sFileName1,header=0,low_memory=False, encoding="latin-1")
Owing to an error during extraction, the files dates are not correct, so you perform a data-quality correction.
dt1 = datetime.datetime(year=2017, month=1, day=1)
dt2 = datetime.datetime(year=2017, month=2, day=1)
ChurnData0['Date'] = dt1.strftime('%Y/%m/%d')
ChurnData1['Date'] = dt2.strftime('%Y/%m/%d')
You now compare all the relevant features of the customer, to see what actions were taken during the month under investigation.
TrackColumns=['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'StreamingTV', 'PaperlessBilling', 'StreamingMovies', 'InternetService',
'Contract', 'PaymentMethod', 'MonthlyCharges'] for i in range(ChurnData0.shape[0]):
for TColumn in TrackColumns:
if ChurnData0[TColumn][i] == 'No':
if t > 4:
ChurnData0['Churn'][i] = 'Yes'
ChurnData0['Churn'][i] = 'No'
for i in range(ChurnData1.shape[0]):
for TColumn in TrackColumns:
if ChurnData1[TColumn][i] == 'No':
if t > 4:
ChurnData1['Churn'][i] = 'Yes'
ChurnData1['Churn'][i] = 'No'
print('Store CSV Data')
ChurnData0.to_csv(sFileName0, index=False)
ChurnData1.to_csv(sFileName1, index=False) print('Store SQLite Data')
ChurnData0.to_sql(sTable0, conn, if_exists='replace')
ChurnData1.to_sql(sTable1, conn, if_exists='replace') for TColumn in TrackColumns:
for i in range(ChurnData0.shape[0]):
You always start with a “Root” node and then attach the customers. Then you connect each change in status and perform a complete path analysis.
Node0 = 'Root'
Node1 = '(' + ChurnData0['customerID'][i] + '-Start)'
G.add_edge(Node0, Node1)
Node5 = '(' + ChurnData0['customerID'][i] + '-Stop)'
if ChurnData0['Churn'][i] == 'Yes':
NodeA = '(' + ChurnData0['customerID'][i] + '-Start)' NodeB = '(' + ChurnData0['customerID'][i] + '-Stop)' if nx.has_path(G, source=NodeA, target=NodeB) == False:
NodeC = '(' + ChurnData0['customerID'][i] + '): (Churn)=>(' + ChurnData1['Churn'][i] + ')' G.add_edge(NodeA, NodeC) G.add_edge(NodeC, NodeB)
if ChurnData0[TColumn][i] != ChurnData1[TColumn][i]: #print(M,ChurnData0['customerID'][i],ChurnData0['Date'] [i],ChurnData1['Date'][i],TColumn, ChurnData0[TColumn] [i], ChurnData1[TColumn][i])
Node2 = '(' + ChurnData0['customerID'][i] + ')-
(' + ChurnData0['Date'][i] + ')'
G.add_edge(Node1, Node2)
Node3 = Node2 + '-(' + TColumn + ')'
G.add_edge(Node2, Node3)
Node4 = Node3 + ':(' + ChurnData0[TColumn][i] + ')=>
(' + ChurnData1[TColumn][i] + ')'
G.add_edge(Node3, Node4)
if M == 9:
Node6 = '(' + ChurnData0['customerID'][i] + '): (Churn)=>(' + ChurnData1['Churn'][i] + ')' G.add_edge(Node4, Node6) G.add_edge(Node6, Node5)
G.add_edge(Node4, Node5)
You can use these lines to investigate the nodes and the edges you created.
for n in G.nodes():
for e in G.edges():
You must now store your graph for future use.
sGraphOutput=Base + '/' + Company + \
nx.write_gml(G, sGraphOutput)
You now investigate the paths taken by a customer over the nine months and produce an output of all the steps taken by the customer over the period.
sFile0 = 'ConnectionsChurn9.csv'
sFileName0=Base + '/' + Company + '/00-RawData/' + sFile0
for i in range(ChurnData0.shape[0]):
sCustomer = ChurnData0['customerID'][i]
NodeX = '(' + ChurnData0['customerID'][i] + '-Start)' NodeY = '(' + ChurnData0['customerID'][i] + '-Stop)' if nx.has_path(G, source=NodeX, target=NodeY) == False:
NodeZ = '(' + ChurnData0['customerID'][i] + '):(Churn)=> (' + ChurnData0['Churn'][i] + ')' G.add_edge(NodeX, NodeZ)
G.add_edge(NodeZ, NodeX)
if nx.has_path(G, source=NodeX, target=NodeY) == True:
This function enables you to expose all the paths between the two nodes you created for each customer.
pset = nx.all_shortest_paths(G, source=NodeX, target=NodeY,
for p in pset:
ps = 'Path: ' + str(p)
for s in p:
ts = 'Step: ' + str(t)
#print(NodeX, NodeY, ps, ts, s)
if c == 1:
pl = [[sCustomer, ps, ts, s]]
pl.append([sCustomer, ps, ts, s])
You now store the path analysis results into a CSV for later use.
sFileOutput=Base + '/' + Company + \ '/04-Transform/01-EDS/02-Python/Transform_ConnectionsChurn.csv'
df = pd.DataFrame(pl, columns=['Customer', 'Path', 'Step', 'StepName']) = 'RecID'
sTable = 'ConnectionsChurnPaths'
df.to_sql(sTable, conn, if_exists='replace')
print('### Done!! ############################################')
Well done. You have just completed your first path analysis over a period of nine months.



Can you understand why the routes are different for each customer but still have the same outcome? Can you explain why one customer churns and another does not?


The common activity you will spot is that customers begin to remove services from your client as they start to churn, to enable their own new churned services, they no longer will support your services 100%.


If they have fewer than five services, customers normally churn, as they are now likely with the other telecommunications company.


If you were advising the client, I would suggest you highlight that the optimum configuration for the trigger that a customer is about to churn is if the customer has changed his or her configuration to include fewer services from your client.


You can still prevent a churn if you intervene before the customer hits the five-minimum level.


Well done. You can now perform Transform steps for path analysis, get a simple list of node and edges (relationships) between them, and model a graph and ask pertinent questions.



This technique is used to discover patterns in data that result in practical predictions about a future result, as indicated, by predictive analytics of future probabilities and trends. We have been performing forecasting at several points in the blog.


Pattern Recognition

Pattern recognition identifies regularities and irregularities in data sets. The most common application of this is in text analysis, to find complex patterns in the data.


I will guide you through an example of text extraction. The example will extract text files from a common 20 newsgroups dataset and then create categories of text that was found together in the same document. This will provide you with the most common word that is used in the newsgroups.


Open your Python editor and set up this ecosystem:

from pprint import pprint
from time import time
import logging
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline
Modify your logging, to display progress logs on stdout (standard output).
logging.basicConfig( -&nbspThis website is for sale! -&nbspLogging Resources and Information.,
format='%(asctime)s %(levelname)s %(message)s')
You can now load your categories from the training set categories = [

You could use this for all the categories, but I would perform it as a later experiment, as it slows down your processing, by using larger volumes of data from your data lake.

#categories = None
You can now investigate what you loaded.
print("Loading 20 newsgroups dataset for categories:")
You must now load the training data.
data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
You now have to define a pipeline, combining a text feature extractor with a simple classifier.
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
#'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)),
#'tfidf__use_idf': (True, False),
#'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
#'clf__n_iter': (10, 50, 80),
You can now build the main processing engine.
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
t0 = time(),
print("done in %0.3fs" % (time() - t0))
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

When you execute the program, you will have successfully completed a text extraction. The data sources for this type of extract are typically documents, e-mail, Twitter, or note fields in databases. Any test source can receive a transform step.