Transform Data (Best Data Science Tutorial 2019)

Transform data

How to Transform data?

The Transform data super step allows you, as a data scientist, to take data from the data vault and formulate answers to questions raised by your investigations. This Tutorial explains how transformation step in the data science process that converts results into insights.


Transforming with Data Science

Transforming with Data Science

For data exploration and preparation from the data lake into the data vault and from the data vault to the data warehouse, we will discuss in the next section. We will now introduce you to the basic data science to transform your data into insights. You must understand a selected set of basic investigation practices, to gain insights from your data.


Steps of Data Exploration and Preparation

You must keep detailed notes of what techniques you employed to prepare the data. Make sure you keep your data traceability matrix up to date after each data engineering step has completed.


Update your Data Lineage and Data Providence, to ensure that you have both the technical and business details of the entire process. Now, I will take you through a small number of the standard transform checkpoints, to ensure that your data science is complete.


Missing Value Treatment

You must describe in detail what the missing value treatments are for the data lake transformation. Make sure you take your business community with you along the journey.


At the end of the process, they must trust your techniques and results. If they trust the process, they will implement the business decisions that you, as a data scientist, aspire to achieve.


Why Missing Value Treatment Is Required

Explain with notes on the data traceability matrix why there is missing data in the data lake. Remember:

Every inconsistency in the data lake is conceivably the missing insight your customer is seeking from you as a data scientist. So, find them and explain them. Your customer will exploit them for business value.


Why Data Has Missing Values


The 5 Whys is the technique that helps you to get to the root cause of your analysis. The use of cause-and-effect fishbone diagrams will assist you to resolve those questions.


I have found the following common reasons for missing data:

  • Data fields renamed during upgrades
  • Migration processes from old systems to new systems where mappings were incomplete
  • Incorrect tables supplied in loading specifications by a subject-matter expert
  • Data simply not recorded, as it was not available
  • Legal reasons, owing to data protection legislation, such as the General Data Protection Regulation (GDPR), resulting in a not-to-­ process tag on the data entry
  • Someone else’s “bad” data science. People and projects make mistakes, and you will have to fix their errors in your own data science.


Warning Ensure that your data science processing is not the reason you are missing data. That is the quickest way to lose your customer’s trust.


What Methods Treat Missing Values?

During your progress through the supersteps, you have used many techniques to resolve missing data. Record them in your lineage, but also make sure you collect precisely how each technique applies to the processing flow.


Techniques of Outlier Detection and Treatment

During the processing, you will have detected several outliers that are not complying with your expected ranges, e.g., you expected “Yes” or “No” but found some “N/A”s, or you expected number ranges between 1 and 10 but got 11, 12, and 13 also. These out-of-­ order items are the outliers.


I suggest you treat them as you treat the missing data. Make sure that your customer agrees with the process, as it will affect the insights you will process and their decisions.


Elliptic Envelope

Elliptic Envelope

I will introduce a function called Elliptic Envelope. The basic idea is to assume that a data set is from a known distribution and then evaluate any entries not complying with that assumption. Fitting an elliptic envelope is one of the more common techniques used to detect outliers in a Gaussian distributed data set.


The scikit-learn package provides an object covariance. Elliptic Envelope that fits a robust covariance estimate to the data, and thus fits an ellipse to the central data points, ignoring points outside the central mode.


For instance, if the inlier data are Gaussian distributed, it will estimate the inlier location and covariance in a robust way (i.e., without being influenced by outliers).


The Mahalanobis distances obtained from this estimate are used to derive a measure of outlyingness.

If you want more in-depth details on the function, visit sklearn.covariance



# -*- coding: utf-8 -*-import numpy as np
from sklearn.covariance import EllipticEnvelope from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt import matplotlib.font_manager
from sklearn.datasets import load_boston
# Get data
X1 = load_boston()['data'][:, [8, 10]] # two clusters
X2 = load_boston()['data'][:, [5, 12]] # "banana"-shaped
# Define "classifiers" to be used classifiers = {
"Empirical Covariance": EllipticEnvelope(support_fraction=1.,
"Robust Covariance (Minimum Covariance Determinant)":
"OCSVM": OneClassSVM(nu=0.261, gamma=0.05)}

The classifiers assume you are testing the normal data entries. Let’s look at EllipticEnvelope(support_fraction=1., contamination=0.261).


The support_ fraction is the portion of the complete population you want to use to determine the border between inliers and outliers. In this case, we use 1, which means 100%.


The contamination is the indication of what portion of the population could be outliers, hence, the amount of contamination of the data set, i.e., the proportion of outliers in the data set. In your case, this is set to 0.261 against a possible 0.5, more generally described as 26.1% contamination.


The EllipticEnvelope(contamination=0.261) is only a change of the included population, by using the defaults for all the settings, except for contamination that is set to 26.1%.


Third is another type of detection called sklearn.svm.OneClassSVM, which is discussed later in this blog.
colors = ['m', 'g', 'b']
legend1 = {}
legend2 = {}
# Learn a frontier for outlier detection with several classifiers
xx1, yy1 = np.meshgrid(np.linspace(-8, 28, 500), np.linspace(3, 40, 500)) xx2, yy2 = np.meshgrid(np.linspace(3, 10, 500), np.linspace(-5, 45, 500)) for i, (clf_name, clf) in enumerate(classifiers.items()):
fig1a.set_size_inches(10, 10)
Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
legend1[clf_name] = plt.contour(
xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])
Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
Z2 = Z2.reshape(xx2.shape)
legend2[clf_name] = plt.contour(
xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i])
legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())
# Plot the results (= shape of the data points cloud) fig1b=plt.figure(1) # two clusters fig1b.set_size_inches(10, 10)
plt.title("Outlier detection on a real data set (boston housing)") plt.scatter(X1[:, 0], X1[:, 1], color='black')
bbox_args = dict(boxstyle="round", fc="0.8") arrow_args = dict(arrowstyle="->") plt.annotate("several confounded points", xy=(24, 19),
xycoords="data", textcoords="data",
xytext=(13, 10), bbox=bbox_args, arrowprops=arrow_args) plt.xlim((xx1.min(), xx1.max()))
plt.ylim((yy1.min(), yy1.max())) plt.legend((legend1_values_list[0].collections[0],
legend1_values_list[1].collections[0], legend1_values_list[2].collections[0]), (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_ list[2]),
loc="upper center",
prop=matplotlib.font_manager.FontProperties(size=12)) plt.ylabel("accessibility to radial highways") plt.xlabel("pupil-teacher ratio by town")
legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())
fig2a=plt.figure(2) # "banana" shape
fig2a.set_size_inches(10, 10)
plt.title("Outlier detection on a real data set (boston housing)")
plt.scatter(X2[:, 0], X2[:, 1], color='black')
plt.xlim((xx2.min(), xx2.max()))
plt.ylim((yy2.min(), yy2.max())) plt.legend((legend2_values_list[0].collections[0],
legend2_values_list[1].collections[0], legend2_values_list[2].collections[0]),
(legend2_keys_list[0], legend2_keys_list[1], legend2_keys_
loc="upper center",
plt.ylabel("% lower status of the population")
plt.xlabel("average number of rooms per dwelling")
There are a few other outlier detection techniques you can investigate.


Isolation Forest

One efficient way of performing outlier detection in high-dimensional data sets is to use random forests. The ensemble.IsolationForest tool “isolates” observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.


Because recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.


This path length averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces a noticeably shorter path for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

See ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.


Novelty Detection

Novelty Detection

Novelty detection simply performs an evaluation in which we add one more observation to a data set. Is the new observation so different from the others that we can doubt that it is regular? (I.e., does it come from the same distribution?)


Or, on the contrary, is it so similar to the other that we cannot distinguish it from the original observations? This is the question addressed by the novelty detection tools and methods.


The sklearn.svm.OneClassSVM tool is a good example of this unsupervised outlier detection technique.

For more information, see scikit-learn: machine learning in Python modules/generated/sklearn.svm. SVM.


Local Outlier Factor

An efficient way to perform outlier detection on moderately high-dimensional data sets is to use the local outlier factor (LOF) algorithm. The neighbors. The local outlier factor algorithm computes a score (called a local outlier factor) reflecting the degree of abnormality of the observations.


It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.


In practice, the local density is obtained from the k-nearest neighbors. The LOF score of an observation is equal to the ratio of the average local density of its k-nearest neighbors and its own local density.


A normal instance is expected to have a local density like that of its neighbors, while abnormal data are expected to have a much smaller local density.



LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor for additional information.


When the amount of contamination is known, this algorithm illustrates three different ways of performing—based on a robust estimator of covariance, which assumes that the data are Gaussian distributed—and performs better than the one-class SVM, in that case.


The first is the one-class SVM, which has the ability to capture the shape of the data set and, hence, perform better when the data is strongly non-Gaussian, i.e., with two well-separated clusters.


The second is the isolation forest algorithm, which is based on random forests and, hence, better adapted to large-dimensional settings, even if it performs quite well in the example you will perform next.


The third is the local outlier factor to measure the local deviation of a given data point with respect to its neighbors, by comparing their local density.



Here, the underlying truth about inliers and outliers is given by the points’ colors. The orange-filled area indicates which points are reported as inliers by each method. You assume to know the fraction of outliers in the data sets, and the following provides an example of what you can achieve.


Open your Python editor, set up this ecosystem, and investigate the preceding techniques.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm
from sklearn.covariance import EllipticEnvelope from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor
rng = np.random.RandomState(42)
# Your example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
"Isolation Forest": IsolationForest(max_samples=n_samples,
"Local Outlier Factor": LocalOutlierFactor(
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples) ground_truth = np.ones(n_samples, dtype=int) ground_truth[-n_outliers:] = -1
# Fit the problem with varying cluster separation for i, offset in enumerate(clusters_separation):
# Data generation
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model plt.figure(figsize=(9, 7))
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
threshold = stats.scoreatpercentile(scores_pred,
100 * outliers_fraction)
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points if clf_name == "Local Outlier Factor":
# decision_function is private for LOF
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = plt.subplot(2, 2, i + 1)
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7),
a = subplot.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white', s=20, edgecolor='k')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black', s=20, edgecolor='k')
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'], prop=matplotlib.font_manager.FontProperties(size=10), loc='lower right')
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_ errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
plt.suptitle("Outlier detection")

You can now detect outliers in a data set. That is a major achievement, as the most interesting features of data science work are found in these outliers and why they exist. This supports you in performing the feature engineering of the data sets.

Dimension Consolidation

The data vault consists of five categories of data, with linked relationships and additional characteristics in satellite hubs. 


To perform dimension consolidation, you start with a given relationship in the data vault and construct a sun model for that relationship. I will cover the example of a person being born, to illustrate the consolidation process.


Open a new file in the Python editor and save it as in the directory ..\VKHCG\01-Vermeulen\04-Transform.


You will require the Python ecosystem, so set it up by adding the following to your editor (you must set up the ecosystem by adding the libraries):

import sys
import os
from datetime import datetime
from datetime import timedelta
from pytz import timezone, all_timezones
import pandas as pd
import sqlite3 as sq
from import sql
import uuid
pd.options.mode.chained_assignment = None
Find the working directory of the examples.
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
print('Working Base :',Base, ' using ', sys.platform)
Set up the company you are processing.
Add the company work space:
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir) sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
Add the data vault.
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
os.makedirs(sDataVaultDir) sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
Add the data warehouse.
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
os.makedirs(sDataWarehousetDir) sDatabaseName=sDataVaultDir + '/datawarehouse.db' conn3 = sq.connect(sDatabaseName)


Execute the Python code now, to set up the basic ecosystem.

Note The new data structure, called a data warehouse, is in the directory ../99-DW. The data warehouse is the only data structure delivered from the Transform step.


Let’s look at a real-world scenario.

Gunnarsson was born on December 20, 1960, at 9:15 in Landspítali, Hringbraut 101, 101 Reykjavík, Iceland. Following is what I would expect to find in the data vault.




You need a date and time of December 20, 1960, at 9:15 in Reykjavík, Iceland. Enter the following code into your editor (you start with a UTC time):

print('Time Category')
print('UTC Time')
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
Formulate a Reykjavík local time.
print('Birth Date in Reykjavik :')
BirthZone = 'Atlantic/Reykjavik'
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")


You have successfully discovered the time key for the time hub and the time zone satellite for Atlantic/Reykjavik.


•\ Time Hub: You have a UTC date and time of December 20, 1960, at 9:15 in Reykjavík, Iceland, as follows:

1960-12-20 10:15:00 (UTC) (+0000)

•\ Time Satellite: Birth date in Reykjavík: 1960-12-20 09:15:00 (-01) (-0100)

Now you can save your work, by adding the following to your code.

Build a data frame, as follows:

sDateTimeKey=BirthDateZoneStr.replace(' ','-').replace(':','-')
TimeLine=[('ZoneBaseKey', ['UTC']),
('IDNumber', [IDZoneNumber]), ('DateTimeKey', [sDateTimeKey]), ('UTCDateTimeValue', [BirthDateZoneUTC]), ('Zone', [BirthZone]), ('DateTimeValue', [BirthDateStr])]
TimeFrame = pd.DataFrame.from_items(TimeLine)


Create the time hub.

TimeHub=TimeFrame[['IDNumber','ZoneBaseKey','DateTimeKey','DateTimeValue']] TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
sTable = 'Hub-Time-Gunnarsson'
print('Storing :',sDatabaseName,'\n Table:',sTable)
TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-Time-Gunnarsson'
TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")


Create the time satellite.

TimeSatellite=TimeFrame[['IDNumber','DateTimeKey','Zone','DateTimeValue']] TimeSatelliteIndex=TimeSatellite.set_index(['IDNumber'],inplace=False)

BirthZoneFix=BirthZone.replace(' ','-').replace('/','-') sTable = 'Satellite-Time-' + BirthZoneFix + '-Gunnarsson' print('\n#################################') 
print('Storing :',sDatabaseName,'\n Table:',sTable) print('\n#################################') 
TimeSatelliteIndex.to_sql(sTable, conn2, if_exists="replace") sTable = 'Dim-Time-' + BirthZoneFix + '-Gunnarsson' TimeSatelliteIndex.to_sql(sTable, conn3, if_exists="replace")

Well done. You now have a Time category. The next category is Person.



You must record that Guðmundur Gunnarsson was born on December 20, 1960, at 9:15 in Iceland.

Add the following to your existing code:

print('Person Category') FirstName = 'Guðmundur' LastName = 'Gunnarsson' print('Name:',FirstName,LastName) print('Birth Date:',BirthDateLocal) print('Birth Zone:',BirthZone) print('UTC Birth Date:',BirthDateZoneStr)
You just created the person in the person hub.
IDPersonNumber=str(uuid.uuid4()) PersonLine=[('IDNumber', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = pd.DataFrame.from_items(PersonLine)
sTable = 'Hub-Person-Gunnarsson'
print('Storing :',sDatabaseName,'\n Table:',sTable)
TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-Person-Gunnarsson'
TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")

Well done. You now have a person hub.

Can you create a location hub and satellite for National University Hospital of Iceland?

Latitude: 64° 08' 10.80" N or 64.136332788 Longitude: -21° 55' 22.79" W or -21.922996308


What else can you process from the address data?

Landspítali: Local Name

Hringbraut 101: Street Address

101: Post Code Reykjavík: City Name Iceland: Country


You can add to the information an entry in the location hub if the specific latitude and longitude for Reykjavík does not exist yet. You can then add to a satellite named Hospitals the key from the location hub, plus all the extra information.


Alternatively, you can simply add it as a satellite called BuildingAt with the same extra information and a type_of_building indicator.


Whatever way you perform it, I advise discussing it with your customer.


Note You can data-mine many characteristics from a simple connection to a location or building. Can you now create an event hub for “Birth”? Yes, you can—easily—but you have an event called “Born.” Use that.


Note I have found that various people have diverse names for similar events, for example, born and birth, death and deceased. I have a library of events that we model for the same event. I keep these in an additional table, and when we create an event for one of the words in a similar list, we also create it for the other lists.


For example, if we create an event for born and birth, that facilitates queries from the Transform step. If you asked “Who was born?” or “Whose birth was it?” you would get results retrieved from the data vault into the data warehouse.


So, what about Object? Yes, you could use the genus/species data. Guðmundur is a Homo sapiens (human). Try to find the correct record to link to.


You must build three items: dimension Person, dimension Time, and fact PersonBornAtTime.


Open your Python editor and create a file named in directory ..\VKHCG\01-Vermeulen\04-Transform. Here is your basic ecosystem:

# -*- coding: utf-8 -*-
import sys import os
from datetime import datetime from pytz import timezone import pandas as pd
import sqlite3 as sq import uuid pd.options.mode.chained_assignment = None
if sys.platform == 'linux' or sys.platform == ' Darwin': Base=os.path.expanduser('~') + '/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
sDatabaseName=sDataWarehousetDir + '/datawarehouse.db' conn2 = sq.connect(sDatabaseName)
Here is your Time dimension:
print('Time Dimension')
BirthZone = 'Atlantic/Reykjavik'
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z)
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
IDTimeNumber=str(uuid.uuid4()) TimeLine=[('TimeID', [IDTimeNumber]),
('UTCDate', [BirthDateZoneStr]),
('LocalTime', [BirthDateLocal]),
('TimeZone', [BirthZone])]
TimeFrame = pd.DataFrame.from_items(TimeLine)
sTable = 'Dim-Time'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")
Well done. you have a Time dimension. Let’s build the Person dimension.
print('\n#################################') print('Dimension Person') print('\n#################################') FirstName = 'Guðmundur' LastName = 'Gunnarsson'
IDPersonNumber=str(uuid.uuid4()) PersonLine=[('PersonID', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = pd.DataFrame.from_items(PersonLine)
sTable = 'Dim-Person'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
Finally, we add the fact, as follows:
print('\n#################################') print('Fact - Person - time') print('\n#################################') IDFactNumber=str(uuid.uuid4()) PersonTimeLine=[('IDNumber', [IDFactNumber]),
('IDPersonNumber', [IDPersonNumber]), ('IDTimeNumber', [IDTimeNumber])]
PersonTimeFrame = pd.DataFrame.from_items(PersonTimeLine)
sTable = 'Fact-Person-Time'
print('Storing:',sDatabaseName,'\n Table:',sTable)
FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")
FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")

Can you now understand how to formulate a sun model and build the required dimensions and facts?


Building a Data Warehouse

Building a Data Warehouse

As you have performed so well up to now, I will ask you to open the Transform-Sun-­ file from directory ..\VKHCG\01-Vermeulen\04-Transform.


Note The Python program will build you a good and solid warehouse with which to try new data science techniques. Please be patient; it will supply you with a big push forward.


Execute the program and have some coffee once it runs. You can either code it for yourself or simply upload the code from the samples directory. The code follows if you want to understand the process. Can you understand the transformation from data vault to data warehouse?

# -*- coding: utf-8 -*-
import sys import os
from datetime import datetime from pytz import timezone import pandas as pd
import sqlite3 as sq import uuid pd.options.mode.chained_assignment = None
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
sDataWarehouseDir=Base + '/99-DW'
if not os.path.exists(sDataWarehouseDir):
sDatabaseName=sDataWarehouseDir + '/datawarehouse.db' conn3 = sq.connect(sDatabaseName)
sSQL=" SELECT DateTimeValue FROM [Hub-Time];" DateDataRaw=pd.read_sql_query(sSQL, conn2) DateData=DateDataRaw.head(1000) print(DateData)
print('Time Dimension')
for i in range(mt):
BirthZone = ('Atlantic/Reykjavik','Europe/London','UCT')
for j in range(len(BirthZone)):
BirthDateUTC = datetime.strptime(DateData['DateTimeValue'][i],
"%Y-%m-%d %H:%M:%S")
BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S
(%Z) (%z)")
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone[j]))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
IDTimeNumber=str(uuid.uuid4()) TimeLine=[('TimeID', [str(IDTimeNumber)]),
('UTCDate', [str(BirthDateZoneStr)]),
('LocalTime', [str(BirthDateLocal)]),
('TimeZone', [str(BirthZone)])]
if t==1:
TimeFrame = pd.DataFrame.from_items(TimeLine)
TimeRow = pd.DataFrame.from_items(TimeLine)
sTable = 'Dim-Time'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
DimTimeIndex.to_sql(sTable, conn3, if_exists="replace") sSQL=" SELECT " + \
" FirstName," + \
" SecondName," + \
" LastName," + \
" BirthDateKey " + \
" FROM [Hub-Person];" PersonDataRaw=pd.read_sql_query(sSQL, conn2) PersonData=PersonDataRaw.head(1000)
print('Dimension Person')
for i in range(mt):
FirstName = str(PersonData["FirstName"]) SecondName = str(PersonData["SecondName"]) if len(SecondName) > 0:
LastName = str(PersonData["LastName"])
BirthDateKey = str(PersonData["BirthDateKey"])
IDPersonNumber=str(uuid.uuid4()) PersonLine=[('PersonID', [str(IDPersonNumber)]),
('FirstName', [FirstName]),
('SecondName', [SecondName]),
('LastName', [LastName]),
('Zone', [str('UTC')]),
('BirthDate', [BirthDateKey])]
if t==1:
PersonFrame = pd.DataFrame.from_items(PersonLine)
PersonRow = pd.DataFrame.from_items(PersonLine)
PersonFrame = PersonFrame.append(PersonRow)
sTable = 'Dim-Person'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
DimPersonIndex.to_sql(sTable, conn3, if_exists="replace")

You should now have a good example of a data vault to data warehouse transformation. Congratulations on your progress!


What Is Feature Engineering?

Feature Engineering

Feature engineering is your core technique to determine the important data characteristics in the data lake and ensure they get the correct treatment through the steps of processing. Make sure that any featuring extraction process technique is documented in the data transformation matrix and the data lineage.


Common Feature Extraction Techniques

I will introduce you to several common feature extraction techniques that will help you to enhance an existing data warehouse, by applying data science to the data in the warehouse.



Binning is a technique that is used to reduce the complexity of data sets, to enable the data scientist to evaluate the data with an organized grouping technique. Binning is a good way for you to turn continuous data into a data set that has specific features that you can evaluate for patterns.


A simple example is the cost of candy in your local store, which might range anywhere from a penny to ten dollars, but if you subgroup the price into, say, a rounded-up value that then gives you a range of five values against five hundred.


You have just reduced your processing complexity to 1/500th of what it was before. There are several good techniques, which I will discuss next.


I have two binning techniques that you can use against the data sets. Open your Python editor and try these examples. The first technique is to use the digitizer function.

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))] print(bin_means)
The second is to use the histogram function.
bin_means2 = (numpy.histogram(data, bins, weights=data)[0] / numpy.histogram(data, bins)[0])

This transform technique can be used to reduce the location dimension into three latitude bins and four longitude bins.

You will require the NumPy library.

import numpy as np
Set up the latitude and longitude data sets.
LatitudeData = np.array(range(-90,90,1))
LongitudeData = np.array(range(-180,180,1))
Set up the latitude and longitude data bins.
LatitudeBins = np.array(range(-90,90,45))
LongitudeBins = np.array(range(-180,180,60))
Digitize the data sets with the data bins.
LatitudeDigitized = np.digitize(LatitudeData, LatitudeBins) LongitudeDigitized = np.digitize(LongitudeData, LongitudeBins) Calculate the mean against the bins:
LatitudeBinMeans = [LatitudeData[LatitudeDigitized == i].mean() for i in range(1, len(LatitudeBins))]
LongitudeBinMeans = [LongitudeData[LongitudeDigitized == i].mean() for i in range(1, len(LongitudeBins))]
Well done. You have the three latitude bins and four longitude bins.
You can also use the histogram function to achieve similar results.
LatitudeBinMeans2 = (np.histogram(LatitudeData, LatitudeBins,\
weights=LatitudeData)[0] /
np.histogram(LatitudeData, LatitudeBins)[0])
LongitudeBinMeans2 = (np.histogram(LongitudeData, LongitudeBins,\
weights=LongitudeData)[0] /
np.histogram(LongitudeData, LongitudeBins)[0])

Now you can apply two different techniques for binning.



The use of averaging enables you to reduce the number of records you require to report any activity that demands a more indicative, rather than a precise, total.


Create a model that enables you to calculate the average position for ten sample points. First, set up the ecosystem.

import numpy as np
import pandas as pd
Create two series to model the latitude and longitude ranges.
LatitudeData = pd.Series(np.array(range(-90,91,1)))
LongitudeData = pd.Series(np.array(range(-180,181,1)))
You then select 10 samples for each range:
Calculate the average of each.
LatitudeAverage = np.average(LatitudeSet)
LongitudeAverage = np.average(LongitudeSet)
See your results.
print('Latitude (Avg):',LatitudeAverage)
print('Longitude (Avg):', LongitudeAverage)

You can now calculate the average of any range of numbers. If you run the code several times, you should get different samples. 


Challenge Question One:

In directory ..\VKHCG\01-Vermeulen\00-RawData, there is a file called IP_DATA_ CORE.csv. Try to import the latitude and longitude columns and calculate the average values.


Challenge Question Two:

Try to calculate the average amount of IP addresses:

  • Per country
  • Per place-name
  • Per postcode
Tip Amount of IP address = (-1*First IP Number + Last IP Number)


Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA)

A latent Dirichlet allocation (LDA) is a statistical model that allows sets of observations to be explained by unobserved groups that elucidates why they match or belong together within text documents.


This technique is useful when investigating text from a collection of documents that are common in the data lake, as companies store all their correspondence in a data lake. This model is also useful for Twitter or e-mail analysis.


Note To run the example, you will require pip install lda.


In your Python editor, create a new file named Transform_Latent_Dirichlet_ in the directory .. \VKHCG\01-Vermeulen\04-Transform. Following is an example of what you can achieve:

import numpy as np
import lda
import lda.datasets
X = lda.datasets.load_reuters()
vocab = lda.datasets.load_reuters_vocab()
titles = lda.datasets.load_reuters_titles()
You can experiment with ranges of n_topics and n_iter values to observe the impact on the process.
model = lda.LDA(n_topics=50, n_iter=1500, random_state=1)
topic_word = model.topic_word_
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Investigate the top-ten topics.
doc_topic = model.doc_topic_
for i in range(10):
print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))


Well done. You can now analyze text documents.

Do you think if you had Hillman’s logistics shipping document you could use this technique? Indeed, you could, as this technique will work on any text note fields or e-mail, even Twitter entries.

  • Now, you can read your Twitter accounts.
  • Now, you can read e-mail.

The complete process is about getting data to the data lake and then guiding it through the steps: retrieve, assess, process, and transform.


A tip I have found, on average, that it is only after the third recheck that 90% of the data science is complete. The process is an iterative design process. The methodology is based on a cyclic process of prototyping, testing, analyzing, and refining. Success will be achieved as you close out the prototypes.

I will now explain a set of common data science terminology that you will encounter in the field of data science.


Hypothesis Testing

Hypothesis Testing

Hypothesis testing is not precisely an algorithm, but it’s a must-know for any data scientist. You cannot progress until you have thoroughly mastered this technique.


Hypothesis testing is the process by which statistical tests are used to check if a hypothesis is true, by using data. Based on hypothetical testing, data scientists choose to accept or reject the hypothesis.


When an event occurs, it can be a trend or happen by chance. To check whether the event is an important occurrence or just happenstance, hypothesis testing is necessary.


There are many tests for hypothesis testing, but the following two are the most popular.



A t-test is a popular statistical test to make inferences about single means or inferences about two means or variances, to check if the two groups’ means are statistically different from each other, where n < 30 and the standard deviation is unknown.


For more information on the t-test,

see stats.t.html#scipy.stats.t

and SciPy v1.1.0 Reference Guide generated/scipy.stats.ttest_ind.html.



First, you set up the ecosystem, as follows:

import numpy as np

from scipy.stats import ttest_ind, ttest_ind_from_stats from scipy.special import stdtr


Create a set of “unknown” data. (This can be a set of data you want to analyze.) In the following example, there are five random data sets. You can select them by having nSet equal 1, 2, 3, 4, or 5.

if nSet==1:
a = np.random.randn(40)
b = 4*np.random.randn(50)
if nSet==2:
if nSet==3:
if nSet==4:
if nSet==5:
a = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
b = np.array([55.0, 56.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
First, you will use scipy’s t-test.
# Use scipy.stats.ttest_ind.
t, p = ttest_ind(a, b, equal_var=False)
print("t-Test_ind: t = %g p = %g" % (t, p))
Second, you will get the descriptive statistics.
# Compute the descriptive statistics of a and b. abar = a.mean()
avar = a.var(ddof=1) na = a.size
adof = na - 1
bbar = b.mean()
bvar = b.var(ddof=1)
nb = b.size
bdof = nb - 1
# Use scipy.stats.ttest_ind_from_stats.
t2, p2 = ttest_ind_from_stats(abar, np.sqrt(avar), na,
bbar, np.sqrt(bvar), nb,
print("t-Test_ind_from_stats: t = %g p = %g" % (t2, p2))
Look at Welch’s t-test formula.
Third, you can use the formula to calculate the test.
# Use the formulas directly.
tf = (abar - bbar) / np.sqrt(avar/na + bvar/nb)
dof = (avar/na + bvar/nb)**2 / (avar**2/(na**2*adof) + bvar**2/ (nb**2*bdof))
pf = 2*stdtr(dof, -np.abs(tf))
print("Formula: t = %g p = %g" % (tf, pf))
if P < 0.001:
print('Statistically highly significant:',P)
if P < 0.05:
print('Statistically significant:',P)
print('No conclusion')
You should see results like this:
t-Test_ind: t = -1.5827 p = 0.118873
t-Test_ind_from_stats: t = -1.5827 p = 0.118873
Formula: t = -1.5827 p = 0.118873


No conclusion

Your results are as follows. The p means the probability, or how likely your results are occurring by chance. In this case, it’s 11%, or p-value = 0.11.


The p-value results can be statistically significant when P < 0.05 and statistically highly significant if P < 0.001 (a less than one-in-a-thousand chance of being wrong).


So, in this case, it cannot be noted as either statistically significant or statistically highly significant, as it is 0.11. Go back and change nSet at the beginning of the code you just entered.

Remember: I mentioned that you can select them by nSet = 1, 2, 3, 4, or 5.


Retest the data sets. You should now see that the p-value changes, and you should also understand that the test gives you a good indicator of whether the two results sets are similar.

Can you find the 99.99%?


Chi-Square Test

Chi-Square Test

A chi-square (or squared [χ2]) test is used to examine if two distributions of categorical variables are significantly different from each other.


Try these examples that are generated with five different datasets. First, set up the ecosystem.

import numpy as np
import scipy.stats as st
Create data sets.
# Create sample data sets. nSet=1
if nSet==1:
a = abs(np.random.randn(50))
b = abs(50*np.random.randn(50))
if nSet==2:
if nSet==3:
if nSet==4:
if nSet==5:
a = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
b = np.array([55.0, 56.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
obs = np.array([a,b])
Perform the test.
chi2, p, dof, expected = st.chi2_contingency(obs)
Display the results.
msg = "Test Statistic : {}\np-value: {}\ndof: {}\n"
print( msg.format( chi2, p , dof,expected) )
if P < 0.001:
print('Statistically highly significant:',P)
if P < 0.05:
print('Statistically significant:',P)
print('No conclusion')

Can you understand what the test indicates as you cycle the nSet through samples 1–5?


Overfitting and Underfitting

Overfitting and underfitting are major problems when data scientists retrieve data insights from the data sets they are investigating.


Overfitting is when the data scientist generates a model to fit a training set perfectly, but it does not generalize well against an unknown future real-world data set, as the data science is so tightly modeled against the known data set, the most minor outlier simply does not get classified correctly.


The solution only works for the specific dataset and no other data set. For example, if a person earns more than $150,000, that person is rich; otherwise, the person is poor. A binary classification of rich or poor will not work, as can a person earning about $145,000 be poor?


Underfitting the data scientist’s results into the data insights have been so nonspecific that to some extent predictive models are inappropriately applied or questionable as regards to insights.


For example, your person classifier has a 48% success rate to determine the sex of a person. That will never work, as with a binary guess, you could achieve a 50% rating by simply guessing.


Your data science must offer a significant level of insight for you to secure the trust of your customers, so they can confidently take business decisions, based on the insights you provide them.


Polynomial Features

Polynomial Features

The polynomic formula is the following: (a1 x + b1 )(a2 x + b2 ) = a1a2 x 2 + (a1b2 + a2 b1 )x + b1b2 .


The polynomial feature extraction can use a chain of polynomic formulas to create a hyperplane that will subdivide any data sets into the correct cluster groups. The higher the polynomic complexity, the more precise the result that can be achieved.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline
def f(x):
""" function to approximate by polynomial interpolation""" return x * np.sin(x)
# generate points used to plot
x_plot = np.linspace(0, 10, 100)
# generate points and keep a subset of them x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0) rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)
# create matrix versions of these arrays X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]
colors = ['teal', 'yellowgreen', 'gold'] lw = 2
plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw, label="Ground Truth")
plt.scatter(x, y, color='navy', s=30, marker='o', label="training points")
for count, degree in enumerate([3, 4, 5]):
model = make_pipeline(PolynomialFeatures(degree), Ridge()), y)
y_plot = model.predict(X_plot)
plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw, label="Degree %d" % degree)
plt.legend(loc='lower left')

Now that you know how to generate a polynomic formula to match any curve, I will show you a practical application using a real-life data set.


Common Data-Fitting Issue

Common Data-Fitting Issue

These higher order polynomic formulas are, however, more prone to overfitting, while lower order formulas are more likely to underfit. It is a delicate balance between two extremes that support good data science.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features), ("linear_regression", linear_regression)])[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format( degrees[i], -scores.mean(), scores.std()))


Receiver Operating Characteristic (ROC) Analysis Curves

Receiver Operating Characteristic

A receiver operating characteristic (ROC) analysis curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.


The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true positive rate is also known as sensitivity, recall, or probability of detection.


You will find the ROC analysis curves useful for evaluating whether your classification or feature engineering is good enough to determine the value of the insights you are finding.


This helps with repeatable results against a real-world data set. So, if you suggest that your customers should take a specific action as a result of your findings, ROC analysis curves will support your advice and insights but also relay the quality of the insights at given parameters.


You should now open your Python editor and create the following ecosystem.


import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
# #########
# Data IO and generation
# Import some data to play with
iris = datasets.load_iris()
X =
y =
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
# Add noisy features
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# #########
# Classification and ROC analysis
# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(n_splits=6)
classifier = svm.SVC(kernel='linear', probability=True, random_state=random_state)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 0
for train, test in cv.split(X, y):
probas_ =[train], y[train]).predict_proba(X[test]) # Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='gray', alpha=.2, label=r'$\pm$ 1 std. dev.')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")



Cross-Validation Test

Cross-Validation Test

Cross-validation is a model validation technique for evaluating how the results of a statistical analysis will generalize to an independent data set. It is mostly used in settings where the goal is the prediction.


Knowing how to calculate a test such as this enables you to validate the application of your model on real-world, i.e., independent data sets.


I will guide you through a test. Open your Python editor and create the following ecosystem:

import numpy as np
from sklearn.model_selection import cross_val_score from sklearn import datasets, svm import matplotlib.pyplot as plt
digits = datasets.load_digits()
X =
y =
Let’s pick three different kernels and compare how they will perform.
kernels=['linear', 'poly', 'rbf']
for kernel in kernels:
svc = svm.SVC(kernel=kernel)
C_s = np.logspace(-15, 0, 15)
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_val_score(svc, X, y, n_jobs=1)
You must plot your results.
Title="Kernel:>" + kernel
fig=plt.figure(1, figsize=(8, 6))
fig.suptitle(Title, fontsize=20)
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('Cross-Validation Score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)

Well done. You can now perform cross-validation of your results.


Univariate Analysis

Univariate Analysis

Univariate analysis is the simplest form of analyzing data. Uni means “one,” so your data has only one variable. It doesn’t deal with causes or relationships, and its main purpose is to describe. It takes data, summarizes that data, and finds patterns in the data.


The patterns found in univariate data include central tendency (mean, mode, and median) and dispersion, range, variance, maximum, minimum, quartiles (including the interquartile range), and standard deviation.



How many students are graduating with a data science degree? You have several options for describing data using a univariate approach. You can use frequency distribution tables, frequency polygons, histograms, bar charts, or pie charts.


Bivariate Analysis

Bivariate analysis is when two variables are analyzed together for any possible association or empirical relationship, such as, for example, the correlation between gender and graduation with a data science degree?


Canonical correlation in the experimental context is to take two sets of variables and see what is common between the two sets.


Graphs that are appropriate for bivariate analysis depend on the type of variable. For two continuous variables, a scatterplot is a common graph. When one variable is categorical and the other continuous, a box plot is common, and when both are categorical, a mosaic plot is common.


Multivariate Analysis

Multivariate data analysis refers to any statistical technique used to analyze data that arises from more than one variable. This essentially models reality, in which each situation, product, or decision involves more than a single variable. More than two variables are analyzed together for any possible association or interactions.



What are the correlation between gender, country of residence, and graduation with a data science degree? Any statistical modeling exercise, such as regression, decision tree, SVM, and clustering are multivariate in nature. The analysis is used when more than two variables determine the final outcome.


Linear Regression

Linear Regression

Linear regression is a statistical modeling technique that endeavors to model the relationship between an explanatory variable and a dependent variable, by fitting the observed data points on a linear equation, for example, modeling the body mass index (BMI) of individuals by using their weight.


Warning A regression analysis with one dependent variable and four independent variables is not a multivariate regression. It’s a multiple regression. Multivariate analysis always refers to the dependent variable.


Simple Linear Regression

Linear regression is used if there is a relationship or significant association between the variables. This can be checked by scatterplots. If no linear association appears between the variables, fitting a linear regression model to the data will not provide a useful model.

A linear regression line has equations in the following form:

Y = a + bX,

Where, X = explanatory variable and

Y = dependent variable

b = slope of the line

a = intercept (the value of y when x = 0)


# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import sqlite3 as sq
import matplotlib.pyplot as plt import numpy as np
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
sDataWarehouseDir=Base + '/99-DW'
if not os.path.exists(sDataWarehouseDir):
sDatabaseName=sDataWarehouseDir + '/datawarehouse.db' conn3 = sq.connect(sDatabaseName)
for heightSelect in range(100,300,10):
for weightSelect in range(30,300,5):
height = round(heightSelect/100,3)
weight = int(weightSelect)
bmi = weight/(height*height)
if bmi <=18.5: BMI_Result=1 elif bmi> 18.5 and bmi < 25:
elif bmi > 25 and bmi < 30:
elif bmi > 30:
PersonLine=[('PersonID', [str(t)]),
('Height', [height]),
('Weight', [weight]),
('bmi', [bmi]),
('Indicator', [BMI_Result])]
if t==1:
PersonFrame = pd.DataFrame.from_items(PersonLine)
PersonRow = pd.DataFrame.from_items(PersonLine)
PersonFrame = PersonFrame.append(PersonRow)
sTable = 'Transform-BMI'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
sTable = 'Person-Satellite-BMI' print('\n#################################') print('Storing :',sDatabaseName,'\n Table:',sTable) print('\n#################################') DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-BMI'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimPersonIndex.to_sql(sTable, conn3, if_exists="replace")
fig = plt.figure()
plt.plot(x, y, ".")
plt.plot(x, y, "o")
plt.plot(x, y, "+")
plt.plot(x, y, "^")
plt.title("BMI Curve")


Now that we have identified the persons at risk, we can study the linear regression of these diabetics.


Note You will use the standard diabetes data sample set that is installed with the sklearn library, the reason being the protection of medical data.


As this data is in the public domain, you are permitted to access it. Warning When you process people’s personal information, you are accountable for any issues your processing causes. So, work with great care.


Note In the next example, we will use a medical data set that is part of the standard learn library. This ensures that you are not working with unauthorized medical results.

You set up the ecosystem, as follows:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
Load the data set.
# Load the diabetes dataset diabetes = datasets.load_diabetes()
Perform feature development.
# Use only one feature
diabetes_X =[:, np.newaxis, 2]
Split the data into train and test data sets.
diabetes_X_train = diabetes_X[:-30]
diabetes_X_test = diabetes_X[-50:]
Split the target into train and test data sets.
diabetes_y_train =[:-30] diabetes_y_test =[-50:]
Generate a linear regression model.
regr = linear_model.LinearRegression()
Train the model using the training sets., diabetes_y_train)
Create predictions, using the testing set.
diabetes_y_pred = regr.predict(diabetes_X_test)
Display the coefficients.
print('Coefficients: \n', regr.coef_)
Display the mean squared error.
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
Display the variance score. (Tip: A score of 1 is perfect prediction.)
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
Plot outputs.
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

Well done. You have successfully calculated the BMI and determined the diabetes rate of our staff.


RANSAC Linear Regression

RANSAC Linear Regression

RANSAC (RANdom SAmple Consensus) is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the complete data set.


An advantage of RANSAC is its ability to do a robust estimation of the model parameters, i.e., it can estimate the parameters with a high degree of accuracy, even when a significant number of outliers is present in the data set. The process will find a solution because it is so robust.


Generally, this technique is used when dealing with image processing, owing to noise in the domain. See sklearn.linear_model.RANSACRegressor.html.


import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model, datasets
n_samples = 1000
n_outliers = 50
X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=1,
n_informative=1, noise=10,
coef=True, random_state=0)
# Add outlier data np.random.seed(0)
X[:n_outliers] = 3 + 0.5 * np.random.normal(size=(n_outliers, 1)) y[:n_outliers] = -3 + 10 * np.random.normal(size=n_outliers)
# Fit line using all data
lr = linear_model.LinearRegression(), y)
# Robustly fit linear model with RANSAC algorithm ransac = linear_model.RANSACRegressor(), y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
# Predict data of estimated models
line_X = np.arange(X.min(), X.max())[:, np.newaxis] line_y = lr.predict(line_X)
line_y_ransac = ransac.predict(line_X)
# Compare estimated coefficients
print("Estimated coefficients (true, linear regression, RANSAC):")
print(coef, lr.coef_, ransac.estimator_.coef_)
lw = 2
plt.scatter(X[inlier_mask], y[inlier_mask], color='yellowgreen', marker='.',
plt.scatter(X[outlier_mask], y[outlier_mask], color='gold', marker='.', label='Outliers')
plt.plot(line_X, line_y, color='navy', linewidth=lw, label='Linear regressor') plt.plot(line_X, line_y_ransac, color='cornflowerblue', linewidth=lw,
label='RANSAC regressor')
plt.legend(loc='lower right')


This regression technique is extremely useful when using robotics and robot vision in which the robot requires the regression of the changes between two data frames or data sets.


Hough Transform

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes, by a voting procedure.


This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.


With the help of the Hough transformation, this regression improves the resolution of the RANSAC technique, which is extremely useful when using robotics and robot vision in which the robot requires the regression of the changes between two data frames or data sets to move through an environment.


Logistic Regression

Logistic Regression

Logistic regression is the technique to find relationships between a set of input variables and an output variable (just like any regression), but the output variable, in this case, is a binary outcome (think of 0/1 or yes/no).


Simple Logistic Regression

I will guide you through a simple logistic regression that only compares two values. A real-world business example would be the study of a traffic jam at a certain location in London, using a binary variable. The output is a categorical: yes or no. Hence, is there a traffic jam? Yes or no?


The probability of occurrence of traffic jams can be dependent on attributes such as weather condition, the day of the week and month, time of day, number of vehicles, etc.


Using logistic regression, you can find the best-fitting model that explains the relationship between independent attributes and traffic jam occurrence rates and predicts the probability of jam occurrence.


This process is called a binary logistic regression.

The state of the traffic changes for No = Zero to Yes = One, by moving along a curve modeled by the following code.

for x in range(-10,10,1): print(math.sin(x/10))

I will now discuss the logistic regression, using a sample data set.



from sklearn import datasets, neighbors, linear_model
Load the data.
digits = datasets.load_digits()
X_digits =
y_digits =
n_samples = len(X_digits)
Select the train data set.
X_train = X_digits[:int(.9 * n_samples)] y_train = y_digits[:int(.9 * n_samples)] X_test = X_digits[int(.9 * n_samples):] y_test = y_digits[int(.9 * n_samples):]
Select the K-Neighbor classifier.
knn = neighbors.KNeighborsClassifier()
Select the logistic regression model.
logistic = linear_model.LogisticRegression()
Train the model to perform logistic regression.
print('KNN score: %f' %, y_train).score(X_test, y_test))
Apply the trained model against the test data set.
print('LogisticRegression score: %f'
%, y_train).score(X_test, y_test))

Well done. You have just completed your next transform step, by successfully deploying a logistic regression model with a K-Neighbor classifier against the sample data set.


Tip Using this simple process, I have discovered numerous thought-provoking correlations or busted myths about relationships between data values.


Multinomial Logistic Regression

Multinomial Logistic Regression

Multinomial logistic regression (MLR) is a form of linear regression analysis conducted when the dependent variable is nominal at more than two levels.


It is used to describe data and to explain the relationship between one dependent nominal variable and one or more continuous-level (interval or ratio scale) independent variables. You can consider the nominal variable as a variable that has no intrinsic ordering.


This type of data is most common in the business world, as it generally covers most data entries within the data sources and directly indicates what you could expect in the average data lake. The data has no intrinsic order or relationship.


I will now guide you through the example, to show you how to deal with these data sets. You need the following libraries for the ecosystem:

import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.utils import check_random_state


Set up the results target.

sPicNameOut=../VKHCG/01-Vermeulen/04-Transform/01-EDS/02-Python/Letters. png'

You must tune a few parameters.

t0 = time.time()
train_samples = 5000
Get the sample data from simulated data lake.
mnist = fetch_mldata('MNIST original')
Data engineer the data lake data.
X ='float64')
y =
random_state = check_random_state(0)
permutation = random_state.permutation(X.shape[0])
X = X[permutation]
y = y[permutation]
X = X.reshape((X.shape[0], -1))
Train the data model.
X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=train_samples, test_size=10000)
Apply a scaler to training to inhibit overfitting.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Turn the tolerance (tol = 0.1) for faster convergence.
clf = LogisticRegression(C=50. / train_samples,
penalty='l2', solver='sag', tol=0.1)
Apply the model to the data set., y_train)
sparsity = np.mean(clf.coef_ == 0) * 100
Score the model.
score = clf.score(X_test, y_test)
print('Best C % .4f' % clf.C_)
print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)
coef = clf.coef_.copy()
Display the results.
Fig=plt.figure(figsize=(15, 6))
scale = np.abs(coef).max()
for i in range(10):
l1_plot = plt.subplot(2, 5, i + 1)
l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest',, vmin=-scale, vmax=scale)
l1_plot.set_xlabel('Letter %i' % i)
plt.suptitle('Classification vector for...')
run_time = time.time() - t0 print('Process run in %.3f s' % run_time)
Save results to disk.

You’re performing well with the examples. Now, you can handle data that has no intrinsic order. If you understand this process, you have successfully achieved a major milestone in your understanding of the transformation of data lakes via data vaults, by using a Transform step.


Tip I normally store any results back into the data warehouse as a sun model, which then gets physicalized as facts and dimensions in the data warehouse.


Ordinal Logistic Regression

Ordinal logistic regression is a type of binomial logistics regression. Ordinal regression is used to predict the dependent variable with ordered multiple categories and independent variables.


In other words, it is used to facilitate the interaction of dependent variables (having multiple ordered levels) with one or more independent variables.


This data type is an extremely good data set to process, as you already have a relationship between the data entries that is known. Deploying your Transform step’s algorithms will give you insights into how strongly or weakly this relationship supports the data discovery process.


Clustering Techniques

Clustering Techniques

Clustering (or segmentation) is a kind of unsupervised learning algorithm, in which a data set is grouped into unique, differentiated clusters. Let’s say we have customer data spanning 1000 rows.


Using clustering, we can group the customers into separate clusters or segments, based on the variables. In the case of customers’ data, the variables can be demographic information or purchasing activities.


Clustering is an unsupervised learning algorithm because the input is unknown to the data scientist as no training set is available to pre-train a model of the solution.


You do not train the algorithm on any past input-output information, but you let the algorithm define the output for itself. Therefore, there is no right solution to a clustering algorithm, only a reasonably best-fit solution, based on business usability. Clustering is also known as unsupervised classification.

There are two basic types of clustering techniques.


Hierarchical Clustering

Hierarchical Clustering

Hierarchical clustering is a method of cluster analysis whereby you build a hierarchy of clusters. This works well for data sets that are complex and have distinct characteristics for separated clusters of data.


The following would be an example. People on a budget are more attracted to your sale items and multi-buy combinations, while more prosperous shoppers are more brand-orientated. These are two clearly different clusters, with poles-apart driving forces to buy an item. There are two design styles



This is a bottom-up approach. Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.



This is a top-down approach. All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. I will take you through the transformation process to generate a hierarchical cluster.

Open your Python editor and set up a new ecosystem.

from matplotlib import pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage import numpy as np


You will now generate two clusters: (a), with 100 points, and (b), with 50.

a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)
This creates 150 samples with 2 dimensions.
print( X.shape)
Let’s quickly display it.
plt.scatter(X[:,0], X[:,1])


You must generate the linkage matrix. The matrix contains the linkage criterion that determines the distance between sets of observations as a function of the pairwise distances between observations.


For more information on this matrix, see https:// hierarchy.linkage.html.

Z = linkage(X, 'ward')


Enhance the ecosystem, by adding extra libraries.

from scipy.cluster.hierarchy import cophenet from scipy.spatial.distance import pdist


Trigger the cophenetic correlation coefficient. The cophenetic correlation coefficient is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. In simple terms, how accurate is the measure?

c, coph_dists = cophenet(Z, pdist(X))
print('Cophenetic Correlation Coefficient:',c)
You will now calculate a full dendrogram.
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
Now, you can truncate the cluster (show only the last p merged clusters).
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
You now must define a new function, to improve the diagram’s display.
def fancy_dendrogram(*args, **kwargs):
max_d = kwargs.pop('max_d', None)
if max_d and 'color_threshold' not in kwargs:
kwargs['color_threshold'] = max_d
annotate_above = kwargs.pop('annotate_above', 0)
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata ['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
if y > annotate_above:
plt.plot(x, y, 'o', c=c)
plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center')
if max_d:
plt.axhline(y=max_d, c='k')
return ddata
You can now use the new function against your clusters.
annotate_above=10, # useful in small plots so annotations don't overlap
Let’s set the cutoff to 50.
max_d = 50
Now, you just replot the new data.
Change the cut to 16.
Now, you add extra functions to investigate your transformation.
from scipy.cluster.hierarchy import inconsistent You can investigate at a depth of five? depth = 5
incons = inconsistent(Z, depth)
What are you seeing?
Move to depth of three.
depth = 3
incons = inconsistent(Z, depth)
What do you see? You will see it better with a graph.
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
You should now look at the acceleration.
acceleration = np.diff(last, 2) acceleration_rev = acceleration[::-1] plt.plot(idxs[:-2] + 1, acceleration_rev)
k = acceleration_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print ("Clusters:", k)
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
X2 = np.concatenate((X, c, d, e),)
plt.scatter(X2[:,0], X2[:,1])
Can you see the clusters? Here is a proper cluster diagram:
Z2 = linkage(X2, 'ward')
Well done you can now see the clusters.
Let’s look at the data in more detail.
last = Z2[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
You can now perform more analysis.
acceleration = np.diff(last, 2) # 2nd derivative of the distances acceleration_rev = acceleration[::-1] plt.plot(idxs[:-2] + 1, acceleration_rev)
k = acceleration_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print ("Clusters:", k)
print (inconsistent(Z2, 5)[-10:])
Let’s look at an F-cluster.
from scipy.cluster.hierarchy import fcluster max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
Can you see the clusters?
fcluster(Z, k, criterion='maxclust')
And now can you spot the clusters? It is not that easy to spot the clusters, is it? Let’s try a different angle.
from scipy.cluster.hierarchy import fcluster fcluster(Z, 8, depth=10)
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='winter'

You can see them now!

Note Visualize insights. People understand graphics much more easily. Columned datasets require explaining and additional contextual information. In blog 11, I will discuss how to report your insights in an easy but effective manner.


You have successfully completed hierarchical clustering. This should enable you to understand that there are numerous subsets of clusters that can be combined to form bigger cluster structures.


Partitional Clustering

A partitional clustering is simply a division of the set of data objects into non-­ overlapping subsets (clusters), such that each data object is in exactly one subset. Remember when you were at school?


During breaks, when you played games, you could only belong to either the blue team or the red team. If you forgot which team was yours, the game normally ended in disaster!


Open your Python editor, and let’s perform a transformation to demonstrate how you can create a proper partitional clustering solution. As always, you will require the ecosystem.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler
You can generate some sample data, as follows:
centers = [[4, 4], [-4, -4], [4, -4],[6,0],[0,0]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.5, random_state=0)
Let’s apply a scaler transform.
X = StandardScaler().fit_transform(X)
You can now apply the DBSCAN transformation.
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_
Select the number of clusters in labels. You can ignore noise, if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
Here are your findings:
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels)) print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels)) print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
You can also plot it. Remember: Graphics explains better.
import matplotlib.pyplot as plt
unique_labels = set(labels)
colors = [ -&nbspThis website is for sale! -&nbspplt Resources and Information..Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors):
if k == -1:
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)

If you see this layout, well done. You can perform partitional clustering. This type of clustering is commonly used, as it places any data entry in distinct clusters without overlay.


Open a new file in the Python editor and try the following location cluster. Set up the ecosystem.

import sys
import os
from math import radians, cos, sin, asin, sqrt
from scipy.spatial.distance import pdist, squareform from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt
import pandas as pd


Remember this distance measures function? It calculates the great circle distance between two points on the earth, when specified in decimal degrees.

def haversine(lonlat1, lonlat2):
# convert decimal degrees to radians lat1, lon1 = lonlat1
lat2, lon2 = lonlat2
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles return c * r
Let’s get the location data.
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_ALL.csv'
print('Loading :',sFileName)
RawData = pd.read_csv(sFileName,usecols=scolumns,header=0,low_memory=False)
You now need a sample of 1000 data points.
Now calculate the cluster distance parameter.
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
Next, apply the clustering.
db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed', algorithm='auto')
y_db = db.fit_predict(distance_matrix)
X['cluster'] = y_db
C = X.cluster.unique()
Let’s visualize the data.
fig=plt.figure(1, figsize=(20, 20))
plt.title('Estimated number of clusters: %d' % len(C))
plt.scatter(X['Latitude'], X['Longitude'], c=X['cluster'],marker='D')
Now save your results.
sImageFile = Base + '/01-Vermeulen/04-Transform/01-EDS/02-Python/Location_

Note The latent Dirichlet allocation (LDA), as discussed earlier in this blog, would be classed as a clustering algorithm and should be mentioned among other clustering algorithms. You have completed the partitional clustering solution.



The one-way analysis of variance (ANOVA) test is used to determine whether the mean of more than two groups of data sets is significantly different from each data set.



A BOGOF (buy-one-get-one-free) campaign is executed on 5 groups of 100 customers each. Each group is different in terms of its demographic attributes.


We would like to determine whether these five respond differently to the campaign. This would help us optimize the right campaign for the right demographic group, increase the response rate, and reduce the cost of the campaign.


The analysis of variance works by comparing the variance between the groups to that within the group. The core of this technique lies in assessing whether all the groups are in fact part of one larger population or a completely different population with different characteristics.


Open your Python editor and set up the following ecosystem:

import pandas as pd
datafile='../VKHCG/01-Vermeulen/00-RawData/PlantGrowth.csv' data = pd.read_csv(datafile)
Now you must create a boxplot against the data.
data.boxplot('weight', by='group', figsize=(12, 8))
You must now perform feature extraction and engineering.
ctrl = data['weight'][ == 'ctrl'] grps = pd.unique(
d_data = {grp:data['weight'][ == grp] for grp in grps} k = len(pd.unique( # number of conditions N = len(data.values) # conditions times participants
n = data.groupby('group').size()[0] #Participants in each condition
You now need extra funtions from extra library:
from scipy import stats


Now activate the one-way ANOVA test for the null hypothesis that two or more groups have the same population mean transformation.

F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])
You need to set up a few parameters.
DFbetween = k - 1
DFwithin = N - k
DFtotal = N - 1
You can now further investigate the results from the transformation.
SSbetween = (sum(data.groupby('group').sum()['weight']**2)/n) \ - (data['weight'].sum()**2)/N
sum_y_squared = sum([value**2 for value in data['weight'].values]) SSwithin = sum_y_squared - sum(data.groupby('group').sum()['weight']**2)/n SStotal = sum_y_squared - (data['weight'].sum()**2)/N MSbetween = SSbetween/DFbetween
MSwithin = SSwithin/DFwithin
F = MSbetween/MSwithin
eta_sqrd = SSbetween/SStotal
omega_sqrd = (SSbetween - (DFbetween * MSwithin))/(SStotal + MSwithin)
Here are the results of your investigation:
Well done. You have performed a successful one-way ANOVA test.


Principal Component Analysis (PCA)

Principal Component Analysis

Dimension (variable) reduction techniques aim to reduce a data set with higher dimension to one of lower dimension, without the loss of features of information that are conveyed by the data set. The dimension here can be conceived as the number of variables that datasets contain.

Two commonly used variable reduction techniques follow.


Factor Analysis

The crux of PCA lies in measuring the data from the perspective of a principal component. A principal component of a data set is the direction with the largest variance.


A PCA analysis involves rotating the axis of each variable to the highest Eigenvector/Eigenvalue pair and defining the principal components, i.e., the highest variance axis or, in other words, the direction that most defines the data. Principal components are uncorrelated and orthogonal.


PCA is fundamentally a dimensionality reduction algorithm, but it is just as useful as a tool for visualization, for noise filtering, for feature extraction, and engineering. Here is an example.


Open your Python editor and set up this ecosystem:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D from sklearn import datasets
from sklearn.decomposition import PCA
Import some data to apply your skills against.
iris = datasets.load_iris()
Take only the first two features.
X =[:, :2]
You need the following target:
y =
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
You can quickly visualize the data set.
plt.figure(2, figsize=(8, 6))
Now plot the training points.
plt.scatter(X[:, 0], X[:, 1], c=y, -&nbspThis website is for sale! -&nbspplt Resources and Information..Set1, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
I suggest getting a better understanding of the interaction of the dimensions. Plot the first three PCA dimensions.
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform( ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.set_ylabel("2nd eigenvector")
ax.set_zlabel("3rd eigenvector")

Can you see the advantage of reducing the features in a controlled manner? Let’s deploy your new knowledge against a more complex data set.



Hillman Ltd has introduced a CCTV tracking process on the main gates, as it suspects unauthorized offloading of goods is happening through its harbor port’s warehouse. You will load a set of digits from the cameras, to assist the company to read the numbers on the side of the containers that left the warehouse.


Open your Python editor and set up this ecosystem:

from sklearn.datasets import load_digits from sklearn.decomposition import PCA import matplotlib.pyplot as plt import numpy as np


Load the digits from the preprocessed image software.

digits = load_digits()
You now apply the transformation to project from 64 to 2 dimensions.
pca = PCA(2)
projected = pca.fit_transform(
Display your findings.
plt.figure(figsize=(10, 15)) plt.scatter(projected[:, 0], projected[:, 1],, edgecolor='none', alpha=0.5,'nipy_spectral_r', 10))
plt.xlabel('Value 1')
plt.ylabel('Value 2')
Apply the PCA transform.
pca = PCA().fit( plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('Number of components') plt.ylabel('Cumulative explained variance');


You will require the following function to plot the digits:

def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest', clim=(0, 16))
You can now plot results.
One of the cameras, however, has not generated perfect images.
noisy = np.random.normal(, 4)
You can still use the data; it just needs more processing.
pca = PCA(0.50).fit(noisy)
You can determine the distortion by filtering the noise factors.
components = pca.transform(noisy)
filtered = pca.inverse_transform(components) Visualize your results: plot_digits(filtered)


Question Can you see the digits from the cameras?

Do you think we have a workable solution?

You have now proven that you can take a complex 64-dimension data set, and using PCA, reduce it to 2 dimensions and still achieve good data science results. Good progress on your side.


Conjoint Analysis

Conjoint analysis is widely used in market research to identify customers’ preference for various attributes that make up a product. The attributes can be various features, such as size, color, usability, price, etc.


Using conjoint (trade-off) analysis, brand managers can identify which features customers would trade off for a certain price point. Thus, such analysis is a highly used technique in new product design or pricing strategies.



The data is a ranking of three different features (TV size, TV type, TV color).

  • TV size options are 42", 47", or 60".
  • TV type options are LCD or Plasma.
  • TV color options are red, blue, or pink.

The data rates the different stimuli types for each customer. You are tasked with determining which TVs to display on Krennwallner’s billboards.


Open your Python editor and set up the following ecosystem:

>import numpy as np
import pandas as pd
Retrieve the customer buy choices.
sFile='C:/VKHCG/01-Vermeulen/00-RawData/BuyChoices.txt' caInputeDF = pd.read_csv(sFile, sep = ";")
You can display the choices.


You need a new data structure, and you must create dummy variables for every stimulus. In total, you require 9 different stimuli and 18 different combinations.

ConjointDummyDF = pd.DataFrame(np.zeros((18,9)), columns=["Rank","A1", "A2", "A3","B1","B2","C1", "C2","C3"])
You now need to feature engineer the data choices into the new structure:
ConjointDummyDF.Rank = caInputeDF.Rank
for index, row in caInputeDF.iterrows():
stimuli1 = str(caInputeDF["Stimulus"].iloc[index][:2]) stimuli2 = str(caInputeDF["Stimulus"].iloc[index][2:4]) stimuli3 = str(caInputeDF["Stimulus"].iloc[index][4:6]) stimuliLine=[stimuli1,stimuli2,stimuli3] Columns=ConjointDummyDF.columns for stimuli in stimuliLine:
for i in range(len(Columns)):
if stimuli == Columns[i]:
ConjointDummyDF.iloc[index, i] = 1
Well done. Let’s look at your new structure.
You only have unknown numbers. I suggest you add suitable stimulus names. Be mindful of the Transform step you normally present to businesspeople. They understand plasma TV is not good with TVType3.
fullNames = {"Rank":"Rank", "A1": "42\" (106cm)","A2": "47\" (120cm)","A3":
"60\" (152cm)","B1": "Plasma","B2":"LCD","C1":"Blue","C2":"Red","C3":
ConjointDummyDF.rename(columns=fullNames, inplace=True)
That’s better. Now look at the data structure.
Next, you estimate the main effects with a linear regression.
You need extra libraries to achieve this action.
import statsmodels.api as sm
You can select any of these columns for your model.
I suggest you select these:
X = ConjointDummyDF[[u'42" (106cm)', u'47" (120cm)',\
u'60" (152cm)', u'Plasma', u'LCD', u'Red', u'Blue', u'Pink']]
You now need to set up the model.
X = sm.add_constant(X)
Y = ConjointDummyDF.Rank
You now activate the model against your new data structure.
linearRegression = sm.OLS(Y, X).fit()
You can now look at the results.


There are numerous indicators that you can research. I am not going to cover any of these in detail, as several require complex mathematics to explain them fully.

The indicators we are interested in are the part worth values and relative importance of the stimuli:

Importance of Stimuli = Max(beta) - Min(beta)


Relative Importance of Stimuli = Importance of Stim/ Sum(Importance of All Stimuli)

You will now investigate these indicators. To do this, you need some data engineering. You now require several basic indicator variables to start the process.

importance = []
relative_importance = []
rangePerFeature = []
begin = "A"
tempRange = []
You now need to load the stimuli.
for stimuli in fullNames.keys():
if stimuli[0] == begin:
elif stimuli == "Rank":
begin = stimuli[0]
tempRange = [linearRegression.params[fullNames[stimuli]]]
Then, you need the feature ranges.
for item in rangePerFeature:
importance.append( max(item) - min(item))
You then calculate the importance of a feature.
for item in importance:
relative_importance.append(100* round(item/sum(importance),3))
Start a base data structure for the part worth values.
partworths = []
item_levels = [1,3,5,8]
You now calculate the part worth values.
for i in range(1,4):
part_worth_range = linearRegression.params[item_levels[i-1]:item_ levels[i]]
print (part_worth_range)
You need to determine the mean rank of the data set.
meanRank = []
for i in ConjointDummyDF.columns[1:]:
newmeanRank = ConjointDummyDF["Rank"].loc[ConjointDummyDF[i] == 1].mean()
I suggest you use total mean known as “basic utility” to be used as the “zero alternative.”
totalMeanRank = sum(meanRank) / len(meanRank)
You will now rank the value of each part of the stimuli, i.e., what features of the TV are important?
partWorths = {}
for i in range(len(meanRank)):
name = fullNames[sorted(fullNames.keys())[i]] partWorths[name] = meanRank[i] - totalMeanRank
You now have a result set.
Now, I will help you to develop the results into insights and deliver the basis for a summary and results report.
print ("Relative Importance of Feature:\n\nMonitor Size:",relative_ importance[0], "%","\nType of Monitor:", relative_importance[1], "%", "\ nColor of TV:", relative_importance[2], "%\n\n") print ("--"*30)
print ("Importance of Feature:\n\nMonitor Size:",importance[0],"\nType of Monitor:", importance[1], "\nColor of TV:", importance[2])
So, what is the optimal product bundle? I suggest 60", LCD, Red. Do you agree? Let’s test that assumed fact.
optBundle = [1,0,0,1,0,1,0,1,0]
print ("The best possible Combination of Stimuli would have the highest rank:",linearRegression.predict(optBundle)[0])
optimalWorth = partWorths["60\" (152cm)"] + partWorths["LCD"] + partWorths["Red"]
print ("Choosing the optimal Combination brings the Customer an extra ", optimalWorth, "'units' of utility")

Does your science support my assumed fact? Congratulations, you have just completed a conjoint analysis.


Tip I suggest you try the equivalent analysis with other data sets, as this is a common question I am asked. What makes people pick one product and not another? You should practice with data sets of all sizes, to understand how the formulas react to changes in the stimuli factors.

I have confidence that there is a future for you at Krennwallner as a data scientist.


Decision Trees

Decision Trees

Decision trees, as the name suggests, are a tree-shaped visual representation of routes you can follow to reach a particular decision, by laying down all options and their probability of occurrence. Decision trees are exceptionally easy to understand and interpret.


At each node of the tree, one can interpret what would be the consequence of selecting that node or option. The series of decisions lead you to the end result.


Before you start the example, I must discuss a common add-on algorithm to decision trees called AdaBoost. AdaBoost, short for “adaptive boosting,” is a machine-learning meta-algorithm.


The classifier is a meta-estimator because it begins by fitting a classifier on the original data set and then fits additional copies of the classifier on the same data set, but where the weights of incorrectly classified instances are adjusted, such that subsequent classifiers focus more on difficult cases.


It boosts the learning impact of less clear differences in the specific variable, by adding a progressive weight to boost the impact. See AdaBoostClassifier.html.


This boosting enables the data scientist to force decisions down unclear decision routes for specific data entities, to enhance the outcome.


Example of a Decision Tree using AdaBoost:

Open your Python editor and set up this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import AdaBoostRegressor
You need a data set. So, you can build a random data set.
rng = np.random.RandomState(1)
X = np.linspace(0, 6, 1000)[:, np.newaxis]
y = np.sin(X).ravel() + np.sin(6 * X).ravel() + rng.normal(0, 0.1, X.shape[0])
You then fit the normal regression model to the data set.
regr_1 = DecisionTreeRegressor(max_depth=4)
You then also apply an AdaBoost model to the data set. The parameters can be changed, if you want to experiment.
regr_2 = AdaBoostRegressor(DecisionTreeRegressor(max_depth=4), n_estimators=300, random_state=rng)
You then train the model., y), y)
You activate the predict.
y_1 = regr_1.predict(X)
y_2 = regr_2.predict(X)
You plot the results.
plt.figure(figsize=(15, 10))
plt.scatter(X, y, c="k", label="Training Samples")
plt.plot(X, y_1, c="g", label="n_Estimators=1", linewidth=2)
plt.plot(X, y_2, c="r", label="n_Estimators=300", linewidth=2)
plt.title("Boosted Decision Tree Regression")

Congratulations! You just built a decision tree. You can now test your new skills against a more complex example.


Once more, open your Python editor and set up this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import make_gaussian_quantiles
You are building a more complex data set.
X1, y1 = make_gaussian_quantiles(cov=2.,
n_samples=2000, n_features=2,
n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=3000, n_features=2,
n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))
You now have a data set ready. Create and fit an AdaBoosted decision tree to the data
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
You train the model., y)
You now need to set up a few parameters that you will require.
plot_colors = "br"
plot_step = 0.02
class_names = "AB"
You now create a visualization of the results.
plt.figure(figsize=(10, 5))
Add the plot for the decision boundaries.
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z,
Plot your training points.
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1], c=c,, s=20, edgecolor='k', label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.title('Decision Boundary')
Plot the two-class decision scores.
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
for i, n, c in zip(range(2), class_names, plot_colors):
plt.hist(twoclass_output[y == i],
label='Class %s' % n,
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right')
plt.title('Decision Scores')

Well done, you have just completed a complex decision tree.


Support Vector Machines, Networks, Clusters, and Grids

Support Vector Machines

The support vector machine (SVM) constructs a hyperplane or set of hyperplanes in a high- or infinite-dimensional space, which can be used for classification and regression.


The support vector network (SVN) daisy chains more than one SVM together to form a network. All the data flows through the same series of SVMs.


The support vector cluster (SVC) runs SVM on different clusters of the data in parallel. Hence, not all data flows through all the SVMs.


The support vector grid (SVG) is an SVC of an SVN or an SVN of an SVC. This solution is the most likely configuration you will develop at a customer site. It uses SVMs to handle smaller clusters of the data, to apply specific transform steps. As a beginner data scientist, you only need to note that they exist.


Support Vector Machines

A support vector machine is a discriminative classifier formally defined by a separating hyperplane. The method calculates an optimal hyperplane with a maximum margin, to ensure it classifies the data set into separate clusters of data points.


I will guide you through a sample SVM. Open your Python editor and create this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm
Here is your data set and targets.
X = np.c_[(.4, -.7),
(-1.5, -1),
(-1.4, -.9),
(-1.3, -1.2),
(-1.1, -.2),
(-1.2, -.4),
(-.5, 1.2),
(-1.5, 2.1),
(1, 1),
(1.3, .8),
(1.2, .5),
(.2, -2),
(.5, -2.4),
(.2, -2.3),
(0, -2.7),
(1.3, 2.1)].T
Y = [0] * 8 + [1] * 8


You have several kernels you can use to fit the data model. I will take you through all of them. 

for kernel in ('linear', 'poly', 'rbf'): clf = svm.SVC(kernel=kernel, gamma=2), Y)


You now plot the line, the points, and the nearest vectors to the plane.

plt.figure(fignum, figsize=(8, 6))
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80,
facecolors='none', zorder=10, edgecolors='k') plt.scatter(X[:, 0], X[:, 1], c=Y, zorder=10,,
x_min = -3
x_max = 3
y_min = -3
y_max = 3
XX, YY = np.mgrid[x_min:x_max:200j, y_min:y_max:200j]
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
You now apply the result into a color plot.
Z = Z.reshape(XX.shape)
plt.figure(fignum, figsize=(8, 6))
plt.pcolormesh(XX, YY, Z > 0,
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
levels=[-.5, 0, .5])
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
fignum = fignum + 1
You now show your plots.

Well done. You have just completed three different types of SVMs.


Support Vector Networks

The support vector network is the ensemble of a network of support vector machines that together classify the same data set, by using different parameters or even different kernels. This a common method of creating feature engineering, by creating a chain of SVMs.


Tip You can change kernels on the same flow to expose new features. Practice against the data sets with the different kernels, to understand what they give you.


Support Vector Clustering

Support vector clustering is used where the data points are classified into clusters, with support vector machines performing the classification at the cluster level.


This is commonly used in high dimensional data sets, where the clustering creates a grouping that can then be exploited by the SVM to subdivide the data points, using different kernels and other parameters.


I have seen SVC, SVN, SVM, and SVG process in many of the deep-learning algorithms that I work with every day. The volume, variety, and velocity of the data require that the deep learning do multistage classifications, to enable the more detailed analysis of the data points to occur after a primary result is published.


Data Mining

Data mining is processing data to pinpoint patterns and establish relationships between data entities. Here are a small number of critical data mining theories you need to understand data patterns, to be successful with data mining.


Association Patterns

This involves detecting patterns in which one occasion is associated with another. If, for example, a loading bay door is opened, it is fair to assume that a truck is loading goods. Pattern associations simply discover the correlation of occasions in the data. You will use some core statistical skills for this processing.


Warning “Correlation does not imply causation.”

Correlation is only a relationship or indication of behavior between two data sets.

The relationship is not a cause-driven action.



If you discover a relationship between hot weather and ice cream sales, it does not mean high ice cream sales cause hot weather or vice versa. It is only an observed relationship. This is commonly used in retail basket analysis and recommender systems. I will guide you through an example now.


Please open your Python editor and create this ecosystem:

import pandas as pd
df1 = pd.DataFrame({'A': range(8), 'B': [2*i for i in range(8)]})
df2 = pd.DataFrame({'A': range(8), 'B': [-2*i for i in range(8)]})
Here is your data.
print('Positive Data Set')
print('Negative Data Set')
Here are your results.
print('Correlation Positive:', df1['A'].corr(df1['B']))
print('Correlation Negative:', df2['A'].corr(df2['B']))

You should see a correlation of either +1 or -1. If it is +1, this means there is a 100% correlation between the two values, hence they change at the same rate. If it is -1, this means there is a 100% negative correlation, indicating that the two values have a relationship if one increases while the other decreases.


Tip In real-world data sets, such two extremes as +1 and -1 do not occur. The range is normally i-1 < C < +1.


You will now apply changes that will interfere with the correlation.

df1.loc[2, 'B'] = 10
df2.loc[2, 'B'] = -10
You can now see the impact.
print('Positive Data Set')
print('Negative Data Set')
print('Correlation Positive:', df1['A'].corr(df1['B']))
print('Correlation Negative:', df2['A'].corr(df2['B']))
Can you sense the impact a minor change has on the model?
So, let’s add a bigger change.
df1.loc[3, 'B'] = 100
df2.loc[3, 'B'] = -100
You check the impact.
print('Positive Data Set')
print('Negative Data Set')
print('Correlation Positive:', df1['A'].corr(df1['B']))
print('Correlation Negative:', df2['A'].corr(df2['B']))

Well done, if you understood the changes that interfere with the relationship. If you see the relationship, you have achieved an understanding of association patterns.


These “What if?” analyses are common among data scientists’ daily tasks. For example, one such analysis might relate to the following: What happens when I remove 300 of my 1000 staff?


And at what point does the removal of staff have an impact? These small-­step increases of simulated impact can be used to simulate progressive planned changes in the future.


Classification Patterns

This technique discovers new patterns in the data, to enhance the quality of the complete data set. Data classification is the process of consolidating data into categories, for its most effective and efficient use by the data processing.


For example, if the data is related to the shipping department, you must then augment a label on the data that states that fact.


A carefully planned data-classification system creates vital data structures that are easy to find and retrieve. You do not want to scour your complete data lake to find data every time you want to analyze a new data pattern.


Clustering Patterns

Clustering is the discovery and labeling of groups of specifics not previously known. An example of clustering is if, when your customers buy bread and milk together on a Monday night, your group, or cluster, these customers as “start-of-the-week small-size shoppers,” by simply looking at their typical basic shopping basket.


Any combination of variables that you can use to cluster data entries into a specific group can be viewed as some form of clustering. For data scientists, the following clustering types are beneficial to master.


Connectivity-Based Clustering

You can discover the interaction between data items by studying the connections between them. This process is sometimes also described as hierarchical clustering.


Centroid-Based Clustering (K-Means Clustering)

”Centroid-based” describes the cluster as a relationship between data entries and a virtual center point in the data set. K-means clustering is the most popular centroid-­ based clustering algorithm. I will guide you through an example.


Open your Python editor and set up this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets
You now set up a data set to study.
iris = datasets.load_iris()
X =
y =
Configure your estimators for the clustering.
estimators = [('k_means_iris_8', KMeans(n_clusters=8)),
('k_means_iris_3', KMeans(n_clusters=3)),
('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
Get ready to virtualize your results.
fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), edgecolor='k')
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(titles[fignum - 1])
ax.dist = 12
fignum = fignum + 1
Plot your results.
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean(),
X[y == label, 2].mean() + 2, name,
bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
Reorder the labels to have colors matching the cluster results.
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12

Well done. You should see a set of results for your hard work. Can you see how the connectivity-based clustering uses the interaction between data points to solve the transformation steps?


Distribution-Based Clustering

This type of clustering model relates datasets with statistics onto distribution models.

The most widespread density-based clustering technique is DBSCAN.


Density-Based Clustering

In density-based clustering, an area of higher density is separated from the remainder of the data set. Data entries in sparse areas are placed in separate clusters. These clusters are considered to be noise, outliers, and border data entries.


Grid-Based Method

Grid-based approaches are common for mining large multidimensional space clusters having denser regions than their surroundings. The grid-based clustering approach differs from the conventional clustering algorithms in that it does not use the data points but a value space that surrounds the data points.


Bayesian Classification

Naive Bayes (NB) classifiers are a group of probabilistic classifiers established by applying Bayes’s theorem with strong independence assumptions between the features of the data set. There is one more specific Bayesian classification you must take note of, and it is called tree augmented naive Bayes (TAN).


Tree augmented naive Bayes is a semi-naive Bayesian learning method. It relaxes the naive Bayes attribute independence assumption by assigning a tree structure, in which each attribute only depends on the class and one other attribute.


A maximum weighted spanning tree that maximizes the likelihood of the training data is used to perform classification.


The naive Bayesian (NB) classifier and the tree augmented naive Bayes (TAN) classifier are well-known models that will be discussed next. Here is an example:


Open your Python editor and start with this ecosystem.

import numpy as np
import urllib
Load data via the web interface.
url = "­ indians-diabetes/"
Download the file.
raw_data = urllib.request.urlopen(url)
Load the CSV file into a numpy matrix.
dataset = np.loadtxt(raw_data, delimiter=",")
Separate the data from the target attributes in the data set.
X = dataset[:,0:8]
y = dataset[:,8]
Add extra processing capacity.
from sklearn import metrics
from sklearn.naive_bayes import GaussianNB
model = GaussianNB(), y)
Produce predictions.
expected = y
predicted = model.predict(X)


# summarize the fit of the model print(metrics.classification_report(expected, predicted)) print(metrics.confusion_matrix(expected, predicted))

You have just built your naive Bayes classifiers. Good progress!


Sequence or Path Analysis

This identifies patterns in which one event leads to another, later resulting in insights into the business. Path analysis is a chain of consecutive events that a given business entity performs during a set period, to understand behavior, in order to gain actionable insights into the data.


I suggest you use a combination of tools to handle this type of analysis. I normally model the sequence or path with the help of a graph database, or, for smaller projects, I use a library called network in Python.



Your local telecommunications company is interested in understanding the reasons or flow of events that resulted in people churning their telephone plans to their competitor.

Open your Python editor and set up this ecosystem.

# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import sqlite3 as sq import networkx as nx import datetime pd.options.mode.chained_assignment = None
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn = sq.connect(sDatabaseName)
You must create a new graph to track the individual paths of each customer through their journey over the last nine months.
The following loop structure enables you to compare two sequential months and determine the changes:
for M in range(1,10):
print('Month: ', M)
MIn = str(M - 1)
MOut = str(M)
sFile0 = 'ConnectionsChurn' + MIn + '.csv' sFile1 = 'ConnectionsChurn' + MOut + '.csv' sTable0 = 'ConnectionsChurn' + MIn sTable1 = 'ConnectionsChurn' + MOut
sFileName0=Base + '/' + Company + '/00-RawData/' + sFile0 ChurnData0=pd.read_csv(sFileName0,header=0,low_memory=False, encoding="latin-1")
sFileName1=Base + '/' + Company + '/00-RawData/' + sFile1 ChurnData1=pd.read_csv(sFileName1,header=0,low_memory=False, encoding="latin-1")
Owing to an error during extraction, the files dates are not correct, so you perform a data-quality correction.
dt1 = datetime.datetime(year=2017, month=1, day=1)
dt2 = datetime.datetime(year=2017, month=2, day=1)
ChurnData0['Date'] = dt1.strftime('%Y/%m/%d')
ChurnData1['Date'] = dt2.strftime('%Y/%m/%d')
You now compare all the relevant features of the customer, to see what actions were taken during the month under investigation.
TrackColumns=['SeniorCitizen', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'StreamingTV', 'PaperlessBilling', 'StreamingMovies', 'InternetService',
'Contract', 'PaymentMethod', 'MonthlyCharges'] for i in range(ChurnData0.shape[0]):
for TColumn in TrackColumns:
if ChurnData0[TColumn][i] == 'No':
if t > 4:
ChurnData0['Churn'][i] = 'Yes'
ChurnData0['Churn'][i] = 'No'
for i in range(ChurnData1.shape[0]):
for TColumn in TrackColumns:
if ChurnData1[TColumn][i] == 'No':
if t > 4:
ChurnData1['Churn'][i] = 'Yes'
ChurnData1['Churn'][i] = 'No'
print('Store CSV Data')
ChurnData0.to_csv(sFileName0, index=False)
ChurnData1.to_csv(sFileName1, index=False) print('Store SQLite Data')
ChurnData0.to_sql(sTable0, conn, if_exists='replace')
ChurnData1.to_sql(sTable1, conn, if_exists='replace') for TColumn in TrackColumns:
for i in range(ChurnData0.shape[0]):
You always start with a “Root” node and then attach the customers. Then you connect each change in status and perform a complete path analysis.
Node0 = 'Root'
Node1 = '(' + ChurnData0['customerID'][i] + '-Start)'
G.add_edge(Node0, Node1)
Node5 = '(' + ChurnData0['customerID'][i] + '-Stop)'
if ChurnData0['Churn'][i] == 'Yes':
NodeA = '(' + ChurnData0['customerID'][i] + '-Start)' NodeB = '(' + ChurnData0['customerID'][i] + '-Stop)' if nx.has_path(G, source=NodeA, target=NodeB) == False:
NodeC = '(' + ChurnData0['customerID'][i] + '): (Churn)=>(' + ChurnData1['Churn'][i] + ')' G.add_edge(NodeA, NodeC) G.add_edge(NodeC, NodeB)
if ChurnData0[TColumn][i] != ChurnData1[TColumn][i]: #print(M,ChurnData0['customerID'][i],ChurnData0['Date'] [i],ChurnData1['Date'][i],TColumn, ChurnData0[TColumn] [i], ChurnData1[TColumn][i])
Node2 = '(' + ChurnData0['customerID'][i] + ')-
(' + ChurnData0['Date'][i] + ')'
G.add_edge(Node1, Node2)
Node3 = Node2 + '-(' + TColumn + ')'
G.add_edge(Node2, Node3)
Node4 = Node3 + ':(' + ChurnData0[TColumn][i] + ')=>
(' + ChurnData1[TColumn][i] + ')'
G.add_edge(Node3, Node4)
if M == 9:
Node6 = '(' + ChurnData0['customerID'][i] + '): (Churn)=>(' + ChurnData1['Churn'][i] + ')' G.add_edge(Node4, Node6) G.add_edge(Node6, Node5)
G.add_edge(Node4, Node5)
You can use these lines to investigate the nodes and the edges you created.
for n in G.nodes():
for e in G.edges():
You must now store your graph for future use.
sGraphOutput=Base + '/' + Company + \
nx.write_gml(G, sGraphOutput)
You now investigate the paths taken by a customer over the nine months and produce an output of all the steps taken by the customer over the period.
sFile0 = 'ConnectionsChurn9.csv'
sFileName0=Base + '/' + Company + '/00-RawData/' + sFile0
for i in range(ChurnData0.shape[0]):
sCustomer = ChurnData0['customerID'][i]
NodeX = '(' + ChurnData0['customerID'][i] + '-Start)' NodeY = '(' + ChurnData0['customerID'][i] + '-Stop)' if nx.has_path(G, source=NodeX, target=NodeY) == False:
NodeZ = '(' + ChurnData0['customerID'][i] + '):(Churn)=> (' + ChurnData0['Churn'][i] + ')' G.add_edge(NodeX, NodeZ)
G.add_edge(NodeZ, NodeX)
if nx.has_path(G, source=NodeX, target=NodeY) == True:
This function enables you to expose all the paths between the two nodes you created for each customer.
pset = nx.all_shortest_paths(G, source=NodeX, target=NodeY,
for p in pset:
ps = 'Path: ' + str(p)
for s in p:
ts = 'Step: ' + str(t)
#print(NodeX, NodeY, ps, ts, s)
if c == 1:
pl = [[sCustomer, ps, ts, s]]
pl.append([sCustomer, ps, ts, s])
You now store the path analysis results into a CSV for later use.
sFileOutput=Base + '/' + Company + \ '/04-Transform/01-EDS/02-Python/Transform_ConnectionsChurn.csv'
df = pd.DataFrame(pl, columns=['Customer', 'Path', 'Step', 'StepName']) = 'RecID'
sTable = 'ConnectionsChurnPaths'
df.to_sql(sTable, conn, if_exists='replace')
print('### Done!! ############################################')
Well done. You have just completed your first path analysis over a period of nine months.



Can you understand why the routes are different for each customer but still have the same outcome? Can you explain why one customer churns and another does not?


The common activity you will spot is that customers begin to remove services from your client as they start to churn, to enable their own new churned services, they no longer will support your services 100%.


If they have fewer than five services, customers normally churn, as they are now likely with the other telecommunications company.


If you were advising the client, I would suggest you highlight that the optimum configuration for the trigger that a customer is about to churn is if the customer has changed his or her configuration to include fewer services from your client.


You can still prevent a churn if you intervene before the customer hits the five-minimum level.


Well done. You can now perform Transform steps for path analysis, get a simple list of node and edges (relationships) between them, and model a graph and ask pertinent questions.



This technique is used to discover patterns in data that result in practical predictions about a future result, as indicated, by predictive analytics of future probabilities and trends. We have been performing forecasting at several points in the blog.


Pattern Recognition

Pattern recognition identifies regularities and irregularities in data sets. The most common application of this is in text analysis, to find complex patterns in the data.


I will guide you through an example of text extraction. The example will extract text files from a common 20 newsgroups dataset and then create categories of text that was found together in the same document. This will provide you with the most common word that is used in the newsgroups.


Open your Python editor and set up this ecosystem:

from pprint import pprint
from time import time
import logging
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV from sklearn.pipeline import Pipeline
Modify your logging, to display progress logs on stdout (standard output).
logging.basicConfig( -&nbspThis website is for sale! -&nbspLogging Resources and Information.,
format='%(asctime)s %(levelname)s %(message)s')
You can now load your categories from the training set categories = [

You could use this for all the categories, but I would perform it as a later experiment, as it slows down your processing, by using larger volumes of data from your data lake.

#categories = None
You can now investigate what you loaded.
print("Loading 20 newsgroups dataset for categories:")
You must now load the training data.
data = fetch_20newsgroups(subset='train', categories=categories)
print("%d documents" % len(data.filenames))
print("%d categories" % len(data.target_names))
You now have to define a pipeline, combining a text feature extractor with a simple classifier.
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier()),
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
#'vect__max_features': (None, 5000, 10000, 50000),
'vect__ngram_range': ((1, 1), (1, 2)),
#'tfidf__use_idf': (True, False),
#'tfidf__norm': ('l1', 'l2'),
'clf__alpha': (0.00001, 0.000001),
'clf__penalty': ('l2', 'elasticnet'),
#'clf__n_iter': (10, 50, 80),
You can now build the main processing engine.
if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
t0 = time(),
print("done in %0.3fs" % (time() - t0))
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))

When you execute the program, you will have successfully completed a text extraction. The data sources for this type of extract are typically documents, e-mail, Twitter, or note fields in databases. Any test source can receive a transform step.


Machine Learning

Machine Learning

The business world is bursting with activities and philosophies about machine learning and its application to various business environments. Machine learning is the capability of systems to learn without explicit software development. It evolved from the study of pattern recognition and computational learning theory.


The impact is that with the appropriate processing and skills, you can amplify your own data capabilities, by training a processing environment to accomplish massive amounts of the discovery of data into actionable knowledge, while you have a cup of coffee, for example.


This skill is an essential part of achieving major gains in shortening the data-to-knowledge cycle.


 I will cover a limited rudimentary theory, but machine learning encompasses a wide area of expertise that merits a blog by itself. So, I will introduce you only to the core theories.


Supervised Learning

Supervised learning is the machine-learning task of inferring a function from labeled training data. The training data consists of a set of training examples. In supervised learning, each example is a pair consisting of an input object and the desired output value. You use this when you know the required outcome for a set of input features.



If you buy bread and jam, you can make a jam sandwich. Without either, you have no jam sandwich. If you investigate this data set, you can easily spot from the indicators what is bread and what is jam.


A data science model could perform the same task, using supervised learning. If you investigate the bigger data set stored in the directory .. \VKHCG\03-Hillman\00-­ RawData in file KitchenData.csv, you can see how the preceding knowledge can be used to infer what the correct selection is for items that will result in a jam sandwich.


Unsupervised Learning

Unsupervised learning is the machine-learning task of inferring a function to describe hidden structures from unlabeled data. This encompasses many other techniques that seek to summarize and explain key features of the data.



You can take a bag of marble with different colors, sizes, and materials and split them into three equal groups, by applying a set of features and a model. You do not know up front what the criteria are for splitting the marbles.


Reinforcement Learning

Reinforcement learning (RL) is an area of machine learning inspired by behavioral psychology that is concerned with how software agents should take actions in an environment, so as to maximize, more or less, a notion of cumulative reward.


This is used in several different areas, such as game theory, swarm intelligence, control theory, operations research, simulation-based optimization, multi-agent systems, statistics, and genetic algorithms.


The process is simple. Your agent extracts feature from the environment that is either “state” or “reward.” State features indicate that something has happened.


Reward features indicate that something happened that has improved or worsened to the perceived gain in reward. The agent uses the state and reward to determine actions to change the environment.


This process of extracting state and reward, plus responding with action, will continue until a pre-agreed end reward is achieved. The real-world application for these types of reinforced learning is endless. You can apply reinforced learning to any environment which you can control with an agent.


I build many RL systems that monitor processes, such as a sorting system of purchases or assembly of products. It is also the core of most robot projects, as robots are physical agents that can interact with the environment.


I also build many “soft-robots” that take decisions on such data processing as approval of loans, payments of money, and fixing of data errors.


Bagging Data

Bootstrap aggregating also called bagging, is a machine-learning ensemble meta-­ algorithm that aims to advance the stability and accuracy of machine-learning algorithms used in statistical classification and regression. It decreases variance and supports systems to avoid overfitting.


I want to cover this concept, as I have seen many data science solutions over the last years that suffer from overfitting because they were trained with a known data set that eventually became the only data set they could process.


Thanks to inefficient processing and algorithms, we naturally had a lead way for variance in the data.


The new GPU (graphics processing unit)-based systems are so accurate that they overfit easily, if the training data is a consistent set, with little or no major changes in the patterns within the data set.


You will now see how to perform a simple bagging process. Open your Python editor and create this ecosystem:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import BaggingRegressor from sklearn.tree import DecisionTreeRegressor
You need a few select settings.
n_repeat = 100 # Number of iterations for processing
n_train = 100 # Size of the Training Data set
n_test = 10000 # Size of the Test Data set
noise = 0.1 # Standard deviation of the noise introduced np.random.seed(0)
You will select two estimators to compare.
estimators = [("Tree", DecisionTreeRegressor()),
("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor()))] n_estimators = len(estimators)
You will need a set of data to perform the bagging against, so generate a random data set.
def f(x):
x = x.ravel()
return np.exp(-x ** 2) - 2 * np.exp(-(x - 2) ** 2)
You can experiment with other data configurations, if you want to see the process working.
You need to create a function to add the noise to the data.
def generate(n_samples, noise, n_repeat=1): X = np.random.rand(n_samples) * 10 - 5 X = np.sort(X)
if n_repeat == 1:
y = f(X) + np.random.normal(0.0, noise, n_samples)
y = np.zeros((n_samples, n_repeat))
for i in range(n_repeat):
y[:, i] = f(X) + np.random.normal(0.0, noise, n_samples)
X = X.reshape((n_samples, 1))
return X, y
You can now train the system using these Transform steps.
X_train = []
y_train = []
You train the system with the bagging data set, by taking a sample each cycle. This exposes the model to a more diverse data-training spread of the data.
for i in range(n_repeat):
X, y = generate(n_samples=n_train, noise=noise)
You can now test your models.
X_test, y_test = generate(n_samples=n_test, noise=noise, n_repeat=n_repeat)
You can now loop over estimators to compare the results, by computing your predictions.
for n, (name, estimator) in enumerate(estimators):
y_predict = np.zeros((n_test, n_repeat))
for i in range(n_repeat):[i], y_train[i])
y_predict[:, i] = estimator.predict(X_test)
# Bias^2 + Variance + Noise decomposition of the mean squared error y_error = np.zeros(n_test)
for i in range(n_repeat): for j in range(n_repeat):
y_error += (y_test[:, j] - y_predict[:, i]) ** 2 y_error /= (n_repeat * n_repeat)
y_noise = np.var(y_test, axis=1)
y_bias = (f(X_test) - np.mean(y_predict, axis=1)) ** 2 y_var = np.var(y_predict, axis=1)
You can now display your results.
print("{0}: {1:.4f} (error) = {2:.4f} (bias^2) "
" + {3:.4f} (var) + {4:.4f} (noise)".format(name, np.mean(y_error),
np.mean(y_bias), np.mean(y_var), np.mean(y_noise)))
You can now plot your results.
plt.subplot(2, n_estimators, n + 1)
plt.plot(X_test, f(X_test), "b", label="$f(x)$")
plt.plot(X_train[0], y_train[0], ".b", label="LS ~ $y = f(x)+noise$")
for i in range(n_repeat):
if i == 0:
plt.plot(X_test, y_predict[:, i], "r", label="$\^y(x)$")
plt.plot(X_test, y_predict[:, i], "r", alpha=0.05)
plt.plot(X_test, np.mean(y_predict, axis=1), "c",
label="$\mathbb{E}_{LS} \^y(x)$")
plt.xlim([-5, 5])
if n == 0:
plt.legend(loc="upper left", prop={"size": 11}) plt.subplot(2, n_estimators, n_estimators + n + 1) plt.plot(X_test, y_error, "r", label="$error(x)$") plt.plot(X_test, y_bias, "b", label="$bias^2(x)$"), plt.plot(X_test, y_var, "g", label="$variance(x)$"), plt.plot(X_test, y_noise, "c", label="$noise(x)$") plt.xlim([-5, 5])
plt.ylim([0, 0.1])
if n == 0:
plt.legend(loc="upper left", prop={"size": 11})
Display your hard work!


Well done. You have completed your bagging example.

Remember: The bagging enables the training engine to train against different sets of the data you expect it to process. This eliminates the impact of outliers and extremes on the data model. So, remember that you took a model and trained it against several training sets sampled from the same population of data.


Random Forests

Random forests, or random decision forests, are an ensemble learning method for classification and regression that works by constructing a multitude of decision trees at training time and outputting the results the mode of the classes (classification) or mean prediction (regression) of the individual trees.


Random decision forests correct for decision trees’ habit of overfitting to their training set.


The result is an aggregation of all the trees’ results, by performing a majority vote against the range of results. So, if five trees return three yeses and two nos, it passes a yes out of the Transform step. 


Sometimes, this is also called tree bagging, as you take a bagging concept to the next level by not only training the model on a range of samples from the data population but by actually performing the complete process with the data bag and then aggregating the data results.


Let me guide you through this process. Open your Python editor and prepare the following ecosystem:

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import sys
import os
import datetime as dt
import calendar as cal
Set up the data location and load the data.
if sys.platform == 'linux':
Base=os.path.expanduser('~') + 'VKHCG'
print('Working Base :',Base, ' using ', sys.platform)
basedate = dt.datetime(2018,1,1,0,0,0)
InputFile=Base+'/'+Company+'/03-Process/01-EDS/02-Python/' + InputFileName
usecols=['Open','Close','UnitsOwn'], \ low_memory=False)
You must perform some preprocessing to reveal features in the data.
ShareRawData.index.names = ['ID'] ShareRawData['nRow'] = ShareRawData.index ShareRawData['TradeDate']=ShareRawData.apply(lambda row:\
(basedate - dt.timedelta(days=row['nRow'])),axis=1) ShareRawData['WeekDayName']=ShareRawData.apply(lambda row:\
(cal.day_name[row['TradeDate'].weekday()])\ ,axis=1)
ShareRawData['WeekDayNum']=ShareRawData.apply(lambda row:\ (row['TradeDate'].weekday())\
ShareRawData['sTarget']=ShareRawData.apply(lambda row:\
'true' if row['Open'] < row['Close'] else 'false'\ ,axis=1)
Here is your data set:
Select a data frame with the two feature variables.
df = pd.DataFrame(ShareRawData, columns=sColumns)
Let’s look at the top-five rows.
You need to select the target column.
df2 = pd.DataFrame(['sTarget'])
df2.columns =['WeekDayNum']
You must select a training data set.
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
Now create two new data frames, one with the training rows and the other with the test rows.
train, test = df[df['is_train']==True], df[df['is_train']==False]
Here is the number of observations for the test and training data frames:
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))
Start processing the data by creating a list of the feature column's names features = df.columns[:3]
Display your features.
You must factorize your target to use the model I selected.
y = pd.factorize(train['WeekDayNum'])[0]
You can now view the target values.
You now train The Random Forest Classifier
Create a random forest classifier.
clf = RandomForestClassifier(n_jobs=2, random_state=0)
You now train the classifier to take the training features and learn how they relate to the training y (weekday number).[features], y)
Now apply the classifier to test data. This action is called “scoring.”
You can look at the predicted probabilities of the first ten observations.
Evaluate the classifier. Is it any good?
preds = clf.predict(test[features])[3:4]
Look at the PREDICTED Week Day Number for the first ten observations.
print('PREDICTED Week Number:',preds[0:10])
Look at the ACTUAL WeekDayName for the first ten observations.
I suggest you create a confusion matrix.
c=pd.crosstab(df2['WeekDayNum'], preds, rownames=['Actual Week Day Number'], colnames=['Predicted Week Day Number']) print(c)
You can also look at a list of the features and their importance scores.
print(list(zip(train[features], clf.feature_importances_)))


You have completed the Transform steps for a random forest solution. At his point, I want to explain an additional aspect of random forests. This is simply the daisy-chaining of a series of random forests to create a solution.


I have found these to become more popular over the last two years, as solutions become more demanding and data sets become larger. The same principles apply; you are simply repeating them several times in a chain.


Computer Vision (CV)

Computer vision is a complex feature extraction area, but once you have the features exposed, it simply becomes a matrix of values.

Open your Python editor and enter this quick example:

import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
imageIn =
fig1=plt.figure(figsize=(10, 10))
fig1.suptitle('Audi R8', fontsize=20)
imgplot = plt.imshow(imageIn)
You should see a car.
imagewidth, imageheight = imageIn.size imageMatrix=np.asarray(imageIn) pixelscnt = (imagewidth * imageheight) print('Pixels:', pixelscnt)
print('Size:', imagewidth, ' x', imageheight,)


This is what your computer sees!

You have achieved a computer vision. Remember how I showed you that movies convert several frames? Each frame becomes a matrix’s entry, and now you can make your data science “see.”


Natural Language Processing (NLP)

Natural language processing is the area in data science that investigates the process we as humans use to communicate with each other. This covers mainly written and spoken words that form bigger concepts. Your data science is aimed at intercepting or interacting with humans, to react to the natural language.


There are two clear requirements in natural language processing. First is the direct interaction with humans, such as when you speak to your smartphone, and it responds with an appropriate answer. For example, you request “phone home,” and the phone calls the number set as “home.”


The second type of interaction is taking the detailed content of the interaction and understanding its context and relationship with other text or recorded information.


Examples of these are news reports that are examined, and common trends are found among different news reports. This a study of the natural language’s meaning, not simply a response to a human interaction.



If you want to process text, you must set up an ecosystem to perform the basic text processing.

I recommend that you use library nltk (conda install -c anaconda nltk).


Open your Python editor and set up your ecosystem. You then require the base data.

import nltk

You will see a program that enables you to download several text libraries, which will assist the process to perform text analysis against any text you submit for analysis. The basic principle is that the library matches your text against the text stored in the data libraries and will return the correct matching text analysis.


Open your Python editor and create the following ecosystem, to enable you to investigate this library:

from nltk.tokenize import sent_tokenize, word_tokenize
Txt = "Good Day Mr. Vermeulen,\
how are you doing today?\
The weather is great, and Data Science is awesome.\ You are doing well!"
print('Identify sentences')
print('Identify Word')



There is a major demand for speech-to-text conversion, to extract features. I suggest looking at the SpeechRecognition library (PyPI – the Python Package Index SpeechRecognition/).


You can install it by using conda install -c conda-forge speech recognition.

 This transform area is highly specialized, and I will not provide any further details on this subject.


Neural Networks

Neural networks (also known as artificial neural networks) are inspired by the human nervous system. They simulate how complex information is absorbed and processed by the human system. Just like humans, neural networks learn by example and are configured to a specific application.


Neural networks are used to find patterns in extremely complex data and, thus, deliver forecasts and classify data points. Neural networks are usually organized in layers. Layers are made up of a number of interconnected “nodes.”


Patterns (features) are presented to the network via the “input layer,” which communicates to one or more “hidden layers,” where the actual processing is done. The hidden layers then link to an “output layer,” where the answer is output.


Let me offer an example of feature development for neural networks. Suppose you have to select three colors for your dog’s new ball: blue, yellow, or pink. The features would be

•\ Is the ball blue? Yes/No

•\ Is the ball yellow? Yes/No

•\ Is the ball pink? Yes/No


When you feed these to the neural network, it will use them as simple 0 or 1 values, and this is what neural networks really excel at solving.


Unfortunately, the most important feature when buying a ball for a dog is “Does the dog fit under the house?” It took me two hours to retrieve one dog and one black ball from a space I did not fit!


The lesson: You must change the criteria as you develop the neural network. If you keep the question simple, you can just add more questions, or even remove questions, that result in features.


Note, too, that you can daisy-chain neural networks, and I design such systems on a regular basis. I call it “neural pipelining and clustering,” as I sell it as a service. Before you start your example, you must understand two concepts.


Gradient Descent

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient of the function at the current point.


Open your Python editor and investigate this code:
cur_x = 3 # The algorithm starts at x=3
gamma = 0.01 # step size multiplier
precision = 0.00001
previous_step_size = cur_x
df = lambda x: 4 * x**3 - 9 * x**2
while previous_step_size > precision:
prev_x = cur_x
cur_x += -gamma * df(prev_x)
previous_step_size = abs(cur_x - prev_x)
print("Current X at %f" % cur_x, " with step at %f" % previous_step_ size )
print("The local minimum occurs at %f" % cur_x)


Can you see how the steps start to improve the process to find the local minimum? The gradient descent can take many iterations to compute a local minimum with a required accuracy.


This process is useful when you want to find an unknown value that fits your model. In neural networks this supports the weighting of the hidden layers, to calculate to a required accuracy.


Regularization Strength

Regularization strength is the parameter that prevents overfitting of the neural network. The parameter enables the neural network to match the best set of weights for a general data set. The common name for this setting is the epsilon parameter, also known as the learning rate.


Simple Neural Network

Now that you understand these basic parameters, I will show you an example of a neural network. You can now open a new Python file and your Python editor.

Let’s build a simple neural network.


Setup the eco-system.

import numpy as np
from sklearn import datasets, linear_model import matplotlib.pyplot as plt
You need a visualization procedure.
def plot_decision_boundary(pred_func):
# Set min and max values and give it some padding
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 h = 0.01
# Generate a grid of points with distance h between them xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange (y_min, y_max, h))
# Predict the function value for the whole gid
Z = pred_func(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the contour and training examples plt.contourf(xx, yy, Z, plt.scatter(X[:, 0], X[:, 1], c=y,
You will generate a data set for the example.
I suggest 200 points, but feel free to increase or decrease as you experiment with your neural network.
X, y = datasets.make_moons(200, noise=0.20)
You can plot the data to see what you generated.
plt.scatter(X[:,0], X[:,1], s=40, c=y,
I suggest we train a logistic regression classifier to feed the features.
clf = linear_model.LogisticRegressionCV(), y)
Now you can plot the decision boundary
plot_decision_boundary(lambda x: clf.predict(x))
plt.title("Logistic Regression")
You now configure the neural network. I kept it simple, with two inputs and two outputs.
num_examples = len(X) # training set size
nn_input_dim = 2 # input layer dimensionality nn_output_dim = 2 # output layer dimensionality
You set the gradient descent parameters. This drives the speed at which you resolve the neural network.
Set learning rate for gradient descent and regularization strength, experiment with this as it drives the transform speeds.
epsilon = 0.01
reg_lambda = 0.01 #
You must engineer a helper function, to evaluate the total loss on the data set.
def calculate_loss(model):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2']
# Forward propagation to calculate our predictions z1 = + b1
a1 = np.tanh(z1)
z2 = + b2 exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Calculating the loss
corect_logprobs = -np.log(probs[range(num_examples), y])
data_loss = np.sum(corect_logprobs)
# Add regulatization term to loss (optional)
data_loss += reg_lambda/2 * (np.sum(np.square(W1)) + np.sum (np.square(W2)))
return 1./num_examples * data_loss
You also require a helper function, to predict an output (0 or 1).
def predict(model, x):
W1, b1, W2, b2 = model['W1'], model['b1'], model['W2'], model['b2'] # Forward propagation
z1 = + b1
a1 = np.tanh(z1)
z2 = + b2
exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
return np.argmax(probs, axis=1)
Your next function to engineer is central to the neural network. This function learns parameters for the neural network and returns the model.
•\ nn_hdim: Number of nodes in the hidden layer
•\ num_passes: Number of passes through the training data for gradient descent. I suggest 20000, but you can experiment with different sizes.
•\ print_loss: If True, print the loss every 1000 iterations.
def build_model(nn_hdim, num_passes=20000, print_loss=False):
# Initialize the parameters to random values. We need to learn these. np.random.seed(0)
W1 = np.random.randn(nn_input_dim, nn_hdim) / np.sqrt(nn_input_dim) b1 = np.zeros((1, nn_hdim))
W2 = np.random.randn(nn_hdim, nn_output_dim) / np.sqrt(nn_hdim) b2 = np.zeros((1, nn_output_dim))
# This is what we return at the end
model = {}
# Gradient descent. For each batch...
for i in range(0, num_passes):
# Forward propagation z1 = + b1 a1 = np.tanh(z1)
z2 = + b2 exp_scores = np.exp(z2)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
# Backpropagation
delta3 = probs
delta3[range(num_examples), y] -= 1
dW2 = (a1.T).dot(delta3)
db2 = np.sum(delta3, axis=0, keepdims=True)
delta2 = * (1 - np.power(a1, 2))
dW1 =, delta2)
db1 = np.sum(delta2, axis=0)
# Add regularization terms (b1 and b2 don't have regularization terms)
dW2 += reg_lambda * W2 dW1 += reg_lambda * W1
# Gradient descent parameter update
W1 += -epsilon * dW1
b1 += -epsilon * db1
W2 += -epsilon * dW2
b2 += -epsilon * db2
# Assign new parameters to the model
model = { 'W1': W1, 'b1': b1, 'W2': W2, 'b2': b2}
# Optionally print the loss.
# This is expensive because it uses the whole dataset, so we don't want to do it too often.
if print_loss and i % 1000 == 0:
print ("Loss after iteration %i: %f" %(i, calculate_loss(model)))
return model
You now define the model with a three-dimensional hidden layer.
model = build_model(3, print_loss=True)
You can now plot the decision boundary.
plot_decision_boundary(lambda x: predict(model, x))
plt.title("Decision Boundary for hidden layer size 3")
You can now visualize what you have achieved.
plt.figure(figsize=(16, 32))
hidden_layer_dimensions = [1, 2, 3, 4, 5, 20, 50] for i, nn_hdim in enumerate(hidden_layer_dimensions):
plt.subplot(5, 2, i+1)
plt.title('Hidden Layer size %d' % nn_hdim)
model = build_model(nn_hdim, print_loss=True)
plot_decision_boundary(lambda x: predict(model, x))
You can now build neural networks. Well done.
The preceding is neural networking in its simplest form.



TensorFlow is an open source software library for numerical computation using data-­ flow graphs.


Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs.


TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google’s machine intelligence research organization, for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains.


I will guide you through a few examples, to demonstrate the capability. I have several installations that use this ecosystem, and it is gaining popularity in the data science communities.


To use it, you will require a library named sensor flow. You install it using condo install -c conda-forge sensor flow. Details about the library are available at www.


The next big advantage is the Cloud Tensor Processing Unit (TPU) (https://cloud. hardware product, which was specifically designed to calculate tensor processing at better performance levels than standard CPU or GPU hardware. The TPU supports the TensorFlow process with an extremely effective hardware ecosystem.


The TensorFlow Research Cloud provides users with access to these second-generation cloud TPUs, each of which provides 180 teraflops of machine-learning acceleration.


Basic TensorFlow

I will take you through a basic example by explaining, as a starting point, how to convert the following mathematical equation into a TensorFlow:

\ a = (b + c ) ∗ ( c + 2 ) a = (b + c ) ∗ ( c + 2) \
I will calculate the following for you:
•\ B = 2.5
•\ C = 10
Open your Python editor and create the following ecosystem:
import tensorflow as tf
Create a TensorFlow constant.
const = tf.constant(2.0, name="const")
Create TensorFlow variables.
b = tf.Variable(2.5, name='b')
c = tf.Variable(10.0, name='c')
You must now create the operations.
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')
Next, set up the variable initialization.
init_op = tf.global_variables_initializer()
You can now start the session.
with tf.Session() as sess:
# initialise the variables
# compute the output of the graph a_out = print("Variable a is {}".format(a_out))


Well done. You have just successfully deployed a TensorFlow solution. I will now guide you through a more advance example: how to feed a range of values into a TensorFlow.

\ a = (b + c ) ∗ ( c + 22)a = (b + c ) ∗ ( c + 22) \

I will calculate for the following:

•\ B = range (-5,-4,-3,-2,-1,0,1,2,3,4,5)

•\ C = 3

Open your Python editor and create the following ecosystem:

import tensorflow as tf
import numpy as np
Create a TensorFlow constant.
const = tf.constant(22.0, name="const")
Now create the TensorFlow variables. Note the range format for variable b.
b = tf.placeholder(tf.float32, [None, 1], name='b')
c = tf.Variable(3.0, name='c')
You will create the required operations next.
d = tf.add(b, c, name='d')
e = tf.add(c, const, name='e')
a = tf.multiply(d, e, name='a')
Start the setup of the variable initialization.
init_op = tf.global_variables_initializer()
Start the session construction.
with tf.Session() as sess:
# initialise the variables
# compute the output of the graph
a_out =, feed_dict={b: np.arange(-5, 5)[:, np.newaxis]})
print("Variable a is {}".format(a_out))


Did you notice how with minor changes TensorFlow handles larger volumes of data with ease?


The advantage of TensorFlow is the simplicity of the basic building block you use to create it, and the natural graph nature of the data pipelines, which enable you to easily convert data flows from the real world into complex simulations within the TensorFlow ecosystem.


I will use a fun game called the One-Arm Bandits to offer a sample real-world application of this technology. Open your Python editor and create the following ecosystem:

import tensorflow as tf
import numpy as np
Let’s construct a model for your bandits. There are four one-arm bandits, and currently, bandit 4 is set to provide a positive reward most often.
bandits = [0.2,0.0,-0.2,-2.0]
num_bandits = len(bandits)
You must model the bandit by creating a pull-bandit action.
def pullBandit(bandit):
#Get a random number.
result = np.random.randn(1)
if result > bandit:
#return a positive reward.
return 1
#return a negative reward.
return -1
Now, you reset the ecosystem.
You need the following two lines to establish the feed-forward part of the network.
You perform the actual selection using this formula.
weights = tf.Variable(tf.ones([num_bandits]))
chosen_action = tf.argmax(weights,0)
These next six lines establish the training procedure. You feed the reward and chosen action into the network by computing the loss and using it to update the network.
reward_holder = tf.placeholder(shape=[1],dtype=tf.float32)
action_holder = tf.placeholder(shape=[1],dtype=tf.int32)
responsible_weight = tf.slice(weights,action_holder,[1])
loss = -(tf.log(responsible_weight)*reward_holder)
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001)
update = optimizer.minimize(loss)
Now, you must train the system to perform as a one-arm bandit.
total_episodes = 1000 #Set total number of episodes to train the bandit.
total_reward = np.zeros(num_bandits) #Set scoreboard for bandits to 0.
e = 0.1 #Set the chance of taking a random action.
Initialize the ecosystem now.
init = tf.initialize_all_variables()
Launch the TensorFlow graph processing.
with tf.Session() as sess:
i = 0
while i < total_episodes:
#Choose either a random action or one from our network.
if np.random.rand(1) < e:
action = np.random.randint(num_bandits)
action =
reward = pullBandit(bandits[action])
Collect your reward from picking one of the bandits and update the network.
_,resp,ww =[update,responsible_weight,weights], feed_ dict={reward_holder:[reward],action_holder:[action]})
Update your running tally of the scores.
total_reward[action] += reward
if i % 50 == 0:
print ("Running reward for the " + str(num_bandits) + "
bandits: " + str(total_reward))
print ("The agent thinks bandit " + str(np.argmax(ww)+1) + " is the most promising....")
if np.argmax(ww) == np.argmax(-np.array(bandits)):
print ("...and it was right!")
print ("...and it was wrong!")


Congratulations! You have a fully functional TensorFlow solution. Can you think of three real-life examples you can model using this ecosystem?