How to Transform data (2019)

How to Transform data

How to Transform data?

The Transform data super step allows you, as a data scientist, to take data from the data vault and formulate answers to questions raised by your investigations. This Tutorial explains how transformation step in the data science process that converts results into insights.

 

This Tutorial explains how to Transform data with examples using Python. And also explains all the Steps of Data Exploration and Preparation that used for transforming data.

 

Transforming with Data Science

Transforming with Data Science

For data exploration and preparation from the data lake into the data vault and from the data vault to the data warehouse, we will discuss in the next section. We will now introduce you to the basic data science to transform your data into insights. You must understand a selected set of basic investigation practices, to gain insights from your data.

 

Steps of Data Exploration and Preparation

You must keep detailed notes of what techniques you employed to prepare the data. Make sure you keep your data traceability matrix up to date after each data engineering step has completed.

 

Update your Data Lineage and Data Providence, to ensure that you have both the technical and business details of the entire process. Now, I will take you through a small number of the standard transform checkpoints, to ensure that your data science is complete.

 

Missing Value Treatment

You must describe in detail what the missing value treatments are for the data lake transformation. Make sure you take your business community with you along the journey.

 

At the end of the process, they must trust your techniques and results. If they trust the process, they will implement the business decisions that you, as a data scientist, aspire to achieve.

 

Why Missing Value Treatment Is Required

Explain with notes on the data traceability matrix why there is missing data in the data lake. Remember:

Every inconsistency in the data lake is conceivably the missing insight your customer is seeking from you as a data scientist. So, find them and explain them. Your customer will exploit them for business value.

 

Why Data Has Missing Values

cause-and-effect

The 5 Whys is the technique that helps you to get to the root cause of your analysis. The use of cause-and-effect fishbone diagrams will assist you to resolve those questions.

 

I have found the following common reasons for missing data:

  • Data fields renamed during upgrades
  • Migration processes from old systems to new systems where mappings were incomplete
  • Incorrect tables supplied in loading specifications by a subject-matter expert
  • Data simply not recorded, as it was not available
  • Legal reasons, owing to data protection legislation, such as the General Data Protection Regulation (GDPR), resulting in a not-to-­ process tag on the data entry
  • Someone else’s “bad” data science. People and projects make mistakes, and you will have to fix their errors in your own data science.

 

Warning Ensure that your data science processing is not the reason you are missing data. That is the quickest way to lose your customer’s trust.

 

What Methods Treat Missing Values?

During your progress through the supersteps, you have used many techniques to resolve missing data. Record them in your lineage, but also make sure you collect precisely how each technique applies to the processing flow.

 

Techniques of Outlier Detection and Treatment

During the processing, you will have detected several outliers that are not complying with your expected ranges, e.g., you expected “Yes” or “No” but found some “N/A”s, or you expected number ranges between 1 and 10 but got 11, 12, and 13 also. These out-of-­ order items are the outliers.

 

I suggest you treat them as you treat the missing data. Make sure that your customer agrees with the process, as it will affect the insights you will process and their decisions.

 

Elliptic Envelope

Elliptic Envelope

I will introduce a function called Elliptic Envelope. The basic idea is to assume that a data set is from a known distribution and then evaluate any entries not complying with that assumption. Fitting an elliptic envelope is one of the more common techniques used to detect outliers in a Gaussian distributed data set.

 

The scikit-learn package provides an object covariance. Elliptic Envelope that fits a robust covariance estimate to the data, and thus fits an ellipse to the central data points, ignoring points outside the central mode.

 

For instance, if the inlier data are Gaussian distributed, it will estimate the inlier location and covariance in a robust way (i.e., without being influenced by outliers).

 

The Mahalanobis distances obtained from this estimate are used to derive a measure of outlyingness.

If you want more in-depth details on the function, visit http://scikit-learn.org/stable/modules/generated/ sklearn.covariance

 

Example:

# -*- coding: utf-8 -*-import numpy as np
from sklearn.covariance import EllipticEnvelope from sklearn.svm import OneClassSVM
import matplotlib.pyplot as plt import matplotlib.font_manager
from sklearn.datasets import load_boston
# Get data
X1 = load_boston()['data'][:, [8, 10]] # two clusters
X2 = load_boston()['data'][:, [5, 12]] # "banana"-shaped
# Define "classifiers" to be used classifiers = {
"Empirical Covariance": EllipticEnvelope(support_fraction=1.,
contamination=0.261),
"Robust Covariance (Minimum Covariance Determinant)":
EllipticEnvelope(contamination=0.261),
"OCSVM": OneClassSVM(nu=0.261, gamma=0.05)}

The classifiers assume you are testing the normal data entries. Let’s look at EllipticEnvelope(support_fraction=1., contamination=0.261).

 

The support_ fraction is the portion of the complete population you want to use to determine the border between inliers and outliers. In this case, we use 1, which means 100%.

 

The contamination is the indication of what portion of the population could be outliers, hence, the amount of contamination of the data set, i.e., the proportion of outliers in the data set. In your case, this is set to 0.261 against a possible 0.5, more generally described as 26.1% contamination.

 

The EllipticEnvelope(contamination=0.261) is only a change of the included population, by using the defaults for all the settings, except for contamination that is set to 26.1%.

 

Third is another type of detection called sklearn.svm.OneClassSVM, which is discussed later in this blog.
colors = ['m', 'g', 'b']
legend1 = {}
legend2 = {}
# Learn a frontier for outlier detection with several classifiers
xx1, yy1 = np.meshgrid(np.linspace(-8, 28, 500), np.linspace(3, 40, 500)) xx2, yy2 = np.meshgrid(np.linspace(3, 10, 500), np.linspace(-5, 45, 500)) for i, (clf_name, clf) in enumerate(classifiers.items()):
fig1a=plt.figure(1)
fig1a.set_size_inches(10, 10)
clf.fit(X1)
Z1 = clf.decision_function(np.c_[xx1.ravel(), yy1.ravel()])
Z1 = Z1.reshape(xx1.shape)
legend1[clf_name] = plt.contour(
xx1, yy1, Z1, levels=[0], linewidths=2, colors=colors[i])
plt.figure(2)
clf.fit(X2)
Z2 = clf.decision_function(np.c_[xx2.ravel(), yy2.ravel()])
Z2 = Z2.reshape(xx2.shape)
legend2[clf_name] = plt.contour(
xx2, yy2, Z2, levels=[0], linewidths=2, colors=colors[i])
legend1_values_list = list(legend1.values())
legend1_keys_list = list(legend1.keys())
# Plot the results (= shape of the data points cloud) fig1b=plt.figure(1) # two clusters fig1b.set_size_inches(10, 10)
plt.title("Outlier detection on a real data set (boston housing)") plt.scatter(X1[:, 0], X1[:, 1], color='black')
bbox_args = dict(boxstyle="round", fc="0.8") arrow_args = dict(arrowstyle="->") plt.annotate("several confounded points", xy=(24, 19),
xycoords="data", textcoords="data",
xytext=(13, 10), bbox=bbox_args, arrowprops=arrow_args) plt.xlim((xx1.min(), xx1.max()))
plt.ylim((yy1.min(), yy1.max())) plt.legend((legend1_values_list[0].collections[0],
legend1_values_list[1].collections[0], legend1_values_list[2].collections[0]), (legend1_keys_list[0], legend1_keys_list[1], legend1_keys_ list[2]),
loc="upper center",
prop=matplotlib.font_manager.FontProperties(size=12)) plt.ylabel("accessibility to radial highways") plt.xlabel("pupil-teacher ratio by town")
legend2_values_list = list(legend2.values())
legend2_keys_list = list(legend2.keys())
fig2a=plt.figure(2) # "banana" shape
fig2a.set_size_inches(10, 10)
plt.title("Outlier detection on a real data set (boston housing)")
plt.scatter(X2[:, 0], X2[:, 1], color='black')
plt.xlim((xx2.min(), xx2.max()))
plt.ylim((yy2.min(), yy2.max())) plt.legend((legend2_values_list[0].collections[0],
legend2_values_list[1].collections[0], legend2_values_list[2].collections[0]),
(legend2_keys_list[0], legend2_keys_list[1], legend2_keys_
list[2]),
loc="upper center",
prop=matplotlib.font_manager.FontProperties(size=12))
plt.ylabel("% lower status of the population")
plt.xlabel("average number of rooms per dwelling")
plt.show()
There are a few other outlier detection techniques you can investigate.

 

Isolation Forest

One efficient way of performing outlier detection in high-dimensional data sets is to use random forests. The ensemble.IsolationForest tool “isolates” observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

 

Because recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node.

 

This path length averaged over a forest of such random trees, is a measure of normality and our decision function. Random partitioning produces a noticeably shorter path for anomalies. Hence, when a forest of random trees collectively produces shorter path lengths for particular samples, they are highly likely to be anomalies.

See http://scikit-learn.org/stable/modules/generated/sklearn. ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.

 

Novelty Detection

Novelty Detection

Novelty detection simply performs an evaluation in which we add one more observation to a data set. Is the new observation so different from the others that we can doubt that it is regular? (I.e., does it come from the same distribution?)

 

Or, on the contrary, is it so similar to the other that we cannot distinguish it from the original observations? This is the question addressed by the novelty detection tools and methods.

 

The sklearn.svm.OneClassSVM tool is a good example of this unsupervised outlier detection technique.

For more information, see scikit-learn: machine learning in Python modules/generated/sklearn.svm.

OneClassSVM.html#sklearn.svm.one-class SVM.

 

Local Outlier Factor

An efficient way to perform outlier detection on moderately high-dimensional data sets is to use the local outlier factor (LOF) algorithm. The neighbors. The local outlier factor algorithm computes a score (called a local outlier factor) reflecting the degree of abnormality of the observations.

 

It measures the local density deviation of a given data point with respect to its neighbors. The idea is to detect the samples that have a substantially lower density than their neighbors.

 

In practice, the local density is obtained from the k-nearest neighbors. The LOF score of an observation is equal to the ratio of the average local density of its k-nearest neighbors and its own local density.

 

A normal instance is expected to have a local density like that of its neighbors, while abnormal data are expected to have a much smaller local density.

 

See http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.

LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor for additional information.

 

When the amount of contamination is known, this algorithm illustrates three different ways of performing—based on a robust estimator of covariance, which assumes that the data are Gaussian distributed—and performs better than the one-class SVM, in that case.

 

The first is the one-class SVM, which has the ability to capture the shape of the data set and, hence, perform better when the data is strongly non-Gaussian, i.e., with two well-separated clusters.

 

The second is the isolation forest algorithm, which is based on random forests and, hence, better adapted to large-dimensional settings, even if it performs quite well in the example you will perform next.

 

The third is the local outlier factor to measure the local deviation of a given data point with respect to its neighbors, by comparing their local density.

 

Example:

Here, the underlying truth about inliers and outliers is given by the points’ colors. The orange-filled area indicates which points are reported as inliers by each method. You assume to know the fraction of outliers in the data sets, and the following provides an example of what you can achieve.

 

Open your Python editor, set up this ecosystem, and investigate the preceding techniques.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import matplotlib.font_manager
from sklearn import svm
from sklearn.covariance import EllipticEnvelope from sklearn.ensemble import IsolationForest from sklearn.neighbors import LocalOutlierFactor
rng = np.random.RandomState(42)
# Your example settings
n_samples = 200
outliers_fraction = 0.25
clusters_separation = [0, 1, 2]
# define two outlier detection tools to be compared classifiers = {
"One-Class SVM": svm.OneClassSVM(nu=0.95 * outliers_fraction + 0.05,
kernel="rbf", gamma=0.1),
"Robust covariance": EllipticEnvelope(contamination=outliers_fraction),
"Isolation Forest": IsolationForest(max_samples=n_samples,
contamination=outliers_fraction,
random_state=rng),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors=35,
contamination=outliers_fraction)}
# Compare given classifiers under given settings
xx, yy = np.meshgrid(np.linspace(-7, 7, 100), np.linspace(-7, 7, 100))
n_inliers = int((1. - outliers_fraction) * n_samples)
n_outliers = int(outliers_fraction * n_samples) ground_truth = np.ones(n_samples, dtype=int) ground_truth[-n_outliers:] = -1
# Fit the problem with varying cluster separation for i, offset in enumerate(clusters_separation):
np.random.seed(42)
# Data generation
X1 = 0.3 * np.random.randn(n_inliers // 2, 2) - offset
X2 = 0.3 * np.random.randn(n_inliers // 2, 2) + offset
X = np.r_[X1, X2]
# Add outliers
X = np.r_[X, np.random.uniform(low=-6, high=6, size=(n_outliers, 2))]
# Fit the model plt.figure(figsize=(9, 7))
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit the data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
threshold = stats.scoreatpercentile(scores_pred,
100 * outliers_fraction)
n_errors = (y_pred != ground_truth).sum()
# plot the levels lines and the points if clf_name == "Local Outlier Factor":
# decision_function is private for LOF
Z = clf._decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
subplot = plt.subplot(2, 2, i + 1)
subplot.contourf(xx, yy, Z, levels=np.linspace(Z.min(), threshold, 7), cmap=plt.cm.Blues_r)
a = subplot.contour(xx, yy, Z, levels=[threshold], linewidths=2, colors='red')
subplot.contourf(xx, yy, Z, levels=[threshold, Z.max()], colors='orange')
b = subplot.scatter(X[:-n_outliers, 0], X[:-n_outliers, 1], c='white', s=20, edgecolor='k')
c = subplot.scatter(X[-n_outliers:, 0], X[-n_outliers:, 1], c='black', s=20, edgecolor='k')
subplot.axis('tight')
subplot.legend(
[a.collections[0], b, c],
['learned decision function', 'true inliers', 'true outliers'], prop=matplotlib.font_manager.FontProperties(size=10), loc='lower right')
subplot.set_xlabel("%d. %s (errors: %d)" % (i + 1, clf_name, n_ errors))
subplot.set_xlim((-7, 7))
subplot.set_ylim((-7, 7))
plt.subplots_adjust(0.04, 0.1, 0.96, 0.94, 0.1, 0.26)
plt.suptitle("Outlier detection")
plt.show()

You can now detect outliers in a data set. That is a major achievement, as the most interesting features of data science work are found in these outliers and why they exist. This supports you in performing the feature engineering of the data sets.

Dimension Consolidation

The data vault consists of five categories of data, with linked relationships and additional characteristics in satellite hubs. 

 

To perform dimension consolidation, you start with a given relationship in the data vault and construct a sun model for that relationship. I will cover the example of a person being born, to illustrate the consolidation process.

 

Open a new file in the Python editor and save it as Transform-Gunnarsson_is_Born.py in the directory ..\VKHCG\01-Vermeulen\04-Transform.

 

You will require the Python ecosystem, so set it up by adding the following to your editor (you must set up the ecosystem by adding the libraries):

import sys
import os
from datetime import datetime
from datetime import timedelta
from pytz import timezone, all_timezones
import pandas as pd
import sqlite3 as sq
from http://pandas.io import sql
import uuid
pd.options.mode.chained_assignment = None
Find the working directory of the examples.
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', sys.platform)
print('################################')
Set up the company you are processing.
Company='01-Vermeulen'
Add the company work space:
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir) sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
Add the data vault.
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
os.makedirs(sDataVaultDir) sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
Add the data warehouse.
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
os.makedirs(sDataWarehousetDir) sDatabaseName=sDataVaultDir + '/datawarehouse.db' conn3 = sq.connect(sDatabaseName)

 

Execute the Python code now, to set up the basic ecosystem.

Note The new data structure, called a data warehouse, is in the directory ../99-DW. The data warehouse is the only data structure delivered from the Transform step.

 

Time

time

You need a date and time of December 20, 1960, at 9:15, Iceland. Enter the following code into your editor (you start with a UTC time):

print('Time Category')
print('UTC Time')
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
print(BirthDateZoneUTCStr)
Formulate a Reykjavík local time.
print('Birth Date in Reykjavik :')
BirthZone = 'Atlantic/Reykjavik'
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
print(BirthDateStr)

 

You have successfully discovered the time key for the time hub and the time zone satellite for Atlantic.

 

•\ Time Hub: You have a UTC date and time of December 20, 1960, at 9:15 in Reykjavík, Iceland, as follows:

1960-12-20 10:15:00 (UTC) (+0000)

•\ Time Satellite: Birth date in Reykjavík: 1960-12-20 09:15:00 (-01) (-0100)

Now you can save your work, by adding the following to your code.

Build a data frame, as follows:

IDZoneNumber=str(uuid.uuid4())
sDateTimeKey=BirthDateZoneStr.replace(' ','-').replace(':','-')
TimeLine=[('ZoneBaseKey', ['UTC']),
('IDNumber', [IDZoneNumber]), ('DateTimeKey', [sDateTimeKey]), ('UTCDateTimeValue', [BirthDateZoneUTC]), ('Zone', [BirthZone]), ('DateTimeValue', [BirthDateStr])]
TimeFrame = pd.DataFrame.from_items(TimeLine)

 

Create the time hub.

TimeHub=TimeFrame[['IDNumber','ZoneBaseKey','DateTimeKey','DateTimeValue']] TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
sTable = 'Hub-Time-Gunnarsson'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-Time-Gunnarsson'
TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")

 

Create the time satellite.

TimeSatellite=TimeFrame[['IDNumber','DateTimeKey','Zone','DateTimeValue']] TimeSatelliteIndex=TimeSatellite.set_index(['IDNumber'],inplace=False)

BirthZoneFix=BirthZone.replace(' ','-').replace('/','-') sTable = 'Satellite-Time-' + BirthZoneFix + '-Gunnarsson' print('\n#################################') 
print('Storing :',sDatabaseName,'\n Table:',sTable) print('\n#################################') 
TimeSatelliteIndex.to_sql(sTable, conn2, if_exists="replace") sTable = 'Dim-Time-' + BirthZoneFix + '-Gunnarsson' TimeSatelliteIndex.to_sql(sTable, conn3, if_exists="replace")

Well done. You now have a Time category. The next category is Person.

 

Person

You must record that Guðmundur Gunnarsson was born on December 20, 1960, at 9:15 in Iceland.

Add the following to your existing code:

print('Person Category') FirstName = 'Guðmundur' LastName = 'Gunnarsson' print('Name:',FirstName,LastName) print('Birth Date:',BirthDateLocal) print('Birth Zone:',BirthZone) print('UTC Birth Date:',BirthDateZoneStr)
You just created the person in the person hub.
IDPersonNumber=str(uuid.uuid4()) PersonLine=[('IDNumber', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = pd.DataFrame.from_items(PersonLine)
TimeHub=PersonFrame
TimeHubIndex=TimeHub.set_index(['IDNumber'],inplace=False)
sTable = 'Hub-Person-Gunnarsson'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
TimeHubIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-Person-Gunnarsson'
TimeHubIndex.to_sql(sTable, conn3, if_exists="replace")

Well done. You now have a person hub.

Can you create a location hub and satellite for National University Hospital of Iceland?

Latitude: 64° 08' 10.80" N or 64.136332788 Longitude: -21° 55' 22.79" W or -21.922996308

 

What else can you process from the address data?

Landspítali: Local Name

Hringbraut 101: Street Address

101: Post Code Reykjavík: City Name Iceland: Country

 

You can add to the information an entry in the location hub if the specific latitude and longitude for Reykjavík does not exist yet. You can then add to a satellite named Hospitals the key from the location hub, plus all the extra information.

 

Alternatively, you can simply add it as a satellite called BuildingAt with the same extra information and a type_of_building indicator.

 

Whatever way you perform it, I advise discussing it with your customer.

 

Note You can data-mine many characteristics from a simple connection to a location or building. Can you now create an event hub for “Birth”? Yes, you can—easily—but you have an event called “Born.” Use that.

 

Note I have found that various people have diverse names for similar events, for example, born and birth, death and deceased. I have a library of events that we model for the same event. I keep these in an additional table, and when we create an event for one of the words in a similar list, we also create it for the other lists.

 

For example, if we create an event for born and birth, that facilitates queries from the Transform step. If you asked “Who was born?” or “Whose birth was it?” you would get results retrieved from the data vault into the data warehouse.

 

So, what about Object? Yes, you could use the genus/species data. Guðmundur is a Homo sapiens (human). Try to find the correct record to link to.

 

You must build three items: dimension Person, dimension Time, and fact PersonBornAtTime.

 

Open your Python editor and create a file named http://Transform-Gunnarsson-Sun-Model.py in directory ..\VKHCG\01-Vermeulen\04-Transform. Here is your basic ecosystem:

# -*- coding: utf-8 -*-
import sys import os
from datetime import datetime from pytz import timezone import pandas as pd
import sqlite3 as sq import uuid pd.options.mode.chained_assignment = None
if sys.platform == 'linux' or sys.platform == ' Darwin': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
Company='01-Vermeulen'
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataWarehousetDir=Base + '/99-DW'
if not os.path.exists(sDataWarehousetDir):
os.makedirs(sDataWarehousetDir)
sDatabaseName=sDataWarehousetDir + '/datawarehouse.db' conn2 = sq.connect(sDatabaseName)
Here is your Time dimension:
print('\n#################################')
print('Time Dimension')
BirthZone = 'Atlantic/Reykjavik'
BirthDateUTC = datetime(1960,12,20,10,15,0)
BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
IDTimeNumber=str(uuid.uuid4()) TimeLine=[('TimeID', [IDTimeNumber]),
('UTCDate', [BirthDateZoneStr]),
('LocalTime', [BirthDateLocal]),
('TimeZone', [BirthZone])]
TimeFrame = pd.DataFrame.from_items(TimeLine)
DimTime=TimeFrame
DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
sTable = 'Dim-Time'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")
Well done. you have a Time dimension. Let’s build the Person dimension.
print('\n#################################') print('Dimension Person') print('\n#################################') FirstName = 'Guðmundur' LastName = 'Gunnarsson'
###############################################################
IDPersonNumber=str(uuid.uuid4()) PersonLine=[('PersonID', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = pd.DataFrame.from_items(PersonLine)
DimPerson=PersonFrame
DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
sTable = 'Dim-Person'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
Finally, we add the fact, as follows:
print('\n#################################') print('Fact - Person - time') print('\n#################################') IDFactNumber=str(uuid.uuid4()) PersonTimeLine=[('IDNumber', [IDFactNumber]),
('IDPersonNumber', [IDPersonNumber]), ('IDTimeNumber', [IDTimeNumber])]
PersonTimeFrame = pd.DataFrame.from_items(PersonTimeLine)
FctPersonTime=PersonTimeFrame
FctPersonTimeIndex=FctPersonTime.set_index(['IDNumber'],inplace=False)
sTable = 'Fact-Person-Time'
print('\n#################################')
print('Storing:',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")
FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")

Can you now understand how to formulate a sun model and build the required dimensions and facts?

 

Building a Data Warehouse

Data Warehouse

As you have performed so well up to now, I will ask you to open the Transform-Sun-­ http://Models.py file from directory ..\VKHCG\01-Vermeulen\04-Transform.

 

Note The Python program will build you a good and solid warehouse with which to try new data science techniques. Please be patient; it will supply you with a big push forward.

 

Execute the program and have some coffee once it runs. You can either code it for yourself or simply upload the code from the samples directory. The code follows if you want to understand the process. Can you understand the transformation from data vault to data warehouse?

# -*- coding: utf-8 -*-
import sys import os
from datetime import datetime from pytz import timezone import pandas as pd
import sqlite3 as sq import uuid pd.options.mode.chained_assignment = None
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
Company='01-Vermeulen'
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
os.makedirs(sDataBaseDir)
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
os.makedirs(sDataVaultDir)
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
sDataWarehouseDir=Base + '/99-DW'
if not os.path.exists(sDataWarehouseDir):
os.makedirs(sDataWarehouseDir)
sDatabaseName=sDataWarehouseDir + '/datawarehouse.db' conn3 = sq.connect(sDatabaseName)
sSQL=" SELECT DateTimeValue FROM [Hub-Time];" DateDataRaw=pd.read_sql_query(sSQL, conn2) DateData=DateDataRaw.head(1000) print(DateData)
print('\n#################################')
print('Time Dimension')
print('\n#################################')
t=0
mt=DateData.shape[0]
for i in range(mt):
BirthZone = ('Atlantic/Reykjavik','Europe/London','UCT')
for j in range(len(BirthZone)):
t+=1
print(t,mt*3)
BirthDateUTC = datetime.strptime(DateData['DateTimeValue'][i],
"%Y-%m-%d %H:%M:%S")
BirthDateZoneUTC=BirthDateUTC.replace(tzinfo=timezone('UTC'))
BirthDateZoneStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S")
BirthDateZoneUTCStr=BirthDateZoneUTC.strftime("%Y-%m-%d %H:%M:%S
(%Z) (%z)")
BirthDate = BirthDateZoneUTC.astimezone(timezone(BirthZone[j]))
BirthDateStr=BirthDate.strftime("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=BirthDate.strftime("%Y-%m-%d %H:%M:%S")
IDTimeNumber=str(uuid.uuid4()) TimeLine=[('TimeID', [str(IDTimeNumber)]),
('UTCDate', [str(BirthDateZoneStr)]),
('LocalTime', [str(BirthDateLocal)]),
('TimeZone', [str(BirthZone)])]
if t==1:
TimeFrame = pd.DataFrame.from_items(TimeLine)
else:
TimeRow = pd.DataFrame.from_items(TimeLine)
TimeFrame=TimeFrame.append(TimeRow)
DimTime=TimeFrame
DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
sTable = 'Dim-Time'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
DimTimeIndex.to_sql(sTable, conn3, if_exists="replace") sSQL=" SELECT " + \
" FirstName," + \
" SecondName," + \
" LastName," + \
" BirthDateKey " + \
" FROM [Hub-Person];" PersonDataRaw=pd.read_sql_query(sSQL, conn2) PersonData=PersonDataRaw.head(1000)
print('\n#################################')
print('Dimension Person')
print('\n#################################')
t=0
mt=DateData.shape[0]
for i in range(mt):
t+=1
print(t,mt)
FirstName = str(PersonData["FirstName"]) SecondName = str(PersonData["SecondName"]) if len(SecondName) > 0:
SecondName=""
LastName = str(PersonData["LastName"])
BirthDateKey = str(PersonData["BirthDateKey"])
IDPersonNumber=str(uuid.uuid4()) PersonLine=[('PersonID', [str(IDPersonNumber)]),
('FirstName', [FirstName]),
('SecondName', [SecondName]),
('LastName', [LastName]),
('Zone', [str('UTC')]),
('BirthDate', [BirthDateKey])]
if t==1:
PersonFrame = pd.DataFrame.from_items(PersonLine)
else:
PersonRow = pd.DataFrame.from_items(PersonLine)
PersonFrame = PersonFrame.append(PersonRow)
DimPerson=PersonFrame
print(DimPerson)
DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
sTable = 'Dim-Person'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
DimPersonIndex.to_sql(sTable, conn3, if_exists="replace")

You should now have a good example of a data vault to data warehouse transformation. Congratulations on your progress!

 

 

Hough Transform

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes, by a voting procedure.

 

This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

 

With the help of the Hough transformation, this regression improves the resolution of the RANSAC technique, which is extremely useful when using robotics and robot vision in which the robot requires the regression of the changes between two data frames or data sets to move through an environment.

 

[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]

 

Clustering Techniques

Clustering Techniques

Clustering (or segmentation) is a kind of unsupervised learning algorithm, in which a data set is grouped into unique, differentiated clusters. Let’s say we have customer data spanning 1000 rows.

 

Using clustering, we can group the customers into separate clusters or segments, based on the variables. In the case of customers’ data, the variables can be demographic information or purchasing activities.

 

Clustering is an unsupervised learning algorithm because the input is unknown to the data scientist as no training set is available to pre-train a model of the solution.

 

You do not train the algorithm on any past input-output information, but you let the algorithm define the output for itself. Therefore, there is no right solution to a clustering algorithm, only a reasonably best-fit solution, based on business usability. Clustering is also known as unsupervised classification.

There are two basic types of clustering techniques.

 

Hierarchical Clustering

Hierarchical Clustering

Hierarchical clustering is a method of cluster analysis whereby you build a hierarchy of clusters. This works well for data sets that are complex and have distinct characteristics for separated clusters of data.

 

The following would be an example. People on a budget are more attracted to your sale items and multi-buy combinations, while more prosperous shoppers are more brand-orientated. These are two clearly different clusters, with poles-apart driving forces to buy an item. There are two design styles

 

Agglomerative

This is a bottom-up approach. Each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

 

Divisive

This is a top-down approach. All observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. I will take you through the transformation process to generate a hierarchical cluster.

Open your Python editor and set up a new ecosystem.

from matplotlib import pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage import numpy as np

 

You will now generate two clusters: (a), with 100 points, and (b), with 50.

np.random.seed(4711)
a = np.random.multivariate_normal([10, 0], [[3, 1], [1, 4]], size=[100,])
b = np.random.multivariate_normal([0, 20], [[3, 1], [1, 4]], size=[50,])
X = np.concatenate((a, b),)
This creates 150 samples with 2 dimensions.
print( X.shape)
Let’s quickly display it.
plt.scatter(X[:,0], X[:,1])
plt.show()

 

You must generate the linkage matrix. The matrix contains the linkage criterion that determines the distance between sets of observations as a function of the pairwise distances between observations.

 

For more information on this matrix, see https:// http://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.cluster. hierarchy.linkage.html.

Z = linkage(X, 'ward')

 

Enhance the ecosystem, by adding extra libraries.

from scipy.cluster.hierarchy import cophenet from scipy.spatial.distance import pdist

 

Trigger the cophenetic correlation coefficient. The cophenetic correlation coefficient is a measure of how faithfully a dendrogram preserves the pairwise distances between the original unmodeled data points. In simple terms, how accurate is the measure?

c, coph_dists = cophenet(Z, pdist(X))
print('Cophenetic Correlation Coefficient:',c)
You will now calculate a full dendrogram.
plt.figure(figsize=(25, 10))
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
dendrogram(
Z,
leaf_rotation=90,
leaf_font_size=8,
)
plt.show()
Now, you can truncate the cluster (show only the last p merged clusters).
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
Z,
truncate_mode='lastp',
p=12,
show_leaf_counts=False,
leaf_rotation=90,
leaf_font_size=12,
show_contracted=True,
)
plt.show()
You now must define a new function, to improve the diagram’s display.
def fancy_dendrogram(*args, **kwargs):
max_d = kwargs.pop('max_d', None)
if max_d and 'color_threshold' not in kwargs:
kwargs['color_threshold'] = max_d
annotate_above = kwargs.pop('annotate_above', 0)
ddata = dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index or (cluster size)')
plt.ylabel('distance')
for i, d, c in zip(ddata['icoord'], ddata['dcoord'], ddata ['color_list']):
x = 0.5 * sum(i[1:3])
y = d[1]
if y > annotate_above:
plt.plot(x, y, 'o', c=c)
plt.annotate("%.3g" % y, (x, y), xytext=(0, -5),
textcoords='offset points',
va='top', ha='center')
if max_d:
plt.axhline(y=max_d, c='k')
return ddata
You can now use the new function against your clusters.
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10, # useful in small plots so annotations don't overlap
)
plt.show()
Let’s set the cutoff to 50.
max_d = 50
Now, you just replot the new data.
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,
max_d=max_d,
)
plt.show()
Change the cut to 16.
fancy_dendrogram(
Z,
truncate_mode='lastp',
p=12,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=10,
max_d=16,
)
plt.show()
Now, you add extra functions to investigate your transformation.
from scipy.cluster.hierarchy import inconsistent You can investigate at a depth of five? depth = 5
incons = inconsistent(Z, depth)
print(incons[-10:])
What are you seeing?
Move to depth of three.
depth = 3
incons = inconsistent(Z, depth)
print(incons[-10:])
What do you see? You will see it better with a graph.
last = Z[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
You should now look at the acceleration.
acceleration = np.diff(last, 2) acceleration_rev = acceleration[::-1] plt.plot(idxs[:-2] + 1, acceleration_rev) plt.show()
k = acceleration_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print ("Clusters:", k)
c = np.random.multivariate_normal([40, 40], [[20, 1], [1, 30]], size=[200,])
d = np.random.multivariate_normal([80, 80], [[30, 1], [1, 30]], size=[200,])
e = np.random.multivariate_normal([0, 100], [[100, 1], [1, 100]], size=[200,])
X2 = np.concatenate((X, c, d, e),)
plt.scatter(X2[:,0], X2[:,1])
plt.show()
Can you see the clusters? Here is a proper cluster diagram:
Z2 = linkage(X2, 'ward')
plt.figure(figsize=(10,10))
fancy_dendrogram(
Z2,
truncate_mode='lastp',
p=30,
leaf_rotation=90.,
leaf_font_size=12.,
show_contracted=True,
annotate_above=40,
max_d=170,
)
plt.show()
Well done you can now see the clusters.
Let’s look at the data in more detail.
last = Z2[-10:, 2]
last_rev = last[::-1]
idxs = np.arange(1, len(last) + 1)
plt.plot(idxs, last_rev)
You can now perform more analysis.
acceleration = np.diff(last, 2) # 2nd derivative of the distances acceleration_rev = acceleration[::-1] plt.plot(idxs[:-2] + 1, acceleration_rev)
plt.show()
k = acceleration_rev.argmax() + 2 # if idx 0 is the max of this we want 2 clusters
print ("Clusters:", k)
print (inconsistent(Z2, 5)[-10:])
Let’s look at an F-cluster.
from scipy.cluster.hierarchy import fcluster max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
print(clusters)
Can you see the clusters?
k=2
fcluster(Z, k, criterion='maxclust')
And now can you spot the clusters? It is not that easy to spot the clusters, is it? Let’s try a different angle.
from scipy.cluster.hierarchy import fcluster fcluster(Z, 8, depth=10)
plt.figure(figsize=(10, 8))
plt.scatter(X[:,0], X[:,1], c=clusters, cmap='winter' plt.show()

You can see them now!

Note Visualize insights. People understand graphics much more easily. Columned datasets require explaining and additional contextual information. In blog 11, I will discuss how to report your insights in an easy but effective manner.

 

You have successfully completed hierarchical clustering. This should enable you to understand that there are numerous subsets of clusters that can be combined to form bigger cluster structures.

 

Partitional Clustering

A partitional clustering is simply a division of the set of data objects into non-­ overlapping subsets (clusters), such that each data object is in exactly one subset. Remember when you were at school?

 

During breaks, when you played games, you could only belong to either the blue team or the red team. If you forgot which team was yours, the game normally ended in disaster!

 

Open your Python editor, and let’s perform a transformation to demonstrate how you can create a proper partitional clustering solution. As always, you will require the ecosystem.

import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs from sklearn.preprocessing import StandardScaler
You can generate some sample data, as follows:
centers = [[4, 4], [-4, -4], [4, -4],[6,0],[0,0]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.5, random_state=0)
Let’s apply a scaler transform.
X = StandardScaler().fit_transform(X)
You can now apply the DBSCAN transformation.
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool) core_samples_mask[db.core_sample_indices_] = True labels = db.labels_
Select the number of clusters in labels. You can ignore noise, if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
Here are your findings:
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels)) print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels)) print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
You can also plot it. Remember: Graphics explains better.
import matplotlib.pyplot as plt
unique_labels = set(labels)
colors = [plt.cm -&nbspThis website is for sale! -&nbspplt Resources and Information..Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))] for k, col in zip(unique_labels, colors):
if k == -1:
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col), markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask] plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

If you see this layout, well done. You can perform partitional clustering. This type of clustering is commonly used, as it places any data entry in distinct clusters without overlay.

 

Open a new file in the Python editor and try the following location cluster. Set up the ecosystem.

import sys
import os
from math import radians, cos, sin, asin, sqrt
from scipy.spatial.distance import pdist, squareform from sklearn.cluster import DBSCAN import matplotlib.pyplot as plt
import pandas as pd

 

Remember this distance measures function? It calculates the great circle distance between two points on the earth, when specified in decimal degrees.

def haversine(lonlat1, lonlat2):
# convert decimal degrees to radians lat1, lon1 = lonlat1
lat2, lon2 = lonlat2
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles return c * r
Let’s get the location data.
if sys.platform == 'linux':
Base=os.path.expanduser('~') + '/VKHCG'
else:
Base='C:/VKHCG'
sFileName=Base + '/01-Vermeulen/00-RawData/IP_DATA_ALL.csv'
print('Loading :',sFileName)
scolumns=("Latitude","Longitude")
RawData = pd.read_csv(sFileName,usecols=scolumns,header=0,low_memory=False)
print(RawData)
You now need a sample of 1000 data points.
X=RawData.sample(n=1000)
Now calculate the cluster distance parameter.
distance_matrix = squareform(pdist(X, (lambda u,v: haversine(u,v))))
print(distance_matrix)
Next, apply the clustering.
db = DBSCAN(eps=0.2, min_samples=2, metric='precomputed', algorithm='auto')
y_db = db.fit_predict(distance_matrix)
X['cluster'] = y_db
C = X.cluster.unique()
Let’s visualize the data.
fig=plt.figure(1, figsize=(20, 20))
plt.title('Estimated number of clusters: %d' % len(C))
plt.scatter(X['Latitude'], X['Longitude'], c=X['cluster'],marker='D')
plt.show()
Now save your results.
sImageFile = Base + '/01-Vermeulen/04-Transform/01-EDS/02-Python/Location_
Cluster.jpg'
fig.savefig(sImageFile)
plt.close(fig)

Note The latent Dirichlet allocation (LDA), as discussed earlier in this blog, would be classed as a clustering algorithm and should be mentioned among other clustering algorithms. You have completed the partitional clustering solution.

 

ANOVA

The one-way analysis of variance (ANOVA) test is used to determine whether the mean of more than two groups of data sets is significantly different from each data set.

 

Example:

A BOGOF (buy-one-get-one-free) campaign is executed on 5 groups of 100 customers each. Each group is different in terms of its demographic attributes.

 

We would like to determine whether these five respond differently to the campaign. This would help us optimize the right campaign for the right demographic group, increase the response rate, and reduce the cost of the campaign.

 

The analysis of variance works by comparing the variance between the groups to that within the group. The core of this technique lies in assessing whether all the groups are in fact part of one larger population or a completely different population with different characteristics.

 

Open your Python editor and set up the following ecosystem:

import pandas as pd
datafile='../VKHCG/01-Vermeulen/00-RawData/PlantGrowth.csv' data = pd.read_csv(datafile)
Now you must create a boxplot against the data.
data.boxplot('weight', by='group', figsize=(12, 8))
You must now perform feature extraction and engineering.
ctrl = data['weight'][data.group == 'ctrl'] grps = pd.unique(data.group.values)
d_data = {grp:data['weight'][data.group == grp] for grp in grps} k = len(pd.unique(data.group)) # number of conditions N = len(data.values) # conditions times participants
n = data.groupby('group').size()[0] #Participants in each condition
You now need extra funtions from extra library:
from scipy import stats

 

Now activate the one-way ANOVA test for the null hypothesis that two or more groups have the same population mean transformation.

F, p = stats.f_oneway(d_data['ctrl'], d_data['trt1'], d_data['trt2'])
You need to set up a few parameters.
DFbetween = k - 1
DFwithin = N - k
DFtotal = N - 1
You can now further investigate the results from the transformation.
SSbetween = (sum(data.groupby('group').sum()['weight']**2)/n) \ - (data['weight'].sum()**2)/N
sum_y_squared = sum([value**2 for value in data['weight'].values]) SSwithin = sum_y_squared - sum(data.groupby('group').sum()['weight']**2)/n SStotal = sum_y_squared - (data['weight'].sum()**2)/N MSbetween = SSbetween/DFbetween
MSwithin = SSwithin/DFwithin
F = MSbetween/MSwithin
eta_sqrd = SSbetween/SStotal
omega_sqrd = (SSbetween - (DFbetween * MSwithin))/(SStotal + MSwithin)
Here are the results of your investigation:
print(F,p,eta_sqrd,omega_sqrd)
Well done. You have performed a successful one-way ANOVA test.

 

Principal Component Analysis (PCA)

Principal Component Analysis

Dimension (variable) reduction techniques aim to reduce a data set with higher dimension to one of lower dimension, without the loss of features of information that are conveyed by the data set. The dimension here can be conceived as the number of variables that datasets contain.

Two commonly used variable reduction techniques follow.

 

Factor Analysis

The crux of PCA lies in measuring the data from the perspective of a principal component. A principal component of a data set is the direction with the largest variance.

 

A PCA analysis involves rotating the axis of each variable to the highest Eigenvector/Eigenvalue pair and defining the principal components, i.e., the highest variance axis or, in other words, the direction that most defines the data. Principal components are uncorrelated and orthogonal.

 

PCA is fundamentally a dimensionality reduction algorithm, but it is just as useful as a tool for visualization, for noise filtering, for feature extraction, and engineering. Here is an example.

 

Open your Python editor and set up this ecosystem:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D from sklearn import datasets
from sklearn.decomposition import PCA
Import some data to apply your skills against.
iris = datasets.load_iris()
Take only the first two features.
X = iris.data[:, :2]
You need the following target:
y = iris.target
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
You can quickly visualize the data set.
plt.clf()
plt.figure(2, figsize=(8, 6))
Now plot the training points.
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm -&nbspThis website is for sale! -&nbspplt Resources and Information..Set1, edgecolor='k')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
I suggest getting a better understanding of the interaction of the dimensions. Plot the first three PCA dimensions.
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data) ax.scatter(X_reduced[:, 0], X_reduced[:, 1], X_reduced[:, 2], c=y,
cmap=plt.cm.Set1, edgecolor='k', s=40)
ax.set_title("First three PCA directions")
ax.set_xlabel("1st eigenvector")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("2nd eigenvector")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("3rd eigenvector")
ax.w_zaxis.set_ticklabels([])
plt.show()

Can you see the advantage of reducing the features in a controlled manner? Let’s deploy your new knowledge against a more complex data set.

 

Example:

Hillman Ltd has introduced a CCTV tracking process on the main gates, as it suspects unauthorized offloading of goods is happening through its harbor port’s warehouse. You will load a set of digits from the cameras, to assist the company to read the numbers on the side of the containers that left the warehouse.

 

Open your Python editor and set up this ecosystem:

from sklearn.datasets import load_digits from sklearn.decomposition import PCA import matplotlib.pyplot as plt import numpy as np

 

Load the digits from the preprocessed image software.

digits = load_digits()
You now apply the transformation to project from 64 to 2 dimensions.
pca = PCA(2)
projected = pca.fit_transform(digits.data)
print(digits.data.shape)
print(projected.shape)
Display your findings.
plt.figure(figsize=(10, 15)) plt.scatter(projected[:, 0], projected[:, 1],
c=digits.target, edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral_r', 10))
plt.xlabel('Value 1')
plt.ylabel('Value 2')
plt.colorbar()
Apply the PCA transform.
pca = PCA().fit(digits.data) plt.plot(np.cumsum(pca.explained_variance_ratio_)) plt.xlabel('Number of components') plt.ylabel('Cumulative explained variance');

 

You will require the following function to plot the digits:

def plot_digits(data):
fig, axes = plt.subplots(4, 10, figsize=(10, 4),
subplot_kw={'xticks':[], 'yticks':[]},
gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(8, 8),
cmap='binary', interpolation='nearest', clim=(0, 16))
You can now plot results.
plot_digits(digits.data)
One of the cameras, however, has not generated perfect images.
np.random.seed(42)
noisy = np.random.normal(digits.data, 4)
plot_digits(noisy)
You can still use the data; it just needs more processing.
pca = PCA(0.50).fit(noisy)
pca.n_components_
You can determine the distortion by filtering the noise factors.
components = pca.transform(noisy)
filtered = pca.inverse_transform(components) Visualize your results: plot_digits(filtered)

 

Question Can you see the digits from the cameras?

Do you think we have a workable solution?

You have now proven that you can take a complex 64-dimension data set, and using PCA, reduce it to 2 dimensions and still achieve good data science results. Good progress on your side.

 

Conjoint Analysis

Conjoint analysis is widely used in market research to identify customers’ preference for various attributes that make up a product. The attributes can be various features, such as size, color, usability, price, etc.

 

Using conjoint (trade-off) analysis, brand managers can identify which features customers would trade off for a certain price point. Thus, such analysis is a highly used technique in new product design or pricing strategies.

 

Example:

The data is a ranking of three different features (TV size, TV type, TV color).

  • TV size options are 42", 47", or 60".
  • TV type options are LCD or Plasma.
  • TV color options are red, blue, or pink.

The data rates the different stimuli types for each customer. You are tasked with determining which TVs to display on Krennwallner’s billboards.

 

Open your Python editor and set up the following ecosystem:

>import numpy as np
import pandas as pd
Retrieve the customer buy choices.
sFile='C:/VKHCG/01-Vermeulen/00-RawData/BuyChoices.txt' caInputeDF = pd.read_csv(sFile, sep = ";")
You can display the choices.
print(caInputeDF)

 

You need a new data structure, and you must create dummy variables for every stimulus. In total, you require 9 different stimuli and 18 different combinations.

ConjointDummyDF = pd.DataFrame(np.zeros((18,9)), columns=["Rank","A1", "A2", "A3","B1","B2","C1", "C2","C3"])
You now need to feature engineer the data choices into the new structure:
ConjointDummyDF.Rank = caInputeDF.Rank
for index, row in caInputeDF.iterrows():
stimuli1 = str(caInputeDF["Stimulus"].iloc[index][:2]) stimuli2 = str(caInputeDF["Stimulus"].iloc[index][2:4]) stimuli3 = str(caInputeDF["Stimulus"].iloc[index][4:6]) stimuliLine=[stimuli1,stimuli2,stimuli3] Columns=ConjointDummyDF.columns for stimuli in stimuliLine:
for i in range(len(Columns)):
if stimuli == Columns[i]:
ConjointDummyDF.iloc[index, i] = 1
Well done. Let’s look at your new structure.
print(ConjointDummyDF.head())
You only have unknown numbers. I suggest you add suitable stimulus names. Be mindful of the Transform step you normally present to businesspeople. They understand plasma TV is not good with TVType3.
fullNames = {"Rank":"Rank", "A1": "42\" (106cm)","A2": "47\" (120cm)","A3":
"60\" (152cm)","B1": "Plasma","B2":"LCD","C1":"Blue","C2":"Red","C3":
"Pink",}
ConjointDummyDF.rename(columns=fullNames, inplace=True)
That’s better. Now look at the data structure.
#ConjointDummyDF.head()
Next, you estimate the main effects with a linear regression.
You need extra libraries to achieve this action.
import statsmodels.api as sm
You can select any of these columns for your model.
print(ConjointDummyDF.columns)
I suggest you select these:
X = ConjointDummyDF[[u'42" (106cm)', u'47" (120cm)',\
u'60" (152cm)', u'Plasma', u'LCD', u'Red', u'Blue', u'Pink']]
You now need to set up the model.
X = sm.add_constant(X)
Y = ConjointDummyDF.Rank
You now activate the model against your new data structure.
linearRegression = sm.OLS(Y, X).fit()
You can now look at the results.
print(linearRegression.summary())

 

There are numerous indicators that you can research. I am not going to cover any of these in detail, as several require complex mathematics to explain them fully.

The indicators we are interested in are the part worth values and relative importance of the stimuli:

Importance of Stimuli = Max(beta) - Min(beta)

 

Relative Importance of Stimuli = Importance of Stim/ Sum(Importance of All Stimuli)

You will now investigate these indicators. To do this, you need some data engineering. You now require several basic indicator variables to start the process.

importance = []
relative_importance = []
rangePerFeature = []
begin = "A"
tempRange = []
You now need to load the stimuli.
for stimuli in fullNames.keys():
if stimuli[0] == begin:
tempRange.append(linearRegression.params[fullNames[stimuli]])
elif stimuli == "Rank":
rangePerFeature.append(tempRange)
else:
rangePerFeature.append(tempRange)
begin = stimuli[0]
tempRange = [linearRegression.params[fullNames[stimuli]]]
Then, you need the feature ranges.
for item in rangePerFeature:
importance.append( max(item) - min(item))
You then calculate the importance of a feature.
for item in importance:
relative_importance.append(100* round(item/sum(importance),3))
Start a base data structure for the part worth values.
partworths = []
item_levels = [1,3,5,8]
You now calculate the part worth values.
for i in range(1,4):
part_worth_range = linearRegression.params[item_levels[i-1]:item_ levels[i]]
print (part_worth_range)
You need to determine the mean rank of the data set.
meanRank = []
for i in ConjointDummyDF.columns[1:]:
newmeanRank = ConjointDummyDF["Rank"].loc[ConjointDummyDF[i] == 1].mean()
meanRank.append(newmeanRank)
I suggest you use total mean known as “basic utility” to be used as the “zero alternative.”
totalMeanRank = sum(meanRank) / len(meanRank)
You will now rank the value of each part of the stimuli, i.e., what features of the TV are important?
partWorths = {}
for i in range(len(meanRank)):
name = fullNames[sorted(fullNames.keys())[i]] partWorths[name] = meanRank[i] - totalMeanRank
You now have a result set.
print(partWorths)
Now, I will help you to develop the results into insights and deliver the basis for a summary and results report.
print ("Relative Importance of Feature:\n\nMonitor Size:",relative_ importance[0], "%","\nType of Monitor:", relative_importance[1], "%", "\ nColor of TV:", relative_importance[2], "%\n\n") print ("--"*30)
print ("Importance of Feature:\n\nMonitor Size:",importance[0],"\nType of Monitor:", importance[1], "\nColor of TV:", importance[2])
So, what is the optimal product bundle? I suggest 60", LCD, Red. Do you agree? Let’s test that assumed fact.
optBundle = [1,0,0,1,0,1,0,1,0]
print ("The best possible Combination of Stimuli would have the highest rank:",linearRegression.predict(optBundle)[0])
optimalWorth = partWorths["60\" (152cm)"] + partWorths["LCD"] + partWorths["Red"]
print ("Choosing the optimal Combination brings the Customer an extra ", optimalWorth, "'units' of utility")

Does your science support my assumed fact? Congratulations, you have just completed a conjoint analysis.

 

Tip I suggest you try the equivalent analysis with other data sets, as this is a common question I am asked. What makes people pick one product and not another? You should practice with data sets of all sizes, to understand how the formulas react to changes in the stimuli factors.

I have confidence that there is a future for you at Krennwallner as a data scientist.

Recommend