What Is Feature Engineering (2019)


Feature Engineering

What Is Feature Engineering?

Feature engineering is your core technique to determine the important data characteristics in the data lake and ensure they get the correct treatment through the steps of processing. Make sure that any featuring extraction process technique is documented in the data transformation matrix and the data lineage.


This tutorial explains What Is Feature Engineering and how it works. And also explains Common Feature Extraction Techniques used in Feature Engineering.


Common Feature Extraction Techniques

I will introduce you to several common feature extraction techniques that will help you to enhance an existing data warehouse, by applying data science to the data in the warehouse.



Binning is a technique that is used to reduce the complexity of data sets, to enable the data scientist to evaluate the data with an organized grouping technique. Binning is a good way for you to turn continuous data into a data set that has specific features that you can evaluate for patterns.


A simple example is the cost of candy in your local store, which might range anywhere from a penny to ten dollars, but if you subgroup the price into, say, a rounded-up value that then gives you a range of five values against five hundred.


You have just reduced your processing complexity to 1/500th of what it was before. There are several good techniques, which I will discuss next.


I have two binning techniques that you can use against the data sets. Open your Python editor and try these examples. The first technique is to use the digitizer function.

import numpy
data = numpy.random.random(100)
bins = numpy.linspace(0, 1, 10)
digitized = numpy.digitize(data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))] print(bin_means)
The second is to use the histogram function.
bin_means2 = (numpy.histogram(data, bins, weights=data)[0] / numpy.histogram(data, bins)[0])

This transform technique can be used to reduce the location dimension into three latitude bins and four longitude bins.

You will require the NumPy library.

import numpy as np
Set up the latitude and longitude data sets.
LatitudeData = np.array(range(-90,90,1))
LongitudeData = np.array(range(-180,180,1))
Set up the latitude and longitude data bins.
LatitudeBins = np.array(range(-90,90,45))
LongitudeBins = np.array(range(-180,180,60))
Digitize the data sets with the data bins.
LatitudeDigitized = np.digitize(LatitudeData, LatitudeBins) LongitudeDigitized = np.digitize(LongitudeData, LongitudeBins) Calculate the mean against the bins:
LatitudeBinMeans = [LatitudeData[LatitudeDigitized == i].mean() for i in range(1, len(LatitudeBins))]
LongitudeBinMeans = [LongitudeData[LongitudeDigitized == i].mean() for i in range(1, len(LongitudeBins))]
Well done. You have the three latitude bins and four longitude bins.
You can also use the histogram function to achieve similar results.
LatitudeBinMeans2 = (np.histogram(LatitudeData, LatitudeBins,\
weights=LatitudeData)[0] /
np.histogram(LatitudeData, LatitudeBins)[0])
LongitudeBinMeans2 = (np.histogram(LongitudeData, LongitudeBins,\
weights=LongitudeData)[0] /
np.histogram(LongitudeData, LongitudeBins)[0])

Now you can apply two different techniques for binning.



The use of averaging enables you to reduce the number of records you require to report any activity that demands a more indicative, rather than a precise, total.


Create a model that enables you to calculate the average position for ten sample points. First, set up the ecosystem.

import numpy as np
import pandas as pd
Create two series to model the latitude and longitude ranges.
LatitudeData = pd.Series(np.array(range(-90,91,1)))
LongitudeData = pd.Series(np.array(range(-180,181,1)))
You then select 10 samples for each range:
Calculate the average of each.
LatitudeAverage = np.average(LatitudeSet)
LongitudeAverage = np.average(LongitudeSet)
See your results.
print('Latitude (Avg):',LatitudeAverage)
print('Longitude (Avg):', LongitudeAverage)

You can now calculate the average of any range of numbers. If you run the code several times, you should get different samples. 


Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation

A latent Dirichlet allocation (LDA) is a statistical model that allows sets of observations to be explained by unobserved groups that elucidates why they match or belong together within text documents.


This technique is useful when investigating text from a collection of documents that are common in the data lake, as companies store all their correspondence in a data lake. This model is also useful for Twitter or e-mail analysis.


Note To run the example, you will require pip install lda.


In your Python editor, create a new file named Transform_Latent_Dirichlet_ http://allocation.py in the directory .. \VKHCG\01-Vermeulen\04-Transform. Following is an example of what you can achieve:

import numpy as np
import lda
import lda.datasets
X = lda.datasets.load_reuters()
vocab = lda.datasets.load_reuters_vocab()
titles = lda.datasets.load_reuters_titles()
You can experiment with ranges of n_topics and n_iter values to observe the impact on the process.
model = lda.LDA(n_topics=50, n_iter=1500, random_state=1)
topic_word = model.topic_word_
n_top_words = 10
for i, topic_dist in enumerate(topic_word):
topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
print('Topic {}: {}'.format(i, ' '.join(topic_words)))
Investigate the top-ten topics.
doc_topic = model.doc_topic_
for i in range(10):
print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))


Well done. You can now analyze text documents.

Do you think if you had Hillman’s logistics shipping document you could use this technique? Indeed, you could, as this technique will work on any text note fields or e-mail, even Twitter entries.

  • Now, you can read your Twitter accounts.
  • Now, you can read e-mail.

The complete process is about getting data to the data lake and then guiding it through the steps: retrieve, assess, process, and transform.


A tip I have found, on average, that it is only after the third recheck that 90% of the data science is complete. The process is an iterative design process. The methodology is based on a cyclic process of prototyping, testing, analyzing, and refining. Success will be achieved as you close out the prototypes.

I will now explain a set of common data science terminology that you will encounter in the field of data science.


Hypothesis Testing

Hypothesis Testing

Hypothesis testing is not precisely an algorithm, but it’s a must-know for any data scientist. You cannot progress until you have thoroughly mastered this technique.


Hypothesis testing is the process by which statistical tests are used to check if a hypothesis is true, by using data. Based on hypothetical testing, data scientists choose to accept or reject the hypothesis.


Logistic Regression

Logistic Regression

Logistic regression is the technique to find relationships between a set of input variables and an output variable (just like any regression), but the output variable, in this case, is a binary outcome (think of 0/1 or yes/no).


Simple Logistic Regression

I will guide you through a simple logistic regression that only compares two values. A real-world business example would be the study of a traffic jam at a certain location in London, using a binary variable. The output is a categorical: yes or no. Hence, is there a traffic jam? Yes or no?


The probability of occurrence of traffic jams can be dependent on attributes such as weather condition, the day of the week and month, time of day, number of vehicles, etc.


Using logistic regression, you can find the best-fitting model that explains the relationship between independent attributes and traffic jam occurrence rates and predicts the probability of jam occurrence.


This process is called a binary logistic regression.

The state of the traffic changes for No = Zero to Yes = One, by moving along a curve modeled by the following code.

for x in range(-10,10,1): print(math.sin(x/10))

I will now discuss the logistic regression, using a sample data set.



from sklearn import datasets, neighbors, linear_model
Load the data.
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target
n_samples = len(X_digits)
Select the train data set.
X_train = X_digits[:int(.9 * n_samples)] y_train = y_digits[:int(.9 * n_samples)] X_test = X_digits[int(.9 * n_samples):] y_test = y_digits[int(.9 * n_samples):]
Select the K-Neighbor classifier.
knn = neighbors.KNeighborsClassifier()
Select the logistic regression model.
logistic = linear_model.LogisticRegression()
Train the model to perform logistic regression.
print('KNN score: %f' % knn.fit(X_train, y_train).score(X_test, y_test))
Apply the trained model against the test data set.
print('LogisticRegression score: %f'
% logistic.fit(X_train, y_train).score(X_test, y_test))

Well done. You have just completed your next transform step, by successfully deploying a logistic regression model with a K-Neighbor classifier against the sample data set.


Tip Using this simple process, I have discovered numerous thought-provoking correlations or busted myths about relationships between data values.


Multinomial Logistic Regression

Multinomial Logistic Regression

Multinomial logistic regression (MLR) is a form of linear regression analysis conducted when the dependent variable is nominal at more than two levels.


It is used to describe data and to explain the relationship between one dependent nominal variable and one or more continuous-level (interval or ratio scale) independent variables. You can consider the nominal variable as a variable that has no intrinsic ordering.


This type of data is most common in the business world, as it generally covers most data entries within the data sources and directly indicates what you could expect in the average data lake. The data has no intrinsic order or relationship.


I will now guide you through the example, to show you how to deal with these data sets. You need the following libraries for the ecosystem:

import time
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_mldata
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.utils import check_random_state


Set up the results target.

sPicNameOut=../VKHCG/01-Vermeulen/04-Transform/01-EDS/02-Python/Letters. png'

You must tune a few parameters.

t0 = time.time()
train_samples = 5000
Get the sample data from simulated data lake.
mnist = fetch_mldata('MNIST original')
Data engineer the data lake data.
X = mnist.data.astype('float64')
y = mnist.target
random_state = check_random_state(0)
permutation = random_state.permutation(X.shape[0])
X = X[permutation]
y = y[permutation]
X = X.reshape((X.shape[0], -1))
Train the data model.
X_train, X_test, y_train, y_test = train_test_split( X, y, train_size=train_samples, test_size=10000)
Apply a scaler to training to inhibit overfitting.
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
Turn the tolerance (tol = 0.1) for faster convergence.
clf = LogisticRegression(C=50. / train_samples,
penalty='l2', solver='sag', tol=0.1)
Apply the model to the data set.
clf.fit(X_train, y_train)
sparsity = np.mean(clf.coef_ == 0) * 100
Score the model.
score = clf.score(X_test, y_test)
print('Best C % .4f' % clf.C_)
print("Sparsity with L1 penalty: %.2f%%" % sparsity)
print("Test score with L1 penalty: %.4f" % score)
coef = clf.coef_.copy()
Display the results.
Fig=plt.figure(figsize=(15, 6))
scale = np.abs(coef).max()
for i in range(10):
l1_plot = plt.subplot(2, 5, i + 1)
l1_plot.imshow(coef[i].reshape(28, 28), interpolation='nearest', cmap=plt.cm.RdBu, vmin=-scale, vmax=scale)
l1_plot.set_xlabel('Letter %i' % i)
plt.suptitle('Classification vector for...')
run_time = time.time() - t0 print('Process run in %.3f s' % run_time) plt.show()
Save results to disk.

You’re performing well with the examples. Now, you can handle data that has no intrinsic order. If you understand this process, you have successfully achieved a major milestone in your understanding of the transformation of data lakes via data vaults, by using a Transform step.


Tip I normally store any results back into the data warehouse as a sun model, which then gets physicalized as facts and dimensions in the data warehouse.


Ordinal Logistic Regression

Ordinal logistic regression is a type of binomial logistics regression. Ordinal regression is used to predict the dependent variable with ordered multiple categories and independent variables.


This data type is an extremely good data set to process, as you already have a relationship between the data entries that is known. Deploying your Transform step’s algorithms will give you insights into how strongly or weakly this relationship supports the data discovery process.

For more information on the t-test,

see https://docs.scipy.org/doc/scipy/reference/generated/scipy. stats.t.html#scipy.stats.t

and SciPy v1.1.0 Reference Guide generated/scipy.stats.ttest_ind.html.


First, you set up the ecosystem, as follows:

import numpy as np

from scipy.stats import ttest_ind, ttest_ind_from_stats from scipy.special import stdtr


Create a set of “unknown” data. (This can be a set of data you want to analyze.) In the following example, there are five random data sets. You can select them by having nSet equal 1, 2, 3, 4, or 5.

if nSet==1:
a = np.random.randn(40)
b = 4*np.random.randn(50)
if nSet==2:
if nSet==3:
if nSet==4:
if nSet==5:
a = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
b = np.array([55.0, 56.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
First, you will use scipy’s t-test.
# Use scipy.stats.ttest_ind.
t, p = ttest_ind(a, b, equal_var=False)
print("t-Test_ind: t = %g p = %g" % (t, p))
Second, you will get the descriptive statistics.
# Compute the descriptive statistics of a and b. abar = a.mean()
avar = a.var(ddof=1) na = a.size
adof = na - 1
bbar = b.mean()
bvar = b.var(ddof=1)
nb = b.size
bdof = nb - 1
# Use scipy.stats.ttest_ind_from_stats.
t2, p2 = ttest_ind_from_stats(abar, np.sqrt(avar), na,
bbar, np.sqrt(bvar), nb,
print("t-Test_ind_from_stats: t = %g p = %g" % (t2, p2))
Look at Welch’s t-test formula.
Third, you can use the formula to calculate the test.
# Use the formulas directly.
tf = (abar - bbar) / np.sqrt(avar/na + bvar/nb)
dof = (avar/na + bvar/nb)**2 / (avar**2/(na**2*adof) + bvar**2/ (nb**2*bdof))
pf = 2*stdtr(dof, -np.abs(tf))
print("Formula: t = %g p = %g" % (tf, pf))
if P < 0.001:
print('Statistically highly significant:',P)
if P < 0.05:
print('Statistically significant:',P)
print('No conclusion')
You should see results like this:
t-Test_ind: t = -1.5827 p = 0.118873
t-Test_ind_from_stats: t = -1.5827 p = 0.118873
Formula: t = -1.5827 p = 0.118873


No conclusion

Your results are as follows. The p means the probability, or how likely your results are occurring by chance. In this case, it’s 11%, or p-value = 0.11.


The p-value results can be statistically significant when P < 0.05 and statistically highly significant if P < 0.001 (a less than one-in-a-thousand chance of being wrong).


So, in this case, it cannot be noted as either statistically significant or statistically highly significant, as it is 0.11. Go back and change nSet at the beginning of the code you just entered.

Remember: I mentioned that you can select them by nSet = 1, 2, 3, 4, or 5.


Retest the data sets. You should now see that the p-value changes, and you should also understand that the test gives you a good indicator of whether the two results sets are similar.

Can you find the 99.99%?


Chi-Square Test

Chi-Square Test

A chi-square (or squared [χ2]) test is used to examine if two distributions of categorical variables are significantly different from each other.


Try these examples that are generated with five different datasets. First, set up the ecosystem.

import numpy as np
import scipy.stats as st
Create data sets.
# Create sample data sets. nSet=1
if nSet==1:
a = abs(np.random.randn(50))
b = abs(50*np.random.randn(50))
if nSet==2:
if nSet==3:
if nSet==4:
if nSet==5:
a = np.array([55.0, 55.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
b = np.array([55.0, 56.0, 47.0, 47.0, 55.0, 55.0, 55.0, 63.0])
obs = np.array([a,b])
Perform the test.
chi2, p, dof, expected = st.chi2_contingency(obs)
Display the results.
msg = "Test Statistic : {}\np-value: {}\ndof: {}\n"
print( msg.format( chi2, p , dof,expected) )
if P < 0.001:
print('Statistically highly significant:',P)
if P < 0.05:
print('Statistically significant:',P)
print('No conclusion')

Can you understand what the test indicates as you cycle the nSet through samples 1–5?


[Note: You can free download the complete Office 365 and Office 2019 com setup Guide.]


Overfitting and Underfitting

Overfitting and underfitting are major problems when data scientists retrieve data insights from the data sets they are investigating.


Overfitting is when the data scientist generates a model to fit a training set perfectly, but it does not generalize well against an unknown future real-world data set, as the data science is so tightly modeled against the known data set, the most minor outlier simply does not get classified correctly.


The solution only works for the specific dataset and no other data set. For example, if a person earns more than $150,000, that person is rich; otherwise, the person is poor. A binary classification of rich or poor will not work, as can a person earning about $145,000 be poor?


Underfitting the data scientist’s results into the data insights have been so nonspecific that to some extent predictive models are inappropriately applied or questionable as regards to insights.


For example, your person classifier has a 48% success rate to determine the sex of a person. That will never work, as with a binary guess, you could achieve a 50% rating by simply guessing.


Your data science must offer a significant level of insight for you to secure the trust of your customers, so they can confidently take business decisions, based on the insights you provide them.


Polynomial Features

Polynomial Features

The polynomic formula is the following: (a1 x + b1 )(a2 x + b2 ) = a1a2 x 2 + (a1b2 + a2 b1 )x + b1b2 .


The polynomial feature extraction can use a chain of polynomic formulas to create a hyperplane that will subdivide any data sets into the correct cluster groups. The higher the polynomic complexity, the more precise the result that can be achieved.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures from sklearn.pipeline import make_pipeline
def f(x):
""" function to approximate by polynomial interpolation""" return x * np.sin(x)
# generate points used to plot
x_plot = np.linspace(0, 10, 100)
# generate points and keep a subset of them x = np.linspace(0, 10, 100)
rng = np.random.RandomState(0) rng.shuffle(x)
x = np.sort(x[:20])
y = f(x)
# create matrix versions of these arrays X = x[:, np.newaxis]
X_plot = x_plot[:, np.newaxis]
colors = ['teal', 'yellowgreen', 'gold'] lw = 2
plt.plot(x_plot, f(x_plot), color='cornflowerblue', linewidth=lw, label="Ground Truth")
plt.scatter(x, y, color='navy', s=30, marker='o', label="training points")
for count, degree in enumerate([3, 4, 5]):
model = make_pipeline(PolynomialFeatures(degree), Ridge())
model.fit(X, y)
y_plot = model.predict(X_plot)
plt.plot(x_plot, y_plot, color=colors[count], linewidth=lw, label="Degree %d" % degree)
plt.legend(loc='lower left')

Now that you know how to generate a polynomic formula to match any curve, I will show you a practical application using a real-life data set.


Common Data-Fitting Issue

Common Data-Fitting Issue

These higher order polynomic formulas are, however, more prone to overfitting, while lower order formulas are more likely to underfit. It is a delicate balance between two extremes that support good data science.


import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features), ("linear_regression", linear_regression)])
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor='b', s=20, label="Samples")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.title("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format( degrees[i], -scores.mean(), scores.std()))


Receiver Operating Characteristic (ROC) Analysis Curves

Receiver Operating Characteristic

A receiver operating characteristic (ROC) analysis curve is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.


The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true positive rate is also known as sensitivity, recall, or probability of detection.


You will find the ROC analysis curves useful for evaluating whether your classification or feature engineering is good enough to determine the value of the insights you are finding.


This helps with repeatable results against a real-world data set. So, if you suggest that your customers should take a specific action as a result of your findings, ROC analysis curves will support your advice and insights but also relay the quality of the insights at given parameters.


You should now open your Python editor and create the following ecosystem.


import numpy as np
from scipy import interp
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold
# #########
# Data IO and generation
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X, y = X[y != 2], y[y != 2]
n_samples, n_features = X.shape
# Add noisy features
random_state = np.random.RandomState(0)
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# #########
# Classification and ROC analysis
# Run classifier with cross-validation and plot ROC curves
cv = StratifiedKFold(n_splits=6)
classifier = svm.SVC(kernel='linear', probability=True, random_state=random_state)
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
i = 0
for train, test in cv.split(X, y):
probas_ = classifier.fit(X[train], y[train]).predict_proba(X[test]) # Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, alpha=0.3,
label='ROC fold %d (AUC = %0.2f)' % (i, roc_auc))
i += 1
plt.plot([0, 1], [0, 1], linestyle='--', lw=2, color='r', label='Luck', alpha=.8)
mean_tpr = np.mean(tprs, axis=0)
mean_tpr[-1] = 1.0
mean_auc = auc(mean_fpr, mean_tpr)
std_auc = np.std(aucs)
plt.plot(mean_fpr, mean_tpr, color='b',
label=r'Mean ROC (AUC = %0.2f $\pm$ %0.2f)' % (mean_auc, std_auc), lw=2, alpha=.8)
std_tpr = np.std(tprs, axis=0)
tprs_upper = np.minimum(mean_tpr + std_tpr, 1)
tprs_lower = np.maximum(mean_tpr - std_tpr, 0)
plt.fill_between(mean_fpr, tprs_lower, tprs_upper, color='gray', alpha=.2, label=r'$\pm$ 1 std. dev.')
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")


Cross-Validation Test

Cross-Validation Test

Cross-validation is a model validation technique for evaluating how the results of a statistical analysis will generalize to an independent data set. It is mostly used in settings where the goal is the prediction.


Knowing how to calculate a test such as this enables you to validate the application of your model on real-world, i.e., independent data sets.


I will guide you through a test. Open your Python editor and create the following ecosystem:

import numpy as np
from sklearn.model_selection import cross_val_score from sklearn import datasets, svm import matplotlib.pyplot as plt
digits = datasets.load_digits()
X = digits.data
y = digits.target
Let’s pick three different kernels and compare how they will perform.
kernels=['linear', 'poly', 'rbf']
for kernel in kernels:
svc = svm.SVC(kernel=kernel)
C_s = np.logspace(-15, 0, 15)
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_val_score(svc, X, y, n_jobs=1)
You must plot your results.
Title="Kernel:>" + kernel
fig=plt.figure(1, figsize=(8, 6))
fig.suptitle(Title, fontsize=20)
plt.semilogx(C_s, scores)
plt.semilogx(C_s, np.array(scores) + np.array(scores_std), 'b--')
plt.semilogx(C_s, np.array(scores) - np.array(scores_std), 'b--')
locs, labels = plt.yticks()
plt.yticks(locs, list(map(lambda x: "%g" % x, locs)))
plt.ylabel('Cross-Validation Score')
plt.xlabel('Parameter C')
plt.ylim(0, 1.1)

Well done. You can now perform cross-validation of your results.


Univariate Analysis

Univariate Analysis

Univariate analysis is the simplest form of analyzing data. Uni means “one,” so your data has only one variable. It doesn’t deal with causes or relationships, and its main purpose is to describe. It takes data, summarizes that data, and finds patterns in the data.


The patterns found in univariate data include central tendency (mean, mode, and median) and dispersion, range, variance, maximum, minimum, quartiles (including the interquartile range), and standard deviation.



How many students are graduating with a data science degree? You have several options for describing data using a univariate approach. You can use frequency distribution tables, frequency polygons, histograms, bar charts, or pie charts.

# -*- coding: utf-8 -*-
import sys import os
import pandas as pd import sqlite3 as sq
import matplotlib.pyplot as plt import numpy as np
if sys.platform == 'linux': Base=os.path.expanduser('~') + '/VKHCG'
print('################################') print('Working Base :',Base, ' using ', sys.platform) print('################################')
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite' if not os.path.exists(sDataBaseDir):
sDatabaseName=sDataBaseDir + '/Vermeulen.db' conn1 = sq.connect(sDatabaseName)
sDataVaultDir=Base + '/88-DV'
if not os.path.exists(sDataVaultDir):
sDatabaseName=sDataVaultDir + '/datavault.db' conn2 = sq.connect(sDatabaseName)
sDataWarehouseDir=Base + '/99-DW'
if not os.path.exists(sDataWarehouseDir):
sDatabaseName=sDataWarehouseDir + '/datawarehouse.db' conn3 = sq.connect(sDatabaseName)
for heightSelect in range(100,300,10):
for weightSelect in range(30,300,5):
height = round(heightSelect/100,3)
weight = int(weightSelect)
bmi = weight/(height*height)
if bmi <=18.5: BMI_Result=1 elif bmi> 18.5 and bmi < 25:
elif bmi > 25 and bmi < 30:
elif bmi > 30:
PersonLine=[('PersonID', [str(t)]),
('Height', [height]),
('Weight', [weight]),
('bmi', [bmi]),
('Indicator', [BMI_Result])]
if t==1:
PersonFrame = pd.DataFrame.from_items(PersonLine)
PersonRow = pd.DataFrame.from_items(PersonLine)
PersonFrame = PersonFrame.append(PersonRow)
sTable = 'Transform-BMI'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
sTable = 'Person-Satellite-BMI' print('\n#################################') print('Storing :',sDatabaseName,'\n Table:',sTable) print('\n#################################') DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
sTable = 'Dim-BMI'
print('Storing :',sDatabaseName,'\n Table:',sTable)
DimPersonIndex.to_sql(sTable, conn3, if_exists="replace")
fig = plt.figure()
plt.plot(x, y, ".")
plt.plot(x, y, "o")
plt.plot(x, y, "+")
plt.plot(x, y, "^")
plt.title("BMI Curve")


Now that we have identified the persons at risk, we can study the linear regression of these diabetics.


Note You will use the standard diabetes data sample set that is installed with the sklearn library, the reason being the protection of medical data.


As this data is in the public domain, you are permitted to access it. Warning When you process people’s personal information, you are accountable for any issues your processing causes. So, work with great care.


Note In the next example, we will use a medical data set that is part of the standard learn library. This ensures that you are not working with unauthorized medical results.

You set up the ecosystem, as follows:

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
Load the data set.
# Load the diabetes dataset diabetes = datasets.load_diabetes()
Perform feature development.
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
Split the data into train and test data sets.
diabetes_X_train = diabetes_X[:-30]
diabetes_X_test = diabetes_X[-50:]
Split the target into train and test data sets.
diabetes_y_train = diabetes.target[:-30] diabetes_y_test = diabetes.target[-50:]
Generate a linear regression model.
regr = linear_model.LinearRegression()
Train the model using the training sets.
regr.fit(diabetes_X_train, diabetes_y_train)
Create predictions, using the testing set.
diabetes_y_pred = regr.predict(diabetes_X_test)
Display the coefficients.
print('Coefficients: \n', regr.coef_)
Display the mean squared error.
print("Mean squared error: %.2f"
% mean_squared_error(diabetes_y_test, diabetes_y_pred))
Display the variance score. (Tip: A score of 1 is perfect prediction.)
print('Variance score: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
Plot outputs.
plt.scatter(diabetes_X_test, diabetes_y_test, color='black')
plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

Well done. You have successfully calculated the BMI and determined the diabetes rate of our staff.


RANSAC Linear Regression

RANSAC Linear Regression

RANSAC (RANdom SAmple Consensus) is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the complete data set.


An advantage of RANSAC is its ability to do a robust estimation of the model parameters, i.e., it can estimate the parameters with a high degree of accuracy, even when a significant number of outliers is present in the data set. The process will find a solution because it is so robust.


Generally, this technique is used when dealing with image processing, owing to noise in the domain. See http://scikit-learn.org/stable/modules/generated/ sklearn.linear_model.RANSACRegressor.html.


import numpy as np
from matplotlib import pyplot as plt
from sklearn import linear_model, datasets
n_samples = 1000
n_outliers = 50
X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=1,
n_informative=1, noise=10,
coef=True, random_state=0)
# Add outlier data np.random.seed(0)
X[:n_outliers] = 3 + 0.5 * np.random.normal(size=(n_outliers, 1)) y[:n_outliers] = -3 + 10 * np.random.normal(size=n_outliers)
# Fit line using all data
lr = linear_model.LinearRegression()
lr.fit(X, y)
# Robustly fit linear model with RANSAC algorithm ransac = linear_model.RANSACRegressor() ransac.fit(X, y)
inlier_mask = ransac.inlier_mask_
outlier_mask = np.logical_not(inlier_mask)
# Predict data of estimated models
line_X = np.arange(X.min(), X.max())[:, np.newaxis] line_y = lr.predict(line_X)
line_y_ransac = ransac.predict(line_X)
# Compare estimated coefficients
print("Estimated coefficients (true, linear regression, RANSAC):")
print(coef, lr.coef_, ransac.estimator_.coef_)
lw = 2
plt.scatter(X[inlier_mask], y[inlier_mask], color='yellowgreen', marker='.',
plt.scatter(X[outlier_mask], y[outlier_mask], color='gold', marker='.', label='Outliers')
plt.plot(line_X, line_y, color='navy', linewidth=lw, label='Linear regressor') plt.plot(line_X, line_y_ransac, color='cornflowerblue', linewidth=lw,
label='RANSAC regressor')
plt.legend(loc='lower right')


This regression technique is extremely useful when using robotics and robot vision in which the robot requires the regression of the changes between two data frames or data sets.