Linear Algebra in Python

linear algebra in python

Linear Algebra in Python

Linear algebra is a branch of mathematics concerning vector spaces and linear mappings between such spaces. Simply, it explores linelike relationships. Practically every area of modern science approximates modeling equations with linear algebra. In particular, data science relies on linear algebra for machine learning, mathematical modeling, and dimensional distribution problem solving using Python.

 

Vector Spaces

A vector space is a collection of vectors. A vector is any quantity with magnitude and direction that determines the position of one point in space relative to another. Magnitude is the size of an object measured by movement, length, and/or velocity. Vectors can be added and multiplied (by scalars) to form new vectors. A scalar is any quantity with magnitude (size). In application, vectors are pointing infinite space.

 

Vector examples include breathing, walking, and displacement. Breathing requires diaphragm muscles to exert a force that has magnitude and direction. Walking requires movement in some direction. Displacement measures how far an object moves in a certain direction.

 

Vector Math

 Math

In vector math, a vector is depicted as a directed line segment whose length is its magnitude vector with an arrow indicating the direction from tail to head. The tail is where the line segment begins and the head is where it ends (the arrow). Vectors are the same if they have the same magnitude and direction.

 

To add two vectors a and b, start b where a finishes, and complete the triangle. Visually, start at some point of origin, draw a, start b from head of a, and the result c is a line from tail of a to head of b. The 1st example illustrates vector addition as well as a graphic depiction of the process:

import matplotlib.pyplot as plt, numpy as np
def vector_add(a, b):
return np.add(a, b)
def set_up():
plt.figure()
plt.xlim(-.05, add_vectors[0]+0.4)
plt.ylim(-1.1, add_vectors[1]+0.4)
if __name__ == "__main__":
v1, v2 = np.array([3, -1]), np.array([2, 3])
add_vectors = vector_add(v1, v2)
set_up()
ax = plt.axes()
ax.arrow(0, 0, 3, -1, head_width=0.1, fc='b', ec='b')
ax.text(1.5, -0.35, 'a')
ax.set_facecolor('honeydew')
set_up()
ax = plt.axes()
ax.arrow(0, 0, 3, -1, head_width=0.1, fc='b', ec='b') ax.arrow(3, -1, 2, 3, head_width=0.1, fc='crimson', ec='crimson')
ax.text(1.5, -0.35, 'a')
ax.text(4, -0.1, 'b')
ax.set_facecolor('honeydew')
set_up()
ax = plt.axes()
ax.arrow(0, 0, 3, -1, head_width=0.1, fc='b', ec='b') ax.arrow(3, -1, 2, 3, head_width=0.1, fc='crimson', ec='crimson')
ax.arrow(0, 0, 5, 2, head_width=0.1, fc='springgreen',
ec='springgreen')
ax.text(1.5, -0.35, 'a')
ax.text(4, -0.1, 'b')
ax.text(2.3, 1.2, 'a + b')
ax.text(4.5, 2.08, add_vectors, color='fuchsia')
ax.set_facecolor('honeydew')
plt.show()

 

The code begins by importing matplotlib and numpy libraries. Library matplotlib is a plotting library used for high-quality visualization. Library numpy is the fundamental package for scientific computing. It is a wonderful library for working with vectors and matrices. The code continues with two functions—vector_add() and set_up().

 

Function vector_add() adds two vectors. Function set_up() sets up the figure for plotting. The main block begins by creating two vectors and adding them. The remainder of the code demonstrates graphically how vector addition works.

 

First, it creates an axes() object with an arrow representing vector a beginning at the origin (0, 0) and ending at (3, -1). It continues by adding text and a background color. Next, it creates a 2nd axes() object with the same arrow a, but adds arrow b (vector b) starting at (3, -1) and continuing to (2, 3). Finally, it creates a 3rd axes() object with the same arrows a and b, but adds arrow c (a + b) starting at (0, 0) and ending at (5, 2).

 

The 2nd example modifies the previous example by using subplots. Subplots divide a figure into an m × n grid for a different visualization experience.

import matplotlib.pyplot as plt, numpy as np
def vector_add(a, b):
return np.add(a, b)
if __name__ == "__main__":
v1, v2 = np.array([3, -1]), np.array([2, 3])
add_vectors = vector_add(v1, v2)
f, ax = plt.subplots(3)
x, y = [0, 3], [0, -1]
ax[0].set_xlim([-0.05, 3.1])
ax[0].set_ylim([-1.1, 0.1])
ax[0].scatter(x,y,s=1)
ax[0].arrow(0, 0, 3, -1, head_width=0.1, head_length=0.07, fc='b', ec='b')
ax[0].text(1.5, -0.35, 'a')
ax[0].set_facecolor('honeydew')
x, y = ([0, 3, 5]), ([0, -1, 2])
ax[1].set_xlim([-0.05, 5.1])
ax[1].set_ylim([-1.2, 2.2])
ax[1].scatter(x,y,s=0.5)
ax[1].arrow(0, 0, 3, -1, head_width=0.2, head_length=0.1, fc='b', ec='b')
ax[1].arrow(3, -1, 2, 3, head_width=0.16, head_length=0.1, fc='crimson', ec='crimson')
ax[1].text(1.5, -0.35, 'a')
ax[1].text(4, -0.1, 'b')
ax[1].set_facecolor('honeydew')
x, y = ([0, 3, 5]), ([0, -1, 2])
ax[2].set_xlim([-0.05, 5.25])
ax[2].set_ylim([-1.2, 2.3])
ax[2].scatter(x,y,s=0.5)
ax[2].arrow(0, 0, 3, -1, head_width=0.15, head_length=0.1, fc='b', ec='b')
ax[2].arrow(3, -1, 2, 3, head_width=0.15, head_length=0.1, fc='crimson', ec='crimson')
ax[2].arrow(0, 0, 5, 2, head_width=0.1, head_length=0.1, fc='springgreen', ec='springgreen')
ax[2].text(1.5, -0.35, 'a')
ax[2].text(4, -0.1, 'b')
ax[2].text(2.3, 1.2, 'a + b')
ax[2].text(4.9, 1.4, add_vectors, color='fuchsia')
ax[2].set_facecolor('honeydew')
plt.tight_layout()
plt.show()

 

The code begins by importing matplotlib and numpy libraries. It continues with the same vector_add() function. The main block creates three subplots with plt.subplots(3) and assigns to f and ax, where f represents the figure and ax represents each subplot (ax[0], ax[1], and ax[2]). Instead of working with one figure, the code builds each subplot by indexing ax. The code uses plt.tight_layout() to automatically align each subplot.

 

The 3rd example adds vector subtraction. Subtracting two vectors is addition with the opposite (negation) of a vector. So, vector a minus vector b is the same as a + (-b). The code example demonstrates vector addition and subtraction for both 2- and 3-D vectors:

import numpy as np
def vector_add(a, b):
return np.add(a, b)
def vector_sub(a, b):
return np.subtract(a, b)
if __name__ == "__main__":
v1, v2 = np.array([3, -1]), np.array([2, 3])
add = vector_add(v1, v2)
sub = vector_sub(v1, v2)
print ('2D vectors:')
print (v1, '+', v2, '=', add)
print (v1, '-', v2, '=', sub)
v1 = np.array([1, 3, -5])
v2 = np.array([2, -1, 3])
add = vector_add(v1, v2)
sub = vector_sub(v1, v2)
print ('\n3D vectors:')
print (v1, '+', v2, '=', add)
print (v1, '-', v2, '=', sub)

 

The code begins by importing the numpy library. It continues with functions vector_add() and vector_subtract(), which add and subtract vectors respectively. The main block begins by creating two 2-D vectors and adding and subtracting them. It continues by adding and subtracting two 3-D vectors. Any n-dimensional can be added and subtracted in the same manner.

 

Magnitude is measured by the distance formula. The magnitude of a single vector is measured from the origin (0, 0) to the vector. Magnitude between two vectors is measured from the 1st vector to the 2nd vector. The distance formula is the square root of ((the 1st value from the 2nd vector minus the 1st value from the 1st vector squared) plus (the 2nd value from the 2nd vector minus the 2nd value from the 1st vector squared)).

 

Matrix Math

Matrix Math

A matrix is an array of numbers. Many operations can be performed on a matrix such as addition, subtraction, negation, multiplication, and division. The dimension of a matrix is its size in a number of rows and columns in that order. That is, a 2 × 3 matrix has two rows and three columns. Generally, an m × n matrix has m rows and n columns.

 

An element is an entry in a matrix. Specifically, an element in rowi and columnj of matrix A is denoted as ai,j. Finally, a vector in a matrix is typically viewed as a column. So, a 2 × 3 matrix has three vectors (columns) each with two elements. This is a very important concept to understand when performing matrix multiplication and/or using matrices in data science algorithms.

 

The 1st code example creates a numpy matrix, multiplies it by a scalar, calculates means row- and column-wise, creates a numpy matrix from numpy arrays, and displays it by row and element:

import numpy as np
def mult_scalar(m, s):
matrix = np.empty(m.shape)
m_shape = m.shape
for i, v in enumerate(range(m_shape[0])):
result = [x * s for x in m[v]]
x = np.array(result[0])
matrix[i] = x
return matrix
def display(m):
s = np.shape(m)
cols = s[1]
for i, row in enumerate(m):
print ('row', str(i) + ':', row, 'elements:', end=' ')
for col in range(cols):
print (row[col], end=' ')
print ()
if __name__ == "__main__":
v1, v2, v3 = [1, 7, -4], [2, -3, 10], [3, 5, 6] A = np.matrix([v1, v2, v3]) print ('matrix A:\n', A)
scalar = 0.5
B = mult_scalar(A, scalar)
print ('\nmatrix B:\n', B)
mu_col = np.mean(A, axis=0, dtype=np.float64)
print ('\nmean A (column-wise):\n', mu_col)
mu_row = np.mean(A, axis=1, dtype=np.float64)
print ('\nmean A (row-wise):\n', mu_row)
print ('\nmatrix C:')
C = np.array([[2, 14, -8], [4, -6, 20], [6, 10, 12]])
print (C)
print ('\ndisplay each row and element:')
display(C)

 

The code begins by importing numpy. It continues with two functions—mult_scalar() and display(). Function mult_scalar() multiplies a matrix by a scalar. Function display() displays a matrix by row and each element of a row. The main block creates three vectors and adds them to numpy matrix A. B is created by multiplying scalar 0.5 by A. Next, means for A are calculated by column and row. Finally, numpy matrix C is created from three numpy arrays and displayed by row and element.

 

The 2nd code example creates a numpy matrix A, sums its columns and rows, calculates the dot product of two vectors, and calculates the dot product of two matrices. Dot product multiplies two vectors to get magnitude that can be used to compute lengths of vectors and angles between vectors. Specifically, the dot product of two vectors a and b is ax × bx + ay × by.

 

For matrix multiplication, dot product produces matrix C from two matrices A and B. However, two vectors cannot be multiplied when both are viewed as column matrices. To rectify this problem, transpose the 1st vector from A, turning it into a 1 × n row matrix so it can be multiplied by the 1st vector from B and summed.

 

The product is now well defined because the product of a 1 × n matrix with an n × 1 matrix is a 1 × 1 matrix (a scalar). To get the dot product, repeat this process for the remaining vectors from A and B. Numpy includes a handy function that calculates dot product for you, which greatly simplifies matrix multiplication.

import numpy as np
def sum_cols(matrix):
return np.sum(matrix, axis=0)
def sum_rows(matrix):
return np.sum(matrix, axis=1)
def dot(v, w):
return np.dot(v, w)
if __name__ == "__main__":
v1, v2, v3 = [1, 7, -4], [2, -3, 10], [3, 5, 6] A = np.matrix([v1, v2, v3]) print ('matrix A:\n', A)
v_cols = sum_cols(A)
print ('\nsum A by column:\n', v_cols)
v_rows = sum_rows(A)
print ('\nsum A by row:\n', v_rows)
dot_product = dot(v1, v2)
print ('\nvector 1:', v1)
print ('vector 2:', v2)
print ('\ndot product v1 and v2:')
print (dot_product)
v1, v2, v3 = [-2, 5, 4], [1, 2, 9], [10, -9, 3] B = np.matrix([v1, v2, v3]) print ('\nmatrix B:\n', B)
C = A.dot(B)
print ('\nmatrix C (dot product A and B):\n', C)
print ('\nC by row:')
for i, row in enumerate(C):
print ('row', str(i) + ': ', end='')
for v in np.nditer(row):
print (v, end=' ')
print()

 

The code begins by importing numpy. It continues with three functions— sum_cols(), sum_rows(), and dot(). Function sum_cols() sums each column and returns a row with these values. Function sum_rows() sums each row and returns a column with these values. Function dot() calculates the dot product.

 

The main block begins by creating three vectors that are then used to create matrix A. Columns and rows are summed for A. Dot product is then calculated for two vectors (v1 and v2). Next, three new vectors are created that are then used to create matrix B. Matrix C is created by calculating the dot product for A and B. Finally, each row of C is displayed.

 

The 3rd code example illuminates a realistic scenario. Suppose a company sells three types of pies beef, chicken, and vegetable. Beef pies cost $3 each, chicken pies cost $4 dollars each, and vegetable pies cost $2 dollars each. The vector representation for pie cost is [3, 4, 2]. You also know sales by pie for Monday through Thursday.

 

Beef sales are 13 for Monday, 9 for Tuesday, 7 for Wednesday, and 15 for Thursday. The vector for beef sales is thereby [13, 9, 7, 15]. Using the same logic, the vectors for chicken sales are [8, 7, 4, 6] and [6, 4, 0, 3], respectively. The goal is to calculate total sales for four days (Monday–Thursday).

import numpy as np
def dot(v, w):
return np.dot(v, w)
def display(m):
for i, row in enumerate(m):
print ('total sales by day:\n', end='')
for v in np.nditer(row):
print (v, end=' ')
print()
if __name__ == "__main__":
a = [3, 4, 2]
A = np.matrix([a])
print ('cost matrix A:\n', A)
v1, v2, v3 = [13, 9, 7, 15], [8, 7, 4, 6], [6, 4, 0, 3] B = np.matrix([v1, v2, v3])
print ('\ndaily sales by item matrix B:\n', B)
C = A.dot(B)
print ('\ndot product matrix C:\n', C, '\n')
display(C)

 

The code begins by importing numpy. It continues with function dot() that calculates the dot product, and function display() that displays the elements of a matrix, row by row. The main block begins by creating a vector that holds the cost of each type of pie. It continues by converting the vector into matrix A. Next, three vectors are created that represent sales for each type of pie for Monday through Friday.

 

The code continues by converting the three vectors into matrix B. Matrix C is created by finding the dot product of A and B. This scenario demonstrates how dot product can be used for solving business problems.

 

The 4th code example calculates the magnitude (distance) and direction (angle) with a single vector and between two vectors:

import math, numpy as np
def sqrt_sum_squares(ls):
return math.sqrt(sum(map(lambda x:x*x,ls)))
def mag(v):
return np.linalg.norm(v)
def a_tang(v):
return math.degrees(math.atan(v[1]/v[0]))
def dist(v, w):
return math.sqrt(((w[0]-v[0])** 2) + ((w[1]-v[1])** 2))
def mags(v, w):
return np.linalg.norm(v - w)
def a_tangs(v, w):
val = (w[1] - v[1]) / (w[0] - v[0])
return math.degrees(math.atan(val))
if __name__ == "__main__":
v = np.array([3, 4])
print ('single vector', str(v) + ':')
print ('magnitude:', sqrt_sum_squares(v))
print ('NumPY magnitude:', mag(v))
print ('direction:', round(a_tang(v)), 'degrees\n')
v1, v2 = np.array([2, 3]), np.array([5, 8])
print ('two vectors', str(v1) + ' and ' + str(v2) + ':')
print ('magnitude', round(dist(v1, v2),2))
print ('NumPY magnitude:', round(mags(v1, v2),2))
print ('direction:', round(a_tangs(v1, v2)), 'degrees\n')
v1, v2 = np.array([0, 0]), np.array([3, 4])
print ('use origin (0,0) as 1st vector:')
print ('"two vectors', str(v1) + ' and ' + str(v2) + '"')
print ('magnitude:', round(mags(v1, v2),2))
print ('direction:', round(a_tangs(v1, v2)), 'degrees')

 

The code begins by importing math and numpy libraries. It continues with six functions. Function sqrt_sum_squares() calculates magnitude for one vector from scratch. Function mag() does the same but uses numpy.

 

Function a_tang() calculates the arctangent of a vector, which is the direction (angle) of a vector from the origin (0,0). Function dist() calculates magnitude between two vectors from scratch. Function mags() does the same but uses numpy. Function a_tangs() calculates the arctangent of two vectors. The main block creates a vector, calculates magnitude and direction, and displays. Next, magnitude and direction are calculated and displayed for two vectors.

 

Finally, magnitude and direction for a single vector are calculated using the two vector formulas. This is accomplished by using the origin (0,0) as the 1st vector. So, functions that calculate magnitude and direction for a single vector are not needed, because any single vector always begins from the origin (0,0). Therefore, a vector is simply a point in space measured either from the origin (0,0) or in relation to another vector by magnitude and direction.

 

Basic Matrix Transformations

Matrix Transformations

The 1st code example introduces the identity matrix, which is a square matrix with ones on the main diagonal and zeros elsewhere. The product of matrix A and its identity matrix is A, which is important mathematically because the identity property of multiplication states that any number multiplied by 1 is equal to itself.

import numpy as np
def slice_row(M, i):
return M[i,:]
def slice_col(M, j):
return M[:, j]
def to_int(M):
return M.astype(np.int64)
if __name__ == "__main__":
A = [[1, 9, 3, 6, 7],
[4, 8, 6, 2, 1],
[9, 8, 7, 1, 2],
[1, 1, 9, 2, 4],
[9, 1, 1, 3, 5]]
A = np.matrix(A)
print ('A:\n', A)
print ('\n1st row: ', slice_row(A, 0))
print ('\n3rd column:\n', slice_col(A, 2))
shapeA = np.shape(A)
I = np.identity(np.shape(A)[0])
I = to_int(I)
print ('\nI:\n', I)
dot_product = np.dot(A, I)
print ('\nA * I = A:\n', dot_product)
print ('\nA\':\n', A.I)
A_by_Ainv = np.round(np.dot(A, A.I), decimals=0, out=None)
A_by_Ainv = to_int(A_by_Ainv)
print ('\nA * A\':\n', A_by_Ainv)
The code begins by importing numpy. It continues with three
functions. Function slice_row() slices a row from a matrix. Function
slice_col() slices a column from a matrix. Function to_int() converts matrix
elements to integers. The main block begins by creating matrix A.
It continues by creating the identity matrix for A. Finally, it creates the identity matrix for A by using the dot product of A with A' (inverse of A).
The 2nd code example converts a list of lists into a numpy matrix and traverses it:
import numpy as np
if __name__ == "__main__":
data = [
[41, 72, 180], [27, 66, 140],
[18, 59, 101], [57, 72, 160],
[21, 59, 112], [29, 77, 250],
[55, 60, 120], [28, 72, 110],
[19, 59, 99], [32, 68, 125],
[31, 79, 322], [36, 69, 111]
]
A = np.matrix(data)
print ('manual traversal:')
for p in range(A.shape[0]):
for q in range(A.shape[1]):
print (A[p,q], end=' ')
print ()

 

The code begins by importing numpy. The main block begins by creating a list of lists, converting it into numpy matrix A, and traversing A. Although I have demonstrated several methods for traversing a numpy matrix, this is my favorite method.

The 3rd code example converts a list of lists into numpy matrix A.

It then slices and dices A:
import numpy as np
if __name__ == "__main__":
points_3D_space = [
[0, 0, 0],
[1, 2, 3],
[2, 2, 2],
[9, 9, 9] ]
A = np.matrix(points_3D_space)
print ('slice entire A:')
print (A[:])
print ('\nslice 2nd column:')
print (A[0:4, 1])
print ('\nslice 2nd column (alt method):')
print (A[:, 1])
print ('\nslice 2nd & 3rd value 3rd column:')
print (A[1:3, 2])
print ('\nslice last row:')
print (A[-1])
print ('\nslice last row (alt method):')
print (A[3])
print ('\nslice 1st row:')
print (A[0, :])
print ('\nslice 2nd row; 2nd & 3rd value:')
print (A[1, 1:3])

 

The code begins by importing numpy. The main block begins by creating a list of lists and converting it into numpy matrix A. The code continues by slicing and dicing the matrix.

 

Pandas Matrix Applications

Pandas Matrix Applications

The panda's library provides high-performance, easy-to-use data structure, and analysis tools. The most commonly used panda's object is a DataFrame (df). A df is a 2-D structure with labeled axes (row and column) of potentially different types. Math operations align on both row and column labels. A df can be conceptualized by column or row. To view by column, use axis = 0 or axis = ‘index’. To view by row, use axis = 1 or axis = ‘columns’. This may seem counterintuitive when working with rows, but this is the way pandas implemented this feature.

 

A pandas df is much easier to work with than a numpy matrix, but it is also less efficient. That is, it takes a lot more resources to process a pandas df. The numpy library is optimized for processing large amounts of data and numerical calculations.

 

The 1st example creates a list of lists, places it into a pandas df, and displays some data:

import pandas as pd
if __name__ == "__main__":
data = [
[41, 72, 180], [27, 66, 140],
[18, 59, 101], [57, 72, 160],
[21, 59, 112], [29, 77, 250],
[55, 60, 120], [28, 72, 110],
[19, 59, 99], [32, 68, 125],
[31, 79, 322], [36, 69, 111]
]
headers = ['age', 'height', 'weight'] df = pd.DataFrame(data, columns=headers) n = 3
print ('First', n, '"df" rows:\n', df.head(n))
print ('\nFirst "df" row:')
print (df[0:1])
print ('\nRows 2 through 4')
print (df[2:5])
print ('\nFirst', n, 'rows "age" column')
print (df[['age']].head(n))
print ('\nLast', n, 'rows "weight" and "age" columns')
print (df[['weight', 'age']].tail(n))
print ('\nRows 3 through 6 "weight" and "age" columns')
print (df.ix[3:6, ['weight', 'age']])

 

The code begins by importing pandas. The main block begins by creating a list of lists and adding it to a pandas df. It is a good idea to create your own headers as we do here. Method head() and tail() automatically display the 1st five records and last five records respectively unless a value is included.

 

In this case, we display the 1st and last three records. Using head() and tail() are very useful, especially with a large df. Notice how easy it is to slice and dice the df. Also, notice how easy it is to display column data of your choice.

 

The 2nd example creates a list of lists, places it into numpy matrix A, and puts A into a pandas df. This ability is very important because it shows how easy it is to create a df from a numpy matrix. So, you can be working with numpy matrices for precision and performance, and then convert to pandas for slicing, dicing, and other operations.

import pandas as pd, numpy as np
if __name__ == "__main__":
data = [
[41, 72, 180], [27, 66, 140],
[18, 59, 101], [57, 72, 160],
[21, 59, 112], [29, 77, 250],
[55, 60, 120], [28, 72, 110],
[19, 59, 99], [32, 68, 125],
[31, 79, 322], [36, 69, 111]
]
A = np.matrix(data)
headers = ['age', 'height', 'weight'] df = pd.DataFrame(A, columns=headers) print ('Entire "df":') print (df, '\n')
print ('Sliced by "age" and "height":')
print (df[['age', 'height']])

 

The code begins by importing pandas and numpy. The main block begins by creating a list of lists, converting it to numpy matrix A, and then adding A to a pandas df.

 

The 3rd example creates a list of lists, places it into a list of dictionary elements, and puts it into a pandas df. This ability is also very important because dictionaries are very efficient data structures when working with data science applications.

import pandas as pd
if __name__ == "__main__":
data = [
[41, 72, 180], [27, 66, 140],
[18, 59, 101], [57, 72, 160],
[21, 59, 112], [29, 77, 250],
[55, 60, 120], [28, 72, 110],
[19, 59, 99], [32, 68, 125],
[31, 79, 322], [36, 69, 111]
]
d = {}
dls = []
key = ['age', 'height', 'weight'] for row in data:
for i, num in enumerate(row):
d[key[i]] = num
dls.append(d)
d = {}
df = pd.DataFrame(dls)
print ('dict elements from list:')
for row in dls:
print (row)
print ('\nheight from 1st dict element is:', end=' ')
print (dls[0]['height'])
print ('\n"df" converted from dict list:\n', df)
print ('\nheight 1st df element:\n', df[['height']].head(1))
The 4th code example creates two lists of lists—data and scores. The data list holds ages, heights, and weights for 12 athletes. The scores list holds three exam scores for 12 students. The data list is put directly into df1, and the scores list is put directly into df2. Averages are computed and displayed.
import pandas as pd, numpy as np
if __name__ == "__main__":
data = [
[41, 72, 180], [27, 66, 140],
[18, 59, 101], [57, 72, 160],
[21, 59, 112], [29, 77, 250],
[55, 60, 120], [28, 72, 110],
[19, 59, 99], [32, 68, 125],
[31, 79, 322], [36, 69, 111]
]
scores = [
[99, 90, 88], [77, 66, 81], [78, 77, 83],
[75, 72, 79], [88, 77, 93], [88, 77, 94],
[100, 99, 93], [94, 74, 90], [98, 97, 99],
[73, 68, 77], [55, 50, 68], [36, 77, 90]
]
n = 3
key1 = ['age', 'height', 'weight'] df1 = pd.DataFrame(data, columns=key1) print ('df1 slice:\n', df1.head(n)) avg_cols = df1.apply(np.mean, axis=0) print ('\naverage by columns:') print (avg_cols)
avg_wt = df1[['weight']].apply(np.mean, axis='index')
print ('\naverage weight')
print (avg_wt)
key2 = ['exam1', 'exam2', 'exam3']
df2 = pd.DataFrame(scores, columns=key2)
print ('\ndf2 slice:\n', df2.head(n))
avg_scores = df2.apply(np.mean, axis=1)
print ('\naverage scores for 1st', n, 'students (rows):')
print (avg_scores.head(n))
avg_slice = df2[['exam1','exam3']].apply(np.mean, axis='columns')
print ('\naverage "exam1" & "exam3" 1st', n, 'students (rows):')
print (avg_slice[0:n])

The code begins by importing pandas and numpy. The main block creates the data and scores lists and puts them in df1 and df2, respectively. With df1 (data), we average by column because our goal is to return the average age, height, and weight for all athletes. With df2 (scores), we average by row because our goal is to return the average overall exam score for each student. We could average by column for df2 if the goal is to calculate the average overall score for one of the exams. Try this if you wish.

 

Gradient Descent

Gradient Descent

Gradient descent (GD) is an algorithm that minimizes (or maximizes) functions. To apply, start at an initial set of a function’s parameter values and iteratively move toward a set of parameter values that minimize the function. Iterative minimization is achieved using calculus by taking steps in the negative direction of the function’s gradient. GD is important because optimization is a big part of machine learning. Also, GD is easy to implement, generic, and efficient (fast).

 

Simple Function Minimization (and Maximization)

GD is a 1st order iterative optimization algorithm for finding the minimum of a function f. A function can be denoted as f or f(x). Simply, GD finds the minimum error by minimizing (or maximizing) a cost function. A cost function is something that you want to minimize.

 

Let’s begin with a minimization example. To find the local minimum of f, take steps proportional to the negative of the gradient of f at the current point. The gradient is the derivative (rate of change) of f. The only weakness of GD is that it finds the local minimum rather than the minimum for the whole function.

The power rule is used to differentiate functions of the form f(x) = xr:

d x n = nxn-1 \

\ dx

So, the derivative of xn equals nxn−1. Simply, the derivative is the product of the exponent times x with the exponent reduced by 1. To minimize f(x) = x4 – 3x3 + 2 find the derivative, which is f'(x) = 4x3 – 9x2. So, the 1st step is always to find the derivative f'(x). The 2nd step is to plot the original function to get an idea of its shape. The 3rd step is to run GD. The 4th step is to plot the local minimum.

 

The 1st example finds the local minimum of f(x) and displays f(x), f'(x).

import matplotlib.pyplot as plt, numpy as np
def f(x):
return x**4 - 3 * x**3 + 2
def df(x):
return 4 * x**3 - 9 * x**2
if __name__ == "__main__":
x = np.arange(-5, 5, 0.2)
y, y_dx = f(x), df(x)
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(x, y, color='mediumspringgreen')
axarr[0].set_xlabel('x')
axarr[0].set_ylabel('f(x)')
axarr[0].set_title('f(x)')
axarr[1].plot(x, y_dx, color='coral')
axarr[1].set_xlabel('x')
axarr[1].set_ylabel('dy/dx(x)')
axarr[1].set_title('derivative of f(x)')
axarr[2].set_xlabel('x')
axarr[2].set_ylabel('GD')
axarr[2].set_title('local minimum')
iterations, cur_x, gamma, precision = 0, 6, 0.01, 0.00001 previous_step_size = cur_x
while previous_step_size > precision:
prev_x = cur_x
cur_x += -gamma * df(prev_x)
previous_step_size = abs(cur_x - prev_x)
iterations += 1
axarr[2].plot(prev_x, cur_x, "o")
f.subplots_adjust(hspace=0.3)
f.tight_layout()
plt.show()
print ('minimum:', cur_x, '\niterations:', iterations)

 

The code example begins by importing matplotlib and numpy. It continues with function f(x) used to plot the original function and function df(x) used to plot the derivative. The main block begins by creating values for f(x). It continues by creating a subplot. GD begins by initializing variables. Variable cur_x is the starting point for the simulation. Variable gamma is the step size.

 

Variable precision is the tolerance. Smaller tolerance translates into more precision, but requires more iterations (resources). The simulation continues until previous_step_size is greater than precision. Each iteration multiplies -gamma (step_size) by the gradient (derivative) at the current point to move it to the local minimum. Variable previous_step_size is then assigned the difference between cur_x and prev_x.

 

Each point is plotted. The minimum for f(x) solving for x is approximately 2.25. I know this result is correct because I calculated it by hand. Check out Calculus - dummies how-to-find-local-extrema-with-the-first-derivative-test/ for a nice lesson on how to calculate by hand.

 

The 2nd example finds the local minimum and maximum of f(x) = x3 – 6x2 + 9x + 15. First find f'(x), which is 3x2 – 12x + 9. Next, find the local minimum, plot, local maximum, and plot. I don’t use a subplot in this case because the visualization is not as rich. That is, it is much easier to see the approximate local minimum and maximum by looking at a plot of f(x), and easier to see how the GD process works its magic.

import matplotlib.pyplot as plt, numpy as np def f(x):
return x**3 - 6 * x**2 + 9 * x + 15 def df(x):
return 3 * x**2 - 12 * x + 9 if __name__ == "__main__":
x = np.arange(-0.5, 5, 0.2)
y = f(x)
plt.figure('f(x)')
plt.xlabel('x')
plt.ylabel('f(x)')
plt.title('f(x)')
plt.plot(x, y, color='blueviolet')
plt.figure('local minimum')
plt.xlabel('x')
plt.ylabel('GD')
plt.title('local minimum')
iterations, cur_x, gamma, precision = 0, 6, 0.01, 0.00001 previous_step_size = cur_x
while previous_step_size > precision:
prev_x = cur_x
cur_x += -gamma * df(prev_x)
previous_step_size = abs(cur_x - prev_x)
iterations += 1
plt.plot(prev_x, cur_x, "o")
local_min = cur_x
print ('minimum:', local_min, 'iterations:', iterations)
plt.figure('local maximum')
plt.xlabel('x')
plt.ylabel('GD')
plt.title('local maximum')
iterations, cur_x, gamma, precision = 0, 0.5, 0.01, 0.00001 previous_step_size = cur_x
while previous_step_size > precision:
prev_x = cur_x
cur_x += -gamma * -df(prev_x)
previous_step_size = abs(cur_x - prev_x)
iterations += 1
plt.plot(prev_x, cur_x, "o")
local_max = cur_x
print ('maximum:', local_max, 'iterations:', iterations)
plt.show()

The code begins by importing matplotlib and numpy libraries. It continues with functions f(x) and df(x), which represent the original function and its derivative algorithmically. The main block begins by creating data for f(x) and plotting it. It continues by finding the local minimum and maximum, and plotting them.

 

Notice the cur_x (the beginning point) for local minimum is 6, while it is 0.5 for local maximum. This is where data science is more of an art than a science, because I found these points by trial and error. Also notice that GD for the local maximum is the negation of the derivative. Again, I know that the results are correct because I calculated both local minimum and maximum by hand.

 

The main reason that I used separate plots rather than a subplot for this example is to demonstrate why it is so important to plot f(x). Just by looking at the plot, you can tell that the local maximum of x for f(x) is close to one, and the local minimum of x for f(x) is close to 3. In addition, you can see that the function has an overall maximum that is greater than 1 from this plot. 

 

Sigmoid Function Minimization (and Maximization)

Sigmoid Function Minimization

A sigmoid function is a mathematical function with an S-shaped or sigmoid curve. It is very important in data science for several reasons. First, it is easily differentiable with respect to network parameters, which are pivotal in training neural networks. Second, the cumulative distribution functions for many common probability distributions are sigmoidal.

 

Third, many natural processes (e.g., complex learning curves) follow a sigmoidal curve over time. So, a sigmoid function is often used if no specific mathematical model is available. The 1st example finds the local minimum of the sigmoid function:

import matplotlib.pyplot as plt, numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def df(x):
return x * (1-x)
if __name__ == "__main__":
x = np.arange(-10., 10., 0.2)
y, y_dx = sigmoid(x), df(x)
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(x, y, color='lime')
axarr[0].set_xlabel('x')
axarr[0].set_ylabel('f(x)')
axarr[0].set_title('Sigmoid Function')
axarr[1].plot(x, y_dx, color='coral')
axarr[1].set_xlabel('x')
axarr[1].set_ylabel('dy/dx(x)')
axarr[1].set_title('Derivative of f(x)')
axarr[2].set_xlabel('x')
axarr[2].set_ylabel('GD')
axarr[2].set_title('local minimum')
iterations, cur_x, gamma, precision = 0, 0.01, 0.01, 0.00001 previous_step_size = cur_x
while previous_step_size > precision:
prev_x = cur_x
cur_x += -gamma * df(prev_x)
previous_step_size = abs(cur_x - prev_x)
iterations += 1
plt.plot(prev_x, cur_x, "o")
f.subplots_adjust(hspace=0.3)
f.tight_layout()
print ('minimum:', cur_x, '\niterations:', iterations)
plt.show()

 

The code begins by importing matplotlib and numpy. It continues with functions sigmoid(x) and df(x), which represent the sigmoid function and its derivative algorithmically. The main block begins by creating data for f(x) and f'(x). It continues by creating subplots for f(x), f'(x), and the local minimum. In this case, using subplots was fine for visualization. It is easy to see from the f(x) and f'(x) plots  that the local minimum is close to 0. Next, the code runs GD to find the local minimum and plots it.

 

Again, the starting point for GD, cur_x, was found by trial and error. If you start cur_x further from the local minimum (you can estimate this by looking at the subplot of f'(x)), the number of iterations increases because it takes longer for the GD algorithm to converge on the local minimum. As expected, the local minimum is approximately 0.

 

The 2nd example finds the local maximum of the sigmoid function:

import matplotlib.pyplot as plt, numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def df(x):
return x * (1-x)
if __name__ == "__main__":
x = np.arange(-10., 10., 0.2)
y, y_dx = sigmoid(x), df(x)
f, axarr = plt.subplots(3, sharex=True)
axarr[0].plot(x, y, color='lime')
axarr[0].set_xlabel('x')
axarr[0].set_ylabel('f(x)')
axarr[0].set_title('Sigmoid Function')
axarr[1].plot(x, y_dx, color='coral')
axarr[1].set_xlabel('x')
axarr[1].set_ylabel('dy/dx(x)')
axarr[1].set_title('Derivative of f(x)')
axarr[2].set_xlabel('x')
axarr[2].set_ylabel('GD')
axarr[2].set_title('local maximum')
iterations, cur_x, gamma, precision = 0, 0.01, 0.01, 0.00001 previous_step_size = cur_x
while previous_step_size > precision:
prev_x = cur_x
cur_x += -gamma * -df(prev_x)
previous_step_size = abs(cur_x - prev_x)
iterations += 1
plt.plot(prev_x, cur_x, "o")
f.subplots_adjust(hspace=0.3)
f.tight_layout()
print ('maximum:', cur_x, '\niterations:', iterations)
plt.show()

 

The code begins by importing matplotlib and numpy. It continues with functions sigmoid(x) and df(x), which represent the sigmoid function and its derivative algorithmically. The main block begins by creating data for f(x) and f'(x). It continues by creating subplots for f(x), f'(x), and the local maximum.

 

It is easy to see from the f(x) plot that the local maximum is close to 1. Next, the code runs GD to find the local maximum and plots it. Again, the starting point for GD, cur_x, was found by trial and error. If you start cur_x further from the local maximum (you can estimate this by looking at the subplot of f(x)), the number of iterations increases because it takes longer for the GD algorithm to converge on the local maximum. As expected, the local maximum is approximately 1.

 

Euclidean Distance Minimization Controlling for Step Size

Euclidean Distance

Euclidean distance is the ordinary straight-line distance between two points in Euclidean space. With this distance, Euclidean space becomes a metric space. The associated norm is the Euclidean norm (EN). The EN assigns each vector the length of its arrow. So, EN is really just the magnitude of a vector. A vector space on which a norm is defined is the normed vector space.

 

To find the local minimum of f(x) in three-dimensional (3-D) space, the 1st step is to find the minimum for all 3-D vectors. The 2nd step is to create a random 3-D vector [x, y, z]. The 3rd step is to pick a random starting point, and then take tiny steps in the opposite direction of the gradient f'(x) until a point is reached where the gradient is very small.

 

Each tiny step (from the current vector to the next vector) is measured with the ED metric. The ED metric is the distance between two points in Euclidean space. The metric is required because we need to know how to move for each tiny step. So, the ED metric supplements GD to find the local minimum in 3-D space.

 

The code example finds the local minimum of the sigmoid function in 3-D space:

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D import random, numpy as np
from scipy.spatial import distance
def step(v, direction, step_size):
return [v_i + step_size * direction_i
for v_i, direction_i in zip(v, direction)]
def sigmoid_gradient(v):
return [v_i * (1-v_i) for v_i in v]
def mod_vector(v):
for i, v_i in enumerate(v):
if v_i == float("inf") or v_i == float("-inf"):
v[i] = random.randint(-1, 1)
return v
if __name__ == "__main__":
v = [random.randint(-10, 10) for i in range(3)]
tolerance = 0.0000001
iterations = 1
fig = plt.figure('Euclidean')
ax = fig.add_subplot(111, projection='3d')
while True:
gradient = sigmoid_gradient(v)
next_v = step(v, gradient, -0.01)
xs = gradient[0]
ys = gradient[1]
zs = gradient[2]
ax.scatter(xs, ys, zs, c='lime', marker='o')
v = mod_vector(v)
next_v = mod_vector(next_v)
test_v = distance.euclidean(v, next_v)
if test_v < tolerance:
break
v = next_v
iterations += 1
print ('minimum:', test_v, '\niterations:', iterations)
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
plt.tight_layout()
plt.show()

 

The code begins by importing matplotlib, mpl_toolkits, random, numpy, and scipy libraries. Function step() moves a vector in a direction (based on the gradient), by a step size. Function sigmoid_gradient() is the f'(sigmoid) returned as a point in 3-D space. Function mod_vector() ensures that an erroneous vector generated by the simulation is handled properly.

 

The main block begins by creating a randomly generated 3-D vector [x, y, z] as a starting point for the simulation. It continues by creating a tolerance (precision). A smaller tolerance results in a more accurate result. A subplot is created to hold a 3-D rendering of the local minimum. The GD simulation creates a set of 3-D vectors influenced by the sigmoid gradient until the gradient is very small. The size (magnitude) of the gradient is calculated by the ED metric. The local minimum, as expected is close to 0.

 

Stabilizing Euclidean Distance Minimization with Monte Carlo Simulation

Stabilizing Euclidean Distance

The Euclidean distance experiment in the previous example is anchored by a stochastic process. Namely, the starting vector v is stochastically generated by randomint(). As a result, each run of the GD experiment generates a different result for number of iterations. 

 

The code example first wraps the GD experiment in a loop that runs n number of simulations. With n simulations, an average number of iterations is calculated. The resultant code is then wrapped in another loop that runs m trials. With m trials, an average gap between each average number of iterations, is calculated.

 

Gap is calculated by subtracting the minimum from the maximum average iteration. The smaller the gap, the more stable (accurate) the result. To increase accuracy, increase simulations (n). The only limitation is computing power. That is, running 1,000 simulations takes a lot more computing power than 100. Stable (accurate) results allow comparison to alternative experiments.

import random, numpy as np
from scipy.spatial import distance
def step(v, direction, step_size):
return [v_i + step_size * direction_i
for v_i, direction_i in zip(v, direction)]
def sigmoid_gradient(v):
return [v_i * (1-v_i) for v_i in v]
def mod_vector(v):
for i, v_i in enumerate(v):
if v_i == float("inf") or v_i == float("-inf"):
v[i] = random.randint(-1, 1)
return v
if __name__ == "__main__":
trials= 10
sims = 10
avg_its = []
for _ in range(trials):
its = []
for _ in range(sims):
v = [random.randint(-10, 10) for i in range(3)]
tolerance = 0.0000001
iterations = 0
while True:
gradient = sigmoid_gradient(v)
next_v = step(v, gradient, -0.01)
v = mod_vector(v)
next_v = mod_vector(next_v)
test_v = distance.euclidean(v, next_v)
if test_v < tolerance:
break
v = next_v
iterations += 1
its.append(iterations)
a = round(np.mean(its))
avg_its.append(a)
gap = np.max(avg_its) - np.min(avg_its)
print (trials, 'trials with', sims, 'simulations each:')
print ('gap', gap)
print ('avg iterations', round(np.mean(avg_its)))

 

Output is for 10, 100, and 1,000 simulations. By running 1,000 simulations ten times (trials), the gap is down to 13. So, confidence is high that the number of iterations required to minimize the function is close to 1,089. We can further stabilize by wrapping the code in another loop to decrease variation in gap and number of iterations.

 

However, computer processing time becomes an issue. Leveraging MCS for this type of experiment makes a strong case for cloud computing. It may be tough to get your head around this application of MCS, but it is a very powerful tool for working with and solving data science problems.

 

Substituting a NumPy Method to Hasten Euclidean Distance Minimization

NumPy Method

Since numpy arrays are faster than Python lists, it follows that using a numpy method would be more efficient for calculating Euclidean distance. The code example substitutes np.linalg.norm() for distance.euclidean() to calculate Euclidean distance for the GD experiment.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D import random, numpy as np
def step(v, direction, step_size):
return [v_i + step_size * direction_i
for v_i, direction_i in zip(v, direction)]
def sigmoid_gradient(v):
return [v_i * (1-v_i) for v_i in v]
def round_v(v):
return np.round(v, decimals=3)
if __name__ == "__main__":
v = [random.randint(-10, 10) for i in range(3)]
tolerance = 0.0000001
iterations = 1
fig = plt.figure('norm')
ax = fig.add_subplot(111, projection='3d')
while True:
gradient = sigmoid_gradient(v) next_v = step(v, gradient, -0.01) round_gradient = round_v(gradient) xs = round_gradient[0] ys = round_gradient[1]
zs = round_gradient[2]
ax.scatter(xs, ys, zs, c='lime', marker='o')
norm_v = np.linalg.norm(v)
norm_next_v = np.linalg.norm(next_v)
test_v = norm_v - norm_next_v
if test_v < tolerance:
break
v = next_v
iterations += 1
print ('minimum:', test_v, '\niterations:', iterations)
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
plt.show()

The number of iterations is much lower at 31. However, given that the GD experiment is stochastic, we can use MCS for objective comparison.

 

Using the same MCS methodology, the code example first wraps the GD experiment in a loop that runs n number of simulations. The resultant code is then wrapped in another loop that runs m trials.

import random, numpy as np
def step(v, direction, step_size):
return [v_i + step_size * direction_i
for v_i, direction_i in zip(v, direction)]
def sigmoid_gradient(v):
return [v_i * (1-v_i) for v_i in v]
def round_v(v):
return np.round(v, decimals=3)
if __name__ == "__main__":
trials= 10
sims = 10
avg_its = []
for _ in range(trials):
its = []
for _ in range(sims):
v = [random.randint(-10, 10) for i in range(3)]
tolerance = 0.0000001
iterations = 0
while True:
gradient = sigmoid_gradient(v)
next_v = step(v, gradient, -0.01)
norm_v = np.linalg.norm(v)
norm_next_v = np.linalg.norm(next_v)
test_v = norm_v - norm_next_v
if test_v < tolerance:
break
v = next_v
iterations += 1
its.append(iterations)
a = round(np.mean(its))
avg_its.append(a)
gap = np.max(avg_its) - np.min(avg_its)
print (trials, 'trials with', sims, 'simulations each:')
print ('gap', gap)
print ('avg iterations', round(np.mean(avg_its)))

 

Processing is much faster using numpy. The average number of iterations is close to 193. As such, using the numpy alternative for calculating Euclidean distance is more than five times faster!

 

Stochastic Gradient Descent Minimization and Maximization

Descent Minimization

Up to this point in the chapter, optimization experiments used batch GD. Batch GD computes the gradient using the whole dataset. Stochastic GD computes the gradient using a single sample, so it is computationally much faster. It is called stochastic GD because the gradient is randomly determined.

 

However, unlike batch GD, stochastic GD is an approximation. If the exact gradient is required, stochastic GD is not optimal. Another issue with stochastic GD is that it can hover around the minimum forever without actually converging. So, it is important to plot progress of the simulation to see what is happening.

 

Let’s change direction and optimize another important function— residual sum of squares (RSS). A RSS function is a statistical technique that measures the amount of error (variance) remaining between the regression function and the data set. Regression analysis is an algorithm that estimates relationships between variables. It is widely used for prediction and forecasting. It is also a popular modeling and predictive algorithm for data science applications.

 

The 1st code example generates a sample, runs the GD experiment n times, and processes the sample randomly:

import matplotlib.pyplot as plt
import random, numpy as np
def rnd():
return [random.randint(-10,10) for i in range(3)]
def random_vectors(n):
ls = []
for v in range(n):
ls.append(rnd())
return ls
def sos(v):
return sum(v_i ** 2 for v_i in v)
def sos_gradient(v):
return [2 * v_i for v_i in v]
def in_random_order(data):
indexes = [i for i, _ in enumerate(data)] random.shuffle(indexes) for i in indexes:
yield data[i]
if __name__ == "__main__":
v, x, y = rnd(), random_vectors(3), random_vectors(3)
data = list(zip(x, y))
theta = v
alpha, value = 0.01, 0
min_theta, min_value = None, float("inf")
iterations_with_no_improvement = 0
n, x = 30, 1
for i, _ in enumerate(range(n)): y = np.linalg.norm(theta) plt.scatter(x, y, c='r')
x = x + 1
s = []
for x_i, y_i in data:
s.extend([sos(theta), sos(x_i), sos(y_i)])
value = sum(s)
if value < min_value:
min_theta, min_value = theta, value iterations_with_no_improvement = 0 alpha = 0.01
else:
iterations_with_no_improvement += 1
alpha *= 0.9
g = []
for x_i, y_i in in_random_order(data):
g.extend([sos_gradient(theta), sos_gradient(x_i),
sos_gradient(y_i)])
for v in g:
theta = np.around(np.subtract(theta,alpha*np.
array(v)),3)
g = []
print ('minimum:', np.around(min_theta, 4), 'with', i+1, 'iterations')
print ('iterations with no improvement:', iterations_with_no_improvement)
print ('magnitude of min vector:', np.linalg.norm(min_theta))
plt.show()

 

The code begins by importing matplotlib, random, and numpy. It continues with function rnd(), which returns a list of random integers from –10 to 10. Function random_vectors() generates a list (random sample) of n numbers. Function sos() returns the RSS for a vector. Function sos_ gradient() returns the derivative (gradient) of RSS for a vector. Function in_random_order() generates a list of randomly shuffled indexes. This function adds the stochastic flavor to the GD algorithm.

 

The main block begins by generating a random vector v as the starting point for the simulation. It continues by creating a sample of x and y vectors of size 3. Next, the vector is assigned to theta, which is a common name for a vector of some general probability distribution. We can call the vector anything we want, but a common data science problem is to find the value(s) of theta.

 

The code continues with a fixed step size alpha, minimum theta value, minimum ending value, iterations with no improvement, number of simulations n, and a plot value for the x-coordinate.

 

The simulation begins by assigning y the magnitude of theta. Next, it plots the current x and y coordinates. The x-coordinate is incremented by 1 to plot the convergence to the minimum for each y-coordinate. The next block of code finds the RSS for each theta, and the sample of x and y values. This value determines if the simulation is hovering around the local minimum rather than converging.

 

The final part of the code traverses the sample data points in random (stochastic) order, finds the gradient of theta, x and y, places these three values in list g, and traverses this vector to find the next theta value.

 

Whew! This is not simple, but this is how stochastic GD operates. Notice that the minimum generated is 2.87, which is not the true minimum of 0. So, stochastic GD requires few iterations but does not produce the true minimum.

 

The previous simulation can be refined by adjusting the algorithm for finding the next theta. In the previous example, the next theta is calculated for the gradient based on the current theta, x value, and y value for each sample. However, the actual new theta is based on the 3rd data point in the sample. So, the 2nd example is refined by taking the minimum theta from the entire sample rather than the 3rd data point:

import matplotlib.pyplot as plt
import random, numpy as np
def rnd():
return [random.randint(-10,10) for i in range(3)]
def random_vectors(n):
ls = []
for v in range(n):
ls.append([random.randint(-10,10) for i in range(3)])
return ls
def sos(v):
return sum(v_i ** 2 for v_i in v)
def sos_gradient(v):
return [2 * v_i for v_i in v]
def in_random_order(data):
indexes = [i for i, _ in enumerate(data)] random.shuffle(indexes) for i in indexes:
yield data[i]
if __name__ == "__main__":
v, x, y = rnd(), random_vectors(3), random_vectors(3)
data = list(zip(x, y))
theta = v
alpha, value = 0.01, 0
min_theta, min_value = None, float("inf")
iterations_with_no_improvement = 0
n, x = 60, 1
for i, _ in enumerate(range(n)): y = np.linalg.norm(theta) plt.scatter(x, y, c='r')
x = x + 1
s = []
for x_i, y_i in data:
s.extend([sos(theta), sos(x_i), sos(y_i)])
value = sum(s)
if value < min_value:
min_theta, min_value = theta, value iterations_with_no_improvement = 0 alpha = 0.01
else:
iterations_with_no_improvement += 1
alpha *= 0.9
g, t, m = [], [], []
for x_i, y_i in in_random_order(data):
g.extend([sos_gradient(theta), sos_gradient(x_i),
sos_gradient(y_i)])
m = np.around([np.linalg.norm(x) for x in g], 2)
for v in g:
theta = np.around(np.subtract(theta,alpha*np.
array(v)),3)
t.append(np.around(theta,2))
mm = np.argmin(m) theta = t[mm]
g, m, t = [], [], []
print ('minimum:', np.around(min_theta, 4), 'with', i+1, 'iterations')
print ('iterations with no improvement:', iterations_with_no_improvement)
print ('magnitude of min vector:', np.linalg.norm(min_theta))
plt.show()

 

The only difference in the code is toward the bottom where the minimum theta is calculated. Although it took 60 iterations, the minimum is much closer to 0 and much more stable. That is, the prior example deviates quite a bit more each time the experiment is run.

 

The 3rd example finds the maximum:

import matplotlib.pyplot as plt import random, numpy as np
def rnd():
return [random.randint(-10,10) for i in range(3)]
def random_vectors(n):
ls = []
for v in range(n):
ls.append([random.randint(-10,10) for i in range(3)])
return ls
def sos_gradient(v):
return [2 * v_i for v_i in v]
def negate(function):
def new_function(*args, **kwargs):
return np.negative(function(*args, **kwargs)) return new_function
def in_random_order(data):
indexes = [i for i, _ in enumerate(data)] random.shuffle(indexes) for i in indexes:
yield data[i]
if __name__ == "__main__":
v, x, y = rnd(), random_vectors(3), random_vectors(3)
data = list(zip(x, y))
theta, alpha = v, 0.01
neg_gradient = negate(sos_gradient)
n, x = 100, 1
for i, row in enumerate(range(n)): y = np.linalg.norm(theta) plt.scatter(x, y, c='r')
x = x + 1
g = []
for x_i, y_i in in_random_order(data):
g.extend([neg_gradient(theta), neg_gradient(x_i),
neg_gradient(y_i)])
for v in g:
theta = np.around(np.subtract(theta,alpha*np.
array(v)),3)
g = []
print ('maximum:', np.around(theta, 4), 'with', i+1, 'iterations')
print ('magnitude of max vector:', np.linalg.norm(theta))
plt.show()

The only difference in the code from the 1st example is the negate() function, which negates the gradient to find the maximum. Since the maximum of RSS is infinity, we can stop at 100 iterations. Try 1,000 iterations and see what happens.

 

Working with Data

Data

Working with data details the earliest processes of data science problem-solving. The 1st step is to identify the problem, which determines all else that needs to be done. The 2nd step is to gather data. The 3rd step is to wrangle (munge) data, which is critical. Wrangling is getting data into a form that is useful for machine learning and other data science problems. Of course, wrangled data will probably have to be cleaned. The 4th step is to visualize the data. Visualization helps you get to know the data and, hopefully, identify patterns.

 

One-Dimensional Data Example

The code example generates visualizations of two very common data distributions—uniform and normal. The uniform distribution has constant probability. That is, all events that belong to the distribution are equally probable. The normal distribution is symmetrical about the center, which means that 50% of its values are less than the mean and 50% of its values are greater than the mean. Its shape resembles a bell curve. The normal distribution is extremely important because it models many naturally occurring events.

import matplotlib.pyplot as plt
import numpy as np
if __name__ == "__main__": plt.figure('Uniform Distribution') uniform = np.random.uniform(-3, 3, 1000)
count, bins, ignored = plt.hist(uniform, 20, facecolor='lime')
plt.xlabel('Interval: [-3, 3]')
plt.ylabel('Frequency')
plt.title('Uniform Distribution')
plt.axis([-3,3,0,100])
plt.grid(True)
plt.figure('Normal Distribution')
normal = np.random.normal(0, 1, 1000)
count, bins, ignored = plt.hist(normal, 20,
facecolor='fuchsia')
plt.xlabel('Interval: [-3, 3]')
plt.ylabel('Frequency')
plt.title('Normal Distribution')
plt.axis([-3,3,0,140])
plt.grid(True)
plt.show()

The code example begins by importing matplotlib and numpy. The main block begins by creating a figure and data for a uniform distribution. Next, a histogram is created and plotted based on the data. 

 

Two-Dimensional Data Example

Two-Dimensional Data Example

Modeling 2-D data offers a more realistic picture of naturally occurring events. The code example compares two normally distributed distributions of randomly generated data with the same mean and standard deviation (SD). SD measures the amount of variation (dispersion) of a set of data values. Although both data sets are normally distributed with the same mean and SD, each has a very different joint distribution (correlation). Correlation is the interdependence of the two variables.

import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec import numpy as np, random from scipy.special import ndtri
def inverse_normal_cdf(r):
return ndtri(r)
def random_normal():
return inverse_normal_cdf(random.random())
def scatter(loc):
plt.scatter(xs, ys1, marker='.', color='black', label='ys1')
plt.scatter(xs, ys2, marker='.', color='gray', label='ys2')
plt.xlabel('xs')
plt.ylabel('ys')
plt.legend(loc=loc)
plt.tight_layout()
if __name__ == "__main__":
xs = [random_normal() for _ in range(1000)] ys1 = [ x + random_normal() / 2 for x in xs] ys2 = [-x + random_normal() / 2 for x in xs] gs = gridspec.GridSpec(2, 2) fig = plt.figure()
ax1 = fig.add_subplot(gs[0,0])
plt.title('ys1 data')
n, bins, ignored = plt.hist(ys1, 50, normed=1,
facecolor='chartreuse',
alpha=0.75)
ax2 = fig.add_subplot(gs[0,1])
plt.title('ys2 data')
n, bins, ignored = plt.hist(ys2, 50, normed=1,
facecolor='fuchsia',
alpha=0.75)
ax3 = fig.add_subplot(gs[1,:])
plt.title('Correlation')
scatter(6)
print (np.corrcoef(xs, ys1)[0, 1])
print (np.corrcoef(xs, ys2)[0, 1])
plt.show()

 

The code example begins by importing matplotlib, numpy, random, and scipy libraries. Method gridspec specifies the geometry of a grid where a subplot will be placed. Method ndtri returns the standard normal cumulative distribution function (CDF). CDF is the probability that a random variable X takes on a value less than or equal to x, where x represents the area under a normal distribution. The code continues with three functions. Function inverse_normal_cdf() returns the CDF based on a random variable.

 

Function random_normal() calls function inverse_normal_cdf() with a randomly generated value X and returns the CDF. Function scatter() creates a scatter plot. The main block begins by creating randomly generated x and y values xs, ys1, and ys2. A gridspec() is created to hold the distributions. Histograms are created for xs, ys1 and xs, ys2 data, respectively. Next, a correlation plot is created for both distributions. Finally, correlations are generated for the two distributions.

 

The code example spawns two important lessons. First, creating a set of randomly generated numbers with ndtri() creates a normally distributed dataset. That is, function ndtri() returns the CDF of a randomly generated value. Second, two normally distributed datasets are not necessarily similar even though they look alike. In this case, the correlations are opposite. So, visualization and correlations are required to demonstrate the difference between the datasets.

 

Data Correlation and Basic Statistics

Data Correlation

Correlation is the extent that two or more variables fluctuate (move) together. A correlation matrix is a table showing correlation coefficients between sets of variables. Correlation coefficients measure the strength of association between two or more variables.

 

The code example creates three datasets with x and y coordinates, calculates correlations, and plots. The 1st dataset represents a positive correlation; the 2nd, a negative correlation; and the 3rd, a weak correlation.

import random, numpy as np import matplotlib.pyplot as plt import matplotlib.gridspec as gridspec
if __name__ == "__main__":
np.random.seed(0)
x = np.random.randint(0, 50, 1000)
y = x + np.random.normal(0, 10, 1000)
print ('highly positive:\n', np.corrcoef(x, y))
gs = gridspec.GridSpec(2, 2)
fig = plt.figure()
ax1 = fig.add_subplot(gs[0,0])
plt.title('positive correlation')
plt.scatter(x, y, color='springgreen')
y = 100 - x + np.random.normal(0, 10, 1000)
print ('\nhighly negative:\n', np.corrcoef(x, y))
ax2 = fig.add_subplot(gs[0,1])
plt.title('negative correlation')
plt.scatter(x, y, color='crimson')
y = np.random.normal(0, 10, 1000)
print ('\nno/weak:\n', np.corrcoef(x, y))
ax3 = fig.add_subplot(gs[1,:])
plt.title('weak correlation')
plt.scatter(x, y, color='peachpuff')
plt.tight_layout()
plt.show()

The code example begins by importing random, numpy, and matplotlib libraries. The main block begins by generating x and y coordinates with a positive correlation and displaying the correlation matrix. It continues by creating a grid to hold the subplot, the 1st subplot grid, and a scatterplot.

 

Next, x and y coordinates are created with a negative correlation and the correlation matrix is displayed. The 2nd subplot grid is created and plotted. Finally, x and y coordinates are created with a weak correlation and the correlation matrix is displayed. The 3rd subplot grid is created and plotted, and all three scatterplots are displayed. 

 

Pandas Correlation and Heat Map Examples

Pandas Correlation

Pandas is a Python package that provides fast, flexible, and expressive data structures to make working with virtually any type of data easy, intuitive, and practical in real-world data analysis. A DataFrame (df) is a 2-D labeled data structure and the most commonly used object in pandas.

 

The 1st code example creates a correlation matrix with an associated visualization:

import random, numpy as np, pandas as pd import matplotlib.pyplot as plt import matplotlib.cm as cm
import matplotlib.colors as colors
if __name__ == "__main__":
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 50, 1000)})
df['b'] = df['a'] + np.random.normal(0, 10, 1000)
df['c'] = 100 - df['a'] + np.random.normal(0, 5, 1000)
df['d'] = np.random.randint(0, 50, 1000)
colormap = cm.viridis
colorlist = [colors.rgb2hex(colormap(i))
for i in np.linspace(0, 1, len(df['a']))]
df['colors'] = colorlist
print (df.corr())
pd.plotting.scatter_matrix(df, c=df['colors'],
diagonal='d',
figsize=(10, 6))
plt.show()

The code example begins by importing random, numpy, pandas, and matplotlib libraries. The main block begins by creating a df with four columns populated by various random number possibilities. It continues by creating a color map of the correlations between each column, printing the correlation matrix, and plotting the color map.

 

We can see from the correlation matrix that the most highly correlated variables are a and b (0.83), a and c (–0.95), and b and c (–0.79). From the color map, we can see that a and b are positively correlated, a and c are negatively correlated, and b and c are negatively correlated. However, the actual correlation values are not apparent from the visualization.

 

A Heatmap is a graphical representation of data where individual values in a matrix are represented as colors. It is a popular visualization technique in data science. With pandas, a Heatmap provides a sophisticated visualization of correlations where each variable is represented by its own color.

 

The 2nd code example uses a Heat map to visualize variable correlations. You need to install library seaborn if you don’t already have it installed on your computer (e.g., pip install seaborn).

import random, numpy as np, pandas as pd import matplotlib.pyplot as plt import seaborn as sns
if __name__ == "__main__":
np.random.seed(0)
df = pd.DataFrame({'a': np.random.randint(0, 50, 1000)})
df['b'] = df['a'] + np.random.normal(0, 10, 1000)
df['c'] = 100 - df['a'] + np.random.normal(0, 5, 1000)
df['d'] = np.random.randint(0, 50, 1000)
plt.figure()
sns.heatmap(df.corr(), annot=True, cmap='OrRd')
plt.show()

The code begins by importing random, numpy, pandas, matplotlib, and seaborn libraries. Seaborn is a Python visualization library based on matplotlib. The main block begins by generating four columns of data (variables), and plots a Heat map. Attribute cmap uses a colormap. A list of matplotlib colormaps can be found at: https:// color example code: colormaps_reference.py.

 

Various Visualization Examples

Various Visualization Examples

The 1st code example introduces the Andrews curve, which is a way to visualize structure in high-dimensional data. Data for this example is the Iris dataset, which is one of the best known in the pattern recognition literature. The Iris dataset consists of three different types of irises’ (Setosa, Versicolour, and Virginia) petal and sepal lengths.

 

Andrews curves allow multivariate data plotting as a large number of curves that are created using the attributes (variable) of samples as coefficients. By coloring the curves differently for each class, it is possible to visualize data clustering. Curves belonging to samples of the same class will usually be closer together and form larger structures. Raw data for the iris dataset is located at the following URL:

https://raw.githubusercontent.com/pandas-dev/pandas/master/ pandas/tests/data/iris.csv
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import andrews_curves
if __name__ == "__main__":
data = pd.read_csv('data/iris.csv')
plt.figure()
andrews_curves(data, 'Name',
color=['b','mediumspringgreen','r'])
plt.show()

 

The code example begins by importing matplotlib and pandas. The main block begins by reading the iris dataset into pandas df data. Next, Andrews curves are plotted for each class—Iris-setosa, Iris-versicolor, and Iris-virginica. From this visualization, it is difficult to see which attributes distinctly define each class.

 

The 2nd code example introduces parallel coordinates:

import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import parallel_coordinates
if __name__ == "__main__":
data = pd.read_csv('data/iris.csv')
plt.figure()
parallel_coordinates(data, 'Name',
color=['b','mediumspringgreen','r'])
plt.show()

 

Parallel coordinates is another technique for plotting multivariate data. It allows visualization of clusters in data and estimation of other statistics visually. Points are represented as connected line segments. Each vertical line represents one attribute. One set of connected line segments represents one data point. Points that tend to cluster appear closer together.

 

The code example begins by importing matplotlib and pandas. The main block begins by reading the iris dataset into pandas df data. Next, parallel coordinates are plotted for each class. From this visualization, attributes PetalLength and PetalWidth are most distinct for the three species (classes of Iris). So, PetalLength and PetalWidth are the best classifiers for species of Iris. Andrews curves just don’t clearly provide this important information.

 

Here is a useful URL:

http://wilkelab.org/classes/SDS348/2016_spring/worksheets/ class9.html
The 3rd code example introduces RadViz:
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import radviz
if __name__ == "__main__":
data = pd.read_csv('data/iris.csv')
plt.figure()
radviz(data, 'Name',
color=['b','mediumspringgreen','r'])
plt.show()

RadVis is yet another technique for visualizing multivariate data. The code example begins by importing matplotlib and pandas. The main block begins by reading the iris dataset into pandas df data. Next, RadVis coordinates are plotted for each class. With this visualization, it is not easy to see any distinctions. So, the parallel coordinates technique appears to be the best of the three in terms of recognizing variation (for this example).

 

Cleaning a CSV File with Pandas and JSON

Cleaning a CSV File

The code example loads a dirty CSV file into a Pandas df and displays to locate bad data. It then loads the same CSV file into a list of dictionary elements for cleaning. Finally, the cleansed data is saved to JSON.

import csv, pandas as pd, json
def to_dict(d):
return [dict(row) for row in d]
def dump_json(f, d):
with open(f, 'w') as f:
json.dump(d, f)
def read_json(f):
with open(f) as f:
return json.load(f)
if __name__ == "__main__":
df = pd.read_csv("data/audio.csv")
print (df, '\n')
data = csv.DictReader(open('data/audio.csv'))
d = to_dict(data)
for row in d:
if (row['pno'][0] not in ['a', 'c', 'p', 's']):
if (row['pno'][0] == '8'):
row['pno'] = 'a' + row['pno'] elif (row['pno'][0] == '7'):
row['pno'] = 'p' + row['pno'] elif (row['pno'][0] == '5'):
row['pno'] = 's' + row['pno']
if (row['color']) == '-':
row['color'] = 'silver'
if row['model'] == '-':
row['model'] = 'S1'
if (row['mfg']) == '100':
row['mfg'] = 'Linn'
if (row['desc'] == '0') and row['pno'][0] == 'p':
row['desc'] = 'preamplifier'
elif (row['desc'] == '-') and row['pno'][0] == 's':
row['desc'] = 'speakers'
if (row['price'][0] == '$'):
row['price'] =\
row['price'].translate({ord(i): None for i in '$,.'})
json_file = 'data/audio.json'
dump_json(json_file, d)
data = read_json(json_file)
for i, row in enumerate(data):
if i < 5:
print (row)

The code example begins by importing CSV, pandas, and JSON libraries. Function to_dict() converts a list of OrderedDict elements to a list of regular dictionary elements for easier processing. Function dump_json() saves data to a JSON file. Function read_json() reads JSON data into a Python list.

 

The main block begins by loading a CSV file into a Pandas df and displaying it to visualize dirty data. It continues by loading the same CSV file into a list of dictionary elements for easier cleaning. Next, all dirty data is cleansed. The code continues by saving the cleansed data to JSON file audio.json. Finally, audio.json is loaded and a few records are displayed to ensure that everything worked properly.

 

Slicing and Dicing

Slicing and Dicing

Slicing and dicing are breaking data into smaller parts or views to better understand and present it as information in a variety of different and useful ways. A slice in multidimensional arrays is a column of data corresponding to a single value for one or more members of the dimension of interest. While a slice filters on a particular attribute, a dice is like a zoom feature that selects a subset of all dimensions, but only for specific values of the dimension.

 

The code example loads audio.json into a Pandas df, slices data by column and row, and displays:

import pandas as pd
if __name__ == "__main__":
df = pd.read_json("data/audio.json") amps = df[df.desc == 'amplifier'] print (amps, '\n')
price = df.query('price >= 40000')
print (price, '\n')
between = df.query('4999 < price < 6000')
print (between, '\n')
row = df.loc[[0, 10, 19]]
print (row)

 

The code example begins by importing Pandas. The main block begins by loading audio.json into a Pandas df. Next, the df is sliced by amplifier from the desc column. The code continues by slicing by the price column for equipment more expensive than $40,000. The next slice is by price column for equipment between $5,000 and $6,000. The final slice is by rows 0, 10, and 19.

 

Data Cubes

Data Cubes

A data cube is an n-dimensional array of values. Since it is hard to conceptualize an n-dimensional cube, most are 3-D in practice. Let’s build a cube that holds three stocks—GOOGL, AMZ, and MKL. For each stock, including five days of data. Each day includes data for open, high, low, close, adj close, and volume values. So, the three dimensions are stock, day, and values. Data was garnered from actual stock quotes.

 

The code example creates a cube, saves it to a JSON file, reads the JSON, and displays some information:

import json
def dump_json(f, d):
with open(f, 'w') as f:
json.dump(d, f)
def read_json(f):
with open(f) as f:
return json.load(f)
def rnd(n):
return '{:.2f}'.format(n)
if __name__ == "__main__":
d = dict()
googl = dict()
googl['2017-09-25'] =\
{'Open':939.450012, 'High':939.750000, 'Low':924.510010, 'Close':934.280029, 'Adj Close':934.280029, 'Volume':1873400}
googl['2017-09-26'] =\
{'Open':936.690002, 'High':944.080017, 'Low':935.119995, 'Close':937.429993, 'Adj Close':937.429993, 'Volume':1672700}
googl['2017-09-27'] =\
{'Open':942.739990, 'High':965.429993, 'Low':941.950012, 'Close':959.900024, 'Adj Close':959.900024, 'Volume':2334600}
googl['2017-09-28'] =\
{'Open':956.250000, 'High':966.179993, 'Low':955.549988, 'Close':964.809998, 'Adj Close':964.809998, 'Volume':1400900}
googl['2017-09-29'] =\
{'Open':966.000000, 'High':975.809998, 'Low':966.000000, 'Close':973.719971, 'Adj Close':973.719971, 'Volume':2031100}
amzn = dict()
amzn['2017-09-25'] =\
{'Open':949.309998, 'High':949.419983, 'Low':932.890015, 'Close':939.789978, 'Adj Close':939.789978, 'Volume':5124000}
amzn['2017-09-26'] =\
{'Open':945.489990, 'High':948.630005, 'Low':931.750000, 'Close':937.429993, 'Adj Close':938.599976, 'Volume':3564800}
amzn['2017-09-27'] =\
{'Open':948.000000, 'High':955.299988, 'Low':943.299988, 'Close':950.869995, 'Adj Close':950.869995, 'Volume':3148900}
amzn['2017-09-28'] =\
{'Open':951.859985, 'High':959.700012, 'Low':950.099976, 'Close':956.400024, 'Adj Close':956.400024, 'Volume':2522600}
amzn['2017-09-29'] =\
{'Open':960.109985, 'High':964.830017, 'Low':958.380005, 'Close':961.349976, 'Adj Close':961.349976, 'Volume':2543800}
mkl = dict()
mkl['2017-09-25'] =\
{'Open':1056.199951, 'High':1060.089966, 'Low':1047.930054, 'Close':1050.250000, 'Adj Close':1050.250000, 'Volume':23300}
mkl['2017-09-26'] =\
{'Open':1052.729980, 'High':1058.520020, 'Low':1045.000000, 'Close':1045.130005, 'Adj Close':1045.130005, 'Volume':25800}
mkl['2017-09-27'] =\
{'Open':1047.560059, 'High':1069.099976, 'Low':1047.010010, 'Close':1064.040039, 'Adj Close':1064.040039, 'Volume':21100}
mkl['2017-09-28'] =\
{'Open':1064.130005, 'High':1073.000000, 'Low':1058.079956, 'Close':1070.550049, 'Adj Close':1070.550049, 'Volume':23500}
mkl['2017-09-29'] =\
{'Open':1068.439941, 'High':1073.000000, 'Low':1060.069946, 'Close':1067.979980, 'Adj Close':1067.979980 , 'Volume':20700}
d['GOOGL'], d['AMZN'], d['MKL'] = googl, amzn, mkl
json_file = 'data/cube.json'
dump_json(json_file, d)
d = read_json(json_file)
s = ' '
print ('\'Adj Close\' slice:')
print (10*s, 'AMZN', s, 'GOOGL', s, 'MKL')
print ('Date')
print ('2017-09-25', rnd(d['AMZN']['2017-09-25'] ['Adj Close']),
rnd(d['GOOGL']['2017-09-25']['Adj Close']), rnd(d['MKL']['2017-09-25']['Adj Close']))
print ('2017-09-26', rnd(d['AMZN']['2017-09-26'] ['Adj Close']),
rnd(d['GOOGL']['2017-09-26']['Adj Close']), rnd(d['MKL']['2017-09-26']['Adj Close']))
print ('2017-09-27', rnd(d['AMZN']['2017-09-27']
['Adj Close']),
rnd(d['GOOGL']['2017-09-27']['Adj Close']),
rnd(d['MKL']['2017-09-27']['Adj Close']))
print ('2017-09-28', rnd(d['AMZN']['2017-09-28']
['Adj Close']),
rnd(d['GOOGL']['2017-09-28']['Adj Close']),
rnd(d['MKL']['2017-09-28']['Adj Close']))
print ('2017-09-29', rnd(d['AMZN']['2017-09-29']
['Adj Close']),
rnd(d['GOOGL']['2017-09-29']['Adj Close']),
rnd(d['MKL']['2017-09-29']['Adj Close']))

The code example begins by importing json. Function dump_json() and read_json() save and read JSON data respectively. The main block creates a cube by creating a dictionary d, dictionaries for each stock, and adding data by day and attribute to each stock dictionary. The code continues by saving the cube to JSON file cube.json. Finally, the code reads cube.json and displays a slice from the cube.

 

Data Scaling and Wrangling

Data Scaling

Data scaling is changing type, spread, and/or position to compare data that are otherwise incomparable. Data scaling is very common in data science. Mean centering is the 1st technique, which transforms data by subtracting out the mean. Normalization is the 2nd technique, which transforms data to fall within the range between 0 and 1. Standardization is the 3rd technique, which transforms data to zero mean and unit variance (SD = 1), which is commonly referred to as standard normal.

 

The 1st code example generates and centers a normal distribution:

import numpy as np
import matplotlib.pyplot as plt
def rnd_nrml(m, s, n):
return np.random.normal(m, s, n)
def ctr(d):
return [x-np.mean(d) for x in d]
if __name__ == "__main__":
mu, sigma, n, c1, c2, b = 10, 15, 100, 'pink',\ 'springgreen', True
s = rnd_nrml(mu, sigma, n)
plt.figure()
ax = plt.subplot(211)
ax.set_title('normal distribution')
count, bins, ignored = plt.hist(s, 30, color=c1, normed=b)
sc = ctr(s)
ax = plt.subplot(212)
ax.set_title('normal distribution "centered"')
count, bins, ignored = plt.hist(sc, 30, color=c2, normed=b)
plt.tight_layout()
plt.show()

 

The code example begins by importing numpy and matplotlib. Function rnd_nrml() generates a normal distribution based on mean (mu), SD (sigma), and n number of data points. Function ctr() subtracts out the mean from every data point. The main block begins by creating the normal distribution. The code continues by plotting the original and centered distributions. Notice that the distributions are exactly the same, but the 2nd distribution is centered with mean of 0.

 

The 2nd code example generates and normalizes a normal distribution:

import numpy as np
import matplotlib.pyplot as plt
def rnd_nrml(m, s, n):
return np.random.normal(m, s, n)
def nrml(d):
return [(x-np.amin(d))/(np.amax(d)-np.amin(d)) for x in d]
if __name__ == "__main__":
mu, sigma, n, c1, c2, b = 10, 15, 100, 'orchid',\ 'royalblue', True
s = rnd_nrml(mu, sigma, n)
plt.figure()
ax = plt.subplot(211)
ax.set_title('normal distribution')
count, bins, ignored = plt.hist(s, 30, color=c1, normed=b)
sn = nrml(s)
ax = plt.subplot(212)
ax.set_title('normal distribution "normalized"')
count, bins, ignored = plt.hist(sn, 30, color=c2, normed=b)
plt.tight_layout()
plt.show()

 

The code example begins by importing numpy and matplotlib. Function rnd_nrml() generates a normal distribution based on mean (mu), SD (sigma), and n number of data points. Function nrml() transforms data to fall within the range between 0 and 1. The main block begins by creating the normal distribution. The code continues by plotting the original and normalized distributions. Notice that the distributions are exactly the same, but the 2nd distribution is normalized between 0 and 1.

 

The 3rd code example transforms data to zero mean and unit variance (standard normal):

import numpy as np, csv
import matplotlib.pyplot as plt
def rnd_nrml(m, s, n):
return np.random.normal(m, s, n)
def std_nrml(d, m, s):
return [(x-m)/s for x in d]
if __name__ == "__main__":
mu, sigma, n, b = 0, 1, 1000, True c1, c2 = 'peachpuff', 'lime' s = rnd_nrml(mu, sigma, n)
plt.figure(1)
plt.title('standard normal distribution')
count, bins, ignored = plt.hist(s, 30, color=c1, normed=b) plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *
np.exp( - (bins - mu)**2 / (2 * sigma**2) ), linewidth=2, color=c2)
start1, start2 = 5, 600
mu1, sigma1, n, b = 10, 15, 500, True x1 = np.arange(start1, n+start1, 1)
y1 = rnd_nrml(mu1, sigma1, n)
mu2, sigma2, n, b = 25, 5, 500, True
x2 = np.arange(start2, n+start2, 1)
y2 = rnd_nrml(mu2, sigma2, n)
plt.figure(2)
ax = plt.subplot(211)
ax.set_title('dataset1 (mu=10, sigma=15)')
count, bins, ignored = plt.hist(y1, 30, color='r', normed=b)
ax = plt.subplot(212)
ax.set_title('dataset2 (mu=5, sigma=5)')
count, bins, ignored = plt.hist(y2, 30, color='g', normed=b)
plt.tight_layout()
plt.figure(3)
ax = plt.subplot(211)
ax.set_title('Normal Distributions')
g1, g2 = (x1, y1), (x2, y2)
data = (g1, g2)
colors = ('red', 'green')
groups = ('dataset1', 'dataset2')
for data, color, group in zip(data, colors, groups):
x, y = data
ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=30, label=group)
plt.legend(loc=4)
ax = plt.subplot(212)
ax.set_title('Standard Normal Distributions')
ds1 = (x1, std_nrml(y1, mu1, sigma1))
y1_sn = ds1[1]
ds2 = (x2, std_nrml(y2, mu2, sigma2))
y2_sn = ds2[1]
g1, g2 = (x1, y1_sn), (x2, y2_sn)
data = (g1, g2)
for data, color, group in zip(data, colors, groups):
x, y = data
ax.scatter(x, y, alpha=0.8, c=color, edgecolors='none', s=30, label=group)
plt.tight_layout()
plt.show()

 

The code example begins by importing numpy and matplotlib. Function rnd_nrml() generates a normal distribution based on the mean (mu), SD (Sigma), and n number of data points. Function std_nrml() transforms data to standard normal. The main block begins by creating a standard normal distribution as a histogram and a line.

 

The code continues by creating and plotting two different normally distributed datasets. Next, both data sets are rescaled to standard normal and plotted. Now, the datasets can be compared with each other. Although the original plots of the datasets appear to be very different, they are actually very similar distributions.

 

The 4th code example reads a CSV dataset, saves it to JSON, wrangles it, and prints a few records. The URL for the data is: https://community. tableau.com/docs/DOC-1236. However, the data on this site changes, so please use the data from our website to work with this example:

import csv, json
def read_dict(f):
return csv.DictReader(open(f))
def to_dict(d):
return [dict(row) for row in d]
def dump_json(f, d):
with open(f, 'w') as fout:
json.dump(d, fout)
def read_json(f):
with open(f) as f:
return json.load(f)
def mk_data(d):
for i, row in enumerate(d):
e = {}
e['_id'] = i
e['cust'] = row['Customer Name'] e['item'] = row['Sub-Category'] e['sale'] = rnd(row['Sales']) e['quan'] = row['Quantity'] e['disc'] = row['Discount'] e['prof'] = rnd(row['Profit']) e['segm'] = row['Segment'] yield e
def rnd(v):
return str(round(float(v),2))
if __name__ == "__main__":
f= 'data/superstore.csv'
d = read_dict(f)
data = to_dict(d)
jsonf = 'data/superstore.json'
dump_json(jsonf, data)
print ('"superstore" data added to JSON\n')
json_data = read_json(jsonf)
print ("{:20s} {:15s} {:10s} {:3s} {:5s} {:12s} {:10s}". format('CUSTOMER', 'ITEM', 'SALES', 'Q', 'DISC',
'PROFIT', 'SEGMENT'))
generator = mk_data(json_data)
for i, row in enumerate(generator):
if i < 10:
print ("{:20s} {:15s}".format(row['cust'], row['item']),
"{:10s} {:3s}".format(row['sale'], row['quan']),
"{:5s} {:12s}".format(row['disc'],
row['prof']),
"{:10s}".format(row['segm']))
else:
break

The code example begins by importing csv and json libraries. Function read_dict() reads a CSV file as an OrderedDict. Function to_dict() converts an OrderedDict to a regular dictionary. Function dump_json() saves a file to JSON. Function read_json() reads a JSON file. Function mk_data() creates a generator object consisting of wrangled data from the JSON file. Function rnd() rounds a number to 2 decimal places. The main block begins by reading a CSV file and converting it to JSON.

 

The code continues by reading the newly created JSON data. Next, a generator object is created from the JSON data. The generator object is critical because it speeds processing orders of magnitude faster than a list. Since the dataset is close to 10,000 records, speed is important. To verify that the data was created correctly, the generator object is iterated a few times to print some of the wrangled records.

 

The 5th and final code example reads the JSON file created in the previous example, wrangles it, and saves the wrangled data set to JSON:

import json
def read_json(f):
with open(f) as f:
return json.load(f)
def mk_data(d):
for i, row in enumerate(d):
e = {}
e['_id'] = i
e['cust'] = row['Customer Name'] e['item'] = row['Sub-Category'] e['sale'] = rnd(row['Sales']) e['quan'] = row['Quantity'] e['disc'] = row['Discount'] e['prof'] = rnd(row['Profit']) e['segm'] = row['Segment'] yield e
def rnd(v):
return str(round(float(v),2))
if __name__ == "__main__":
jsonf = 'data/superstore.json'
json_data = read_json(jsonf)
l = len(list(mk_data(json_data))) generator = mk_data(json_data) jsonf= 'data/wrangled.json' with open(jsonf, 'w') as f:
f.write('[')
for i, row in enumerate(generator):
j = json.dumps(row)
if i < l - 1:
with open(jsonf, 'a') as f:
f.write(j)
f.write(',')
else:
with open(jsonf, 'a') as f:
f.write(j)
f.write(']')
json_data = read_json(jsonf)
for i, row in enumerate(json_data):
if i < 5:
print (row['cust'], row['item'], row['sale'])
else:
break

The code example imports json. Function read_json() reads a JSON file. Function mk_data() creates a generator object consisting of wrangled data from the JSON file. Function rnd() rounds a number to two decimal places. The main block begins by reading a JSON file. A generator object must be created twice.

 

The 1st generator allows us to find the length of the JSON file. The 2nd generator consists of wrangled data from the JSON file. Next, the generator is traversed so we can create a JSON file of the wrangled data. Although the generator object is created and can be traversed very fast, it takes a bit of time to create a JSON file consisting of close to 10,000 wrangled records. On my machine, it took a bit over 33 seconds, so be patient.