Supervised and unsupervised learning ppt

unsupervised learning in neural networks ppt and unsupervised learning ppt
Prof.KristianHardy Profile Pic
Published Date:26-07-2017
Your Website URL(Optional)
Unsupervised Learning Unsupervised vs Supervised Learning:  Most of this course focuses on supervised learning methods such as regression and classi cation.  In that setting we observe both a set of features X ;X ;:::;X for each object, as well as a response or 1 2 p outcome variable Y . The goal is then to predict Y using X ;X ;:::;X . 1 2 p  Here we instead focus on unsupervised learning, we where observe only the features X ;X ;:::;X . We are not 1 2 p interested in prediction, because we do not have an associated response variable Y . 1/52The Goals of Unsupervised Learning  The goal is to discover interesting things about the measurements: is there an informative way to visualize the data? Can we discover subgroups among the variables or among the observations?  We discuss two methods:  principal components analysis, a tool used for data visualization or data pre-processing before supervised techniques are applied, and  clustering, a broad class of methods for discovering unknown subgroups in data. 2/52The Challenge of Unsupervised Learning  Unsupervised learning is more subjective than supervised learning, as there is no simple goal for the analysis, such as prediction of a response.  But techniques for unsupervised learning are of growing importance in a number of elds:  subgroups of breast cancer patients grouped by their gene expression measurements,  groups of shoppers characterized by their browsing and purchase histories,  movies grouped by the ratings assigned by movie viewers. 3/52Another advantage  It is often easier to obtain unlabeled data from a lab instrument or a computer than labeled data, which can require human intervention.  For example it is dicult to automatically assess the overall sentiment of a movie review: is it favorable or not? 4/52Principal Components Analysis  PCA produces a low-dimensional representation of a dataset. It nds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated.  Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visualization. 5/52Principal Components Analysis: details  The rst principal component of a set of features X ;X ;:::;X is the normalized linear combination of the 1 2 p features Z = X + X +::: + X 1 11 1 21 2 p1 p that has the largest variance. By normalized, we mean that P p 2  = 1. j=1 j1  We refer to the elements  ;:::; as the loadings of the 11 p1 rst principal component; together, the loadings make up the principal component loading vector, T  = (  :::  ) . 1 11 21 p1  We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance. 6/52PCA: example 10 20 30 40 50 60 70 Population The population size (pop) and ad spending (ad) for 100 di erent cities are shown as purple circles. The green solid line indicates the rst principal component direction, and the blue dashed line indicates the second principal component direction. 7/52 Ad Spending 0 5 10 15 20 25 30 35Computation of Principal Components  Suppose we have a np data setX. Since we are only interested in variance, we assume that each of the variables inX has been centered to have mean zero (that is, the column means ofX are zero).  We then look for the linear combination of the sample feature values of the form z = x + x +::: + x (1) i1 11 i1 21 i2 p1 ip for i = 1;:::;n that has largest sample variance, subject to P p 2 the constraint that  = 1. j=1 j1  Since each of the x has mean zero, then so does z (for ij i1 any values of  ). Hence the sample variance of the z j1 i1 P 1 n 2 can be written as z . i=1 i1 n 8/52Computation: continued  Plugging in (1) the rst principal component loading vector solves the optimization problem 0 1 2 n p p X X X 1 2 A maximize  x subject to  = 1: j1 ij j1  ;:::; n 11 p1 i=1 j=1 j=1  This problem can be solved via a singular-value decomposition of the matrixX, a standard technique in linear algebra.  We refer to Z as the rst principal component, with 1 realized values z ;:::;z 11 n1 9/52Geometry of PCA  The loading vector  with elements  ; ;:::; 1 11 21 p1 de nes a direction in feature space along which the data vary the most.  If we project the n data points x ;:::;x onto this 1 n direction, the projected values are the principal component scores z ;:::;z themselves. 11 n1 10/52Further principal components  The second principal component is the linear combination of X ;:::;X that has maximal variance among all linear 1 p combinations that are uncorrelated with Z . 1  The second principal component scores z ;z ;:::;z 12 22 n2 take the form z = x + x +::: + x ; i2 12 i1 22 i2 p2 ip where  is the second principal component loading vector, 2 with elements  ; ;:::; . 12 22 p2 11/52Further principal components: continued  It turns out that constraining Z to be uncorrelated with 2 Z is equivalent to constraining the direction  to be 1 2 orthogonal (perpendicular) to the direction  . And so on. 1  The principal component directions  ,  ,  ;::: are the 1 2 3 ordered sequence of right singular vectors of the matrixX, 1 and the variances of the components are times the n squares of the singular values. There are at most min(n 1;p) principal components. 12/52Illustration  USAarrests data: For each of the fty states in the United States, the data set contains the number of arrests per 100; 000 residents for each of three crimes: Assault, Murder, and Rape. We also record UrbanPop (the percent of the population in each state living in urban areas).  The principal component score vectors have length n = 50, and the principal component loading vectors have length p = 4.  PCA was performed after standardizing each variable to have mean zero and standard deviation one. 13/52USAarrests data: PCA plot −0.5 0.0 0.5 UrbanPop Hawaii California Rhode Island Massachusetts Utah New Jersey Connecticut Washington Colorado New York Nevada Ohio Arizona Illinois Wisconsin Minnesota Pennsylvania Rape Oregon Texas Delaware Kansas Oklahoma Missouri Nebraska Indiana Michigan Iowa New Hampshire Florida New Mexico Virginia Idaho Wyoming Maine Maryland Montana North Dakota Assault South Dakota Tennessee Louisiana Kentucky Alaska Arkansas Alabama Georgia VermontWest Virginia Murder South Carolina North Carolina Mississippi −3 −2 −1 0 1 2 3 First Principal Component 14/52 Second Principal Component −3 −2 −1 0 1 2 3 −0.5 0.0 0.5Figure details The rst two principal components for the USArrests data.  The blue state names represent the scores for the rst two principal components.  The orange arrows indicate the rst two principal component loading vectors (with axes on the top and right). For example, the loading for Rape on the rst component is 0:54, and its loading on the second principal component 0:17 the word Rape is centered at the point (0:54; 0:17).  This gure is known as a biplot, because it displays both the principal component scores and the principal component loadings. 15/52PCA loadings PC1 PC2 Murder 0.5358995 -0.4181809 Assault 0.5831836 -0.1879856 UrbanPop 0.2781909 0.8728062 Rape 0.5434321 0.1673186 16/52Another Interpretation of Principal Components • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • •• • • • •• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •• • • • • • • • • • • • −1.0 −0.5 0.0 0.5 1.0 First principal component 17/52 Second principal component −1.0 −0.5 0.0 0.5 1.0PCA nd the hyperplane closest to the observations  The rst principal component loading vector has a very special property: it de nes the line in p-dimensional space that is closest to the n observations (using average squared Euclidean distance as a measure of closeness)  The notion of principal components as the dimensions that are closest to the n observations extends beyond just the rst principal component.  For instance, the rst two principal components of a data set span the plane that is closest to the n observations, in terms of average squared Euclidean distance. 18/52Scaling of the variables matters  If the variables are in di erent units, scaling each to have standard deviation equal to one is recommended.  If they are in the same units, you might or might not scale the variables. Scaled Unscaled −0.5 0.0 0.5 −0.5 0.0 0.5 1.0 UrbanPop UrbanPop Rape Rape Murder Assault Assault Murder −3 −2 −1 0 1 2 3 −100 −50 0 50 100 150 First Principal Component First Principal Component 19/52 Second Principal Component −3 −2 −1 0 1 2 3 −0.5 0.0 0.5 Second Principal Component −100 −50 0 50 100 150 −0.5 0.0 0.5 1.0Proportion Variance Explained  To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one.  The total variance present in a data set (assuming that the variables have been centered to have mean zero) is de ned as p p n X X X 1 2 Var(X ) = x ; j ij n j=1 j=1 i=1 and the variance explained by the mth principal component is n X 1 2 Var(Z ) = z : m im n i=1 P P p M  It can be shown that Var(X ) = Var(Z ), j m j=1 m=1 with M = min(n 1;p). 20/52

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.