Question? Leave a message!




K-Means Algorithms

K-Means Algorithms
Big Data Analytics CSCI 4030High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection Big Data Analytics CSCI 4030 2 Given a cloud of data points we want to understand its structure Big Data Analytics CSCI 4030 3 Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that  Members of a cluster are close/similar to each other  Members of different clusters are dissimilar  Usually:  Points are in a highdimensional space  Similarity is defined using a distance measure  Euclidean, Cosine, Jaccard, edit distance, … Big Data Analytics CSCI 4030 4Big Data Analytics CSCI 4030 5x x x xx xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Cluster Outlier Big Data Analytics CSCI 4030 6Big Data Analytics CSCI 4030 7Big Data Analytics CSCI 4030 8 Clustering in two dimensions looks easy  Clustering small amounts of data looks easy  And in most cases, looks are not deceiving  Many applications involve not 2, but 10 or 10,000 dimensions  Highdimensional spaces look different: Many pairs of points are at about the same distance Big Data Analytics CSCI 4030 9 A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands)  Problem: Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc.  Sloan Digital Sky Survey Big Data Analytics CSCI 4030 10 Intuitively: Music divides into categories, and customers prefer a few categories  But what are categories really  Represent a CD by a set of customers who bought it  Similar CDs have similar sets of customers, and viceversa Big Data Analytics CSCI 4030 11Space of all CDs:  Think of a space with one dim. for each customer  Values in a dimension may be 0 or 1 only  A CD is a point in this space (x , x ,…, x ), 1 2 k th where x = 1 iff the i customer bought the CD i  For Amazon, the dimension is tens of millions  Task: Find clusters of similar CDs Big Data Analytics CSCI 4030 12Finding topics:  Represent a document by a vector th (x , x ,…, x ), where x = 1 iff the i word 1 2 k i (in some order) appears in the document  Documents with similar sets of words may be about the same topic Big Data Analytics CSCI 4030 13 As with CDs we have a choice when we think of documents as sets of words or shingles:  Sets as vectors: Measure similarity by the Cosine Distance  Sets as sets: Measure similarity by the Jaccard Distance  Sets as points: Measure similarity by Euclidean Distance Big Data Analytics CSCI 4030 14 Hierarchical:  Agglomerative (bottom up):  Initially, each point is a cluster  Repeatedly combine the two “nearest” clusters into one  Divisive (top down):  Start with one cluster and recursively split it  Point assignment:  Maintain a set of clusters  Points belong to “nearest” cluster Big Data Analytics CSCI 4030 15 Key operation: Repeatedly combine two nearest clusters  Three important questions:  1) How do you represent a cluster of more than one point  2) How do you determine the “nearness” of clusters  3) When to stop combining clusters Big Data Analytics CSCI 4030 16 Key operation: Repeatedly combine two nearest clusters  (1) How to represent a cluster of many points  Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest  Euclidean case: each cluster has a centroid = average of its (data)points  (2) How to determine “nearness” of clusters  Measure cluster distances by distances of centroids Big Data Analytics CSCI 4030 17(5,3) o (1,2) o x (1.5,1.5) x (4.7,1.3) o (2,1) o (4,1) x (1,1) x (4.5,0.5) o (0,0) o (5,0) Data: o … data point x … centroid Dendrogram Big Data Analytics CSCI 4030 18What about the NonEuclidean case  The only “locations” we can talk about are the points themselves  i.e., there is no “average” of two points  Approach 1:  (1) How to represent a cluster of many points clustroid = (data)point “closest” to other points  (2) How do you determine the “nearness” of clusters Treat clustroid as if it were centroid, when computing intercluster distances Big Data Analytics CSCI 4030 19 (1) How to represent a cluster of many points clustroid = point “closest” to other points  Possible meanings of “closest”:  Smallest maximum distance to other points  Smallest average distance to other points  Smallest sum of squares of distances to other points 2 min d(x,c)  For distance metric d clustroid c of cluster C is:  c xC Centroid Datapoint Centroid is the avg. of all (data)points in the cluster. This means centroid is X an “artificial” point. Clustroid Clustroid is an existing (data)point that is “closest” to all other points in Cluster on the cluster. 3 datapoints Big Data Analytics CSCI 4030 20 (2) How do you determine the “nearness” of clusters  Approach 2: Intercluster distance = minimum of the distances between any two points, one from each cluster  Approach 3: Pick a notion of “cohesion” of clusters, e.g., maximum distance from the clustroid  Merge clusters whose union is most cohesive Big Data Analytics CSCI 4030 21 Approach 3.1: Use the diameter of the merged cluster = maximum distance between points in the cluster  Approach 3.2: Use the average distance between points in the cluster  Approach 3.3: Use a densitybased approach  Take the diameter or avg. distance, e.g., and divide by the number of points in the cluster Big Data Analytics CSCI 4030 22 Naïve implementation of hierarchical clustering:  At each step, compute pairwise distances between all pairs of clusters, then merge 3  O(N )  Too expensive for really big datasets that do not fit in memory Big Data Analytics CSCI 4030 23 Assumes Euclidean space/distance  Start by picking k, the number of clusters  Initialize clusters by picking one point per cluster  Example: Pick one point at random, then k1 other points, each as far away as possible from the previous points Big Data Analytics CSCI 4030 25 1) For each point, place it in the cluster whose current centroid it is nearest  2) After all points are assigned, update the locations of centroids of the k clusters  3) Reassign all points to their closest centroid  Sometimes moves points between clusters  Repeat 2 and 3 until convergence  Convergence: Points don’t move between clusters and centroids stabilize Big Data Analytics CSCI 4030 26x x x x x x x x x x x x … data point … centroid Clusters after round 1 Big Data Analytics CSCI 4030 27x x x x x x x x x x x x … data point … centroid Clusters after round 2 Big Data Analytics CSCI 4030 28x x x x x x x x x x x x … data point … centroid Clusters at the end Big Data Analytics CSCI 4030 29How to select k  Try different k, looking at the change in the average distance to centroid as k increases  Average falls rapidly until right k, then changes little Best value of k Average distance to centroid k Big Data Analytics CSCI 4030 30x Too few; x many long xx x distances x x to centroid. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Big Data Analytics CSCI 4030 31x Just right; x distances xx x x x rather short. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Big Data Analytics CSCI 4030 32Too many; x x little improvement xx x in average x x distance. x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Big Data Analytics CSCI 4030 33Extension of kmeans to large data BFR BradleyFayyadReina is a variant of kmeans designed to handle very large (diskresident) data sets  Assumes that clusters are normally distributed around a centroid in a Euclidean space  Standard deviations in different dimensions may vary  Clusters are axisaligned ellipses  Efficient way to summarize clusters (want memory required O(clusters) and not O(data)) Big Data Analytics CSCI 4030 35 Points are read from disk into main memory in chunks.  Most points from previous memory loads are summarized by simple statistics  To begin, from the initial load we select the initial k centroids by some sensible approach:  Take k random points  Take a small random sample and cluster optimally  Take a sample; pick a random point, and then k–1 more points, each as far from the previously selected points as possible Big Data Analytics CSCI 4030 363 sets of points which we keep track of:  Discard set (DS):  Points close enough to a centroid to be summarized. Only summaries are kept in main memory.  Compression set (CS):  Groups of points that are close together but not close to any existing centroid  These points are summarized, but not assigned to a cluster. Only summaries are kept in main memory.  Retained set (RS):  Isolated points waiting to be assigned to a compression set. Held in main memory exactly as they are. Big Data Analytics CSCI 4030 37Points in the RS Compressed sets. Their points are in the CS. A cluster. Its points The centroid are in the DS. Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points Big Data Analytics CSCI 4030 38For each cluster, the discard set (DS) is summarized by:  The number of points, N th  The vector SUM, whose i component is the sum of the coordinates of the points in the th i dimension th  The vector SUMSQ: i component = sum of th squares of coordinates in i dimension A cluster. The centroid All its points are in the DS. Big Data Analytics CSCI 4030 39 2d + 1 values represent any size cluster  d = number of dimensions  Average in each dimension (the centroid) can be calculated as SUM / N i th  SUM = i component of SUM i  Variance of a cluster’s discard set in 2 dimension i is: (SUMSQ / N) – (SUM / N) i i  And standard deviation is the square root of that  Next step: Actual clustering Big Data Analytics CSCI 4030 40Processing the “MemoryLoad” of points (1), i.e. chunk of points:  1) Find those points that are “sufficiently close” to a cluster centroid and add those points to that cluster and the DS  These points are so close to the centroid that they can be summarized and then discarded  2) Use any mainmemory clustering algorithm to cluster the remaining points and the old RS  Clusters go to the CS; outlying points to the RS Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points Big Data Analytics CSCI 4030 41Processing the “MemoryLoad” of points (2):  3) DS set: Adjust statistics of the clusters to account for the new points  Add Ns, SUMs, SUMSQs  4) Consider merging compressed sets in the CS  5) If this is the last round (last chunk of data), merge all compressed sets in the CS and all RS points into their nearest cluster (or treat them as outliers) Discard set (DS): Close enough to a centroid to be summarized. Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points Big Data Analytics CSCI 4030 42Points in the RS Compressed sets. Their points are in the CS. A cluster. Its points The centroid are in the DS. Discard set (DS): Close enough to a centroid to be summarized Compression set (CS): Summarized, but not assigned to a cluster Retained set (RS): Isolated points Big Data Analytics CSCI 4030 43 Q1) How do we decide if a point is “close enough” to a DS cluster that we will add the point to that cluster  Q2) How do we decide whether two compressed sets (CS) deserve to be combined into one Big Data Analytics CSCI 4030 44 Q1) We need a way to decide whether to put a new point into a cluster (and discard)  e.g, BFR suggests:  The Mahalanobis distance is less than a threshold Big Data Analytics CSCI 4030 45 Normalized Euclidean distance from centroid  For point (x , …, x ) and centroid (c , …, c ) 1 d 1 d 1. Normalize in each dimension: y = (x c ) /  i i i i 2. Take sum of the squares of the y i 3. Take the square root 2 𝑑 𝑥 −𝑐 𝑖 𝑖 𝑑 𝑥,𝑐 = 𝜎 𝑖 𝑖=1 σ … standard deviation of points in i th the cluster in the i dimension Big Data Analytics CSCI 4030 46 If clusters are normally distributed in d dimensions, then after transformation, one standard deviation = 𝒅  i.e., 68 of the points of the cluster will have a Mahalanobis distance 𝒅  Accept a point for a cluster if its M.D. is some threshold, e.g. 2 standard deviations Big Data Analytics CSCI 4030 47 Euclidean vs. Mahalanobis distance Contours of equidistant points from the origin Uniformly distributed points, Normally distributed points, Normally distributed points, Euclidean distance Euclidean distance Mahalanobis distance Big Data Analytics CSCI 4030 48Q2) Should 2 CS subclusters be combined  Compute the variance of the combined subcluster  N, SUM, and SUMSQ allow us to make that calculation quickly  Combine if the combined variance is below some threshold Big Data Analytics CSCI 4030 49Extension of kmeans to clusters of arbitrary shapesVs.  Problem with BFR:  Assumes clusters are normally distributed in each dimension  And axes are fixed – ellipses at an angle are not OK  CURE (Clustering Using REpresentatives):  Assumes a Euclidean distance  Allows clusters to assume any shape  Uses a collection of representative points to represent clusters Big Data Analytics CSCI 4030 51h h h e e e h e e e h e e e e h salary e h h h h h h h age Big Data Analytics CSCI 4030 522 Pass algorithm. Pass 1:  0) Pick a random sample of points that fit in main memory  1) Initial clusters:  Cluster these points hierarchically – group nearest points/clusters  2) Pick representative points:  For each cluster, pick a sample of points, as dispersed as possible  From the sample, pick representatives by moving them (say) 20 toward the centroid of the cluster Big Data Analytics CSCI 4030 53h h h e e e h e e e h e e e e h e salary h h h h h h h age Big Data Analytics CSCI 4030 54h h h e e e h e e e h e e e e h e salary Pick (say) 4 h remote points h h h for each h h cluster. h age Big Data Analytics CSCI 4030 55h h h e e e h e e e h e e e e h e salary Move points h (say) 20 h h h toward the h h centroid. h age Big Data Analytics CSCI 4030 56Pass 2:  Now, rescan the whole dataset and visit each point p in the data set  Place it in the “closest cluster” p  Normal definition of “closest”: Find the closest representative to p and assign it to representative’s cluster Big Data Analytics CSCI 4030 57 Clustering: Given a set of points, with a notion of distance between points, group the points into some number of clusters  Algorithms:  Agglomerative hierarchical clustering:  Centroid and clustroid  kmeans:  Initialization, picking k  BFR  CURE Big Data Analytics CSCI 4030 58 Apply hierarchical clustering on the following data in a 2dimensional Euclidean space.  Assume stopping point is k = 6.  Present the tree showing the complete grouping of the points. Big Data Analytics CSCI 4030 59 Ans:  Initially, each point is in a cluster by itself and is the centroid of that cluster. Big Data Analytics CSCI 4030 60 Among all the pairs of points, there are two pairs that are the closest: (10, 5) and (11, 4) or (11, 4) and (12, 3). Each is at distance 2.  Let us break ties arbitrarily and decide to combine (11,4) with (12,3). The result is shown in figure below, including the centroid of the new cluster, which is at (11.5, 3.5). Big Data Analytics CSCI 4030 61 Thus, now the two closest clusters are those of the points (4,8) and (4,10). We combine them into one cluster with centroid (4,9). See figure on the next slide.  At this point, the two closest centroids are (10,5) and (11.5, 3.5), so we combine these two clusters. The result is a cluster of three points (10,5), (11,4), and (12,3). The centroid of this cluster is (11,4). Big Data Analytics CSCI 4030 62 Now, there are several pairs of centroids that are at distance 5 and these are the closest centroids. Big Data Analytics CSCI 4030 63 (6,8) is combined with the cluster of two elements having centroid (4,9).  (2,2) is combined with (3,4).  (9,3) is combined with the cluster of three elements having centroid (11,4). Big Data Analytics CSCI 4030 64 Tree presenting the grouping: Big Data Analytics CSCI 4030 65 Let us consider the twelve points presented below. KMeans clustering algorithm for k = 3 is used.  Assume the initial choice of point is (6,8). What are the other two initial points Pick points that are as far away as possible.  Provide result of KMeans alg. nd after 2 round. Big Data Analytics CSCI 4030 66 Ans:  The furthest point from (6,8) is (12,3), so that point is chosen.  Among the remaining ten points, the one whose minimum distance to either (6,8) or (12,3) is a maximum is (2,2).  That point has distance √52 = 7.21 from (6,8) and distance √101 = 10.05 to (12,3); thus its “score” is 7.21.  You can check easily that any other point is less than distance 7.21 from at least one of (6,8) and (12,3).  Our selection of three starting points is thus (6,8), (12,3), and (2,2). nd  Note: compute the result of KMeans after 2 round on your own and compare with other students. Big Data Analytics CSCI 4030 67 Which of the following three shapes are allowed for clusters in BFR algorithm Big Data Analytics CSCI 4030 68 Ans:  The clusters in data for which the BFR algorithm may be used can have standard deviations that differ along different axes, but the axes of the cluster must align with the axes of the space. Big Data Analytics CSCI 4030 69 Suppose a cluster consists of the points (5, 1), (6,−2), and (7, 0).  Compute each of the following statistics:  N  SUM  SUMSQ  Centroid  Variance  Standard Deviation  Do statistics have to be computed from scratch in each round of BFR algorithm Big Data Analytics CSCI 4030 70 Ans:  Then N = 3,  SUM = 18,−1, and SUMSQ = 110, 5.  The centroid is SUM/N, or 6,−1/3. 2  The variance in the first dimension is 110/3 − 18/3 = 0.667,  So the standard deviation is 0.667 = 0.816.  In the second dimension, the variance is 5/3 − 2 (−1/3) = 1.56, so the standard deviation is 1.25.  NO They can be updated based on the previous values of statistics to safe computation This is at the heart of BFR algorithm. Big Data Analytics CSCI 4030 71 Assuming following selection of representative points from each cluster move the representative points 20 of the distance to the cluster’s centroid. Big Data Analytics CSCI 4030 72 Ans:  Moving the representative points 20 of the distance to the cluster’s centroid Big Data Analytics CSCI 4030 73 Suppose a cluster of threedimensional points has standard deviations of 2, 3, and 5, in the three dimensions, in that order. Compute the Mahalanobis distance between the origin (0, 0, 0) and the point (1,−3, 4). Big Data Analytics CSCI 4030 74 Suppose a cluster of threedimensional points has standard deviations of 2, 3, and 5, in the three dimensions, in that order. Compute the Mahalanobis distance between the origin (0, 0, 0) and the point (1,−3, 4).  Ans:  Mahalanobis distance  …  …  Compute the results individually and then compare it with other students. Big Data Analytics CSCI 4030 75 Clustering: Clusters are often a useful summary of data that is in the form of points in some space.  To cluster points, we need a distance measure on that space. Ideally, points in the same cluster have small distances between them, while points in different clusters have large distances between them.  Clustering Algorithms: Clustering algorithms generally have one of two forms.  Hierarchical clustering algorithms begin with all points in a cluster of their own, and nearby clusters are merged iteratively. Pointassignment clustering algorithms consider points in turn and assign them to the cluster in which they best fit. Big Data Analytics CSCI 4030 76 Centroids and Clustroids: In a Euclidean space, the members of a cluster can be averaged, and this average is called the centroid.  In nonEuclidean spaces, there is no guarantee that points have an “average,” so we are forced to use one of the members of the cluster as a representative or typical element of the cluster. That representative is called the clustroid. Big Data Analytics CSCI 4030 77 KMeans Algorithms: This family of algorithms is of the pointassignment type and assumes a Euclidean space.  It is assumed that there are exactly k clusters for some known k.  After picking k initial cluster centroids, the points are considered one at a time and assigned to the closest centroid.  The centroid of a cluster can migrate during point assignment, and an optional last step is to reassign all the points, while holding the centroids fixed at their final values obtained during the first pass. Big Data Analytics CSCI 4030 78 The BFR Algorithm: This algorithm is a version of kmeans designed to handle data that is too large to fit in main memory. It assumes clusters are normally distributed about the axes.  Processing Points in BFR: Most of the points in a mainmemory load will be assigned to a nearby cluster and the parameters for that cluster will be adjusted to account for the new points.  Unassigned points can be formed into new miniclusters, and these miniclusters can be merged with previously discovered miniclusters or retained points. Big Data Analytics CSCI 4030 79 The CURE Algorithm: This algorithm is of the pointassignment type. It is designed for a Euclidean space, but clusters can have any shape. It handles data that is too large to fit in main memory.  Processing Points in CURE: After creating representative points for each cluster, the entire set of points can be read from disk and assigned to a cluster. We assign a given point to the cluster of the representative point that is closest to the given point. Big Data Analytics CSCI 4030 80 Review slides  Read Chapter 7 from course book.  You can find electronic version of the book on Blackboard. Big Data Analytics CSCI 4030 81