Question? Leave a message!




K-Means Algorithms

K-Means Algorithms
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Website URL
Comment
Big Data Analytics CSCI 4030High dim. Graph Infinite Machine Apps data data data learning Locality Filtering PageRank, Recommen sensitive data SVM SimRank der systems hashing streams Community Web Decision Association Clustering Detection advertising Trees Rules Dimensional Duplicate Spam Queries on Perceptron, ity document Detection streams kNN reduction detection Big Data Analytics CSCI 4030 2 Given a cloud of data points we want to understand its structure Big Data Analytics CSCI 4030 3 Given a set of points, with a notion of distance between points, group the points into some number of clusters, so that  Members of a cluster are close/similar to each other  Members of different clusters are dissimilar  Usually:  Points are in a high-dimensional space  Similarity is defined using a distance measure  Euclidean, Cosine, Jaccard, edit distance, … Big Data Analytics CSCI 4030 4Big Data Analytics CSCI 4030 5x x x xx xx x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Cluster Outlier Big Data Analytics CSCI 4030 6Big Data Analytics CSCI 4030 7Big Data Analytics CSCI 4030 8 Clustering in two dimensions looks easy  Clustering small amounts of data looks easy  And in most cases, looks are not deceiving  Many applications involve not 2, but 10 or 10,000 dimensions  High-dimensional spaces look different: Many pairs of points are at about the same distance Big Data Analytics CSCI 4030 9 A catalog of 2 billion “sky objects” represents objects by their radiation in 7 dimensions (frequency bands)  Problem: Cluster into similar objects, e.g., galaxies, nearby stars, quasars, etc.  Sloan Digital Sky Survey Big Data Analytics CSCI 4030 10 Intuitively: Music divides into categories, and customers prefer a few categories  But what are categories really?  Represent a CD by a set of customers who bought it  Similar CDs have similar sets of customers, and vice-versa Big Data Analytics CSCI 4030 11Space of all CDs:  Think of a space with one dim. for each customer  Values in a dimension may be 0 or 1 only  A CD is a point in this space (x , x ,…, x ), 1 2 k th where x = 1 iff the i customer bought the CD i  For Amazon, the dimension is tens of millions  Task: Find clusters of similar CDs Big Data Analytics CSCI 4030 12Finding topics:  Represent a document by a vector th (x , x ,…, x ), where x = 1 iff the i word 1 2 k i (in some order) appears in the document  Documents with similar sets of words may be about the same topic Big Data Analytics CSCI 4030 13 As with CDs we have a choice when we think of documents as sets of words or shingles:  Sets as vectors: Measure similarity by the Cosine Distance  Sets as sets: Measure similarity by the Jaccard Distance  Sets as points: Measure similarity by Euclidean Distance Big Data Analytics CSCI 4030 14 Hierarchical:  Agglomerative (bottom up):  Initially, each point is a cluster  Repeatedly combine the two “nearest” clusters into one  Divisive (top down):  Start with one cluster and recursively split it  Point assignment:  Maintain a set of clusters  Points belong to “nearest” cluster Big Data Analytics CSCI 4030 15 Key operation: Repeatedly combine two nearest clusters  Three important questions:  1) How do you represent a cluster of more than one point?  2) How do you determine the “nearness” of clusters?  3) When to stop combining clusters? Big Data Analytics CSCI 4030 16 Key operation: Repeatedly combine two nearest clusters  (1) How to represent a cluster of many points?  Key problem: As you merge clusters, how do you represent the “location” of each cluster, to tell which pair of clusters is closest?  Euclidean case: each cluster has a centroid = average of its (data)points  (2) How to determine “nearness” of clusters?  Measure cluster distances by distances of centroids Big Data Analytics CSCI 4030 17(5,3) o (1,2) o x (1.5,1.5) x (4.7,1.3) o (2,1) o (4,1) x (1,1) x (4.5,0.5) o (0,0) o (5,0) Data: o … data point x … centroid Dendrogram Big Data Analytics CSCI 4030 18What about the Non-Euclidean case?  The only “locations” we can talk about are the points themselves  i.e., there is no “average” of two points  Approach 1:  (1) How to represent a cluster of many points? clustroid = (data)point “closest” to other points  (2) How do you determine the “nearness” of clusters? Treat clustroid as if it were centroid, when computing inter-cluster distances Big Data Analytics CSCI 4030 19 (1) How to represent a cluster of many points? clustroid = point “closest” to other points  Possible meanings of “closest”:  Smallest maximum distance to other points  Smallest average distance to other points  Smallest sum of squares of distances to other points 2 min d(x,c)  For distance metric d clustroid c of cluster C is:  c xC Centroid Datapoint Centroid is the avg. of all (data)points in the cluster. This means centroid is X an “artificial” point. Clustroid Clustroid is an existing (data)point that is “closest” to all other points in Cluster on the cluster. 3 datapoints Big Data Analytics CSCI 4030 20