Clustering k means algorithm

Partition-Based Clustering with k-Means and k means clustering algorithm for image segmentation
Dr.NaveenBansal Profile Pic
Published Date:25-10-2017
Your Website URL(Optional)
Chapter 7 Partition-Based Clustering with k-Means 7.1 Exploratory Data Analysis and Clustering Nowadays, huge size data-sets are commonly publicly available, and it becomes increasingly important to efficiently process them to discover worthwhile structures (or“patterns”)inthoseseasofdata. Exploratorydataanalysisisconcerned withthis challengeoffindingsuchstructuralinformationwithoutanypriorknowledge:inthis case,thosetechniquesthatconsistinlearningfromdatawithoutpriorknowledgeare called generically unsupervised machine learning. Let X=x ,..., x denote a data-set like a collection of images (often static, 1 n that is given once for all when we begin with the analysis). We seek for compact subsets of data, called clusters, that represent categories of data (say, the clus- ter of the car images or the cluster of the cat images, etc.). Each datum x ∈ X i d (with X denoting the space of data, usually X⊂ R ) is described as an attribute 1 d vector x = (x ,..., x ), called a feature vector. We adopt the following notation i i i j ( j) x (meaning x ) to describe the jth coordinate of vector x . Vector attributes can i i i eitherbe numerical quantitiesor categorical values(thatisqualitativeattributes)like the fixed set of words of a dictionary, or ordered categorical data (like the ordered rankingABCD E), or a mix of those different data types. Exploratory analysis differs from supervised classification that consists in a first stage to learn a classifier function C(·) from labeled data of a training data set Z=(x , y ),...,(x , y ), with the x ’s the attributes and the y ’s the class labels, 1 1 n n i i in order in a second stage to classify new unlabeled observations x belonging to a j testing set: yˆ = C(x ). The hat notation in yˆ indicates that one has estimated the j j j class, a so-called inference task performed from a training data set. Clustering is a set of techniques that consists in detecting subsets of data that definegroupsorclusters.Thosegroupsshouldideallyrepresent semantic categories ofdata:forexample,theflowergroupsorganizedbyspeciesfromadatabaseofflower 1 images. One such famous public data-set is available from the UCI repository as 1 Available online at © Springer International Publishing Switzerland 2016 163 F. Nielsen, Introduction to HPC with MPI for Data Science, Undergraduate Topics in Computer Science, DOI 10.1007/978-3-319-21903-5_7164 7 Partition-Based Clustering with k-Means Fig. 7.1 Exploratory analysis consists in finding intrinsic structures in data sets like groups of data called cluster. Clustering is a set of techniques that seek to find homogeneous clusters in data-sets. In this 2D toy example, the Human eye perceives three well-formed clusters for the digits: ‘4’, ‘4’, ‘2’. In practice, data-sets are living in high dimensions, and thus cannot be visually inspected: Therefore we require clustering algorithms to automatically find those groups 2 filenameIris : it contains n= 150 numerical data in dimension 4 (with attributes describing the length and the width of both the sepals and petals, in centimeters), classified in k= 3 botanical groups: Setosa iris, Virginica iris, and Versicolor iris. To summarize, classification enables to label new observations while clustering allows one to discover those classes as clusters (Fig.7.1). 7.1.1 Hard Clustering: Partitioning Data Sets Partition-based clustering consists in dividing a data set X=x ,..., x into k 1 n homogeneous groups G ⊂ X,..., G ⊂ X (the non-overlapping clusters G ) such 1 k i that we have: k X=∪ G , ∀ i = j, G ∩ G =∅, i i j i=1 k X := G i i=1 Notation a:= b indicates that the equality sign should be understood by definition (that is, it is not an equality resulting from some mathematical calculation). Thus a dataelement(datum)isallocatedtoauniquegroup G :Partition-basedclustering l(x ) i is a hard clustering technique, and differentiates itself from other soft clustering techniques that gives a positive membership weight l 0 for all the x ’s and the i, j i  k groups G ’s with l = 1 (normalization constraint): l = 1 if and only if l(x ) i, j i, j i j=1 (iff.) j = l(x ). We denote by L=l the membership matrix of size n× k. i i, j 2 Exploratory Data Analysis and Clustering 165 7.1.2 Cost Functions and Model-Based Clustering k Finding a good partitioning of the data X= G requires to be able to evalu- i i=1 ate the clustering fitness of partitions. However, we often proceed the other way around From a given cost function, we seek an efficient algorithm that partitions X by minimizing this prescribed cost function. A generic cost function e (·;·) (also k synonymously called energy function, loss function or objective function) is written as the sum of the costs of each group as follows: k  e (X; G ,..., G )= e (G ), k 1 k 1 i i=1 with e (G) the cost function for a single group. 1 We can also associate for each group G a model c that defines the “center” of i i that cluster. The collection of centers, the c ’s, are called the prototypes, and those i prototypes allow one to define a distance between any data x ∈ X and any cluster G (with corresponding prototype c) as follows: D (x, G)= D(x, c). M Function D (x, G)denotesthedistancebetweenanelement x andaclusterusing M the prototype of that cluster. Function D(p, q) is a base distance to properly define according to nature of the data set. That is, we have D (x, G)= D(x, c) where c M is the prototype of G. Given the set of k prototypes C=c ,..., c , one can define the overall cost of 1 k a partition-based clustering by: n  e (X; C)= min D(x , c ), k i j j∈1,...,k i=1  andthecostofasingleclusterisdefinedby e (G, c)= D(x, c).Model-based 1 x∈G clustering with a single center associated to each cluster induces a partition of the k data set X: G(C)= G , with G =x ∈ X : D(x , c )≤ D(x , c ), ∀ l∈ j j i i j i l j=1 1,..., k. There exists many clustering cost/loss functions that gives rise to many different kinds of partitions. Next, we shall introduce the most celebrated such a function called k-means, and explain why the minimization of this loss function provides good clustering partitions in practice.166 7 Partition-Based Clustering with k-Means 7.2 The k-Means Objective Function The k-means cost function asks to minimize the sum of squared Euclidean distances of data points to their closest prototype centers: n  2 e (X; C)= min x − c . k i j j∈1,...,k i=1 2 Although that the squared Euclidean distance D(x, c)= x− c is a symmet- ric dissimilarity measure equals to zero if and only if x = c,itis notametric becauseitfailstosatisfythetriangularinequalitiesoftheordinaryEuclideandistance:   d i j 2 x− c = (x − c ) . In fact, there is a good reason to choose the squared 2 j=1 Euclidean distance instead of the Euclidean distance: Indeed, the cost of a single cluster e (G)= e (G, c) is minimized when we choose for the cluster prototype its 1 1 center of mass c, called the centroid:   1 2 c(G):= argmin x− c = x, c G x∈G x∈G whereG denotes the cardinality of G, that is the number of elements contained in group G. We use the following notation argmin f (x) to denote the argument that x 3 yields the minimum in case this minimum value is unique.  2 Thustheminimalcostis e (G, c)= x− c(G) := v(G),thenormalized 1 x∈G variance of cluster G. Indeed, the normalized variance of X is defined in statistics as: n  1 2 v(X)= x −¯ x , i n i=1  n 1 with x¯ = x the center of mass. Technically speaking, we often meet in the i i=1 n  1 n 2 literature the unbiased variance formula v (X)= x −¯ x ,butfor unbiased i i=1 n−1 fixed n the minimizer of the biased/unbiased variance does not change. We can rewrite the variance of a d-dimensional point cloud X as:   n  1 2 v(X)= x −¯ x x¯ i n i=1 Thisformulaismathematicallythesameasthevarianceforarandomvariable X: 2 2 2 VX= E(X− μ(X))= EX − (EX) 3 Otherwise, we can choose the “smallest” x that yields the minimum value according to some lexicographic order on X.7.2 The k-Means Objective Function 167 where μ(X)= EX denotes the expectation of the random variable X. We can define a positive weight attribute w = w(x ) 0 for each element x of i i i  n X such that w = 1 (data weights normalized). i i=1 The following theorem characterizes the prototype c as the center for a single 1 cluster (case k= 1 with X = G ): 1 d Theorem 3 Let X=(w , x ),... ,(w , x )⊂ R be a weighted data-set with 1 1 n n  n w 0 and w = 1. The center c that minimizes the weighted variance i i i=1   n n 2 v(X)= w x − c is the unique barycenter: c=¯ x = w x . i i i i i=1 i=1  d j j Proof Let x, y denote the scalar product: x, y= x y= x y = y, x. j=1 Thescalarproductisasymmetricbi-linearform: λx+ b, y= λ x, y+ b, yfor 2 λ∈ R. Now, the squared Euclidean distance D(x, y)= x− y can be rewritten using scalar producs as D(x, y)= x− y, x− y= x, x− 2 x, y+ y, y.  n We seek to minimize min d w x − c, x − c. We can mathematically i i i c∈R i=1 rewrite this optimization problem as: n  min w x − c, x − c i i i d c∈R i=1 n  min w ( x , x− 2 x , c+ c, c) i i i i d c∈R i=1     n n   min w x , x − 2 w x , c + c, c i i i i i d c∈R i=1 i=1  n We can remove the term w x , x from the minimization since it is inde- i i i i=1 pendent of c. Thus we seek to minimize equivalently:   n  min E(c):=−2 w x , c + c, c. i i d c∈R i=1 A convex function f (x)satisfies f (αx+ (1− α)y)≤ α f (x)+ (1− α) f (y)for anyα∈0,1.Itisa strictly convex functioniff. f (αx+ (1− α)y) α f (x)+ (1− α) f (y) for any α∈ (0,1). Figure7.2 plots the graph of a strictly convex function. Note that the epigraph defined as the geometric objectF=(x, f (x)) : x ∈ R is a geometric convex object. A geometric convex object satisfies the property that any line segment joining two points of the object shall fully lie inside the object. ∗ For univariate convex functions, there exists at most one global minimum x (for example, exp(−x) is strictly convex without a minimum), and it can be found by  ∗ settingthederivativetozero: f (x )= 0.Foramulti-variatereal-valuedfunction,we 2 denote by∇ F(x) its gradient (the vector of partial derivatives), and by∇ F(x) the x x Hessian matrix(ofsecond-orderderivatives).Asmoothfunction F isstrictlyconvex 2 if and only if∇ F  0 where M  0 denotes that the matrix M is positive-definite:168 7 Partition-Based Clustering with k-Means z = f(x) (x, f(x)) f(x) αf(x)+(1− α)f(y) (y,f(y)) f(y) f(αx+(1− α)y) ∗ f(x ) ∗ x αx+(1− α)y y x Fig. 7.2 Plot of a strictly convex function satisfying f (αx+ (1− α)y) α f (x)+ (1− α) f (y) for any α∈ (0,1) ∀x = 0, x Mx 0. A strictly convex function admits at most a global unique ∗ ∗ minimum x such that∇ F(x )= 0. We get the following d partial derivatives to set to zero: n  d j j E(c)=−2 w x + 2c , ∀ j∈1,..., d, i i j dc i=1 2 Consider the d second derivatives (for proving the convexity of the objective function) as: 2 d E(c)= 2, for l= j,∀ j∈1,..., d. j l dc c The cost function E(c) is strictly convex and admits a unique minimum. This minimum is obtained by zeroing all partial derivatives: n  d j j E(c)= 0⇔ c = w x . i i j dc i=1 ThuswehaveshownthattheminimizersoftheweightedsumofsquaredEuclidean distance of the center to the points is the unique barycenter: n  c=¯ x = w x . i i i=1 1 The centroid is also called isobarycenter when w = .  i n7.2 The k-Means Objective Function 169 Fig. 7.3 The k-means cost function tend to find globular-shaped clusters that minimize the weighted sum of the cluster variances. k-Means clustering is a model-based clustering where each cluster is associated to a prototype: its center of mass, or centroid. Here, we have choosen k= 4 groups for the k-means: Cluster prototypes, centroids, are illustrated with large disks IfinsteadofchoosingthesquaredEuclideandistance,wehadchosentheordinary Euclideandistance,oneobtainstheso-called Fermat-Weber pointthatgeneralizesthe 4 notion of median. It is thus also called the geometric median. Although the Fermat- Weber point is unique and often used in operations research for facility location problems, it does not admit a closed-form solution, but can be arbitrarily finely approximated. The k-median clusteringistheclusteringobtainedbyminimizingthe  n cost function min min x − c (observe that the squared Euclidean C j∈1,...,k i j i=1 distance of k-means has been replaced by the regular Euclidean distance). Note that the obtained partitions from k-means and k-medians can be very different from each other.Indeed,thecentroidlocationcanbedifferenttothemedianforasinglecluster. Moreover, centroids can be easily corrupted by adding a single outlier point. We say thatthe breakdown pointofthecentroidis0:Asingleoutlier p divergingtoinfinity 0 will impact the centroid to be diverging to infinity too. But the median is provably n more robust since it requires  outliers (that is, about 50% of outliers) to steer the 2 median point to∞. Therefore k-median clustering is often preferred when there are many outliers in data-sets. Let us remark that finding the center of a single cluster is a particular case of clustering with k= 1 cluster. With the squared Euclidean distance cost, we find that the center is the mean of the attributes, hence its naming: k-means. Figure7.3 displaystheclusteringresultonagivendata-set.Thisfigurehasbeenproducedusing 5 the following code in the R language : 4 5 Freely available online at 7 Partition-Based Clustering with k-Means WWW source code:Example-kMeans.R filename: Example−kMeans.R k−means clustering using the R language N − 100000 x − matrix(0, N, 2) xseq(1,N,by=4), − rnorm(N/2) xseq(2,N,by=4), − rnorm(N/2, 3, 1) xseq(3,N,by=4), − rnorm(N/2, −3, 1) xseq(4,N,by=4),1 − rnorm(N/4, 2, 1) xseq(4,N,by =4),2 − rnorm(N/4, −2.5, 1) start.kmeans − proc.time()3 ans.kmeans − kmeans(x, 4, nstart=3, iter .max=10, algorithm="Lloyd") ans.kmeanscenters end.kmeans − proc.time()3 end.kmeans− start.kmeans these − sample(1:nrow(x) , 1000) plot(xthese,1, xthese,2, pch="+", xlab="x", ylab="y") title(main="Clustering", sub="(globular shapes of clusters)", xlab="x", ylab="y") points(ans.kmeanscenters, pch=19, cex=2, col=1:4) 7.2.1 Rewriting the k-Means Cost Function for a Dual Interpretation of Clustering: Group Intra-cluster or Separate Inter-cluster Data The k-means cost function seeks compact globular clusters of small variances. Indeed, the cost function can be reinterpreted as the minimization of the weighted sum of cluster variances as follows: n  2 min min w x − c , i i j C=c ,...,c j∈1,...,k 1 k i=1 k   2 min w(x) x− c j C=c ,...,c 1 k j=1 x∈G j k  min W v(G ), j j C=c ,...,c 1 k j=1  where W := w(x) denotes the cumulative weight of the elements in cluster j x∈G j G (see Sect.7.12). j Wecanalsoshowthatclusteringdataintohomogeneousgroupscorrespondequiv-   n n 2 alently to separate data of X into groups: Indeed, let A:= x − x i j i=1 j=i+17.2 The k-Means Objective Function 171 denote the constant that is the sum of the inter-point squared Euclidean distances (fixed for a given data-set, and independent of k). For a given partition, we can decompose A into two terms: the sum of the intra-distances inside a same cluster, and the sum of the inter-distance among two distinct clusters: ⎛ ⎞ l    2 2 ⎝ ⎠ A= x − x + x − x . i j i j i=1 x ,x ∈G x ∈G ,x ∈ /G i j l i l j l Thus to minimize the sum of the intra-cluster squared Euclidean distances   l 2 x − x is equivalent to maximize the sum of the inter-cluster i j i=1 x ,x ∈G i j l squared Euclidean distances since A is a constant (for a given X): l   2 min x − x i j C i=1 x ,x ∈G i j l l   2 = min A− x − x i j C i=1 x ∈G ,x ∈ /G i l j l l   2 ≡ max x − x i j C i=1 x ∈G ,x ∈ /G i l j l Therefore, we have a dual description to define a good clustering: • cluster data into homogeneous groups in order to minimize the weighted sum of cluster variances, or • separate data in order to maximize the inter-cluster squared Euclidean distances. 7.2.2 Complexity and Tractability of the k-Means Optimization Problem Finding the minimum of a k-means cost function is a NP-hard problem as soon as thedimension d 1andthenumberofclusters k 1.When k= 1,wehaveshown that we can compute the optimal solution (the centroid) in linear time (computing the mean of the group). When d = 1, we can compute an optimal k-means solution using dynamic programming:Using O(nk)memory,wecansolvethe k-meansfor n 2 scalar values in time O(n k) (see the exercises at the end of this chapter for further details). Theorem 4 (k-means complexity) Finding a partition that minimizes the k-means cost function is NP-hard when k 1 and d 1. When d = 1, we can solve for the 2 exact k-means using dynamic programming in O(n k) time using O(nk) memory.172 7 Partition-Based Clustering with k-Means We quickly recall that P is the class of decision problems (that is, answering yes/no questions) that can be solved in polynomial time, and NP is the class of problemsforwhichonecanverifythesolutionin polynomial time(likeforexample, 6 3-SAT ). The NP-complete class is that class of problems that can be solved one from another by using a polynomial-time reduction scheme: X ∝ Y,∀Y ∈ polynomial NP. The NP-hard class is the class of problems, not necessarily in NP, such that ∃Y ∈ NP− Complete∝ X. polynomial Since the k-means problem is theoretically NP-hard, we seek efficient heuristics to approximate the cost function. We distinguish two classes of such heuristics: 1. global heuristics that do not depend on initialization, and 2. local heuristics that iteratively starts from a solution (a partition) and iteratively improves this partition using “pivot rules.” Ofcourse,oneneedtoinitializelocalheuristicswithaglobalheuristic.Thisyields many strategies for obtaining in practice a good k-means clustering Finding novel k-means heuristics is still an active research topic 50 years after its inception 7.3 Lloyd’s Batched k-Means Local Heuristic We now present the celebrated Lloyd’s heuristic (1957) that consists from a given initialization to iteratively repeat until convergence the following two steps: 2 Assign points to clusters. For all x ∈ X,let l = argmin x − c , and define i i i l l the k clustergroupsas G =x : l = jwith n =G ,thenumberofelements j i i j j of X falling into the jth cluster. Update centers. For all j∈1,..., k, update the centers to their cluster   1 1  centroids : c = x (or the barycenters c = w(x)x j j n x∈G w(x) x∈G j j j x∈G j for weighted data-sets). Figure7.4 illustrates a few iterations of Lloyd’s algorithm. Theorem 5 Lloyd’s k-means heuristics converge monotonically into a local mini-  n mum after a finite number of iterations, upper bounded by . k (t) (t) Proof Let G ,..., G denote the partition of X at time t, with cost e (X, C ).Let k t 1 k (t) k G(C )= G denote the clusters induced by the k centers C . At stage t+ 1, t t j=1 i since we assign points to clusters that have the nearest squared euclidean distance, we minimize: (t+1) e (X; G )≤ e (X, G(C )). k k t 6 The 3-SAT problem consists in answering whether a boolean formula with n clauses of 3 literals canbesatisfiableornot.3-SATisafamousNP-completeproblem(Cook’stheorem,1971),acorner stone of theoretical computer science.7.3 Lloyd’s Batched k-Means Local Heuristic 173 (a) (b) input point cloud random seed initialization (c) (d) assign points to clusters new centers = centroids Fig. 7.4 IllustrationofLloyd’s k-meansalgorithm: ainputdataset, brandominitializationofcluster centers, c assigning points to clusters, d center relocations, etc. until the algorithm convergences into a local minimum of the cost function Now,recallthatthe k-meanscostfunctionisequaltotheweightedsumoftheintra-  k (t+1) (t+1) cluster variances: e (X; G )= v(G , c ). When we update the cluster k j j=1 j centerstotheircentroids(thatarethepointsthatminimizethesquaredEuclideandis- (t+1) (t+1) (t+1) tance in these groups), we have for each group v(G , c(G ))≤ v(G , c ). j j j j Thus we deduce that: (t+1) e (X; C )≤ e (G ; C )≤ e (X; C ). k t+1 k t k t174 7 Partition-Based Clustering with k-Means a cluster assigning points updating center initialization becomes to clusters centers empty Fig. 7.5 Empty cluster exceptions in Lloyd’s heuristic: Cluster centers are depicted with large cir- cles.Initializationfollowedbyanassignmentstepwithacenterrelocationstep,andnewassignment step. One of the cluster becomes empty Since e (X; C)≥ 0andthatwecanneverrepeattwicethesamepartitionsamong k  n 7 the O( ) potential partitions, we necessarily converge into a local minimum after k a finite number of iterations.  Let us now present a few remarkable observations on the k-means clustering: Observation 1 AlthoughthatLloyd’sheuristicperformremarkablywellinpractice, it has been shown that in the worst-case we can have an exponential number of iterations, even in the planar case d =21, 2. In 1D, Lloyd’s k-means can require Ω(n) iterations until convergence 2. But recall that the 1D case can be solved polynomially using dynamic programming. Observation 2 Evenifthe k-meanscostfunctionasauniqueglobalminimum,there can be exponentially many partition solutions that yield this minimum value (one single min but many argmin): For example, let us consider the 4 vertices of a square. For k= 2, we have two optimal solutions (from the vertices of the parallel edges). n Let us now clone copies of this square, each square located very far away to each 4 n k other, and consider k= : In that case, there exists 2 optimal clusterings. 4 Observation 3 In some cases, after assigning points to clusters in Lloyd’s batched heuristic, we can obtain empty clusters: This case happens rarely in practice but its probability increases with the dimension. Thus one has to take care when imple- menting Lloyd’s k-means of these potential empty cluster exceptions. One such configuration is illustrated in Fig.7.5. Note that this problem is in fact a blessing, since we can choose new center points in order to reinitialize those empty clusters while ensuring that the sum of cluster variances decreases. Lloyd’s k-means are a local heuristic that starts from a given initialization (either inducedbyaninitialsetof kprototypes,orbyagivenstartingpartitionthatinducesthe 7 The number of distinct partitions of a set of n elements into k non-empty subsets is defined by the     n k k 1 k− j n second kind of Stirling number: = (−1) j . k j=0 j k7.3 Lloyd’s Batched k-Means Local Heuristic 175 centroid prototypes) and guarantees monotonous convergence to a local minimum. We shall now describe a few initialization methods: That is, global heuristics for k-means. 7.4 Initializing k-Means with Global Heuristics 7.4.1 Random Seed Initialization (as Known as Forgy’s Initialization) Let us choose the k distinct seeds randomly from X (for example, by sampling  n uniformly the k indices fromn=1,..., n). There are such different random k drawings. Then we create the group partition G(C)=G ,..., G from the seed 1 k variates C=c ,..., c . There is no theoretical guarantee that e (X, G) is close 1 k k ∗ to the global minimum e (X, G)= min e (X, G). Thus in order to increase the C k k ∗ chance of finding a good initialization, not too far from e (X, G), we can initialize k l times to get the seed sets C ,..., C and keep the best seeds that yielded the best 1 l ∗ k-means cost so far: That is, keep C ∗ with l = argmin e (X; G(C )). This method l k l l is sometimes called Forgy’s initialization with l restarts in software packages. 7.4.2 Global k-Means: The Best Greedy Initialization Inthe global k-means,wefirstchooserandomlythefirstseed c ,thengreedilychoose 1 theseeds c to c .Let C =c ,..., cdenotethesetofthefirst i seeds.Wechoose 2 k ≤i 1 i c ∈ X in order to minimize e (X, C ) (there are only n− i+ 1 possible choices i i ≤i thatcanbeexhaustivelytested).Finally,weconsiderfor c allthe n potentialchoices 1 c = x ,…, c = x , and we keep the best seed set. 1 1 1 n 7.4.3 k-Means++: A Simple Probabilistically-Guaranteed Initialization Let us now consider a probabilistic initialization that guarantees with high prob- ∗ ∗ ability a good initialization. Denote by e (X)= min e (X; C)= e (X, C ) the C k k k ∗ globalminimumvalueofthe k-meanscost,with C = argmin e (X; C).A (1+ )- k C approximation of the k-means is defined by a set of prototypes C such that: ∗ ∗ e (X)≤ e (X, C)≤ (1+ )e (X). k k k e (X,C) k In other words, the ratio is at most 1+ . ∗ e (X) k176 7 Partition-Based Clustering with k-Means The k-means++ initialization choose iteratively the seeds by weighting the ele- ments x ’s according to the squared Euclidean distance of x to the already chosen i i 2 seeds. Let D (x, C) denote the minimum squared Euclidean distance of x to an 2 2 element of C: D (x, C)= min x− c . c∈C For a weighted set X,the k-means++ initialization procedure writes as follows: • Choose c uniformly randomly in X. If we had shuffled X beforehand, we set 1 C =c . ++ 1 • For i =2to k Draw c = x ∈ X with probability: i 2 w(x)D (x, C ) ++ p(x)=  2 w(y)D (y, C ) ++ y C ← C ∪c. ++ ++ i Theorem 6 (k-means++ 3) k-Means++ probability initialization guarantee with ∗ high probability that Ee (X, C )≤ 8(2+ ln k)e (X). k ++ k ˜ ˜ That is, the k-means++ are O(log k) competitive. The notation O(·) emphasizes thefactthattheanalysisisprobabilisticinexpectation.Thetechnicalproofisreported in the paper 3. Here, to give a flavor of the tools used in the proof, we shall give an elementary proof in the case of k= 1 cluster. That is, we choose randomly a point ∗ x ∈ X.Let c denote the center of mass of X. 0 We have:   1 2 Ee (X)= x− x . 1 0 X x ∈X x∈X 0 We shall use the following variance-bias decomposition for any z:   2 ∗ 2 ∗ 2 x− z − x− c =X c − z . x∈X x∈X Thus we deduce that     1 ∗ 2 ∗ 2 Ee (X)= x− c +X x − c , 1 0 X x ∈X x∈X 0  ∗ 2 = 2 x− c , x∈X ∗ = 2e (X). 1 Hence, in the case of k= 1 cluster, by randomly choosing the first seed c uni- 1 formly in X, we guarantee in expectation a 2-approximation.7.5 Application of k-Means to Vector Quantization (VQ) 177 7.5 Application of k-Means to Vector Quantization (VQ) 7.5.1 Vector Quantization In vector quantization (VQ for short), we are given a set X=x ,..., x that we 1 n seek toencode withwords c ,..., c . This dictionary of words is called a codebook. 1 k We define the encoding and decoding functions as follows: d • quantization function i(·): x ∈ R →1,..., k • decoding function: c(·) To compress and code a message (t ,..., t ) of m elements of an alphabet X of 1 m n characters (t ∈ X), we associate to each t its code i(t ) via the encoding function i i i i(·) that belongs to the k words of the code book. For example, we may quantize the 24-bit colors of an image (encoded using 3 color channels, ‘R’ for red, ‘G’ for green, and ‘B’ for blue) into k distinct color levels using vector quantization (the k-prototypes of the k-means ran on all the pixel colors). Thus instead of coding an imageofdimension m= w× h pixelsusing24 m bits,werathercodethe (R, G, B) colors of a palette of size k, and then encode the pixel colors using m× log k bits (the contents of the image). Thus we save memory storage or shrink communication time for sending an image over a digital channel. The distortion error introduced by the quantization of colors into an alphabet of  n 1 2 k letters is E = x− c(i(x)) ,the Mean Square Error (MSE). i=1 n k-Meansclusteringallowsonetofindacodebookminimizing(locally)theMSE: e (X, C)= v(X)− v(C). Here, the variance denotes the information spread, and k we seek to minimize this loss of information. Indeed, we can rewrite the k-means cost function as follows: n  k 2 e (X, C)= min x − c , k i j j=1 i=1 k = v(X)− v(C),with C=(n , c ) j j j=1 Thus, k-meanscanbereinterpretedastheminimizationofthedifferencebetween the variances of X and its quantization by k centers C. In other words, quantization asks to minimize the difference between the variance of a discrete random variable on n letters and the variance of a discrete random variable on k letters. The variance playstheroleofinformation,andweseektominimizethisdifferenceofinformation between these two random variables. In Information Theory (IT), this is related to rate distortion theory.178 7 Partition-Based Clustering with k-Means 7.5.2 Lloyd’s Local Minima and Stable Voronoi Partitions For all x ∈ X, we associate a label in N: i 2 l (x)= arg min x− c C j j∈1,...,k We can extend this labeling function to the full space X, and we obtain a partition of X called a Voronoi diagram. This partition of space is illustrated in Fig.7.6.A Voronoi cell V of the Voronoi diagram is defined by: j   d V = x ∈ R : x− c ≤ x− c ∀l∈1,..., n . j j l Here, let us remark that the squared Euclidean distance or the ordinary Euclidean distanceyieldstothesameVoronoidiagramthatdecomposesthespaceintoproximity cells.Infact,Voronoicellsdonotchangeifweapplyanystrictlymonotonousfunction on the base distance. The square function is such an example of a monotonous function on R . + Now, let us note that once Lloyd’s k-means heuristic has converged, the cluster groups G make a Voronoi partition, and have the convex hulls (co) of their groups i pairwise disjoint:∀i = j, co(G )∩ co(G )=∅ where: i j n   co(X)=x : x = λ x , λ = 1,λ ≥ 0. i i i i x ∈X i=1 i Fig. 7.6 Voronoi diagram induced by k centers C, called generators, and pl (p)=1 C Voronoi partition of X induced by C c 6 c 1 c 5 c 4 c 3 c 2 ql (q)=3 C7.6 A Physical Interpretation of k-Means: The Inertia Decomposition 179 7.6 A Physical Interpretation of k-Means: The Inertia Decomposition Let us consider X=(x ,w ) as n body masses located at positions x ’s with i i i i respective weights w ’s. In Physics, the concept of inertia measures the resistance i of a body when we move it around a given point. We measure the total inertia I(X) of a point cloud X as the sum of squared Euclidean distances of points of X with  k respect to its center of mass c= w x : i i i=1 n  2 I(X):= w x − c . i i i=1 Thus,whenweincreasethemassesonpoints,inertiaincreasestoo(i.e.,itbecomes more difficult to rotate the point set around its center of mass). Similarly, when we consider points getting farther away of the center of mass, it also becomes harder to rotation this point cloud around its center of mass c. Therefore k-means can be reinterpretedphysicallyasthetasktoidentify k groupssuchthatthesumofinertiaof these groups with respect to their barycenters is minimal. Huygens’ formula report an invariant or a mathematical identity between the total inertia of the system and its decomposition into the sum of the intra-group inertia plus the inertia of the inter- groups: Theorem 7 (Huygens’ formula: Inertia decomposition) The total inertia I(X)=  n 2 w x −¯ x equals to I (G)+ I (C) with the intra-group inertia i i intra inter i=1    k k 2 I (G)= I(G )= w x − c and the inter-group inertia intra i j j i i=1 i=1 x ∈G j i   k 2 I (C)= W c − c (a unique centroid c) with W = w(x). inter i i i i=1 x∈G i intra-group inertia total inertia = + inter-group inertia Fig. 7.7 Thetotalinertiaofasystemofpointmassesisinvariantbygroupdecomposition. k-Means optimizes the decomposition that minimizes the intra-group inertia180 7 Partition-Based Clustering with k-Means Figure7.7illustratestwodecompositionsofinertiathathavethesametotalinertia. Since the total inertia is invariant, minimizing the intra-group inertia amounts to maximize the inter-group inertia. 7.7 Choosing k in k-Means: Model Selection Until so far, we have considered that the number of clusters k was prescribed, and known beforehand. This is not true in practice when one performs exploratory data analysis, and k has to be guessed as well. Finding the right value of k is an important problem referred to as model selection in the literature. For any value of k, we can ∗ consider theoptimal k-means cost function e (X) (thatcan beestimated empirically k in practice, say using Lloyd’s heuristics on several initializations). Let us notice that e (X) decreases monotonically until reaching e (X)= 0 (in that case, each point is k n trivially allocated to its own cluster). 7.7.1 Model Selection via the Elbow Method To choose a correct value for k, one can use the so-called elbow method.Thisis a visual procedure: First, we plot the function (k, e (X)) for k∈n=1,..., n, k and we choose k that defines the inflexion point: the elbow (separating the forearm from the arm). The reason to choose this value of k is that for small k values, the sum of cluster variances decreases quickly and then starting from some value, the sum of variances describes a plateau. This visual inspection method is called the “elbow method” because the function f (k)= e (X) looks like an arm on practical k data-sets(withtheplateaubeingtheforearm):Theelbowreturnstheoptimalnumber of clusters (Fig.7.8). One drawback of this method is that it is computationally very expensive, and sometimes (depending on the data-sets), the inflexion point between the sharp decrease and the plateau is not so well defined 7.7.2 Model Selection: Explaining Variance Reduction with k We calculate the proportion of variance that is explained by k classes: I (k) inter 2 R (k)= . I total 2 R (k) 2 ∗ We have 0 R (k)≤ 1. We then choose k that minimizes the ratio : 2 R (k+1)   2 R (k) ∗ k = argmin . 2 k R (k+ 1)7.7 Choosing k in k-Means: Model Selection 181 Fig. 7.8 Choosing k with the elbow method: the elbow defines the value of k that separates the area of high decrease (the arm) to the plateau area (the forearm) elbow k 2 34 5 67 8 9 10 arm forearm Wehavepresentedtwoessentialmethodstochoosetherightvalueof k,thenumber ofclusters.Inmachinelearning, k denotesthe complexity of the model,andchoosing k is therefore called the model selection problem. There exists some algorithms that perform clustering without requiring to know beforehand k. For example, affinity propagation4issuchapopularalgorithm,orafastconvexrelaxationminimization of k-means 5. 7.8 Parallel k-Means Clustering on a Cluster of Machines There are many ways to design a parallel version of Lloyd’s k-means heuristic on a cluster of computers: That is, on a set of interconnected machines that we con- sider as a “super-computer” with distributed memory (one local memory for each machine). As usual, for sake of simplicity, we consider that each computer has a single processing core (a processing unit) and associates a node to each machine to describe the communication links by a graph. Those different processors commu- nicate with each other by sending and receiving messages using the MPI interface: The Message Passing Interface.Sendingamessagehasacommunicationcostthatis split into two terms: one for the latency of initialization the communication and one that is proportional to the length of the message. Sending structured data requires first to serialize it into a universal “string” before sending (encoding) at the sender node and to reconstruct it (de-serialize the string to reconstruct the structure) at the receiver node. k-means objective function e (X) k182 7 Partition-Based Clustering with k-Means A k-means clustering problem can be characterized by the number of attributes (that is, the dimension of the data point cloud), the number of data elements n, and the number of clusters, k. We consider k n (that is, k= o(n)) so that all cluster centers requiring a memory space O(dk) can be stored into the local memory of each machine. Since in practice the memory (RAM) is fixed (that is, O(1)), that means that we consider theoretically the special case of k= O(1) here. However, the number of elements n is considered (very) large compared to k, so that the full data-set X need to be distributed among the P processors (since X cannot be fully contained in the RAM of a single computer). In order to design a simple but efficient parallel k-means heuristic, we rely on the following composability/decomposability theorem: Theorem 8 (Composability of barycenters) Let X and X be two weighted data- 1 2 sets with respective overall weights W 0 and W 0. We have: 1 2 W W 1 2 x¯(X ∪ X )= x¯(X )+ x¯(X ) 1 2 1 2 W + W W + W 1 2 1 2 where x¯(X ) denotes the barycenter for X ,fori∈1,2. i i This property turns out to be essential for distributed the centroid computation on a partition of X into P subsets X ,..., X where P is the number of processors 1 P (or machines of the cluster). The parallel algorithm is described in pseudo-code in Algorithm 5. Atruntime,eachprocessorknowstheoverallnumberofprocessorsofthecluster, P,byusingthefunctionMPI_Comm_size(),anditsranknumberindexedbetween 0 and P− 1 by using the standard MPI function MPI_Comm_rank().Itisthe task of processor P (the root machine) to initialize the k cluster prototypes and 0 to broadcast them to all other processors using the primitiveMPI_Bcast(C, root processor).Thenweloopuntilconvergenceusingawhilestructure:Eachprocessor P computes the labeling of its group of data X , and the cumulative sum of the l l vectors of the k groups corresponding to P with the local cardinality of each cluster. l Then we aggregate and broadcast all those group cardinals using the MPI primitive MPI_Allreduce.Theaggregationoperation(thatisassociativeandcommutative) can be chosen among a set of binary operators like+ or min, etc. We specify this binary operation as an argument of the MPI primitiveMPI_Allreduce: Here, it isMPI_SUM to indicate the cumulative sum operation. Since some clusters in some partial data-sets associated to machines may be empty, we take of those cases when computing the local centroids (see the max(n ,1) operation in Algorithm5). j A complete source code using the C API of OpenMPI is syntactically differ- ent from the Algorithm5 written in the pseudo-code as the arguments in the MPI

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.