k-means clustering algorithms implementation and comparison and k-means algorithms
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Big Data Analytics CSCI 4030High dim. Graph Infinite Machine
Apps
data data data learning
Locality Filtering
PageRank, Recommen
sensitive data
SVM
SimRank der systems
hashing streams
Community Web Decision Association
Clustering
Detection advertising Trees Rules
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
Big Data Analytics CSCI 4030
2 Given a cloud of data points we want to
understand its structure
Big Data Analytics CSCI 4030
3 Given a set of points, with a notion of distance
between points, group the points into some
number of clusters, so that
Members of a cluster are close/similar to each other
Members of different clusters are dissimilar
Usually:
Points are in a high-dimensional space
Similarity is defined using a distance measure
Euclidean, Cosine, Jaccard, edit distance, …
Big Data Analytics CSCI 4030
4Big Data Analytics CSCI 4030
5x
x x
xx xx x x
x x x x
x x x x
x x x x x x
x x x x x x x x
x x
x x x x x x x x
x x x x x x
x x x x x x
x x
x x x x
x x x x
x
x x x x x x x x
x x x x x x
x x
Cluster
Outlier
Big Data Analytics CSCI 4030
6Big Data Analytics CSCI 4030
7Big Data Analytics CSCI 4030
8 Clustering in two dimensions looks easy
Clustering small amounts of data looks easy
And in most cases, looks are not deceiving
Many applications involve not 2, but 10 or
10,000 dimensions
High-dimensional spaces look different:
Many pairs of points are at about the same
distance
Big Data Analytics CSCI 4030
9 A catalog of 2 billion “sky objects” represents
objects by their radiation in 7 dimensions
(frequency bands)
Problem: Cluster into similar objects, e.g.,
galaxies, nearby stars, quasars, etc.
Sloan Digital Sky Survey
Big Data Analytics CSCI 4030
10 Intuitively: Music divides into categories, and
customers prefer a few categories
But what are categories really?
Represent a CD by a set of customers who
bought it
Similar CDs have similar sets of customers,
and vice-versa
Big Data Analytics CSCI 4030
11Space of all CDs:
Think of a space with one dim. for each
customer
Values in a dimension may be 0 or 1 only
A CD is a point in this space (x , x ,…, x ),
1 2 k
th
where x = 1 iff the i customer bought the CD
i
For Amazon, the dimension is tens of millions
Task: Find clusters of similar CDs
Big Data Analytics CSCI 4030
12Finding topics:
Represent a document by a vector
th
(x , x ,…, x ), where x = 1 iff the i word
1 2 k i
(in some order) appears in the document
Documents with similar sets of words
may be about the same topic
Big Data Analytics CSCI 4030
13 As with CDs we have a choice when we
think of documents as sets of words or
shingles:
Sets as vectors: Measure similarity by the
Cosine Distance
Sets as sets: Measure similarity by the
Jaccard Distance
Sets as points: Measure similarity by
Euclidean Distance
Big Data Analytics CSCI 4030
14 Hierarchical:
Agglomerative (bottom up):
Initially, each point is a cluster
Repeatedly combine the two
“nearest” clusters into one
Divisive (top down):
Start with one cluster and recursively split it
Point assignment:
Maintain a set of clusters
Points belong to “nearest” cluster
Big Data Analytics CSCI 4030
15 Key operation:
Repeatedly combine
two nearest clusters
Three important questions:
1) How do you represent a cluster of more
than one point?
2) How do you determine the “nearness” of
clusters?
3) When to stop combining clusters?
Big Data Analytics CSCI 4030
16 Key operation: Repeatedly combine two
nearest clusters
(1) How to represent a cluster of many points?
Key problem: As you merge clusters, how do you
represent the “location” of each cluster, to tell which
pair of clusters is closest?
Euclidean case: each cluster has a
centroid = average of its (data)points
(2) How to determine “nearness” of clusters?
Measure cluster distances by distances of centroids
Big Data Analytics CSCI 4030
17(5,3)
o
(1,2)
o
x (1.5,1.5)
x (4.7,1.3)
o (2,1) o (4,1)
x (1,1)
x (4.5,0.5)
o (0,0) o (5,0)
Data:
o … data point
x … centroid
Dendrogram
Big Data Analytics CSCI 4030
18What about the Non-Euclidean case?
The only “locations” we can talk about are the
points themselves
i.e., there is no “average” of two points
Approach 1:
(1) How to represent a cluster of many points?
clustroid = (data)point “closest” to other points
(2) How do you determine the “nearness” of
clusters? Treat clustroid as if it were centroid, when
computing inter-cluster distances
Big Data Analytics CSCI 4030
19 (1) How to represent a cluster of many points?
clustroid = point “closest” to other points
Possible meanings of “closest”:
Smallest maximum distance to other points
Smallest average distance to other points
Smallest sum of squares of distances to other points
2
min d(x,c)
For distance metric d clustroid c of cluster C is:
c
xC
Centroid
Datapoint
Centroid is the avg. of all (data)points
in the cluster. This means centroid is
X
an “artificial” point.
Clustroid Clustroid is an existing (data)point
that is “closest” to all other points in
Cluster on
the cluster.
3 datapoints
Big Data Analytics CSCI 4030
20
Advise:Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.