Question? Leave a message!




Cluster Analysis

Cluster Analysis
Dr.JakeFinlay Profile Pic
Dr.JakeFinlay,Germany,Teacher
Published Date:22-07-2017
Website URL
Comment
Cluster Analysis www.ThesisScientist.comCluster Analysis  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods  Hierarchical Methods  Density-Based Methods  Grid-Based Methods  Subspace Clustering/Bi-clustering  Model-Based Clustering www.ThesisScientist.comWhat is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster Intra-cluster distances are distances are maximized minimized www.ThesisScientist.comWhat is Cluster Analysis?  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Grouping a set of data objects into clusters  Clustering is unsupervised classification: no predefined classes  Clustering is used:  As a stand-alone tool to get insight into data distribution  Visualization of clusters may unveil important information  As a preprocessing step for other algorithms  Efficient indexing or compression often relies on clustering www.ThesisScientist.comSome Applications of Clustering  Pattern Recognition  Image Processing  cluster images based on their visual content  Bio-informatics  WWW and IR  document classification  cluster Weblog data to discover groups of similar access patterns www.ThesisScientist.comWhat Is Good Clustering?  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation.  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. www.ThesisScientist.comRequirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability www.ThesisScientist.comOutliers  Outliers are objects that do not belong to any cluster or form clusters of very small cardinality cluster outliers  In some applications we are interested in discovering outliers, not clusters (outlier www.ThesisScientist.com analysis)Data Structures attributes/dimensions  data matrix x ... x ... x  11 1f 1p   (two modes) ... ... ... ... ...   x ... x ... x i1 if ip  the “classic” data input ... ... ... ... ...   x ... x ... x n1 nf np   objects  dissimilarity or distance 0   matrix d(2,1) 0   d(3,1) d(3,2) 0  (one mode)  : : : Assuming simmetric distance   d(i,j) = d(j, i) d(n,1) d(n,2) ... ... 0  www.ThesisScientist.com objects tuples/objectsMeasuring Similarity in Clustering  Dissimilarity/Similarity metric:  The dissimilarity d(i, j) between two objects i and j is expressed in terms of a distance function, which is typically a metric:  d(i, j)0 (non-negativity)  d(i, i)=0 (isolation)  d(i, j)= d(j, i) (symmetry)  d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality)  The definitions of distance functions are usually different for interval-scaled, boolean, categorical, ordinal and ratio-scaled variables.  Weights may be associated with different variables www.ThesisScientist.com based on applications and data semantics.Type of data in cluster analysis  Interval-scaled variables  e.g., salary, height  Binary variables  e.g., gender (M/F), has_cancer(T/F)  Nominal (categorical) variables  e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.)  Ordinal variables  e.g., military rank (soldier, sergeant, lutenant, captain, etc.)  Ratio-scaled variables  population growth (1,10,100,1000,...)  Variables of mixed types www.ThesisScientist.com  multiple attributes with various typesSimilarity and Dissimilarity Between Objects  Distance metrics are normally used to measure the similarity or dissimilarity between two data objects  The most popular conform to Minkowski distance: 1/ p  p p p  L (i, j) xx xx ...xx  p  in jn i1 j1 i2 j2  where i = (x , x , …, x ) and j = (x , x , …, x ) are two i1 i2 in j1 j2 jn n-dimensional data objects, and p is a positive integer  If p = 1, L is the Manhattan (or city block) 1 distance: L (i, j)xx xx ...xx i1 j1 i2 j2 in jn 1 www.ThesisScientist.comSimilarity and Dissimilarity Between Objects (Cont.)  If p = 2, L is the Euclidean distance: 2 2 2 2 d(i, j) (xx xx ...xx ) i1 j1 i2 j2 in jn  Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)  Also one can use weighted distance: 2 2 2 d(i, j) (w xx w xx ...w xx ) n i1 j1 i2 j2 in jn 1 2 www.ThesisScientist.comBinary Variables  A binary variable has two states: 0 absent, 1 present  A contingency table for binary data object j 1 0 sum i= (0011101001) 1 a b ab J=(1001100110) 0 c d cd object i sum ac bd p  Simple matching coefficient distance (invariant, if the binary variable is symmetric): b c d(i, j)  a b c d  Jaccard coefficient distance (noninvariant if the binary b c variable is asymmetric): d(i, j)  a b c www.ThesisScientist.comBinary Variables  Another approach is to define the similarity of two objects and not their distance.  In that case we have the following:  Simple matching coefficient similarity: ad s(i, j) abcd  Jaccard coefficient similarity: a s(i, j) abc Note that: s(i,j) = 1 – d(i,j) www.ThesisScientist.comDissimilarity between Binary Variables  Example (Jaccard coefficient) Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0  all attributes are asymmetric binary  1 denotes presence or positive test  0 denotes absence or negative test 0 1 d( jack ,mary ) 0.33 2 0 1 1 1 d( jack , jim) 0.67 1 1 1 1 2 d( jim,mary ) 0.75 www.ThesisScientist.com 1 1 2A simpler definition  Each variable is mapped to a bitmap (binary vector) Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0  Jack: 101000  Mary: 101010  Jim: 110000  Simple match distance: number of non -common bit positions d(i, j) total number of bits  Jaccard coefficient: number of 1's in i j d(i, j)1 www.ThesisScientist.com number of 1's in i j Variables of Mixed Types  A database may contain all the six types of variables  symmetric binary, asymmetric binary, nominal, ordinal, interval and ratio-scaled.  One may use a weighted formula to combine their effects. p ( f ) ( f )  d f 1 ij ij d(i, j) p ( f )  f 1 ij www.ThesisScientist.comMajor Clustering Approaches  Partitioning algorithms: Construct random partitions and then iteratively refine them by some criterion  Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion  Density-based: based on connectivity and density functions  Grid-based: based on a multiple-level granularity structure  Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to www.ThesisScientist.com each otherPartitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster www.ThesisScientist.com