Question? Leave a message!




Applications of clustering in IR

Applications of clustering in IR
Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical Clustering 1Introduction to Information Retrieval Overview ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 2Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 3Introduction to Information Retrieval Applications of clustering in IR Application What is Benefit Example clustered Search result search more effective clustering results information presentation to user ScatterGather (subsets alternative user of) collection interface: “search without typing” Collection collection effective information McKeown et al. clustering presentation for 2002, exploratory browsing news.google.com Clusterbased collection higher efficiency: Salton 1971 retrieval faster search 4 4Introduction to Information Retrieval K means algorithm 5 5Introduction to Information Retrieval Initialization of Kmeans  Random seed selection is just one of many ways Kmeans can be initialized.  Random seed selection is not very robust: It’s easy to get a suboptimal clustering.  Better heuristics:  Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the document space)  Use hierarchical clustering to find good seeds (next class)  Select i (e.g., i = 10) different sets of seeds, do a Kmeans clustering for each, select the clustering with lowest RSS 6 6Introduction to Information Retrieval External criterion: Purity  Ω= ω , ω , . . . , ω is the set of clusters and 1 2 K C = c , c , . . . , c is the set of classes. 1 2 J  For each cluster ω : find class c with most members n in ω k j kj k  Sum all n and divide by total number of points kj 7 7Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 8Introduction to Information Retrieval Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically. We can do this either topdown or bottomup. The best known bottomup method is hierarchical agglomerative clustering. 9 9Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  HAC creates a hierachy in the form of a binary tree.  Assumes a similarity measure for determining the similarity of two clusters.  Up to now, our similarity measures were for documents.  We will look at four different cluster similarity measures. 10 10Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  Start with each document in a separate cluster  Then repeatedly merge the two clusters that are most similar  Until there is only one cluster  The history of merging is a hierarchy in the form of a binary tree.  The standard way of depicting this history is a dendrogram. 11 11Introduction to Information Retrieval A dendogram  The history of mergers can be read off from bottom to top.  The horizontal line of each merger tells us what the similarity of the merger was.  We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. 12 12Introduction to Information Retrieval Divisive clustering  Divisive clustering is topdown.  Alternative to HAC (which is bottom up).  Divisive clustering:  Start with all docs in one big cluster  Then recursively split clusters  Eventually each node forms a cluster on its own.  → Bisecting Kmeans at the end  For now: HAC (= bottomup) 13 13Introduction to Information Retrieval Naive HAC algorithm 14 14Introduction to Information Retrieval Computational complexity of the naive algorithm  First, we compute the similarity of all N× N pairs of documents.  Then, in each of N iterations:  We scan the O(N× N) similarities to find the maximum similarity.  We merge the two clusters with maximum similarity.  We compute the similarity of the new cluster with all other (surviving) clusters.  There are O(N) iterations, each performing a O(N× N) “scan” operation. 3  Overall complexity is O(N ).  We’ll look at more efficient algorithms later. 15 15Introduction to Information Retrieval Key question: How to define cluster similarity  Singlelink: Maximum similarity  Maximum similarity of any two documents  Completelink: Minimum similarity  Minimum similarity of any two documents  Centroid: Average “intersimilarity”  Average similarity of all document pairs (but excluding pairs of docs in the same cluster)  This is equivalent to the similarity of the centroids.  Groupaverage: Average “intrasimilarity”  Average similary of all document pairs, including pairs of docs in the same cluster 16 16Introduction to Information Retrieval Cluster similarity: Example 17 17Introduction to Information Retrieval Singlelink: Maximum similarity 18 18Introduction to Information Retrieval Completelink: Minimum similarity 19 19Introduction to Information Retrieval Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters 20 20Introduction to Information Retrieval Group average: Average intrasimilarity intrasimilarity = similarity of any pair, including cases where the two documents are in the same cluster 21 21Introduction to Information Retrieval Cluster similarity: Larger Example 22 22Introduction to Information Retrieval Singlelink: Maximum similarity 23 23Introduction to Information Retrieval Completelink: Minimum similarity 24 24Introduction to Information Retrieval Centroid: Average intersimilarity 25 25Introduction to Information Retrieval Group average: Average intrasimilarity 26 26Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 27Introduction to Information Retrieval Single link HAC  The similarity of two clusters is the maximum intersimilarity – the maximum similarity of a document from the first cluster and a document from the second cluster.  Once we have merged two clusters, how do we update the similarity matrix  This is simple for single link: SIM(ω , (ω∪ω )) = max(SIM(ω , ω ), SIM(ω , ω )) i k1 k2 i k1 i k2 28 28Introduction to Information Retrieval This dendogram was produced by singlelink  Notice: many small clusters (1 or 2 members) being added to the main cluster  There is no balanced 2 cluster or 3cluster clustering that can be derived by cutting the dendrogram. 29 29Introduction to Information Retrieval Complete link HAC  The similarity of two clusters is the minimum intersimilarity – the minimum similarity of a document from the first cluster and a document from the second cluster.  Once we have merged two clusters, how do we update the similarity matrix  Again, this is simple: SIM(ω , (ω∪ω )) = min(SIM(ω , ω ), SIM(ω , ω )) i k1 k2 i k1 i k2  We measure the similarity of two clusters by computing the diameter of the cluster that we would get if we merged them. 30 30Introduction to Information Retrieval Completelink dendrogram  Notice that this dendrogram is much more balanced than the singlelink one.  We can create a 2cluster clustering with two clusters of about the same size. 31 31Introduction to Information Retrieval Exercise: Compute single and complete link clustering 32 32Introduction to Information Retrieval Singlelink clustering 33 33Introduction to Information Retrieval Complete link clustering 34 34Introduction to Information Retrieval Singlelink vs. Complete link clustering 35 35Introduction to Information Retrieval Singlelink: Chaining Singlelink clustering often produces long, straggly clusters. For most applications, these are undesirable. 36 36Introduction to Information Retrieval What 2cluster clustering will completelink produce Coordinates: 1 + 2 ×ϵ, 4, 5 + 2 ×ϵ, 6, 7 − ϵ. 37 37Introduction to Information Retrieval Completelink: Sensitivity to outliers  The completelink clustering of this set splits d from its 2 right neighbors – clearly undesirable.  The reason is the outlier d . 1  This shows that a single outlier can negatively affect the outcome of completelink clustering.  Singlelink clustering does better in this case. 38 38Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 39Introduction to Information Retrieval Centroid HAC  The similarity of two clusters is the average intersimilarity – the average similarity of documents from the first cluster with documents from the second cluster.  A naive implementation of this definition is inefficient 2 (O(N )), but the definition is equivalent to computing the similarity of the centroids:  Hence the name: centroid HAC  Note: this is the dot product, not cosine similarity 40 40Introduction to Information Retrieval Exercise: Compute centroid clustering 41 41Introduction to Information Retrieval Centroid clustering 42 42Introduction to Information Retrieval The Inversion in centroid clustering  In an inversion, the similarity increases during a merge sequence. Results in an “inverted” dendrogram.  Below: Similarity of the first merger (d∪ d ) is 4.0, 1 2 similarity of second merger ((d∪ d ) ∪ d ) is ≈ −3.5. 1 2 3 43 43Introduction to Information Retrieval Inversions  Hierarchical clustering algorithms that allow inversions are inferior.  The rationale for hierarchical clustering is that at any given point, we’ve found the most coherent clustering of a given size.  Intuitively: smaller clusterings should be more coherent than larger clusterings.  An inversion contradicts this intuition: we have a large cluster that is more coherent than one of its subclusters. 44 44Introduction to Information Retrieval Groupaverage agglomerative clustering (GAAC)  GAAC also has an “averagesimilarity” criterion, but does not have inversions.  The similarity of two clusters is the average intrasimilarity – the average similarity of all document pairs (including those from the same cluster).  But we exclude selfsimilarities. 45 45Introduction to Information Retrieval Groupaverage agglomerative clustering (GAAC) 2  Again, a naive implementation is inefficient (O(N )) and there is an equivalent, more efficient, centroidbased definition:  Again, this is the dot product, not cosine similarity. 46 46Introduction to Information Retrieval Which HAC clustering should I use  Don’t use centroid HAC because of inversions.  In most cases: GAAC is best since it isn’t subject to chaining and sensitivity to outliers.  However, we can only use GAAC for vector representations.  For other types of document representations (or if only pairwise similarities for document are available): use completelink.  There are also some applications for singlelink (e.g., duplicate detection in web search). 47 47Introduction to Information Retrieval Flat or hierarchical clustering  For high efficiency, use flat clustering (or perhaps bisecting kmeans)  For deterministic results: HAC  When a hierarchical structure is desired: hierarchical algorithm  HAC also can be applied if K cannot be predetermined (can start without knowing K) 48 48Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 49Introduction to Information Retrieval Efficient single link clustering 50 50Introduction to Information Retrieval Time complexity of HAC 2  The singlelink algorithm we just saw is O(N ). 3  Much more efficient than the O(N ) algorithm we looked at earlier 2  There is no known O(N ) algorithm for completelink, centroid and GAAC. 2  Best time complexity for these three is O(N log N): See book. 2 2  In practice: little difference between O(N log N) and O(N ). 51 51Introduction to Information Retrieval Combination similarities of the four algorithms 52 52Introduction to Information Retrieval Comparison of HAC algorithms method combination time compl. optimal comment similarity 2 singlelink max intersimilarity Ɵ(N ) yes chaining effect of any 2 docs 2 completelink min intersimilarity of Ɵ(N log N) no sensitive to any 2 docs outliers 2 groupaverage average of all sims Ɵ(N log N) no best choice for most applications 2 centroid average Ɵ(N log N) no inversions can intersimilarity occur 53 53Introduction to Information Retrieval What to do with the hierarchy  Use as is (e.g., for browsing as in Yahoo hierarchy)  Cut at a predetermined threshold  Cut to get a predetermined number of clusters K  Ignores hierarchy below and above cutting line. 54 54Introduction to Information Retrieval Bisecting Kmeans: A topdown algorithm  Start with all documents in one cluster  Split the cluster into 2 using Kmeans  Of the clusters produced so far, select one to split (e.g. select the largest one)  Repeat until we have produced the desired number of clusters 55 55Introduction to Information Retrieval Bisecting Kmeans 56 56Introduction to Information Retrieval Bisecting Kmeans  If we don’t generate a complete hierarchy, then a topdown algorithm like bisecting Kmeans is much more efficient than HAC algorithms.  But bisecting Kmeans is not deterministic.  There are deterministic versions of bisecting Kmeans (see resources at the end), but they are much less efficient. 57 57Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Singlelink/ Completelink ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 58Introduction to Information Retrieval Major issue in clustering – labeling  After a clustering algorithm finds a set of clusters: how can they be useful to the end user  We need a pithy label for each cluster.  For example, in search result clustering for “jaguar”, The labels of the three clusters could be “animal”, “car”, and “operating system”.  Topic of this section: How can we automatically find good labels for clusters 59 59Introduction to Information Retrieval Exercise  Come up with an algorithm for labeling clusters  Input: a set of documents, partitioned into K clusters (flat clustering)  Output: A label for each cluster  Part of the exercise: What types of labels should we consider Words 60 60Introduction to Information Retrieval Discriminative labeling  To label cluster ω, compare ω with all other clusters  Find terms or phrases that distinguish ω from the other clusters  We can use any of the feature selection criteria we introduced in text classification to identify discriminating 2 terms: mutual information, χ and frequency.  (but the latter is actually not discriminative) 61 61Introduction to Information Retrieval Nondiscriminative labeling  Select terms or phrases based solely on information from the cluster itself  Terms with high weights in the centroid (if we are using a vector space model)  Nondiscriminative methods sometimes select frequent terms that do not distinguish clusters.  For example, MONDAY, TUESDAY, . . . in newspaper text 62 62Introduction to Information Retrieval Using titles for labeling clusters  Terms and phrases are hard to scan and condense into a holistic idea of what the cluster is about.  Alternative: titles  For example, the titles of two or three documents that are closest to the centroid.  Titles are easier to scan than a list of phrases. 63 63Introduction to Information Retrieval Cluster labeling: Example labeling method docs centroid mutual information title 4 622 oil plant mexico plant oil production MEXICO: Hurricane production crude power barrels crude bpd mexico Dolly heads for 000 refinery gas bpd dolly capacity petroleum Mexico coast 9 1017 police security russian police killed military RUSSIA: Russia’s people military peace security peace told troops Lebed meets rebel killed told grozny court forces rebels people chief in Chechnya 10 1259 00 000 tonnes traders delivery traders futures USA: Export Business futures wheat prices tonne tonnes desk wheat Grain/oilseeds cents september tonne prices 000 00 complex  Three methods: most prominent terms in centroid, differential labeling using MI, title of doc closest to centroid  All three methods do a pretty good job. 64 64Introduction to Information Retrieval Resources  Chapter 17 of IIR  Resources at http://ifnlp.org/ir  Columbia Newsblaster (a precursor of Google News): McKeown et al. (2002)  Bisecting Kmeans clustering: Steinbach et al. (2000)  PDDP (similar to bisecting Kmeans; deterministic, but also less efficient): Saravesi and Boley (2004) 65 65
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
WilliamsMcmahon
User Type:
Professional
Country:
United States
Uploaded Date:
20-07-2017