Question? Leave a message!

Applications of clustering in IR

Applications of clustering in IR
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Introduction to Information Retrieval Introduction to Information Retrieval Hierarchical Clustering 1Introduction to Information Retrieval Overview ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 2Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 3Introduction to Information Retrieval Applications of clustering in IR Application What is Benefit Example clustered? Search result search more effective clustering results information presentation to user Scatter-Gather (subsets alternative user of) collection interface: “search without typing” Collection collection effective information McKeown et al. clustering presentation for 2002, exploratory browsing Cluster-based collection higher efficiency: Salton 1971 retrieval faster search 4 4Introduction to Information Retrieval K- means algorithm 5 5Introduction to Information Retrieval Initialization of K-means  Random seed selection is just one of many ways K-means can be initialized.  Random seed selection is not very robust: It’s easy to get a suboptimal clustering.  Better heuristics:  Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the document space)  Use hierarchical clustering to find good seeds (next class)  Select i (e.g., i = 10) different sets of seeds, do a K-means clustering for each, select the clustering with lowest RSS 6 6Introduction to Information Retrieval External criterion: Purity  Ω= ω , ω , . . . , ω is the set of clusters and 1 2 K C = c , c , . . . , c is the set of classes. 1 2 J  For each cluster ω : find class c with most members n in ω k j kj k  Sum all n and divide by total number of points kj 7 7Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ Single-link/ Complete-link ❹ Centroid/ GAAC ❺ Variants ❻ Labeling clusters 8Introduction to Information Retrieval Hierarchical clustering Our goal in hierarchical clustering is to create a hierarchy like the one we saw earlier in Reuters: We want to create this hierarchy automatically. We can do this either top-down or bottom-up. The best known bottom-up method is hierarchical agglomerative clustering. 9 9Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  HAC creates a hierachy in the form of a binary tree.  Assumes a similarity measure for determining the similarity of two clusters.  Up to now, our similarity measures were for documents.  We will look at four different cluster similarity measures. 10 10Introduction to Information Retrieval Hierarchical agglomerative clustering (HAC)  Start with each document in a separate cluster  Then repeatedly merge the two clusters that are most similar  Until there is only one cluster  The history of merging is a hierarchy in the form of a binary tree.  The standard way of depicting this history is a dendrogram. 11 11Introduction to Information Retrieval A dendogram  The history of mergers can be read off from bottom to top.  The horizontal line of each merger tells us what the similarity of the merger was.  We can cut the dendrogram at a particular point (e.g., at 0.1 or 0.4) to get a flat clustering. 12 12Introduction to Information Retrieval Divisive clustering  Divisive clustering is top-down.  Alternative to HAC (which is bottom up).  Divisive clustering:  Start with all docs in one big cluster  Then recursively split clusters  Eventually each node forms a cluster on its own.  → Bisecting K-means at the end  For now: HAC (= bottom-up) 13 13Introduction to Information Retrieval Naive HAC algorithm 14 14Introduction to Information Retrieval Computational complexity of the naive algorithm  First, we compute the similarity of all N× N pairs of documents.  Then, in each of N iterations:  We scan the O(N× N) similarities to find the maximum similarity.  We merge the two clusters with maximum similarity.  We compute the similarity of the new cluster with all other (surviving) clusters.  There are O(N) iterations, each performing a O(N× N) “scan” operation. 3  Overall complexity is O(N ).  We’ll look at more efficient algorithms later. 15 15Introduction to Information Retrieval Key question: How to define cluster similarity  Single-link: Maximum similarity  Maximum similarity of any two documents  Complete-link: Minimum similarity  Minimum similarity of any two documents  Centroid: Average “intersimilarity”  Average similarity of all document pairs (but excluding pairs of docs in the same cluster)  This is equivalent to the similarity of the centroids.  Group-average: Average “intrasimilarity”  Average similary of all document pairs, including pairs of docs in the same cluster 16 16Introduction to Information Retrieval Cluster similarity: Example 17 17Introduction to Information Retrieval Single-link: Maximum similarity 18 18Introduction to Information Retrieval Complete-link: Minimum similarity 19 19Introduction to Information Retrieval Centroid: Average intersimilarity intersimilarity = similarity of two documents in different clusters 20 20