Question? Leave a message!




Document clustering

Document clustering
RyanCanon Profile Pic
RyanCanon,United Arab Emirates,Teacher
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Document clustering www.ThesisScientist.comIntroduction to Information Retrieval Today’s Topic: Clustering  Document clustering  Motivations  Document representations  Success criteria  Clustering algorithms  Partitional  Hierarchical www.ThesisScientist.comIntroduction to Information Retrieval Ch. 16 What is clustering?  Clustering: the process of grouping a set of objects into classes of similar objects  Documents within a cluster should be similar.  Documents from different clusters should be dissimilar.  The commonest form of unsupervised learning  Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given  A common and important task that finds many applications in IR and other places www.ThesisScientist.comIntroduction to Information Retrieval Ch. 16 A data set with clear cluster structure  How would you design an algorithm for finding the three clusters in this case? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 16.1 Applications of clustering in IR  Whole corpus analysis/navigation  Better user interface: search without typing  For improving recall in search applications  Better search results (like pseudo RF)  For better navigation of search results  Effective “user recall” will be higher  For speeding up vector space retrieval  Cluster-based retrieval gives faster search www.ThesisScientist.comIntroduction to Information Retrieval Yahoo Hierarchy isn’t clustering but is the kind of output you want from clustering www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism HCI missions agronomy evolution forestry relativity www.ThesisScientist.comIntroduction to Information Retrieval Google News: automatic clustering gives an effective news presentation metaphor www.ThesisScientist.comIntroduction to Information Retrieval Sec. 16.1 Scatter/Gather: Cutting, Karger, and Pedersen www.ThesisScientist.comIntroduction to Information Retrieval For visualizing a document collection and its themes  Wise et al, “Visualizing the non-visual” PNNL  ThemeScapes, Cartia  Mountain height = cluster size www.ThesisScientist.comIntroduction to Information Retrieval Sec. 16.1 For improving search recall  Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs  Therefore, to improve search recall:  Cluster docs in corpus a priori  When a query matches a doc D, also return other docs in the cluster containing D  Hope if we do this: The query “car” will also return docs containing automobile  Because clustering grouped together docs containing car with those containing automobile. Why might this happen? www.ThesisScientist.comIntroduction to Information Retrieval www.ThesisScientist.com yippy.com – grouping search resultsIntroduction to Information Retrieval Sec. 16.2 Issues for clustering  Representation for clustering  Document representation  Vector space? Normalization?  Centroids aren’t length normalized  Need a notion of similarity/distance  How many clusters?  Fixed a priori?  Completely data driven?  Avoid “trivial” clusters - too large or small  If a cluster's too large, then for navigation purposes you've wasted an extra user click without whittling down the set of documents much. www.ThesisScientist.comIntroduction to Information Retrieval Notion of similarity/distance  Ideal: semantic similarity.  Practical: term-statistical similarity  We will use cosine similarity.  Docs as vectors.  For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.  We will mostly speak of Euclidean distance  But real implementations use cosine similarity www.ThesisScientist.comIntroduction to Information Retrieval Clustering Algorithms  Flat algorithms  Usually start with a random (partial) partitioning  Refine it iteratively  K means clustering  (Model based clustering)  Hierarchical algorithms  Bottom-up, agglomerative  (Top-down, divisive) www.ThesisScientist.comIntroduction to Information Retrieval Hard vs. soft clustering  Hard clustering: Each document belongs to exactly one cluster  More common and easier to do  Soft clustering: A document can belong to more than one cluster.  Makes more sense for applications like creating browsable hierarchies  You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes  You can only do that with a soft clustering approach.  We won’t do soft clustering today. See IIR 16.5, 18 www.ThesisScientist.comIntroduction to Information Retrieval Partitioning Algorithms  Partitioning method: Construct a partition of n documents into a set of K clusters  Given: a set of documents and the number K  Find: a partition of K clusters that optimizes the chosen partitioning criterion  Globally optimal  Intractable for many objective functions  Ergo, exhaustively enumerate all partitions  Effective heuristic methods: K-means and K- medoids algorithms www.ThesisScientist.com See also Kleinberg NIPS 2002 – impossibility for natural clusteringIntroduction to Information Retrieval Sec. 16.4 K-Means  Assumes documents are real-valued vectors.  Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:  1 μ(c) x   c xc  Reassignment of instances to clusters is based on distance to the current cluster centroids.  (Or one can equivalently phrase it in terms of similarities) www.ThesisScientist.comIntroduction to Information Retrieval Sec. 16.4 K-Means Algorithm Select K random docs s , s ,… s as seeds. 1 2 K Until clustering converges (or other stopping criterion): For each doc d : i Assign d to the cluster c such that dist(x , s ) is minimal. i j i j (Next, update the seeds to the centroid of each cluster) For each cluster c j s = (c ) j j www.ThesisScientist.comIntroduction to Information Retrieval Sec. 16.4 K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Reassign clusters x x x Compute centroids x x x Reassign clusters Converged www.ThesisScientist.comIntroduction to Information Retrieval Sec. 16.4 Termination conditions  Several possibilities, e.g.,  A fixed number of iterations.  Doc partition unchanged.  Centroid positions don’t change. Does this mean that the docs in a cluster are unchanged? www.ThesisScientist.com