Clustering in information retrieval ppt

applications of clustering in information retrieval and item clustering in information retrieval systems
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Applications in clustering in IRIntroduction to Information Retrieval Applications in clustering in IR What is Application Benefit clustered? Search results search results more effective information clustering presentation to user Scatter-Gather (subsets of) alternative user interface: collection “search without typing” Collection clustering collection effective information presentation for exploratory browsing Cluser-based retrieval collection Higher efficiency: faster searchIntroduction to Information Retrieval K-means algorithmIntroduction to Information Retrieval Initialization of K-means  Random seed selection is just one of many ways K-means can be initialized.  Random seed selection is not very robust: It's easy to get a suboptimal clustering.  Better ways of computing initial centroids:  Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the space)  Use hierarchical clustering to find good seeds  Select i (e.g., i = 10) different random sets of seeds , do a K-means clustering for each, select the clustering with lowest RSS. 6Introduction to Information Retrieval External criterion: Purity  W = w , w , …, w is the set of clusters and 1 2 K C = c , c , …, c is the set of classes 1 2 J  For each cluster w : find class c with most k j members n in w kj k  Sum all n and divide by total number of points kj 7Introduction to Information Retrieval Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Citation analysis: the mathematical foundation of PageRank and link-based ranking  PageRank : the original algorithm that was used for link- based ranking on the web  Hubs & Authorities: an alternative link-based ranking algorithm 14Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d . 2  We use anchor text somewhat loosely here for: the text surrounding the hyperlink .  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ” 22Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d . 2  We use anchor text somewhat loosely here for: the text surrounding the hyperlink .  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ”  Anchor text: “You can find cheap here” 23Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page  … if IBM home page is mostly graphics  Searching on anchor text → d is better for the query IBM. 2 32Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page  … if IBM home page is mostly graphics  Searching on anchor text → d is better for the query IBM. 2  In this representation, the page with most occurences of IBM is www.ibm.com 33Introduction to Information Retrieval Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself.  Anchor text can be weighted more highly than document text. (based on Assumption 1&2) 38Introduction to Information Retrieval Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linked-to page is high- quality.  Assumption 2: The anchor text describes the content of the linked-to page.  Is assumption 1 true in general?  Is assumption 2 true in general? 43Introduction to Information Retrieval Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: dangerous cult on Google, Bing, Yahoo  Coordinated link creation by those who dislike the Church of Scientology  Defused Google bombs: dumb motherf…+, who is a failure?+, evil empire 49Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them. 56Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.  This is called cocitation similarity. 57Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.  This is called cocitation similarity.  Cocitation similarity on the web: Google’s “find pages like this” or “Similar” feature. 58Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ...  ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank  An article’s vote is weighted according to its citation impact. 66Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ...  ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank  An article’s vote is weighted according to its citation impact.  Circular? No: can be formalized in a well-defined way. 67Introduction to Information Retrieval Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank.  PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s.  Citation analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publications 72Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality ...  ... both for web pages and for scientific publications.  Next: PageRank algorithm for computing weighted citation frequency on the web. 79