Question? Leave a message!




Applications in clustering in IR

Applications in clustering in IR
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Applications in clustering in IRIntroduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & AuthoritiesIntroduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & AuthoritiesIntroduction to Information Retrieval Applications in clustering in IR What is Application Benefit clustered? Search results search results more effective information clustering presentation to user Scatter-Gather (subsets of) alternative user interface: collection “search without typing” Collection clustering collection effective information presentation for exploratory browsing Cluser-based retrieval collection Higher efficiency: faster searchIntroduction to Information Retrieval K-means algorithmIntroduction to Information Retrieval Initialization of K-means  Random seed selection is just one of many ways K-means can be initialized.  Random seed selection is not very robust: It's easy to get a suboptimal clustering.  Better ways of computing initial centroids:  Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the space)  Use hierarchical clustering to find good seeds  Select i (e.g., i = 10) different random sets of seeds , do a K-means clustering for each, select the clustering with lowest RSS. 6Introduction to Information Retrieval External criterion: Purity  W = w , w , …, w is the set of clusters and 1 2 K C = c , c , …, c is the set of classes 1 2 J  For each cluster w : find class c with most k j members n in w kj k  Sum all n and divide by total number of points kj 7Introduction to Information Retrieval Finding the “knee” in the curve 8Introduction to Information Retrieval Finding the “knee” in the curve Pick the number of clusters where curve “flattens”. Here: 4 or 9. 9Introduction to Information Retrieval Take-away today 10Introduction to Information Retrieval Take-away today  Anchor text: What exactly are links on the web and why are they important for IR? 11Introduction to Information Retrieval Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Citation analysis: the mathematical foundation of PageRank and link-based ranking 12Introduction to Information Retrieval Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Citation analysis: the mathematical foundation of PageRank and link-based ranking  PageRank : the original algorithm that was used for link- based ranking on the web 13Introduction to Information Retrieval Take-away today  Anchor text: What exactly are links on the web and why are they important for IR?  Citation analysis: the mathematical foundation of PageRank and link-based ranking  PageRank : the original algorithm that was used for link- based ranking on the web  Hubs & Authorities: an alternative link-based ranking algorithm 14Introduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs & AuthoritiesIntroduction to Information Retrieval The web as a directed graph 16Introduction to Information Retrieval The web as a directed graph 17Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal. 18Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 high-quality and relevant. 19Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 high-quality and relevant.  Assumption 2: The anchor text describes the content of d . 2 20