Question? Leave a message!




Applications in clustering in IR

Applications in clustering in IR
Introduction to Information Retrieval Introduction to Information Retrieval Applications in clustering in IRIntroduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs AuthoritiesIntroduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs AuthoritiesIntroduction to Information Retrieval Applications in clustering in IR What is Application Benefit clustered Search results search results more effective information clustering presentation to user ScatterGather (subsets of) alternative user interface: collection “search without typing” Collection clustering collection effective information presentation for exploratory browsing Cluserbased retrieval collection Higher efficiency: faster searchIntroduction to Information Retrieval Kmeans algorithmIntroduction to Information Retrieval Initialization of Kmeans  Random seed selection is just one of many ways Kmeans can be initialized.  Random seed selection is not very robust: It's easy to get a suboptimal clustering.  Better ways of computing initial centroids:  Select seeds not randomly, but using some heuristic (e.g., filter out outliers or find a set of seeds that has “good coverage” of the space)  Use hierarchical clustering to find good seeds  Select i (e.g., i = 10) different random sets of seeds , do a Kmeans clustering for each, select the clustering with lowest RSS. 6Introduction to Information Retrieval External criterion: Purity  W = w , w , …, w is the set of clusters and 1 2 K C = c , c , …, c is the set of classes 1 2 J  For each cluster w : find class c with most k j members n in w kj k  Sum all n and divide by total number of points kj 7Introduction to Information Retrieval Finding the “knee” in the curve 8Introduction to Information Retrieval Finding the “knee” in the curve Pick the number of clusters where curve “flattens”. Here: 4 or 9. 9Introduction to Information Retrieval Takeaway today 10Introduction to Information Retrieval Takeaway today  Anchor text: What exactly are links on the web and why are they important for IR 11Introduction to Information Retrieval Takeaway today  Anchor text: What exactly are links on the web and why are they important for IR  Citation analysis: the mathematical foundation of PageRank and linkbased ranking 12Introduction to Information Retrieval Takeaway today  Anchor text: What exactly are links on the web and why are they important for IR  Citation analysis: the mathematical foundation of PageRank and linkbased ranking  PageRank : the original algorithm that was used for link based ranking on the web 13Introduction to Information Retrieval Takeaway today  Anchor text: What exactly are links on the web and why are they important for IR  Citation analysis: the mathematical foundation of PageRank and linkbased ranking  PageRank : the original algorithm that was used for link based ranking on the web  Hubs Authorities: an alternative linkbased ranking algorithm 14Introduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs AuthoritiesIntroduction to Information Retrieval The web as a directed graph 16Introduction to Information Retrieval The web as a directed graph 17Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal. 18Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 highquality and relevant. 19Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 highquality and relevant.  Assumption 2: The anchor text describes the content of d . 2 20Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 highquality and relevant.  Assumption 2: The anchor text describes the content of d . 2  We use anchor text somewhat loosely here for: the text surrounding the hyperlink . 21Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 highquality and relevant.  Assumption 2: The anchor text describes the content of d . 2  We use anchor text somewhat loosely here for: the text surrounding the hyperlink .  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ” 22Introduction to Information Retrieval The web as a directed graph  Assumption 1: A hyperlink is a quality signal.  The hyperlink d → d indicates that d ‘s author deems d 1 2 1 2 highquality and relevant.  Assumption 2: The anchor text describes the content of d . 2  We use anchor text somewhat loosely here for: the text surrounding the hyperlink .  Example: “You can find cheap cars ˂a href =http://…˃here ˂/a ˃. ”  Anchor text: “You can find cheap here” 23Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2 24Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2 25Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM 26Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page 27Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages 28Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article 29Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page 30Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page  … if IBM home page is mostly graphics 31Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page  … if IBM home page is mostly graphics  Searching on anchor text → d is better for the query IBM. 2 32Introduction to Information Retrieval text of d only vs. text of d + + anchor text → d 2 2 2  Searching on text of d + + anchor text → d is often 2 2 more effective than searching on text of d only. 2  Example: Query IBM  Matches IBM’s copyright page  Matches many spam pages  Matches IBM wikipedia article  May not match IBM home page  … if IBM home page is mostly graphics  Searching on anchor text → d is better for the query IBM. 2  In this representation, the page with most occurences of IBM is www.ibm.com 33Introduction to Information Retrieval Anchor text containing IBM pointing to www.ibm.com 34Introduction to Information Retrieval Anchor text containing IBM pointing to www.ibm.com 35Introduction to Information Retrieval Indexing anchor text 36Introduction to Information Retrieval Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself. 37Introduction to Information Retrieval Indexing anchor text  Thus: Anchor text is often a better description of a page’s content than the page itself.  Anchor text can be weighted more highly than document text. (based on Assumption 12) 38Introduction to Information Retrieval Exercise: Assumptions underlying PageRank 39Introduction to Information Retrieval Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linkedto page is high quality. 40Introduction to Information Retrieval Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linkedto page is high quality.  Assumption 2: The anchor text describes the content of the linkedto page. 41Introduction to Information Retrieval Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linkedto page is high quality.  Assumption 2: The anchor text describes the content of the linkedto page.  Is assumption 1 true in general 42Introduction to Information Retrieval Exercise: Assumptions underlying PageRank  Assumption 1: A link on the web is a quality signal – the author of the link thinks that the linkedto page is high quality.  Assumption 2: The anchor text describes the content of the linkedto page.  Is assumption 1 true in general  Is assumption 2 true in general 43Introduction to Information Retrieval Google bombs 44Introduction to Information Retrieval Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text. 45Introduction to Information Retrieval Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs. 46Introduction to Information Retrieval Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: dangerous cult on Google, Bing, Yahoo 47Introduction to Information Retrieval Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: dangerous cult on Google, Bing, Yahoo  Coordinated link creation by those who dislike the Church of Scientology 48Introduction to Information Retrieval Google bombs  A Google bomb is a search with “bad” results due to maliciously manipulated anchor text.  Google introduced a new weighting function in January 2007 that fixed many Google bombs.  Still some remnants: dangerous cult on Google, Bing, Yahoo  Coordinated link creation by those who dislike the Church of Scientology  Defused Google bombs: dumb motherf…+, who is a failure+, evil empire 49Introduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs AuthoritiesIntroduction to Information Retrieval Origins of PageRank: Citation analysis (1) 51Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature. 52Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.” 53Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles. 54Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature: 55Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them. 56Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.  This is called cocitation similarity. 57Introduction to Information Retrieval Origins of PageRank: Citation analysis (1)  Citation analysis: analysis of citations in the scientific literature.  Example citation: “Miller (2001) has shown that physical activity alters the metabolism of estrogens.”  We can view “Miller (2001)” as a hyperlink linking two scientific articles.  One application of these “hyperlinks” in the scientific literature:  Measure the similarity of two articles by the overlap of other articles citing them.  This is called cocitation similarity.  Cocitation similarity on the web: Google’s “find pages like this” or “Similar” feature. 58Introduction to Information Retrieval Origins of PageRank: Citation analysis (2) 59Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article . 60Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate. 61Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count 62Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ... 63Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ...  ... mainly because of link spam. 64Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ...  ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank 65Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ...  ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank  An article’s vote is weighted according to its citation impact. 66Introduction to Information Retrieval Origins of PageRank: Citation analysis (2)  Another application: Citation frequency can be used to measure the impact of an article .  Simplest measure: Each article gets one vote – not very accurate.  On the web: citation frequency = inlink count  A high inlink count does not necessarily mean high quality ...  ... mainly because of link spam.  Better measure: weighted citation frequency or citation rank  An article’s vote is weighted according to its citation impact.  Circular No: can be formalized in a welldefined way. 67Introduction to Information Retrieval Origins of PageRank: Citation analysis (3) 68Introduction to Information Retrieval Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank. 69Introduction to Information Retrieval Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank. 70Introduction to Information Retrieval Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank.  PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s. 71Introduction to Information Retrieval Origins of PageRank: Citation analysis (3)  Better measure: weighted citation frequency or citation rank.  This is basically PageRank.  PageRank was invented in the context of citation analysis by Pinsker and Narin in the 1960s.  Citation analysis is a big deal: The budget and salary of this lecturer are / will be determined by the impact of his publications 72Introduction to Information Retrieval Origins of PageRank: Summary 73Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for 74Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature 75Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web 76Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality ... 77Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality ...  ... both for web pages and for scientific publications. 78Introduction to Information Retrieval Origins of PageRank: Summary  We can use the same formal representation for  citations in the scientific literature  hyperlinks on the web  Appropriately weighted citation frequency is an excellent measure of quality ...  ... both for web pages and for scientific publications.  Next: PageRank algorithm for computing weighted citation frequency on the web. 79Introduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs AuthoritiesIntroduction to Information Retrieval Model behind PageRank: Random walk 81Introduction to Information Retrieval Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web 82Introduction to Information Retrieval Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page 83Introduction to Information Retrieval Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably 84Introduction to Information Retrieval Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably  In the steady state, each page has a longterm visit rate. 85Introduction to Information Retrieval Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably  In the steady state, each page has a longterm visit rate.  This longterm visit rate is the page’s PageRank. 86Introduction to Information Retrieval Model behind PageRank: Random walk  Imagine a web surfer doing a random walk on the web  Start at a random page  At each step, go out of the current page along one of the links on that page, equiprobably  In the steady state, each page has a longterm visit rate.  This longterm visit rate is the page’s PageRank.  PageRank = longterm visit rate = steady state probability. 87Introduction to Information Retrieval Formalization of random walk: Markov chains 88Introduction to Information Retrieval Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P. 89Introduction to Information Retrieval Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.  state = page 90Introduction to Information Retrieval Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.  state = page  At each step, we are on exactly one of the pages. 91Introduction to Information Retrieval Formalization of random walk: Markov chains  A Markov chain consists of N states, plus an N ×N transition probability matrix P.  state = page  At each step, we are on exactly one of the pages.  For 1 ≤ i, j ≥ N, the matrix entry P tells us the probability of ij j being the next page, given we are currently on page i. N P 1  Clearly, for all i,  ij j1 92Introduction to Information Retrieval Example web graphIntroduction to Information Retrieval Link matrix for example 94Introduction to Information Retrieval Link matrix for example d d d d d d d 0 1 2 3 4 5 6 d 0 0 1 0 0 0 0 0 d 0 1 1 0 0 0 0 1 d 1 0 1 1 0 0 0 2 d 0 0 0 1 1 0 0 3 d 0 0 0 0 0 0 1 4 d 0 0 0 0 0 1 1 5 d 0 0 0 1 1 0 1 6 95Introduction to Information Retrieval Transition probability matrix P for example 96Introduction to Information Retrieval Transition probability matrix P for example d d d d d d d 0 1 2 3 4 5 6 d 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0 d 0.00 0.50 0.50 0.00 0.00 0.00 0.00 1 d 0.33 0.00 0.33 0.33 0.00 0.00 0.00 2 d 0.00 0.00 0.00 0.50 0.50 0.00 0.00 3 d 0.00 0.00 0.00 0.00 0.00 0.00 1.00 4 d 0.00 0.00 0.00 0.00 0.00 0.50 0.50 5 d 0.00 0.00 0.00 0.33 0.33 0.00 0.33 6 97Introduction to Information Retrieval Longterm visit rate 98Introduction to Information Retrieval Longterm visit rate  Recall: PageRank = longterm visit rate. 99Introduction to Information Retrieval Longterm visit rate  Recall: PageRank = longterm visit rate.  Longterm visit rate of page d is the probability that a web surfer is at page d at a given point in time. 100Introduction to Information Retrieval Longterm visit rate  Recall: PageRank = longterm visit rate.  Longterm visit rate of page d is the probability that a web surfer is at page d at a given point in time.  Next: what properties must hold of the web graph for the longterm visit rate to be well defined 101Introduction to Information Retrieval Longterm visit rate  Recall: PageRank = longterm visit rate.  Longterm visit rate of page d is the probability that a web surfer is at page d at a given point in time.  Next: what properties must hold of the web graph for the longterm visit rate to be well defined  The web graph must correspond to an ergodic Markov chain. 102Introduction to Information Retrieval Longterm visit rate  Recall: PageRank = longterm visit rate.  Longterm visit rate of page d is the probability that a web surfer is at page d at a given point in time.  Next: what properties must hold of the web graph for the longterm visit rate to be well defined  The web graph must correspond to an ergodic Markov chain.  First a special case: The web graph must not contain dead ends. 103Introduction to Information Retrieval Dead ends 104Introduction to Information Retrieval Dead ends 105Introduction to Information Retrieval Dead ends  The web is full of dead ends. 106Introduction to Information Retrieval Dead ends  The web is full of dead ends.  Random walk can get stuck in dead ends. 107Introduction to Information Retrieval Dead ends  The web is full of dead ends.  Random walk can get stuck in dead ends.  If there are dead ends, longterm visit rates are not welldefined (or nonsensical). 108Introduction to Information Retrieval Teleporting – to get us of dead ends 109Introduction to Information Retrieval Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N . 110Introduction to Information Retrieval Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N .  At a nondead end, with probability 10, jump to a random web page (to each with a probability of 0.1/N ). 111Introduction to Information Retrieval Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N .  At a nondead end, with probability 10, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90), go out on a random hyperlink. 112Introduction to Information Retrieval Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N .  At a nondead end, with probability 10, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90), go out on a random hyperlink.  For example, if the page has 4 outgoing links: randomly choose one with probability (10.10)/4=0.225 113Introduction to Information Retrieval Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N .  At a nondead end, with probability 10, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90), go out on a random hyperlink.  For example, if the page has 4 outgoing links: randomly choose one with probability (10.10)/4=0.225  10 is a parameter, the teleportation rate. 114Introduction to Information Retrieval Teleporting – to get us of dead ends  At a dead end, jump to a random web page with prob. 1/ N .  At a nondead end, with probability 10, jump to a random web page (to each with a probability of 0.1/N ).  With remaining probability (90), go out on a random hyperlink.  For example, if the page has 4 outgoing links: randomly choose one with probability (10.10)/4=0.225  10 is a parameter, the teleportation rate.  Note: “jumping” from dead end is independent of teleportation rate. 115Introduction to Information Retrieval Result of teleporting 116Introduction to Information Retrieval Result of teleporting  With teleporting, we cannot get stuck in a dead end. 117Introduction to Information Retrieval Result of teleporting  With teleporting, we cannot get stuck in a dead end.  But even without dead ends, a graph may not have well defined longterm visit rates. 118Introduction to Information Retrieval Result of teleporting  With teleporting, we cannot get stuck in a dead end.  But even without dead ends, a graph may not have well defined longterm visit rates.  More generally, we require that the Markov chain be ergodic. 119Introduction to Information Retrieval Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic. 120Introduction to Information Retrieval Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.  Irreducibility. Roughly: there is a path from any other page. 121Introduction to Information Retrieval Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.  Irreducibility. Roughly: there is a path from any other page.  Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially. 122Introduction to Information Retrieval Ergodic Markov chains  A Markov chain is ergodic if it is irreducible and aperiodic.  Irreducibility. Roughly: there is a path from any other page.  Aperiodicity. Roughly: The pages cannot be partitioned such that the random walker visits the partitions sequentially.  A nonergodic Markov chain: 123Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state. 124Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state.  This is the steadystate probability distribution. 125Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state.  This is the steadystate probability distribution.  Over a long time period, we visit each state in proportion to this rate. 126Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state.  This is the steadystate probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start. 127Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state.  This is the steadystate probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.  Teleporting makes the web graph ergodic. 128Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state.  This is the steadystate probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.  Teleporting makes the web graph ergodic.  Webgraph+teleporting has a steadystate probability distribution. 129Introduction to Information Retrieval Ergodic Markov chains  Theorem: For any ergodic Markov chain, there is a unique longterm visit rate for each state.  This is the steadystate probability distribution.  Over a long time period, we visit each state in proportion to this rate.  It doesn’t matter where we start.  Teleporting makes the web graph ergodic.  Webgraph+teleporting has a steadystate probability distribution.   Each page in the webgraph+teleporting has a PageRank. 130Introduction to Information Retrieval Formalization of “visit”: Probability vector 131Introduction to Information Retrieval Formalization of “visit”: Probability vector   A probability (row) vector x = (x , ..., x ) tells us where 1 N the random walk is at any point. 132Introduction to Information Retrieval Formalization of “visit”: Probability vector   A probability (row) vector x = (x , ..., x ) tells us where 1 N the random walk is at any point.  Example: ( 0 0 0 … 1 … 0 0 0 ) 1 2 3 … i … N2 N1 N 133Introduction to Information Retrieval Formalization of “visit”: Probability vector   A probability (row) vector x = (x , ..., x ) tells us where 1 N the random walk is at any point.  Example: ( 0 0 0 … 1 … 0 0 0 ) 1 2 3 … i … N2 N1 N  More generally: the random walk is on the page i with probability x . i 134Introduction to Information Retrieval Formalization of “visit”: Probability vector   A probability (row) vector x = (x , ..., x ) tells us where 1 N the random walk is at any point.  Example: ( 0 0 0 … 1 … 0 0 0 ) 1 2 3 … i … N2 N1 N  More generally: the random walk is on the page i with probability x . i  Example: ( 0.05 0.01 0.0 … 0.2 … 0.01 0.05 0.03 ) 1 2 3 … i … N2 N1 N 135Introduction to Information Retrieval Formalization of “visit”: Probability vector   A probability (row) vector x = (x , ..., x ) tells us where 1 N the random walk is at any point.  Example: ( 0 0 0 … 1 … 0 0 0 ) 1 2 3 … i … N2 N1 N  More generally: the random walk is on the page i with probability x . i  Example: ( 0.05 0.01 0.0 … 0.2 … 0.01 0.05 0.03 ) 1 2 3 … i … N2 N1 N  S x = 1 i 136Introduction to Information Retrieval Change in probability vector   If the probability vector is x = (x , ..., x ), at this step, what 1 N is it at the next step 137Introduction to Information Retrieval Change in probability vector   If the probability vector is x = (x , ..., x ), at this step, what 1 N is it at the next step  Recall that row i of the transition probability matrix P tells us where we go next from state i. 138Introduction to Information Retrieval Change in probability vector   If the probability vector is x = (x , ..., x ), at this step, what 1 N is it at the next step  Recall that row i of the transition probability matrix P tells us where we go next from state i.    So from x, our next state is distributed as xP. 139Introduction to Information Retrieval Steady state in vector notation 140Introduction to Information Retrieval Steady state in vector notation  The steady state in vector notation is simply a vector  p = (p , p , …, p ) of probabilities. 1 2 N 141Introduction to Information Retrieval Steady state in vector notation  The steady state in vector notation is simply a vector  p = (p , p , …, p ) of probabilities. 1 2 N   (We use p to distinguish it from the notation for the  probability vector x.) 142Introduction to Information Retrieval Steady state in vector notation  The steady state in vector notation is simply a vector  p = (p , p , …, p ) of probabilities. 1 2 N   (We use p to distinguish it from the notation for the  probability vector x.)  p is the longterm visit rate (or PageRank) of page i. 143Introduction to Information Retrieval Steady state in vector notation  The steady state in vector notation is simply a vector  p = (p , p , …, p ) of probabilities. 1 2 N   (We use p to distinguish it from the notation for the  probability vector x.)  p is the longterm visit rate (or PageRank) of page i.  So we can think of PageRank as a very long vector – one entry per page. 144Introduction to Information Retrieval Steadystate distribution: Example 145Introduction to Information Retrieval Steadystate distribution: Example  What is the PageRank / steady state in this example 146Introduction to Information Retrieval Steadystate distribution: Example 147Introduction to Information Retrieval Steadystate distribution: Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.25 P = 0.75 11 12 P = 0.25 P = 0.75 21 22 t 0.25 0.75 0 t 1  PageRank vector = p = (p , p ) = (0.25, 0.75) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P t 2 t1 1 12 t1 2 22 148Introduction to Information Retrieval Steadystate distribution: Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.25 P = 0.75 11 12 P = 0.25 P = 0.75 21 22 t 0.25 0.75 0.25 0.75 0 t 1  PageRank vector = p = (p , p ) = (0.25, 0.75) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P t 2 t1 1 12 t1 2 22 149Introduction to Information Retrieval Steadystate distribution: Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.25 P = 0.75 11 12 P = 0.25 P = 0.75 21 22 t 0.25 0.75 0.25 0.75 0 t 0.25 0.75 1  PageRank vector = p = (p , p ) = (0.25, 0.75) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P t 2 t1 1 12 t1 2 22 150Introduction to Information Retrieval Steadystate distribution: Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.25 P = 0.75 11 12 P = 0.25 P = 0.75 21 22 t 0.25 0.75 0.25 0.75 0 t 0.25 0.75 1 (convergence)  PageRank vector = p = (p , p ) = (0.25, 0.75) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P t 2 t1 1 12 t1 2 22 151Introduction to Information Retrieval How do we compute the steady state vector 152Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank 153Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ... 154Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP. 155Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP.   But p is the steady state 156Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP.   But p is the steady state    So: p = p P 157Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP.   But p is the steady state    So: p = p P   Solving this matrix equation gives us p. 158Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP.   But p is the steady state    So: p = p P   Solving this matrix equation gives us p.   p is the principal left eigenvector for P … 159Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP.   But p is the steady state    So: p = p P   Solving this matrix equation gives us p.   p is the principal left eigenvector for P …   … that is, p is the left eigenvector with the largest eigenvalue. 160Introduction to Information Retrieval How do we compute the steady state vector  In other words: how do we compute PageRank   Recall: p = (p , p , …, p ) is the PageRank vector, the vector of 1 2 N steadystate probabilities ...   … and if the distribution in this step is x, then the distribution in  the next step is xP.   But p is the steady state    So: p = p P   Solving this matrix equation gives us p.   p is the principal left eigenvector for P …   … that is, p is the left eigenvector with the largest eigenvalue.  All transition probability matrices have largest eigenvalue 1. 161Introduction to Information Retrieval  One way of computing the PageRank p 162Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution 163Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP. 164Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP.  2  After two steps, we’re at xP . 165Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP.  2  After two steps, we’re at xP .  k  After k steps, we’re at xP . 166Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP.  2  After two steps, we’re at xP .  k  After k steps, we’re at xP .   Algorithm: multiply x by increasing powers of P until convergence. 167Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP.  2  After two steps, we’re at xP .  k  After k steps, we’re at xP .   Algorithm: multiply x by increasing powers of P until convergence.  This is called the power method. 168Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP.  2  After two steps, we’re at xP .  k  After k steps, we’re at xP .   Algorithm: multiply x by increasing powers of P until convergence.  This is called the power method.  Recall: regardless of where we start, we eventually reach the  steady state p. 169Introduction to Information Retrieval  One way of computing the PageRank p   Start with any distribution x, e.g., uniform distribution   After one step, we’re at xP.  2  After two steps, we’re at xP .  k  After k steps, we’re at xP .   Algorithm: multiply x by increasing powers of P until convergence.  This is called the power method.  Recall: regardless of where we start, we eventually reach the  steady state p.  Thus: we will eventually (in asymptotia) reach the steady state. 170Introduction to Information Retrieval Power method: Example 171Introduction to Information Retrieval Power method: Example  What is the PageRank / steady state in this example 172Introduction to Information Retrieval Computing PageRank: Power Example 173Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 = xP 0  2 t = xP 1  3 t = xP 2  4 t = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 174 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t = xP 1  3 t = xP 2  4 t = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 175 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 = xP 1  3 t = xP 2  4 t = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 176 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t = xP 2  4 t = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 177 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 = xP 2  4 t = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 178 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 179 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t 0.252 0.748 = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 180 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t 0.252 0.748 0.2496 0.7504 = xP 3 . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 181 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t 0.252 0.748 0.2496 0.7504 = xP 3 . . . . . .  ∞ t = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 182 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t 0.252 0.748 0.2496 0.7504 = xP 3 . . . . . .  ∞ t 0.25 0.75 = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 183 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t 0.252 0.748 0.2496 0.7504 = xP 3 . . . . . .  ∞ t 0.25 0.75 0.25 0.75 = xP ∞ P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 184 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Computing PageRank: Power Example x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.1 P = 0.9 11 12 P = 0.3 P = 0.7 21 22  t 0 1 0.3 0.7 = xP 0  2 t 0.3 0.7 0.24 0.76 = xP 1  3 t 0.24 0.76 0.252 0.748 = xP 2  4 t 0.252 0.748 0.2496 0.7504 = xP 3 . . . . . .  ∞ t 0.25 0.75 0.25 0.75 = xP ∞  PageRank vector = p = (p , p ) = (0.25, 0.75) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 185 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Power method: Example  What is the PageRank / steady state in this example  The steady state distribution (= the PageRanks) in this example are 0.25 for d and 0.75 for d . 1 2 186Introduction to Information Retrieval Exercise: Compute PageRank using power method 187Introduction to Information Retrieval Exercise: Compute PageRank using power method 188Introduction to Information Retrieval Solution 189Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0 t 1 t 2 t 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 190 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 1 t 2 t 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 191 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 1 t 2 t 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 192 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 2 t 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 193 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 2 t 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 194 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 0.35 0.65 2 t 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 195 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 0.35 0.65 2 t 0.35 0.65 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 196 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 0.35 0.65 2 t 0.35 0.65 0.375 0.625 3 t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 197 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 0.35 0.65 2 t 0.35 0.65 0.375 0.625 3 . . . t ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 198 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 0.35 0.65 2 t 0.35 0.65 0.375 0.625 3 . . . t 0.4 0.6 ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 199 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval Solution x x 1 2 P (d ) P (d ) t 1 t 2 P = 0.7 P = 0.3 11 12 P = 0.2 P = 0.8 21 22 t 0 1 0.2 0.8 0 t 0.2 0.8 0.3 0.7 1 t 0.3 0.7 0.35 0.65 2 t 0.35 0.65 0.375 0.625 3 . . . t 0.4 0.6 0.4 0.6 ∞  PageRank vector = p = (p , p ) = (0.4, 0.6) 1 2 P (d ) = P (d ) P + P (d ) P t 1 t1 1 11 t1 2 21 P (d ) = P (d ) P + P (d ) P 200 t 2 t1 1 12 t1 2 22Introduction to Information Retrieval PageRank summary 201Introduction to Information Retrieval PageRank summary  Preprocessing 202Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P 203Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation 204Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation   From modified matrix, compute p 205Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation   From modified matrix, compute p   p is the PageRank of page i. i 206Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation   From modified matrix, compute p   p is the PageRank of page i. i  Query processing 207Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation   From modified matrix, compute p   p is the PageRank of page i. i  Query processing  Retrieve pages satisfying the query 208Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation   From modified matrix, compute p   p is the PageRank of page i. i  Query processing  Retrieve pages satisfying the query  Rank them by their PageRank 209Introduction to Information Retrieval PageRank summary  Preprocessing  Given graph of links, build matrix P  Apply teleportation   From modified matrix, compute p   p is the PageRank of page i. i  Query processing  Retrieve pages satisfying the query  Rank them by their PageRank  Return reranked list to the user 210Introduction to Information Retrieval PageRank issues 211Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers. 212Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search 213Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing. 214Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes. 215Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages. 216Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query video service. 217Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query video service.  The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service. 218Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query video service.  The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service.  If we rank all Boolean hits according to PageRank, then the Yahoo home page would be topranked. 219Introduction to Information Retrieval PageRank issues  Real surfers are not random surfers.  Examples of nonrandom surfing: back button, short vs. long paths, bookmarks, directories – and search  → Markov model is not a good model of surfing.  But it’s good enough as a model for our purposes.  Simple PageRank ranking (as described on previous slide) produces bad results for many pages.  Consider the query video service.  The Yahoo home page (i) has a very high PageRank and (ii) contains both video and service.  If we rank all Boolean hits according to PageRank, then the Yahoo home page would be topranked.  Clearly not desireble. 220Introduction to Information Retrieval PageRank issues  In practice: rank according to weighted combination of raw text match, anchor text match, PageRank other factors. 221Introduction to Information Retrieval PageRank issues  In practice: rank according to weighted combination of raw text match, anchor text match, PageRank other factors.  → see lecture on Learning to Rank. 222Introduction to Information Retrieval Example web graphIntroduction to Information Retrieval Transition (probability) matrix 224Introduction to Information Retrieval Transition (probability) matrix d d d d d d d 0 1 2 3 4 5 6 d 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0 d 0.00 0.50 0.50 0.00 0.00 0.00 0.00 1 d 0.33 0.00 0.33 0.33 0.00 0.00 0.00 2 d 0.00 0.00 0.00 0.50 0.50 0.00 0.00 3 d 0.00 0.00 0.00 0.00 0.00 0.00 1.00 4 d 0.00 0.00 0.00 0.00 0.00 0.50 0.50 5 d 0.00 0.00 0.00 0.33 0.33 0.00 0.33 6 225Introduction to Information Retrieval Transition matrix with teleporting 226Introduction to Information Retrieval Transition matrix with teleporting d d d d d d d 0 1 2 3 4 5 6 d 0.02 0.02 0.88 0.02 0.02 0.02 0.02 0 d 0.02 0.45 0.45 0.02 0.02 0.02 0.02 1 d 0.31 0.02 0.31 0.31 0.02 0.02 0.02 2 d 0.02 0.02 0.02 0.45 0.45 0.02 0.02 3 d 0.02 0.02 0.02 0.02 0.02 0.02 0.88 4 d 0.02 0.02 0.02 0.02 0.02 0.45 0.45 5 d 0.02 0.02 0.02 0.31 0.31 0.02 0.31 6 227Introduction to Information Retrieval  k Power method vectors xP 228Introduction to Information Retrieval  k Power method vectors xP               1 2 3 4 5 6 7 8 9 10 11 12 13 x xP xP xP xP xP xP xP xP xP xP xP xP xP d 0.14 0.06 0.09 0.07 0.07 0.06 0.06 0.06 0.06 0.05 0.05 0.05 0.05 0.05 0 d 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 1 d 0.14 0.25 0.18 0.17 0.15 0.14 0.13 0.12 0.12 0.12 0.12 0.11 0.11 0.11 2 d 0.14 0.16 0.23 0.24 0.24 0.24 0.24 0.25 0.25 0.25 0.25 0.25 0.25 0.25 3 d 0.14 0.12 0.16 0.19 0.19 0.20 0.21 0.21 0.21 0.21 0.21 0.21 0.21 0.21 4 d 0.14 0.08 0.06 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 0.04 5 d 0.14 0.25 0.23 0.25 0.27 0.28 0.29 0.29 0.30 0.30 0.30 0.30 0.31 0.31 6 229Introduction to Information Retrieval Example web graph PageRank d 0.05 0 d 0.04 1 d 0.11 2 d 0.25 3 d 0.21 4 d 0.04 5 d 0.31 6 230Introduction to Information Retrieval How important is PageRank 231Introduction to Information Retrieval How important is PageRank  Frequent claim: PageRank is the most important component of web ranking. 232Introduction to Information Retrieval How important is PageRank  Frequent claim: PageRank is the most important component of web ranking.  The reality: 233Introduction to Information Retrieval How important is PageRank  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes ... 234Introduction to Information Retrieval How important is PageRank  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes ...  Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking 235Introduction to Information Retrieval How important is PageRank  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes ...  Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking  However, variants of a page’s PageRank are still an essential part of ranking. 236Introduction to Information Retrieval How important is PageRank  Frequent claim: PageRank is the most important component of web ranking.  The reality:  There are several components that are at least as important: e.g., anchor text, phrases, proximity, tiered indexes ...  Rumor has it that PageRank in his original form (as presented here) now has a negligible impact on ranking  However, variants of a page’s PageRank are still an essential part of ranking.  Adressing link spam is difficult and crucial. 237Introduction to Information Retrieval Outline ❶ Recap ❷ Anchor Text ❸ Citation Analysis ❹ PageRank ❺ HITS: Hubs AuthoritiesIntroduction to Information Retrieval Hits – HyperlinkInduced Topic Search 239Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web. 240Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need. 241Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query chicago bulls+: Bob’s list of recommended resources on the Chicago Bulls sports team 242Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query chicago bulls+: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need. 243Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query chicago bulls+: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  The home page of the Chicago Bulls sports team 244Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query chicago bulls+: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  The home page of the Chicago Bulls sports team  By definition: Links to authority pages occur repeatedly on hub pages. 245Introduction to Information Retrieval Hits – HyperlinkInduced Topic Search  Premise: there are two different types of relevance on the web.  Relevance type 1: Hubs. A hub page is a good list of links to pages answering the information need.  E.g, for query chicago bulls+: Bob’s list of recommended resources on the Chicago Bulls sports team  Relevance type 2: Authorities. An authority page is a direct answer to the information need.  The home page of the Chicago Bulls sports team  By definition: Links to authority pages occur repeatedly on hub pages.  Most approaches to search (including PageRank ranking) don’t make the distinction between these two very different types of 246 relevance.Introduction to Information Retrieval Hubs and authorities : Definition 247Introduction to Information Retrieval Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic. 248Introduction to Information Retrieval Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.  A good authority page for a topic is linked to by many hub pages for that topic. 249Introduction to Information Retrieval Hubs and authorities : Definition  A good hub page for a topic links to many authority pages for that topic.  A good authority page for a topic is linked to by many hub pages for that topic.  Circular definition – we will turn this into an iterative computation. 250Introduction to Information Retrieval Example for hubs and authorities 251Introduction to Information Retrieval Example for hubs and authorities 252Introduction to Information Retrieval How to compute hub and authority scores 253Introduction to Information Retrieval How to compute hub and authority scores  Do a regular web search first 254Introduction to Information Retrieval How to compute hub and authority scores  Do a regular web search first  Call the search result the root set 255Introduction to Information Retrieval How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set 256Introduction to Information Retrieval How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set  Call first larger set the base set 257Introduction to Information Retrieval How to compute hub and authority scores  Do a regular web search first  Call the search result the root set  Find all pages that are linked to or link to pages in the root set  Call first larger set the base set  Finally, compute hubs and authorities for the base set (which we’ll view as a small web graph) 258Introduction to Information Retrieval Root set and base set (1) root set The root setIntroduction to Information Retrieval Root set and base set (1) root set Nodes that root set nodes link toIntroduction to Information Retrieval Root set and base set (1) root set Nodes that link to root set nodesIntroduction to Information Retrieval Root set and base set (1) base set root set The base setIntroduction to Information Retrieval Root set and base set (2) 263Introduction to Information Retrieval Root set and base set (2)  Root set typically has 2001000 nodes. 264Introduction to Information Retrieval Root set and base set (2)  Root set typically has 2001000 nodes.  Base set may have up to 5000 nodes. 265Introduction to Information Retrieval Root set and base set (2)  Root set typically has 2001000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide: 266Introduction to Information Retrieval Root set and base set (2)  Root set typically has 2001000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:  Follow outlinks by parsing the pages in the root set 267Introduction to Information Retrieval Root set and base set (2)  Root set typically has 2001000 nodes.  Base set may have up to 5000 nodes.  Computation of base set, as shown on previous slide:  Follow outlinks by parsing the pages in the root set  Find d’s inlinks by searching for all pages containing a link to d 268Introduction to Information Retrieval Hub and authority scores 269Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d) 270Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1 271Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d) 272Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence: 273Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs 274Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs  Output pages with highest a scores as top authrities 275Introduction to Information Retrieval Hub and authority scores  Compute for each page d in the base set a hub score h(d) and an authority score a(d)  Initialization: for all d: h(d) = 1, a(d) = 1  Iteratively update all h(d), a(d)  After convergence:  Output pages with highest h scores as top hubs  Output pages with highest a scores as top authrities  So we output two ranked lists 276Introduction to Information Retrieval Iterative update 277Introduction to Information Retrieval Iterative update a(y)  For all d: h(d) =  dy 278Introduction to Information Retrieval Iterative update a(y)  For all d: h(d) =  dy h(y)   For all d: a(d) = yd 279Introduction to Information Retrieval Iterative update a(y)  For all d: h(d) =  dy h(y)   For all d: a(d) = yd  Iterate these two steps until convergence 280Introduction to Information Retrieval Details 281Introduction to Information Retrieval Details  Scaling 282Introduction to Information Retrieval Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration. 283Introduction to Information Retrieval Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.  Scaling factor doesn’t really matter. 284Introduction to Information Retrieval Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.  Scaling factor doesn’t really matter.  We care about the relative (as opposed to absolute) values of the scores. 285Introduction to Information Retrieval Details  Scaling  To prevent the a() and h() values from getting too big, can scale down after each iteration.  Scaling factor doesn’t really matter.  We care about the relative (as opposed to absolute) values of the scores.  In most cases, the algorithm converges after a few iterations. 286Introduction to Information Retrieval Authorities for query Chicago Bulls 287Introduction to Information Retrieval Authorities for query Chicago Bulls 0.85 www.nba.com/bulls 0.25 www.essex1.com/people/jmiller/bulls.htm “da Bulls” 0.20 www.nando.net/SportServer/basketball/nba/chi.html “The Chicago Bulls” 0.15 Users.aol.com/rynocub/bulls.htm “The Chicago Bulls Home Page ” 0.13 www.geocities.com/Colosseum/6095 “Chicago Bulls” (Ben Shaul et al, WWW8) 288Introduction to Information Retrieval The authority page for Chicago Bulls 289Introduction to Information Retrieval The authority page for Chicago Bulls 290Introduction to Information Retrieval Hubs for query Chicago Bulls 291Introduction to Information Retrieval Hubs for query Chicago Bulls 1.62 www.geocities.com/Colosseum/1778 “Unbelieveabulls” 1.24 www.webring.org/cgibin/webringring=chbulls “Chicago Bulls” 0.74 www.geocities.com/Hollywood/Lot/3330/Bulls.html “Chicago Bulls” 0.52 www.nobull.net/webposition/kwsearch15M2.html “Excite Search Results: bulls ” 0.52 www.halcyon.com/wordltd/bball/bulls.html “Chicago Bulls Links” (Ben Shaul et al, WWW8) 292Introduction to Information Retrieval A hub page for Chicago Bulls 293Introduction to Information Retrieval A hub page for Chicago Bulls 294Introduction to Information Retrieval Hub Authorities: Comments 295Introduction to Information Retrieval Hub Authorities: Comments  HITS can pull together good pages regardless of page content. 296Introduction to Information Retrieval Hub Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching. 297Introduction to Information Retrieval Hub Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words. 298Introduction to Information Retrieval Hub Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  In theory, an English query can retrieve Japaneselanguage pages 299Introduction to Information Retrieval Hub Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  In theory, an English query can retrieve Japaneselanguage pages  If supported by the link structures between English and Japanese pages 300Introduction to Information Retrieval Hub Authorities: Comments  HITS can pull together good pages regardless of page content.  Once the base set is assembles, we only do link analysis, no text matching.  Pages in the base set often do not contain any of the query words.  In theory, an English query can retrieve Japaneselanguage pages  If supported by the link structures between English and Japanese pages  Danger: topic drift – the pages found by following links may not be related to the original query. 301Introduction to Information Retrieval Proof of convergence 302Introduction to Information Retrieval Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier). 303Introduction to Information Retrieval Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N , the matrix entry A tells us whether there is a ij link from page i to page j ( A = 1) or not (A = 0). ij ij 304Introduction to Information Retrieval Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N , the matrix entry A tells us whether there is a ij link from page i to page j ( A = 1) or not (A = 0). ij ij  Example: 305Introduction to Information Retrieval Proof of convergence  We define an N×N adjacency matrix A. (We called this the link matrix earlier).  For 1 ≤ i, j ≤ N , the matrix entry A tells us whether there is a ij link from page i to page j ( A = 1) or not (A = 0). ij ij  Example: d d d 1 2 3 d 0 1 0 1 d 1 1 1 2 d 1 0 0 3 306Introduction to Information Retrieval Write update rules as matrix operations 307Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d . 308Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores 309Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores a(y)  Now we can write h(d) = as a matrix operation:  dy   h = Aa . . . 310Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores a(y)  Now we can write h(d) = as a matrix operation:  dy   h = Aa . . . T h(y)  . . . and we can write a(d) = as a a = A h  yd 311Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores a(y)  Now we can write h(d) = as a matrix operation:  dy   h = Aa . . . T h(y)  . . . and we can write a(d) = as a a = A h  yd  HITS algorithm in matrix notation: 312Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores a(y)  Now we can write h(d) = as a matrix operation:  dy   h = Aa . . . T h(y)  . . . and we can write a(d) = as a a = A h  yd  HITS algorithm in matrix notation:    Compute h = Aa 313Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores a(y)  Now we can write h(d) = as a matrix operation:  dy   h = Aa . . . T h(y)  . . . and we can write a(d) = as a a = A h  yd  HITS algorithm in matrix notation:    Compute h = Aa   T  Compute a = A h 314Introduction to Information Retrieval Write update rules as matrix operations   Define the hub vector h = (h h ) as the vector of hub scores. 1, . . . , N h is the hub score of page d .   Similarity for a , the vector of authority scores a(y)  Now we can write h(d) = as a matrix operation:  dy   h = Aa . . . T h(y)  . . . and we can write a(d) = as a a = A h  yd  HITS algorithm in matrix notation:    Compute h = Aa   T  Compute a = A h  Iterate until convergence 315Introduction to Information Retrieval HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:    Compute h = Aa   T  Compute a = A h 316Introduction to Information Retrieval HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:    Compute h = Aa   T  Compute a = A h     T T  By substitution we get: h = AA h and a = A Aa 317Introduction to Information Retrieval HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:    Compute h = Aa   T  Compute a = A h     T T  By substitution we get: h = AA h and a = A Aa   T T  Thus, h is an eigenvector of AA and a is an eigenvector of A A. 318Introduction to Information Retrieval HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:    Compute h = Aa   T  Compute a = A h     T T  By substitution we get: h = AA h and a = A Aa   T T  Thus, h is an eigenvector of AA and a is an eigenvector of A A.  So the HITS algorithm is actually a special case of the power merthod and hub and authority scores are eigenvector values. 319Introduction to Information Retrieval HITS as eigenvector problem  HITS algorithm in matrix notation. Iterate:    Compute h = Aa   T  Compute a = A h     T T  By substitution we get: h = AA h and a = A Aa   T T  Thus, h is an eigenvector of AA and a is an eigenvector of A A.  So the HITS algorithm is actually a special case of the power merthod and hub and authority scores are eigenvector values.  HITS and PageRank both formalize link analysis as eigenvector problems. 320Introduction to Information Retrieval Example web graphIntroduction to Information Retrieval Raw matrix A for HITS d d d d d d d 0 1 2 3 4 5 6 d 0 0 1 0 0 0 0 0 d 0 1 1 0 0 0 0 1 d 1 0 1 2 0 0 0 2 d 0 0 0 1 1 0 0 3 d 0 0 0 0 0 0 1 4 d 0 0 0 0 0 1 1 5 d 0 0 0 2 1 0 1 6Introduction to Information Retrieval  1 Hub vectors h , h = Aa , i ≥1 0 i d i i       h h h h h h 0 1 2 3 4 5 d 0.14 0.06 0.04 0.04 0.03 0.03 0 d 0.14 0.08 0.05 0.04 0.04 0.04 1 d 0.14 0.28 0.32 0.33 0.33 0.33 2 d 0.14 0.14 0.17 0.18 0.18 0.18 3 d 0.14 0.06 0.04 0.04 0.04 0.04 4 d 0.14 0.08 0.05 0.04 0.04 0.04 5 d 0.14 0.30 0.33 0.34 0.35 0.35 6Introduction to Information Retrieval   1 T Authority vector a = A h , i ≥ 1 i1 c i       a a a a a a a 1 2 3 4 5 6 7 d 0.06 0.09 0.10 0.10 0.10 0.10 0.10 0 d 0.06 0.03 0.01 0.01 0.01 0.01 0.01 1 d 0.19 0.14 0.13 0.12 0.12 0.12 0.12 2 d 0.31 0.43 0.46 0.46 0.46 0.47 0.47 3 d 0.13 0.14 0.16 0.16 0.16 0.16 0.16 4 d 0.06 0.03 0.02 0.01 0.01 0.01 0.01 5 d 0.19 0.14 0.13 0.13 0.13 0.13 0.13 6Introduction to Information Retrieval Example web graph a h d 0.10 0.03 0 d 0.01 0.04 1 d 0.12 0.33 2 d 0.47 0.18 3 d 0.16 0.04 4 d 0.01 0.04 5 d 0.13 0.35 6 325Introduction to Information Retrieval Topranked pagesIntroduction to Information Retrieval Topranked pages  Pages with highest indegree: d d d 2, 3, 6Introduction to Information Retrieval Topranked pages  Pages with highest indegree: d d d 2, 3, 6  Pages with highest outdegree: d d 2, 6Introduction to Information Retrieval Topranked pages  Pages with highest indegree: d d d 2, 3, 6  Pages with highest outdegree: d d 2, 6  Pages with highest PageRank: d 6Introduction to Information Retrieval Topranked pages  Pages with highest indegree: d d d 2, 3, 6  Pages with highest outdegree: d d 2, 6  Pages with highest PageRank: d 6  Pages with highest indegree: d (close: d ) 6 2Introduction to Information Retrieval Topranked pages  Pages with highest indegree: d d d 2, 3, 6  Pages with highest outdegree: d d 2, 6  Pages with highest PageRank: d 6  Pages with highest indegree: d (close: d ) 6 2  Pages with highest authority score: d 3Introduction to Information Retrieval PageRank vs. HITS: Discussion 332Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time. 333Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios. 334Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to. 335Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal. 336Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set. 337Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set.  Claim: On the web, a good hub almost always is also a good authority. 338Introduction to Information Retrieval PageRank vs. HITS: Discussion  PageRank can be precomputed, HITS has to be computed at query time.  HITS is too expensive in most application scenarios.  PageRank and HITS make two different design choices concerning (i) the eigenproblem formalization (ii) the set of pages to apply the formalization to.  These two are orthogonal.  We could also apply HITS to the entire web and PageRank to a small base set.  Claim: On the web, a good hub almost always is also a good authority.  The actual difference between PageRank ranking and HITS ranking is therefore not as large as one might expect. 339Introduction to Information Retrieval Exercise 340Introduction to Information Retrieval Exercise  Why is a good hub almost always also a good authority 341Introduction to Information Retrieval Takeaway today  Anchor text: What exactly are links on the web and why are they important for IR  Citation analysis: the mathematical foundation of PageRank and linkbased ranking  PageRank: the original algorithm that was used for linkbased ranking on the web  Hubs Authorities: an alternative linkbased ranking algorithm 342Introduction to Information Retrieval Resources  Chapter 21 of IIR  Resources at http://ifnlp.org/ir  American Mathematical Society article on PageRank (popular science style)  Jon Kleinberg’s home page (main person behind HITS)  A Google bomb and its defusing  Google’s official description of PageRank: PageRank reflects our view of the importance of web pages by considering more than 500 million variables and 2 billion terms. Pages that believe are important pages receive a higher PageRank and are more likely to appear at the top of the search results. 343
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
WilliamsMcmahon
User Type:
Professional
Country:
United States
Uploaded Date:
20-07-2017