Question? Leave a message!




Why is ranking so important

Why is ranking so important
Introduction to Information Retrieval Introduction to Information Retrieval Scores in a Complete Search System 1Introduction to Information Retrieval Overview ❶ Recap ❷ Why rank ❸ More on cosine ❹ Implementation of ranking ❺ The complete search system 2Introduction to Information Retrieval Outline ❶ Recap ❷ Why rank ❸ More on cosine ❹ Implementation of ranking ❺ The complete search system 3Introduction to Information Retrieval Term frequency weight  The log frequency weight of term t in d is defined as follows 4 4Introduction to Information Retrieval idf weight  The document frequency dft is defined as the number of documents that t occurs in.  We define the idf weight of term t as follows:  idf is a measure of the informativeness of the term. 5 5Introduction to Information Retrieval tfidf weight  The tfidf weight of a term is the product of its tf weight and its idf weight. 6 6Introduction to Information Retrieval Cosine similarity between query and document  q is the tfidf weight of term i in the query. i  d is the tfidf weight of term i in the document. i  and are the lengths of and  and are length1 vectors (= normalized). 7 7Introduction to Information Retrieval Cosine similarity illustrated 8 8Introduction to Information Retrieval tfidf example: lnc.ltn Query: “best car insurance”. Document: “car insurance auto insurance”. term frequency, df: document frequency, idf: inverse document frequency, weight:the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight 1/1.92 0.52 1.3/1.92 0.68 Final similarity score between query and  document: w · w = 0 + 0 + 1.04 + 2.04 = 3.08 i qi di 9 9Introduction to Information Retrieval Takeaway today  The importance of ranking: User studies at Google  Length normalization: Pivot normalization  Implementation of ranking  The complete search system 10 10Introduction to Information Retrieval Outline ❶ Recap ❷ Why rank ❸ More on cosine ❹ Implementation of ranking ❺ The complete search system 11Introduction to Information Retrieval Why is ranking so important  Last lecture: Problems with unranked retrieval  Users want to look at a few results – not thousands.  It’s very hard to write queries that produce a few results.  Even for expert searchers  → Ranking is important because it effectively reduces a large set of results to a very small one.  Next: More data on “users only look at a few results”  Actually, in the vast majority of cases they only examine 1, 2, or 3 results. 12 12Introduction to Information Retrieval Empirical investigation of the effect of ranking  How can we measure how important ranking is  Observe what searchers do when they are searching in a controlled setting  Videotape them  Ask them to “think aloud”  Interview them  Eyetrack them  Time them  Record and count their clicks  The following slides are from Dan Russell’s JCDL talk  Dan Russell is the “Über Tech Lead for Search Quality User Happiness” at Google. 13 13Introduction to Information Retrieval 14 14Introduction to Information Retrieval 15 15Introduction to Information Retrieval 16 16Introduction to Information Retrieval 17 17Introduction to Information Retrieval 18 18Introduction to Information Retrieval 19 19Introduction to Information Retrieval Importance of ranking: Summary  Viewing abstracts: Users are a lot more likely to read the abstracts of the topranked pages (1, 2, 3, 4) than the abstracts of the lower ranked pages (7, 8, 9, 10).  Clicking: Distribution is even more skewed for clicking  In 1 out of 2 cases, users click on the topranked page.  Even if the topranked page is not relevant, 30 of users will click on it.  → Getting the ranking right is very important.  → Getting the topranked page right is most important. 20 20Introduction to Information Retrieval Outline ❶ Recap ❷ Why rank ❸ More on cosine ❹ Implementation of ranking ❺ The complete search system 21Introduction to Information Retrieval Why distance is a bad idea The Euclidean distance of and is large although the distribution of terms in the query q and the distribution of terms in the document d are very similar. That’s why we do length normalization or, 2 equivalently, use cosine to compute querydocument matching scores. 22 22Introduction to Information Retrieval Exercise: A problem for cosine normalization  Query q: “antidoping rules Beijing 2008 olympics”  Compare three documents  d : a short document on antidoping rules at 2008 Olympics 1  d : a long document that consists of a copy of d and 5 other 2 1 news stories, all on topics different from Olympics/anti doping  d : a short document on antidoping rules at the 2004 3 Athens Olympics  What ranking do we expect in the vector space model  What can we do about this 23 23Introduction to Information Retrieval Pivot normalization  Cosine normalization produces weights that are too large for short documents and too small for long documents (on average).  Adjust cosine normalization by linear adjustment: “turning” the average normalization on the pivot  Effect: Similarities of short documents with query decrease; similarities of long documents with query increase.  This removes the unfair advantage that short documents have. 24 24Introduction to Information Retrieval Predicted and true probability of relevance source: Lillian Lee 25 25Introduction to Information Retrieval Pivot normalization source: Lillian Lee 26 26Introduction to Information Retrieval Pivoted normalization: Amit Singhal’s experiments (relevant documents retrieved and (change in) average precision) 27 27Introduction to Information Retrieval Outline ❶ Recap ❷ Why rank ❸ More on cosine ❹ Implementation of ranking ❺ The complete search system 28Introduction to Information Retrieval Now we also need term frequncies in the index term frequencies We also need positions. Not shown here 29 29Introduction to Information Retrieval Term frequencies in the inverted index  In each posting, store tf in addition to docID t,d d  As an integer frequency, not as a (log)weighted real number . . .  . . . because real numbers are difficult to compress.  Unary code is effective for encoding term frequencies.  Why  Overall, additional space requirements are small: less than a byte per posting with bitwise compression.  Or a byte per posting with variable byte code 30 30Introduction to Information Retrieval Exercise: How do we compute the top k in ranking  In many applications, we don’t need a complete ranking.  We just need the top k for a small k (e.g., k = 100).  If we don’t need a complete ranking, is there an efficient way of computing just the top k  Naive:  Compute scores for all N documents  Sort  Return the top k  What’s bad about this  Alternative 31 31Introduction to Information Retrieval Use min heap for selecting top k ouf of N  Use a binary min heap  A binary min heap is a binary tree in which each node’s value is less than the values of its children.  Takes O(N log k) operations to construct (where N is the number of documents) . . .  . . . then read off k winners in O(k log k) steps 32 32Introduction to Information Retrieval Binary min heap 33 33Introduction to Information Retrieval Selecting top k scoring documents in O(N log k)  Goal: Keep the top k documents seen so far  Use a binary min heap  To process a new document d′ with score s′:  Get current minimum h of heap (O(1)) m  If s′ ˂ h skip to next document m  If s′ h heapdeleteroot (O(log k)) m  Heapadd d′/s′ (O(log k)) 34 34Introduction to Information Retrieval Priority queue example 35 35Introduction to Information Retrieval Even more efficient computation of top k  Ranking has time complexity O(N) where N is the number of documents.  Optimizations reduce the constant factor, but they are still 10 O(N), N 10  Are there sublinear algorithms  What we’re doing in effect: solving the knearest neighbor (kNN) problem for the query vector (= query point).  There are no general solutions to this problem that are sublinear.  We will revisit this issue when we do kNN classification in IIR 14. 36 36Introduction to Information Retrieval More efficient computation of top k: Heuristics  Idea 1: Reorder postings lists  Instead of ordering according to docID . . .  . . . order according to some measure of “expected relevance”.  Idea 2: Heuristics to prune the search space  Not guaranteed to be correct . . .  . . . but fails rarely.  In practice, close to constant time.  For this, we’ll need the concepts of documentatatime processing and termatatime processing. 37 37Introduction to Information Retrieval NondocID ordering of postings lists  So far: postings lists have been ordered according to docID.  Alternative: a queryindependent measure of “goodness” of a page  Example: PageRank g(d) of page d, a measure of how many “good” pages hyperlink to d (chapter 21)  Order documents in postings lists according to PageRank: g(d ) g(d ) g(d ) . . . 1 2 3  Define composite score of a document: netscore(q, d) = g(d) + cos(q, d)  This scheme supports early termination: We do not have to process postings lists in their entirety to find top k. 38 38Introduction to Information Retrieval NondocID ordering of postings lists (2)  Order documents in postings lists according to PageRank: g(d ) g(d ) g(d ) . . . 1 2 3  Define composite score of a document: netscore(q, d) = g(d) + cos(q, d)  Suppose: (i) g → 0, 1+; (ii) g(d) 0.1 for the document d we’re currently processing; (iii) smallest top k score we’ve found so far is 1.2  Then all subsequent scores will be 1.1.  So we’ve already found the top k and can stop processing the remainder of postings lists.  Questions 39 39Introduction to Information Retrieval Documentatatime processing  Both docIDordering and PageRankordering impose a consistent ordering on documents in postings lists.  Computing cosines in this scheme is documentatatime.  We complete computation of the querydocument similarity score of document d before starting to compute the query i document similarity score of d . i+1  Alternative: termatatime processing 40 40Introduction to Information Retrieval Weightsorted postings lists  Idea: don’t process postings that contribute little to final score  Order documents in postings list according to weight  Simplest case: normalized tfidf weight (rarely done: hard to compress)  Documents in the top k are likely to occur early in these ordered lists.  → Early termination while processing postings lists is unlikely to change the top k.  But:  We no longer have a consistent ordering of documents in postings lists.  We no longer can employ documentatatime processing. 41 41Introduction to Information Retrieval Termatatime processing  Simplest case: completely process the postings list of the first query term  Create an accumulator for each docID you encounter  Then completely process the postings list of the second query term  . . . and so forth 42 42Introduction to Information Retrieval Termatatime processing 43 43Introduction to Information Retrieval Computing cosine scores  For the web (20 billion documents), an array of accumulators A in memory is infeasible.  Thus: Only create accumulators for docs occurring in postings lists  This is equivalent to: Do not create accumulators for docs with zero scores (i.e., docs that do not contain any of the query terms) 44 44Introduction to Information Retrieval Accumulators: Example  For query: Brutus Caesar:  Only need accumulators for 1, 5, 7, 13, 17, 83, 87  Don’t need accumulators for 8, 40, 85 45 45Introduction to Information Retrieval Removing bottlenecks  Use heap / priority queue as discussed earlier  Can further limit to docs with nonzero cosines on rare (high idf) words  Or enforce conjunctive search (a la Google): nonzero cosines on all words in query  Example: just one accumulator for Brutus Caesar in the example above . . .  . . . because only d contains both words. 1 46 46Introduction to Information Retrieval Outline ❶ Recap ❷ Why rank ❸ More on cosine ❹ Implementation of ranking ❺ The complete search system 47Introduction to Information Retrieval Complete search system 48 48Introduction to Information Retrieval Tiered indexes  Basic idea:  Create several tiers of indexes, corresponding to importance of indexing terms  During query processing, start with highesttier index  If highesttier index returns at least k (e.g., k = 100) results: stop and return results to user  If we’ve only found k hits: repeat for next index in tier cascade  Example: twotier system  Tier 1: Index of all titles  Tier 2: Index of the rest of documents  Pages containing the search words in the title are better hits than pages containing the search words in the body of the text. 49 49Introduction to Information Retrieval Tiered index 50 50Introduction to Information Retrieval Tiered indexes  The use of tiered indexes is believed to be one of the reasons that Google search quality was significantly higher initially (2000/01) than that of competitors.  (along with PageRank, use of anchor text and proximity constraints) 51 51Introduction to Information Retrieval Exercise  Design criteria for tiered system  Each tier should be an order of magnitude smaller than the next tier.  The top 100 hits for most queries should be in tier 1, the top 100 hits for most of the remaining queries in tier 2 etc.  We need a simple test for “can I stop at this tier or do I have to go to the next one”  There is no advantage to tiering if we have to hit most tiers for most queries anyway.  Question 1: Consider a twotier system where the first tier indexes titles and the second tier everything. What are potential problems with this type of tiering  Question 2: Can you think of a better way of setting up a multitier system Which “zones” of a document should be indexed in the different tiers (title, body of document, others) What criterion do you want to use for including a document in tier 1 52 52Introduction to Information Retrieval Complete search system 53 53Introduction to Information Retrieval Components we have introduced thus far  Document preprocessing (linguistic and otherwise)  Positional indexes  Tiered indexes  Spelling correction  kgram indexes for wildcard queries and spelling correction  Query processing  Document scoring  Termatatime processing 54 54Introduction to Information Retrieval Components we haven’t covered yet  Document cache: we need this for generating snippets (=dynamic summaries)  Zone indexes: They separate the indexes for different zones: the body of the document, all highlighted text in the document, anchor text, text in metadata fields etc  Machinelearned ranking functions  Proximity ranking (e.g., rank documents in which the query terms occur in the same local window higher than documents in which the query terms occur far from each other)  Query parser 55 55Introduction to Information Retrieval Vector space retrieval: Interactions  How do we combine phrase retrieval with vector space retrieval  We do not want to compute document frequency / idf for every possible phrase. Why  How do we combine Boolean retrieval with vector space retrieval  For example: “+”constraints and “”constraints  Postfiltering is simple, but can be very inefficient – no easy answer.  How do we combine wild cards with vector space retrieval  Again, no easy answer 56 56Introduction to Information Retrieval Takeaway today  The importance of ranking: User studies at Google  Length normalization: Pivot normalization  Implementation of ranking  The complete search system 57 57Introduction to Information Retrieval Resources  Chapters 6 and 7 of IIR  Resources at http://ifnlp.org/ir  How Google tweaks its ranking function  Interview with Google search guru Udi Manber  Yahoo Search BOSS: Opens up the search engine to developers. For example, you can rerank search results.  Compare Google and Yahoo ranking for a query  How Google uses eye tracking for improving search 58 58
Website URL
Comment