Question? Leave a message!

Scoring, Term Weighting, The Vector Space Model

Scoring, Term Weighting, The Vector Space Model 6
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Introduction to Information Retrieval Introduction to Information Retrieval : Scoring, Term Weighting, The Vector Space Model 1Introduction to Information Retrieval Overview ❶ Recap ❷ Why ranked retrieval? ❸ Term frequency ❹ tf-idf weighting ❺ The vector space model 2Introduction to Information Retrieval Outline ❶ Recap ❷ Why ranked retrieval? ❸ Term frequency ❹ tf-idf weighting ❺ The vector space model 3Introduction to Information Retrieval Heaps’ law Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log M = 10 0.49 ∗ log T + 1.64 is the 10 best least squares fit. 1.64 0.49 Thus, M = 10 T 1.64 and k = 10 ≈ 44 and b = 0.49. 4 4Introduction to Information Retrieval Zipf’s law The most frequent term (the) occurs cf times, the 1 second most frequent term (of) occurs times, the third most frequent term (and) occurs times etc. 5 5Introduction to Information Retrieval Dictionary as a string 6 6Introduction to Information Retrieval Gap encoding 7 7Introduction to Information Retrieval Variable byte (VB) code  Dedicate 1 bit (high bit) to be a continuation bit c.  If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1.  Else: set c = 0, encode high-order 7 bits and then use one or more additional bytes to encode the lower order bits using the same algorithm. 8 8Introduction to Information Retrieval Gamma codes for gap encoding  Represent a gap G as a pair of length and offset.  Offset is the gap in binary, with the leading bit chopped off.  Length is the length of offset.  Encode length in unary code  The Gamma code is the concatenation of length and offset. 9 9Introduction to Information Retrieval Compression of Reuters data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ∼, with blocking, k = 4 7.1 ∼, with blocking & front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, γ encoded 101.0 10 10Introduction to Information Retrieval Take-away today  Ranking search results: why it is important (as opposed to just presenting a set of unordered Boolean results)  Term frequency: This is a key ingredient for ranking.  Tf-idf ranking: best known traditional ranking scheme  Vector space model: One of the most important formal models for information retrieval (along with Boolean and probabilistic models) 11 11Introduction to Information Retrieval Outline ❶ Recap ❷ Why ranked retrieval? ❸ Term frequency ❹ tf-idf weighting ❺ The vector space model 12Introduction to Information Retrieval Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with precise understanding of their needs and of the collection.  Also good for applications: Applications can easily consum 1000s of results.  Not good for the majority of users  Most users are not capable of writing Boolean queries . . .  . . . or they are, but they think it’s too much work.  Most users don’t want to wade through 1000s of results.  This is particularly true of web search. 13 13Introduction to Information Retrieval Problem with Boolean search: Feast or famine  Boolean queries often result in either too few (=0) or too many (1000s) results.  Query 1 (boolean conjunction): standard user dlink 650  → 200,000 hits – feast  Query 2 (boolean conjunction): standard user dlink 650 no card found  → 0 hits – famine  In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. 14 14Introduction to Information Retrieval Feast or famine: No problem in ranked retrieval  With ranking, large result sets are not an issue.  Just show the top 10 results  Doesn’t overwhelm the user  Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. 15 15Introduction to Information Retrieval Scoring as the basis of ranked retrieval  We wish to rank documents that are more relevant higher than documents that are less relevant.  How can we accomplish such a ranking of the documents in the collection with respect to a query?  Assign a score to each query-document pair, say in 0, 1.  This score measures how well document and query “match”. 16 16Introduction to Information Retrieval Query-document matching scores  How do we compute the score of a query-document pair?  Let’s start with a one-term query.  If the query term does not occur in the document: score should be 0.  The more frequent the query term in the document, the higher the score  We will look at a number of alternatives for doing this. 17 17Introduction to Information Retrieval Take 1: Jaccard coefficient  A commonly used measure of overlap of two sets  Let A and B be two sets  Jaccard coefficient:  JACCARD (A, A) = 1  JACCARD (A, B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.  Always assigns a number between 0 and 1. 18 18Introduction to Information Retrieval Jaccard coefficient: Example  What is the query-document match score that the Jaccard coefficient computes for:  Query: “ides of March”  Document “Caesar died in March”  JACCARD(q, d) = 1/6 19 19Introduction to Information Retrieval What’s wrong with Jaccard?  It doesn’t consider term frequency (how many occurrences a term has).  Rare terms are more informative than frequent terms. Jaccard does not consider this information.  We need a more sophisticated way of normalizing for the length of a document.  Later in this lecture, we’ll use (cosine) . . .  . . . instead of A ∩ B/A∪ B (Jaccard) for length normalization. 20 20