Vector space model Information Retrieval

scoring information retrieval and vector space model information retrieval example the vector space model in information retrieval term weighting problem
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval : Scoring, Term Weighting, The Vector Space Model 1Introduction to Information Retrieval Heaps’ law Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log M = 10 0.49 ∗ log T + 1.64 is the 10 best least squares fit. 1.64 0.49 Thus, M = 10 T 1.64 and k = 10 ≈ 44 and b = 0.49. 4 4Introduction to Information Retrieval Zipf’s law The most frequent term (the) occurs cf times, the 1 second most frequent term (of) occurs times, the third most frequent term (and) occurs times etc. 5 5Introduction to Information Retrieval Dictionary as a string 6 6Introduction to Information Retrieval Gap encoding 7 7Introduction to Information Retrieval Variable byte (VB) code  Dedicate 1 bit (high bit) to be a continuation bit c.  If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1.  Else: set c = 0, encode high-order 7 bits and then use one or more additional bytes to encode the lower order bits using the same algorithm. 8 8Introduction to Information Retrieval Gamma codes for gap encoding  Represent a gap G as a pair of length and offset.  Offset is the gap in binary, with the leading bit chopped off.  Length is the length of offset.  Encode length in unary code  The Gamma code is the concatenation of length and offset. 9 9Introduction to Information Retrieval Compression of Reuters data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ∼, with blocking, k = 4 7.1 ∼, with blocking & front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, γ encoded 101.0 10 10Introduction to Information Retrieval Take-away today  Ranking search results: why it is important (as opposed to just presenting a set of unordered Boolean results)  Term frequency: This is a key ingredient for ranking.  Tf-idf ranking: best known traditional ranking scheme  Vector space model: One of the most important formal models for information retrieval (along with Boolean and probabilistic models) 11 11Introduction to Information Retrieval Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with precise understanding of their needs and of the collection.  Also good for applications: Applications can easily consum 1000s of results.  Not good for the majority of users  Most users are not capable of writing Boolean queries . . .  . . . or they are, but they think it’s too much work.  Most users don’t want to wade through 1000s of results.  This is particularly true of web search. 13 13Introduction to Information Retrieval Problem with Boolean search: Feast or famine  Boolean queries often result in either too few (=0) or too many (1000s) results.  Query 1 (boolean conjunction): standard user dlink 650  → 200,000 hits – feast  Query 2 (boolean conjunction): standard user dlink 650 no card found  → 0 hits – famine  In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. 14 14Introduction to Information Retrieval Feast or famine: No problem in ranked retrieval  With ranking, large result sets are not an issue.  Just show the top 10 results  Doesn’t overwhelm the user  Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. 15 15Introduction to Information Retrieval Scoring as the basis of ranked retrieval  We wish to rank documents that are more relevant higher than documents that are less relevant.  How can we accomplish such a ranking of the documents in the collection with respect to a query?  Assign a score to each query-document pair, say in 0, 1.  This score measures how well document and query “match”. 16 16Introduction to Information Retrieval Query-document matching scores  How do we compute the score of a query-document pair?  Let’s start with a one-term query.  If the query term does not occur in the document: score should be 0.  The more frequent the query term in the document, the higher the score  We will look at a number of alternatives for doing this. 17 17Introduction to Information Retrieval Take 1: Jaccard coefficient  A commonly used measure of overlap of two sets  Let A and B be two sets  Jaccard coefficient:  JACCARD (A, A) = 1  JACCARD (A, B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.  Always assigns a number between 0 and 1. 18 18Introduction to Information Retrieval Jaccard coefficient: Example  What is the query-document match score that the Jaccard coefficient computes for:  Query: “ides of March”  Document “Caesar died in March”  JACCARD(q, d) = 1/6 19 19Introduction to Information Retrieval What’s wrong with Jaccard?  It doesn’t consider term frequency (how many occurrences a term has).  Rare terms are more informative than frequent terms. Jaccard does not consider this information.  We need a more sophisticated way of normalizing for the length of a document.  Later in this lecture, we’ll use (cosine) . . .  . . . instead of A ∩ B/A∪ B (Jaccard) for length normalization. 20 20Introduction to Information Retrieval Bag of words model  We do not consider the order of words in a document.  John is quicker than Mary and Mary is quicker than John are represented the same way.  This is called a bag of words model.  In a sense, this is a step back: The positional index was able to distinguish these two documents.  We will look at “recovering” positional information later in this course.  For now: bag of words model 24 24Introduction to Information Retrieval Term frequency tf  The term frequency tf of term t in document d is defined t,d as the number of times that t occurs in d.  We want to use tf when computing query-document match scores.  But how?  Raw term frequency is not what we want because:  A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term.  But not 10 times more relevant.  Relevance does not increase proportionally with term frequency. 25 25Introduction to Information Retrieval Instead of raw frequency: Log frequency weighting  The log frequency weight of term t in d is defined as follows  tf → w : t,d t,d 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.  Score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q, d) = (1 + log tf )  t∈q∩d t,d  The score is 0 if none of the query terms is present in the document. 26 26