# Vector space model Information Retrieval

###### scoring information retrieval and vector space model information retrieval example the vector space model in information retrieval term weighting problem
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval : Scoring, Term Weighting, The Vector Space Model 1Introduction to Information Retrieval Heaps’ law Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log M = 10 0.49 ∗ log T + 1.64 is the 10 best least squares fit. 1.64 0.49 Thus, M = 10 T 1.64 and k = 10 ≈ 44 and b = 0.49. 4 4Introduction to Information Retrieval Zipf’s law The most frequent term (the) occurs cf times, the 1 second most frequent term (of) occurs times, the third most frequent term (and) occurs times etc. 5 5Introduction to Information Retrieval Dictionary as a string 6 6Introduction to Information Retrieval Gap encoding 7 7Introduction to Information Retrieval Variable byte (VB) code  Dedicate 1 bit (high bit) to be a continuation bit c.  If the gap G fits within 7 bits, binary-encode it in the 7 available bits and set c = 1.  Else: set c = 0, encode high-order 7 bits and then use one or more additional bytes to encode the lower order bits using the same algorithm. 8 8Introduction to Information Retrieval Gamma codes for gap encoding  Represent a gap G as a pair of length and offset.  Offset is the gap in binary, with the leading bit chopped off.  Length is the length of offset.  Encode length in unary code  The Gamma code is the concatenation of length and offset. 9 9Introduction to Information Retrieval Compression of Reuters data structure size in MB dictionary, fixed-width 11.2 dictionary, term pointers into string 7.6 ∼, with blocking, k = 4 7.1 ∼, with blocking & front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32-bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, γ encoded 101.0 10 10Introduction to Information Retrieval Take-away today  Ranking search results: why it is important (as opposed to just presenting a set of unordered Boolean results)  Term frequency: This is a key ingredient for ranking.  Tf-idf ranking: best known traditional ranking scheme  Vector space model: One of the most important formal models for information retrieval (along with Boolean and probabilistic models) 11 11Introduction to Information Retrieval Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with precise understanding of their needs and of the collection.  Also good for applications: Applications can easily consum 1000s of results.  Not good for the majority of users  Most users are not capable of writing Boolean queries . . .  . . . or they are, but they think it’s too much work.  Most users don’t want to wade through 1000s of results.  This is particularly true of web search. 13 13Introduction to Information Retrieval Problem with Boolean search: Feast or famine  Boolean queries often result in either too few (=0) or too many (1000s) results.  Query 1 (boolean conjunction): standard user dlink 650  → 200,000 hits – feast  Query 2 (boolean conjunction): standard user dlink 650 no card found  → 0 hits – famine  In Boolean retrieval, it takes a lot of skill to come up with a query that produces a manageable number of hits. 14 14Introduction to Information Retrieval Feast or famine: No problem in ranked retrieval  With ranking, large result sets are not an issue.  Just show the top 10 results  Doesn’t overwhelm the user  Premise: the ranking algorithm works: More relevant results are ranked higher than less relevant results. 15 15Introduction to Information Retrieval Scoring as the basis of ranked retrieval  We wish to rank documents that are more relevant higher than documents that are less relevant.  How can we accomplish such a ranking of the documents in the collection with respect to a query?  Assign a score to each query-document pair, say in 0, 1.  This score measures how well document and query “match”. 16 16Introduction to Information Retrieval Query-document matching scores  How do we compute the score of a query-document pair?  Let’s start with a one-term query.  If the query term does not occur in the document: score should be 0.  The more frequent the query term in the document, the higher the score  We will look at a number of alternatives for doing this. 17 17Introduction to Information Retrieval Take 1: Jaccard coefficient  A commonly used measure of overlap of two sets  Let A and B be two sets  Jaccard coefficient:  JACCARD (A, A) = 1  JACCARD (A, B) = 0 if A ∩ B = 0  A and B don’t have to be the same size.  Always assigns a number between 0 and 1. 18 18Introduction to Information Retrieval Jaccard coefficient: Example  What is the query-document match score that the Jaccard coefficient computes for:  Query: “ides of March”  Document “Caesar died in March”  JACCARD(q, d) = 1/6 19 19Introduction to Information Retrieval What’s wrong with Jaccard?  It doesn’t consider term frequency (how many occurrences a term has).  Rare terms are more informative than frequent terms. Jaccard does not consider this information.  We need a more sophisticated way of normalizing for the length of a document.  Later in this lecture, we’ll use (cosine) . . .  . . . instead of A ∩ B/A∪ B (Jaccard) for length normalization. 20 20Introduction to Information Retrieval Bag of words model  We do not consider the order of words in a document.  John is quicker than Mary and Mary is quicker than John are represented the same way.  This is called a bag of words model.  In a sense, this is a step back: The positional index was able to distinguish these two documents.  We will look at “recovering” positional information later in this course.  For now: bag of words model 24 24Introduction to Information Retrieval Term frequency tf  The term frequency tf of term t in document d is defined t,d as the number of times that t occurs in d.  We want to use tf when computing query-document match scores.  But how?  Raw term frequency is not what we want because:  A document with tf = 10 occurrences of the term is more relevant than a document with tf = 1 occurrence of the term.  But not 10 times more relevant.  Relevance does not increase proportionally with term frequency. 25 25Introduction to Information Retrieval Instead of raw frequency: Log frequency weighting  The log frequency weight of term t in d is defined as follows  tf → w : t,d t,d 0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.  Score for a document-query pair: sum over terms t in both q and d: tf-matching-score(q, d) = (1 + log tf )  t∈q∩d t,d  The score is 0 if none of the query terms is present in the document. 26 26
Presentations
Free

## Recommend

© Copyright @ Thesis Scientist Pvt. Ltd. Terms & Conditions | Refund Policy | Tips & Tricks