scoring information retrieval and vector space model information retrieval example the vector space model in information retrieval term weighting problem
Introduction to Information Retrieval
: Scoring, Term Weighting,
The Vector Space Model
1Introduction to Information Retrieval
Vocabulary size M as a
function of collection size
T (number of tokens) for
Reuters-RCV1. For these
data, the dashed line
log M =
0.49 ∗ log T + 1.64 is the
best least squares fit.
Thus, M = 10 T
and k = 10 ≈ 44 and
b = 0.49.
4 4Introduction to Information Retrieval
The most frequent term
(the) occurs cf times, the
second most frequent term
times, the third most
frequent term (and) occurs
5 5Introduction to Information Retrieval
Dictionary as a string
6 6Introduction to Information Retrieval
7 7Introduction to Information Retrieval
Variable byte (VB) code
Dedicate 1 bit (high bit) to be a continuation bit c.
If the gap G fits within 7 bits, binary-encode it in the 7
available bits and set c = 1.
Else: set c = 0, encode high-order 7 bits and then use one or
more additional bytes to encode the lower order bits using
the same algorithm.
8 8Introduction to Information Retrieval
Gamma codes for gap encoding
Represent a gap G as a pair of length and offset.
Offset is the gap in binary, with the leading bit chopped off.
Length is the length of offset.
Encode length in unary code
The Gamma code is the concatenation of length and offset.
9 9Introduction to Information Retrieval
Compression of Reuters
data structure size in MB
dictionary, fixed-width 11.2
dictionary, term pointers into string 7.6
∼, with blocking, k = 4 7.1
∼, with blocking & front coding 5.9
collection (text, xml markup etc) 3600.0
collection (text) 960.0
T/D incidence matrix 40,000.0
postings, uncompressed (32-bit words) 400.0
postings, uncompressed (20 bits) 250.0
postings, variable byte encoded 116.0
postings, γ encoded 101.0
10 10Introduction to Information Retrieval
Ranking search results: why it is important (as opposed to just
presenting a set of unordered Boolean results)
Term frequency: This is a key ingredient for ranking.
Tf-idf ranking: best known traditional ranking scheme
Vector space model: One of the most important formal
models for information retrieval (along with Boolean and
11 11Introduction to Information Retrieval
Thus far, our queries have all been Boolean.
Documents either match or don’t.
Good for expert users with precise understanding of their
needs and of the collection.
Also good for applications: Applications can easily consum
1000s of results.
Not good for the majority of users
Most users are not capable of writing Boolean queries . . .
. . . or they are, but they think it’s too much work.
Most users don’t want to wade through 1000s of results.
This is particularly true of web search.
13 13Introduction to Information Retrieval
Problem with Boolean search: Feast or famine
Boolean queries often result in either too few (=0) or too
many (1000s) results.
Query 1 (boolean conjunction): standard user dlink 650
→ 200,000 hits – feast
Query 2 (boolean conjunction): standard user dlink 650 no
→ 0 hits – famine
In Boolean retrieval, it takes a lot of skill to come up with a
query that produces a manageable number of hits.
14 14Introduction to Information Retrieval
Feast or famine: No problem in ranked retrieval
With ranking, large result sets are not an issue.
Just show the top 10 results
Doesn’t overwhelm the user
Premise: the ranking algorithm works: More relevant results
are ranked higher than less relevant results.
15 15Introduction to Information Retrieval
Scoring as the basis of ranked retrieval
We wish to rank documents that are more relevant higher
than documents that are less relevant.
How can we accomplish such a ranking of the documents in
the collection with respect to a query?
Assign a score to each query-document pair, say in 0, 1.
This score measures how well document and query “match”.
16 16Introduction to Information Retrieval
Query-document matching scores
How do we compute the score of a query-document pair?
Let’s start with a one-term query.
If the query term does not occur in the document: score
should be 0.
The more frequent the query term in the document, the
higher the score
We will look at a number of alternatives for doing this.
17 17Introduction to Information Retrieval
Take 1: Jaccard coefficient
A commonly used measure of overlap of two sets
Let A and B be two sets
JACCARD (A, A) = 1
JACCARD (A, B) = 0 if A ∩ B = 0
A and B don’t have to be the same size.
Always assigns a number between 0 and 1.
18 18Introduction to Information Retrieval
Jaccard coefficient: Example
What is the query-document match score that the Jaccard
coefficient computes for:
Query: “ides of March”
Document “Caesar died in March”
JACCARD(q, d) = 1/6
19 19Introduction to Information Retrieval
What’s wrong with Jaccard?
It doesn’t consider term frequency (how many occurrences a
Rare terms are more informative than frequent terms.
Jaccard does not consider this information.
We need a more sophisticated way of normalizing for the
length of a document.
Later in this lecture, we’ll use (cosine) . . .
. . . instead of A ∩ B/A∪ B (Jaccard) for length
20 20Introduction to Information Retrieval
Bag of words model
We do not consider the order of words in a document.
John is quicker than Mary and Mary is quicker than John
are represented the same way.
This is called a bag of words model.
In a sense, this is a step back: The positional index was able
to distinguish these two documents.
We will look at “recovering” positional information later in
For now: bag of words model
24 24Introduction to Information Retrieval
Term frequency tf
The term frequency tf of term t in document d is defined
as the number of times that t occurs in d.
We want to use tf when computing query-document match
Raw term frequency is not what we want because:
A document with tf = 10 occurrences of the term is more
relevant than a document with tf = 1 occurrence of the
But not 10 times more relevant.
Relevance does not increase proportionally with term
25 25Introduction to Information Retrieval
Instead of raw frequency: Log frequency
The log frequency weight of term t in d is defined as follows
tf → w :
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in both q
tf-matching-score(q, d) = (1 + log tf )
The score is 0 if none of the query terms is present in the