automatic indexing in information retrieval systems and also describe systems and procedures for storing and retrieving information
Introduction to Information Retrieval
1Introduction to Information Retrieval
Token – an instance of a word or term occurring in a
Type – an equivalence class of tokens
In June, the dog likes to chase the cat in the barn.
12 word tokens, 9 word types
4 4Introduction to Information Retrieval
Problems in tokenization
What are the delimiters? Space? Apostrophe? Hyphen?
For each of these: sometimes they delimit, sometimes they
No whitespace in many languages (e.g., Chinese)
No whitespace in Dutch, German, Swedish compounds
5 5Introduction to Information Retrieval
Problems with equivalence classing
A term is an equivalence class of tokens.
How do we define equivalence classes?
Numbers (3/20/91 vs. 20/3/91)
Stemming, Porter stemmer
Morphological analysis: inflectional vs. derivational
Equivalence classing problems in other languages
More complex morphology than in English
Finnish: a single verb may have 12,000 different forms
6 6Introduction to Information Retrieval
7 7Introduction to Information Retrieval
Postings lists in a nonpositional index: each posting is just a
Postings lists in a positional index: each posting is a docID and
a list of positions
Example query: “to be or not to be ”
1 2 3 4 5 6
‹ 1: ‹7, 18, 33, 72, 86, 231›;
2: ‹1, 17, 74, 222, 255›;
4: ‹8, 16, 190, 429, 433›;
5: ‹363, 367›;
7: ‹13, 23, 191›; . . . ›
‹ 1: ‹17, 25›;
4: ‹17, 191, 291, 430, 434›;
5: ‹14, 19, 101›; . . . › Document 4 is a match
8 8Introduction to Information Retrieval
With a positional index, we can answer phrase queries.
With a positional index, we can answer proximity queries.
9 9Introduction to Information Retrieval
Tolerant retrieval: What to do if there is no exact match
between query term and document term
10 10Introduction to Information Retrieval
12 12Introduction to Information Retrieval
13 13Introduction to Information Retrieval
The dictionary is the data structure for storing the term
Term vocabulary: the data
Dictionary: the data structure for storing the term vocabulary
14 14Introduction to Information Retrieval
Dictionary as array of fixed-width entries
For each term, we need to store a couple of items:
pointer to postings list
. . .
Assume for the time being that we can store this information
in a fixed-length entry.
Assume that we store these entries in an array.
15 15Introduction to Information Retrieval
Dictionary as array of fixed-width entries
space needed: 20 bytes 4 bytes 4 bytes
How do we look up a query term q in this array at query time?
That is: which data structure do we use to locate the entry (row)
in the array where q is stored?
16 16Introduction to Information Retrieval
Data structures for looking up term
Two main classes of data structures: hashes and trees
Some IR systems use hashes, some use trees.
Criteria for when to use hashes vs. trees:
Is there a fixed number of terms or will it keep growing?
What are the relative frequencies with which various keys will
How many terms are we likely to have?
17 17Introduction to Information Retrieval
Each vocabulary term is hashed into an integer.
Try to avoid collisions
At query time, do the following: hash query term, resolve
collisions, locate entry in fixed-width array
Pros: Lookup in a hash is faster than lookup in a tree.
Lookup time is constant.
no way to find minor variants (resume vs. résumé)
no prefix search (all terms starting with automat)
need to rehash everything periodically if vocabulary keeps
18 18Introduction to Information Retrieval
Trees solve the prefix problem (find all terms starting with
Simplest tree: binary tree
Search is slightly slower than in hashes: O(logM), where M is
the size of the vocabulary.
O(logM) only holds for balanced trees.
Rebalancing binary trees is expensive.
B-trees mitigate the rebalancing problem.
B-tree definition: every internal node has a number of
children in the interval a, b where a, b are appropriate
positive integers, e.g., 2, 4.
19 19Introduction to Information Retrieval
mon: find all docs containing any term beginning with mon
Easy with B-tree dictionary: retrieve all terms t in the range:
mon ≤ t moo
mon: find all docs containing any term ending with mon
Maintain an additional tree for terms backwards
Then retrieve all terms t in the range: nom ≤ t non
Result: A set of terms that are matches for wildcard query
Then retrieve documents that contain any of these terms
23 23Introduction to Information Retrieval
How to handle in the middle of a term
We could look up m and nchen in the B-tree and intersect
the two term sets.
Alternative: permuterm index
Basic idea: Rotate every wildcard query, so that the occurs
at the end.
Store each of these rotations in the dictionary, say, in a B-tree
24 24Introduction to Information Retrieval
For term HELLO: add hello, elloh, llohe, lohel, and ohell
to the B-tree where is a special symbol
25 25Introduction to Information Retrieval
Permuterm → term mapping