Dictionaries and tolerant Retrieval ppt

automatic indexing in information retrieval systems and also describe systems and procedures for storing and retrieving information
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Introduction to Information Retrieval Introduction to Information Retrieval Dictionaries and tolerant retrieval 1Introduction to Information Retrieval Type/token distinction  Token – an instance of a word or term occurring in a document  Type – an equivalence class of tokens  In June, the dog likes to chase the cat in the barn.  12 word tokens, 9 word types 4 4Introduction to Information Retrieval Problems in tokenization  What are the delimiters? Space? Apostrophe? Hyphen?  For each of these: sometimes they delimit, sometimes they don’t.  No whitespace in many languages (e.g., Chinese)  No whitespace in Dutch, German, Swedish compounds (Lebensversicherungsgesellschaftsangestellter) 5 5Introduction to Information Retrieval Problems with equivalence classing  A term is an equivalence class of tokens.  How do we define equivalence classes?  Numbers (3/20/91 vs. 20/3/91)  Case folding  Stemming, Porter stemmer  Morphological analysis: inflectional vs. derivational  Equivalence classing problems in other languages  More complex morphology than in English  Finnish: a single verb may have 12,000 different forms  Accents, umlauts 6 6Introduction to Information Retrieval Skip pointers 7 7Introduction to Information Retrieval Positional indexes  Postings lists in a nonpositional index: each posting is just a docID  Postings lists in a positional index: each posting is a docID and a list of positions  Example query: “to be or not to be ” 1 2 3 4 5 6 TO, 993427: ‹ 1: ‹7, 18, 33, 72, 86, 231›; 2: ‹1, 17, 74, 222, 255›; 4: ‹8, 16, 190, 429, 433›; 5: ‹363, 367›; 7: ‹13, 23, 191›; . . . › BE, 178239: ‹ 1: ‹17, 25›; 4: ‹17, 191, 291, 430, 434›; 5: ‹14, 19, 101›; . . . › Document 4 is a match 8 8Introduction to Information Retrieval Positional indexes  With a positional index, we can answer phrase queries.  With a positional index, we can answer proximity queries. 9 9Introduction to Information Retrieval Take-away  Tolerant retrieval: What to do if there is no exact match between query term and document term  Wildcard queries  Spelling correction 10 10Introduction to Information Retrieval Inverted index 12 12Introduction to Information Retrieval Inverted index 13 13Introduction to Information Retrieval Dictionaries  The dictionary is the data structure for storing the term vocabulary.  Term vocabulary: the data  Dictionary: the data structure for storing the term vocabulary 14 14Introduction to Information Retrieval Dictionary as array of fixed-width entries  For each term, we need to store a couple of items:  document frequency  pointer to postings list  . . .  Assume for the time being that we can store this information in a fixed-length entry.  Assume that we store these entries in an array. 15 15Introduction to Information Retrieval Dictionary as array of fixed-width entries space needed: 20 bytes 4 bytes 4 bytes How do we look up a query term q in this array at query time? i That is: which data structure do we use to locate the entry (row) in the array where q is stored? i 16 16Introduction to Information Retrieval Data structures for looking up term  Two main classes of data structures: hashes and trees  Some IR systems use hashes, some use trees.  Criteria for when to use hashes vs. trees:  Is there a fixed number of terms or will it keep growing?  What are the relative frequencies with which various keys will be accessed?  How many terms are we likely to have? 17 17Introduction to Information Retrieval Hashes  Each vocabulary term is hashed into an integer.  Try to avoid collisions  At query time, do the following: hash query term, resolve collisions, locate entry in fixed-width array  Pros: Lookup in a hash is faster than lookup in a tree.  Lookup time is constant.  Cons  no way to find minor variants (resume vs. résumé)  no prefix search (all terms starting with automat)  need to rehash everything periodically if vocabulary keeps growing 18 18Introduction to Information Retrieval Trees  Trees solve the prefix problem (find all terms starting with automat).  Simplest tree: binary tree  Search is slightly slower than in hashes: O(logM), where M is the size of the vocabulary.  O(logM) only holds for balanced trees.  Rebalancing binary trees is expensive.  B-trees mitigate the rebalancing problem.  B-tree definition: every internal node has a number of children in the interval a, b where a, b are appropriate positive integers, e.g., 2, 4. 19 19Introduction to Information Retrieval Wildcard queries  mon: find all docs containing any term beginning with mon  Easy with B-tree dictionary: retrieve all terms t in the range: mon ≤ t moo  mon: find all docs containing any term ending with mon  Maintain an additional tree for terms backwards  Then retrieve all terms t in the range: nom ≤ t non  Result: A set of terms that are matches for wildcard query  Then retrieve documents that contain any of these terms 23 23Introduction to Information Retrieval How to handle in the middle of a term  Example: mnchen  We could look up m and nchen in the B-tree and intersect the two term sets.  Expensive  Alternative: permuterm index  Basic idea: Rotate every wildcard query, so that the occurs at the end.  Store each of these rotations in the dictionary, say, in a B-tree 24 24Introduction to Information Retrieval Permuterm index  For term HELLO: add hello, elloh, llohe, lohel, and ohell to the B-tree where is a special symbol 25 25Introduction to Information Retrieval Permuterm → term mapping 26 26