Question? Leave a message!




Dictionaries and tolerant retrieval

Dictionaries and tolerant retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Dictionaries and tolerant retrieval 1Introduction to Information Retrieval Overview ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 2Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 3Introduction to Information Retrieval Type/token distinction  Token – an instance of a word or term occurring in a document  Type – an equivalence class of tokens  In June, the dog likes to chase the cat in the barn.  12 word tokens, 9 word types 4 4Introduction to Information Retrieval Problems in tokenization  What are the delimiters Space Apostrophe Hyphen  For each of these: sometimes they delimit, sometimes they don’t.  No whitespace in many languages (e.g., Chinese)  No whitespace in Dutch, German, Swedish compounds (Lebensversicherungsgesellschaftsangestellter) 5 5Introduction to Information Retrieval Problems with equivalence classing  A term is an equivalence class of tokens.  How do we define equivalence classes  Numbers (3/20/91 vs. 20/3/91)  Case folding  Stemming, Porter stemmer  Morphological analysis: inflectional vs. derivational  Equivalence classing problems in other languages  More complex morphology than in English  Finnish: a single verb may have 12,000 different forms  Accents, umlauts 6 6Introduction to Information Retrieval Skip pointers 7 7Introduction to Information Retrieval Positional indexes  Postings lists in a nonpositional index: each posting is just a docID  Postings lists in a positional index: each posting is a docID and a list of positions  Example query: “to be or not to be ” 1 2 3 4 5 6 TO, 993427: ‹ 1: ‹7, 18, 33, 72, 86, 231›; 2: ‹1, 17, 74, 222, 255›; 4: ‹8, 16, 190, 429, 433›; 5: ‹363, 367›; 7: ‹13, 23, 191›; . . . › BE, 178239: ‹ 1: ‹17, 25›; 4: ‹17, 191, 291, 430, 434›; 5: ‹14, 19, 101›; . . . › Document 4 is a match 8 8Introduction to Information Retrieval Positional indexes  With a positional index, we can answer phrase queries.  With a positional index, we can answer proximity queries. 9 9Introduction to Information Retrieval Takeaway  Tolerant retrieval: What to do if there is no exact match between query term and document term  Wildcard queries  Spelling correction 10 10Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 11Introduction to Information Retrieval Inverted index 12 12Introduction to Information Retrieval Inverted index 13 13Introduction to Information Retrieval Dictionaries  The dictionary is the data structure for storing the term vocabulary.  Term vocabulary: the data  Dictionary: the data structure for storing the term vocabulary 14 14Introduction to Information Retrieval Dictionary as array of fixedwidth entries  For each term, we need to store a couple of items:  document frequency  pointer to postings list  . . .  Assume for the time being that we can store this information in a fixedlength entry.  Assume that we store these entries in an array. 15 15Introduction to Information Retrieval Dictionary as array of fixedwidth entries space needed: 20 bytes 4 bytes 4 bytes How do we look up a query term q in this array at query time i That is: which data structure do we use to locate the entry (row) in the array where q is stored i 16 16Introduction to Information Retrieval Data structures for looking up term  Two main classes of data structures: hashes and trees  Some IR systems use hashes, some use trees.  Criteria for when to use hashes vs. trees:  Is there a fixed number of terms or will it keep growing  What are the relative frequencies with which various keys will be accessed  How many terms are we likely to have 17 17Introduction to Information Retrieval Hashes  Each vocabulary term is hashed into an integer.  Try to avoid collisions  At query time, do the following: hash query term, resolve collisions, locate entry in fixedwidth array  Pros: Lookup in a hash is faster than lookup in a tree.  Lookup time is constant.  Cons  no way to find minor variants (resume vs. résumé)  no prefix search (all terms starting with automat)  need to rehash everything periodically if vocabulary keeps growing 18 18Introduction to Information Retrieval Trees  Trees solve the prefix problem (find all terms starting with automat).  Simplest tree: binary tree  Search is slightly slower than in hashes: O(logM), where M is the size of the vocabulary.  O(logM) only holds for balanced trees.  Rebalancing binary trees is expensive.  Btrees mitigate the rebalancing problem.  Btree definition: every internal node has a number of children in the interval a, b where a, b are appropriate positive integers, e.g., 2, 4. 19 19Introduction to Information Retrieval Binary tree 20 20Introduction to Information Retrieval Btree 21 21Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 22Introduction to Information Retrieval Wildcard queries  mon: find all docs containing any term beginning with mon  Easy with Btree dictionary: retrieve all terms t in the range: mon ≤ t moo  mon: find all docs containing any term ending with mon  Maintain an additional tree for terms backwards  Then retrieve all terms t in the range: nom ≤ t non  Result: A set of terms that are matches for wildcard query  Then retrieve documents that contain any of these terms 23 23Introduction to Information Retrieval How to handle in the middle of a term  Example: mnchen  We could look up m and nchen in the Btree and intersect the two term sets.  Expensive  Alternative: permuterm index  Basic idea: Rotate every wildcard query, so that the occurs at the end.  Store each of these rotations in the dictionary, say, in a Btree 24 24Introduction to Information Retrieval Permuterm index  For term HELLO: add hello, elloh, llohe, lohel, and ohell to the Btree where is a special symbol 25 25Introduction to Information Retrieval Permuterm → term mapping 26 26Introduction to Information Retrieval Permuterm index  For HELLO, we’ve stored: hello, elloh, llohe, lohel, and ohell  Queries  For X, look up X  For X, look up X  For X, look up X  For X, look up X  For XY, look up YX  Example: For helo, look up ohel  Permuterm index would better be called a permuterm tree.  But permuterm index is the more common name. 27 27Introduction to Information Retrieval Processing a lookup in the permuterm index  Rotate query wildcard to the right  Use Btree lookup as before  Problem: Permuterm more than quadruples the size of the dictionary compared to a regular Btree. (empirical number) 28 28Introduction to Information Retrieval kgram indexes  More spaceefficient than permuterm index  Enumerate all character kgrams (sequence of k characters) occurring in a term  2grams are called bigrams.  Example: from April is the cruelest month we get the bigrams: a ap pr ri il l i is s t th he e c cr ru ue el le es st t m mo on nt h  is a special word boundary symbol, as before.  Maintain an inverted index from bigrams to the terms that contain the bigram 29 29Introduction to Information Retrieval Postings list in a 3gram inverted index 30 30Introduction to Information Retrieval kgram (bigram, trigram, . . . ) indexes  Note that we now have two different types of inverted indexes  The termdocument inverted index for finding documents based on a query consisting of terms  The kgram index for finding terms based on a query consisting of kgrams 31 31Introduction to Information Retrieval Processing wildcarded terms in a bigram index  Query mon can now be run as: m AND mo AND on  Gets us all terms with the prefix mon . . .  . . . but also many “false positives” like MOON.  We must postfilter these terms against query.  Surviving terms are then looked up in the termdocument inverted index.  kgram index vs. permuterm index  kgram index is more space efficient.  Permuterm index doesn’t require postfiltering. 32 32Introduction to Information Retrieval Exercise  Google has very limited support for wildcard queries.  For example, this query doesn’t work very well on Google: gen universit  Intention: you are looking for the University of Geneva, but don’t know which accents to use for the French words for university and Geneva.  According to Google search basics, 20100429: “Note that the operator works only on whole words, not parts of words.”  But this is not entirely true. Try pythag and mnchen  Exercise: Why doesn’t Google fully support wildcard queries 33 33Introduction to Information Retrieval Processing wildcard queries in the term document index  Problem 1: we must potentially execute a large number of Boolean queries.  Most straightforward semantics: Conjunction of disjunctions  For gen universit: geneva university OR geneva université OR genève university OR genève université OR general universities OR . . .  Very expensive  Problem 2: Users hate to type.  If abbreviated queries like pyth theo for pythagoras’ theorem are allowed, users will use them a lot.  This would significantly increase the cost of answering queries.  Somewhat alleviated by Google Suggest 34 34Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 35Introduction to Information Retrieval Spelling correction  Two principal uses  Correcting documents being indexed  Correcting user queries  Two different methods for spelling correction  Isolated word spelling correction  Check each word on its own for misspelling  Will not catch typos resulting in correctly spelled words, e.g., an asteroid that fell form the sky  Contextsensitive spelling correction  Look at surrounding words  Can correct form/from error above 36 36Introduction to Information Retrieval Correcting documents  We’re not interested in interactive spelling correction of documents (e.g., MS Word) in this class.  In IR, we use document correction primarily for OCR’ed documents. (OCR = optical character recognition)  The general philosophy in IR is: don’t change the documents. 37 37Introduction to Information Retrieval Correcting queries  First: isolated word spelling correction  Premise 1: There is a list of “correct words” from which the correct spellings come.  Premise 2: We have a way of computing the distance between a misspelled word and a correct word.  Simple spelling correction algorithm: return the “correct” word that has the smallest distance to the misspelled word.  Example: informaton → information  For the list of correct words, we can use the vocabulary of all words that occur in our collection.  Why is this problematic 38 38Introduction to Information Retrieval Alternatives to using the term vocabulary  A standard dictionary (Webster’s, OED etc.)  An industryspecific dictionary (for specialized IR systems)  The term vocabulary of the collection, appropriately weighted 39 39Introduction to Information Retrieval Distance between misspelled word and “correct” word  We will study several alternatives.  Edit distance and Levenshtein distance  Weighted edit distance  kgram overlap 40 40Introduction to Information Retrieval Edit distance  The edit distance between string s and string s is the 1 2 minimum number of basic operations that convert s to s . 1 2  Levenshtein distance: The admissible basic operations are insert, delete, and replace  Levenshtein distance dogdo: 1  Levenshtein distance catcart: 1  Levenshtein distance catcut: 1  Levenshtein distance catact: 2  DamerauLevenshtein distance catact: 1  DamerauLevenshtein includes transposition as a fourth possible operation. 41 41Introduction to Information Retrieval Levenshtein distance: Computation 42 42Introduction to Information Retrieval Levenshtein distance: Algorithm 43 43Introduction to Information Retrieval Levenshtein distance: Algorithm 44 44Introduction to Information Retrieval Levenshtein distance: Algorithm 45 45Introduction to Information Retrieval Levenshtein distance: Algorithm 46 46Introduction to Information Retrieval Levenshtein distance: Algorithm 47 47Introduction to Information Retrieval Levenshtein distance: Example 48 48Introduction to Information Retrieval Each cell of Levenshtein matrix cost of getting here from cost of getting here my upper left neighbor from my upper neighbor (copy or replace) (delete) cost of getting here from the minimum of the three my left neighbor (insert) possible “movements”; the cheapest way of getting here 49 49Introduction to Information Retrieval Levenshtein distance: Example 50 50Introduction to Information Retrieval Dynamic programming (Cormen et al.)  Optimal substructure: The optimal solution to the problem contains within it subsolutions, i.e., optimal solutions to subproblems.  Overlapping subsolutions: The subsolutions overlap. These subsolutions are computed over and over again when computing the global optimal solution in a bruteforce algorithm.  Subproblem in the case of edit distance: what is the edit distance of two prefixes  Overlapping subsolutions: We need most distances of prefixes 3 times – this corresponds to moving right, diagonally, down. 51 51Introduction to Information Retrieval Weighted edit distance  As above, but weight of an operation depends on the characters involved.  Meant to capture keyboard errors, e.g., m more likely to be mistyped as n than as q.  Therefore, replacing m by n is a smaller edit distance than by q.  We now require a weight matrix as input.  Modify dynamic programming to handle weights 52 52Introduction to Information Retrieval Using edit distance for spelling correction  Given query, first enumerate all character sequences within a preset (possibly weighted) edit distance  Intersect this set with our list of “correct” words  Then suggest terms in the intersection to the user.  → exercise in a few slides 53 53Introduction to Information Retrieval Exercise ❶Compute Levenshtein distance matrix for OSLO – SNOW ❷What are the Levenshtein editing operations that transform cat into catcat 54 54Introduction to Information Retrieval 55 55Introduction to Information Retrieval 56 56Introduction to Information Retrieval 57 57Introduction to Information Retrieval 58 58Introduction to Information Retrieval 59 59Introduction to Information Retrieval 60 60Introduction to Information Retrieval 61 61Introduction to Information Retrieval 62 62Introduction to Information Retrieval 63 63Introduction to Information Retrieval 64 64Introduction to Information Retrieval 65 65Introduction to Information Retrieval 66 66Introduction to Information Retrieval 67 67Introduction to Information Retrieval 68 68Introduction to Information Retrieval 69 69Introduction to Information Retrieval 70 70Introduction to Information Retrieval 71 71Introduction to Information Retrieval 72 72Introduction to Information Retrieval 73 73Introduction to Information Retrieval 74 74Introduction to Information Retrieval 75 75Introduction to Information Retrieval 76 76Introduction to Information Retrieval 77 77Introduction to Information Retrieval 78 78Introduction to Information Retrieval 79 79Introduction to Information Retrieval 80 80Introduction to Information Retrieval 81 81Introduction to Information Retrieval 82 82Introduction to Information Retrieval 83 83Introduction to Information Retrieval 84 84Introduction to Information Retrieval 85 85Introduction to Information Retrieval 86 86Introduction to Information Retrieval 87 87Introduction to Information Retrieval 88 88Introduction to Information Retrieval How do I read out the editing operations that transform OSLO into SNOW 89 89Introduction to Information Retrieval 90 90Introduction to Information Retrieval 91 91Introduction to Information Retrieval 92 92Introduction to Information Retrieval 93 93Introduction to Information Retrieval 94 94Introduction to Information Retrieval 95 95Introduction to Information Retrieval 96 96Introduction to Information Retrieval 97 97Introduction to Information Retrieval 98 98Introduction to Information Retrieval 99 99Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 100Introduction to Information Retrieval Spelling correction  Now that we can compute edit distance: how to use it for isolated word spelling correction – this is the last slide in this section.  kgram indexes for isolated word spelling correction.  Contextsensitive spelling correction  General issues 101 101Introduction to Information Retrieval kgram indexes for spelling correction  Enumerate all kgrams in the query term  Example: bigram index, misspelled word bordroom  Bigrams: bo, or, rd, dr, ro, oo, om  Use the kgram index to retrieve “correct” words that match query term kgrams  Threshold by number of matching kgrams  E.g., only vocabulary terms that differ by at most 3 kgrams 102 102Introduction to Information Retrieval kgram indexes for spelling correction: bordroom 103 103Introduction to Information Retrieval Contextsensitive spelling correction  Our example was: an asteroid that fell form the sky  How can we correct form here  One idea: hitbased spelling correction  Retrieve “correct” terms close to each query term  for flew form munich: flea for flew, from for form, munch for  munich  Now try all possible resulting phrases as queries with one word “fixed” at a time  Try query “flea form munich”  Try query “flew from munich”  Try query “flew form munch”  The correct query “flew from munich” has the most hits.  Suppose we have 7 alternatives for flew, 20 for form and 3 for munich, how many “corrected” phrases will we enumerate 104 104Introduction to Information Retrieval Contextsensitive spelling correction  The “hitbased” algorithm we just outlined is not very efficient.  More efficient alternative: look at “collection” of queries, not documents 105 105Introduction to Information Retrieval General issues in spelling correction  User interface  automatic vs. suggested correction  Did you mean only works for one suggestion.  What about multiple possible corrections  Tradeoff: simple vs. powerful UI  Cost  Spelling correction is potentially expensive.  Avoid running on every query  Maybe just on queries that match few documents.  Guess: Spelling correction of major search engines is efficient enough to be run on every query. 106 106Introduction to Information Retrieval Exercise: Understand Peter Norvig’s spelling corrector 107 107Introduction to Information Retrieval Outline ❶ Recap ❷ Dictionaries ❸ Wildcard queries ❹ Edit distance ❺ Spelling correction ❻ Soundex 108Introduction to Information Retrieval Soundex  Soundex is the basis for finding phonetic (as opposed to orthographic) alternatives.  Example: chebyshev / tchebyscheff  Algorithm:  Turn every token to be indexed into a 4character reduced form  Do the same with query terms  Build and search an index on the reduced forms 109 109Introduction to Information Retrieval Soundex algorithm ❶ Retain the first letter of the term. ❷ Change all occurrences of the following letters to ’0’ (zero): A, E, I, O, U, H, W, Y ❸ Change letters to digits as follows:  B, F, P, V to 1  C, G, J, K, Q, S, X, Z to 2  D,T to 3  L to 4  M, N to 5  R to 6 ❹ Repeatedly remove one out of each pair of consecutive identical digits ❺ Remove all zeros from the resulting string; pad the resulting string with trailing zeros and return the first four positions, which will consist of a letter followed by three digits 110 110Introduction to Information Retrieval Example: Soundex of HERMAN  Retain H  ERMAN → 0RM0N  0RM0N → 06505  06505 → 06505  06505 → 655  Return H655  Note: HERMANN will generate the same code 111 111Introduction to Information Retrieval How useful is Soundex  Not very – for information retrieval  Ok for “high recall” tasks in other applications (e.g., Interpol)  Zobel and Dart (1996) suggest better alternatives for phonetic matching in IR. 112 112Introduction to Information Retrieval Exercise  Compute Soundex code of your last name 113 113Introduction to Information Retrieval Takeaway  Tolerant retrieval: What to do if there is no exact match between query term and document term  Wildcard queries  Spelling correction 114 114Introduction to Information Retrieval Resources  Chapter 3 of IIR  Resources at http://ifnlp.org/ir  Soundex demo  Levenshtein distance demo  Peter Norvig’s spelling corrector 115 115