Dictionary of data structures and algorithms

Dictionary data structures for inverted indexes and dictionary data structure implementation
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Dictionaries and tolerant retrieval www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 Dictionary data structures for inverted indexes  The dictionary data structure stores the term vocabulary, document frequency, pointers to each postings list … in what data structure? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 A naïve dictionary  An array of struct: char20 int Postings 20 bytes 4/8 bytes 4/8 bytes  How do we store a dictionary in memory efficiently?  How do we quickly look up elements at query time? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 Dictionary data structures  Two main choices:  Hashtables  Trees  Some IR systems use hashtables, some trees www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 Hashtables  Each vocabulary term is hashed to an integer  (We assume you’ve seen hashtables before)  Pros:  Lookup is faster than for a tree: O(1)  Cons:  No easy way to find minor variants:  judgment/judgement  No prefix search tolerant retrieval  If vocabulary keeps growing, need to occasionally do the expensive operation of rehashing everything www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 Tree: binary tree Root a-m n-z a-hu hy-m n-sh si-z www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 Tree: B-tree n-z a-hu hy-m  Definition: Every internal nodel has a number of children in the interval a,b where a, b are appropriate natural numbers, e.g., 2,4. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.1 Trees  Simplest: binary tree  More usual: B-trees  Trees require a standard ordering of characters and hence strings … but we typically have one  Pros:  Solves the prefix problem (terms starting with hyp)  Cons:  Slower: O(log M) and this requires balanced tree  Rebalancing binary trees is expensive  But B-trees mitigate the rebalancing problem www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2 Wild-card queries:  mon: find all docs containing any word beginning with “mon”.  Easy with binary tree (or B-tree) lexicon: retrieve all words in range: mon ≤ w moo  mon: find words ending in “mon”: harder  Maintain an additional B-tree for terms backwards. Can retrieve all words in range: nom ≤ w non. Exercise: from this, how can we enumerate all terms meeting the wild-card query procent ? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2 Query processing  At this point, we have an enumeration of all terms in the dictionary that match the wild-card query.  We still have to look up the postings for each enumerated term.  E.g., consider the query: seate AND filer This may result in the execution of many Boolean AND queries. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2 B-trees handle ’s at the end of a query term  How can we handle ’s in the middle of query term?  cotion  We could look up co AND tion in a B-tree and intersect the two term sets  Expensive  The solution: transform wild-card queries so that the ’s occur at the end  This gives rise to the Permuterm Index. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2.1 Permuterm index  For term hello, index under:  hello, elloh, llohe, lohel, ohell, hello where is a special symbol.  Queries:  X lookup on X X lookup on X  X lookup on X X lookup on X  XY lookup on YX XYZ ??? Exercise Query = helo X=hel, Y=o Lookup ohel www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2.1 Permuterm query processing  Rotate query wild-card to the right  Now use B-tree lookup as before.  Permuterm problem: ≈ quadruples lexicon size Empirical observation for English. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2.2 Bigram (k-gram) indexes  Enumerate all k-grams (sequence of k chars) occurring in any term  e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) a,ap,pr,ri,il,l,i,is,s,t,th,he,e,c,cr,ru, ue,el,le,es,st,t, m,mo,on,nt,h  is a special word boundary symbol  Maintain a second inverted index from bigrams to dictionary terms that match each bigram. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2.2 Bigram index example  The k-gram index finds terms based on a query consisting of k-grams (here k=2). m mace madden mo among amortize on along among www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2.2 Processing wild-cards  Query mon can now be run as  m AND mo AND on  Gets terms that match AND version of our wildcard query.  But we’d enumerate moon.  Must post-filter these terms against query.  Surviving enumerated terms are then looked up in the term-document inverted index.  Fast, space efficient (compared to permuterm). www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.2.2 Processing wild-card queries  As before, we must execute a Boolean query for each enumerated, filtered term.  Wild-cards can result in expensive query execution (very large disjunctions…)  pyth AND prog  If you encourage “laziness” people will respond Search Type your search terms, use ‘’ if you need to. E.g., Alex will match Alexander. www.ThesisScientist.com  Which web search engines allow wildcard queries?Introduction to Information Retrieval SPELLING CORRECTION www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.3 Spell correction  Two principal uses  Correcting document(s) being indexed  Correcting user queries to retrieve “right” answers  Two main flavors:  Isolated word  Check each word on its own for misspelling  Will not catch typos resulting in correctly spelled words  e.g., from  form  Context-sensitive  Look at surrounding words,  e.g., I flew form Heathrow to Narita. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 3.3 Document correction  Especially needed for OCR’ed documents  Correction algorithms are tuned for this: rn/m  Can use domain-specific knowledge  E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).  But also: web pages and even printed material have typos  Goal: the dictionary contains fewer misspellings  But often we don’t change the documents and instead fix the query-document mapping www.ThesisScientist.com