Question? Leave a message!

Constructing the inverted index

Constructing the inverted index
Introduction to Information Retrieval Introduction to Information Retrieval The term vocabulary and postings lists 1Introduction to Information Retrieval Overview ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 2Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 3Introduction to Information Retrieval Inverted Index For each term t, we store a list of all documents that contain t. dictionary postings 4 4Introduction to Information Retrieval Intersecting two posting lists 5 5Introduction to Information Retrieval Constructing the inverted index: Sort postings 6 6Introduction to Information Retrieval Westlaw: Example queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos /s prevent /s employe Information need: Requirements for disabled people to be able to access a workplace Query: disab /p access /s worksite workplace (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests Query: host /p (responsib liab) /p (intoxicat drunk) /p guest 7 7Introduction to Information Retrieval Does Google use the Boolean model  On Google, the default interpretation of a query w w . . .w 1 2 n is w AND w AND . . .AND w 1 2 n  Cases where you get hits that do not contain one of the wi :  anchor text  page contains variant of w (morphology, spelling correction, i synonym)  long queries (n large)  boolean expression generates very few hits  Simple Boolean vs. Ranking of result set  Simple Boolean retrieval returns matching documents in no particular order.  Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 8 8Introduction to Information Retrieval Takeaway  Understanding of the basic unit of classical information retrieval systems: words and documents: What is a document, what is a term  Tokenization: how to get from raw text to words (or tokens)  More complex indexes: skip pointers and phrases 9 9Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 10Introduction to Information Retrieval Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machineread” each document.  This can be complex in reality. 11 11Introduction to Information Retrieval Parsing a document  We need to deal with format and language of each document.  What format is it in pdf, word, excel, html etc.  What language is it in  What character set is in use  Each of these is a classification problem, which we will study later in this course (IIR 13).  Alternative: use heuristics 12 12Introduction to Information Retrieval Format/Language: Complications  A single index usually contains terms of several languages.  Sometimes a document or its components contain multiple languages/formats.  French email with Spanish pdf attachment  What is the document unit for indexing  A file  An email  An email with 5 attachments  A group of files (ppt or latex in HTML)  Upshot: Answering the question “what is a document” is not trivial and requires some design decisions.  Also: XML 13 13Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 14Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 15Introduction to Information Retrieval Definitions  Word – A delimited string of characters as it appears in the text.  Term – A “normalized” word (case, morphology, spelling etc); an equivalence class of words.  Token – An instance of a word or term occurring in a document.  Type – The same as a term in most cases: an equivalence class of tokens. 16 16Introduction to Information Retrieval Normalization  Need to “normalize” terms in indexed text as well as query terms into the same form.  Example: We want to match U.S.A. and USA  We most commonly implicitly define equivalence classes of terms.  Alternatively: do asymmetric expansion  window → window, windows  windows → Windows, windows  Windows (no expansion)  More powerful, but less efficient  Why don’t you want to put window, Window, windows, and Windows in the same equivalence class 17 17Introduction to Information Retrieval Normalization: Other languages  Normalization and language detection interact.  PETER WILL NICHT MIT. → MIT = mit  He got his PhD from MIT. → MIT ≠ mit 18 18Introduction to Information Retrieval Recall: Inverted index construction  Input:  Output:  Each token is a candidate for a postings entry.  What are valid tokens to emit 19 19Introduction to Information Retrieval Exercises In June, the dog likes to chase the cat in the barn. – How many word tokens How many word types Why tokenization is difficult – even in English. Tokenize: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. 20 20Introduction to Information Retrieval Tokenization problems: One word or two (or several)  HewlettPackard  Stateoftheart  coeducation  the holdhimbackanddraghimaway maneuver  data base  San Francisco  Los Angelesbased company  cheap San FranciscoLos Angeles fares  York University vs. New York University 21 21Introduction to Information Retrieval Numbers  3/20/91  20/3/91  Mar 20, 1991  B52   (800) 2342333  800.234.2333  Older IR systems may not index numbers . . .  . . . but generally it’s a useful feature. 22 22Introduction to Information Retrieval Chinese: No whitespace 23 23Introduction to Information Retrieval Ambiguous segmentation in Chinese The two characters can be treated as one word meaning ‘monk’ or as a sequence of two words meaning ‘and’ and ‘still’. 24 24Introduction to Information Retrieval Other cases of “no whitespace”  Compounds in Dutch, German, Swedish  Computerlinguistik → Computer + Linguistik  Lebensversicherungsgesellschaftsangestellter  → leben + versicherung + gesellschaft + angestellter  Inuit: tusaatsiarunnanngittualuujunga (I can’t hear very well.)  Many other languages with segmentation difficulties: Finnish, Urdu, . . . 25 25Introduction to Information Retrieval Japanese 4 different “alphabets”: Chinese characters, hiragana syllabary for inflectional endings and functional words, katakana syllabary for transcription of foreign words and other uses, and latin. No spaces (as in Chinese). End user can express query entirely in hiragana 26 26Introduction to Information Retrieval Arabic script 27 27Introduction to Information Retrieval Arabic script: Bidirectionality ← → ← → ← START ‘Algeria achieved its independence in 1962 after 132 years of French occupation.’ Bidirectionality is not a problem if text is coded in Unicode. 28 28Introduction to Information Retrieval Accents and diacritics  Accents: résumé vs. resume (simple omission of accent)  Umlauts: Universität vs. Universitaet (substitution with special letter sequence “ae”)  Most important criterion: How are users likely to write their queries for these words  Even in languages that standardly have accents, users often do not type them. (Polish) 29 29Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 30Introduction to Information Retrieval Case folding  Reduce all letters to lower case  Possible exceptions: capitalized words in midsentence  MIT vs. mit  Fed vs. fed  It’s often best to lowercase everything since users will use lowercase regardless of correct capitalization. 31 31Introduction to Information Retrieval Stop words  stop words = extremely common words which would appear to be of little value in helping select documents matching a user need  Examples: a, an, and, are, as, at, be, by, for, from, has, he, in, is, it, its, of, on, that, the, to, was, were, will, with  Stop word elimination used to be standard in older IR systems.  But you need stop words for phrase queries, e.g. “King of Denmark”  Most web search engines index stop words. 32 32Introduction to Information Retrieval More equivalence classing  Soundex: IIR 3 (phonetic equivalence, Muller = Mueller)  Thesauri: IIR 9 (semantic equivalence, car = automobile) 33 33Introduction to Information Retrieval Lemmatization  Reduce inflectional/variant forms to base form  Example: am, are, is → be  Example: car, cars, car’s, cars’ → car  Example: the boy’s cars are different colors → the boy car be different color  Lemmatization implies doing “proper” reduction to dictionary headword form (the lemma).  Inflectional morphology (cutting → cut) vs. derivational morphology (destruction → destroy) 34 34Introduction to Information Retrieval Stemming  Definition of stemming: Crude heuristic process that chops off the ends of words in the hope of achieving what “principled” lemmatization attempts to do with a lot of linguistic knowledge.  Language dependent  Often inflectional and derivational  Example for derivational: automate, automatic, automation all reduce to automat 35 35Introduction to Information Retrieval Porter algorithm  Most common algorithm for stemming English  Results suggest that it is at least as good as other stemming options  Conventions + 5 phases of reductions  Phases are applied sequentially  Each phase consists of a set of commands.  Sample command: Delete final ement if what remains is longer than 1 character  replacement → replac  cement → cement  Sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. 36 36Introduction to Information Retrieval Porter stemmer: A few rules Example Rule caresses → caress SSES → SS ponies → poni IES → I caress → caress SS → SS cats → cat S → 37 37Introduction to Information Retrieval Three stemmers: A comparison Sample text: Such an analysis can reveal features that are not easil visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation Porter stemmer: such an analysi can reveal featur that ar not easili visibl from the variat in the individu gene and can lead to pictur of express that is more biolog transpar and access to interpret Lovins stemmer: such an analys can reve featur that ar not eas vis from th vari in th individu gen and can lead to a pictur of expres that is mor biolog transpar and acces to interpres Paice stemmer: such an analys can rev feat that are not easy vis from the vary in the individ gen and can lead to a pict of express that is mor biolog transp and access to interpret 38 38Introduction to Information Retrieval Does stemming improve effectiveness  In general, stemming increases effectiveness for some queries, and decreases effectiveness for others.  Queries where stemming is likely to help: tartan sweaters, sightseeing tour san francisco (equivalence classes: sweater,sweaters, tour,tours)  Porter Stemmer equivalence class oper contains all of operate operating operates operation operative operatives operational.  Queries where stemming hurts: operational AND research, operating AND system, operative AND dentistry 39 39Introduction to Information Retrieval Exercise: What does Google do  Stop words  Normalization  Tokenization  Lowercasing  Stemming  Nonlatin alphabets  Umlauts  Compounds  Numbers 40 40Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 41Introduction to Information Retrieval Recall basic intersection algorithm  Linear in the length of the postings lists.  Can we do better 42 42Introduction to Information Retrieval Skip pointers  Skip pointers allow us to skip postings that will not figure in the search results.  This makes intersecting postings lists more efficient.  Some postings lists contain several million entries – so efficiency can be an issue even if basic intersection is linear.  Where do we put skip pointers  How do we make sure intersection results are correct 43 43Introduction to Information Retrieval Basic idea 44 44Introduction to Information Retrieval Skip lists: Larger example 45 45Introduction to Information Retrieval Intersection with skip pointers 46 46Introduction to Information Retrieval Where do we place skips  Tradeoff: number of items skipped vs. frequency skip can be taken  More skips: Each skip pointer skips only a few items, but we can frequently use it.  Fewer skips: Each skip pointer skips many items, but we can not use it very often. 47 47Introduction to Information Retrieval Where do we place skips (cont)  Simple heuristic: for postings list of length P, use evenlyspaced skip pointers.  This ignores the distribution of query terms.  Easy if the index is static; harder in a dynamic environment because of updates.  How much do skip pointers help  They used to help a lot.  With today’s fast CPUs, they don’t help that much anymore. 48 48Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + NonEnglish  English ❹ Skip pointers ❺ Phrase queries 49Introduction to Information Retrieval Phrase queries  We want to answer a query such as stanford university – as a phrase.  Thus The inventor Stanford Ovshinsky never went to university should not be a match.  The concept of phrase query has proven easily understood by users.  About 10 of web queries are phrase queries.  Consequence for inverted index: it no longer suffices to store docIDs in postings lists.  Two ways of extending the inverted index:  biword index  positional index 50 50Introduction to Information Retrieval Biword indexes  Index every consecutive pair of terms in the text as a phrase.  For example, Friends, Romans, Countrymen would generate two biwords: “friends romans” and “romans countrymen”  Each of these biwords is now a vocabulary term.  Twoword phrases can now easily be answered. 51 51Introduction to Information Retrieval Longer phrase queries  A long phrase like “stanford university palo alto” can be represented as the Boolean query “STANFORD UNIVERSITY” AND “UNIVERSITY PALO” AND “PALO ALTO”  We need to do postfiltering of hits to identify subset that actually contains the 4word phrase. 52 52Introduction to Information Retrieval Extended biwords  Parse each document and perform partofspeech tagging  Bucket the terms into (say) nouns (N) and articles/prepositions (X)  Now deem any string of terms of the form NXN to be an extended biword  Examples: catcher in the rye N X X N king of Denmark N X N  Include extended biwords in the term vocabulary  Queries are processed accordingly 53 53Introduction to Information Retrieval Issues with biword indexes  Why are biword indexes rarely used  False positives, as noted above  Index blowup due to very large term vocabulary 54 54Introduction to Information Retrieval Positional indexes  Positional indexes are a more efficient alternative to biword indexes.  Postings lists in a nonpositional index: each posting is just a docID  Postings lists in a positional index: each posting is a docID and a list of positions 55 55Introduction to Information Retrieval Positional indexes: Example Query: “to be or not to be ” TO, 993427: 1 2 3 4 5 6 ‹ 1: ‹7, 18, 33, 72, 86, 231›; 2: ‹1, 17, 74, 222, 255›; 4: ‹8, 16, 190, 429, 433›; 5: ‹363, 367›; 7: ‹13, 23, 191›; . . . › BE, 178239: ‹ 1: ‹17, 25›; 4: ‹17, 191, 291, 430, 434›; 5: ‹14, 19, 101›; . . . › Document 4 is a match 56 56Introduction to Information Retrieval Proximity search  We just saw how to use a positional index for phrase searches.  We can also use it for proximity search.  For example: employment /4 place  Find all documents that contain EMPLOYMENT and PLACE within 4 words of each other.  Employment agencies that place healthcare workers are seeing growth is a hit.  Employment agencies that have learned to adapt now place healthcare workers is not a hit. 57 57Introduction to Information Retrieval Proximity search  Use the positional index  Simplest algorithm: look at crossproduct of positions of (i) EMPLOYMENT in document and (ii) PLACE in document  Very inefficient for frequent words, especially stop words  Note that we want to return the actual matching positions, not just a list of documents.  This is important for dynamic summaries etc. 58 58Introduction to Information Retrieval “Proximity” intersection 59 59Introduction to Information Retrieval Combination scheme  Biword indexes and positional indexes can be profitably combined.  Many biwords are extremely frequent: Michael Jackson, Britney Spears etc  For these biwords, increased speed compared to positional postings intersection is substantial.  Combination scheme: Include frequent biwords as vocabulary terms in the index. Do all other phrases by positional intersection.  Williams et al. (2004) evaluate a more sophisticated mixed indexing scheme. Faster than a positional index, at a cost of 26 more space for index. 60 60Introduction to Information Retrieval “Positional” queries on Google  For web search engines, positional queries are much more expensive than regular Boolean queries.  Let’s look at the example of phrase queries.  Why are they more expensive than regular Boolean queries  Can you demonstrate on Google that phrase queries are more expensive than Boolean queries 61 61Introduction to Information Retrieval Takeaway  Understanding of the basic unit of classical information retrieval systems: words and documents: What is a document, what is a term  Tokenization: how to get from raw text to words (or tokens)  More complex indexes: skip pointers and phrases 62 62Introduction to Information Retrieval Resources  Chapter 2 of IIR  Resources at  Porter stemmer 63 63
Website URL
Document Information
User Name:
User Type:
United States
Uploaded Date: