Question? Leave a message!




Constructing the inverted index

Constructing the inverted index
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval The term vocabulary and postings lists 1Introduction to Information Retrieval Overview ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 2Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 3Introduction to Information Retrieval Inverted Index For each term t, we store a list of all documents that contain t. dictionary postings 4 4Introduction to Information Retrieval Intersecting two posting lists 5 5Introduction to Information Retrieval Constructing the inverted index: Sort postings 6 6Introduction to Information Retrieval Westlaw: Example queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos /s prevent /s employe Information need: Requirements for disabled people to be able to access a workplace Query: disab /p access /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests Query: host /p (responsib liab) /p (intoxicat drunk) /p guest 7 7Introduction to Information Retrieval Does Google use the Boolean model?  On Google, the default interpretation of a query w w . . .w 1 2 n is w AND w AND . . .AND w 1 2 n  Cases where you get hits that do not contain one of the wi :  anchor text  page contains variant of w (morphology, spelling correction, i synonym)  long queries (n large)  boolean expression generates very few hits  Simple Boolean vs. Ranking of result set  Simple Boolean retrieval returns matching documents in no particular order.  Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 8 8Introduction to Information Retrieval Take-away  Understanding of the basic unit of classical information retrieval systems: words and documents: What is a document, what is a term?  Tokenization: how to get from raw text to words (or tokens)  More complex indexes: skip pointers and phrases 9 9Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 10Introduction to Information Retrieval Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.  This can be complex in reality. 11 11Introduction to Information Retrieval Parsing a document  We need to deal with format and language of each document.  What format is it in? pdf, word, excel, html etc.  What language is it in?  What character set is in use?  Each of these is a classification problem, which we will study later in this course (IIR 13).  Alternative: use heuristics 12 12Introduction to Information Retrieval Format/Language: Complications  A single index usually contains terms of several languages.  Sometimes a document or its components contain multiple languages/formats.  French email with Spanish pdf attachment  What is the document unit for indexing?  A file?  An email?  An email with 5 attachments?  A group of files (ppt or latex in HTML)?  Upshot: Answering the question “what is a document?” is not trivial and requires some design decisions.  Also: XML 13 13Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 14Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 15Introduction to Information Retrieval Definitions  Word – A delimited string of characters as it appears in the text.  Term – A “normalized” word (case, morphology, spelling etc); an equivalence class of words.  Token – An instance of a word or term occurring in a document.  Type – The same as a term in most cases: an equivalence class of tokens. 16 16Introduction to Information Retrieval Normalization  Need to “normalize” terms in indexed text as well as query terms into the same form.  Example: We want to match U.S.A. and USA  We most commonly implicitly define equivalence classes of terms.  Alternatively: do asymmetric expansion  window → window, windows  windows → Windows, windows  Windows (no expansion)  More powerful, but less efficient  Why don’t you want to put window, Window, windows, and Windows in the same equivalence class? 17 17Introduction to Information Retrieval Normalization: Other languages  Normalization and language detection interact.  PETER WILL NICHT MIT. → MIT = mit  He got his PhD from MIT. → MIT ≠ mit 18 18Introduction to Information Retrieval Recall: Inverted index construction  Input:  Output:  Each token is a candidate for a postings entry.  What are valid tokens to emit? 19 19Introduction to Information Retrieval Exercises In June, the dog likes to chase the cat in the barn. – How many word tokens? How many word types? Why tokenization is difficult – even in English. Tokenize: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. 20 20