Match the Vocabulary term with its meaning

The term Vocabulary and postings lists ppt and match each vocabulary term to its definition the term vocabulary refers to the structure of sentences
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval The term vocabulary and postings lists 1Introduction to Information Retrieval Overview ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 2Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 3Introduction to Information Retrieval Inverted Index For each term t, we store a list of all documents that contain t. dictionary postings 4 4Introduction to Information Retrieval Intersecting two posting lists 5 5Introduction to Information Retrieval Constructing the inverted index: Sort postings 6 6Introduction to Information Retrieval Westlaw: Example queries Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos /s prevent /s employe Information need: Requirements for disabled people to be able to access a workplace Query: disab /p access /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests Query: host /p (responsib liab) /p (intoxicat drunk) /p guest 7 7Introduction to Information Retrieval Does Google use the Boolean model?  On Google, the default interpretation of a query w w . . .w 1 2 n is w AND w AND . . .AND w 1 2 n  Cases where you get hits that do not contain one of the wi :  anchor text  page contains variant of w (morphology, spelling correction, i synonym)  long queries (n large)  boolean expression generates very few hits  Simple Boolean vs. Ranking of result set  Simple Boolean retrieval returns matching documents in no particular order.  Google (and most well designed Boolean engines) rank the result set – they rank good hits (according to some estimator of relevance) higher than bad hits. 8 8Introduction to Information Retrieval Take-away  Understanding of the basic unit of classical information retrieval systems: words and documents: What is a document, what is a term?  Tokenization: how to get from raw text to words (or tokens)  More complex indexes: skip pointers and phrases 9 9Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 10Introduction to Information Retrieval Documents  Last lecture: Simple Boolean retrieval system  Our assumptions were:  We know what a document is.  We can “machine-read” each document.  This can be complex in reality. 11 11Introduction to Information Retrieval Parsing a document  We need to deal with format and language of each document.  What format is it in? pdf, word, excel, html etc.  What language is it in?  What character set is in use?  Each of these is a classification problem, which we will study later in this course (IIR 13).  Alternative: use heuristics 12 12Introduction to Information Retrieval Format/Language: Complications  A single index usually contains terms of several languages.  Sometimes a document or its components contain multiple languages/formats.  French email with Spanish pdf attachment  What is the document unit for indexing?  A file?  An email?  An email with 5 attachments?  A group of files (ppt or latex in HTML)?  Upshot: Answering the question “what is a document?” is not trivial and requires some design decisions.  Also: XML 13 13Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 14Introduction to Information Retrieval Outline ❶ Recap ❷ Documents ❸ Terms  General + Non-English  English ❹ Skip pointers ❺ Phrase queries 15Introduction to Information Retrieval Definitions  Word – A delimited string of characters as it appears in the text.  Term – A “normalized” word (case, morphology, spelling etc); an equivalence class of words.  Token – An instance of a word or term occurring in a document.  Type – The same as a term in most cases: an equivalence class of tokens. 16 16Introduction to Information Retrieval Normalization  Need to “normalize” terms in indexed text as well as query terms into the same form.  Example: We want to match U.S.A. and USA  We most commonly implicitly define equivalence classes of terms.  Alternatively: do asymmetric expansion  window → window, windows  windows → Windows, windows  Windows (no expansion)  More powerful, but less efficient  Why don’t you want to put window, Window, windows, and Windows in the same equivalence class? 17 17Introduction to Information Retrieval Normalization: Other languages  Normalization and language detection interact.  PETER WILL NICHT MIT. → MIT = mit  He got his PhD from MIT. → MIT ≠ mit 18 18Introduction to Information Retrieval Recall: Inverted index construction  Input:  Output:  Each token is a candidate for a postings entry.  What are valid tokens to emit? 19 19Introduction to Information Retrieval Exercises In June, the dog likes to chase the cat in the barn. – How many word tokens? How many word types? Why tokenization is difficult – even in English. Tokenize: Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. 20 20