Introduction to Information Retrieval

introduction to information retrieval exercise solutions manual pdf organizing knowledge an introduction to information retrieval solution manual for introduction to information retrieval
ImogenCameron Profile Pic
Published Date:14-07-2017
Your Website URL(Optional)
P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 Introduction to Information Retrieval Introduction to Information Retrieval is the first textbook with a coherent treat- ment of classical and web information retrieval, including web search and the related areas of text classification and text clustering. Written from a computer science perspective, it gives an up-to-date treatment of all aspects of the design and implementation of systems for gathering, indexing, and searching documents and of methods for evaluating systems, along with an introduction to the use of machine learning methods on text collections. Designed as the primary text for a graduate or advanced undergraduate course in information retrieval, the book will also interest researchers and professionals. A complete set of lecture slides and exercises that accompany the book are available on the web. Christopher D. Manning is Associate Professor of Computer Science and Lin- guistics at Stanford University. Prabhakar Raghavan is Head of Yahoo Research and a Consulting Professor of Computer Science at Stanford University. Hinrich Schutze ¨ is Chair of Theoretical Computational Linguistics at the In- stitute for Natural Language Processing, University of Stuttgart. iP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1Boolean retrieval The meaning of the term information retrieval (IR) can be very broad. Just get- ting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, information information retrieval might be defined thus: retrieval Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar pro- fessional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web 1 search engine or search their email. Information retrieval is fast becoming the dominant form of information access, overtaking traditional database- style searching (the sort that is going on when a clerk says to you: “I’m sorry, I can only look up your order if you can give me your order ID”). Information retrieval can also cover other kinds of data and informa- tion problems beyond that specified in the core definition above. The term “unstructured data” refers to data that does not have clear, semantically overt, easy-for-a-computer structure. It is the opposite of structured data, the canonical example of which is a relational database, of the sort companies usually use to maintain product inventories and personnel records. In reality, almost no data are truly “unstructured.” This is definitely true of all text data if you count the latent linguistic structure of human languages. But even ac- cepting that the intended notion of structure is overt structure, most text has structure, such as headings, paragraphs, and footnotes, which is commonly represented in documents by explicit markup (such as the coding underlying web pages). Information retrieval is also used to facilitate “semistructured” 1 In modern parlance, the word “search” has tended to replace “(information) retrieval”; the term “search” is quite ambiguous, but in context we use the two synonymously. 1P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 2 Boolean retrieval search such as finding a document where the title contains Java and the body contains threading. The field of IR also covers supporting users in browsing or filtering docu- ment collections or further processing a set of retrieved documents. Given a set of documents, clustering is the task of coming up with a good grouping of the documents based on their contents. It is similar to arranging books on a bookshelf according to their topic. Given a set of topics, standing informa- tion needs, or other categories (such as suitability of texts for different age groups), classification is the task of deciding which class(es), if any, each of a set of documents belongs to. It is often approached by first manually classi- fying some documents and then hoping to be able to classify new documents automatically. Information retrieval systems can also be distinguished by the scale at which they operate, and it is useful to distinguish three prominent scales. In web search, the system has to provide search over billions of documents stored on millions of computers. Distinctive issues are needing to gather doc- uments for indexing, being able to build systems that work efficiently at this enormous scale, and handling particular aspects of the web, such as the ex- ploitation of hypertext and not being fooled by site providers manipulating page content in an attempt to boost their search engine rankings, given the commercial importance of the web. We focus on all these issues in Chap- ters 19–21. At the other extreme is personal information retrieval.Inthe last few years, consumer operating systems have integrated information retrieval (such as Apple’s Mac OS X Spotlight or Windows Vista’s Instant Search). Email programs usually not only provide search but also text classification: they at least provide a spam (junk mail) filter, and commonly also provide either manual or automatic means for classifying mail so that it can be placed directly into particular folders. Distinctive issues here include handling the broad range of document types on a typical personal computer, and mak- ing the search system maintenance free and sufficiently lightweight in terms of startup, processing, and disk space usage that it can run on one machine without annoying its owner. In between is the space ofenterprise,institutional, and domain-specific search, where retrieval might be provided for collections such as a corporation’s internal documents, a database of patents, or research articles on biochemistry. In this case, the documents are typically stored on centralized file systems and one or a handful of dedicated machines provide search over the collection. This book contains techniques of value over this whole spectrum, but our coverage of some aspects of parallel and distributed search in web-scale search systems is comparatively light owing to the rel- atively small published literature on the details of such systems. However, outside of a handful of web search companies, a software developer is most likely to encounter the personal search and enterprise scenarios. In this chapter, we begin with a very simple example of an IR problem, and introduce the idea of a term-document matrix (Section 1.1) and theP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.1 An example information retrieval problem 3 central inverted index data structure (Section 1.2). We then examine the Boo- lean retrieval model and how Boolean queries are processed (Sections 1.3 and 1.4). 1.1Anexampleinformationretrievalproblem A fat book that many people own is Shakespeare’s Collected Works. Suppose you wanted to determine which plays of Shakespeare contain the words Brutus and Caesar and not Calpurnia. One way to do that is to start at the beginning and to read through all the text, noting for each play whether it contains Brutus and Caesar and excluding it from consideration if it con- tains Calpurnia. The simplest form of document retrieval is for a computer to do this sort of linear scan through documents. This process is commonly grep referred to as grepping through text, after the Unix command grep, which performs this process. Grepping through text can be a very effective process, especially given the speed of modern computers, and often allows useful possibilities for wildcard pattern matching through the use of regular expres- sions. With modern computers, for simple querying of modest collections (the size of Shakespeare’s Collected Works is a bit under one million words of text in total), you really need nothing more. But for many purposes, you do need more: 1. To process large document collections quickly. The amount of online data has grown at least as quickly as the speed of computers, and we would now like to be able to search collections that total in the order of billions to trillions of words. 2. To allow more flexible matching operations. For example, it is impractical to perform the query Romans near countrymen with grep, where near might be defined as “within 5 words” or “within the same sentence.” 3. To allow ranked retrieval. In many cases, you want the best answer to an information need among many documents that contain certain words. index The way to avoid linearly scanning the texts for each query is to index the documents in advance. Let us stick withShakespeare’sCollectedWorks,anduse it to introduce the basics of the Boolean retrieval model. Suppose we record for each document – here a play of Shakespeare’s – whether it contains each word out of all the words Shakespeare used (Shakespeare used about 32,000 incidence different words). The result is a binary term-document incidencematrix,asin matrix Figure 1.1.Terms are the indexed units (further discussed in Section 2.2); they term are usually words, and for the moment you can think of them as words, but the information retrieval literature normally speaks of terms because some of them, such as perhaps I-9 or Hong Kong are not usually thought of as words. Now, depending on whether we look at the matrix rows or columns, we canP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 4 Boolean retrieval Antony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 ... Figure 1.1 A term-document incidence matrix. Matrix element (t, d) is 1 if the play in column d contains the word in row t, and is 0 otherwise. have a vector for each term, which shows the documents it appears in, or a 2 vector for each document, showing the terms that occur in it. To answer the query Brutus and Caesar and not Calpurnia, we take the vectors for Brutus, Caesar and Calpurnia, complement the last, and then do a bitwiseand: 110100and 110111and 101111 = 100100 The answers for this query are thus Antony and Cleopatra and Hamlet (Fig- ure 1.2). TheBoolean retrievalmodel is a model for information retrieval in which we Boolean can pose any query which is in the form of a Boolean expression of terms, retrieval that is, in which terms are combined with the operators and, or,and not. model The model views each document as just a set of words. Let us now consider a more realistic scenario, simultaneously using the opportunity to introduce some terminology and notation. Suppose we have document N = 1 million documents. By documents we mean whatever units we have decided to build a retrieval system over. They might be individual memos or chapters of a book (see Section 2.1.2 (page 20) for further discussion). We refer to the group of documents over which we perform retrieval as the (doc- collection ument) collection. It is sometimes also referred to as a corpus (a body of texts). corpus Suppose each document is about 1,000 words long (2–3 book pages). If we assume an average of 6 bytes per word including spaces and punctuation, then this is a document collection about 6 gigabytes (GB) in size. Typically, there might be about M = 500,000 distinct terms in these documents. There is nothing special about the numbers we have chosen, and they might vary by an order of magnitude or more, but they give us some idea of the dimen- sions of the kinds of problems we need to handle. We will discuss and model these size assumptions in Section 5.1 (page 79). adhoc Our goal is to develop a system to address the ad hoc retrieval task. This is retrieval the most standard IR task. In it, a system aims to provide documents from 2 Formally, we take the transpose of the matrix to be able to get the terms as column vectors.P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.1 An example information retrieval problem 5 Antony and Cleopatra, Act III, Scene ii Agrippa Aside to Domitius Enobarbus: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Figure1.2 Results from Shakespeare for the query Brutusand Caesarandnot Calpurnia. within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. An information information need is the topic about which the user desires to know more, and need is differentiated from a query, which is what the user conveys to the com- query puter in an attempt to communicate the information need. A document is relevance relevant if it is one that the user perceives as containing information of value with respect to their personal information need. Our example above was rather artificial in that the information need was defined in terms of par- ticular words, whereas, usually a user is interested in a topic like “pipeline leaks” and would like to find relevant documents regardless of whether they precisely use those words or express the concept with other words such as effectiveness pipeline rupture. To assess the effectiveness of an IR system (the quality of its search results), a user usually wants to know two key statistics about the system’s returned results for a query: precision Precision: What fraction of the returned results are relevant to the informa- tion need? recall Recall: What fraction of the relevant documents in the collection were re- turned by the system? Detailed discussion of relevance and evaluation measures including preci- sion and recall is found in Chapter 8. We now cannot build a term-document matrix in a naive way. A 500K × 1M matrix has half-a-trillion 0’s and 1’s – too many to fit in a computer’s memory. But the crucial observation is that the matrix is extremely sparse, that is, it has few nonzero entries. Because each document is 1,000 words long, the matrix has no more than one billion 1’s, so a minimum of 99.8% of the cells are zero. A much better representation is to record only the things that do occur, that is, the 1 positions. This idea is central to the first major concept in information retrieval, inverted the inverted index. The name is actually redundant: an index always maps index back from terms to the parts of a document where they occur. Nevertheless, inverted index, or sometimes inverted file, has become the standard term inP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 6 Boolean retrieval Brutus −→ 1 2 4 11 31 45 173 174 Caesar −→ 1 2 4 5 6 16 57 132 ... Calpurnia −→ 2 31 54 101 . . .     Dictionary Postings Figure1.3 The two parts of an inverted index. The dictionary is commonly kept in memory, with pointers to each postings list, which is stored on disk. 3 IR. The basic idea of an inverted index is shown in Figure 1.3. We keep a dictionary dictionary of terms (sometimes also referred to as a vocabulary or lexicon;in vocabulary this book, we usedictionary for the data structure andvocabulary for the set of terms). Then, for each term, we have a list that records which documents the lexicon term occurs in. Each item in the list – which records that a term appeared in a document (and, later, often, the positions in the document) – is convention- 4 posting ally called a posting. The list is then called a postings list (or inverted list), postingslist and all the postings lists taken together are referred to as the postings.The postings dictionary in Figure 1.3 has been sorted alphabetically and each postings list is sorted by document ID. We see why this is useful in Section 1.3; later, we also consider alternatives to doing this (Section 7.1.5). 1.2Afirsttakeatbuildinganinvertedindex To gain the speed benefits of indexing at retrieval time, we have to build the index in advance. The major steps in this are: 1. Collect the documents to be indexed: Friends, Romans, countrymen. So let it be with Caesar ... 2. Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So ... 3 Some IR researchers prefer the term inverted file, but expressions like index construction and index compression are much more common than inverted file construction and inverted file com- pression. For consistency, we use (inverted) index throughout this book. 4 In a (nonpositional) inverted index, a posting is just a document ID, but it is inherently asso- ciated with a term, via the postings list it is placed on; sometimes we will also talk of a (term, docID) pair as a posting. P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.2 A first take at building an inverted index 7 3. Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friend roman countryman so ... 4. Index the documents that each term occurs in by creating an inverted in- dex, consisting of a dictionary and postings. We define and discuss the earlier stages of processing, that is, steps 1–3, in Section 2.2. Until then you can think of tokens and normalized tokens as also loosely equivalent to words. Here, we assume that the first three steps have already been done, and we examine building a basic inverted index by sort- based indexing. Within a document collection, we assume that each document has a unique docID serial number, known as the document identifier (docID). During index con- struction, we can simply assign successive integers to each new document when it is first encountered. The input to indexing is a list of normalized to- kens for each document, which we can equally think of as a list of pairs of sorting term and docID, as in Figure 1.4. The core indexing step is sorting this list so that the terms are alphabetical, giving us the representation in the middle column of Figure 1.4. Multiple occurrences of the same term from the same 5 document are then merged. Instances of the same term are then grouped, and the result is split into a dictionary and postings, as shown in the right column of Figure 1.4. Because a term generally occurs in a number of doc- uments, this data organization already reduces the storage requirements of the index. The dictionary also records some statistics, such as the number of document documents which contain each term (the document frequency, which is here frequency also the length of each postings list). This information is not vital for a basic Boolean search engine, but it allows us to improve the efficiency of the search engine at query time, and it is a statistic later used in many ranked retrieval models. The postings are secondarily sorted by docID. This provides the ba- sis for efficient query processing. This inverted index structure is essentially without rival as the most efficient structure for supporting ad hoc text search. In the resulting index, we pay for storage of both the dictionary and the postings lists. The latter are much larger, but the dictionary is commonly kept in memory, and postings lists are normally kept on disk, so the size of each is important. In Chapter 5, we examine how each can be optimized for storage and access efficiency. What data structure should be used for a postings list? A fixed length array would be wasteful; some words occur in many docu- ments, and others in very few. For an in-memory postings list, two good al- ternatives are singly linked lists or variable length arrays.Singlylinkedlists al- low cheap insertion of documents into postings lists (following updates, such as when recrawling the web for updated documents), and naturally extend 5 Unix users can note that these steps are similar to use of thesort and thenuniq commands. P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 8 Boolean retrieval Doc1 Doc2 I did enact Julius Caesar: I was So let it be with Caesar. The noble killed i’ the Capitol; Brutus killed Brutus hath told you Caesar was me. ambitious: term docID term docID I1 ambitious 2 did 1 be 2 term doc.freq. → postingslists enact 1 brutus 1 ambitious 1 → 2 julius 1 brutus 2 be 1 → 2 caesar 1 capitol 1 brutus 2 → 1 → 2 I1 caesar 1 → 1 capitol 1 was 1 caesar 2 killed 1 caesar 2 caesar 2 → 1 → 2 i’ 1 did 1 did 1 → 1 the 1 enact 1 enact 1 → 1 capitol 1 hath 1 hath 1 → 2 brutus 1I1 → 1 I 1 killed 1I1 i’ 1 → 1 me 1 i’ 1 =⇒ =⇒ it 1 → 2 so 2 it 2 julius 1 → 1 let 2 julius 1 killed 1 → 1 it 2 killed 1 let 1 → 2 be 2 killed 1 → 1 me 1 with 2 let 2 → 2 noble 1 caesar 2 me 1 the 2 noble 2 so 1 → 2 noble 2 so 2 the 2 → 1 → 2 brutus 2 the 1 told 1 → 2 hath 2 the 2 you 1 → 2 told 2 told 2 was 2 → 1 → 2 you 2 you 2 → 2 with 1 caesar 2 was 1 was 2 was 2 ambitious 2 with 2 Figure 1.4 Building an index by sorting and grouping. The sequence of terms in each docu- ment, tagged by their documentID (left) is sorted alphabetically (middle). Instances of the same term are then grouped by word and then by documentID. The terms and documentIDs are then separated out (right). The dictionary stores the terms, and has a pointer to the postings list for each term. It commonly also stores other summary information such as, here, the doc- ument frequency of each term. We use this information for improving query time efficiency and, later, for weighting in ranked retrieval models. Each postings list stores the list of docu- ments in which a term occurs, and may store other information such as the term frequency (the frequency of each term in each document) or the position(s) of the term in each document. P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.3 Processing Boolean queries 9 to more advanced indexing strategies such as skip lists (Section 2.3), which require additional pointers. Variable length arrays win in space requirements by avoiding the overhead for pointers and in time requirements because their use of contiguous memory increases speed on modern processors with mem- ory caches. Extra pointers can in practice be encoded into the lists as offsets. If updates are relatively infrequent, variable length arrays are more compact and faster to traverse. We can also use a hybrid scheme, with a linked list of fixed length arrays for each term. When postings lists are stored on disk, they are stored (perhaps compressed) as a contiguous run of postings without ex- plicit pointers (as in Figure 1.3), so as to minimize the size of the postings list and the number of disk seeks to read a postings list into memory. Exercise1.1 Draw the inverted index that would be built for the following document collection. (See Figure 1.3 for an example.) ? Doc1 new home sales top forecasts Doc2 home sales rise in july Doc3 increase in home sales in july Doc4 july new home sales rise Exercise1.2 Consider these documents: Doc1 breakthrough drug for schizophrenia Doc2 new schizophrenia drug Doc3 new approach for treatment of schizophrenia Doc4 new hopes for schizophrenia patients a. Draw the term-document incidence matrix for this document collec- tion. b. Draw the inverted index representation for this collection, as in Fig- ure 1.3 (page 6). Exercise1.3 For the document collection shown in Exercise 1.2, what are the returned results for these queries? a. schizophreniaand drug b. forandnot (drugor approach) 1.3ProcessingBooleanqueries simple How do we process a query using an inverted index and the basic Boolean conjunctive retrieval model? Consider processing the simple conjunctive query: queries (1.1) Brutusand Calpurnia over the inverted index partially shown in Figure 1.3 (page 6). We: 1. Locate Brutus in the dictionary. 2. Retrieve its postings. 3. Locate Calpurnia in the dictionary.P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 10 Boolean retrieval Brutus −→ 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Calpurnia −→ 2 → 31 → 54 → 101 Intersection =⇒ 2 → 31 Figure1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3. 4. Retrieve its postings. 5. Intersect the two postings lists, as shown in Figure 1.5. postingslist The intersection operation is the crucial one: We need to efficiently intersect intersection postings lists so as to be able to quickly find documents that contain both postings terms. (This operation is sometimes referred to as merging postings lists, this merge slightly counterintuitive name reflects using the term merge algorithm for a general family of algorithms that combine multiple sorted lists by inter- leaved advancing of pointers through each; here we are merging the lists with a logicaland operation.) There is a simple and effective method of intersecting postings lists using the merge algorithm (see Figure 1.6): We maintain pointers into both lists and walk through the two postings lists simultaneously, in time linear in the to- tal number of postings entries. At each step, we compare the docID pointed to by both pointers. If they are the same, we put that docID in the results list, and advance both pointers. Otherwise we advance the pointer pointing to the smaller docID. If the lengths of the postings lists are x and y,the in- tersection takes O(x + y) operations. Formally, the complexity of querying 6 is (N), where N is the number of documents in the collection. Our index- ing methods gain us just a constant, not a difference in  time complexity compared with a linear scan, but in practice the constant is huge. To use this algorithm, it is crucial that postings be sorted by a single global ordering. Using a numeric sort by docID is one simple way to achieve this. We can extend the intersection operation to process more complicated queries like: (1.2) (Brutusor Caesar)andnot Calpurnia query Query optimization is the process of selecting how to organize the work of an- optimization swering a query so that the least total amount of work needs to be done by the system. A major element of this for Boolean queries is the order in which postings lists are accessed. What is the best order for query processing? Con- sider a query that is anand of t terms, for instance: (1.3) Brutusand Caesarand Calpurnia 6 The notation (·) is used to express an asymptotically tight bound on the complexity of an algorithm. Informally, this is often written as O(·), but this notation really expresses an asymptotic upper bound, which need not be tight (Cormen et al. 1990).P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.3 Processing Boolean queries 11 Intersect(p,p ) 1 2 1 answer ← 2 while p =  nil and p =  nil 1 2 3 doifdocID(p ) = docID(p ) 1 2 4 thenAdd(answer,docID(p )) 1 5 p ← next(p ) 1 1 6 p ← next(p ) 2 2 7 else ifdocID(p ) docID(p ) 1 2 8 then p ← next(p ) 1 1 9 else p ← next(p ) 2 2 10 returnanswer Figure1.6 Algorithm for the intersection of two postings lists p and p . 1 2 For each of the t terms, we need to get its postings, then and them together. The standard heuristic is to process terms in order of increasing document frequency; if we start by intersecting the two smallest postings lists, then all intermediate results must be no bigger than the smallest postings list, and we are therefore likely to do the least amount of total work. So, for the postings lists in Figure 1.3 (page 6), we execute the above query as: (1.4) (Calpurniaand Brutus)and Caesar This is a first justification for keeping the frequency of terms in the dictionary; it allows us to make this ordering decision based on in-memory data before accessing any postings list. Consider now the optimization of more general queries, such as: (1.5) (maddingor crowd)and (ignobleor strife)and (killedor slain) As before, we get the frequencies for all terms, and we can then (conser- vatively) estimate the size of each or by the sum of the frequencies of its disjuncts. We can then process the query in increasing order of the size of each disjunctive term. For arbitrary Boolean queries, we have to evaluate and temporarily store the answers for intermediate expressions in a complex expression. However, in many circumstances, either because of the nature of the query language, or just because this is the most common type of query that users submit, a query is purely conjunctive. In this case, rather than viewing merging post- ings lists as a function with two inputs and a distinct output, it is more ef- ficient to intersect each retrieved postings list with the current intermediate result in memory, where we initialize the intermediate result by loading the postings list of the least frequent term. This algorithm is shown in Figure 1.7. The intersection operation is then asymmetric: The intermediate results list is in memory while the list it is being intersected with is being read from disk. Moreover, the intermediate results list is always at least as short as the other list, and in many cases it is orders of magnitude shorter. The postingsP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 12 Boolean retrieval Intersect(t ,...,t ) 1 n 1 terms ← SortByIncreasingFrequency(t ,...,t ) 1 n 2 result ← postings(first(terms)) 3 terms ←rest(terms) 4 whileterms = nil andresult = nil 5 doresult ← Intersect(result, postings(first(terms))) 6 terms ←rest(terms) 7 returnresult Figure 1.7 Algorithm for conjunctive queries that returns the set of documents containing each term in the input list of terms. intersection can still be done by the algorithm in Figure 1.6, but when the dif- ference between the list lengths is very large, opportunities to use alternative techniques open up. The intersection can be calculated in place by destruc- tively modifying or marking invalid items in the intermediate results list. Or the intersection can be done as a sequence of binary searches in the long postings lists for each posting in the intermediate results list. Another possi- bility is to store the long postings list as a hashtable, so that membership of an intermediate result item can be calculated in constant rather than linear or log time. However, such alternative techniques are difficult to combine with postings list compression of the sort discussed in Chapter 5. Moreover, stan- dard postings list intersection operations remain necessary when both terms of a query are very common. Exercise1.4 For the queries below, can we still run through the intersec- tion in time O(x + y), where x and y are the lengths of the postings lists for ? Brutus and Caesar? If not, what can we achieve? a. Brutusandnot Caesar b. Brutusornot Caesar Exercise1.5 Extend the postings merge algorithm to arbitrary Boolean query formulas. What is its time complexity? For instance, consider: c. (Brutusor Caesar)andnot (Antonyor Cleopatra) Can we always merge in linear time? Linear in what? Can we do better than this? Exercise1.6 We can use distributive laws for and and or to rewrite queries. a. Show how to rewrite the query in Exercise 1.5 into disjunctive normal form using the distributive laws. b. Would the resulting query be more or less efficiently evaluated than the original form of this query? c. Is this result true in general or does it depend on the words and the contents of the document collection?P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.4 The extended Boolean model versus ranked retrieval 13 Exercise1.7 Recommend a query processing order for d. (tangerine or trees) and (marmalade or skies) and (kaleidoscope or eyes) given the following postings list sizes: Term Postingssize eyes 213312 kaleidoscope 87009 marmalade 107913 skies 271658 tangerine 46653 trees 316812 Exercise1.8 If the query is: e. friendsand romansand (not countrymen) how could we use the frequency of countrymen in evaluating the best query evaluation order? In particular, propose a way of handling negation in de- termining the order of query processing. Exercise1.9 For a conjunctive query, is processing postings lists in order of size guaranteed to be optimal? Explain why it is, or give an example where it is not. Exercise1.10 Write out a postings merge algorithm, in the style of Fig- ure 1.6 (page 11), for an x or y query. Exercise1.11 How should the Boolean query x and not y be handled? Why is naive evaluation of this query normally very expensive? Write out a postings merge algorithm that evaluates this query efficiently. 1.4TheextendedBooleanmodelversusrankedretrieval ranked The Boolean retrieval model contrasts with ranked retrieval models such as the retrieval vector space model (Section 6.3), in which users largely use free text queries, models that is, just typing one or more words rather than using a precise language with operators for building up query expressions, and the system decides freetext queries which documents best satisfy the query. Despite decades of academic re- search on the advantages of ranked retrieval, systems implementing the Boo- lean retrieval model were the main or only search option provided by large commercial information providers for three decades until the early 1990s (approximately the date of arrival of the World Wide Web). However, these systems did not have just the basic Boolean operations (and, or,and not) that have been presented so far. A strict Boolean expression over terms with an unordered results set is too limited for many of the information needs that people have, and these systems implemented extended Boolean re- trieval models by incorporating additional operators such as term proximityP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 14 Boolean retrieval proximity operators. A proximity operator is a way of specifying that two terms in a operator query must occur close to each other in a document, where closeness may be measured by limiting the allowed number of intervening words or by reference to a structural unit such as a sentence or paragraph. Example 1.1: Commercial Boolean searching: Westlaw. Westlaw (http:// ✎ is the largest commercial legal search service (in terms of the number of paying subscribers), with over half a million subscribers performing millions of searches a day over tens of terabytes of text data. The service was started in 1975. In 2005, Boolean search (called Terms and Connectors by Westlaw) was still the default, and used by a large percent- age of users, although ranked free text querying (called Natural Language by Westlaw) was added in 1992. Here are some example Boolean queries on Westlaw: Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a com- peting company. Query: “trade secret” /s disclos /s prevent /s employe Information need: Requirements for disabled people to be able to access a workplace. Query: disab /p access /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests. Query: host /p (responsib liab) /p (intoxicat drunk) /p guest Note the long, precise queries and the use of proximity operators, both uncommon in web search. Submitted queries average about ten words in length. Unlike web search conventions, a space between words represents disjunction (the tightest binding operator), & is and and /s, /p, and /k ask for matches in the same sentence, same paragraph or within k words respectively. Double quotes give a phrase search (consecutive words); see Section 2.4 (page 36). The exclamation mark () gives a trailing wildcard query (see Section 3.2, page 48); thus liab matches all words starting with liab. Additionally work-site matches any of worksite, work-site or work site; see Section 2.2.1. Typical expert queries are usually carefully defined and incrementally developed until they obtain what look to be good results to the user. Many users, particularly professionals, prefer Boolean query models. Boolean queries are precise: A document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved. And some domains, such as legal materials, allow an effec- tive means of document ranking within a Boolean model: Westlaw re- turns documents in reverse chronological order, which is in practice quiteP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.4 The extended Boolean model versus ranked retrieval 15 effective. In 2007, the majority of law librarians still seem to recommend terms and connectors for high recall searches, and the majority of legal users think they are getting greater control by using them. However, this does not mean that Boolean queries are more effective for professional searchers. Indeed, experimenting on a Westlaw subcollection, Turtle (1994) found that free text queries produced better results than Boolean queries prepared by Westlaw’s own reference librarians for the majority of the information needs in his experiments. A general problem with Boolean search is that using and operators tends to produce high precision but low recall searches, while usingor operators gives low precision but high recall searches, and it is difficult or impossible to find a satisfactory middle ground. In this chapter, we have looked at the structure and construction of a basic inverted index, comprising a dictionary and postings lists. We introduced the Boolean retrieval model, and examined how to do efficient retrieval via linear time merges and simple query optimization. In Chapters 2–7, we consider in detail richer query models and the sort of augmented index structures that are needed to handle them efficiently. Here we just mention a few of the main additional things we would like to be able to do. 1. We would like to better determine the set of terms in the dictionary and to provide retrieval that is tolerant to spelling mistakes and inconsistent choice of words. 2. It is often useful to search for compounds or phrases that denote a concept such as “operating system.” As the Westlaw examples show, we might also wish to do proximity queries such as Gatesnear Microsoft. To answer such queries, the index has to be augmented to capture the proximities of terms in documents. 3. A Boolean model only records term presence or absence, but often we would like to accumulate evidence, giving more weight to documents that have a term several times as opposed to ones that contain it only once. To term be able to do this we needtermfrequency information (the number of times frequency a term occurs in a document) in postings lists. 4. Boolean queries just retrieve a set of matching documents, but commonly we wish to have an effective method to order (or rank) the returned re- sults. This requires having a mechanism for determining a document score which encapsulates how good a match a document is for a query. With these additional ideas, we will have seen most of the basic technol- ogy that supports ad hoc searching over unstructured information. Ad hoc searching over documents has recently conquered the world, powering not only web search engines but the kind of unstructured search that lies behind the large eCommerce web sites. Although the main web search engines differ by emphasizing free text querying, most of the basic issues and technologiesP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 16 Boolean retrieval of indexing and querying remain the same, as we will see in later chapters. Moreover, over time, web search engines have added at least partial imple- mentations of some of the most popular operators from extended Boolean models: phrase search is especially popular and most have a very partial im- plementation of Boolean operators. Nevertheless, although these options are liked by expert searchers, they are little used by most people and are not the main focus in work on trying to improve web search engine performance. Exercise1.12 Write a query using Westlaw syntax that would find any of the words professor, teacher,or lecturer in the same sentence as a form of the ? verb explain. Exercise1.13 Try using the Boolean search features on a couple of major web search engines. For instance, choose a word, such as burglar,and sub- mit the queries (i) burglar, (ii) burglarand burglar, and (iii) burglaror burglar. Look at the estimated number of results and top hits. Do they make sense in terms of Boolean logic? Often they haven’t for major search engines. Can you make sense of what is going on? What about if you try different words? For example, query for (i) knight, (ii) conquer, and then (iii) knight OR conquer. What bound should the number of results from the first two queries place on the third query? Is this bound observed? 1.5Referencesandfurtherreading The practical pursuit of computerized information retrieval began in the late 1940s (Cleverdon 1991; Liddy 2005). A great increase in the production of scientific literature, much in the form of less formal technical reports rather than traditional journal articles, coupled with the availability of computers, led to interest in automatic document retrieval. However, in those days, doc- ument retrieval was always based on author, title, and keywords; full-text search came much later. The article by Bush (1945) provided lasting inspiration for the new field: Consider a future device for individual use, which is a sort of mecha- nized private file and library. It needs a name, and, to coin one at random, ‘memex’ will do. A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility. It is an enlarged intimate supplement to his memory. The term information retrieval was coined by Calvin Mooers in 1948/1950 (Mooers 1950). In 1958, much newspaper attention was paid to demonstrations at a confer- ence (see Taube and Wooster 1958) of IBM “auto-indexing” machines, based primarily on the work of H. P. Luhn. Commercial interest quickly gravitatedP1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 1.5 References and further reading 17 toward Boolean retrieval systems, but the early years saw a heady debate over various disparate technologies for retrieval systems. For example, Moo- ers (1961) dissented: It is a common fallacy, underwritten at this date by the investment of sev- eral million dollars in a variety of retrieval hardware, that the algebra of George Boole (1847) is the appropriate formalism for retrieval system de- sign. This view is as widely and uncritically accepted as it is wrong. The observation of and versus or giving you opposite extremes in a pre- cision/recall tradeoff, but not the middle ground comes from (Lee and Fox 1988). The book (Witten et al. 1999) is the standard reference for an in-depth com- parison of the space and time efficiency of the inverted index versus other possible data structures; a more succinct and up-to-date presentation ap- pears in Zobel and Moffat (2006). We further discuss several approaches in Chapter 5. regular Friedl (2006) covers the practical usage of regular expressions for searching. expressions The underlying computer science appears in (Hopcroft et al. 2000). P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 The term vocabulary and 2postings lists Recall the major steps in inverted index construction: 1. Collect the documents to be indexed. 2. Tokenize the text. 3. Do linguistic preprocessing of tokens. 4. Index the documents that each term occurs in. In this chapter, we first briefly mention how the basic unit of a document can be defined and how the character sequence that it comprises is determined (Section 2.1). We then examine in detail some of the substantive linguistic issues of tokenization and linguistic preprocessing, which determine the vo- cabulary of terms that a system uses (Section 2.2). Tokenization is the pro- cess of chopping character streams into tokens; linguistic preprocessing then deals with building equivalence classes of tokens, which are the set of terms that are indexed. Indexing itself is covered in Chapters 1 and 4. Then we re- turn to the implementation of postings lists. In Section 2.3, we examine an extended postings list data structure that supports faster querying, and Sec- tion 2.4 covers building postings data structures suitable for handling phrase and proximity queries, of the sort that commonly appear in both extended Boolean models and on the web. 2.1Documentdelineationandcharactersequencedecoding 2.1.1 Obtaining the charactersequence in a document Digital documents that are the input to an indexing process are typically bytes in a file or on a web server. The first step of processing is to con- vert this byte sequence into a linear sequence of characters. For the case of plain English text in ASCII encoding, this is trivial. But often things get much more complex. The sequence of characters may be encoded by one of various single-byte or multibyte encoding schemes, such as Unicode UTF-8, 18P1: KRU/IRP irbook CUUS232/Manning 978 0 521 86571 5 May 27, 2008 12:8 2.1 Document delineation and character sequence decoding 19 Figure 2.1 An example of a vocalized Modern Standard Arabic word. The writing is from right to left and letters undergo complex mutations as they are combined. The representation of short vowels (here, /i/ and /u/) and the final /n/ (nunation) departs from strict linearity by being represented as diacritics above and below letters. Nevertheless, the represented text is still clearly a linear ordering of characters representing sounds. Full vocalization, as here, normally appears only in the Koran and children’s books. Day-to-day text is unvocalized (short vowels are not represented, but the letter for a ¯ would still appear) or partially vocalized, with short vowels in- serted in places where the writer perceives ambiguities. These choices add further complexities to indexing. or various national or vendor-specific standards. We need to determine the correct encoding. This can be regarded as a machine learning classification 1 problem, as discussed in Chapter 13, but is often handled by heuristic meth- ods, user selection, or using provided document metadata. Once the encod- ing is determined, we decode the byte sequence to a character sequence. We might save the choice of encoding because it gives some evidence about what language the document is written in. The characters may have to be decoded out of some binary representation like Microsoft Word DOC files and/or a compressed format such as zip files. Again, we must determine the document format, and then an appropriate decoder has to be used. Even for plain text documents, additional decoding may need to be done. In XML documents (Section 10.1, page 180), charac- ter entities, such as &, need to be decoded to give the correct character, namely, & for &. Finally, the textual part of the document may need to be extracted out of other material that will not be processed. This might be the desired handling for XML files, if the markup is going to be ignored; we would almost certainly want to do this with postscript or PDF files. We do not deal further with these issues in this book, and assume henceforth that our documents are a list of characters. Commercial products usually need to support a broad range of document types and encodings, because users want things to just work with their data as is. Often, they just think of docu- ments as text inside applications and are not even aware of how it is encoded on disk. This problem is usually solved by licensing a software library that handles decoding document formats and character encodings. The idea that text is a linear sequence of characters is also called into question by some writing systems, such as Arabic, where text takes on some two-dimensional and mixed-order characteristics, as shown in Fig- ures 2.1 and 2.2. But, despite some complicated writing system conventions, there is an underlying sequence of sounds being represented and hence an 1 A classifier is a function that takes objects of some sort and assigns them to one of a num- ber of distinct classes. Usually classification is done by machine learning methods such as probabilistic models, but it can also be done by hand-written rules.