Lecture notes on Information Retrieval

how information retrieval system works and how to evaluate information retrieval systems | pdf free download
Dr.JamesSmith Profile Pic
Dr.JamesSmith,France,Professional
Published Date:11-07-2017
Your Website URL(Optional)
Comment
An Introduction to Information Retrieval Draft of April 1, 2009 Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPAn Introduction to Information Retrieval Christopher D. Manning Prabhakar Raghavan Hinrich Schütze Cambridge University Press Cambridge, England Online edition (c) 2009 Cambridge UPDRAFT DO NOT DISTRIBUTE WITHOUT PRIOR PERMISSION © 2009 Cambridge University Press By Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze Printed on April 1, 2009 Website: http://www.informationretrieval.org/ Comments, corrections, and other feedback most welcome at: informationretrievalyahoogroups.com Online edition (c) 2009 Cambridge UPDRAFT © April 1, 2009 Cambridge University Press. Feedback welcome. v BriefContents 1 Booleanretrieval 1 2 Thetermvocabularyandpostingslists 19 3 Dictionariesandtolerantretrieval 49 4 Indexconstruction 67 5 Indexcompression 85 6 Scoring,termweightingandthevectorspacemodel 109 7 Computingscoresinacompletesearchsystem 135 8 Evaluationininformationretrieval 151 9 Relevance feedbackandqueryexpansion 177 10 XMLretrieval 195 11 Probabilisticinformationretrieval 219 12 Languagemodelsforinformationretrieval 237 13 TextclassificationandNaiveBayes 253 14 Vectorspaceclassification 289 15 Supportvectormachinesandmachinelearning ondocuments 319 16 Flatclustering 349 17 Hierarchicalclustering 377 18 Matrixdecompositionsandlatentsemanticindexing 403 19 Web searchbasics 421 20 Web crawlingandindexes 443 21 Linkanalysis 461 Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT © April 1, 2009 Cambridge University Press. Feedback welcome. vii Contents ListofTables xv ListofFigures xix TableofNotation xxvii Preface xxxi 1 Booleanretrieval 1 1.1 An example information retrieval problem 3 1.2 A first take at building an inverted index 6 1.3 Processing Boolean queries 10 1.4 The extended Boolean model versus ranked retrieval 14 1.5 References and further reading 17 2 Thetermvocabularyandpostingslists 19 2.1 Document delineation and character sequence decoding 19 2.1.1 Obtaining the character sequence in a document 19 2.1.2 Choosing a document unit 20 2.2 Determining the vocabulary of terms 22 2.2.1 Tokenization 22 2.2.2 Dropping common terms: stop words 27 2.2.3 Normalization (equivalence classing of terms) 28 2.2.4 Stemming and lemmatization 32 2.3 Faster postings list intersection via skip pointers 36 2.4 Positional postings and phrase queries 39 2.4.1 Biword indexes 39 2.4.2 Positional indexes 41 2.4.3 Combination schemes 43 2.5 References and further reading 45 Online edition (c) 2009 Cambridge UPviii Contents 3 Dictionariesandtolerantretrieval 49 3.1 Search structures for dictionaries 49 3.2 Wildcard queries 51 3.2.1 General wildcard queries 53 3.2.2 k-gram indexes for wildcard queries 54 3.3 Spelling correction 56 3.3.1 Implementing spelling correction 57 3.3.2 Forms of spelling correction 57 3.3.3 Edit distance 58 3.3.4 k-gram indexes for spelling correction 60 3.3.5 Context sensitive spelling correction 62 3.4 Phonetic correction 63 3.5 References and further reading 65 4 Indexconstruction 67 4.1 Hardware basics 68 4.2 Blocked sort-based indexing 69 4.3 Single-pass in-memory indexing 73 4.4 Distributed indexing 74 4.5 Dynamic indexing 78 4.6 Other types of indexes 80 4.7 References and further reading 83 5 Indexcompression 85 5.1 Statistical properties of terms in information retrieval 86 5.1.1 Heaps’ law: Estimating the number of terms 88 5.1.2 Zipf’s law: Modeling the distribution of terms 89 5.2 Dictionary compression 90 5.2.1 Dictionary as a string 91 5.2.2 Blocked storage 92 5.3 Postings file compression 95 5.3.1 Variable byte codes 96 5.3.2 γ codes 98 5.4 References and further reading 105 6 Scoring,termweightingandthevectorspacemodel 109 6.1 Parametric and zone indexes 110 6.1.1 Weighted zone scoring 112 6.1.2 Learning weights 113 6.1.3 The optimal weight g 115 6.2 Term frequency and weighting 117 6.2.1 Inverse document frequency 117 6.2.2 Tf-idf weighting 118 Online edition (c) 2009 Cambridge UPContents ix 6.3 The vector space model for scoring 120 6.3.1 Dot products 120 6.3.2 Queries as vectors 123 6.3.3 Computing vector scores 124 6.4 Variant tf-idf functions 126 6.4.1 Sublinear tf scaling 126 6.4.2 Maximum tf normalization 127 6.4.3 Document and query weighting schemes 128 6.4.4 Pivoted normalized document length 129 6.5 References and further reading 133 7 Computingscoresinacompletesearchsystem 135 7.1 Efficient scoring and ranking 135 7.1.1 Inexact top K document retrieval 137 7.1.2 Index elimination 137 7.1.3 Champion lists 138 7.1.4 Static quality scores and ordering 138 7.1.5 Impact ordering 140 7.1.6 Cluster pruning 141 7.2 Components of an information retrieval system 143 7.2.1 Tiered indexes 143 7.2.2 Query-term proximity 144 7.2.3 Designing parsing and scoring functions 145 7.2.4 Putting it all together 146 7.3 Vector space scoring and query operator interaction 147 7.4 References and further reading 149 8 Evaluationininformationretrieval 151 8.1 Information retrieval system evaluation 152 8.2 Standard test collections 153 8.3 Evaluation of unranked retrieval sets 154 8.4 Evaluation of ranked retrieval results 158 8.5 Assessing relevance 164 8.5.1 Critiques and justifications of the concept of relevance 166 8.6 A broader perspective: System quality and user utility 168 8.6.1 System issues 168 8.6.2 User utility 169 8.6.3 Refining a deployed system 170 8.7 Results snippets 170 8.8 References and further reading 173 9 Relevancefeedbackandqueryexpansion 177 Online edition (c) 2009 Cambridge UPx Contents 9.1 Relevance feedback and pseudo relevance feedback 178 9.1.1 The Rocchio algorithm for relevance feedback 178 9.1.2 Probabilistic relevance feedback 183 9.1.3 When does relevance feedback work? 183 9.1.4 Relevance feedback on the web 185 9.1.5 Evaluation of relevance feedback strategies 186 9.1.6 Pseudo relevance feedback 187 9.1.7 Indirect relevance feedback 187 9.1.8 Summary 188 9.2 Global methods for query reformulation 189 9.2.1 Vocabulary tools for query reformulation 189 9.2.2 Query expansion 189 9.2.3 Automatic thesaurus generation 192 9.3 References and further reading 193 10 XMLretrieval 195 10.1 Basic XML concepts 197 10.2 Challenges in XML retrieval 201 10.3 A vector space model for XML retrieval 206 10.4 Evaluation of XML retrieval 210 10.5 Text-centric vs. data-centric XML retrieval 214 10.6 References and further reading 216 10.7 Exercises 217 11 Probabilisticinformationretrieval 219 11.1 Review of basic probability theory 220 11.2 The Probability Ranking Principle 221 11.2.1 The 1/0 loss case 221 11.2.2 The PRP with retrieval costs 222 11.3 The Binary Independence Model 222 11.3.1 Deriving a ranking function for query terms 224 11.3.2 Probability estimates in theory 226 11.3.3 Probability estimates in practice 227 11.3.4 Probabilistic approaches to relevance feedback 228 11.4 An appraisal and some extensions 230 11.4.1 An appraisal of probabilistic models 230 11.4.2 Tree-structured dependencies between terms 231 11.4.3 Okapi BM25: a non-binary model 232 11.4.4 Bayesian network approaches to IR 234 11.5 References and further reading 235 12 Languagemodelsforinformationretrieval 237 12.1 Language models 237 Online edition (c) 2009 Cambridge UPContents xi 12.1.1 Finite automata and language models 237 12.1.2 Types of language models 240 12.1.3 Multinomial distributions over words 241 12.2 The query likelihood model 242 12.2.1 Using query likelihood language models in IR 242 12.2.2 Estimating the query generation probability 243 12.2.3 Ponte and Croft’s Experiments 246 12.3 Language modeling versus other approaches in IR 248 12.4 Extended language modeling approaches 250 12.5 References and further reading 252 13 TextclassificationandNaiveBayes 253 13.1 The text classification problem 256 13.2 Naive Bayes text classification 258 13.2.1 Relation to multinomial unigram language model 262 13.3 The Bernoulli model 263 13.4 Properties of Naive Bayes 265 13.4.1 A variant of the multinomial model 270 13.5 Feature selection 271 13.5.1 Mutual information 272 2 13.5.2 χ Feature selection 275 13.5.3 Frequency-based feature selection 277 13.5.4 Feature selection for multiple classifiers 278 13.5.5 Comparison of feature selection methods 278 13.6 Evaluation of text classification 279 13.7 References and further reading 286 14 Vectorspaceclassification 289 14.1 Document representations and measures of relatedness in vector spaces 291 14.2 Rocchio classification 292 14.3 k nearest neighbor 297 14.3.1 Time complexity and optimality of kNN 299 14.4 Linear versus nonlinear classifiers 301 14.5 Classification with more than two classes 306 14.6 The bias-variance tradeoff 308 14.7 References and further reading 314 14.8 Exercises 315 15 Supportvectormachinesandmachinelearningondocuments 319 15.1 Support vector machines: The linearly separable case 320 15.2 Extensions to the SVM model 327 15.2.1 Soft margin classification 327 Online edition (c) 2009 Cambridge UPxii Contents 15.2.2 Multiclass SVMs 330 15.2.3 Nonlinear SVMs 330 15.2.4 Experimental results 333 15.3 Issues in the classification of text documents 334 15.3.1 Choosing what kind of classifier to use 335 15.3.2 Improving classifier performance 337 15.4 Machine learning methods in ad hoc information retrieval 341 15.4.1 A simple example of machine-learned scoring 341 15.4.2 Result ranking by machine learning 344 15.5 References and further reading 346 16 Flatclustering 349 16.1 Clustering in information retrieval 350 16.2 Problem statement 354 16.2.1 Cardinality – the number of clusters 355 16.3 Evaluation of clustering 356 16.4 K-means 360 16.4.1 Cluster cardinality in K-means 365 16.5 Model-based clustering 368 16.6 References and further reading 372 16.7 Exercises 374 17 Hierarchicalclustering 377 17.1 Hierarchical agglomerative clustering 378 17.2 Single-link and complete-link clustering 382 17.2.1 Time complexity of HAC 385 17.3 Group-average agglomerative clustering 388 17.4 Centroid clustering 391 17.5 Optimality of HAC 393 17.6 Divisive clustering 395 17.7 Cluster labeling 396 17.8 Implementation notes 398 17.9 References and further reading 399 17.10 Exercises 401 18 Matrixdecompositionsandlatentsemanticindexing 403 18.1 Linear algebra review 403 18.1.1 Matrix decompositions 406 18.2 Term-document matrices and singular value decompositions 407 18.3 Low-rank approximations 410 18.4 Latent semantic indexing 412 18.5 References and further reading 417 Online edition (c) 2009 Cambridge UPContents xiii 19 Websearchbasics 421 19.1 Background and history 421 19.2 Web characteristics 423 19.2.1 The web graph 425 19.2.2 Spam 427 19.3 Advertising as the economic model 429 19.4 The search user experience 432 19.4.1 User query needs 432 19.5 Index size and estimation 433 19.6 Near-duplicates and shingling 437 19.7 References and further reading 441 20 Webcrawlingandindexes 443 20.1 Overview 443 20.1.1 Features a crawlermust provide 443 20.1.2 Features a crawlershould provide 444 20.2 Crawling 444 20.2.1 Crawler architecture 445 20.2.2 DNS resolution 449 20.2.3 The URL frontier 451 20.3 Distributing indexes 454 20.4 Connectivity servers 455 20.5 References and further reading 458 21 Linkanalysis 461 21.1 The Web as a graph 462 21.1.1 Anchor text and the web graph 462 21.2 PageRank 464 21.2.1 Markov chains 465 21.2.2 The PageRank computation 468 21.2.3 Topic-specific PageRank 471 21.3 Hubs and Authorities 474 21.3.1 Choosing the subset of the Web 477 21.4 References and further reading 480 Bibliography 483 AuthorIndex 519 Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT © April 1, 2009 Cambridge University Press. Feedback welcome. xv ListofTables 4.1 Typical system parameters in 2007. The seek time is the time needed to position the disk head in a new position. The transfer time per byte is the rate of transfer from disk to memory when the head is in the right position. 68 4.2 Collection statistics for Reuters-RCV1. Values are rounded for the computations in this book. The unrounded values are: 806,791 documents, 222 tokens per document, 391,523 (distinct) terms, 6.04 bytes per token with spaces and punctuation, 4.5 bytes per token without spaces and punctuation, 7.5 bytes per term, and 96,969,056 tokens. The numbers in this table correspond to the third line (“case folding”) in Table 5.1 (page 87). 70 4.3 The five steps in constructing an index for Reuters-RCV1 in blocked sort-based indexing. Line numbers refer to Figure 4.2. 82 4.4 Collection statistics for a large collection. 82 5.1 The effect of preprocessing on the number of terms, nonpositional postings, and tokens for Reuters-RCV1. “Δ%” indicates the reduction in size from the previous line, except that “30 stop words” and “150 stop words” both use “case folding” as their reference line. “T%” is the cumulative (“total”) reduction from unfiltered. We performed stemming with the Porter stemmer (Chapter 2, page 33). 87 5.2 Dictionary compression for Reuters-RCV1. 95 5.3 Encoding gaps instead of document IDs. For example, we store gaps 107, 5, 43, . . . , instead of docIDs 283154, 283159, 283202, . . . for computer. The first docID is left unchanged (only shown for arachnocentric). 96 5.4 VB encoding. 97 Online edition (c) 2009 Cambridge UPxvi ListofTables 5.5 Some examples of unary andγ codes. Unary codes are only shown for the smaller numbers. Commas inγ codes are for readability only and are not part of the actual codes. 98 5.6 Index and dictionary compression for Reuters-RCV1. The compression ratio depends on the proportion of actual text in the collection. Reuters-RCV1 contains a large amount of XML markup. Using the two best compression schemes,γ encoding and blocking with front coding, the ratio compressed index to collection size is therefore especially small for Reuters-RCV1: (101+ 5.9)/3600≈ 0.03. 103 5.7 Two gap sequences to be merged in blocked sort-based indexing 105 6.1 Cosine computation for Exercise 6.19. 132 8.1 Calculation of 11-point Interpolated Average Precision. 159 8.2 Calculating the kappa statistic. 165 10.1 RDB (relational database) search, unstructured information retrieval and structured information retrieval. 196 10.2 INEX 2002 collection statistics. 211 10.3 INEX 2002 results of the vector space model in Section 10.3 for content-and-structure (CAS) queries and the quantization functionQ. 213 10.4 A comparison of content-only and full-structure search in INEX 2003/2004. 214 13.1 Data for parameter estimation examples. 261 13.2 Training and test times for NB. 261 13.3 Multinomial versus Bernoulli model. 268 13.4 Correct estimation implies accurate prediction, but accurate prediction does not imply correct estimation. 269 13.5 A set of documents for which the NB independence assumptions are problematic. 270 2 13.6 Critical values of theχ distribution with one degree of freedom. For example, if the two events are independent, 2 2 then P(X 6.63) 0.01. So for X 6.63 the assumption of independence can be rejected with 99% confidence. 277 13.7 The ten largest classes in the Reuters-21578 collection with number of documents in training and test sets. 280 Online edition (c) 2009 Cambridge UPListofTables xvii 13.8 Macro- and microaveraging. “Truth” is the true class and “call” the decision of the classifier. In this example, macroaveraged precision is 10/(10+ 10)+ 90/(10+ 90)/2 = (0.5+ 0.9)/2 = 0.7. Microaveraged precision is 100/(100+ 20)≈ 0.83. 282 13.9 Text classification effectiveness numbers on Reuters-21578 for F (in percent). Results from Li and Yang (2003) (a), Joachims 1 (1998) (b: kNN) and Dumais et al. (1998) (b: NB, Rocchio, trees, SVM). 282 13.10 Data for parameter estimation exercise. 284 14.1 Vectors and class centroids for the data in Table 13.1. 294 14.2 Training and test times for Rocchio classification. 296 14.3 Training and test times for kNN classification. 299 14.4 A linear classifier. 303 14.5 A confusion matrix for Reuters-21578. 308 15.1 Training and testing complexity of various classifiers including SVMs. 329 15.2 SVM classifier break-even F from (Joachims 2002a, p. 114). 334 1 15.3 Training examples for machine-learned scoring. 342 16.1 Some applications of clustering in information retrieval. 351 16.2 The four external evaluation measures applied to the clustering in Figure 16.4. 357 16.3 The EM clustering algorithm. 371 17.1 Comparison of HAC algorithms. 395 17.2 Automatically computed cluster labels. 397 Online edition (c) 2009 Cambridge UPOnline edition (c) 2009 Cambridge UPDRAFT © April 1, 2009 Cambridge University Press. Feedback welcome. xix ListofFigures 1.1 A term-document incidence matrix. 4 1.2 Results from Shakespeare for the query Brutus AND Caesar AND NOT Calpurnia. 5 1.3 The two parts of an inverted index. 7 1.4 Building an index by sorting and grouping. 8 1.5 Intersecting the postings lists for Brutus and Calpurnia from Figure 1.3. 10 1.6 Algorithm for the intersection of two postings lists p and p . 11 1 2 1.7 Algorithm for conjunctive queries that returns the set of documents containing each term in the input list of terms. 12 2.1 An example of a vocalized Modern Standard Arabic word. 21 2.2 The conceptual linear order of characters is not necessarily the order that you see on the page. 21 2.3 The standard unsegmented form of Chinese text using the simplified characters of mainland China. 26 2.4 Ambiguities in Chinese word segmentation. 26 2.5 A stop list of 25 semantically non-selective words which are common in Reuters-RCV1. 26 2.6 An example of how asymmetric expansion of query terms can usefully model users’ expectations. 28 2.7 Japanese makes use of multiple intermingled writing systems and, like Chinese, does not segment words. 31 2.8 A comparison of three stemming algorithms on a sample text. 34 2.9 Postings lists with skip pointers. 36 2.10 Postings lists intersection with skip pointers. 37 2.11 Positional index example. 41 2.12 An algorithm for proximity intersection of postings lists p 1 and p . 42 2 Online edition (c) 2009 Cambridge UPxx ListofFigures 3.1 A binary search tree. 51 3.2 A B-tree. 52 3.3 A portion of a permuterm index. 54 3.4 Example of a postings list in a 3-gram index. 55 3.5 Dynamic programming algorithm for computing the edit distance between strings s ands . 59 1 2 3.6 Example Levenshtein distance computation. 59 3.7 Matching at least two of the three 2-grams in the query bord. 61 4.1 Document from the Reuters newswire. 70 4.2 Blocked sort-based indexing. 71 4.3 Merging in blocked sort-based indexing. 72 4.4 Inversion of a block in single-pass in-memory indexing 73 4.5 An example of distributed indexing with MapReduce. Adapted from Dean and Ghemawat (2004). 76 4.6 Map and reduce functions in MapReduce. 77 4.7 Logarithmic merging. Each token (termID,docID) is initially added to in-memory index Z by LMERGEADDTOKEN. 0 LOGARITHMICMERGE initializes Z andindexes. 79 0 4.8 A user-document matrix for access control lists. Element (i,j) is 1 if useri has access to document j and 0 otherwise. During query processing, a user’s access postings list is intersected 81 with the results list returned by the text part of the index. 5.1 Heaps’ law. 88 5.2 Zipf’s law for Reuters-RCV1. 90 5.3 Storing the dictionary as an array of fixed-width entries. 91 5.4 Dictionary-as-a-string storage. 92 5.5 Blocked storage with four terms per block. 93 5.6 Search of the uncompressed dictionary (a) and a dictionary compressed by blocking with k = 4 (b). 94 5.7 Front coding. 94 5.8 VB encoding and decoding. 97 5.9 Entropy H(P) as a function of P(x ) for a sample space with 1 two outcomes x and x . 100 1 2 5.10 Stratification of terms for estimating the size of aγ encoded inverted index. 102 6.1 Parametric search. 111 6.2 Basic zone index 111 6.3 Zone index in which the zone is encoded in the postings rather than the dictionary. 111 Online edition (c) 2009 Cambridge UP

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.