Question? Leave a message!




Dictionary compression and Postings compression

Dictionary compression and Postings compression
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Dictionary compression Postings compression 1Introduction to Information Retrieval Overview ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 2Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 3Introduction to Information Retrieval Blocked Sort-Based Indexing 4 4Introduction to Information Retrieval Single-pass in-memory indexing  Abbreviation: SPIMI  Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.  Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.  With these two ideas we can generate a complete inverted index for each block.  These separate indexes can then be merged into one big index. 5 5Introduction to Information Retrieval SPIMI-Invert 6 6Introduction to Information Retrieval MapReduce for index construction 7 7Introduction to Information Retrieval Dynamic indexing: Simplest approach  Maintain big main index on disk  New docs go into small auxiliary index in memory.  Search across both, merge results  Periodically, merge auxiliary index into big index 8 8Introduction to Information Retrieval Roadmap  Today: index compression  Next 2 weeks: perspective of the user: how can we give the user relevant results, how can we measure relevance, what types of user interactions are effective?  After Pentecost: statistical classification and clustering in information retrieval  Last 3 weeks: web information retrieval 9 9Introduction to Information Retrieval Take-away today  Motivation for compression in information retrieval systems  How can we compress the dictionary component of the inverted index?  How can we compress the postings component of the inverted index?  Term statistics: how are terms distributed in document collections? 10 10Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 11Introduction to Information Retrieval Why compression? (in general)  Use less disk space (saves money)  Keep more stuff in memory (increases speed)  Increase speed of transferring data from disk to memory (again, increases speed)  read compressed data and decompress in memory is faster than read uncompressed data  Premise: Decompression algorithms are fast.  This is true of the decompression algorithms we will use. 12 12Introduction to Information Retrieval Why compression in information retrieval?  First, we will consider space for dictionary  Main motivation for dictionary compression: make it small enough to keep in main memory  Then for the postings file  Motivation: reduce disk space needed, decrease time needed to read from disk  Note: Large search engines keep significant part of postings in memory  We will devise various compression schemes for dictionary and postings. 13 13Introduction to Information Retrieval Lossy vs. lossless compression  Lossy compression: Discard some information  Several of the preprocessing steps we frequently use can be viewed as lossy compression:  downcasing, stop words, porter, number elimination  Lossless compression: All information is preserved.  What we mostly do in index compression 14 14Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 15Introduction to Information Retrieval Model collection: The Reuters collection symbol statistics value N documents 800,000 L avg. tokens per document 200 M word types 400,000 avg. bytes per token (incl. spaces/punct.) 6 avg. bytes per token (without spaces/punct.) 4.5 avg. bytes per term (= word type) 7.5 T non-positional postings 100,000,000 16 16Introduction to Information Retrieval Effect of preprocessing for Reuters 17 17Introduction to Information Retrieval How big is the term vocabulary?  That is, how many distinct words are there?  Can we assume there is an upper bound? 20 37  Not really: At least 70 ≈ 10 different words of length 20.  The vocabulary will keep growing with collection size. b  Heaps’ law: M = kT  M is the size of the vocabulary, T is the number of tokens in the collection.  Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.  Heaps’ law is linear in log-log space.  It is the simplest possible relationship between collection size and vocabulary size in log-log space.  Empirical law 18 18Introduction to Information Retrieval Heaps’ law for Reuters Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log M = 10 0.49 ∗ log T + 1.64 is the 10 best least squares fit. 1.64 0.49 Thus, M = 10 T 1.64 and k = 10 ≈ 44 and b = 0.49. 19 19Introduction to Information Retrieval Empirical fit for Reuters  Good, as we just saw in the graph.  Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 0.49 44 × 1,000,020 ≈ 38,323  The actual number is 38,365 terms, very close to the prediction.  Empirical observation: fit is good in general. 20 20