Question? Leave a message!

Dictionary based compression

dictionary based compression example and dictionary compression algorithm example and dictionary techniques data compression ppt
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Introduction to Information Retrieval Introduction to Information Retrieval Dictionary compression Postings compression 1Introduction to Information Retrieval Blocked Sort-Based Indexing 4 4Introduction to Information Retrieval Single-pass in-memory indexing  Abbreviation: SPIMI  Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks.  Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.  With these two ideas we can generate a complete inverted index for each block.  These separate indexes can then be merged into one big index. 5 5Introduction to Information Retrieval SPIMI-Invert 6 6Introduction to Information Retrieval MapReduce for index construction 7 7Introduction to Information Retrieval Dynamic indexing: Simplest approach  Maintain big main index on disk  New docs go into small auxiliary index in memory.  Search across both, merge results  Periodically, merge auxiliary index into big index 8 8Introduction to Information Retrieval Roadmap  Today: index compression  Next 2 weeks: perspective of the user: how can we give the user relevant results, how can we measure relevance, what types of user interactions are effective?  After Pentecost: statistical classification and clustering in information retrieval  Last 3 weeks: web information retrieval 9 9Introduction to Information Retrieval Take-away today  Motivation for compression in information retrieval systems  How can we compress the dictionary component of the inverted index?  How can we compress the postings component of the inverted index?  Term statistics: how are terms distributed in document collections? 10 10Introduction to Information Retrieval Why compression? (in general)  Use less disk space (saves money)  Keep more stuff in memory (increases speed)  Increase speed of transferring data from disk to memory (again, increases speed)  read compressed data and decompress in memory is faster than read uncompressed data  Premise: Decompression algorithms are fast.  This is true of the decompression algorithms we will use. 12 12Introduction to Information Retrieval Why compression in information retrieval?  First, we will consider space for dictionary  Main motivation for dictionary compression: make it small enough to keep in main memory  Then for the postings file  Motivation: reduce disk space needed, decrease time needed to read from disk  Note: Large search engines keep significant part of postings in memory  We will devise various compression schemes for dictionary and postings. 13 13Introduction to Information Retrieval Lossy vs. lossless compression  Lossy compression: Discard some information  Several of the preprocessing steps we frequently use can be viewed as lossy compression:  downcasing, stop words, porter, number elimination  Lossless compression: All information is preserved.  What we mostly do in index compression 14 14Introduction to Information Retrieval Model collection: The Reuters collection symbol statistics value N documents 800,000 L avg. tokens per document 200 M word types 400,000 avg. bytes per token (incl. spaces/punct.) 6 avg. bytes per token (without spaces/punct.) 4.5 avg. bytes per term (= word type) 7.5 T non-positional postings 100,000,000 16 16Introduction to Information Retrieval Effect of preprocessing for Reuters 17 17Introduction to Information Retrieval How big is the term vocabulary?  That is, how many distinct words are there?  Can we assume there is an upper bound? 20 37  Not really: At least 70 ≈ 10 different words of length 20.  The vocabulary will keep growing with collection size. b  Heaps’ law: M = kT  M is the size of the vocabulary, T is the number of tokens in the collection.  Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.  Heaps’ law is linear in log-log space.  It is the simplest possible relationship between collection size and vocabulary size in log-log space.  Empirical law 18 18Introduction to Information Retrieval Heaps’ law for Reuters Vocabulary size M as a function of collection size T (number of tokens) for Reuters-RCV1. For these data, the dashed line log M = 10 0.49 ∗ log T + 1.64 is the 10 best least squares fit. 1.64 0.49 Thus, M = 10 T 1.64 and k = 10 ≈ 44 and b = 0.49. 19 19Introduction to Information Retrieval Empirical fit for Reuters  Good, as we just saw in the graph.  Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 0.49 44 × 1,000,020 ≈ 38,323  The actual number is 38,365 terms, very close to the prediction.  Empirical observation: fit is good in general. 20 20Introduction to Information Retrieval Exercise ❶What is the effect of including spelling errors vs. automatically correcting spelling errors on Heaps’ law? ❷Compute vocabulary size M  Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.  Assume a search engine indexes a total of 20,000,000,000 10 (2 × 10 ) pages, containing 200 tokens on average  What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? 21 21Introduction to Information Retrieval Zipf’s law  Now we have characterized the growth of the vocabulary in collections.  We also want to know how many frequent vs. infrequent terms we should expect in a collection.  In natural language, there are a few very frequent terms and very many very rare terms. th  Zipf’s law: The i most frequent term has frequency cf i proportional to 1/i .   cf is collection frequency: the number of occurrences of the i term t in the collection. i 22 22Introduction to Information Retrieval Zipf’s law th  Zipf’s law: The i most frequent term has frequency proportional to 1/i .   cf is collection frequency: the number of occurrences of the term in the collection.  So if the most frequent term (the) occurs cf times, then the 1 second most frequent term (of) has half as many occurrences  . . . and the third most frequent term (and) has a third as many occurrences k  Equivalent: cf = ci and log cf = log c +k log i (for k = −1) i i  Example of a power law 23 23Introduction to Information Retrieval Zipf’s law for Reuters Fit is not great. What is important is the key insight: Few frequent terms, many rare terms. 24 24