Question? Leave a message!




Dictionary compression and Postings compression

Dictionary compression and Postings compression
Introduction to Information Retrieval Introduction to Information Retrieval Dictionary compression Postings compression 1Introduction to Information Retrieval Overview ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 2Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 3Introduction to Information Retrieval Blocked SortBased Indexing 4 4Introduction to Information Retrieval Singlepass inmemory indexing  Abbreviation: SPIMI  Key idea 1: Generate separate dictionaries for each block – no need to maintain termtermID mapping across blocks.  Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur.  With these two ideas we can generate a complete inverted index for each block.  These separate indexes can then be merged into one big index. 5 5Introduction to Information Retrieval SPIMIInvert 6 6Introduction to Information Retrieval MapReduce for index construction 7 7Introduction to Information Retrieval Dynamic indexing: Simplest approach  Maintain big main index on disk  New docs go into small auxiliary index in memory.  Search across both, merge results  Periodically, merge auxiliary index into big index 8 8Introduction to Information Retrieval Roadmap  Today: index compression  Next 2 weeks: perspective of the user: how can we give the user relevant results, how can we measure relevance, what types of user interactions are effective  After Pentecost: statistical classification and clustering in information retrieval  Last 3 weeks: web information retrieval 9 9Introduction to Information Retrieval Takeaway today  Motivation for compression in information retrieval systems  How can we compress the dictionary component of the inverted index  How can we compress the postings component of the inverted index  Term statistics: how are terms distributed in document collections 10 10Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 11Introduction to Information Retrieval Why compression (in general)  Use less disk space (saves money)  Keep more stuff in memory (increases speed)  Increase speed of transferring data from disk to memory (again, increases speed)  read compressed data and decompress in memory is faster than read uncompressed data  Premise: Decompression algorithms are fast.  This is true of the decompression algorithms we will use. 12 12Introduction to Information Retrieval Why compression in information retrieval  First, we will consider space for dictionary  Main motivation for dictionary compression: make it small enough to keep in main memory  Then for the postings file  Motivation: reduce disk space needed, decrease time needed to read from disk  Note: Large search engines keep significant part of postings in memory  We will devise various compression schemes for dictionary and postings. 13 13Introduction to Information Retrieval Lossy vs. lossless compression  Lossy compression: Discard some information  Several of the preprocessing steps we frequently use can be viewed as lossy compression:  downcasing, stop words, porter, number elimination  Lossless compression: All information is preserved.  What we mostly do in index compression 14 14Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 15Introduction to Information Retrieval Model collection: The Reuters collection symbol statistics value N documents 800,000 L avg. tokens per document 200 M word types 400,000 avg. bytes per token (incl. spaces/punct.) 6 avg. bytes per token (without spaces/punct.) 4.5 avg. bytes per term (= word type) 7.5 T nonpositional postings 100,000,000 16 16Introduction to Information Retrieval Effect of preprocessing for Reuters 17 17Introduction to Information Retrieval How big is the term vocabulary  That is, how many distinct words are there  Can we assume there is an upper bound 20 37  Not really: At least 70 ≈ 10 different words of length 20.  The vocabulary will keep growing with collection size. b  Heaps’ law: M = kT  M is the size of the vocabulary, T is the number of tokens in the collection.  Typical values for the parameters k and b are: 30 ≤ k ≤ 100 and b ≈ 0.5.  Heaps’ law is linear in loglog space.  It is the simplest possible relationship between collection size and vocabulary size in loglog space.  Empirical law 18 18Introduction to Information Retrieval Heaps’ law for Reuters Vocabulary size M as a function of collection size T (number of tokens) for ReutersRCV1. For these data, the dashed line log M = 10 0.49 ∗ log T + 1.64 is the 10 best least squares fit. 1.64 0.49 Thus, M = 10 T 1.64 and k = 10 ≈ 44 and b = 0.49. 19 19Introduction to Information Retrieval Empirical fit for Reuters  Good, as we just saw in the graph.  Example: for the first 1,000,020 tokens Heaps’ law predicts 38,323 terms: 0.49 44 × 1,000,020 ≈ 38,323  The actual number is 38,365 terms, very close to the prediction.  Empirical observation: fit is good in general. 20 20Introduction to Information Retrieval Exercise ❶What is the effect of including spelling errors vs. automatically correcting spelling errors on Heaps’ law ❷Compute vocabulary size M  Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.  Assume a search engine indexes a total of 20,000,000,000 10 (2 × 10 ) pages, containing 200 tokens on average  What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law 21 21Introduction to Information Retrieval Zipf’s law  Now we have characterized the growth of the vocabulary in collections.  We also want to know how many frequent vs. infrequent terms we should expect in a collection.  In natural language, there are a few very frequent terms and very many very rare terms. th  Zipf’s law: The i most frequent term has frequency cf i proportional to 1/i .   cf is collection frequency: the number of occurrences of the i term t in the collection. i 22 22Introduction to Information Retrieval Zipf’s law th  Zipf’s law: The i most frequent term has frequency proportional to 1/i .   cf is collection frequency: the number of occurrences of the term in the collection.  So if the most frequent term (the) occurs cf times, then the 1 second most frequent term (of) has half as many occurrences  . . . and the third most frequent term (and) has a third as many occurrences k  Equivalent: cf = ci and log cf = log c +k log i (for k = −1) i i  Example of a power law 23 23Introduction to Information Retrieval Zipf’s law for Reuters Fit is not great. What is important is the key insight: Few frequent terms, many rare terms. 24 24Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 25Introduction to Information Retrieval Dictionary compression  The dictionary is small compared to the postings file.  But we want to keep it in memory.  Also: competition with other applications, cell phones, onboard computers, fast startup time  So compressing the dictionary is important. 26 26Introduction to Information Retrieval Recall: Dictionary as array of fixedwidth entries Space needed: 20 bytes 4 bytes 4 bytes for Reuters: (20+4+4)400,000 = 11.2 MB 27 27Introduction to Information Retrieval Fixedwidth entries are bad.  Most of the bytes in the term column are wasted.  We allot 20 bytes for terms of length 1.  We can’t handle HYDROCHLOROFLUOROCARBONS and SUPERCALIFRAGILISTICEXPIALIDOCIOUS  Average length of a term in English: 8 characters  How can we use on average 8 characters per term 28 28Introduction to Information Retrieval Dictionary as a string 29 29Introduction to Information Retrieval Space for dictionary as a string  4 bytes per term for frequency  4 bytes per term for pointer to postings list  8 bytes (on average) for term in string  3 bytes per pointer into string (need log2 8 · 400000 24 bits to resolve 8 · 400,000 positions)  Space: 400,000 × (4 +4 +3 + 8) = 7.6MB (compared to 11.2 MB for fixedwidth array) 30 30Introduction to Information Retrieval Dictionary as a string with blocking 31 31Introduction to Information Retrieval Space for dictionary as a string with blocking  Example block size k = 4  Where we used 4 × 3 bytes for term pointers without blocking . . .  . . .we now use 3 bytes for one pointer plus 4 bytes for indicating the length of each term.  We save 12 − (3 + 4) = 5 bytes per block.  Total savings: 400,000/4 ∗ 5 = 0.5 MB  This reduces the size of the dictionary from 7.6 MB to 7.1  MB. 32 32Introduction to Information Retrieval Lookup of a term without blocking 33 33Introduction to Information Retrieval Lookup of a term with blocking: (slightly) slower 34 34Introduction to Information Retrieval Front coding One block in blocked compression (k = 4) . . . 8 a u t o m a t a 8 a u t o m a t e 9 a u t o m a t i c 10 a u t o m a t i o n ⇓ . . . further compressed with front coding. 8 a u t o m a t ∗ a 1⋄ e 2 ⋄ i c 3⋄ i o n 35 35Introduction to Information Retrieval Dictionary compression for Reuters: Summary data structure size in MB dictionary, fixedwidth 11.2 dictionary, term pointers into string 7.6 ∼, with blocking, k = 4 7.1 ∼, with blocking front coding 5.9 36 36Introduction to Information Retrieval Exercise  Which prefixes should be used for front coding What are the tradeoffs  Input: list of terms (= the term vocabulary)  Output: list of prefixes that will be used in front coding 37 37Introduction to Information Retrieval Outline ❶ Recap ❷ Compression ❸ Term statistics ❹ Dictionary compression ❺ Postings compression 38Introduction to Information Retrieval Postings compression  The postings file is much larger than the dictionary, factor of at least 10.  Key desideratum: store each posting compactly  A posting for our purposes is a docID.  For Reuters (800,000 documents), we would use 32 bits per docID when using 4byte integers.  Alternatively, we can use log 800,000 ≈ 19.6 20 bits per 2 docID.  Our goal: use a lot less than 20 bits per docID. 39 39Introduction to Information Retrieval Key idea: Store gaps instead of docIDs  Each postings list is ordered in increasing order of docID.  Example postings list: COMPUTER: 283154, 283159, 283202, . . .  It suffices to store gaps: 283159283154=5, 283202283154=43  Example postings list using gaps : COMPUTER: 283154, 5, 43, . . .  Gaps for frequent terms are small.  Thus: We can encode small gaps with fewer than 20 bits. 40 40Introduction to Information Retrieval Gap encoding 41 41Introduction to Information Retrieval Variable length encoding  Aim:  For ARACHNOCENTRIC and other rare terms, we will use about 20 bits per gap (= posting).  For THE and other very frequent terms, we will use only a few bits per gap (= posting).  In order to implement this, we need to devise some form of variable length encoding.  Variable length encoding uses few bits for small gaps and many bits for large gaps. 42 42Introduction to Information Retrieval Variable byte (VB) code  Used by many commercial/research systems  Good lowtech blend of variablelength coding and sensitivity to alignment matches (bitlevel codes, see later).  Dedicate 1 bit (high bit) to be a continuation bit c.  If the gap G fits within 7 bits, binaryencode it in the 7 available bits and set c = 1.  Else: encode lowerorder 7 bits and then use one or more additional bytes to encode the higher order bits using the same algorithm.  At the end set the continuation bit of the last byte to 1 (c = 1) and of the other bytes to 0 (c = 0). 43 43Introduction to Information Retrieval VB code examples docIDs 824 829 215406 gaps 5 214577 VB code 00000110 10111000 10000101 00001101 00001100 10110001 44 44Introduction to Information Retrieval VB code encoding algorithm 45 45Introduction to Information Retrieval VB code decoding algorithm 46 46Introduction to Information Retrieval Other variable codes  Instead of bytes, we can also use a different “unit of alignment”: 32 bits (words), 16 bits, 4 bits (nibbles) etc  Variable byte alignment wastes space if you have many small gaps – nibbles do better on those.  Recent work on wordaligned codes that efficiently “pack” a variable number of gaps into one word – see resources at the end 47 47Introduction to Information Retrieval Gamma codes for gap encoding  You can get even more compression with another type of variable length encoding: bitlevel code.  Gamma code is the best known of these.  First, we need unary code to be able to introduce gamma code.  Unary code  Represent n as n 1s with a final 0.  Unary code for 3 is 1110  Unary code for 40 is 11111111111111111111111111111111111111110  Unary code for 70 is: 11111111111111111111111111111111111111111111111111111111111111111111110 48 48Introduction to Information Retrieval Gamma code  Represent a gap G as a pair of length and offset.  Offset is the gap in binary, with the leading bit chopped off.  For example 13 → 1101 → 101 = offset  Length is the length of offset.  For 13 (offset 101), this is 3.  Encode length in unary code: 1110.  Gamma code of 13 is the concatenation of length and offset: 1110101. 49 49Introduction to Information Retrieval Gamma code examples 50 50Introduction to Information Retrieval Exercise  Compute the variable byte code of 130  Compute the gamma code of 130 51 51Introduction to Information Retrieval Length of gamma code  The length of offset is ⌊log G⌋ bits. 2  The length of length is ⌊log G⌋ + 1 bits, 2  So the length of the entire code is 2 x ⌊log G⌋ + 1 bits. 2 ϒ codes are always of odd length.  Gamma codes are within a factor of 2 of the optimal encoding length log G. 2  (assuming the frequency of a gap G is proportional to log 2 G – not really true) 52 52Introduction to Information Retrieval Gamma code: Properties  Gamma code is prefixfree: a valid code word is not a prefix of any other valid code.  Encoding is optimal within a factor of 3 (and within a factor of 2 making additional assumptions).  This result is independent of the distribution of gaps  We can use gamma codes for any distribution. Gamma code is universal.  Gamma code is parameterfree. 53 53Introduction to Information Retrieval Gamma codes: Alignment  Machines have word boundaries – 8, 16, 32 bits  Compressing and manipulating at granularity of bits can be slow.  Variable byte encoding is aligned and thus potentially more efficient.  Regardless of efficiency, variable byte is conceptually simpler at little additional space cost. 54 54Introduction to Information Retrieval Compression of Reuters data structure size in MB dictionary, fixedwidth 11.2 dictionary, term pointers into string 7.6 ∼, with blocking, k = 4 7.1 ∼, with blocking front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, encoded 101.0 55 55Introduction to Information Retrieval Termdocument incidence matrix Entry is 1 if term occurs. Example: CALPURNIA occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: CALPURNIA doesn’t occur in The tempest. 56 56Introduction to Information Retrieval Compression of Reuters data structure size in MB dictionary, fixedwidth 11.2 dictionary, term pointers into string 7.6 ∼, with blocking, k = 4 7.1 ∼, with blocking front coding 5.9 collection (text, xml markup etc) 3600.0 collection (text) 960.0 T/D incidence matrix 40,000.0 postings, uncompressed (32bit words) 400.0 postings, uncompressed (20 bits) 250.0 postings, variable byte encoded 116.0 postings, encoded 101.0 57 57Introduction to Information Retrieval Summary  We can now create an index for highly efficient Boolean retrieval that is very space efficient.  Only 1015 of the total size of the text in the collection.  However, we’ve ignored positional and frequency information.  For this reason, space savings are less in reality. 58 58Introduction to Information Retrieval Takeaway today  Motivation for compression in information retrieval systems  How can we compress the dictionary component of the inverted index  How can we compress the postings component of the inverted index  Term statistics: how are terms distributed in document collections 59 59Introduction to Information Retrieval Resources  Chapter 5 of IIR  Resources at http://ifnlp.org/ir  Original publication on wordaligned binary codes by Anh and Moffat (2005); also: Anh and Moffat (2006a)  Original publication on variable byte codes by Scholer, Williams, Yiannis and Zobel (2002)  More details on compression (including compression of positions and frequencies) in Zobel and Moffat (2006) 60 60
Website URL
Comment