Index compression in information retrieval ppt

index compression in information retrieval system and index compression in information retrieval
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Index CompressionIntroduction to Information Retrieval Ch. 5 Today  Collection statistics in more detail (with RCV1)  How big will the dictionary and postings be?  Dictionary compression  Postings compression www.ThesisScientist.comIntroduction to Information Retrieval Ch. 5 Why compression (in general)?  Use less disk space  Saves a little money  Keep more stuff in memory  Increases speed  Increase speed of data transfer from disk to memory  read compressed data decompress is faster than read uncompressed data  Premise: Decompression algorithms are fast  True of the decompression algorithms we use www.ThesisScientist.comIntroduction to Information Retrieval Ch. 5 Why compression for inverted indexes?  Dictionary  Make it small enough to keep in main memory  Make it so small that you can keep some postings lists in main memory too  Postings file(s)  Reduce disk space needed  Decrease time needed to read postings lists from disk  Large search engines keep a significant part of the postings in memory.  Compression lets you keep more in memory  We will devise various IR-specific compression schemes www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Recall Reuters RCV1  symbol statistic value  N documents 800,000  L avg. tokens per doc 200  M terms (= word types) 400,000  avg. bytes per token 6 (incl. spaces/punct.)  avg. bytes per token 4.5 (without spaces/punct.)  avg. bytes per term 7.5  non-positional postings 100,000,000 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Index parameters vs. what we index (details IIR Table 5.1, p.80) size of word types (terms) non-positional positional postings postings dictionary non-positional index positional index Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul (K) % % % % % Unfiltered 484 109,971 197,879 No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9 Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9 30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38 150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52 stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52 Exercise: give intuitions for all the ‘0’ entries. Why do some zero entries correspond to big deltas in other columns? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Lossless vs. lossy compression  Lossless compression: All information is preserved.  What we mostly do in IR.  Lossy compression: Discard some information  Several of the preprocessing steps can be viewed as lossy compression: case folding, stop words, stemming, number elimination.  Chap/Lecture 7: Prune postings entries that are unlikely to turn up in the top k list for any query.  Almost no loss quality for top k list. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Vocabulary vs. collection size  How big is the term vocabulary?  That is, how many distinct words are there?  Can we assume an upper bound? 20 37  Not really: At least 70 = 10 different words of length 20  In practice, the vocabulary will keep growing with the collection size  Especially with Unicode  www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Vocabulary vs. collection size b  Heaps’ law: M = kT  M is the size of the vocabulary, T is the number of tokens in the collection  Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5  In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a line with slope about ½  It is the simplest possible relationship between the two in log-log space  An empirical finding (“empirical law”) www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Heaps’ Law Fig 5.1 p81 For RCV1, the dashed line log M = 0.49 log T + 1.64 10 10 is the best least squares fit. 1.64 0.49 Thus, M = 10 T so k = 1.64 10 ≈ 44 and b = 0.49. Good empirical fit for Reuters RCV1 For first 1,000,020 tokens, law predicts 38,323 terms; actually, 38,365 terms www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Exercises  What is the effect of including spelling errors, vs. automatically correcting spelling errors on Heaps’ law?  Compute the vocabulary size M for this scenario:  Looking at a collection of web pages, you find that there are 3000 different terms in the first 10,000 tokens and 30,000 different terms in the first 1,000,000 tokens.  Assume a search engine indexes a total of 20,000,000,000 10 (2 × 10 ) pages, containing 200 tokens on average  What is the size of the vocabulary of the indexed collection as predicted by Heaps’ law? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Zipf’s law  Heaps’ law gives the vocabulary size in collections.  We also study the relative frequencies of terms.  In natural language, there are a few very frequent terms and very many very rare terms.  Zipf’s law: The ith most frequent term has frequency proportional to 1/i .  cf∝ 1/i = K/i where K is a normalizing constant i  cf is collection frequency: the number of i occurrences of the term t in the collection. i www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Zipf consequences  If the most frequent term (the) occurs cf times 1  then the second most frequent term (of) occurs cf /2 times 1  the third most frequent term (and) occurs cf /3 times … 1  Equivalent: cf = K/i where K is a normalizing factor, i so  log cf = log K - log i i  Linear relationship between log cf and log i i  Another power law relationship www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.1 Zipf’s law for Reuters RCV1 www.ThesisScientist.comIntroduction to Information Retrieval Ch. 5 Compression  Now, we will consider compressing the space for the dictionary and postings  Basic Boolean index only  No study of positional indexes, etc.  We will consider compression schemes www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.2 DICTIONARY COMPRESSION www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.2 Why compress the dictionary?  Search begins with the dictionary  We want to keep it in memory  Memory footprint competition with other applications  Embedded/mobile devices may have very little memory  Even if the dictionary isn’t in memory, we want it to be small for a fast search startup time  So, compressing the dictionary is important www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.2 Dictionary storage - first cut  Array of fixed-width entries  400,000 terms; 28 bytes/term = 11.2 MB. Terms Freq. Postings ptr. a 656,265 aachen 65 …. …. zulu 221 20 bytes 4 bytes each Dictionary search structure www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.2 Fixed-width terms are wasteful  Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms.  And we still can’t handle supercalifragilisticexpialidocious or hydrochlorofluorocarbons.  Written English averages 4.5 characters/word.  Exercise: Why is/isn’t this the number to use for estimating the dictionary size?  Ave. dictionary word in English: 8 characters  How do we use 8 characters per dictionary term?  Short words dominate token counts but not type average. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 5.2 Compressing the term list: Dictionary-as-a-String Store dictionary as a (long) string of characters: Pointer to next word shows end of current word Hope to save up to 60% of dictionary space. ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Freq. Postings ptr. Term ptr. Total string length = 33 400K x 8B = 3.2MB 29 44 Pointers resolve 3.2M 126 positions: log 3.2M = 2 22bits = 3bytes www.ThesisScientist.com