Index construction in information retrieval

index construction and compression in information retrieval and equity index construction methodologies
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Your Website URL(Optional)
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Index Construction www.ThesisScientist.comIntroduction to Information Retrieval Plan  Last lecture: n-z a-hu  Dictionary data structures hy-m  Tolerant retrieval  Wildcards  Spell correction m mace madden  Soundex mo among amortize on abandon among  This time:  Index construction www.ThesisScientist.comIntroduction to Information Retrieval Ch. 4 Index construction  How do we construct an index?  What strategies can we use with limited main memory? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.1 Hardware basics  Many design decisions in information retrieval are based on the characteristics of hardware  We begin by reviewing hardware basics www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.1 Hardware basics  Access to data in memory is much faster than access to data on disk.  Disk seeks: No data is transferred from disk while the disk head is being positioned.  Therefore: Transferring one large chunk of data from disk to memory is faster than transferring many small chunks.  Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks).  Block sizes: 8KB to 256 KB. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.1 Hardware basics  Servers used in IR systems now typically have several GB of main memory, sometimes tens of GB.  Available disk space is several (2–3) orders of magnitude larger.  Fault tolerance is very expensive: It’s much cheaper to use many regular machines rather than one fault tolerant machine. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.1 Hardware assumptions for this lecture  symbol statistic value −3  s average seek time 5 ms = 5 x 10 s −8  b transfer time per byte 0.02 μs = 2 x 10 s 9 −1  processor’s clock rate 10 s −8  p low-level operation 0.01 μs = 10 s (e.g., compare & swap a word)  size of main memory several GB  size of disk space 1 TB or more www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 RCV1: Our collection for this lecture  Shakespeare’s collected works definitely aren’t large enough for demonstrating many of the points in this course.  The collection we’ll use isn’t really large enough either, but it’s publicly available and is at least a more plausible example.  As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.  This is one year of Reuters newswire (part of 1995 and 1996) www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 A Reuters RCV1 document www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Reuters RCV1 statistics  symbol statistic value  N documents 800,000  L avg. tokens per doc 200  M terms (= word types) 400,000  avg. bytes per token 6 (incl. spaces/punct.)  avg. bytes per token 4.5 (without spaces/punct.)  avg. bytes per term 7.5  non-positional postings 100,000,000 4.5 bytes per word token vs. 7.5 bytes per word type: why? www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Term Doc I 1 Recall IIR 1 index construction did 1 enact 1 julius 1 caesar 1  Documents are parsed to extract words and these I 1 are saved with the Document ID. was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 Doc 1 Doc 2 so 2 let 2 it 2 be 2 I did enact Julius with 2 So let it be with caesar 2 Caesar I was killed Caesar. The noble the 2 noble 2 i' the Capitol; Brutus hath told you brutus 2 hath 2 Brutus killed me. Caesar was ambitious told 2 you 2 caesar 2 was 2 www.ThesisScientist.com ambitious 2Introduction to Information Retrieval Sec. 4.2 Term Doc Term Doc Key step I 1 ambitious 2 did 1 be 2 enact 1 brutus 1 brutus 2 julius 1 caesar 1 capitol 1  After all documents have been I 1 caesar 1 caesar 2 was 1 parsed, the inverted file is killed 1 caesar 2 i' 1 did 1 sorted by terms. the 1 enact 1 capitol 1 hath 1 I 1 brutus 1 killed 1 I 1 me 1 i' 1 We focus on this sort step. it 2 so 2 let 2 julius 1 We have 100M items to sort. it 2 killed 1 killed 1 be 2 with 2 let 2 me 1 caesar 2 noble 2 the 2 noble 2 so 2 the 1 brutus 2 hath 2 the 2 told 2 told 2 you 2 you 2 caesar 2 was 1 was 2 was 2 with 2 ambitious 2 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Scaling index construction  In-memory index construction does not scale  Can’t stuff entire collection into memory, sort, then write back  How can we construct an index for very large collections?  Taking into account the hardware constraints we just learned about . . .  Memory, disk, speed, etc. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Sort-based index construction  As we build the index, we parse docs one at a time.  While building the index, we cannot easily exploit compression tricks (you can, but much more complex)  The final postings for any term are incomplete until the end.  At 12 bytes per non-positional postings entry (term, doc, freq), demands a lot of space for large collections.  T = 100,000,000 in the case of RCV1  So … we can do this in memory in 2009, but typical collections are much larger. E.g., the New York Times provides an index of 150 years of newswire  Thus: We need to store intermediate results on disk. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Sort using disk as “memory”?  Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?  No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.  We need an external sorting algorithm. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Bottleneck  Parse and build postings entries one doc at a time  Now sort postings entries by term (then by doc within each term)  Doing this with random disk seeks would be too slow – must sort T=100M records If every comparison took 2 disk seeks, and N items could be sorted with N log N comparisons, how long would this take? 2 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)  12-byte (4+4+4) records (term, doc, freq).  These are generated as we parse docs.  Must now sort 100M such 12-byte records by term.  Define a Block 10M such records  Can easily fit a couple into memory.  Will have 10 such blocks to start with.  Basic idea of algorithm:  Accumulate postings for each block, sort, write to disk.  Then merge the blocks into one long sorted order. www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 Sorting 10 blocks of 10M records  First, read each block and sort within:  Quicksort takes 2N ln N expected steps  In our case 2 x (10M ln 10M) steps  Exercise: estimate total time to read each block from disk and and quicksort it.  10 times this estimate – gives us 10 sorted runs of 10M records each.  Done straightforwardly, need 2 copies of data on disk  But can optimize this www.ThesisScientist.comIntroduction to Information Retrieval Sec. 4.2 www.ThesisScientist.com