Question? Leave a message!

Inverted index in Mapreduce

inverted index mapreduce algorithm and inverted index mapreduce example
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Introduction to Information Retrieval Introduction to Information Retrieval Index Construction 1Introduction to Information Retrieval Dictionary as array of fixed-width entries space needed: 20 bytes 4 bytes 4 bytes 4 4Introduction to Information Retrieval B-tree for looking up entries in array 5 5Introduction to Information Retrieval Wildcard queries using a permuterm index Queries:  For X, look up X  For X, look up X  For X, look up X  For X, look up X  For XY, look up YX 6 6Introduction to Information Retrieval k-gram indexes for spelling correction: bordroom 7 7Introduction to Information Retrieval Levenshtein distance for spelling correction 8 8Introduction to Information Retrieval Take-away  Two index construction algorithms: BSBI (simple) and SPIMI (more realistic)  Distributed index construction: MapReduce  Dynamic index construction: how to keep the index up-to-date as the collection changes 10 10Introduction to Information Retrieval Hardware basics  Many design decisions in information retrieval are based on hardware constraints.  We begin by reviewing hardware basics that we’ll need in this course. 12 12Introduction to Information Retrieval Hardware basics  Access to data is much faster in memory than on disk. (roughly a factor of 10)  Disk seeks are “idle” time: No data is transferred from disk while the disk head is being positioned.  To optimize transfer time from disk to memory: one large chunk is faster than many small chunks.  Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB  Servers used in IR systems typically have several GB of main memory, sometimes tens of GB, and TBs or 100s of GB of disk space.  Fault tolerance is expensive: It’s cheaper to use many regular machines than one fault tolerant machine. 13 13Introduction to Information Retrieval Some stats (ca. 2008) symbol statistic value −3 s average seek time 5 ms = 5 × 10 s −8 b transfer time per byte 0.02 μs = 2 × 10 s 9 −1 processor’s clock rate 10 s −8 P lowlevel operation (e.g., compare & swap a 0.01 μs = 10 s word) size of main memory several GB size of disk space 1 TB or more 14 14Introduction to Information Retrieval RCV1 collection  Shakespeare’s collected works are not large enough for demonstrating many of the points in this course.  As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.  English newswire articles sent over the wire in 1995 and 1996 (one year). 15 15Introduction to Information Retrieval A Reuters RCV1 document 16 16Introduction to Information Retrieval Reuters RCV1 statistics N documents 800,000 L tokens per document 200 M terms (= word types) 400,000 bytes per token (incl. spaces/punct.) 6 bytes per token (without spaces/punct.) 4.5 bytes per term (= word type) 7.5 T non-positional postings 100,000,000 Exercise: Average frequency of a term (how many tokens)? 4.5 bytes per word token vs. 7.5 bytes per word type: why the difference? How many positional postings? 17 17Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ BSBI algorithm ❹ SPIMI algorithm ❺ Distributed indexing ❻ Dynamic indexing 18Introduction to Information Retrieval Goal: construct the inverted Index dictonary postings 19 19Introduction to Information Retrieval Index construction in IIR 1: Sort postings in memory 20 20Introduction to Information Retrieval Sort-based index construction  As we build index, we parse docs one at a time.  The final postings for any term are incomplete until the end.  Can we keep all postings in memory and then do the sort in- memory at the end?  No, not for large collections  At 10–12 bytes per postings entry, we need a lot of space for large collections.  T = 100,000,000 in the case of RCV1: we can do this in memory on a typical machine in 2010.  But in-memory index construction does not scale for large collections.  Thus: We need to store intermediate results on disk. 21 21Introduction to Information Retrieval Same algorithm for disk?  Can we use the same index construction algorithm for larger collections, but by using disk instead of memory?  No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks.  We need an external sorting algorithm. 22 22Introduction to Information Retrieval “External” sorting algorithm (using few disk seeks)  We must sort T = 100,000,000 non-positional postings.  Each posting has size 12 bytes (4+4+4: termID, docID, document frequency).  Define a block to consist of 10,000,000 such postings  We can easily fit that many postings into memory.  We will have 10 such blocks for RCV1.  Basic idea of algorithm:  For each block: (i) accumulate postings, (ii) sort in memory, (iii) write to disk  Then merge the blocks into one long sorted order. 23 23Introduction to Information Retrieval Merging two blocks 24 24