Question? Leave a message!




Index construction in MapReduce

Index construction in MapReduce
WilliamsMcmahon Profile Pic
WilliamsMcmahon,United States,Professional
Published Date:20-07-2017
Website URL
Comment
Introduction to Information Retrieval Introduction to Information Retrieval Index Construction 1Introduction to Information Retrieval Overview ❶ Recap ❷ Introduction ❸ BSBI algorithm ❹ SPIMI algorithm ❺ Distributed indexing ❻ Dynamic indexing 2Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ BSBI algorithm ❹ SPIMI algorithm ❺ Distributed indexing ❻ Dynamic indexing 3Introduction to Information Retrieval Dictionary as array of fixed-width entries space needed: 20 bytes 4 bytes 4 bytes 4 4Introduction to Information Retrieval B-tree for looking up entries in array 5 5Introduction to Information Retrieval Wildcard queries using a permuterm index Queries:  For X, look up X  For X, look up X  For X, look up X  For X, look up X  For XY, look up YX 6 6Introduction to Information Retrieval k-gram indexes for spelling correction: bordroom 7 7Introduction to Information Retrieval Levenshtein distance for spelling correction 8 8Introduction to Information Retrieval Exercise: Understand Peter Norvig’s spelling corrector 9 9Introduction to Information Retrieval Take-away  Two index construction algorithms: BSBI (simple) and SPIMI (more realistic)  Distributed index construction: MapReduce  Dynamic index construction: how to keep the index up-to-date as the collection changes 10 10Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ BSBI algorithm ❹ SPIMI algorithm ❺ Distributed indexing ❻ Dynamic indexing 11Introduction to Information Retrieval Hardware basics  Many design decisions in information retrieval are based on hardware constraints.  We begin by reviewing hardware basics that we’ll need in this course. 12 12Introduction to Information Retrieval Hardware basics  Access to data is much faster in memory than on disk. (roughly a factor of 10)  Disk seeks are “idle” time: No data is transferred from disk while the disk head is being positioned.  To optimize transfer time from disk to memory: one large chunk is faster than many small chunks.  Disk I/O is block-based: Reading and writing of entire blocks (as opposed to smaller chunks). Block sizes: 8KB to 256 KB  Servers used in IR systems typically have several GB of main memory, sometimes tens of GB, and TBs or 100s of GB of disk space.  Fault tolerance is expensive: It’s cheaper to use many regular machines than one fault tolerant machine. 13 13Introduction to Information Retrieval Some stats (ca. 2008) symbol statistic value −3 s average seek time 5 ms = 5 × 10 s −8 b transfer time per byte 0.02 μs = 2 × 10 s 9 −1 processor’s clock rate 10 s −8 P lowlevel operation (e.g., compare & swap a 0.01 μs = 10 s word) size of main memory several GB size of disk space 1 TB or more 14 14Introduction to Information Retrieval RCV1 collection  Shakespeare’s collected works are not large enough for demonstrating many of the points in this course.  As an example for applying scalable index construction algorithms, we will use the Reuters RCV1 collection.  English newswire articles sent over the wire in 1995 and 1996 (one year). 15 15Introduction to Information Retrieval A Reuters RCV1 document 16 16Introduction to Information Retrieval Reuters RCV1 statistics N documents 800,000 L tokens per document 200 M terms (= word types) 400,000 bytes per token (incl. spaces/punct.) 6 bytes per token (without spaces/punct.) 4.5 bytes per term (= word type) 7.5 T non-positional postings 100,000,000 Exercise: Average frequency of a term (how many tokens)? 4.5 bytes per word token vs. 7.5 bytes per word type: why the difference? How many positional postings? 17 17Introduction to Information Retrieval Outline ❶ Recap ❷ Introduction ❸ BSBI algorithm ❹ SPIMI algorithm ❺ Distributed indexing ❻ Dynamic indexing 18Introduction to Information Retrieval Goal: construct the inverted Index dictonary postings 19 19Introduction to Information Retrieval Index construction in IIR 1: Sort postings in memory 20 20