Lecture Notes on Data Structures and Algorithms for Big Databases

use of data structures and algorithms for big databases pdf free download
GeorgeThomson Profile Pic
GeorgeThomson,Greenland,Professional
Published Date:09-07-2017
Your Website URL(Optional)
Comment
Data Structures and Algorithms for Big Databases Michael A. Bender Bradley C. Kuszmaul Stony Brook & Tokutek MIT & Tokutekoy vey 365 42 Big data problem data ingestion queries + answers ??? ??? ??? ??? query processor data indexing Important and universal problem. Hot topic. 2oy vey 365 42 For on-disk data, one sees funny tradeoffs in the speeds Big data problem of data ingestion, query speed, and freshness of data. data ingestion queries + answers ??? ??? ??? ??? query processor data indexing Important and universal problem. Hot topic. 242 Funny tradeoff in ingestion, querying, freshness • “I'm trying to create indexes on a table with 308 million rows. It took 20 minutes to load the table but 10 days to build indexes on it.” MySQL bug 9544 ‣ • Typical record of all kinds of metadata is 150 bytes. • “Select queries were slow until I added an index onto the timestamp field... Adding the index really helped our reporting, BUT now the inserts are taking • Different parts of metadata are accessed separately. forever.” Comment on mysqlperformanceblog.com ‣ • “They indexed their tables, and indexed them well, And lo, did the queries run quick But that wasn’t the last of their troubles, to tell– Their insertions, like molasses, ran thick.” Not from Alice in Wonderland by Lewis Carroll ‣ queries + answers ??? data ingestion query processor 3 Don’t Thrash: How to Cache Your Hash in Flash data indexing42 Funny tradeoff in ingestion, querying, freshness • “I'm trying to create indexes on a table with 308 million rows. It took 20 minutes to load the table but 10 days to build indexes on it.” MySQL bug 9544 ‣ • Typical record of all kinds of metadata is 150 bytes. • “Select queries were slow until I added an index onto the timestamp field... Adding the index really helped our reporting, BUT now the inserts are taking • Different parts of metadata are accessed separately. forever.” Comment on mysqlperformanceblog.com ‣ • “They indexed their tables, and indexed them well, And lo, did the queries run quick But that wasn’t the last of their troubles, to tell– Their insertions, like molasses, ran thick.” Not from Alice in Wonderland by Lewis Carroll ‣ queries + answers ??? data ingestion query processor 4 Don’t Thrash: How to Cache Your Hash in Flash data indexing42 Funny tradeoff in ingestion, querying, freshness • “I'm trying to create indexes on a table with 308 million rows. It took 20 minutes to load the table but 10 days to build indexes on it.” MySQL bug 9544 ‣ • Typical record of all kinds of metadata is 150 bytes. • “Select queries were slow until I added an index onto the timestamp field... Adding the index really helped our reporting, BUT now the inserts are taking • Different parts of metadata are accessed separately. forever.” Comment on mysqlperformanceblog.com ‣ • “They indexed their tables, and indexed them well, And lo, did the queries run quick But that wasn’t the last of their troubles, to tell– Their insertions, like treacle, ran thick.” Not from Alice in Wonderland by Lewis Carroll ‣ queries + answers ??? data ingestion query processor 5 Don’t Thrash: How to Cache Your Hash in Flash data indexing® Fractal-tree This tutorial index ɛ B -tree • Better data LSM structures tree significantly mitigate the insert/query/ freshness tradeoff. • These structures scale to much larger sizes while efficiently using the memory- hierarchy. 6What we mean by Big Data We don’t define Big Data in terms of TB, PB, EB. By Big Data, we mean • The data is too big to fit in main memory. • We need data structures on the data. • Words like “index” or “metadata” suggest that there are underlying data structures. • These data structures are also too big to fit in main memory. 7 Don’t Thrash: How to Cache Your Hash in FlashIn this tutorial we study the underlying data structures for managing big data. SQL File systems NoSQL NewSQL 8 Don’t Thrash: How to Cache Your Hash in FlashBut enough about databases... ... more about us. Tokutek A few years ago we started working together on I/O-efficient and cache-oblivious data structures. Michael Martin Bradley Along the way, we started Tokutek to commercialize our research. 10 Don’t Thrash: How to Cache Your Hash in FlashStorage engines in MySQL Application MySQL Database SQL Processing, Query Optimization… File System Tokutek sells TokuDB, an ACID compliant, closed-source storage engine for MySQL. 11 Don’t Thrash: How to Cache Your Hash in FlashStorage engines in MySQL Application MySQL Database SQL Processing, Query Optimization… File System Tokutek sells TokuDB, an ACID compliant, closed-source storage engine for MySQL. 11 Don’t Thrash: How to Cache Your Hash in FlashStorage engines in MySQL Application TokuDB also has a Berkeley DB API and can MySQL Database be used independent of MySQL. SQL Processing, Query Optimization… File System Tokutek sells TokuDB, an ACID compliant, closed-source storage engine for MySQL. 11 Don’t Thrash: How to Cache Your Hash in FlashStorage engines in MySQL Application TokuDB also has a Berkeley DB API and can MySQL Database be used independent of MySQL. SQL Processing, Query Optimization… Many of the data structures ideas in this tutorial were used in developing TokuDB. But this tutorial is about data structures and File System algorithms, not TokuDB or any other platform. Tokutek sells TokuDB, an ACID compliant, closed-source storage engine for MySQL. 11 Don’t Thrash: How to Cache Your Hash in FlashOur Mindset • This tutorial is self contained. • We want to teach. • If something we say isn’t clear to you, please ask questions or ask us to clarify/repeat something. • You should be comfortable using math. • You should want to listen to data structures for an afternoon. 12 Don’t Thrash: How to Cache Your Hash in FlashTopics and Outline for this Tutorial I/O model and cache-oblivious analysis. Write-optimized data structures. How write-optimized data structures can help file systems. Block-replacement algorithms. Indexing strategies. Log-structured merge trees. Bloom filters. 13 Don’t Thrash: How to Cache Your Hash in FlashData Structures and Algorithms for Big Data Module 1: I/O Model and Cache- Oblivious Analysis Michael A. Bender Bradley C. Kuszmaul Stony Brook & Tokutek MIT & TokutekStory for Module • If we want to understand the performance of data structures within databases we need algorithmic models for modeling I/Os. • There’s a long history of models for understanding the memory hierarchy. Many are beautiful. Most have not found practical use. • Two approaches are very powerful. • That’s what we’ll present here so we have a foundation for the rest of the tutorial. 2 I/O modelsModeling I/O Using the Disk Access Model How computation works: • Data is transferred in blocks between RAM and disk. • The of block transfers dominates the running time. Goal: Minimize of block transfers • Performance bounds are parameterized by block size B, memory size M, data size N. B RAM Disk M B Aggarwal+Vitter ’88 3 I/O models