Question? Leave a message!

GFS, Mapreduce and Bigtable

GFS, Mapreduce and Bigtable
GFS, Mapreduce and Bigtable Seminar on big data management Lecturer: Jiaheng Lu Spring 2016 1.2.2016 1Google big data techniquesGoogle File System MapReduce model Bigtable data storage platform Google File System Google File System (GFS) A scalable distributed file system for large distributed data intensive applications Multiple GFS clusters are currently deployed. The largest ones (in 2003) have: 1000+ storage nodes 300+ TeraBytes of disk storage heavily accessed by hundreds of clients on distinct machines Shares many same goals as previous distributed file systems performance, scalability, reliability, etc GFS design has been driven by four key observation of Google application workloads and technological environment Observations 1 1. Component failures are the norm constant monitoring, error detection, fault tolerance and automatic recovery are integral to the system 2. Huge files (by traditional standards) Multi GB files are common I/O operations and blocks sizes must be revisited Observations 2 3. Most files are mutated by appending new data This is the focus of performance optimization and atomicity guarantees 4. Codesigning the applications and APIs benefits overall system by increasing flexibility Design Cluster consists of a single master and multiple chunkservers and is accessed by multiple clients Master Maintains all file system metadata. names space, access control info, file to chunk mappings, chunk (including replicas) location, etc. Periodically communicates with chunkservers in HeartBeat messages to give instructions and check state Master Helps make sophisticated chunk placement and replication decision, using global knowledge For reading and writing, client contacts Master to get chunk locations, then deals directly with chunkservers Master is not a bottleneck for reads/writes Files are broken into chunks. Each chunk has a immutable globally unique 64bit chunk handle. handle is assigned by the master at chunk creation Chunk size is 64 MB Each chunk is replicated on 3 (default) servers Linked to apps using the file system API. Communicates with master and chunkservers for reading and writing Master interactions only for metadata Chunkserver interactions for data Only caches metadata information Data is too large to cache. Locations Master does not keep a persistent record of locations of chunks and replicas. Polls chunkservers at startup, and when new chunkservers join/leave for this. Stays up to date by controlling placement of new chunks and through HeartBeat messages (when monitoring chunkservers) to MapReduceMapReduce: Insight  A Toy Problem:   We have 10 billion documents   Average documents size is 20KB   10 Billion docs == 200 TB   We want build a language model of the Web:  – Basically count how many times each word occurMapReduce: Insight  for each document d  for each word w in d wordcountw++;  Approximately 1 month.  Assumptions:  1. All disk reads are sequential  2. Dictionary fits into the memoryMapReduce Programming Model  Inspired from map and reduce operations commonly used in functional programming languages like Lisp.  Users implement interface of two primary methods:  1. Map: (key1, val1) → (key2, val2)  2. Reduce: (key2, val2) → val3Map operation  Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs.  e.g. (doc—id, doccontent)  Draw an analogy to SQL, map can be visualized as groupby clause of an aggregate query.Reduce operation  On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer.  Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same groupby attribute.Pseudocode map(String inputkey, String inputvalue): // inputkey: document name // inputvalue: document contents for each word w in inputvalue: EmitIntermediate(w, "1"); reduce(String outputkey, Iterator intermediatevalues): // outputkey: a word // outputvalues: a list of counts int result = 0; for each v in intermediatevalues: result += ParseInt(v); Emit(AsString(result));MapReduce: Execution overviewMapReduce: ExampleMapReduce in Parallel: ExampleMapReduce: Some More Apps MapReduce Programs In Google Source Tree  Distributed Grep.  Count of URL Access Frequency.  Clustering (Kmeans)  Graph Algorithms.  Indexing SystemsGoogle File System MapReduce model Bigtable data storage platform A Distributed Storage System for Structured DataIntroduction  BigTable is a distributed storage system for managing structured data.  Designed to scale to a very large size  Petabytes of data across thousands of servers  Used for many Google projects  Web indexing, Personalized Search, Google Earth, Google Analytics, Google Finance, …  Flexible, highperformance solution for all of Google’s productsMotivation  Lots of (semi)structured data at Google  URLs: ‒ Contents, crawl metadata, links, anchors, pagerank, …  Peruser data: ‒ User preference settings, recent queries/search results, …  Geographic locations: ‒ Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, …  Scale is large  Billions of URLs, many versions/page (20K/version)Why not just use commercial DB  Scale is too large for most commercial databases  Even if it weren’t, cost would be very high  Building internally means system can be applied across many projects for low incremental costGoals  Want asynchronous processes to be continuously updating different pieces of data  Want access to most current data at any time  Need to support:  Very high read/write rates (millions of ops per second)  Efficient scans over all or interesting subsets of data  Efficient joins of large onetoone and onetomany datasets  Often want to examine data changes over time  E.g. Contents of a web page over multiple crawlsBigTable  Distributed multilevel map  Faulttolerant, persistent  Scalable  Thousands of servers  Terabytes of inmemory data  Petabyte of diskbased data  Millions of reads/writes per second, efficient scans  Selfmanaging  Servers can be added/removed dynamically  Servers adjust to load imbalanceBuilding Blocks  Building blocks:  Google File System (GFS): Raw storage  Scheduler: schedules jobs onto machines  Lock service: distributed lock manager  MapReduce: simplified largescale data processingBasic Data Model  A BigTable is a sparse, distributed persistent multidimensional sorted map (row, column, timestamp) cell contents  Good match for most Google applicationsWebTable Example  Want to keep copy of a large collection of web pages and related information  Use URLs as row keys  Various aspects of web page as column names  Store contents of web pages in thecontents: column under the timestamps when they were fetched.Rows  Name is an arbitrary string  Access to data in a row is atomic  Row creation is implicit upon storing data  Rows ordered lexicographically  Rows close together lexicographically usually on one or a small number of machinesRows (cont.) Reads of short row ranges are efficient and typically require communication with a small number of machines.  Can exploit this property by selecting row keys so they get good locality for data access.  Example:,,, VS edu.gatech.math, edu.gatech.phys, edu.uga.math, edu.uga.physColumns  Columns have twolevel name structure: ‒ family:optionalqualifier  Column family  Unit of access control  Has associated type information  Qualifier gives unbounded columns  Additional levels of indexing, if desiredTimestamps  Used to store different versions of data in a cell  New writes default to current time, but timestamps for writes can also be set explicitly by clients  Lookup options:  “Return most recent K values”  “Return all values in timestamp range (or all values)”  Column families can be marked w/ attributes:  “Only retain most recent K values in a cell”  “Keep values until they are older than K seconds”Implementation – Three Major Components  Library linked into every client  One master server  Responsible for: ‒ Assigning tablets to tablet servers ‒ Detecting addition and expiration of tablet servers ‒ Balancing tabletserver load ‒ Garbage collection  Many tablet servers  Tablet servers handle read and write requests to its table  Splits tablets that have grown too largeTablets  Large tables broken into tablets at row boundaries  Tablet holds contiguous range of rows ‒ Clients can often choose row keys to achieve locality  Aim for 100MB to 200MB of data per tablet  Serving machine responsible for 100 tablets  Fast recovery: ‒ 100 machines each pick up 1 tablet for failed machine  Finegrained load balancing: ‒ Migrate tablets away from overloaded machine ‒ Master makes loadbalancing decisionsTablet Assignment  Each tablet is assigned to one tablet server at a time.  Master server keeps track of the set of live tablet servers and current assignments of tablets to servers. Also keeps track of unassigned tablets.  When a tablet is unassigned, master assigns the tablet to an tablet server with sufficient room.API  Metadata operations  Create/delete tables, column families, change metadata  Writes (atomic)  Set(): write cells in a row  DeleteCells(): delete cells in a row  DeleteRow(): delete all cells in a row  Reads  Scanner: read arbitrary cells in a bigtable ‒ Each row read is atomic ‒ Can restrict returned rows to a particular range ‒ Can ask for just data from 1 row, all rows, etc. ‒ Can ask for all columns, just certain column families, or specific columnsSummary  GFS is developed by Google for big data challenge  Hadoop is an opensource software of GFS  Mapreduce is a distributed programming framework  Bigtable is developed to store largescale Web data Matemaattisluonnontieteellinen tiedekunta / Iso tiedonhallinta/ 1.2.2016 44 Jiaheng Lu
Document Information
User Name:
User Type:
Uploaded Date: