Question? Leave a message!




Introducing Apache Hadoop: The Modern Data Operating System

Introducing Apache Hadoop: The Modern Data Operating System 5
GraceRogers Profile Pic
GraceRogers,Greece,Professional
Published Date:12-07-2017
Website URL
Comment
11/16/2011, Stanford EE380 Computer Systems Colloquium Introducing Apache Hadoop: The Modern Data Operating System Dr. Amr Awadallah Founder, CTO, VP of Engineering aaacloudera.com, twitter: awadallahLimitations of Existing Data Analytics Architecture Can’t Explore Original BI Reports + Interactive Apps High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Data Death Collection Instrumentation 2 ©2011 Cloudera, Inc. All Rights Reserved.So What is Apache Hadoop ? • A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main systems: – Hadoop Distributed File System: self-healing high-bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction. 3 ©2011 Cloudera, Inc. All Rights Reserved.The Key Benefit: Agility/Flexibility Schema-on-Write (RDBMS): Schema-on-Read (Hadoop): • Data is simply copied to the file • Schema must be created before store, no transformation is needed. any data can be loaded. • A SerDe (Serializer/Deserlizer) is • An explicit load operation has to applied during read time to extract take place which transforms the required columns (late binding) data to DB internal structure. • New data can start flowing anytime • New columns must be added and will appear retroactively once explicitly before new data for the SerDe is updated to parse it. such columns can be loaded into the database. • Load is Fast • Read is Fast P Pr ros os • Flexibility/Agility • Standards/Governance 4 ©2011 Cloudera, Inc. All Rights Reserved.Innovation: Explore Original Raw Data Data Committee Data Scientist 5 ©2011 Cloudera, Inc. All Rights Reserved.Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multi-stage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 6 ©2011 Cloudera, Inc. All Rights Reserved.Scalability: Scalable Software Development Grows without requiring developers to re-architect their algorithms/application. AUTO SCALE 7 ©2011 Cloudera, Inc. All Rights Reserved.Scalability: Data Beats Algorithm Smarter Algos More Data A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009 8 ©2011 Cloudera, Inc. All Rights Reserved.Scalability: Keep All Data Alive Forever Extract Value From Archive to Tape and Never See It Again All Your Data 9 ©2011 Cloudera, Inc. All Rights Reserved.Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (1sec) • Structured or Not (Flexibility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100% SQL Compliance • Complex Data Processing 10 ©2011 Cloudera, Inc. All Rights Reserved.HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). Optimized for: • Throughput • Put/Get/Delete • Appends Block Replication for: • Durability • Availability • Throughput Block Replicas are distributed across servers and racks. 11 ©2011 Cloudera, Inc. All Rights Reserved.MapReduce: Computational Framework cat .txt mapper.pl sort reducer.pl out.txt (words, counts) (docid, text) Map 1 (sorted words, counts) Split 1 Output (sorted words, File 1 Be, 5 Reduce 1 sum of counts) “To Be Be, 30 Or Not To Be?” Be, 12 Output (sorted words, File i Reduce i sum of counts) (docid, text) Map i Split i Be, 7 Be, 6 Shuffle Output (sorted words, File R Reduce R sum of counts) (docid, text) Map M (sorted words, counts) Split N (words, counts) Map(in_key, in_value) = list of (out_key, intermediate_value) Reduce(out_key, list of intermediate_values) = out_value(s) 12 ©2011 Cloudera, Inc. All Rights Reserved.MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: • Same server as data (local disk) • Same rack as data (rack/leaf switch) • Wherever there is a free slot (cross rack) Optimized for: • Batch Processing • Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 13 ©2011 Cloudera, Inc. All Rights Reserved.But Networks Are Faster Than Disks Yes, however, core and disk density per server are going up very quickly: • 1 Hard Disk = 100MB/sec (1Gbps) • Server = 12 Hard Disks = 1.2GB/sec (12Gbps) • Rack = 20 Servers = 24GB/sec (240Gbps) • Avg. Cluster = 6 Racks = 144GB/sec (1.4Tbps) • Large Cluster = 200 Racks = 4.8TB/sec (48Tbps) • Scanning 4.8TB at 100MB/sec takes 13 hours. 14 ©2011 Cloudera, Inc. All Rights Reserved.Hadoop High-Level Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file names Tracks resources and schedules to blocks to data node slaves. jobs across task tracker slaves. Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 15 ©2011 Cloudera, Inc. All Rights Reserved.Changes for Better Availability/Scalability Hadoop Client Contacts Name Node for data Federation partitions Each job has its own or Job Tracker to submit jobs out the name space, Application Manager, High Availability via Resource Manager is an Active Standby. decoupled from MR. Name Node Job Tracker Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 16 ©2011 Cloudera, Inc. All Rights Reserved.CDH: Cloudera’s Distribution Including Apache Hadoop File System Mount UI Framework/SDK Data Mining FUSE-DFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, APACHE HBASE APACHE SQOOP Coordination APACHE ZOOKEEPER SCM Express (Installation Wizard for CDH) 17 ©2011 Cloudera, Inc. All Rights Reserved. Build/Test: APACHE BIGTOPBooks 18 ©2011 Cloudera, Inc. All Rights Reserved.Conclusion • The Key Benefits of Apache Hadoop: – Agility/Flexibility (Quickest Time to Insight). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Storage (Keep All Your Data Alive Forever). • The Key Systems for Apache Hadoop are: – Hadoop Distributed File System: self-healing high- bandwidth clustered storage. – MapReduce: distributed fault-tolerant resource management coupled with scalable data processing. 19 ©2011 Cloudera, Inc. All Rights Reserved.Appendix BACKUP SLIDES 20 ©2011 Cloudera, Inc. All Rights Reserved.