Question? Leave a message!




Introducing Apache Hadoop: The Modern Data Operating System

Introducing Apache Hadoop: The Modern Data Operating System 5
11/16/2011, Stanford EE380 Computer Systems Colloquium Introducing Apache Hadoop: The Modern Data Operating System Dr. Amr Awadallah Founder, CTO, VP of Engineering aaacloudera.com, twitter: awadallahLimitations of Existing Data Analytics Architecture Can’t Explore Original BI Reports + Interactive Apps High Fidelity Raw Data RDBMS (aggregated data) ETL Compute Grid Moving Data To Compute Doesn’t Scale Storage Only Grid (original raw data) Archiving = Mostly Append Premature Data Death Collection Instrumentation 2 ©2011 Cloudera, Inc. All Rights Reserved.So What is Apache Hadoop • A scalable faulttolerant distributed system for data storage and processing (open source under the Apache license). • Core Hadoop has two main systems: – Hadoop Distributed File System: selfhealing highbandwidth clustered storage. – MapReduce: distributed faulttolerant resource management and scheduling coupled with a scalable data programming abstraction. 3 ©2011 Cloudera, Inc. All Rights Reserved.The Key Benefit: Agility/Flexibility SchemaonWrite (RDBMS): SchemaonRead (Hadoop): • Data is simply copied to the file • Schema must be created before store, no transformation is needed. any data can be loaded. • A SerDe (Serializer/Deserlizer) is • An explicit load operation has to applied during read time to extract take place which transforms the required columns (late binding) data to DB internal structure. • New data can start flowing anytime • New columns must be added and will appear retroactively once explicitly before new data for the SerDe is updated to parse it. such columns can be loaded into the database. • Load is Fast • Read is Fast P Pr ros os • Flexibility/Agility • Standards/Governance 4 ©2011 Cloudera, Inc. All Rights Reserved.Innovation: Explore Original Raw Data Data Committee Data Scientist 5 ©2011 Cloudera, Inc. All Rights Reserved.Flexibility: Complex Data Processing 1. Java MapReduce: Most flexibility and performance, but tedious development cycle (the assembly language of Hadoop). 2. Streaming MapReduce (aka Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility than native Java MapReduce. 3. Crunch: A library for multistage MapReduce pipelines in Java (modeled After Google’s FlumeJava) 4. Pig Latin: A highlevel language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta store mapping files to their schemas and associated SerDes. 6. Oozie: A PDL XML workflow engine that enables creating a workflow of jobs composed of any of the above. 6 ©2011 Cloudera, Inc. All Rights Reserved.Scalability: Scalable Software Development Grows without requiring developers to rearchitect their algorithms/application. AUTO SCALE 7 ©2011 Cloudera, Inc. All Rights Reserved.Scalability: Data Beats Algorithm Smarter Algos More Data A. Halevy et al, “The Unreasonable Effectiveness of Data”, IEEE Intelligent Systems, March 2009 8 ©2011 Cloudera, Inc. All Rights Reserved.Scalability: Keep All Data Alive Forever Extract Value From Archive to Tape and Never See It Again All Your Data 9 ©2011 Cloudera, Inc. All Rights Reserved.Use The Right Tool For The Right Job Relational Databases: Hadoop: Use when: Use when: • Interactive OLAP Analytics (1sec) • Structured or Not (Flexibility) • Multistep ACID Transactions • Scalability of Storage/Compute • 100 SQL Compliance • Complex Data Processing 10 ©2011 Cloudera, Inc. All Rights Reserved.HDFS: Hadoop Distributed File System A given file is broken down into blocks (default=64MB), then blocks are replicated across cluster (default=3). Optimized for: • Throughput • Put/Get/Delete • Appends Block Replication for: • Durability • Availability • Throughput Block Replicas are distributed across servers and racks. 11 ©2011 Cloudera, Inc. All Rights Reserved.MapReduce: Computational Framework cat .txt mapper.pl sort reducer.pl out.txt (words, counts) (docid, text) Map 1 (sorted words, counts) Split 1 Output (sorted words, File 1 Be, 5 Reduce 1 sum of counts) “To Be Be, 30 Or Not To Be” Be, 12 Output (sorted words, File i Reduce i sum of counts) (docid, text) Map i Split i Be, 7 Be, 6 Shuffle Output (sorted words, File R Reduce R sum of counts) (docid, text) Map M (sorted words, counts) Split N (words, counts) Map(inkey, invalue) = list of (outkey, intermediatevalue) Reduce(outkey, list of intermediatevalues) = outvalue(s) 12 ©2011 Cloudera, Inc. All Rights Reserved.MapReduce: Resource Manager / Scheduler A given job is broken down into tasks, then tasks are scheduled to be as close to data as possible. Three levels of data locality: • Same server as data (local disk) • Same rack as data (rack/leaf switch) • Wherever there is a free slot (cross rack) Optimized for: • Batch Processing • Failure Recovery System detects laggard tasks and speculatively executes parallel tasks on the same slice of data. 13 ©2011 Cloudera, Inc. All Rights Reserved.But Networks Are Faster Than Disks Yes, however, core and disk density per server are going up very quickly: • 1 Hard Disk = 100MB/sec (1Gbps) • Server = 12 Hard Disks = 1.2GB/sec (12Gbps) • Rack = 20 Servers = 24GB/sec (240Gbps) • Avg. Cluster = 6 Racks = 144GB/sec (1.4Tbps) • Large Cluster = 200 Racks = 4.8TB/sec (48Tbps) • Scanning 4.8TB at 100MB/sec takes 13 hours. 14 ©2011 Cloudera, Inc. All Rights Reserved.Hadoop HighLevel Architecture Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Name Node Job Tracker Maintains mapping of file names Tracks resources and schedules to blocks to data node slaves. jobs across task tracker slaves. Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 15 ©2011 Cloudera, Inc. All Rights Reserved.Changes for Better Availability/Scalability Hadoop Client Contacts Name Node for data Federation partitions Each job has its own or Job Tracker to submit jobs out the name space, Application Manager, High Availability via Resource Manager is an Active Standby. decoupled from MR. Name Node Job Tracker Data Node Task Tracker Stores and serves Runs tasks (work units) blocks of data within a job Share Physical Node 16 ©2011 Cloudera, Inc. All Rights Reserved.CDH: Cloudera’s Distribution Including Apache Hadoop File System Mount UI Framework/SDK Data Mining FUSEDFS HUE APACHE MAHOUT Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers Fast Data APACHE PIG, APACHE HIVE Read/Write Integration Access APACHE FLUME, APACHE HBASE APACHE SQOOP Coordination APACHE ZOOKEEPER SCM Express (Installation Wizard for CDH) 17 ©2011 Cloudera, Inc. All Rights Reserved. Build/Test: APACHE BIGTOPBooks 18 ©2011 Cloudera, Inc. All Rights Reserved.Conclusion • The Key Benefits of Apache Hadoop: – Agility/Flexibility (Quickest Time to Insight). – Complex Data Processing (Any Language, Any Problem). – Scalability of Storage/Compute (Freedom to Grow). – Economical Storage (Keep All Your Data Alive Forever). • The Key Systems for Apache Hadoop are: – Hadoop Distributed File System: selfhealing high bandwidth clustered storage. – MapReduce: distributed faulttolerant resource management coupled with scalable data processing. 19 ©2011 Cloudera, Inc. All Rights Reserved.Appendix BACKUP SLIDES 20 ©2011 Cloudera, Inc. All Rights Reserved.Unstructured Data is Exploding Complex, Unstructured Relational • 2,500 exabytes of new information in 2012 with Internet as primary driver • Digital universe grew by 62 last year to 800K petabytes and will grow to 1.2 “zettabytes” this year Source: IDC White Paper sponsored by EMC. As the Economy Contracts, the Digital Universe Expands. May 2009. 21 ©2011 Cloudera, Inc. All Rights Reserved.Hadoop Creation History • Fastest sort of a TB, 62secs over 1,460 nodes • Sorted a PB in 16.25hours over 3,658 nodes 22 ©2011 Cloudera, Inc. All Rights Reserved.Hadoop in the Enterprise Data Stack Data Scientists Analysts Business Users Enterprise IDEs BI, Analytics Reporting Development Tools Business Intelligence Tools System Operators ODBC, JDBC, Cloudera NFS, Native Mgmt Suite Enterprise Data Warehouse Sqoop Data Customers Architects LowLatency Web Flume Flume Sqoop Flume Application Serving Relational Systems Logs Files Web Data Databases 23 ©2011 Cloudera, Inc. All Rights Reserved. ETL ToolsMapReduce Next Gen Main idea is to split up the JobTracker functions: • Cluster resource management (for tracking and allocating nodes) • Application lifecycle management (for MapReduce scheduling and execution) Enables: • High Availability • Better Scalability • Efficient Slot Allocation • Rolling Upgrades • NonMapReduce Apps 24 ©2011 Cloudera, Inc. All Rights Reserved.Two Core Use Cases Common Across Many Industries Use Case Industry Use Case Application Application Web Social Network Analysis Clickstream Sessionization Media Clickstream Sessionization Content Optimization Telco Network Analytics Mediation Retail Loyalty Promotions Data Factory Financial Fraud Analysis Trade Reconciliation Federal Entity Analysis SIGINT Sequencing Analysis Genome Mapping Bioinformatics Manufacturing Product Quality Mfg Process Tracking 25 ©2011 Cloudera, Inc. All Rights Reserved. ADVANCED ANALYTICS DATA PROCESSINGWhat is Cloudera Enterprise Cloudera Enterprise makes open CLOUDERA ENTERPRISE COMPONENTS source Apache Hadoop enterpriseeasy Cloudera Production  Simplify and Accelerate Hadoop Deployment Management Level Support  Reduce Adoption Costs and Risks Suite  Lower the Cost of Administration Our Team of Experts Comprehensive  Increase the Transparency Control of Hadoop OnCall to Help You Toolset for Hadoop Meet Your SLAs  Leverage the Experience of Our Experts Administration 3 of the top 5 telecommunications, mobile services, defense intelligence, banking, media and retail organizations depend on Cloudera EFFECTIVENESS EFFICIENCY Ensuring Repeatable Value from Enabling Apache Hadoop to be Apache Hadoop Deployments Affordably Run in Production 26 ©2011 Cloudera, Inc. All Rights Reserved.Hive vs Pig Latin (count distinct values 0) • Hive Syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 0; • Pig Latin Syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 27 ©2011 Cloudera, Inc. All Rights Reserved.Apache Hive Key Features • A subset of SQL covering the most common statements • JDBC/ODBC support • Agile data types: Array, Map, Struct, and JSON objects • Pluggable SerDe system to work on unstructured files directly • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • Partitions and Buckets (for performance optimization) • Microstrategy/Tableau Compatibility (through ODBC) • In The Works: Indices, Columnar Storage, Views, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive 28 ©2011 Cloudera, Inc. All Rights Reserved.Hive Agile Data Types • STRUCTS: – SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): – SELECT mytable.mycolumnmykey FROM … • ARRAYS: – SELECT mytable.mycolumn5 FROM … • JSON: – SELECT getjsonobject(mycolumn, objpath) FROM … 29 ©2011 Cloudera, Inc. All Rights Reserved.
Website URL
Comment