Question? Leave a message!

Cloud computing hadoop

Cloud Computing and Hadoop data Management and cloud computing with hadoop and hadoop for cloud computing
Dr.GordenMorse Profile Pic
Published Date:22-07-2017
Website URL
Cloud Computing Data Management in the Cloud Dell Zhang Birkbeck, University of London 2016/17Data Management in Today’s OrganisationsBig Data Analysis • Peta-scale datasets are everywhere: – Facebook: 2.5PB of user data + 15TB/day (4/2009) – eBay: 6.5PB of user data + 50TB/day (5/2009) – … • A lot of these datasets are (mostly) structured – Query logs – Point-of-sale records – User data (e.g., demographics) – …Big Data Analysis • How do we perform data analysis at scale? – Relational databases (RDBMS) – MapReduce (Hadoop)RDBMS vs MapReduce • Relational databases – Multipurpose • transactions & analysis • batch & interactive – Data integrity via ACID transactions – Lots of tools in software ecosystem • for ingesting, reporting, etc. – Supports SQL (and SQL integration, e.g., JDBC) – Automatic SQL query optimization Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)RDBMS vs MapReduce • MapReduce (Hadoop): – Designed for large clusters, fault tolerant – Data is accessed in “native format” – Supports many query languages – Programmers retain control over performance – Open source Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)Database Workloads • Online Transaction Processing (OLTP) – Typical applications: • e-commerce, banking, airline reservations – User facing: • real-time, low latency, highly-concurrent – Tasks: • relatively small set of “standard” transactional queries – Data access pattern: • random reads, updates, writes (involving relatively small amounts of data)Database Workloads • Online Analytical Processing (OLAP) – Typical applications: • business intelligence, data mining – Back-end processing: • batch workloads, less concurrency – Tasks: • complex analytical queries, often ad hoc – Data access pattern: • table scans, large amounts of data involved per queryOne Database or Two? • Downsides of co-existing OLTP and OLAP workloads – Poor memory management – Conflicting data access patterns – Variable latency • Solution: separate databases – OLTP database for user-facing transactions – OLAP database for data warehousing • How do we connect the two?OLTP/OLAP Architecture ETL (Extract, Transform, and Load) OLTP OLAPOLTP/OLAP Integration • Extract-Transform-Load (ETL) – Extract records from OLTP database – Transform records • clean data, check integrity, aggregate, etc. – Load records into OLAP databaseOLTP/OLAP Integration • OLTP database for user-facing transactions – Retain records of all activity – Periodic ETL (e.g., nightly) • OLAP database for data warehousing – Business intelligence • reporting, ad hoc queries, data mining, etc. – Feedback to improve OLTP servicesBusiness Intelligence • Premise: more data leads to better business decisions – Periodic reporting as well as ad hoc queries – Analysts, not programmers • Importance of tools and dashboardsBusiness Intelligence • Examples: – Slicing-and-dicing activity by different dimensions to better understand the marketplace – Analyzing log data to improve OLTP experience – Analyzing log data to better optimize ad placement – Analyzing purchasing trends for better supply- chain management – Mining for correlations between otherwise unrelated activitiesOLTP/OLAP Architecture: Hadoop? ETL (Extract, Transform, and Load) OLTP OLAPOLTP/OLAP/Hadoop Architecture ETL (Extract, Transform, and Load) OLTP Hadoop OLAPETL Bottleneck • Reporting is often a nightly task: – ETL is often slow: why? – What happens if processing 24 hours of data takes longer than 24 hours?ETL Bottleneck • Hadoop is perfect: – Most likely, you already have some data warehousing solution – Ingestion is limited by the speed of HDFS – Scales out with more nodes – Massively parallel – Ability to use any processing tool – Much cheaper than parallel databases – ETL is a batch process anywayMapReduce Algorithms for Processing Relational and Matrix DataWorking Scenario • Two tables: – User demographics (gender, age, income, etc.) – User page visits (URL, time spent, etc.) • Analyses we might want to perform: – Statistics on demographic characteristics – Statistics on page visits – Statistics on page visits by URL – Statistics on page visits by demographic characteristic