Cloud Computing and Hadoop data Management and cloud computing with hadoop and hadoop for cloud computing
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
Cloud Computing
Data Management
in the Cloud
Dell Zhang
Birkbeck, University of London
2016/17Data Management in
Today’s OrganisationsBig Data Analysis
• Peta-scale datasets are everywhere:
– Facebook: 2.5PB of user data + 15TB/day (4/2009)
– eBay: 6.5PB of user data + 50TB/day (5/2009)
– …
• A lot of these datasets are (mostly) structured
– Query logs
– Point-of-sale records
– User data (e.g., demographics)
– …Big Data Analysis
• How do we perform data analysis at scale?
– Relational databases (RDBMS)
– MapReduce (Hadoop)RDBMS vs MapReduce
• Relational databases
– Multipurpose
• transactions & analysis
• batch & interactive
– Data integrity via ACID transactions
– Lots of tools in software ecosystem
• for ingesting, reporting, etc.
– Supports SQL (and SQL integration, e.g., JDBC)
– Automatic SQL query optimization
Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)RDBMS vs MapReduce
• MapReduce (Hadoop):
– Designed for large clusters, fault tolerant
– Data is accessed in “native format”
– Supports many query languages
– Programmers retain control over performance
– Open source
Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)Database Workloads
• Online Transaction Processing (OLTP)
– Typical applications:
• e-commerce, banking, airline reservations
– User facing:
• real-time, low latency, highly-concurrent
– Tasks:
• relatively small set of “standard” transactional queries
– Data access pattern:
• random reads, updates, writes (involving relatively
small amounts of data)Database Workloads
• Online Analytical Processing (OLAP)
– Typical applications:
• business intelligence, data mining
– Back-end processing:
• batch workloads, less concurrency
– Tasks:
• complex analytical queries, often ad hoc
– Data access pattern:
• table scans, large amounts of data involved per queryOne Database or Two?
• Downsides of co-existing OLTP and OLAP
workloads
– Poor memory management
– Conflicting data access patterns
– Variable latency
• Solution: separate databases
– OLTP database for user-facing transactions
– OLAP database for data warehousing
• How do we connect the two?OLTP/OLAP Architecture
ETL
(Extract, Transform, and Load)
OLTP OLAPOLTP/OLAP Integration
• Extract-Transform-Load (ETL)
– Extract records from OLTP database
– Transform records
• clean data, check integrity, aggregate, etc.
– Load records into OLAP databaseOLTP/OLAP Integration
• OLTP database for user-facing transactions
– Retain records of all activity
– Periodic ETL (e.g., nightly)
• OLAP database for data warehousing
– Business intelligence
• reporting, ad hoc queries, data mining, etc.
– Feedback to improve OLTP servicesBusiness Intelligence
• Premise: more data leads to better business
decisions
– Periodic reporting as well as ad hoc queries
– Analysts, not programmers
• Importance of tools and dashboardsBusiness Intelligence
• Examples:
– Slicing-and-dicing activity by different dimensions
to better understand the marketplace
– Analyzing log data to improve OLTP experience
– Analyzing log data to better optimize ad
placement
– Analyzing purchasing trends for better supply-
chain management
– Mining for correlations between otherwise
unrelated activitiesOLTP/OLAP Architecture: Hadoop?
ETL
(Extract, Transform, and Load)
OLTP OLAPOLTP/OLAP/Hadoop Architecture
ETL
(Extract, Transform, and Load)
OLTP Hadoop OLAPETL Bottleneck
• Reporting is often a nightly task:
– ETL is often slow: why?
– What happens if processing 24 hours of data takes
longer than 24 hours?ETL Bottleneck
• Hadoop is perfect:
– Most likely, you already have some data
warehousing solution
– Ingestion is limited by the speed of HDFS
– Scales out with more nodes
– Massively parallel
– Ability to use any processing tool
– Much cheaper than parallel databases
– ETL is a batch process anywayMapReduce Algorithms for
Processing Relational and Matrix DataWorking Scenario
• Two tables:
– User demographics (gender, age, income, etc.)
– User page visits (URL, time spent, etc.)
• Analyses we might want to perform:
– Statistics on demographic characteristics
– Statistics on page visits
– Statistics on page visits by URL
– Statistics on page visits by demographic
characteristic
Advise:Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.