Question? Leave a message!




Hive: SQL for Hadoop

Hive: SQL for Hadoop
RyanCanon Profile Pic
RyanCanon,United Arab Emirates,Teacher
Published Date:21-07-2017
Website URL
Comment
Hive: SQL for Hadoop Dean Wampler Wednesday, May 14, 14 I’ll argue that Hive is indispensable to people creating “data warehouses” with Hadoop, because it gives them a “similar” SQL interface to their data, making it easier to migrate skills and even apps from existing relational tools to Hadoop.Dean Wampler Consultant for Typesafe. Big Data, Scala, Functional Programming expert. dean.wamplertypesafe.com deanwampler Hire me Wednesday, May 14, 14Why Hive? 3 Wednesday, May 14, 14Since your team knows SQL and all your Data Warehouse apps are written in SQL, Hive minimizes the effort of migrating to Hadoop. 4 Wednesday, May 14, 14Hive Ideal for data warehousing. • Ad-hoc queries of data. • Familiar SQL dialect. • Analysis of large data sets. • Hadoop MapReduce jobs. • 5 Wednesday, May 14, 14 Hive is a killer app, in our opinion, for data warehouse teams migrating to Hadoop, because it gives them a familiar SQL language that hides the complexity of MR programming. Hive Invented at Facebook. • Open sourced to Apache in • 2008. http://hive.apache.org • 6 Wednesday, May 14, 14A Scenario: Mining Daily Click Stream Logs 7 Wednesday, May 14, 14Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … 8 Wednesday, May 14, 14 As we copy the daily click stream log over to a local staging location, we transform it into the Hive table format we want.Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … Timestamp 9 Wednesday, May 14, 14Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … The server 10 Wednesday, May 14, 14Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … The process (“movies search”) and the process id. 11 Wednesday, May 14, 14Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … Customer id 12 Wednesday, May 14, 14Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … The log “message” 13 Wednesday, May 14, 14Ingest & Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … To: /staging/2012-01-09.log • 09:02:17Aserver1AmoviesA18A1234Asea rch for “vampires in love”. … 14 Wednesday, May 14, 14 As we copy the daily click stream log over to a local staging location, we transform it into the Hive table format we want.Ingest & Transform: To: /staging/2012-01-09.log • 09:02:17Aserver1AmoviesA18A1234Asea rch for “vampires in love”. … Removed month (Jan) and day (09). • Added A as field separators (Hive convention). • Separated process id from process name. • 15 Wednesday, May 14, 14 The transformations we made. (You could use many different Linux, scripting, code, or Hadoop-related ingestion tools to do this.Ingest & Transform: Put in HDFS: • hadoop fs -put /staging/2012-01-09.log \ /clicks/2012/01/09/log.txt (The final file name doesn’t matter…) • 16 Wednesday, May 14, 14 Here we use the hadoop shell command to put the file where we want it in the file system. Note that the name of the target file doesn’t matter; we’ll just tell Hive to read all files in the directory, so there could be many files thereBack to Hive... Create an external Hive table: • CREATE EXTERNAL TABLE clicks ( hms STRING, hostname STRING, process STRING, You don’t have to pid INT, use EXTERNAL and PARTITIONED uid INT, together…. message STRING) PARTITIONED BY ( year INT, month INT, day INT); 17 Wednesday, May 14, 14 Now let’s create an “external” table that will read those files as the “backing store”. Also, we make it partitioned to accelerate queries that limit by year, month or day. (You don’t have to use external and partitioned together…)Back to Hive... Add a partition for 2012-01-09: • ALTER TABLE clicks ADD IF NOT EXISTS PARTITION ( year=2012, month=01, day=09) LOCATION '/clicks/2012/01/09'; A directory in HDFS. • 18 Wednesday, May 14, 14 We add a partition for each day. Note the LOCATION path, which is a the directory where we wrote our file.Now, Analyze What’s with the kids and vampires?? • SELECT hms, uid, message FROM clicks WHERE message LIKE '%vampire%' AND year = 2012 AND month = 01 AND day = 09; After some MapReduce crunching... • … 09:02:29 1234 search for “twilight of the vampires” 09:02:35 1234 add to cart “vampires want their genre back” … 19 Wednesday, May 14, 14 And we can run SQL queriesRecap SQL analysis with Hive. • Other tools can use the data, too. • Massive scalability with Hadoop. • 20 Wednesday, May 14, 14