Question? Leave a message!




Hive: SQL for Hadoop

Hive: SQL for Hadoop
Hive: SQL for Hadoop Dean Wampler Wednesday, May 14, 14 I’ll argue that Hive is indispensable to people creating “data warehouses” with Hadoop, because it gives them a “similar” SQL interface to their data, making it easier to migrate skills and even apps from existing relational tools to Hadoop.Dean Wampler Consultant for Typesafe. Big Data, Scala, Functional Programming expert. dean.wamplertypesafe.com deanwampler Hire me Wednesday, May 14, 14Why Hive 3 Wednesday, May 14, 14Since your team knows SQL and all your Data Warehouse apps are written in SQL, Hive minimizes the effort of migrating to Hadoop. 4 Wednesday, May 14, 14Hive Ideal for data warehousing. • Adhoc queries of data. • Familiar SQL dialect. • Analysis of large data sets. • Hadoop MapReduce jobs. • 5 Wednesday, May 14, 14 Hive is a killer app, in our opinion, for data warehouse teams migrating to Hadoop, because it gives them a familiar SQL language that hides the complexity of MR programming. Hive Invented at Facebook. • Open sourced to Apache in • 2008. http://hive.apache.org • 6 Wednesday, May 14, 14A Scenario: Mining Daily Click Stream Logs 7 Wednesday, May 14, 14Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … 8 Wednesday, May 14, 14 As we copy the daily click stream log over to a local staging location, we transform it into the Hive table format we want.Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … Timestamp 9 Wednesday, May 14, 14Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … The server 10 Wednesday, May 14, 14Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … The process (“movies search”) and the process id. 11 Wednesday, May 14, 14Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … Customer id 12 Wednesday, May 14, 14Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … The log “message” 13 Wednesday, May 14, 14Ingest Transform: From: file://server1/var/log/clicks.log • Jan 9 09:02:17 server1 movies18: 1234: search for “vampires in love”. … To: /staging/20120109.log • 09:02:17Aserver1AmoviesA18A1234Asea rch for “vampires in love”. … 14 Wednesday, May 14, 14 As we copy the daily click stream log over to a local staging location, we transform it into the Hive table format we want.Ingest Transform: To: /staging/20120109.log • 09:02:17Aserver1AmoviesA18A1234Asea rch for “vampires in love”. … Removed month (Jan) and day (09). • Added A as field separators (Hive convention). • Separated process id from process name. • 15 Wednesday, May 14, 14 The transformations we made. (You could use many different Linux, scripting, code, or Hadooprelated ingestion tools to do this.Ingest Transform: Put in HDFS: • hadoop fs put /staging/20120109.log \ /clicks/2012/01/09/log.txt (The final file name doesn’t matter…) • 16 Wednesday, May 14, 14 Here we use the hadoop shell command to put the file where we want it in the file system. Note that the name of the target file doesn’t matter; we’ll just tell Hive to read all files in the directory, so there could be many files thereBack to Hive... Create an external Hive table: • CREATE EXTERNAL TABLE clicks ( hms STRING, hostname STRING, process STRING, You don’t have to pid INT, use EXTERNAL and PARTITIONED uid INT, together…. message STRING) PARTITIONED BY ( year INT, month INT, day INT); 17 Wednesday, May 14, 14 Now let’s create an “external” table that will read those files as the “backing store”. Also, we make it partitioned to accelerate queries that limit by year, month or day. (You don’t have to use external and partitioned together…)Back to Hive... Add a partition for 20120109: • ALTER TABLE clicks ADD IF NOT EXISTS PARTITION ( year=2012, month=01, day=09) LOCATION '/clicks/2012/01/09'; A directory in HDFS. • 18 Wednesday, May 14, 14 We add a partition for each day. Note the LOCATION path, which is a the directory where we wrote our file.Now, Analyze What’s with the kids and vampires • SELECT hms, uid, message FROM clicks WHERE message LIKE 'vampire' AND year = 2012 AND month = 01 AND day = 09; After some MapReduce crunching... • … 09:02:29 1234 search for “twilight of the vampires” 09:02:35 1234 add to cart “vampires want their genre back” … 19 Wednesday, May 14, 14 And we can run SQL queriesRecap SQL analysis with Hive. • Other tools can use the data, too. • Massive scalability with Hadoop. • 20 Wednesday, May 14, 14Karmasphere Others... Hive JDBC ODBC CLI HWI Thrift Server Driver Metastore (compiles, optimizes, executes) Hadoop Master HDFS ✲ Job Tracker Name Node 21 Wednesday, May 14, 14 Hive queries generate MR jobs. (Some operations don’t invoke Hadoop processes, e.g., some very simple queries and commands that just write updates to the metastore.) CLI = Command Line Interface. HWI = Hive Web Interface.Tables HDFS • Karmasphere Others... Hive JDBC ODBC MapR • CLI HWI Thrift Server Driver Metastore (compiles, optimizes, executes) S3 • Hadoop Master HBase (new) • HDFS ✲ Job Tracker Name Node Others... • 22 Wednesday, May 14, 14 There is “early” support for using Hive with HBase. Other databases and distributed file systems will no doubt follow.Tables Karmasphere Others... Hive JDBC ODBC Table metadata CLI HWI Thrift Server • Driver Metastore (compiles, optimizes, executes) stored in a Hadoop relational DB. Master HDFS ✲ Job Tracker Name Node 23 Wednesday, May 14, 14 For production, you need to set up a MySQL or PostgreSQL database for Hive’s metadata. Out of the box, Hive uses a Derby DB, but it can only be used by a single user and a single process at a time, so it’s fine for personal development only.Queries Karmasphere Others... Hive JDBC ODBC Most queries CLI HWI Thrift Server • Driver Metastore (compiles, optimizes, executes) use MapReduce Hadoop jobs. Master HDFS ✲ Job Tracker Name Node 24 Wednesday, May 14, 14 Hive generates MapReduce jobs to implement all the but the simplest queries.MapReduce Queries Benefits • Horizontal scalability. • Drawbacks • Latency • 25 Wednesday, May 14, 14 The high latency makes Hive unsuitable for “online” database use. (Hive also doesn’t support transactions and has other limitations that are relevant here…) So, these limitations make Hive best for offline (batch mode) use, such as data warehouse apps.HDFS Storage Benefits • Horizontal scalability. • Data redundancy. • Drawbacks • No insert, update, and • delete 26 Wednesday, May 14, 14 You can generate new tables or write to local files. Forthcoming versions of HDFS will support appending data.HDFS Storage Schema on Read • Schema enforcement at • query time, not write time. 27 Wednesday, May 14, 14 Especially for external tables, but even for internal ones since the files are HDFS files, Hive can’t enforce that records written to table files have the specified schema, so it does these checks at query time.Other Limitations No Transactions. • Some SQL features not • implemented (yet). 28 Wednesday, May 14, 14 More on Tables and Schemas 29 Wednesday, May 14, 14Data Types The usual scalar types: • TINYINT, …, BIGNT. • FLOAT, DOUBLE. • BOOLEAN. • STRING. • 30 Wednesday, May 14, 14 Like most databases...Data Types The unusual complex types: • STRUCT. • MAP. • ARRAY. • 31 Wednesday, May 14, 14 Structs are like “objects” or “cstyle structs”. Maps are keyvalue pairs, and you know what arrays are ;)CREATE TABLE employees ( name STRING, salary FLOAT, subordinates ARRAYSTRING, deductions MAPSTRING,FLOAT, address STRUCT street:STRING, city:STRING, state:STRING, zip:INT ); 32 Wednesday, May 14, 14 subordinates references other records by the employee name. (Hive doesn’t have indexes, in the usual sense, but an indexing feature was recently added.) Deductions is a keyvalue list of the name of the deduction and a float indicating the amount (e.g., ). Address is like a “class”, “object”, or “cstyle struct”, whatever you prefer.File Record Formats CREATE TABLE employees (…) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' COLLECTION ITEMS TERMINATED BY '\002' MAP KEYS TERMINATED BY '\003' All the LINES TERMINATED BY '\n' defaults for text files STORED AS TEXTFILE; 33 Wednesday, May 14, 14 Suppose our employees table has a custom format and field delimiters. We can change them, although here I’m showing all the default values used by Hive Select, Where, Group By, Join,... 34 Wednesday, May 14, 14Common SQL... You get most of the usual • suspects for SELECT, WHERE, GROUP BY and JOIN. 35 Wednesday, May 14, 14 We’ll just highlight a few unique features.“User Defined Functions” ADD JAR MyUDFs.jar; CREATE TEMPORARY FUNCTION netsalary AS 'com.example.NetCalcUDF'; SELECT name, netsalary(salary, deductions) FROM employees; 36 Wednesday, May 14, 14 Following a Hive defined API, implement your own functions, build, put in a jar, and then use them in your queries. Here we (pretend to) implement a function that takes the employee’s salary and deductions, then computes the net salary.ORDER BY vs. SORT BY A total ordering one reducer. • SELECT name, salary FROM employees ORDER BY salary ASC; A local ordering sorts within each reducer. • SELECT name, salary FROM employees SORT BY salary ASC; 37 Wednesday, May 14, 14 For a giant data set, piping everything through one reducer might take a very long time. A compromise is to sort “locally”, so each reducer sorts it’s output. However, if you structure your jobs right, you might achieve a total order depending on how data gets to the reducers. (E.g., each reducer handles a year’s worth of data, so joining the files together would be totally sorted.)Inner Joins Only equality (x = y). • SELECT ... FROM clicks a JOIN clicks b ON ( a.uid = b.uid, a.day = b.day) WHERE a.process = 'movies' AND b.process = 'books' AND a.year 2012; 38 Wednesday, May 14, 14 Note that the a.year ‘…’ is in the WHERE clause, not the ON clause for the JOIN. (I’m doing a correlation query; which users searched for movies and books on the same day) Some outer and semi join constructs supported, as well as some Hadoopspecific optimization constructs.A Final Example of Controlling MapReduce... 39 Wednesday, May 14, 14Specify Map Reduce Processes Calling out to external • programs to perform map and reduce operations. 40 Wednesday, May 14, 14Example FROM ( FROM clicks MAP message USING '/tmp/vampireextractor' AS itemtitle, count CLUSTER BY itemtitle) it INSERT OVERWRITE TABLE vampirestuff REDUCE it.itemtitle, it.count USING '/tmp/thingcounter.py' AS itemtitle, counts; 41 Wednesday, May 14, 14 Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “itemtitle”).Example FROM ( Call specific map and FROM clicks reduce MAP message processes. USING '/tmp/vampireextractor' AS itemtitle, count CLUSTER BY itemtitle) it INSERT OVERWRITE TABLE vampirestuff REDUCE it.itemtitle, it.count USING '/tmp/thingcounter.py' AS itemtitle, counts; 42 Wednesday, May 14, 14 Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “itemtitle”).… And Also: FROM ( Like GROUP FROM clicks BY, but MAP message directs output to USING '/tmp/vampireextractor' specific AS itemtitle, count reducers. CLUSTER BY itemtitle) it INSERT OVERWRITE TABLE vampirestuff REDUCE it.itemtitle, it.count USING '/tmp/thingcounter.py' AS itemtitle, counts; 43 Wednesday, May 14, 14 Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “itemtitle”).… And Also: FROM ( FROM clicks MAP message How to populate an USING '/tmp/vampireextractor' “internal” AS itemtitle, count table. CLUSTER BY itemtitle) it INSERT OVERWRITE TABLE vampirestuff REDUCE it.itemtitle, it.count USING '/tmp/thingcounter.py' AS itemtitle, counts; 44 Wednesday, May 14, 14 Note the MAP … USING and REDUCE … USING. We’re also using CLUSTER BY (distributing and sorting on “itemtitle”).Hive: Conclusions 45 Wednesday, May 14, 14Hive Disadvantages Not a real SQL Database. • Transactions, updates, etc. • … but features will grow. • High latency queries. • Documentation poor. • 46 Wednesday, May 14, 14Hive Advantages Indispensable for SQL users. • Easier than Java MR API. • Makes porting data warehouse • apps to Hadoop much easier. 47 Wednesday, May 14, 14Questions dean.wamplertypesafe.com deanwampler github.com/deanwampler/Presentations Hire me Wednesday, May 14, 14
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
RyanCanon
User Type:
Teacher
Country:
United Arab Emirates
Uploaded Date:
21-07-2017