Databricks Spark sql

databricks spark training resources and also spark performance tuning databricks
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Chapter 8 Spark Databricks Creating a big data analytics cluster, importing data, and creating ETL streams to cleanse and process the data are hard to do, and also expensive. The aim of Databricks is to decrease the complexity and make the process of cluster creation, and data processing easier. They have created a cloud-based platform, based on Apache Spark that automates cluster creation, and simplifies data import, processing, and visualization. Currently, the storage is based upon AWS but, in the future, they plan to expand to other cloud providers. The same people who designed Apache Spark are involved in the Databricks system. At the time of writing this book, the service was only accessible via registration. I have been offered a 30-day trial period. Over the next two chapters, I will examine the service, and its components, and offer some sample code to show how it works. This chapter will cover the following topics: • Installing Databricks • AWS configuration • Account management • The menu system • Notebooks and folders • Importing jobs via libraries • Development environments • Databricks tables • The Databricks DbUtils package Given that this book is provided in a static format, it will be difficult to fully examine functionality such as streaming. 221 aSpark Databricks Overview The Databricks service, available at the website, is based upon the idea of a cluster. This is similar to a Spark cluster, which has already been examined and used in previous chapters. It contains a master, workers, and executors. However, the configuration and the size of the cluster are automated, depending upon the amount of memory that you specify. Features such as security, isolation, process monitoring, and resource management are all automatically managed for you. If you have an immediate requirement for a Spark-based cluster using 200 GB of memory, for a short period of time, this service can be used to dynamically create it, and process your data. You can terminate the cluster to reduce your costs when the processing is finished. Within a cluster, the idea of a Notebook is introduced, along with a location for you to create scripts and run programs. Folders can be created within Notebooks, which can be based upon Scala, Python, or SQL. Jobs can be created to execute the functionality, and can be called from the Notebook code or the imported libraries. Notebooks can call Notebook functionality. Also, the functionality is provided to schedule jobs, based on time or event. This provides you with a feel of what the Databricks service provides. The following sections will explain each major item that has been introduced. Please keep in mind that what is presented here is new and evolving. Also, I used the AWS US East (North Virginia) region for this demonstration, as the Asia Sydney region currently has limitations that caused the Databricks install to fail. Installing Databricks In order to create this demonstration, I used the AWS offer of a year's free access, which was available at This has limitations such as 5 GB of S3 storage, and 750 hours of Amazon Elastic Compute Cloud (EC2), but it allowed me low-cost access and reduced my overall EC2 costs. The AWS account provides the following: • An account ID • An access Key ID • A secret access Key 222 aChapter 8 These items of information are used by Databricks to access your AWS storage, install the Databricks systems, and create the cluster components that you specify. From the moment of the install, you begin to incur AWS EC2 costs, as the Databricks system uses at least two running instances without any clusters. Once you have successfully entered your AWS and billing information, you will be prompted to launch the Databricks cloud. Having done this, you will be provided with a URL to access your cloud, an admin account, and password. This will allow you to access the Databricks web-based user interface, as shown in the following screenshot: This is the welcome screen. It shows the menu bar at the top of the image, which, from left to right, contains the menu, search, help, and account icons. While using the system, there may also be a clock-faced icon that shows the recent activity. From this single interface, you may search through help screens, and usage examples before creating your own clusters and code. 223 aSpark Databricks AWS billing Please note that, once you have the Databricks system installed, you will start incurring the AWS EC2 storage costs. Databricks attempts to minimize your costs by keeping EC2 resources active for a full charging period. For instance if you terminate a Databricks cluster the cluster-based EC2 instances will still exist for the hour in which AWS bills for them. In this way, Databricks can reuse them if you create a new cluster. The following screenshot shows that, although I am using a free AWS account, and though I have carefully reduced my resource usage, I have incurred AWS EC2 costs in a short period of time: You need to be aware of the Databricks clusters that you create, and understand that, while they exist and are used, AWS costs are being incurred. Only keep the clusters that you really require, and terminate any others. In order to examine the Databricks data import functionality, I also created an AWS S3 bucket, and uploaded data files to it. This will be explained later in this chapter. 224 aChapter 8 Databricks menus By selecting the top-left menu icon on the Databricks web interface, it is possible to expand the menu system. The following screenshot shows the top-level menu options, as well as the Workspace option, expanded to a folder hierarchy of /folder1/folder2/. Finally, it shows the actions that can be carried out on folder2, that is, creating a notebook, creating a dashboard, and more. All of these actions will be expanded in future sections. The next section will examine account management, before moving on to clusters. 225 aSpark Databricks Account management Account management is quite simplified within Databricks. There is a default Administrator account and subsequent accounts can be created, but you need to know the Administrator password to do so. Passwords need to be more than eight characters long; they should contain at least one digit, one upper case character, and one non-alphanumeric character. Account options can be accessed from the top-right menu option, shown in the following screenshot: This also allows the user to logout. By selecting the account setting, you can change your password. By selecting the Accounts menu option, an Accounts list is generated. There, you will find an option to Add Account, and each account can be deleted via an X option on each account line, as shown in the following screenshot: 226 aChapter 8 It is also possible to reset the account passwords from the accounts list. Selecting the Add Account option creates a new account window that requires an email address, a full name, the administrator password, and the user's password. So, if you want to create a new user, you need to know your Databricks instance Administrator password. You must also follow the rules for new passwords, which are as follows: • Minimum of eight characters • Must contain at least one digit in the range: 0-9 • Must contain at least one upper case character in the range: A-Z • Must contain at least one non-alphanumeric character: % The next section will examine the Clusters menu option, and will enable you to manage your own Databricks Spark clusters. 227 aSpark Databricks Cluster management Selecting the Clusters menu option provides a list of your current Databricks clusters and their status. Of course, currently you have none. Selecting the Add Cluster option allows you to create one. Note that the amount of memory you specify determines the size of your cluster. There is a minimum of 54 GB required to create a cluster with a single master and worker. For each additional 54 GB specified, another worker is added. The following screenshot is a concatenated image, showing a new cluster called semclust1 being created and in a Pending state. While Pending, the cluster has no dashboard, and the cluster nodes are not accessible. 228 aChapter 8 Once created the cluster memory is listed and it's status changes from Pending to Running. A default dashboard has automatically been attached, and the Spark master and worker user interfaces can be accessed. It is important to note here that Databricks automatically starts and manages the cluster processes. There is also an Option column to the right of this display that offers the ability to Configure , Restart, or Terminate a cluster as shown in the following screenshot: By reconfiguring a cluster, you can change its size. By adding more memory, you can add more workers. The following screenshot shows a cluster, created at the default size of 54 GB, having its memory extended to108 GB. Terminating a cluster removes it, and it cannot be recovered. So, you need to be sure that deletion is the correct course of action. Databricks prompts you to confirm your action before the termination actually takes place. 229 aSpark Databricks It takes time for a cluster to be both, created and terminated. During termination, the cluster is marked with an orange banner, and a state of Terminating, as shown in the following screenshot: Note that the cluster type in the previous screenshot is shown to be On-demand. When creating a cluster, it is possible to select a check box called Use spot instances to create a spot cluster. These clusters are cheaper than the on-demand clusters, as they bid for a cheaper AWS spot price. However, they can be slower to start than the on-demand clusters. The Spark user interfaces are the same as those you would expect on a non- Databricks Spark cluster. You can examine workers, executors, configuration, and log files. As you create clusters, they will be added to your cluster list. One of the clusters will be used as the cluster where the dashboards are run. This can be changed by using the Make Dashboard Cluster option. As you add libraries and Notebooks to your cluster, the cluster details entry will be updated with a count of the numbers added. The only thing that I would say about the Databricks Spark user interface option at this time, because it is familiar, is that it displays the Spark version that is used. The following screenshot, extracted from the master user interface, shows that the Spark version being used (1.3.0) is very up-to-date. At the time of writing, the latest Apache Spark release was 1.3.1, dated 17 April, 2015. 230 aChapter 8 The next section will examine Databricks Notebooks and folders—how to create them, and how they can be used. Notebooks and folders A Notebook is a special type of Databricks folder that can be used to create Spark scripts. Notebooks can call the Notebook scripts to create a hierarchy of functionality. When created, the type of Notebook must be specified (Python, Scala, or SQL), and a cluster can then specify that the Notebook functionality can be run against it. The following screenshot shows the Notebook creation. 231 aSpark Databricks Note that a menu option, to the right of a Notebook session, allows the type of Notebook that is to be changed. The following example shows that a Python notebook can be changed to Scala, SQL, or Markdown: Note that a Scala Notebook cannot be changed to Python, and a Python Notebook cannot be changed to Scala. The terms Python, Scala, and SQL are well understood as the development languages, however, Markdown is new. Markdown allows formatted documentation to be created from formatted commands in text. A simple reference can be found at help.html. This means that formatted comments can be added to the Notebook session as scripts are created. Notebooks are further subdivided into cells, which contain the commands to be executed. Cells can be moved within a Notebook by hovering over the top-left corner, and dragging them into position. New cells can be inserted into a cell list within a Notebook. Also, using the%sql command, within a Scala or Python Notebook cell, allows SQL syntax to be used. Typically, the key combination of Shift + Enter causes text blocks in a Notebook or folder to be executed. Using the%md command allows Markdown comments to be added within a cell. Also, comments can be added to a Notebook cell. The menu options available at the top-right section of a Notebook cell, shown in the following screenshot, shows comment, as well as the minimize and maximize options: Multiple web-based sessions may share a Notebook. The actions that occur within the Notebook will be populated to each web interface viewing it. Also, the Markdown and comment options can be used to enable communication between users to aid the interactive data investigation between a distributed group. 232 aChapter 8 The previous screenshot shows the header of a Notebook session for notebook1. It shows the Notebook name and type (Scala). It also shows the option to lock the Notebook to make it read only, as well as the option to detach it from its cluster. The following screenshot shows the creation of a folder within a Notebook workspace: A drop-down menu, from the Workspace main menu option, allows for the creation of a folder—in this case, namedfolder1. The later sections will describe other options in this menu. Once created and selected, a drop-down menu from the new folder called folder1 shows the actions associated with it in the following screenshot: 233 aSpark Databricks So, a folder can be exported to a DBC archive. It can be locked, or cloned to create a copy. It can also be renamed, or deleted. Items can be imported into it; for instance, files, which will be explained by example later. Also, new notebooks, dashboards, libraries, and folders can be created within it. In the same way as actions can be carried out against a folder, a Notebook has a set of possible actions. The following screenshot shows the actions available via a drop- down menu for the Notebook callednotebook1, which is currently attached to the running cluster calledsemclust1. It is possible to rename, delete, lock, or clone a Notebook. It is also possible to detach it from its current cluster, or attach it if it is detached. It is also possible to export the Notebook to a file, or a DBC archive. From the folder Import option, files can be imported to a folder. The following screenshot shows the file drop-option window that is invoked if this option is selected. It is possible to either drop a file onto the upload pane from the local server, or click on this pane to open a navigation browser to search the local server for files to upload. 234 aChapter 8 Note that the files that are uploaded need to be of a specific type. The following screenshot shows the supported file types. This is a screenshot taken from the file browser when browsing for a file to upload. It also makes sense. The supported file types are Scala, SQL, and Python; as well as DBC archives and JAR file libraries. Before leaving this section, it should also be noted that Notebooks and folders can be dragged and dropped to change their position. The next section will examine Databricks jobs and libraries via simple worked examples. Jobs and libraries Within Databricks, it is possible to import JAR libraries and run the classes in them on your clusters. I will create a very simple piece of Scala code to print out the first 100 elements of the Fibonacci series asBigInt values, locally on my Centos Linux server. I will compile my class into a JAR file using SBT, run it locally to check the result, and then run it on my Databricks cluster to compare the results. The code looks as following: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf object db_ex1 extends App val appName = "Databricks example 1" val conf = new SparkConf() conf.setAppName(appName) val sparkCxt = new SparkContext(conf) var seed1:BigInt = 1 235 aSpark Databricks var seed2:BigInt = 1 val limit = 100 var resultStr = seed1 + " " + seed2 + " " for( i - 1 to limit ) val fib:BigInt = seed1 + seed2 resultStr += fib.toString + " " seed1 = seed2 seed2 = fib println() println( "Result : " + resultStr ) println() sparkCxt.stop() // end application Not that the most elegant piece of code, or the best way to create Fibonacci, but I just want a sample JAR and class to use with Databricks. When run locally, I get the first 100 terms, which look as follows (I've clipped this data to save space): Result : 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181 6765 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040 1346269 2178309 3524578 5702887 9227465 14930352 24157817 39088169 63245986 102334155 165580141 267914296 433494437 701408733 1134903170 1836311903 2971215073 4807526976 7778742049 12586269025 20365011074 32951280099 53316291173 4660046610375530309 7540113804746346429 12200160415121876738 19740274219868223167 31940434634990099905 51680708854858323072 83621143489848422977 135301852344706746049 218922995834555169026 354224848179261915075 573147844013817084101 927372692193078999176 The library that has been created is calleddata-bricks_2.10-1.0.jar. From my folder menu, I can create a new Library using the menu drop-down option. This allows me to specify the library source as a JAR file, name the new library, and load the library JAR file from my local server. The following screenshot shows an example of this process: 236 www.finebook.irChapter 8 When the library has been created, it can be attached to the cluster calledsemclust1, my Databricks cluster, using the Attach option. The following screenshot shows the new library in the process of attaching: 237 www.finebook.irSpark Databricks In the following example, a job called job2 has been created by selecting the jar option on the Task item. For the job, the same JAR file has been loaded and the class db_ex1 has been assigned to run in the library. The cluster has been specified as on-demand, meaning that a cluster will be created automatically to run the job. The Active runs section shows the job running in the following screenshot: Once run, the job is moved to the Completed runs section of the display. The following screenshot, for the same job, shows that it took47 seconds to run, that it was launched manually, and that it succeeded. 238 www.finebook.irChapter 8 By selecting the run named Run 1 in the previous screenshot, it is possible to see the run output. The following screenshot shows the same result as the local run, displayed from my local server execution. I have clipped the output text to make it presentable and readable on this page, but you can see that the output is the same. So, even from this very simple example, it is obvious that it is possible to develop applications remotely, and load them onto a Databricks cluster as JAR files in order to execute. However, each time a Databricks cluster is created on AWS EC2 storage, the Spark URL changes, so the application must not hard-code details such as the Spark master URL. Databricks will automatically set the Spark URL. When running the JAR file classes in this way, it is also possible to define class parameters. The jobs may be scheduled to run at a given time, or periodically. The job timeouts, and alert email addresses may also be specified. 239 www.finebook.irSpark Databricks Development environments It has been shown that scripts can be created in Notebooks in Scala, Python, or SQL, but it is also possible to use an IDE such as IntelliJ or Eclipse to develop code. By installing an SBT plugin into this development environment, it is possible to develop code for your Databricks environment. The current release of Databricks, as I write this book, is 1.3.2d. The Release Notes link, under New Features on the start page, contains a link to the IDE integration, which is The URL will be of this form, with the section starting withdbc changed to match the URL for the Databricks cloud that you will create. I won't expand on this here, but leave it to you to investigate. In the next section, I will investigate the Databricks table data processing functionality. Databricks tables The Databricks Tables menu option allows you store your data in a tabular form with an associated schema. The Tables menu option allows you to both create a table, and refresh your tables list, as the following screenshot shows: 240

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.