Apache Hbase Reference Guide Tutorial

how to install apache hbase and how to install apache hbase in ubuntu and apache hive hbase integration apache hbase java tutorial and apache hbase github
OliviaCutts Profile Pic
Published Date:01-08-2017
Your Website URL(Optional)
Apache HBase ™ Reference Guide Apache HBase Team Version 3.0.0-SNAPSHOT164. Client Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  650 165. Tracing from HBase Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  651 Appendix N: 0.95 RPC Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  652Support and Testing Expectations The phrases /supported/, /not supported/, /tested/, and /not tested/ occur several places throughout this guide. In the interest of clarity, here is a brief explanation of what is generally meant by these phrases, in the context of HBase. Commercial technical support for Apache HBase is provided by many Hadoop vendors. This is not the sense in which the term /support/ is used in the context of  the Apache HBase project. The Apache HBase team assumes no responsibility for your HBase clusters, your configuration, or your data. Supported In the context of Apache HBase, /supported/ means that HBase is designed to work in the way described, and deviation from the defined behavior or functionality should be reported as a bug. Not Supported In the context of Apache HBase, /not supported/ means that a use case or use pattern is not expected to work and should be considered an antipattern. If you think this designation should be reconsidered for a given feature or use pattern, file a JIRA or start a discussion on one of the mailing lists. Tested In the context of Apache HBase, /tested/ means that a feature is covered by unit or integration tests, and has been proven to work as expected. Not Tested In the context of Apache HBase, /not tested/ means that a feature or use pattern may or may not work in a given way, and may or may not corrupt your data or cause operational issues. It is an unknown, and there are no guarantees. If you can provide proof that a feature designated as /not tested/ does work in a given way, please submit the tests and/or the metrics so that other users can gain certainty about such features or use patterns. 2Getting Started 3Chapter 1. Introduction Quickstart will get you up and running on a single-node, standalone instance of HBase. 4Chapter 2. Quick Start - Standalone HBase This section describes the setup of a single-node standalone HBase. A standalone instance has all HBase daemons — the Master, RegionServers, and ZooKeeper — running in a single JVM persisting to the local filesystem. It is our most basic deploy profile. We will show you how to create a table in HBase using the hbase shell CLI, insert rows into the table, perform put and scan operations against the table, enable or disable the table, and start and stop HBase. Apart from downloading HBase, this procedure should take less than 10 minutes. Prior to HBase 0.94.x, HBase expected the loopback IP address to be Ubuntu and some other distributions default to and this will cause problems for you. See Why does HBase care about /etc/hosts? for detail The following /etc/hosts file works correctly for HBase 0.94.x and earlier, on Ubuntu. Use this as a template if you run into trouble.  localhost ubuntu.ubuntu-domain ubuntu This issue has been fixed in hbase-0.96.0 and beyond. 2.1. JDK Version Requirements HBase requires that a JDK be installed. See Java for information about supported JDK versions. 2.2. Get Started with HBase Procedure: Download, Configure, and Start HBase in Standalone Mode 1. Choose a download site from this list of Apache Download Mirrors. Click on the suggested top link. This will take you to a mirror of HBase Releases. Click on the folder named stable and then download the binary file that ends in .tar.gz to your local filesystem. Do not download the file ending in src.tar.gz for now. 2. Extract the downloaded file, and change to the newly-created directory. tar xzvf hbase-3.0.0-SNAPSHOT-bin.tar.gz cd hbase-3.0.0-SNAPSHOT/ 3. You are required to set the JAVA_HOME environment variable before starting HBase. You can set the variable via your operating system’s usual mechanism, but HBase provides a central mechanism, conf/hbase-env.sh. Edit this file, uncomment the line starting with JAVA_HOME, and set it to the appropriate location for your operating system. The JAVA_HOME variable should be set to a directory which contains the executable file bin/java. Most modern Linux operating systems provide a mechanism, such as /usr/bin/alternatives on RHEL or CentOS, for transparently 5switching between versions of executables such as Java. In this case, you can set JAVA_HOME to the directory containing the symbolic link to bin/java, which is usually /usr. JAVA_HOME=/usr 4. Edit conf/hbase-site.xml, which is the main HBase configuration file. At this time, you only need to specify the directory on the local filesystem where HBase and ZooKeeper write data. By default, a new directory is created under /tmp. Many servers are configured to delete the contents of /tmp upon reboot, so you should store the data elsewhere. The following configuration will store HBase’s data in the hbase directory, in the home directory of the user called testuser. Paste the property tags beneath the configuration tags, which should be empty in a new HBase install. Example 1. Example hbase-site.xml for Standalone HBase configuration   property   namehbase.rootdir/name   valuefile:///home/testuser/hbase/value   /property   property   namehbase.zookeeper.property.dataDir/name   value/home/testuser/zookeeper/value   /property /configuration You do not need to create the HBase data directory. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want. The hbase.rootdir in the above example points to a directory in the local filesystem. The 'file:/' prefix is how we denote local filesystem. To home HBase on an existing instance of HDFS, set the hbase.rootdir to point at a directory up  on your instance: e.g. hdfs://namenode.example.org:8020/hbase. For more on this variant, see the section below on Standalone HBase over HDFS. 5. The bin/start-hbase.sh script is provided as a convenient way to start HBase. Issue the command, and if all goes well, a message is logged to standard output showing that HBase started successfully. You can use the jps command to verify that you have one running process called HMaster. In standalone mode HBase runs all daemons within this single JVM, i.e. the HMaster, a single HRegionServer, and the ZooKeeper daemon. Go to http://localhost:16010 to view the HBase Web UI. 6Java needs to be installed and available. If you get an error indicating that Java is not installed, but it is on your system, perhaps in a non-standard location,  edit the conf/hbase-env.sh file and modify the JAVA_HOME setting to point to the directory that contains bin/java on your system. Procedure: Use HBase For the First Time 1. Connect to HBase. Connect to your running instance of HBase using the hbase shell command, located in the bin/ directory of your HBase install. In this example, some usage and version information that is printed when you start HBase Shell has been omitted. The HBase Shell prompt ends with a character. ./bin/hbase shell hbase(main):001:0 2. Display HBase Shell Help Text. Type help and press Enter, to display some basic usage information for HBase Shell, as well as several example commands. Notice that table names, rows, columns all must be enclosed in quote characters. 3. Create a table. Use the create command to create a new table. You must specify the table name and the ColumnFamily name. hbase(main):001:0 create 'test', 'cf' 0 row(s) in 0.4170 seconds = Hbase::Table - test 4. List Information About your Table Use the list command to hbase(main):002:0 list 'test' TABLE test 1 row(s) in 0.0180 seconds = "test" 5. Put data into your table. To put data into your table, use the put command. 7hbase(main):003:0 put 'test', 'row1', 'cf:a', 'value1' 0 row(s) in 0.0850 seconds hbase(main):004:0 put 'test', 'row2', 'cf:b', 'value2' 0 row(s) in 0.0110 seconds hbase(main):005:0 put 'test', 'row3', 'cf:c', 'value3' 0 row(s) in 0.0100 seconds Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in this case. 6. Scan the table for all data at once. One of the ways to get data from HBase is to scan. Use the scan command to scan the table for data. You can limit your scan, but for now, all data is fetched. hbase(main):006:0 scan 'test' ROW COLUMN+CELL  row1 column=cf:a, timestamp=1421762485768, value=value1  row2 column=cf:b, timestamp=1421762491785, value=value2  row3 column=cf:c, timestamp=1421762496210, value=value3 3 row(s) in 0.0230 seconds 7. Get a single row of data. To get a single row of data at a time, use the get command. hbase(main):007:0 get 'test', 'row1' COLUMN CELL  cf:a timestamp=1421762485768, value=value1 1 row(s) in 0.0350 seconds 8. Disable a table. If you want to delete a table or change its settings, as well as in some other situations, you need to disable the table first, using the disable command. You can re-enable it using the enable command. 8hbase(main):008:0 disable 'test' 0 row(s) in 1.1820 seconds hbase(main):009:0 enable 'test' 0 row(s) in 0.1770 seconds Disable the table again if you tested the enable command above: hbase(main):010:0 disable 'test' 0 row(s) in 1.1820 seconds 9. Drop the table. To drop (delete) a table, use the drop command. hbase(main):011:0 drop 'test' 0 row(s) in 0.1370 seconds 10. Exit the HBase Shell. To exit the HBase Shell and disconnect from your cluster, use the quit command. HBase is still running in the background. Procedure: Stop HBase 1. In the same way that the bin/start-hbase.sh script is provided to conveniently start all HBase daemons, the bin/stop-hbase.sh script stops them. ./bin/stop-hbase.sh stopping hbase.................... 2. After issuing the command, it can take several minutes for the processes to shut down. Use the jps to be sure that the HMaster and HRegionServer processes are shut down. The above has shown you how to start and stop a standalone instance of HBase. In the next sections we give a quick overview of other modes of hbase deploy. 2.3. Pseudo-Distributed Local Install After working your way through quickstart standalone mode, you can re-configure HBase to run in pseudo-distributed mode. Pseudo-distributed mode means that HBase still runs completely on a single host, but each HBase daemon (HMaster, HRegionServer, and ZooKeeper) runs as a separate process: in standalone mode all daemons ran in one jvm process/instance. By default, unless you configure the hbase.rootdir property as described in quickstart, your data is still stored in /tmp/. In this walk-through, we store your data in HDFS instead, assuming you have HDFS available. You can 9skip the HDFS configuration to continue storing your data in the local filesystem. Hadoop Configuration This procedure assumes that you have configured Hadoop and HDFS on your local system and/or a remote system, and that they are running and available. It also  assumes you are using Hadoop 2. The guide on Setting up a Single Node Cluster in the Hadoop documentation is a good starting point. 1. Stop HBase if it is running. If you have just finished quickstart and HBase is still running, stop it. This procedure will create a totally new directory where HBase will store its data, so any databases you created before will be lost. 2. Configure HBase. Edit the hbase-site.xml configuration. First, add the following property which directs HBase to run in distributed mode, with one JVM instance per daemon. property   namehbase.cluster.distributed/name   valuetrue/value /property Next, change the hbase.rootdir from the local filesystem to the address of your HDFS instance, using the hdfs://// URI syntax. In this example, HDFS is running on the localhost at port 8020. property   namehbase.rootdir/name   valuehdfs://localhost:8020/hbase/value /property You do not need to create the directory in HDFS. HBase will do this for you. If you create the directory, HBase will attempt to do a migration, which is not what you want. 3. Start HBase. Use the bin/start-hbase.sh command to start HBase. If your system is configured correctly, the jps command should show the HMaster and HRegionServer processes running. 4. Check the HBase directory in HDFS. If everything worked correctly, HBase created its directory in HDFS. In the configuration above, it is stored in /hbase/ on HDFS. You can use the hadoop fs command in Hadoop’s bin/ directory to list this directory. 10 ./bin/hadoop fs -ls /hbase Found 7 items drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/.tmp drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/WALs drwxr-xr-x - hbase users 0 2014-06-25 18:48 /hbase/corrupt drwxr-xr-x - hbase users 0 2014-06-25 18:58 /hbase/data -rw-rr 3 hbase users 42 2014-06-25 18:41 /hbase/hbase.id -rw-rr 3 hbase users 7 2014-06-25 18:41 /hbase/hbase.version drwxr-xr-x - hbase users 0 2014-06-25 21:49 /hbase/oldWALs 5. Create a table and populate it with data. You can use the HBase Shell to create a table, populate it with data, scan and get values from it, using the same procedure as in shell exercises. 6. Start and stop a backup HBase Master (HMaster) server. Running multiple HMaster instances on the same hardware does not make sense in a production environment, in the same way that running a pseudo-  distributed cluster does not make sense for production. This step is offered for testing and learning purposes only. The HMaster server controls the HBase cluster. You can start up to 9 backup HMaster servers, which makes 10 total HMasters, counting the primary. To start a backup HMaster, use the local- master-backup.sh. For each backup master you want to start, add a parameter representing the port offset for that master. Each HMaster uses three ports (16010, 16020, and 16030 by default). The port offset is added to these ports, so using an offset of 2, the backup HMaster would use ports 16012, 16022, and 16032. The following command starts 3 backup servers using ports 16012/16022/16032, 16013/16023/16033, and 16015/16025/16035. ./bin/local-master-backup.sh 2 3 5 To kill a backup master without killing the entire cluster, you need to find its process ID (PID). The PID is stored in a file with a name like /tmp/hbase-USER-X-master.pid. The only contents of the file is the PID. You can use the kill -9 command to kill that PID. The following command will kill the master with port offset 1, but leave the cluster running: cat /tmp/hbase-testuser-1-master.pid xargs kill -9 7. Start and stop additional RegionServers The HRegionServer manages the data in its StoreFiles as directed by the HMaster. Generally, one HRegionServer runs per node in the cluster. Running multiple HRegionServers on the same system can be useful for testing in pseudo-distributed mode. The local-regionservers.sh command allows you to run multiple RegionServers. It works in a similar way to the local- master-backup.sh command, in that each parameter you provide represents the port offset for 11an instance. Each RegionServer requires two ports, and the default ports are 16020 and 16030. However, the base ports for additional RegionServers are not the default ports since the default ports are used by the HMaster, which is also a RegionServer since HBase version 1.0.0. The base ports are 16200 and 16300 instead. You can run 99 additional RegionServers that are not a HMaster or backup HMaster, on a server. The following command starts four additional RegionServers, running on sequential ports starting at 16202/16302 (base ports 16200/16300 plus 2). .bin/local-regionservers.sh start 2 3 4 5 To stop a RegionServer manually, use the local-regionservers.sh command with the stop parameter and the offset of the server to stop. .bin/local-regionservers.sh stop 3 8. Stop HBase. You can stop HBase the same way as in the quickstart procedure, using the bin/stop-hbase.sh command. 2.4. Advanced - Fully Distributed In reality, you need a fully-distributed configuration to fully test HBase and to use it in real-world scenarios. In a distributed configuration, the cluster contains multiple nodes, each of which runs one or more HBase daemon. These include primary and backup Master instances, multiple ZooKeeper nodes, and multiple RegionServer nodes. This advanced quickstart adds two more nodes to your cluster. The architecture will be as follows: Table 1. Distributed Cluster Demo Architecture Node Name Master ZooKeeper RegionServer node-a.example.com yes yes no node-b.example.com backup yes yes node-c.example.com no yes yes This quickstart assumes that each node is a virtual machine and that they are all on the same network. It builds upon the previous quickstart, Pseudo-Distributed Local Install, assuming that the system you configured in that procedure is now node-a. Stop HBase on node-a before continuing. Be sure that all the nodes have full access to communicate, and that no firewall rules are in place which could prevent them from talking to each other. If you see  any errors like no route to host, check your firewall. Procedure: Configure Passwordless SSH Access node-a needs to be able to log into node-b and node-c (and to itself) in order to start the daemons. 12The easiest way to accomplish this is to use the same username on all hosts, and configure password-less SSH login from node-a to each of the others. 1. On node-a, generate a key pair. While logged in as the user who will run HBase, generate a SSH key pair, using the following command: ssh-keygen -t rsa If the command succeeds, the location of the key pair is printed to standard output. The default name of the public key is id_rsa.pub. 2. Create the directory that will hold the shared keys on the other nodes. On node-b and node-c, log in as the HBase user and create a .ssh/ directory in the user’s home directory, if it does not already exist. If it already exists, be aware that it may already contain other keys. 3. Copy the public key to the other nodes. Securely copy the public key from node-a to each of the nodes, by using the scp or some other secure means. On each of the other nodes, create a new file called .ssh/authorized_keys if it does not already exist, and append the contents of the id_rsa.pub file to the end of it. Note that you also need to do this for node-a itself. cat id_rsa.pub /.ssh/authorized_keys 4. Test password-less login. If you performed the procedure correctly, you should not be prompted for a password when you SSH from node-a to either of the other nodes using the same username. 5. Since node-b will run a backup Master, repeat the procedure above, substituting node-b everywhere you see node-a. Be sure not to overwrite your existing .ssh/authorized_keys files, but concatenate the new key onto the existing file using the operator rather than the operator. Procedure: Prepare node-a node-a will run your primary master and ZooKeeper processes, but no RegionServers. Stop the RegionServer from starting on node-a. 1. Edit conf/regionservers and remove the line which contains localhost. Add lines with the hostnames or IP addresses for node-b and node-c. Even if you did want to run a RegionServer on node-a, you should refer to it by the hostname the other servers would use to communicate with it. In this case, that would be node-a.example.com. This enables you to distribute the configuration to each node of your cluster any hostname conflicts. Save the file. 132. Configure HBase to use node-b as a backup master. Create a new file in conf/ called backup-masters, and add a new line to it with the hostname for node-b. In this demonstration, the hostname is node-b.example.com. 3. Configure ZooKeeper In reality, you should carefully consider your ZooKeeper configuration. You can find out more about configuring ZooKeeper in zookeeper section. This configuration will direct HBase to start and manage a ZooKeeper instance on each node of the cluster. On node-a, edit conf/hbase-site.xml and add the following properties. property   namehbase.zookeeper.quorum/name   valuenode-a.example.com,node-b.example.com,node-c.example.com/value /property property   namehbase.zookeeper.property.dataDir/name   value/usr/local/zookeeper/value /property 4. Everywhere in your configuration that you have referred to node-a as localhost, change the reference to point to the hostname that the other nodes will use to refer to node-a. In these examples, the hostname is node-a.example.com. Procedure: Prepare node-b and node-c node-b will run a backup master server and a ZooKeeper instance. 1. Download and unpack HBase. Download and unpack HBase to node-b, just as you did for the standalone and pseudo- distributed quickstarts. 2. Copy the configuration files from node-a to node-b.and node-c. Each node of your cluster needs to have the same configuration information. Copy the contents of the conf/ directory to the conf/ directory on node-b and node-c. Procedure: Start and Test Your Cluster 1. Be sure HBase is not running on any node. If you forgot to stop HBase from previous testing, you will have errors. Check to see whether HBase is running on any of your nodes by using the jps command. Look for the processes HMaster, HRegionServer, and HQuorumPeer. If they exist, kill them. 2. Start the cluster. On node-a, issue the start-hbase.sh command. Your output will be similar to that below. 14 bin/start-hbase.sh node-c.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-c.example.com.out node-a.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-a.example.com.out node-b.example.com: starting zookeeper, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-zookeeper-node-b.example.com.out starting master, logging to /home/hbuser/hbase-0.98.3-hadoop2/bin/../logs/hbase- hbuser-master-node-a.example.com.out node-c.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-regionserver-node-c.example.com.out node-b.example.com: starting regionserver, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-regionserver-node-b.example.com.out node-b.example.com: starting master, logging to /home/hbuser/hbase-0.98.3- hadoop2/bin/../logs/hbase-hbuser-master-nodeb.example.com.out ZooKeeper starts first, followed by the master, then the RegionServers, and finally the backup masters. 3. Verify that the processes are running. On each node of the cluster, run the jps command and verify that the correct processes are running on each server. You may see additional Java processes running on your servers as well, if they are used for other purposes. Example 2. node-a jps Output jps 20355 Jps 20071 HQuorumPeer 20137 HMaster Example 3. node-b jps Output jps 15930 HRegionServer 16194 Jps 15838 HQuorumPeer 16010 HMaster 15Example 4. node-c jps Output jps 13901 Jps 13639 HQuorumPeer 13737 HRegionServer ZooKeeper Process Name The HQuorumPeer process is a ZooKeeper instance which is controlled and started by HBase. If you use ZooKeeper this way, it is limited to one instance per cluster node and is appropriate for testing only. If ZooKeeper is run outside  of HBase, the process is called QuorumPeer. For more about ZooKeeper configuration, including using an external ZooKeeper instance with HBase, see zookeeper section. 4. Browse to the Web UI. Web UI Port Changes  Web UI Port Changes In HBase newer than 0.98.x, the HTTP ports used by the HBase Web UI changed from 60010 for the Master and 60030 for each RegionServer to 16010 for the Master and 16030 for the RegionServer. If everything is set up correctly, you should be able to connect to the UI for the Master http://node-a.example.com:16010/ or the secondary master at http://node-b.example.com:16010/ using a web browser. If you can connect via localhost but not from another host, check your firewall rules. You can see the web UI for each of the RegionServers at port 16030 of their IP addresses, or by clicking their links in the web UI for the Master. 5. Test what happens when nodes or services disappear. With a three-node cluster you have configured, things will not be very resilient. You can still test the behavior of the primary Master or a RegionServer by killing the associated processes and watching the logs. 2.5. Where to go next The next chapter, configuration, gives more information about the different HBase run modes, system requirements for running HBase, and critical configuration areas for setting up a distributed HBase cluster. 16Apache HBase Configuration This chapter expands upon the Getting Started chapter to further explain configuration of Apache HBase. Please read this chapter carefully, especially the Basic Prerequisites to ensure that your HBase testing and deployment goes smoothly, and prevent data loss. Familiarize yourself with Support and Testing Expectations as well. 17Chapter 3. Configuration Files Apache HBase uses the same configuration system as Apache Hadoop. All configuration files are located in the conf/ directory, which needs to be kept in sync for each node on your cluster. HBase Configuration File Descriptions backup-masters Not present by default. A plain-text file which lists hosts on which the Master should start a backup Master process, one host per line. hadoop-metrics2-hbase.properties Used to connect HBase Hadoop’s Metrics2 framework. See the Hadoop Wiki entry for more information on Metrics2. Contains only commented-out examples by default. hbase-env.cmd and hbase-env.sh Script for Windows and Linux / Unix environments to set up the working environment for HBase, including the location of Java, Java options, and other environment variables. The file contains many commented-out examples to provide guidance. hbase-policy.xml The default policy configuration file used by RPC servers to make authorization decisions on client requests. Only used if HBase security is enabled. hbase-site.xml The main HBase configuration file. This file specifies configuration options which override HBase’s default configuration. You can view (but do not edit) the default configuration file at docs/hbase-default.xml. You can also view the entire effective configuration for your cluster (defaults and overrides) in the HBase Configuration tab of the HBase Web UI. log4j.properties Configuration file for HBase logging via log4j. regionservers A plain-text file containing a list of hosts which should run a RegionServer in your HBase cluster. By default this file contains the single entry localhost. It should contain a list of hostnames or IP addresses, one per line, and should only contain localhost if each node in your cluster will run a RegionServer on its localhost interface. Checking XML Validity When you edit XML, it is a good idea to use an XML-aware editor to be sure that your syntax is correct and your XML is well-formed. You can also use the xmllint  utility to check that your XML is well-formed. By default, xmllint re-flows and prints the XML to standard output. To check for well-formedness and only print output if errors exist, use the command xmllint -noout filename.xml. 18Keep Configuration In Sync Across the Cluster When running in distributed mode, after you make an edit to an HBase configuration, make sure you copy the contents of the conf/ directory to all nodes of the cluster. HBase will not do this for you. Use rsync, scp, or another secure  mechanism for copying the configuration files to your nodes. For most configurations, a restart is needed for servers to pick up changes. Dynamic configuration is an exception to this, to be described later below. 19

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.