Titan hbase Tutorial

Graph-based Storage in Apache, start titan with hbase, titan hbase vs cassandra, titan hbase example, titan hbase integration, titan hbase installation
HalfoedGibbs Profile Pic
HalfoedGibbs,United Kingdom,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Chapter 6 Graph-based Storage Processing with Apache Spark and especially GraphX provides the ability to use in memory cluster-based, real-time processing for graphs. However, Apache Spark does not provide storage; the graph-based data must come from somewhere and after processing, probably there will be a need for storage. In this chapter, I will examine graph-based storage using the Titan graph database as an example. This chapter will cover the following topics: • An overview of Titan • An overview of TinkerPop • Installing Titan • Using Titan with HBase • Using Titan with Cassandra • Using Titan with Spark The young age of this field of processing means that the storage integration between Apache Spark, and the graph-based storage system Titan is not yet mature. In the previous chapter, the Neo4j Mazerunner architecture was examined, which showed how the Spark-based transactions could be replicated to Neo4j. This chapter deals with Titan not because of the functionality that it shows today, but due to the future promise that it offers for the field of the graph-based storage when used with Apache Spark. 155 Graph-based Storage Titan Titan is a graph database that was developed by Aurelius (http://thinkaurelius. com/). The application source and binaries can be downloaded from GitHub (http://thinkaurelius.github.io/titan/), and this location also contains the Titan documentation. Titan has been released as an open source application under an Apache 2 license. At the time of writing this book, Aurelius has been acquired by DataStax, although Titan releases should go ahead. Titan offers a number of storage options, but I will concentrate only on two, HBase—the Hadoop NoSQL database, and Cassandra—the non-Hadoop NoSQL database. Using these underlying storage mechanisms, Titan is able to provide a graph-based storage in the big data range. The TinkerPop3-based Titan release 0.9.0-M2 was released in June 2015, which will enable greater integration with Apache Spark (TinkerPop will be explained in the next section). It is this release that I will use in this chapter. It is TinkerPop that the Titan database now uses for graph manipulation. This Titan release is an experimental development release but hopefully, future releases should consolidate Titan functionality. This chapter concentrates on the Titan database rather than an alternative graph database, such as Neo4j, because Titan can use Hadoop-based storage. Also, Titan offers the future promise of integration with Apache Spark for a big data scale, in memory graph-based processing. The following diagram shows the architecture being discussed in this chapter. The dotted line shows direct Spark database access, whereas the solid lines represent Spark access to the data through Titan classes. Apache Spark Direct DB Access Titan HBase Cassandra Oracle Berkeley DB ZooKeeper HDFS 156 aChapter 6 The Spark interface doesn't ofc fi ially exist yet (it is only available in the M2 development release), but it is just added for reference. Although Titan offers the option of using Oracle for storage, it will not be covered in this chapter. I will initially examine the Titan to the HBase and Cassandra architectures, and consider the Apache Spark integration later. When considering (distributed) HBase, ZooKeeper is required as well for integration. Given that I am using an existing CDH5 cluster, HBase and ZooKeeper are already installed. TinkerPop TinkerPop, currently at version 3 as of July 2015, is an Apache incubator project, and can be found athttp://tinkerpop.incubator.apache.org/. It enables both graph databases ( like Titan ) and graph analytic systems ( like Giraph ) to use it as a sub system for graph processing rather than creating their own graph processing modules. The previous figure (borrowed from the TinkerPop website) shows the TinkerPop architecture. The blue layer shows the Core TinkerPop API, which offers the graph processing API for graph, vertex, and edge processing. The Vendor API boxes show the APIs that the vendors will implement to integrate their systems. The diagram shows that there are two possible APIs: one for the OLTP database systems, and another for the OLAP analytics systems. The diagram also shows that the Gremlin language is used to create and manage graphs for TinkerPop, and so for Titan. Finally, the Gremlin server sits at the top of the architecture, and allows integration to monitoring systems like Ganglia. 157 aGraph-based Storage Installing Titan As Titan is required throughout this chapter, I will install it now, and show how it can be acquired, installed, and configured. I have downloaded the latest prebuilt version (0.9.0-M2) of Titan at:s3.thinkaurelius.com/downloads/titan/titan- 0.9.0-M2-hadoop1.zip. I have downloaded the zipped release to a temporary directory, as shown next. Carry out the following steps to ensure that Titan is installed on each node in the cluster: hadoophc2nn tmp ls -lh titan-0.9.0-M2-hadoop1.zip -rw-rr 1 hadoop hadoop 153M Jul 22 15:13 titan-0.9.0-M2-hadoop1.zip Using the Linux unzip command, unpack the zipped Titan release file: hadoophc2nn tmp unzip titan-0.9.0-M2-hadoop1.zip hadoophc2nn tmp ls -l total 155752 drwxr-xr-x 10 hadoop hadoop 4096 Jun 9 00:56 titan-0.9.0-M2-hadoop1 -rw-rr 1 hadoop hadoop 159482381 Jul 22 15:13 titan-0.9.0-M2- hadoop1.zip Now, use the Linuxsu (switch user) command to change to theroot account, and move the install to the/usr/local/ location. Change the file and group membership of the install to thehadoop user, and create a symbolic link called titan so that the current Titan release can be referred to as the simplified path called/usr/local/titan: hadoophc2nn su – roothc2nn cd /home/hadoop/tmp roothc2nn titan mv titan-0.9.0-M2-hadoop1 /usr/local roothc2nn titan cd /usr/local roothc2nn local chown -R hadoop:hadoop titan-0.9.0-M2-hadoop1 roothc2nn local ln -s titan-0.9.0-M2-hadoop1 titan roothc2nn local ls -ld titan lrwxrwxrwx 1 root root 19 Mar 13 14:10 titan - titan-0.9.0-M2- hadoop1 drwxr-xr-x 10 hadoop hadoop 4096 Feb 14 13:30 titan-0.9.0-M2-hadoop1 Using a Titan Gremlin shell that will be demonstrated later, Titan is now available for use. This version of Titan needs Java 8; make sure that you have it installed. 158 aChapter 6 Titan with HBase As the previous diagram shows, HBase depends upon ZooKeeper. Given that I have a working ZooKeeper quorum on my CDH5 cluster (running on thehc2r1m2, hc2r1m3, andhc2r1m4 nodes), I only need to ensure that HBase is installed and working on my Hadoop cluster. The HBase cluster I will install a distributed version of HBase using the Cloudera CDH cluster manager. Using the manager console, it is a simple task to install HBase. The only decision required is where to locate the HBase servers on the cluster. The following figure shows the View By Host form from the CDH HBase installation. The HBase components are shown to the right as Added Roles. I have chosen to add the HBase region servers (RS) to thehc2r1m2,hc2r1m3, and hc2r1m4 nodes. I have installed the HBase master (M), the HBase REST server (HBREST), and HBase Thrift server (HBTS) on thehc2r1m1 host. I have manually installed and configured many Hadoop-based components in the past, and I find that this simple manager-based installation and congura fi tion of components is both quick and reliable. It saves me time so that I can concentrate on other systems, such as Titan. 159 aGraph-based Storage Once HBase is installed, and has been started from the CDH manager console, it needs to be checked to ensure that it is working. I will do this using the HBase shell command shown here: hadoophc2r1m2 hbase shell Version 0.98.6-cdh5.3.2, rUnknown, Tue Feb 24 12:56:59 PST 2015 hbase(main):001:0 As you can see from the previous commands, I run the HBase shell as the Linux userhadoop. The HBase version 0.98.6 has been installed; this version number will become important later when we start using Titan: hbase(main):001:0 create 'table2', 'cf1' hbase(main):002:0 put 'table2', 'row1', 'cf1:1', 'value1' hbase(main):003:0 put 'table2', 'row2', 'cf1:1', 'value2' I have created a simple table calledtable2 with a column family ofcf1. I have then added two rows with two different values. This table has been created from the hc2r1m2 node, and will now be checked from an alternate node calledhc2r1m4 in the HBase cluster: hadoophc2r1m4 hbase shell hbase(main):001:0 scan 'table2' ROW COLUMN+CELL row1 column=cf1:1, timestamp=1437968514021, value=value1 row2 column=cf1:1, timestamp=1437968520664, value=value2 2 row(s) in 0.3870 seconds As you can see, the two data rows are visible intable2 from a different host, so HBase is installed and working. It is now time to try and create a graph in Titan using HBase and the Titan Gremlin shell. The Gremlin HBase script I have checked my Java version to make sure that I am on version 8, otherwise Titan 0.9.0-M2 will not work: hadoophc2r1m2 java -version openjdk version "1.8.0_51" 160 www.finebook.irChapter 6 If you do not set your Java version correctly, you will get errors like this, which don't seem to be meaningful until you Google them: Exception in thread "main" java.lang.UnsupportedClassVersionError: org/ apache/tinkerpop/gremlin/groovy/plugin/RemoteAcceptor : Unsupported major.minor version 52.0 The interactive Titan Gremlin shell can be found within the bin directory of the Titan install, as shown here. Once started, it offers a Gremlin prompt: hadoophc2r1m2 bin pwd /usr/local/titan/ hadoophc2r1m2 titan bin/gremlin.sh gremlin The following script will be entered using the Gremlin shell. The r fi st section of the script den fi es the cong fi uration in terms of the storage (HBase), the ZooKeeper servers used, the ZooKeeper port number, and the HBase table name that is to be used: hBaseConf = new BaseConfiguration(); hBaseConf.setProperty("storage.backend","hbase"); hBaseConf.setProperty("storage.hostname","hc2r1m2,hc2r1m3,hc2r1m4"); hBaseConf.setProperty("storage.hbase.ext.hbase.zookeeper.property. clientPort","2181") hBaseConf.setProperty("storage.hbase.table","titan") titanGraph = TitanFactory.open(hBaseConf); The next section defines the generic vertex properties' name and age for the graph to be created using the Management System. It then commits the management system changes: manageSys = titanGraph.openManagement(); nameProp = manageSys.makePropertyKey('name').dataType(String.class). make(); ageProp = manageSys.makePropertyKey('age').dataType(String.class). make(); manageSys.buildIndex('nameIdx',Vertex.class).addKey(nameProp). buildCompositeIndex(); manageSys.buildIndex('ageIdx',Vertex.class).addKey(ageProp). buildCompositeIndex(); manageSys.commit(); 161 www.finebook.irGraph-based Storage Now, six vertices are added to the graph. Each one is given a numeric label to represent its identity. Each vertex is given an age and name value: v1=titanGraph.addVertex(label, '1'); v1.property('name', 'Mike'); v1.property('age', '48'); v2=titanGraph.addVertex(label, '2'); v2.property('name', 'Sarah'); v2.property('age', '45'); v3=titanGraph.addVertex(label, '3'); v3.property('name', 'John'); v3.property('age', '25'); v4=titanGraph.addVertex(label, '4'); v4.property('name', 'Jim'); v4.property('age', '53'); v5=titanGraph.addVertex(label, '5'); v5.property('name', 'Kate'); v5.property('age', '22'); v6=titanGraph.addVertex(label, '6'); v6.property('name', 'Flo'); v6.property('age', '52'); Finally, the graph edges are added to join the vertices together. Each edge has a relationship value. Once created, the changes are committed to store them to Titan, and therefore HBase: v6.addEdge("Sister", v1) v1.addEdge("Husband", v2) v2.addEdge("Wife", v1) v5.addEdge("Daughter", v1) v5.addEdge("Daughter", v2) v3.addEdge("Son", v1) v3.addEdge("Son", v2) 162 www.finebook.irChapter 6 v4.addEdge("Friend", v1) v1.addEdge("Father", v5) v1.addEdge("Father", v3) v2.addEdge("Mother", v5) v2.addEdge("Mother", v3) titanGraph.tx().commit(); This results in a simple person-based graph, shown in the following figure, which was also used in the previous chapter: Husband Wife Sister 1 Mike 2 Sarah 6 Flo Son Mother Daughter 3 Mother John 5 Kate 4 Jim Friend This graph can then be tested in Titan via the Gremlin shell using a similar script to the previous one. Just enter the following script at thegremlin prompt, as was shown previously. It uses the same initial six lines to create thetitanGraph configuration, but it then creates a graph traversal variable g: hBaseConf = new BaseConfiguration(); hBaseConf.setProperty("storage.backend","hbase"); hBaseConf.setProperty("storage.hostname","hc2r1m2,hc2r1m3,hc2r1m4"); 163 www.finebook.irGraph-based Storage hBaseConf.setProperty("storage.hbase.ext.hbase.zookeeper.property. clientPort","2181") hBaseConf.setProperty("storage.hbase.table","titan") titanGraph = TitanFactory.open(hBaseConf); gremlin g = titanGraph.traversal() Now, the graph traversal variable can be used to check the graph contents. Using the ValueMap option, it is possible to search for the graph nodes calledMike andFlo. They have been successfully found here: gremlin g.V().has('name','Mike').valueMap(); ==name:Mike, age:48 gremlin g.V().has('name','Flo').valueMap(); ==name:Flo, age:52 So, the graph has been created and checked in Titan using the Gremlin shell, but we can also check the storage in HBase using the HBase shell, and check the contents of the Titan table. The following scan shows that the table exists, and contains72 rows of the data for this small graph: hadoophc2r1m2 hbase shell hbase(main):002:0 scan 'titan' 72 row(s) in 0.8310 seconds Now that the graph has been created, and I am confident that it has been stored in HBase, I will attempt to access the data using apache Spark. I have already started Apache Spark on all the nodes as shown in the previous chapter. This will be a direct access from Apache Spark 1.3 to the HBase storage. I won't at this stage be attempting to use Titan to interpret the HBase stored graph. Spark on HBase In order to access HBase from Spark, I will be using Cloudera'sSparkOnHBase module, which can be downloaded fromhttps://github.com/cloudera-labs/ SparkOnHBase. The downloaded file is in a zipped format, and needs to be unzipped. I have done this using the Linux unzip command in a temporary directory: hadoophc2r1m2 tmp ls -l SparkOnHBase-cdh5-0.0.2.zip 164 www.finebook.irChapter 6 -rw-rr 1 hadoop hadoop 370439 Jul 27 13:39 SparkOnHBase-cdh5- 0.0.2.zip hadoophc2r1m2 tmp unzip SparkOnHBase-cdh5-0.0.2.zip hadoophc2r1m2 tmp ls SparkOnHBase-cdh5-0.0.2 SparkOnHBase-cdh5-0.0.2.zip I have then moved into the unpacked module, and used the Maven commandmvn to build the JAR file: hadoophc2r1m2 tmp cd SparkOnHBase-cdh5-0.0.2 hadoophc2r1m2 SparkOnHBase-cdh5-0.0.2 mvn clean package INFO - INFO BUILD SUCCESS INFO - INFO Total time: 13:17 min INFO Finished at: 2015-07-27T14:05:55+12:00 INFO Final Memory: 50M/191M INFO - Finally, I moved the built component to my development area to keep things tidy, so that I could use this module in my Spark HBase code: hadoophc2r1m2 SparkOnHBase-cdh5-0.0.2 cd .. hadoophc2r1m2 tmp mv SparkOnHBase-cdh5-0.0.2 /home/hadoop/spark Accessing HBase with Spark As in previous chapters, I will be using SBT and Scala to compile my Spark-based scripts into applications. Then, I will use spark-submit to run these applications on the Spark cluster. My SBT configuration file looks like this. It contains the Hadoop, Spark, and HBase libraries: hadoophc2r1m2 titan_hbase pwd /home/hadoop/spark/titan_hbase hadoophc2r1m2 titan_hbase more titan.sbt name := "T i t a n" version := "1.0" 165 www.finebook.irGraph-based Storage scalaVersion := "2.10.4" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1" libraryDependencies += "com.cloudera.spark" % "hbase" % "5-0.0.2" from "file:///home/hadoop/spark/SparkOnHBase-cdh5-0.0.2/target/SparkHBase.jar" libraryDependencies += "org.apache.hadoop.hbase" % "client" % "5- 0.0.2" from "file:///home/hadoop/spark/SparkOnHBase-cdh5-0.0.2/target/ SparkHBase.jar" resolvers += "Cloudera Repository" at "https://repository.cloudera.com/ artifactory/clouder a-repos/" Notice that I am running this application on thehc2r1m2 server, using the Linux hadoop account, under the directory/home/hadoop/spark/titan_hbase. I have created a Bash shell script calledrun_titan.bash.hbase, which allows me to run any application that is created and compiled under thesrc/main/scala subdirectory: hadoophc2r1m2 titan_hbase pwd ; more run_titan.bash.hbase /home/hadoop/spark/titan_hbase /bin/bash SPARK_HOME=/usr/local/spark SPARK_BIN=SPARK_HOME/bin SPARK_SBIN=SPARK_HOME/sbin JAR_PATH=/home/hadoop/spark/titan_hbase/target/scala-2.10/t-i-t-a-n_2.10- 1.0.jar CLASS_VAL=1 CDH_JAR_HOME=/opt/cloudera/parcels/CDH/lib/hbase/ CONN_HOME=/home/hadoop/spark/SparkOnHBase-cdh5-0.0.2/target/ HBASE_JAR1=CDH_JAR_HOME/hbase-common-0.98.6-cdh5.3.3.jar HBASE_JAR2=CONN_HOME/SparkHBase.jar cd SPARK_BIN ./spark-submit \ jars HBASE_JAR1 \ 166 www.finebook.irChapter 6 jars HBASE_JAR2 \ class CLASS_VAL \ master spark://hc2nn.semtech-solutions.co.nz:7077 \ executor-memory 100M \ total-executor-cores 50 \ JAR_PATH The Bash script is held within the sametitan_hbase directory, and takes a single parameter of the application class name. The parameters to thespark-submit call are the same as the previous examples. In this case, there is only a single script under src/main/scala, calledspark3_hbase2.scala: hadoophc2r1m2 scala pwd /home/hadoop/spark/titan_hbase/src/main/scala hadoophc2r1m2 scala ls spark3_hbase2.scala The Scala script starts by defining the package name to which the application class will belong. It then imports the Spark, Hadoop, and HBase classes: package nz.co.semtechsolutions import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.hadoop.hbase._ import org.apache.hadoop.fs.Path import com.cloudera.spark.hbase.HBaseContext import org.apache.hadoop.hbase.client.Scan The application class name is defined as well as the main method. A configuration object is then created in terms of the application name, and the Spark URL. Finally, a Spark context is created from the conguration: fi object spark3_hbase2 def main(args: ArrayString) val sparkMaster = "spark://hc2nn.semtech-solutions.co.nz:7077" val appName = "Spark HBase 2" 167 www.finebook.irGraph-based Storage val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, an HBase configuration object is created, and a Cloudera CDH hbase-site. xml file-based resource is added: val jobConf = HBaseConfiguration.create() val hbasePath="/opt/cloudera/parcels/CDH/etc/hbase/conf.dist/" jobConf.addResource(new Path(hbasePath+"hbase-site.xml")) An HBase context object is created using the Spark context and the HBase configuration object. The scan and cache configurations are also defined: val hbaseContext = new HBaseContext(sparkCxt, jobConf) var scan = new Scan() scan.setCaching(100) Finally, the data from the HBaseTitan table is retrieved using thehbaseRDD HBase context method, and the scan object. The RDD count is printed, and then the script closes: var hbaseRdd = hbaseContext.hbaseRDD("titan", scan) println( "Rows in Titan hbase table : " + hbaseRdd.count() ) println( " Script Finished " ) // end main // end spark3_hbase2 I am only printing the count of the data retrieved because Titan compresses the data in GZ format. So, it would make little sense in trying to manipulate it directly. 168 www.finebook.irChapter 6 Using therun_titan.bash.hbase script, the Spark application calledspark3_hbase2 is run. It outputs an RDD row count of72, matching the Titan table row count that was previously found. This proves that Apache Spark has been able to access the raw Titan HBase stored graph data, but Spark has not yet used the Titan libraries to access the Titan data as a graph. This will be discussed later. And here is the code: hadoophc2r1m2 titan_hbase ./run_titan.bash.hbase nz.co. semtechsolutions.spark3_hbase2 Rows in Titan hbase table : 72 Script Finished Titan with Cassandra In this section, the Cassandra NoSQL database will be used as a storage mechanism for Titan. Although it does not use Hadoop, it is a large-scale, cluster-based database in its own right, and can scale to very large cluster sizes. This section will follow the same process. As for HBase, a graph will be created, and stored in Cassandra using the Titan Gremlin shell. It will then be checked using Gremlin, and the stored data will be checked in Cassandra. The raw Titan Cassandra graph-based data will then be accessed from Spark. The first step then will be to install Cassandra on each node in the cluster. Installing Cassandra Create a repo file that will allow the community version of DataStax Cassandra to be installed using the Linuxyum command. Root access will be required for this, so the su command has been used to switch the user to the root. Install Cassandra on all the nodes: hadoophc2nn lib su - roothc2nn vi /etc/yum.repos.d/datastax.repo datastax name= DataStax Repo for Apache Cassandra baseurl=http://rpm.datastax.com/community enabled=1 gpgcheck=0 169 www.finebook.irGraph-based Storage Now, install Cassandra on each node in the cluster using the Linuxyum command: roothc2nn yum -y install dsc20-2.0.13-1 cassandra20-2.0.13-1 Set up the Cassandra configuration under /etc/cassandra/conf by altering the cassandra.yaml file: roothc2nn cd /etc/cassandra/conf ; vi cassandra.yaml I have made the following changes to specify my cluster name, the server seed IP addresses, the RPC address, and the snitch value. Seed nodes are the nodes that the other nodes will try to connect to first. In this case, the NameNode ( 103), and node2 (108) have been used asseeds. The snitch method manages network topology and routing: cluster_name: 'Cluster1' seeds: "192.168.1.103,192.168.1.108" listen_address: rpc_address: 0.0.0.0 endpoint_snitch: GossipingPropertyFileSnitch Cassandra can now be started on each node as root using the service command: roothc2nn service cassandra start Log files can be found under /var/log/cassandra, and the data is stored under/ var/lib/cassandra. Thenodetool command can be used on any Cassandra node to check the status of the Cassandra cluster: roothc2nn cassandra nodetool status Datacenter: DC1 =============== Status=Up/Down / State=Normal/Leaving/Joining/Moving Address Load Tokens Owns (effective) Host ID Rack UN 192.168.1.105 63.96 KB 256 37.2% f230c5d7-ff6f- 43e7-821d-c7ae2b5141d3 RAC1 UN 192.168.1.110 45.86 KB 256 39.9% fc1d80fe-6c2d- 467d-9034-96a1f203c20d RAC1 UN 192.168.1.109 45.9 KB 256 40.9% daadf2ee-f8c2- 4177-ae72-683e39fd1ea0 RAC1 170 www.finebook.irChapter 6 UN 192.168.1.108 50.44 KB 256 40.5% b9d796c0-5893- 46bc-8e3c-187a524b1f5a RAC1 UN 192.168.1.103 70.68 KB 256 41.5% 53c2eebd- a66c-4a65-b026-96e232846243 RAC1 The Cassandra CQL shell command calledcqlsh can be used to access the cluster, and create objects. The shell is invoked next, and it shows that Cassandra version 2.0.13 is installed: hadoophc2nn cqlsh Connected to Cluster1 at localhost:9160. cqlsh 4.1.1 Cassandra 2.0.13 CQL spec 3.1.1 Thrift protocol 19.39.0 Use HELP for help. cqlsh The Cassandra query language next shows a key space calledkeyspace1 that is being created and used via the CQL shell: cqlsh CREATE KEYSPACE keyspace1 WITH REPLICATION = 'class' : 'SimpleStrategy', 'replication_factor' : 1 ; cqlsh USE keyspace1; cqlsh:keyspace1 SELECT FROM system.schema_keyspaces; keyspace_name durable_writes strategy_class strategy_options ++-+- - keyspace1 True org.apache.cassandra.locator.SimpleStrategy "replication_factor":"1" system True org.apache.cassandra.locator.LocalStrategy system_traces True org.apache.cassandra.locator.SimpleStrategy "replication_factor":"2" Since Cassandra is installed and working, it is now time to create a Titan graph using Cassandra for storage. This will be tackled in the next section using the Titan Gremlin shell. It will follow the same format as the HBase section previously. 171 www.finebook.irGraph-based Storage The Gremlin Cassandra script As with the previous Gremlin script, this Cassandra version creates the same simple graph. The difference with this script is in the configuration. The backend storage type is defined as Cassandra, and the hostnames are defined to be the Cassandra seed nodes. The key space and the port number are specified and finally, the graph is created: cassConf = new BaseConfiguration(); cassConf.setProperty("storage.backend","cassandra"); cassConf.setProperty("storage.hostname","hc2nn,hc2r1m2"); cassConf.setProperty("storage.port","9160") cassConf.setProperty("storage.keyspace","titan") titanGraph = TitanFactory.open(cassConf); From this point, the script is the same as the previous HBase example, so I will not repeat it. This script will be available in the download package ascassandra_ create.bash. The same checks, using the previous configuration, can be carried out in the Gremlin shell to check the data. This returns the same results as the previous checks, and so proves that the graph has been stored: gremlin g = titanGraph.traversal() gremlin g.V().has('name','Mike').valueMap(); ==name:Mike, age:48 gremlin g.V().has('name','Flo').valueMap(); ==name:Flo, age:52 Using the Cassandra CQL shell, and the Titankeyspace, it can be seen that a number of Titan tables have been created in Cassandra: hadoophc2nn cqlsh cqlsh use titan; cqlsh:titan describe tables; edgestore graphindex system_properties systemlog txlog edgestore_lock_ graphindex_lock_ system_properties_lock_ titan_ids 172 www.finebook.irChapter 6 It can also be seen that the data exists in theedgestore table within Cassandra: cqlsh:titan select from edgestore; key column1 value ++- - 0x0000000000004815 0x02 0x00011ee0 0x0000000000004815 0x10c0 0xa0727425536fee1ec0 ....... 0x0000000000001005 0x10c8 0x00800512644c1b149004a0 0x0000000000001005 0x30c9801009800c20 0x000101143c01023b0101696e64 65782d706ff30200 This assures me that a Titan graph has been created in the Gremlin shell, and is stored in Cassandra. Now, I will try to access the data from Spark. The Spark Cassandra connector In order to access Cassandra from Spark, I will download the DataStax Spark Cassandra connector and driver libraries. Information and version matching on this can be found athttp://mvnrepository.com/artifact/com.datastax.spark/. The version compatibility section of this URL shows the Cassandra connector version that should be used with each Cassandra and Spark version. The version table shows that the connector version should match the Spark version that is being used. The next URL allows the libraries to be sourced athttp://mvnrepository.com/ artifact/com.datastax.spark/spark-cassandra-connector_2.10. By following the previous URL, and selecting a library version, you will see a compile dependencies table associated with the library, which indicates all of the other dependent libraries, and their versions that you will need. The following libraries are those that are needed for use with Spark 1.3.1. If you use the previous URLs, you will see which version of the Cassandra connector library to use with each version of Spark. You will also see the libraries that the Cassandra connector depends upon. Be careful to choose just (and all of) those library versions that are required: hadoophc2r1m2 titan_cass pwd ; ls .jar /home/hadoop/spark/titan_cass spark-cassandra-connector_2.10-1.3.0-M1.jar cassandra-driver-core-2.1.5.jar 173 www.finebook.irGraph-based Storage cassandra-thrift-2.1.3.jar libthrift-0.9.2.jar cassandra-clientutil-2.1.3.jar guava-14.0.1.jar joda-time-2.3.jar joda-convert-1.2.jar Accessing Cassandra with Spark Now that I have the Cassandra connector library and all of it's dependencies in place, I can begin to think about the Scala code, required to connect to Cassandra. The first thing to do, given that I am using SBT as a development tool, is to set up the SBT build configuration file. Mine looks like this: hadoophc2r1m2 titan_cass pwd ; more titan.sbt /home/hadoop/spark/titan_cass name := "Spark Cass" version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.1" libraryDependencies += "com.datastax.spark" % "spark-cassandra-connector" % "1.3.0-M1" fr om "file:///home/hadoop/spark/titan_cass/spark-cassandra-connector_2.10- 1.3.0-M1.jar" libraryDependencies += "com.datastax.cassandra" % "cassandra-driver-core" % "2.1.5" from "file:///home/hadoop/spark/titan_cass/cassandra-driver-core-2.1.5.jar" libraryDependencies += "org.joda" % "time" % "2.3" from "file:///home/ hadoop/spark/titan_ cass/joda-time-2.3.jar" libraryDependencies += "org.apache.cassandra" % "thrift" % "2.1.3" from "file:///home/hado op/spark/titan_cass/cassandra-thrift-2.1.3.jar" libraryDependencies += "com.google.common" % "collect" % "14.0.1" from "file:///home/hadoo 174 www.finebook.ir

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.