Apache Spark GraphX tutorial

apache spark graphx examples and frameworks provided by apache spark
HartJohnson Profile Pic
HartJohnson,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Chapter 5 Apache Spark GraphX In this chapter, I want to examine the Apache Spark GraphX module, and graph processing in general. I also want to briefly examine graph-based storage by looking at the graph database called Neo4j. So, this chapter will cover the following topics: • GraphX coding • Mazerunner for Neo4j The GraphX coding section, written in Scala, will provide a series of graph coding examples. The work carried out on the experimental Mazerunner product by Kenny Bastani, which I will also examine, ties the two topics together in one practical example. It provides an example prototype-based on Docker to replicate data between Apache Spark GraphX, and Neo4j storage. Before writing code in Scala to use the Spark GraphX module, I think it would be useful to provide an overview of what a graph actually is in terms of graph processing. The following section provides a brief introduction using a couple of simple graphs as examples. Overview A graph can be considered to be a data structure, which consists of a group of vertices, and edges that connect them. The vertices or nodes in the graph can be objects or perhaps, people, and the edges are the relationships between them. The edges can be directional, meaning that the relationship operates from one node to the next. For instance, node A is the father of node B. In the following diagram, the circles represent the vertices or nodes (A to D), whereas the thick lines represent the edges, or relationships between them (E1 to E6). Each node, or edge may have properties, and these values are represented by the associated grey squares (P1 to P7). 131 aApache Spark GraphX So, if a graph represented a physical route map for route finding, then the edges might represent minor roads or motorways. The nodes would be motorway junctions, or road intersections. The node and edge properties might be the road type, speed limit, distance, and the cost and grid locations. There are many types of graph implementation, but some examples are fraud modeling, financial currency transaction modeling, social modeling (as in friend-to-friend connections on Facebook), map processing, web processing, and page ranking. P3 P2 P1 B E2 P4 E1 C E3 A E4 P5 E6 E5 P7 D E2 = B C P6 The previous diagram shows a generic example of a graph with associated properties. It also shows that the edge relationships can be directional, that is, the E2 edge acts from node B to node C. However, the following example uses family members, and the relationships between them to create a graph. Note that there can be multiple edges between two nodes or vertices. For instance, the husband-and-wife relationships between Mike and Sarah. Also, it is possible that there could be multiple properties on a node or edge. 132 aChapter 5 Husband Wife Sister 1 Mike 2 Sarah 6 Flo Son Mother Daughter 3 Mother John 5 Kate 4 Jim Friend So, in the previous example, the Sister property acts from node 6 Flo, to node 1, Mike. These are simple graphs to explain the structure of a graph, and the element nature. Real graph applications can reach extreme sizes, and require both, distributed processing, and storage to enable them to be manipulated. Facebook is able to process graphs, containing over 1 trillion edges using Apache Giraph (source: Avery Ching-Facebook). Giraph is an Apache Hadoop eco-system tool for graph processing, which has historically based its processing on Map Reduce, but now uses TinkerPop, which will be introduced in Chapter 6, Graph-based Storage. Although this book concentrates on Apache Spark, the number of edges provides a very impressive indicator of the size that a graph can reach. In the next section, I will examine the use of the Apache Spark GraphX module using Scala. 133 aApache Spark GraphX GraphX coding This section will examine Apache Spark GraphX programming in Scala, using the family relationship graph data sample, which was shown in the last section. This data will be stored on HDFS, and will be accessed as a list of vertices and edges. Although this data set is small, the graphs that you build in this way could be very large. I have used HDFS for storage, because if your graph scales to the big data scale, then you will need some type of distributed and redundant storage. As this chapter shows by way of example, that could be HDFS. Using the Apache Spark SQL module, the storage could also be Apache Hive; see Chapter 4, Apache Spark SQL, for details. Environment I have used the hadoop Linux account on the serverhc2nn to develop the Scala-based GraphX code. The structure for SBT compilation follows the same pattern as the previous examples, with the code tree existing in a subdirectory namedgraphx, where ansbt congu fi ration file called graph.sbt resides: hadoophc2nn graphx pwd /home/hadoop/spark/graphx hadoophc2nn graphx ls src graph.sbt project target The source code lives, as expected, under a subtree of this level calledsrc/main/ scala, and contains five code samples: hadoophc2nn scala pwd /home/hadoop/spark/graphx/src/main/scala hadoophc2nn scala ls graph1.scala graph2.scala graph3.scala graph4.scala graph5.scala In each graph-based example, the Scala l fi e uses the same code to load data from HDFS, and to create a graph; but then, each l fi e provides a different facet of GraphX-based graph processing. As a different Spark module is being used in this chapter, thesbt cong fi uration l fi e graph.sbt has been changed to support this work: hadoophc2nn graphx more graph.sbt name := "Graph X" 134 aChapter 5 version := "1.0" scalaVersion := "2.10.4" libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.3.0" libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0" libraryDependencies += "org.apache.spark" %% "spark-graphx" % "1.0.0" // If using CDH, also add Cloudera repo resolvers += "Cloudera Repository" at https://repository.cloudera.com/ artifactory/cloudera-repos/ The contents of thegraph.sbt file are shown previously, via the Linux more command. There are only two changes here to note from previous examples—the value of name has changed to represent the content. Also, more importantly, the Spark GraphX 1.0.0 library has been added as a library dependency. Two data files have been placed on HDFS, under the/data/spark/graphx/ directory. They contain the data that will be used for this section in terms of the vertices, and edges that make up a graph. As the Hadoop file system ls command shows next, the files are called graph1_edges.cvs andgraph1_vertex.csv: hadoophc2nn scala hdfs dfs -ls /data/spark/graphx Found 2 items -rw-rr 3 hadoop supergroup 129 2015-03-01 13:52 /data/spark/ graphx/graph1_edges.csv -rw-rr 3 hadoop supergroup 59 2015-03-01 13:52 /data/spark/ graphx/graph1_vertex.csv Thevertex file, shown next, via a Hadoop file system cat command, contains just six lines, representing the graph used in the last section. Each vertex represents a person, and has a vertex ID number, a name and an age value: hadoophc2nn scala hdfs dfs -cat /data/spark/graphx/graph1_vertex.csv 1,Mike,48 2,Sarah,45 3,John,25 4,Jim,53 5,Kate,22 6,Flo,52 135 aApache Spark GraphX The edge file contains a set of directed edge values in the form of source vertex ID, destination vertex ID, and relationship. So, record one forms a Sister relationship betweenFlo andMike: hadoophc2nn scala hdfs dfs -cat /data/spark/graphx/graph1_edges.csv 6,1,Sister 1,2,Husband 2,1,Wife 5,1,Daughter 5,2,Daughter 3,1,Son 3,2,Son 4,1,Friend 1,5,Father 1,3,Father 2,5,Mother 2,3,Mother Having explained the sbt environment, and the HDFS-based data, we are now ready to examine some of the GraphX code samples. As in the previous examples, the code can be compiled, and packaged as follows from thegraphx subdirectory. This creates a JAR calledgraph-x_2.10-1.0.jar from which the example applications can be run: hadoophc2nn graphx pwd /home/hadoop/spark/graphx hadoophc2nn graphx sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash info Set current project to Graph X (in build file:/home/hadoop/spark/ graphx/) info Compiling 5 Scala sources to /home/hadoop/spark/graphx/target/ scala-2.10/classes... info Packaging /home/hadoop/spark/graphx/target/scala-2.10/graph- x_2.10-1.0.jar ... info Done packaging. success Total time: 30 s, completed Mar 3, 2015 5:27:10 PM 136 aChapter 5 Creating a graph This section will explain the generic Scala code, up to the point of creating a GraphX graph, from the HDFS-based data. This will save time, as the same code is reused in each example. Once this is explained, I will concentrate on the actual graph-based manipulation in each code example: The generic code starts by importing the Spark context, graphx, and RDD functionality for use in the Scala code: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD Then, an application is defined, which extends the App class, and the application name changes, for each example, fromgraph1 tograph5. This application name will be used when running the application usingspark-submit: object graph1 extends App The data files are defined in terms of the HDFS server and port, the path that they reside under in HDFS and their file names. As already mentioned, there are two data files that contain the vertex andedge information: val hdfsServer = "hdfs://hc2nn.semtech-solutions.co.nz:8020" val hdfsPath = "/data/spark/graphx/" val vertexFile = hdfsServer + hdfsPath + "graph1_vertex.csv" val edgeFile = hdfsServer + hdfsPath + "graph1_edges.csv" The Spark Master URL is defined, as is the application name, which will appear in the Spark user interface when the application runs. A new Spark configuration object is created, and the URL and name are assigned to it: val sparkMaster = "spark://hc2nn.semtech-solutions.co.nz:7077" val appName = "Graph 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) A new Spark context is created using the cong fi uration that was just defined: val sparkCxt = new SparkContext(conf) 137 aApache Spark GraphX The vertex information from the HDFS-based file is then loaded into an RDD-based structure calledvertices using thesparkCxt.textFile method. The data is stored as a longVertexId, and strings to represent the person's name and age. The data lines are split by commas as this is CSV based-data: val vertices: RDD(VertexId, (String, String)) = sparkCxt.textFile(vertexFile).map line = val fields = line.split(",") ( fields(0).toLong, ( fields(1), fields(2) ) ) Similary, the HDFS-based edge data is loaded into an RDD-based data structure callededges. The CSV-based data is again split by comma values. The r fi st two data values are converted into Long values, as they represent the source and destination vertex ID's. The n fi al value, representing the relationship of the edge, is left as a string. Note that each record in the RDD structure edges is actually now anEdge record: val edges: RDDEdgeString = sparkCxt.textFile(edgeFile).map line = val fields = line.split(",") Edge(fields(0).toLong, fields(1).toLong, fields(2)) A default value is dene fi d in case a connection, or a vertex is missing, then the graph is constructed from the RDD-based structures—vertices,edges, and the default record: val default = ("Unknown", "Missing") val graph = Graph(vertices, edges, default) This creates a GraphX-based structure calledgraph, which can now be used for each of the examples. Remember that although these data samples are small, you can create extremely large graphs using this approach. Many of these algorithms are iterative applications, for instance, PageRank and Triangle Count, and as a result, the programs will generate many iterative Spark jobs. Example 1 – counting The graph has been loaded, and we know the data volumes in the data files, but what about the data content in terms of vertices, and edges in the actual graph itself? It is very simple to extract this information by using the vertices, and the edges count function as shown here: println( "vertices : " + graph.vertices.count ) println( "edges : " + graph.edges.count ) 138 aChapter 5 Running thegraph1 example, using the example name and the JAR file created previously, will provide the count information. The master URL is supplied to connect to the Spark cluster, and some default parameters are supplied for the executor memory, and the total executor cores: spark-submit \ class graph1 \ master spark://hc2nn.semtech-solutions.co.nz:7077 \ executor-memory 700M \ total-executor-cores 100 \ /home/hadoop/spark/graphx/target/scala-2.10/graph-x_2.10-1.0.jar The Spark cluster job calledgraph1 provides the following output, which is as expected and also, it matches the data files: vertices : 6 edges : 12 Example 2 – filtering What happens if we need to create a subgraph from the main graph, and filter by the person's age or relationships? The example code from the second example Scala file, graph2, shows how this can be done: val c1 = graph.vertices.filter case (id, (name, age)) = age. toLong 40 .count val c2 = graph.edges.filter case Edge(from, to, property) = property == "Father" property == "Mother" .count println( "Vertices count : " + c1 ) println( "Edges count : " + c2 ) The two example counts have been created from the main graph. The first filters the person-based vertices on the age, only taking those people who are greater than 40 years old. Notice that theage value, which was stored as a string, has been converted into a long for comparison. The previous second example filters the edges on the relationship property ofMother orFather. The two count values:c1 andc2 are created, and printed as the Spark output shows here: Vertices count : 4 Edges count : 4 139 aApache Spark GraphX Example 3 – PageRank The PageRank algorithm provides a ranking value for each of the vertices in a graph. It makes the assumption that the vertices that are connected to the most edges are the most important ones. Search engines use PageRank to provide ordering for the page display during a web search: val tolerance = 0.0001 val ranking = graph.pageRank(tolerance).vertices val rankByPerson = vertices.join(ranking).map case (id, ( (person,age) , rank )) = (rank, id, person) The previous example code creates atolerance value, and calls the graphpageRank method using it. The vertices are then ranked into a new value ranking. In order to make the ranking more meaningful the ranking values are joined with the original vertices RDD. TherankByPerson value then contains the rank, vertex ID, and person's name. The PageRank result, held inrankByPerson, is then printed record by record, using a case statement to identify the record contents, and a format statement to print the contents. I did this, because I wanted to define the format of the rank value which can vary: rankByPerson.collect().foreach case (rank, id, person) = println ( f"Rank rank%1.2f id id person person") The output from the application is then shown here. As expected,Mike andSarah have the highest rank, as they have the most relationships: Rank 0.15 id 4 person Jim Rank 0.15 id 6 person Flo Rank 1.62 id 2 person Sarah Rank 1.82 id 1 person Mike Rank 1.13 id 3 person John Rank 1.13 id 5 person Kate 140 aChapter 5 Example 4 – triangle counting The triangle count algorithm provides a vertex-based count of the number of triangles, associated with this vertex. For instance, vertexMike (1) is connected toKate (5), who is connected toSarah (2);Sarah is connected toMike (1) and so, a triangle is formed. This can be useful for route finding, where minimum, triangle-free, spanning tree graphs need to be generated for route planning. The code to execute a triangle count, and print it, is simple, as shown next. The graphtriangleCount method is executed for the graph vertices. The result is saved in the valuetCount, and then printed: val tCount = graph.triangleCount().vertices println( tCount.collect().mkString("\n") ) The results of the application job show that the vertices called,Flo (4) andJim (6), have no triangles, whereasMike (1) andSarah (2) have the most, as expected, as they have the most relationships: (4,0) (6,0) (2,4) (1,4) (3,2) (5,2) Example 5 – connected components When a large graph is created from the data, it might contain unconnected subgraphs, that is, subgraphs that are isolated from each other, and contain no bridging or connecting edges between them. This algorithm provides a measure of this connectivity. It might be important, depending upon your processing, to know that all the vertices are connected. The Scala code, for this example, calls two graph methods:connectedComponents, and stronglyConnectedComponents. The strong method required a maximum iteration count, which has been set to1000. These counts are acting on the graph vertices: val iterations = 1000 val connected = graph.connectedComponents().vertices val connectedS = graph.stronglyConnectedComponents(iterations).vertices 141 aApache Spark GraphX The vertex counts are then joined with the original vertex records, so that the connection counts can be associated with the vertex information, such as the person's name: val connByPerson = vertices.join(connected).map case (id, ( (person,age) , conn )) = (conn, id, person) val connByPersonS = vertices.join(connectedS).map case (id, ( (person,age) , conn )) = (conn, id, person) The results are then output using a case statement, and formatted printing: connByPerson.collect().foreach case (conn, id, person) = println ( f"Weak conn id person" ) As expected for theconnectedComponents algorithm, the results show that for each vertex, there is only one component. This means that all the vertices are the members of a single graph, as the graph diagram earlier in the chapter showed: Weak 1 4 Jim Weak 1 6 Flo Weak 1 2 Sarah Weak 1 1 Mike Weak 1 3 John Weak 1 5 Kate ThestronglyConnectedComponents method gives a measure of the connectivity in a graph, taking into account the direction of the relationships between them. The results for thestronglyConnectedComponents algorithm output is as follows: connByPersonS.collect().foreach case (conn, id, person) = println ( f"Strong conn id person" ) You might notice from the graph that the relationships,Sister andFriend, act from verticesFlo (6) andJim (4), toMike (1) as the edge and vertex data shows here: 6,1,Sister 4,1,Friend 142 aChapter 5 1,Mike,48 4,Jim,53 6,Flo,52 So, the strong method output shows that for most vertices, there is only one graph component signified by the 1 in the second column. However, vertices4 and6 are not reachable due to the direction of their relationship, and so they have a vertex ID instead of a component ID: Strong 4 4 Jim Strong 6 6 Flo Strong 1 2 Sarah Strong 1 1 Mike Strong 1 3 John Strong 1 5 Kate Mazerunner for Neo4j In the previous sections, you have been shown how to write Apache Spark graphx code in Scala to process the HDFS-based graph data. You have been able to execute the graph-based algorithms, such as PageRank, and triangle counting. However, this approach has a limitation. Spark does not have storage, and storing graph-based data in the flat files on HDFS does not allow you to manipulate it in its place of storage. For instance, if you had data stored in a relational database, you could use SQL to interrogate it in place. Databases such as Neo4j are graph databases. This means that their storage mechanisms and data access language act on graphs. In this section, I want to take a look at the work done on Mazerunner, created as a GraphX Neo4j processing prototype by Kenny Bastani. The following figure describes the Mazerunner architecture. It shows that data in Neo4j is exported to HDFS, and processed by GraphX via a notification process. The GraphX data updates are then saved back to HDFS as a list of key value updates. These changes are then propagated to Neo4j to be stored. The algorithms in this prototype architecture are accessed via a Rest based HTTP URL, which will be shown later. The point here though, is that algorithms can be run via processing in graphx, but the data changes can be checked via Neo4j database cypher language queries. Kenny's work and further details can be found at:http://www. kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html. 143 aApache Spark GraphX This section will be dedicated to explaining the Mazerunner architecture, and will show, with the help of an example, how it can be used. This architecture provides a unique example of GraphX-based processing, coupled with graph-based storage. Mazerunner Svc ( Scala ) 3) Graph processing GraphX Spark Neo4J 2) Notification plus 4) Key Value updates exported data list 1) Neo4J exports subgraph HDFS 5) Neo4J updates Installing Docker The process for installing the Mazerunner example code is described via https://github.com/kbastani/neo4j-mazerunner. I have used the 64 bit Linux Centos 6.5 machinehc1r1m1 for the install. The Mazerunner example uses the Docker tool, which creates virtual containers with a small foot print for running HDFS, Neo4j, and Mazerunner in this example. First, I must install Docker. I have done this, as follows, using the Linux root user viayum commands. The first command installs the docker-io module (the docker name was already used for CentOS 6.5 by another application): roothc1r1m1 bin yum -y install docker-io I needed to enable thepublic_ol6_latest repository, and install thedevice- mapper-event-libs package, as I found that my current lib-device-mapper, which I had installed, wasn't exporting the symbol Base that Docker needed. I executed the following commands asroot: roothc1r1m1 yum-config-manager enable public_ol6_latest roothc1r1m1 yum install device-mapper-event-libs 144 aChapter 5 The actual error that I encountered was as follows: /usr/bin/docker: relocation error: /usr/bin/docker: symbol dm_task_get_ info_with_deferred_remove, version Base not defined in file libdevmapper. so.1.02 with link time reference I can then check that Docker will run by checking the Docker version number with the following call: roothc1r1m1 docker version Client version: 1.4.1 Client API version: 1.16 Go version (client): go1.3.3 Git commit (client): 5bc2ff8/1.4.1 OS/Arch (client): linux/amd64 Server version: 1.4.1 Server API version: 1.16 Go version (server): go1.3.3 Git commit (server): 5bc2ff8/1.4.1 I can start the Linux docker service using the following service command. I can also force Docker to start on Linux server startup using the followingchkconfig command: roothc1r1m1 bin service docker start roothc1r1m1 bin chkconfig docker on The three Docker images (HDFS, Mazerunner, and Neo4j) can then be downloaded. They are large, so this may take some time: roothc1r1m1 docker pull sequenceiq/hadoop-docker:2.4.1 Status: Downloaded newer image for sequenceiq/hadoop-docker:2.4.1 roothc1r1m1 docker pull kbastani/docker-neo4j:latest Status: Downloaded newer image for kbastani/docker-neo4j:latest roothc1r1m1 docker pull kbastani/neo4j-graph-analytics:latest Status: Downloaded newer image for kbastani/neo4j-graph-analytics:latest 145 aApache Spark GraphX Once downloaded, the Docker containers can be started in the order; HDFS, Mazerunner, and then Neo4j. The default Neo4j movie database will be loaded and the Mazerunner algorithms run using this data. The HDFS container starts as follows: roothc1r1m1 docker run -i -t name hdfs sequenceiq/hadoop- docker:2.4.1 /etc/bootstrap.sh –bash Starting sshd: OK Starting namenodes on 26d939395e84 26d939395e84: starting namenode, logging to /usr/local/hadoop/logs/ hadoop-root-namenode-26d939395e84.out localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop- root-datanode-26d939395e84.out Starting secondary namenodes 0.0.0.0 0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/ hadoop-root-secondarynamenode-26d939395e84.out starting yarn daemons starting resourcemanager, logging to /usr/local/hadoop/logs/yarn resourcemanager-26d939395e84.out localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn- root-nodemanager-26d939395e84.out The Mazerunner service container starts as follows: roothc1r1m1 docker run -i -t name mazerunner link hdfs:hdfs kbastani/neo4j-graph-analytics The output is long, so I will not include it all here, but you will see no errors. There also comes a line, which states that the install is waiting for messages: Waiting for messages. To exit press CTRL+C In order to start the Neo4j container, I need the install to create a new Neo4j database for me, as this is a first time install. Otherwise on restart, I would just supply the path of the database directory. Using thelink command, the Neo4j container is linked to the HDFS and Mazerunner containers: roothc1r1m1 docker run -d -P -v /home/hadoop/neo4j/data:/opt/data name graphdb link mazerunner:mazerunner link hdfs:hdfs kbastani/ docker-neo4j 146 aChapter 5 By checking theneo4j/data path, I can now see that a database directory, named graph.db has been created: roothc1r1m1 data pwd /home/hadoop/neo4j/data roothc1r1m1 data ls graph.db I can then use the followingdocker inspect command, which the container-based IP address and the Docker-based Neo4j container is making available. Theinspect command supplies me with the local IP address that I will need to access the Neo4j container. Thecurl command, along with the port number, which I know from Kenny's website, will default to7474, shows me that the Rest interface is running: roothc1r1m1 data docker inspect format=".NetworkSettings. IPAddress" graphdb 172.17.0.5 roothc1r1m1 data curl 172.17.0.5:7474 "management" : "http://172.17.0.5:7474/db/manage/", "data" : "http://172.17.0.5:7474/db/data/" The Neo4j browser The rest of the work in this section will now be carried out using the Neo4j browser URL, which is as follows: http://172.17.0.5:7474/browser. This is a local, Docker-based IP address that will be accessible from thehc1r1m1 server. It will not be visible on the rest of the local intranet without further network configuration. 147 aApache Spark GraphX This will show the default Neo4j browser page. The Movie graph can be installed by following the movie link here, selecting the Cypher query, and executing it. The data can then be interrogated using Cypher queries, which will be examined in more depth in the next chapter. The following figures are supplied along with their associated Cypher queries, in order to show that the data can be accessed as graphs that are displayed visually. The first graph shows a simple Person to Movie relationship, with the relationship details displayed on the connecting edges. The second graph, provided as a visual example of the power of Neo4j, shows a far more complex cypher query, and resulting graph. This graph states that it contains 135 nodes and 180 relationships. These are relatively small numbers in processing terms, but it is clear that the graph is becoming complex. 148 aChapter 5 The following figures show the Mazerunner example algorithms being called via an HTTP Rest URL. The call is defined by the algorithm to be called, and the attribute that it is going to act upon within the graph: http://localhost:7474/service/mazerunner/analysis/algorithm/ attribute. So for instance, as the next section will show, this generic URL can be used to run the PageRank algorithm by settingalgorithm=pagerank. The algorithm will operate on thefollows relationship by settingattribute=FOLLOWS. The next section will show how each Mazerunner algorithm can be run along with an example of the Cypher output. The Mazerunner algorithms This section shows how the Mazerunner example algorithms may be run using the Rest based HTTP URL, which was shown in the last section. Many of these algorithms have already been examined, and coded in this chapter. Remember that the interesting thing occurring in this section is that data starts in Neo4j, it is processed on Spark with GraphX, and then is updated back into Neo4j. It looks simple, but there are underlying processes doing all of the work. In each example, the attribute that the algorithm has added to the graph is interrogated via a Cypher query. So, each example isn't so much about the query, but that the data update to Neo4j has occurred. 149 aApache Spark GraphX The PageRank algorithm The first call shows the PageRank algorithm, and the PageRank attribute being added to the movie graph. As before, the PageRank algorithm gives a rank to each vertex, depending on how many edge connections it has. In this case, it is using the FOLLOWS relationship for processing. The following image shows a screenshot of the PageRank algorithm result. The text at the top of the image (starting withMATCH) shows the cypher query, which proves that the PageRank property has been added to the graph. The closeness centrality algorithm The closeness algorithm attempts to determine the most important vertices in the graph. In this case, thecloseness attribute has been added to the graph. 150 a