How to calculate Page rank Algorithm

calculate page-rank of a website and how does google calculate page-rank and how to calculate page-rank example and calculation of page-rank in web mining
CecilGalot Profile Pic
CecilGalot,United Kingdom,Teacher
Published Date:06-08-2017
Your Website URL(Optional)
Comment
Utilizing data structures and algorithms This chapter covers ■ Representing data structures such as graphs and Bloom filters in MapReduce ■ Applying algorithms such as PageRank and semi-joins to large amounts of data ■ Learning how social network companies recommend making connections with people outside your network In this chapter we’ll look at how you can implement algorithms in MapReduce to work with internet-scale data. We’ll focus on nontrivial data, which is commonly represented using graphs. We’ll also look at how you can use graphs to model connections between enti- ties, such as relationships in a social network. We’ll run through a number of useful algorithms that can be performed over graphs, such as the shortest path algorithm, friends-of-friends (FoF) to help expand the interconnectedness of a network, and PageRank, which looks at how to determine the popularity of web pages. 253 254 CHAPTER 7 Utilizing data structures and algorithms You’ll learn how to use Bloom filters, whose unique space-saving properties make them handy to solve distributed system problems in P2P (peer-to-peer) and distributed databases. We’ll also create Bloom filters in MapReduce, and then use them to opti- mize joins in MapReduce. A chapter on scalable algorithms wouldn’t be complete without mention of statis- tics and machine learning. These topics will be covered in chapters 8 and 9. In addi- tion, sorting and joining algorithms are covered in chapter 4. Let’s kick things off with a look at how to model graphs in MapReduce. 7.1 Modeling data and solving problems with graphs Graphs are mathematical constructs that represent an YHU1RH GHW interconnected set of objects. They’re used to represent data such as the hyperlink structure of the internet, social OL(GJHQN networks (where they represent relationships between & users), and in internet routing to determine optimal paths for forwarding packets. % A graph consists of a number of nodes (formally ' called vertices) and links (informally called edges) that con- nect nodes together. Figure 7.1 shows a graph with nodes Figure 7.1 A small graph and edges. with nodes and edges The edges can be directed (implying a one-way relation- Directed graph Undirected graph ship), or undirected. For example, you would use a A A Directed edges C C directed graph to model rela- tionships between users in a B B social network because rela- tionships are not always bidi- D D rectional. Figure 7.2 shows Figure 7.2 Directed and undirected graphs examples of directed and undirected graphs. Graphs can be cyclic or acyclic. In cyclic graphs it’s Cyclic directed graph Acyclic directed graph possible for a vertex to reach itself by traversing a sequence A A C C of edges. In an acyclic graph it’s not possible for a vertex to Cycle B B traverse a path to reach itself. Figure 7.3 shows examples of D D cyclic and acyclic graphs. To start working with graphs, Figure 7.3 Cyclic and acyclic graphs Modeling data and solving problems with graphs 255 you’ll need to be able to represent them in your code. So what are the common methods used to represent these graph structures? 7.1.1 Modeling graphs Two common ways of representing graphs are with adjacency matrices and with adjacency lists. ADJACENCY MATRIX jim ali bob dee With an adjacency matrix, you bob dee jim 0 0 0 0 represent a graph as an N x N ali 1 0 1 1 square matrix M, where N is the bob 1 1 0 0 number of nodes, and Mij rep- dee 0 1 0 0 jim ali resents an edge between nodes i Figure 7.4 An adjacency matrix representation of a graph and j. Figure 7.4 shows a directed graph representing connections in a social graph. The arrows indicate a one-way rela- tionship between two people. The adjacency matrix shows how this graph would be represented. The disadvantage of adjacency matrices are that they model both the existence and lack of a relationship, which makes them a dense data structure. ADJACENCY LIST Adjacency lists are similar to adja- MLPDOLEERHGH cency matrices, other than the fact MLPMLP DOLGHHMLEREDOLP that they don’t model the lack of EER DOMLPLERE relationship. Figure 7.5 shows HGH DOGHHL how you’d represent a graph using an adjacency list. Figure 7.5 An adjacency list representation of a graph The advantage of the adja- cency list is that it offers a sparse representation of the data, which is good because it requires less space. It also fits well when representing graphs in MapReduce because the key can represent a vertex, and the values are a list of vertices that denote a directed or undirected relationship node. Next up we’ll cover three graph algorithms starting off with the shortest C D path algorithm. 7.1.2 Shortest path algorithm The shortest path algorithm is a common A B E problem in graph theory, where the goal is to find the shortest route between two nodes. Shortest path Figure 7.6 shows an example of this algo- rithm on a graph where the edges don’t have Figure 7.6 Example of shortest path a weight, in which case the shortest path is the between nodes A and E 256 CHAPTER 7 Utilizing data structures and algorithms path with the smallest number of hops, or intermediary nodes between the source and des- tination. Applications of this algorithm include traffic mapping software to determine the shortest route between two addresses, routers that compute the shortest path tree for each route, and social networks to determine connections between users. TECHNIQUE 52 Find the shortest distance between two users Dijkstra’s algorithm is a shortest path algorithm commonly taught in undergraduate computer science courses. A basic implementation uses a sequential iterative process to traverse the entire graph from the starting node, as seen in the algorithm presented in figure 7.7. The basic algorithm doesn’t scale to graphs that exceed your memory sizes, and it’s also sequential and not optimized for parallel processing. Problem You need to use MapReduce to find the shortest path between two people in a social graph. Solution Use an adjacency list to model a graph, and for each node store the distance from the original node, as well as a backpointer to the original node. Use the mappers to propagate the distance to the original node, and the reducer to restore the state of the graph. Iterate until the target node has been reached. All nodes other than the starting node start with a distance of infinity, denoting the fact that they haven't been visited. The start node's distance is set to zero. Iterative process where all the unvisited nodes are iterated, and the distance from the start node is propagated through the graph by adding weights encountered when edges are traversed. Figure 7.7 Pseudo-code for Dijkstra’s algorithmTECHNIQUE 52 Find the shortest distance between two users 257 Discussion Figure 7.8 shows a small social network, which you’ll use for this technique. Your goal is to find the short- Kia Bob est path between Dee and Joe. There are four paths that you can take from Dee to Joe, but only one of them results in the fewest number of hops. Dee Ali Joe You’ll implement a parallel breadth-first search algorithm to find the shortest path between two Start End users. Because you’re operating on a social network, you don’t need to care about weights on your edges. Figure 7.8 Social network used in The pseudo-code for the algorithm can be seen in this technique figure 7.9. Figure 7.10 shows the algorithm iterations in play with your social graph. Just like Dijkstra’s algorithm, you’ll start with all the node distances set to infinite, and set the distance for the starting node, Dee, at zero. With each MapReduce pass, you’ll deter- mine nodes that don’t have an infinite distance and propagate their distance values to their adjacent nodes. You continue this until you reach the end node. You first need to create the starting point. This is done by reading in the social net- work (which is stored as an adjacency list) from file and setting the initial distance val- ues. Figure 7.11 shows the two file formats, the second being the format that’s used iteratively in your MapReduce code. Figure 7.9 Pseudo-code for breadth-first parallel search on graph using MapReduce 258 CHAPTER 7 Utilizing data structures and algorithms All nodes other than the starting node start with a distance of infinity WLQ6WDJSRLQWU .LDE%R ( ), denoting the fact that they haven't been visited. The start node's distance is set to zero. 0'HHOL-RH First iteration where all the ,WHUDWLRQ.LDE%R unvisited nodes are iterated, and the distance from the start node is 1 propagated to the nodes adjacent to nodes that have a distance that's not set to . 0'HH Ali-RH 1 ,WHUDWLRQ.LDE%R The second iteration propagates the distances one further hop 1 2 through the graph. As it happens we have now reached our target node, Joe, so our algorithm is now complete. 0'HHOL-RH 1 2 Figure 7.10 Shortest path iterations through the network Original friends file NodeName \t AdjacentNodeName \t AdjacentNodeName \t ... MapReduce format NodeName \t Distance \t Backpointers \t AdjacentNodeName \t ... MapReduce format example kia \t 1 \t dee \t ali \t dee (after one iteration) Figure 7.11 Original social network file format and MapReduce form optimized for algorithm Your first step is to create the MapReduce form from the original file. The following listing shows the original input file, and the MapReduce-ready form of the input file generated by the transformation code: cat test-data/ch7/friends-short-path.txt The input data. dee kia ali ali dee kia bob joe joe bob ali TECHNIQUE 52 Find the shortest distance between two users 259 kia ali dee bob ali joe The MapReduce-ready form of the input data, with the addition of a numeric that indicates the number of hops hadoop fs -cat output/input.txt from the source node. The starting node is Dee, so she has a dee 0 kia ali hop of 0. All other nodes use Integer.MAX_VALUE to ali 2147483647 dee kia bob joe indicate that they haven’t been visited. joe 2147483647 bob ali kia 2147483647 ali dee bob 2147483647 ali joe 1 The code that generates the previous output is shown here: OutputStream os = fs.create(targetFile); LineIterator iter = org.apache.commons.io.IOUtils .lineIterator(fs.open(file), "UTF8"); Read each line from while (iter.hasNext()) the original social String line = iter.nextLine(); network file. String parts = StringUtils.split(line); Set the default distance to the node int distance = Map.INFINITE; If the current node is the to be infinite (which you represent if (startNode.equals(parts0)) starting node, set its with a Math.MAX_VALUE). distance = 0; distance to zero. IOUtils.write(parts0 + '\t' + Write out the distance and String.valueOf(distance) + "\t\t", os); an empty backpointer. IOUtils.write(StringUtils.join(parts, '\t', 1, parts.length), os); Write out the adjacent IOUtils.write("\n", os); nodes (the friends). The structure of the MapReduce data isn’t changed across iterations of the algo- rithms; each job produces the same structure, which makes it easy to iterate, because the input format is the same as the output format. Your map function will perform two major functions. First, it outputs all the node data to preserve the original structure of the graph. If you didn’t do this, you couldn’t make this an interactive process because the reducer wouldn’t be able to reproduce the original graph structure for the next map phase. The second function of the map is, for all adjacent nodes, to output each adjacent node with its distance and a back- pointer, if the node has a non-infinite distance number. The backpointer carries infor- mation about the nodes visited from the starting node, so that when you reach the 2 end node you know the exact path that was taken to get there: Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException 1 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/shortestpath/Main.java 2 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/shortestpath/Map.java 260 CHAPTER 7 Utilizing data structures and algorithms Create a Node object from Node node = Node.fromMR(value.toString()); Preserve the graph the inputs. The fromMR method simply splits structure. context.write(key, value); the string and extracts the distance, backpointer and adjacent nodes. if (node.isDistanceSet()) int neighborDistance = node.getDistance() + 1; Calculate the distance for the adjacent nodes. Only output the neighbor String backpointer = node. Calculate the backpointer, which is simply the details if you have a constructBackpointer(key.toString()); existing node’s backpointer with the node’s name distance value set. concatenated to the end. String adjNodes = node.getAdjacentNodeNames(); for (int i = 0; i adjNodes.length; i++) String neighbor = adjNodesi; Loop through all the adjacent nodes. outKey.set(neighbor); Node adjacentNode = new Node() .setDistance(neighborDistance) .setBackpointer(backpointer); outValue.set(adjacentNode.toString()); context.write(outKey, outValue); Output the adjacent node details. When outputting the original input node, as well as the adjacent nodes and the dis- tances to them, the format (not contents) of the map output value are identical, to make it easier for your reducer to read the data. To do this you use a Node class to model the notion of a node, its adjacent nodes, and the distance from the starting node. Its toString method generates a string form of this data, which is used as the 3 map output key, as shown in the following listing. TheNode class, which will help with serialization in the MapReduce code public class Node private int distance = INFINITE; private String backpointer; private String adjacentNodeNames; public static int INFINITE = Integer.MAX_VALUE; public static final char fieldSeparator = '\t'; public String constructBackpointer(String name) StringBuilder backpointer = new StringBuilder(); if (StringUtils.trimToNull(getBackpointer()) = null) backpointers.append(getBackpointer()).append(":"); backpointer.append(name); return backpointer.toString(); 3 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/shortestpath/Node.javaTECHNIQUE 52 Find the shortest distance between two users 261 Override public String toString() StringBuilder sb = new StringBuilder(); sb.append(distance) .append(fieldSeparator) .append(backpointer); if (getAdjacentNodeNames() = null) sb.append(fieldSeparator) .append(StringUtils .join(getAdjacentNodeNames(), fieldSeparator)); return sb.toString(); public static Node fromMR(String value) throws IOException String parts = StringUtils.splitPreserveAllTokens( value, fieldSeparator); if (parts.length 2) throw new IOException( "Expected 2 or more parts but received " + parts.length); Node node = new Node() .setDistance(Integer.valueOf(parts0)) .setBackpointer(StringUtils.trimToNull(parts1)); if (parts.length 2) node.setAdjacentNodeNames(Arrays.copyOfRange(parts, 2, parts.length)); return node; Your reducer’s job is to calculate the minimum distance for each node, and to output the minimum distance, the backpointer, and the original adjacent nodes. The follow- 4 ing listing shows this code. The reducer code for the shortest path algorithm public static enum PathCounter The counter enum you’ll use to set the number of TARGET_NODE_DISTANCE_COMPUTED, hops when you’ve reached the target node. PATH private Text outValue = new Text(); private String targetNode; protected void setup(Context context) Read the target node name throws IOException, InterruptedException from the configuration. targetNode = context.getConfiguration().get( Main.TARGET_NODE); 4 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/shortestpath/Reduce.java 262 CHAPTER 7 Utilizing data structures and algorithms public void reduce(Text key, IterableText values, Context context) throws IOException, InterruptedException int minDistance = Node.INFINITE; The initial minimum distance is infinite. Node shortestAdjacentNode = null; Node originalNode = null; for (Text textValue : values) Convert the input value into a Node. Node node = Node.fromMR(textValue.toString()); if(node.containsAdjacentNodes()) // the original data If the node represents the original node originalNode = node; (with adjacent nodes), preserve it. if(node.getDistance() minDistance) If the distance to this node from an minDistance = node.getDistance(); shortestAdjacentNode = node; adjacent node is less than the minimum distance, preserve it. Store the minimum distance and if(shortestAdjacentNode = null) backpointer from the adjacent node. originalNode.setDistance(minDistance); originalNode.setBackpointer( shortestAdjacentNode.getBackpointer()); Write out your node. outValue.set(originalNode.toString()); context.write(key, outValue); If the current node is the target node, and you have a valid distance value, you’re done and you if (minDistance = Node.INFINITE && indicate this by setting the distance and targetNode.equals(key.toString())) Counter counter = context.getCounter( backpointer in MapReduce counters. PathCounter.TARGET_NODE_DISTANCE_COMPUTED); counter.increment(minDistance); context.getCounter(PathCounter.PATH.toString(), shortestAdjacentNode.getBackpointer()).increment(1); You’re ready to run your code. You need to copy the input file into HDFS, and then kick off your MapReduce job, specifying the start node name (dee) and target node name (joe): hadoop fs -put \ test-data/ch7/friends-short-path.txt \ friends-short-path.txt bin/run.sh com.manning.hip.ch7.shortestpath.Main dee joe \ friends-short-path.txt output ========================================== = Shortest path found, details as follows. = = Start node: deeTECHNIQUE 53 Calculating FoFs 263 = End node: joe = Hops: 2 = Path: dee:ali ========================================== hadoop fs -cat output/2/part ali 1 dee dee kia bob joe bob 2 dee:ali ali joe dee 0 null kia ali joe 2 dee:ali bob ali kia 1 dee ali dee The output of your job shows that the minimum hops between Dee and Joe is 2, and that Ali was the connecting node. Summary This exercise showed how a shortest path algorithm could be used to determine the minimum number of hops between two people in a social network. An algorithm 5 related to the shortest path algorithm, called graph diameter estimation, attempts to determine the average number of hops between nodes. This has been used to sup- 6 port the notion of six degrees of separation in large social network graphs with millions of nodes. The shortest path algorithm has multiple applications, but an arguably more use- ful and utilized algorithm in social networks is that of friends-of-friends (FoF). 7.1.3 Friends-of-friends Social network sites such as LinkedIn and Facebook use the FoF algorithm to help users broaden their networks. TECHNIQUE 53 Calculating FoFs The Friends-of-friends (FoF) algorithm suggests friends that a user may know that aren’t part of their immediate network. For the intent of this section and technique‚ consider a FoF to be in the 2nd degree network as shown in figure 7.12. The key ingredient to success with this approach is to order the FoFs by the num- ber of common friends, which increases the chances that the user knows the FoF. Problem You want to implement the FoF algorithm in MapReduce. Solution Two MapReduce jobs are required to calculate the FoFs for each user in a social net- work. The first job calculates the common friends for each user, and the second job sorts the common friends by the number of connections to your friends. 5 See the PDF, “HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop,” at http:// goo.gl/Estxk. 6 See “Four Degrees of Separation,” at http://arxiv.org/abs/1111.4570. 264 CHAPTER 7 Utilizing data structures and algorithms 1st degree 2nd degree 3rd degree of of of separation separation separation Joe Ben Kia Jim Ali Bill Jon Figure 7.12 An example of FoFs where Joe and Jon are considered FoFs to Jim Discussion You should first look at an example graph and understand what results you’re looking for. Figure 7.13 shows a network of people with Jim, one of the users, highlighted. In this graph Jim’s FoFs are represented in bold (Dee, Joe, and Jon). Next to Jim’s FoFs are the number of friends that the FoF and Jim have in common. Your goal here is to determine all the FoFs and order them by the number of fiends in common. Therefore, your expected results would have Joe as the first FoF recommendation, followed by Dee, and then Jon. The text file to represent the social graph for this technique is shown here: cat test-data/ch7/friends.txt joe jon kia bob ali kia joe jim dee dee kia ali ali dee jim bob joe jon jon joe ali bob joe ali jim jim kia bob ali -RH-LPZLWKWR)R)LVD .LD-RHULFRPPRQIHQGV OL DGQRE .LD% -LP%RE 'HHLVD)R)WR-LPZLWK-RQZ-LPKLWRW)R)LVD FRPPRQIULHQGV'HHOL-RQULFRPPRQIHQG OLDQG.LD OL Figure 7.13 An example of FoF where Joe and Jon are considered FoFs to JimTECHNIQUE 53 Calculating FoFs 265 Figure 7.14 The first MapReduce job, which calculates the common friends for each user This algorithm requires you to write two MapReduce jobs. The first job, the pseudo- code for which is shown in figure 7.14, calculates the FoFs and counts the number of friends in common. The result of the job is a line for each FoF relationship excluding people who are already friends. The output for executing this job against the graph in figure 7.13 is shown below: ali kia 3 bob dee 1 bob jon 2 bob kia 2 dee jim 2 dee joe 2 dee jon 1 jim joe 3 jim jon 1 jon kia 1 Your second job needs to produce output such that for each user you should have a list of FoFs in order of common friends. Figure 7.15 shows the algorithm. You’re using secondary sort to order a user’s FoFs in order of common friends. The output for executing this job against the output of the previous job can be seen here: ali kia:3 bob kia:2,jon:2,dee:1 dee jim:2,joe:2,jon:1,bob:1 jim joe:3,dee:2,jon:1 joe jim:3,dee:2 jon bob:2,kia:1,dee:1,jim:1 kia ali:3,bob:2,jon:1 266 CHAPTER 7 Utilizing data structures and algorithms Figure 7.15 The second MapReduce job, which sorts the common friends by the number of friends in common Let’s dive into the code in the following listing and look at the first MapReduce job, 7 which calculates the FoFs for each user. Mapper and reducer implementations for FoF calculation public static class Map extends MapperText, Text, TextPair, IntWritable private TextPair pair = new TextPair(); private IntWritable one = new IntWritable(1); private IntWritable two = new IntWritable(2); Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException For each friend, emit the fact that they’re String friends = StringUtils.split(value.toString()); friends so that this relationship can be for (int i = 0; i friends.length; i++) discarded in the reduce phase. The TextPair Go through all the pair.set(key.toString(), friendsi); class lexicographically orders the two names so adjacent nodes in the context.write(pair, one); that for a given pair of users there’ll be a graph (the users’ friends). for (int j = i + 1; j friends.length; j++) single reducer key. pair.set(friendsi, friendsj); For each friend, go through the remaining context.write(pair, two); friends and emit the fact that they’re a FoF. public static class Reduce extends ReducerTextPair, IntWritable, TextPair, IntWritable 7 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/friendsofafriend/CalcMapReduce.java TECHNIQUE 53 Calculating FoFs 267 private IntWritable friendsInCommon = new IntWritable(); public void reduce(TextPair key, IterableIntWritable values, Context context) throws IOException, InterruptedException int commonFriends = 0; boolean alreadyFriends = false; for (IntWritable hops : values) if (hops.get() == 1) alreadyFriends = true; Ignore this relationship if the break; users are already friends. commonFriends++; Output the fact that they’re FoFs, including a count of common friends. This also uses the TextPair class to if (alreadyFriends) lexicographically order the user names. friendsInCommon.set(commonFriends); context.write(key, friendsInCommon); The job of the second MapReduce job in the following listing is to sort the FoFs such that you see FoFs with a higher number of mutual friends ahead of those that have a 8 smaller number of mutual friends. Mapper and reducer implementations that sorts FoFs by the number of shared common friends public static class Map extends MapperText, Text, Person, Person private Person outputKey = new Person(); private Person outputValue = new Person(); Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException String parts = StringUtils.split(value.toString()); String name = parts0; int commonFriends = Integer.valueOf(parts1); outputKey.set(name, commonFriends); outputValue.set(key.toString(), commonFriends); Emit one half of the relationship. context.write(outputKey, outputValue); outputValue.set(name, commonFriends); outputKey.set(key.toString(), commonFriends); Emit the other half of the relationship. context.write(outputKey, outputValue); 8 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/friendsofafriend/SortMapReduce.java 268 CHAPTER 7 Utilizing data structures and algorithms public static class Reduce extends ReducerPerson, Person, Text, Text private Text name = new Text(); private Text potentialFriends = new Text(); Override public void reduce(Person key, IterablePerson values, Context context) throws IOException, InterruptedException All the people in your list are sorted StringBuilder sb = new StringBuilder(); in order of common friends. int count = 0; for (Person potentialFriend : values) if(sb.length() 0) sb.append(","); sb.append(potentialFriend.getName()) .append(":") .append(potentialFriend.getCommonFriends()); if (++count == 10) Only keep the top 10. break; name.set(key.getName()); potentialFriends.set(sb.toString()); context.write(name, potentialFriends); Emit the FoFs for the user. I won’t show the whole driver code, but to enable secondary sort, I had to write a few extra classes as well as inform the job to use the classes for partitioning and sorting purposes (for more details on how secondary sort works, look at chapter 4): job.setPartitionerClass(PersonNamePartitioner.class); job.setSortComparatorClass(PersonComparator.class); job.setGroupingComparatorClass(PersonNameComparator.class); You’ll copy an input file containing the friend relationships into HDFS and then run your driver code to run your two MapReduce jobs. The last two arguments are the out- put directories for the two MapReduce jobs: hadoop fs -put test-data/ch7/friends.txt . bin/run.sh com.manning.hip.ch7.friendsofafriend.Main \ friends.txt calc-output sort-output After running your code you can look at the output in HDFS: hadoop fs -cat sort-output/part ali kia:3 bob kia:2,jon:2,dee:1 dee jim:2,joe:2,jon:1,bob:1 jim joe:3,dee:2,jon:1 joe jim:3,dee:2 jon bob:2,kia:1,dee:1,jim:1 kia ali:3,bob:2,jon:1 TECHNIQUE 54 Calculate PageRank over a web graph 269 This output indeed verifies what you saw with your own eyes in figure 7.13, that Jim has three FoFs, and that they’re ordered by the number of common friends. Summary This approach can be used not only as a recommendation engine to help users grow their networks, but also for informational purposes as the user is browsing the social network’s website. For example, when you view people in LinkedIn, you’ll be shown the degrees of separation between you and the person being viewed. Your approach can be used to precompute that information for two hops. To reproduce this for three hops (for example, to show friends-of-friends of a friend) you would need to intro- duce a third MapReduce job to compute the third hop from the output of the first job. This is left as an exercise for the reader To simplify your approach, you used a undirected graph, which implies that user relationships are bidirectional. Most social networks don’t have such a notion, and the algorithm would need some minor tweaks to model this behavior. This brings us to the final graph technique, which is how to use PageRank to calcu- late the popularity of web pages. 7.1.4 PageRank 9 PageRank was a formula introduced by the founders of Google during their Stanford years in 1998. The paper discusses their overall approach to crawling and indexing the web, and includes as part of that a calculation, which they titled PageRank, which gives a score to each web page indicating the page’s importance. This wasn’t the first paper 10 to introduce a scoring mechanism for web pages, but it was the first to weigh scores propagated to each outbound link based on the total number of outbound links. TECHNIQUE 54 Calculate PageRank over a web graph Fundamentally, PageRank gives pages that have a large number of inbound links a higher score than pages that have a smaller number of inbound links. When evaluating the score for a page, PageRank uses the scores for all the inbound links to calculate a page’s % PageRank. But it penalizes individual 35 inbound links from sources that have a high number of outbound links by divid- 35 ing that outbound link PageRank by the number of outbound links. Figure 7.16 & 35 presents a simple example of a web graph with three pages and their respective Figure 7.16 PageRank values for a simple web graph PageRank values. 9 See “The Anatomy of a Large-Scale Hypertextual Web Search Engine” at http://infolab.stanford.edu/pub/ papers/google.pdf. 10 Before PageRank the HITS link analysis method was popular; see http://nlp.stanford.edu/IR-book/html/ htmledition/hubs-and-authorities-1.html. 270 CHAPTER 7 Utilizing data structures and algorithms Figure 7.17 The PageRank formula Figure 7.17 shows the PageRank formula. In the formula, webGraph is a count of all the pages in the graph, and d, set to 0.85, is a constant damping factor used in two parts. First, it denotes the probability of a random surfer reaching the page after clicking on many links (this is a constant equal to 0.15 divided by the total number of pages), and, second, it dampens the effect of the inbound link PageRanks by 85 percent. Problem You want to implement an the iterative PageRank graph algorithm in MapReduce. Solution PageRank can be implemented by iterating a MapReduce job until the graph has con- verged. The mappers are responsible for propagating node PageRank values to their adjacent nodes, and the reducers are responsible for calculating new PageRank values for each node, and for re-creating the original graph with the updated PageRank values. Discussion One of the advantages of PageRank is that it can be computed iteratively and applied locally. Every vertex starts with a seed value, with is 1 divided by the number of nodes, and with each iteration each node propagates its value to all pages it links to. Each vertex in turn sums up the value of all the inbound vertex values to compute a new seed value. This iterative process is repeated until such a time as convergence is reached. Convergence is a measure of how much the seed values have changed since the last iteration. If the con- vergence value is below a certain threshold, it means that there’s been minimal change and you can stop the iteration. It’s also common to limit the number of iterations in cases LDU\UPHGLQWHZQLQJSRLWLWK6WDUW UVPEHNQX3LQERXQGDJH5DQ5HVXOW % %     All nodes push their All nodes start with a PageRank values (PR / PageRank value of Iteration 1 outbound nodes) to the (1.0 / nodes) = 0.33.  outbound nodes. & &    % %    Iteration 2    & &    Figure 7.18 An example of PageRank iterationsTECHNIQUE 54 Calculate PageRank over a web graph 271 Figure 7.19 PageRank decomposed into map and reduce phases of large graphs where convergence takes too many iterations. Figure 7.18 shows two iter- ations of the PageRank against the simple graph you saw at the start of this technique. Figure 7.19 shows the PageRank algorithm expressed as map and reduce parts. The map phase is responsible for preserving the graph as well as emitting the PageRank value to all the outbound nodes. The reducer is responsible for recalculating the new PageRank value for each node and including it in the output of the original graph. In this technique you’ll operate on the graph shown in figure 7.20. In this graph all the nodes have both inbound and output edges. Your first step is to write the map task. The map task in the following listing has two primary functions: to preserve the graph structure and to propagate the PageRank num- 11 ber to each outbound node. B A D C Figure 7.20 Sample web graph for this technique 11 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/pagerank/Map.java 272 CHAPTER 7 Utilizing data structures and algorithms The PageRank mapper inverts the graph structure so the reducer can calculate PageRanks public class Map extends MapperText, Text, Text, Text private Text outKey = new Text(); private Text outValue = new Text(); Override protected void map(Text key, Text value, Context context) throws IOException, InterruptedException context.write(key, value); Emit the input to preserve the graph structure. Node node = Node.fromMR(value.toString()); if(node.getAdjacentNodeNames() = null && Calculate the outbound PageRank by node.getAdjacentNodeNames().length 0) dividing the node’s PageRank with the double outboundPageRank = node.getPageRank() / number of outbound links. (double)node.getAdjacentNodeNames().length; for (int i = 0; i node.getAdjacentNodeNames().length; i++) String neighbor = node.getAdjacentNodeNames()i; outKey.set(neighbor); Node adjacentNode = new Node() .setPageRank(outboundPageRank); outValue.set(adjacentNode.toString()); context.write(outKey, outValue); For each outbound node, emit that node name along with the PageRank. The map task outputs two pieces of information: the original graph and the outbound PageRank values. Your reducer’s job, as shown in the next listing, is to reconstruct the original graph and to update each node with a new PageRank value based on all the 12 sum of inbound PageRank values. PageRank reducer that calculates new PageRank values public class Reduce extends ReducerText, Text, Text, Text public static final double CONVERGENCE_SCALING_FACTOR = 1000.0; public static final double DAMPING_FACTOR = 0.85; public static String CONF_NUM_NODES_GRAPH = "pagerank.numnodes"; private int numberOfNodesInGraph; public static enum Counter CONV_DELTAS 12 GitHub source—https://github.com/alexholmes/hadoop-book/blob/master/src/main/java/com/ manning/hip/ch7/pagerank/Reduce.java

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.