Github API Java

how to access Github api and how to get Github api key and github api documentation
HartJohnson Profile Pic
HartJohnson,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Mining GitHub: Inspecting Software GitHub has rapidly evolved in recent years to become the de facto social coding platform with a deceptively simple premise: provide a top-notch hosted solution for developers to create and maintain open source software projects with an open source distributed version control system called Git. Unlike version control systems such as CVS or Sub‐ version, with Git there is no canonical copy of the code base, per se. All copies are working copies, and developers can commit local changes on a working copy without needing to be connected to a centralized server. The distributed version control paradigm lends itself exceptionally well to GitHub’s notion of social coding because it allows developers who are interested in contributing to a project to fork a working copy of its code repository and immediately begin working on it in just the same way that the developer who owns the fork works on it. Git not only keeps track of semantics that allow repositories to be forked arbitrarily but also makes it relatively easy to merge changes from a forked child repository back into its parent repository. Through the GitHub user interface, this workflow is called a pull request. It is a deceptively simple notion, but the ability for developers to create and collaborate on coding projects with elegant workflows that involve minimal overhead (once you understand some fundamental details about how Git works) has certainly streamlined many of the tedious details that have hindered innovation in open source development, including conveniences that transcend into the visualization of data and interoperability with other systems. In other words, think of GitHub as an enabler of open source soft‐ ware development. In the same way, although developers have collaborated on coding projects for decades, a hosted platform like GitHub supercharges collaboration and 279 a enables innovation in unprecedented ways by making it easy to create a project, share out its source code, maintain feedback and an issue tracker, accept patches for im‐ provements and bug fixes, and more. More recently, it even appears that GitHub is increasingly catering to non-developers—and becoming one of the hottest social plat‐ forms for mainstream collaboration. Just to be perfectly clear, this chapter does not attempt to provide a tutorial on how to use Git or GitHub as a distributed version control system or even discuss Git software architecture at any level. (See one of the many excellent online Git references, such as gitscm.com, for that kind of instruction.) This chapter does, however, attempt to teach you how to mine GitHub’s API to discover patterns of social collaboration in the some‐ what niche software development space. Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code. 7.1. Overview This chapter provides an introduction to GitHub as a social coding platform and to graph-oriented analysis using NetworkX. In this chapter, you’ll learn how to take ad‐ vantage of GitHub’s rich data by constructing a graphical model of the data that can be used in a variety of ways. In particular, we’ll treat the relationships between GitHub users, repositories, and programming languages as an interest graph, which is a way of interpreting the nodes and links in the graph primarily from the vantage point of people and the things in which they are interested. There is a lot of discussion these days amongst hackers, entrepreneurs, and web mavens as to whether or not the future of the Web is largely predicated upon some notion of an interest graph, so now is a fine time to get up to speed on the emerging graph landscape and all that it entails. In sum, then, this chapter follows the same predictable template as chapters before it and covers: • GitHub’s developer platform and how to make API requests • Graph schemas and how to model property graphs with NetworkX • The concept of an interest graph and how to construct an interest graph from GitHub data 280 a • Using NetworkX to query property graphs • Graph centrality algorithms, including degree, betweenness, and closeness centrality 7.2. Exploring GitHub’s API Like the other social web properties featured in this book, GitHub’s developer site offers comprehensive documentation on its APIs, the terms of service governing the use of those APIs, example code, and much more. Although the APIs are fairly rich, we’ll be focusing on only the few API calls that we need in order to collect the data for creating some interest graphs that associate software developers, projects, programming lan‐ guages, and other aspects of software development. The APIs more or less provide you with everything you’d need to build a rich user experience just like github.com offers itself, and there is no shortage of compelling and possibly even lucrative applications that you could build with these APIs. The most fundamental primitives for GitHub are users and projects. If you are reading this page, you’ve probably already managed to pull down this book’s source code from its GitHub project page, so this discussion assumes that you’ve at least visited a few GitHub project pages, have poked around a bit, and are familiar with the general notion of what GitHub offers. A GitHub user has a public profile that generally includes one or more code repositories that have either been created or forked from another GitHub user. For example, the GitHub user ptwobrussell owns a couple of GitHub repositories, including one called Mining-the-Social-Web and another called Mining-the-Social-Web-2nd-Edition. ptwobrussell has also forked a number of repositories in order to capture a particular working snapshot of certain code bases for development purposes, and these forked projects also appear in his public profile. Part of what makes GitHub so powerful is that ptwobrussell is free to do anything he’d like with any of these forked projects (subject to the terms of their software licenses), in the same way that anyone else could do the same thing to forked projects. When a user forks a code repository, that user effectively owns a working copy of the same repository and can do anything from just fiddle around with it to drastically overhaul and create a long-lived fork of the original project that may never be intended to get merged back into the original parent repository. Although most project forks never materialize into derivative works of their own, the effort involved in creating a derivative work is trivial from the standpoint of source code management. It may be short-lived and manifest as a pull request that is merged back into the parent, or it may be long- lived and become an entirely separate project with its own community. The barrier to entry for open source software contribution and other projects that increasingly find themselves appearing on GitHub is low indeed. 7.2. Exploring GitHub’s API 281 a In addition to forking projects on GitHub, a user can also bookmark or star a project to become what is known as a stargazer of the project. Bookmarking a project is essen‐ tially the same thing as bookmarking a web page or a tweet. You are signifying interest in the project, and it’ll appear on your list of GitHub bookmarks for quick reference. What you’ll generally notice is that far fewer people fork code than bookmark it. Book‐ marking is an easy and well-understood notion from over a decade of web surfing, whereas forking the code implies having the intent to modify or contribute to it in some way. Throughout the remainder of this chapter, we’ll focus primarily on using the list of stargazers for a project as the basis of constructing an interest graph for it. 7.2.1. Creating a GitHub API Connection Like other social web properties, GitHub implements OAuth, and the steps to gaining API access involve creating an account followed by one of two possibilities: creating an application to use as the consumer of the API or creating a “personal” access token that will be linked directly to your account. In this chapter, we’ll opt to use a personal access token, which is as easy as clicking a button in the Personal Access API Tokens section of your account’s Applications menu, as shown in Figure 7-1. (See Appendix B for a more extensive overview of OAuth.) Figure 7-1. Create a “Personal API Access Token” from the Applications menu in your account and provide a meaningful note so that you’ll remember its purpose 282 a A programmatic option for obtaining an access token as opposed to creating one within the GitHub user interface is shown in Example 7-1 as an adaptation of “Creating an OAuth token for command-line use” from GitHub’s help site. (If you are not taking advantage of the virtual machine experience for this book, as described in Appen‐ dix A, you’ll need to type pip install requests in a terminal prior to running this example.) Example 7-1. Programmatically obtaining a personal API access token for accessing GitHub’s API import requests from getpass import getpass import json username = '' Your GitHub username password = '' Your GitHub password Note that credentials will be transmitted over a secure SSL connection url = 'https://api.github.com/authorizations' note = 'Mining the Social Web, 2nd Ed.' post_data = 'scopes':'repo','note': note response = requests.post( url, auth = (username, password), data = json.dumps(post_data), ) print "API response:", response.text print print "Your OAuth token is", response.json()'token' Go to https://github.com/settings/applications to revoke this token As is the case with many other social web properties, GitHub’s API is built on top of HTTP and accessible through any programming language in which you can make an HTTP request, including command-line tools in a terminal. Following the precedents set by previous chapters, however, we’ll opt to take advantage of a Python library so that we can avoid some of the tedious details involved in making requests, parsing responses, and handling pagination. In this particular case, we’ll use PyGithub, which can be in‐ stalled with the somewhat predictablepip install PyGithub. We’ll start by taking at a couple of examples of how to make GitHub API requests before transitioning into a discussion of graphical models. Let’s seed an interest graph in this chapter from the Mining-the-Social-Web GitHub repository and create connections between it and its stargazers. Listing the stargazers for a repository is possible with the List Stargazers API. You could try out an API request to get an idea of what the response type looks like by copying and pasting the following 7.2. Exploring GitHub’s API 283 a URL in your web browser: https://api.github.com/repos/ptwobrussell/Mining-the- Social-Web/stargazers. Although you are reading Mining the Social Web, 2nd Edition, at the time of this writing the source code repository for the first edition still has much more activity than the second edition, so the first edition repository will serve as the basis of examples for this chapter. Analy‐ sis of any repository, including the repository for the second edition of this book, is easy enough to accomplish by simply changing the name of the initial project as introduced in Example 7-3. The ability to issue an unauthenticated request in this manner is quite convenient as you are exploring the API, and the rate limit of 60 unauthenticated requests per hour is more than adequate for tinkering and exploring. You could, however, append a query string of the form?access_token=xxx, wherexxx specifies your access token, to make the same request in an authenticated fashion. GitHub’s authenticated rate limits are a generous 5,000 requests per hour, as described in the developer documentation for rate limiting. Example 7-2 illustrates a sample request and response. (Keep in mind that this is requesting only the first page of results and, as described in the developer documen‐ tation for pagination, metadata information for navigating the pages of results is in‐ cluded in the HTTP headers.) Example 7-2. Making direct HTTP requests to GitHub’s API import json import requests An unauthenticated request that doesn't contain an ?access_token=xxx query string url = "https://api.github.com/repos/ptwobrussell/Mining-the-Social-Web/stargazers" response = requests.get(url) Display one stargazer print json.dumps(response.json()0, indent=1) print Display headers for (k,v) in response.headers.items(): print k, "=", v Sample output follows: "following_url": "https://api.github.com/users/rdempsey/following/other_user", "events_url": "https://api.github.com/users/rdempsey/events/privacy", "organizations_url": "https://api.github.com/users/rdempsey/orgs", "url": "https://api.github.com/users/rdempsey", 284 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More a"gists_url": "https://api.github.com/users/rdempsey/gists/gist_id", "html_url": "https://github.com/rdempsey", "subscriptions_url": "https://api.github.com/users/rdempsey/subscriptions", "avatar_url": "https://1.gravatar.com/avatar/8234a5ea3e56fca09c5549ee...png", "repos_url": "https://api.github.com/users/rdempsey/repos", "received_events_url": "https://api.github.com/users/rdempsey/received_events", "gravatar_id": "8234a5ea3e56fca09c5549ee5e23e3e1", "starred_url": "https://api.github.com/users/rdempsey/starred/owner/repo", "login": "rdempsey", "type": "User", "id": 224, "followers_url": "https://api.github.com/users/rdempsey/followers" status = 200 OK access-control-allow-credentials = true x-ratelimit-remaining = 58 x-github-media-type = github.beta x-content-type-options = nosniff access-control-expose-headers = ETag, Link, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes transfer-encoding = chunked x-github-request-id = 73f42421-ea0d-448c-9c90-c2d79c5b1fed content-encoding = gzip vary = Accept, Accept-Encoding server = GitHub.com last-modified = Sun, 08 Sep 2013 17:01:27 GMT x-ratelimit-limit = 60 link = https://api.github.com/repositories/1040700/stargazers?page=2; rel="next", https://api.github.com/repositories/1040700/stargazers?page=30; rel="last" etag = "ca10cd4edc1a44e91f7b28d3fdb05b10" cache-control = public, max-age=60, s-maxage=60 date = Sun, 08 Sep 2013 19:05:32 GMT access-control-allow-origin = content-type = application/json; charset=utf-8 x-ratelimit-reset = 1378670725 As you can see, there’s a lot of useful information that GitHub is returning to us that is not in the body of the HTTP response and is instead conveyed as HTTP headers, as outlined in the developer documentation. You should skim and understand what all of the various headers mean, but a few of note include the status header, which tells us that the request was OK with a 200 response; headers that involve the rate limit, such as x-ratelimit-remaining; and the link header, which contains a value such as the following: https://api.github.com/repositories/1040700/stargazers?page=2; rel="next", https://api.github.com/repositories/1040700/stargazers?page=29; rel="last". 7.2. Exploring GitHub’s API 285 a The link header’s value is giving us a preconstructed URL that could be used to fetch the next page of results as well as an indication of how many total pages of results there are. 7.2.2. Making GitHub API Requests Although it’s not difficult to use a library like requests and make the most of this information by parsing it out ourselves, a library likePyGithub makes it that much easier for us and tackles the abstraction of the implementation details of GitHub’s API, leaving us to work with a clean Pythonic API. Better yet, if GitHub changes the underlying implementation of its API, we’ll still be able to usePyGithub and our code won’t break. Before making a request with PyGithub, also take a moment to look at the body of the response itself. It contains some rich information, but the piece we’re most interested in is a field calledlogin, which is the GitHub username of the user who is stargazing at the repository of interest. This information is the basis of issuing many other queries to other GitHub APIs, such as “List repositories being starred,” an API that returns a list of all repositories a user has starred. This is a powerful pivot because after we have started with an arbitrary repository and queried it for a list of users who are interested in it, we are then able to query those users for additional repositories of interest and potentially discover any patterns that might emerge. For example, wouldn’t it be interesting to know what is the next-most-bookmarked repository among all of the users who have bookmarked Mining-the-Social-Web? The answer to that question could be the basis of an intelligent recommendation that GitHub users would appreciate, and it doesn’t take much creativity to imagine different domains in which intelligent recommendations could (and often do) provide enhanced user experiences in applications, as is the case with Amazon and Netflix. At its core, an interest graph inherently lends itself to making such intelligent recommendations, and that’s one of the reasons that interest graphs have become such a conversation topic in certain niche circles of late. Example 7-3 provides an example of how you could usePyGithub to retrieve all of the stargazers for a repository to seed an interest graph. Example 7-3. Using PyGithub to query for stargazers of a particular repository from github import Github XXX: Specify your own access token here ACCESS_TOKEN = '' Specify a username and repository of interest for that user. USER = 'ptwobrussell' REPO = 'Mining-the-Social-Web' 286 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More a client = Github(ACCESS_TOKEN, per_page=100) user = client.get_user(USER) repo = user.get_repo(REPO) Get a list of people who have bookmarked the repo. Since you'll get a lazy iterator back, you have to traverse it if you want to get the total number of stargazers. stargazers = s for s in repo.get_stargazers() print "Number of stargazers", len(stargazers) Behind the scenes,PyGithub takes care of the API implementation details for you and simply exposes some convenient objects for query. In this case, we create a connection to GitHub and use the per_page keyword parameter to tell it that we’d like to receive the maximum number of results (100) as opposed to the default number (30) in each page of data that comes back. Then, we get a repository for a particular user and query for that repository’s stargazers. It is possible for users to have repositories with identical names, so there is not an unambiguous way to query by just a repository’s name. Since usernames and repository names could overlap, you need to take special care to specify the kind of object that you are working with when using GitHub’s API if using one of these names as an identifier. We’ll account for this as we create graphs with node names that may be ambiguous if we do not qualify them as repositories or users. Finally,PyGithub generally provides “lazy iterators” as results, which in this case means that it does not attempt to fetch all 29 pages of results when the query is issued. Instead, it waits until a particular page is requested when iterating over the data before it retrieves that page. For this reason, we need to exhaust the lazy iterator with a list comprehension in order to actually count the number of stargazers with the API if we want to get an exact count. PyGithub’s documentation is helpful, its API generally mimics the GitHub API in a predictable way, and you’ll usually be able to use its pydoc, such as through the dir() andhelp() functions in a Python interpreter. Alternatively, tab completion and “ques‐ tion mark magic” in IPython or IPython Notebook will get you to the same place in figuring out what methods are available to call on what objects. It would be worthwhile to poke around at the GitHub API a bit withPyGithub to better familiarize yourself with some of the possibilities before continuing further. As an exercise to test your skills, can you iterate over Mining-the-Social-Web’s stargazers (or some subset thereof) and do some basic frequency analysis that determines which other repositories may be of com‐ mon interest? You will likely find Python’s collections.Counter or NLTK’s nltk.FreqDist essential in easily computing frequency statistics. 7.2. Exploring GitHub’s API 287 a7.3. Modeling Data with Property Graphs You may recall from Section 2.3.2.2 on page 78 that graphs were introduced in passing as a means of representing, analyzing, and visualizing social network data from Face‐ book. This section provides a more thorough discussion and hopefully serves as a useful primer for graph computing. Even though it is still a bit under the radar, the graph computing landscape is emerging rapidly given that graphs are a very natural abstraction for modeling many phenomena in the real world. Graphs offer a flexibility in data rep‐ resentation that is especially hard to beat during data experimentation and analysis when compared to other options, such as relational databases. Graph-centric analyses are certainly not a panacea for every problem, but an understanding of how to model your data with graphical structures is a powerful addition to your toolkit. A general introduction to graph theory is beyond the scope of this chapter, and the discussion that follows simply attempts to provide a gentle introduction to key concepts as they arise. You may enjoy the short YouTube video “Graph Theory—An Introduction” if you’d like to accumulate some general background knowledge before proceed‐ ing. The remainder of this section introduces a common kind of graph called a property graph for the purpose of modeling GitHub data as an interest graph by way of a Python package called NetworkX. A property graph is a data structure that represents entities with nodes and relationships between the entities with edges. Each vertex has a unique identifier, a map of properties that are defined as key/value pairs, and a collection of edges. Likewise, edges are unique in that they connect nodes, can be uniquely identified, and can contain properties. Figure 7-2 shows a trivial example of a property graph with two nodes that are uniquely identified by X and Y with an undescribed relationship between them. This particular graph is called a digraph because its edges are directed, which need not be the case unless the directionality of the edge is rooted in meaning for the domain being modeled. Figure 7-2. A trivial property graph with directed edges Expressed in code with NetworkX, a trivial property graph could be constructed as shown in Example 7-4. (You can use pip install networkx to install this package if you aren’t using the book’s turnkey virtual machine.) 288 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More aExample 7-4. Constructing a trivial property graph import networkx as nx Create a directed graph g = nx.DiGraph() Add an edge to the directed graph from X to Y g.add_edge('X', 'Y') Print some statistics about the graph print nx.info(g) print Get the nodes and edges from the graph print "Nodes:", g.nodes() print "Edges:", g.edges() print Get node properties print "X props:", g.node'X' print "Y props:", g.node'Y' Get edge properties print "X=Y props:", g'X''Y' print Update a node property g.node'X'.update('prop1' : 'value1') print "X props:", g.node'X' print Update an edge property g'X''Y'.update('label' : 'label1') print "X=Y props:", g'X''Y' Sample output from the example follows: Name: Type: DiGraph Number of nodes: 2 Number of edges: 1 Average in degree: 0.5000 Average out degree: 0.5000 7.3. Modeling Data with Property Graphs 289 a Nodes: 'Y', 'X' Edges: ('X', 'Y') X props: Y props: X=Y props: X props: 'prop1': 'value1' X=Y props: 'label': 'label1' In this particular example, the add_edge method of the digraph adds an edge from a node that’s uniquely identified by X to a node that’s uniquely identified by Y, resulting in a graph with two nodes and one edge between them. In terms of its unique identifier, this node would be represented by the tuple (X, Y) since both nodes that it connects are uniquely identified themselves. Be aware that adding an edge fromY back toX would create a second edge in the graph, and this second edge could contain its own set of edge properties. In general, you wouldn’t want to create this second edge since you can get a node’s incoming or outgoing edges and effectively traverse the edge in either direction, but there may be some situations in which it is more convenient to explicitly include the additional edge. The degree of a node in a graph is the number of incident edges to it, and for a directed graph, there is a notion of in degree and out degree since edges have direction. The average in degree and average out degree values provide a normalized score for the graph that represents the number of nodes that have incoming and outgoing edges. In this particular case, the directed graph has a single directed edge, so there is one node with an outgoing edge and one node with an incoming edge. The in and out degree of a node is a fundamental concept in graph theory. Assuming you know the number of vertices in the graph, the average degree provides a measure of the graph’s density: the number of actual edges compared to the number of possible edges if the graph were fully connected. In a fully connected graph, each node is con‐ nected to every other node, and in the case of a directed graph, this means that all nodes have incoming edges from all other nodes. You calculate the average in degree for an entire graph by summing the values of each node’s in degree and dividing the total by the number of nodes in the graph, which is 1 divided by 2 in Example 7-4. The average out degree calculation is computed the same way except that the sum of each node’s out degree is used as the value to divide by the number of nodes in the graph. When you’re considering an entire directed graph, there will always be an equal number of incoming edges and outgoing edges because each 290 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More a 1 edge connects only two nodes, and the average in degree and average out degree values for the entire graph will be the same. In the general case, the maximum values for average in and out degree in a graph are one less than the number of nodes in the graph. Take a moment to convince yourself that this is the case by consid‐ ering the number of edges that are necessary to fully connect all of the nodes in a graph. In the next section, we’ll construct an interest graph using these same property graph primitives and illustrate these additional methods at work on real-world data. First, take a moment to explore by adding some nodes, edges, and properties to the graph. The NetworkX documentation provides a number of useful introductory examples that you can also explore if this is one of your first encounters with graphs and you’d like some extra instruction as a primer. The Rise of Big Graph Databases This chapter introduces property graphs, a versatile data structure that can be used to model complex networks with nodes and edges as simple primitives. We’ll be modeling the data according to a flexible graph schema that’s based largely on natural intuition, and for a narrowly focused domain, this pragmatic approach is often sufficient. As we’ll see throughout the remainder of this chapter, property graphs provide flexibility and versatility in modeling and querying complex data. NetworkX, the Python-based graph toolkit used throughout this book, provides a pow‐ erful toolbox for modeling property graphs. Be aware, however, that NetworkX is an in- memory graph database. The limit of what it can do for you is directly proportional to the amount of working memory that you have on the machine on which you are running it. In many situations, you can work around the memory limitation by constraining the domain to a subset of the data or by using a machine with more working memory. In an increasing number of situations involving “big data” and its burgeoning ecosystem that largely involves Hadoop and NoSQL databases, however, in-memory graphs are simply not an option. There is a nascent ecosystem of so-called “big graph databases” that leverage NoSQL databases for storage and provide property graph semantics that are worth noting. Titan, a promising front-runner, and other big graph databases that adopt the property graph model present an opportunity for a departure from the semantic web stack as we know it, involving technologies such as RDF Schema, OWL, and SPARQL. The idea 1. A more abstract version of a graph called a hypergraph contains hyperedges that can connect an arbitrary number of vertices. 7.3. Modeling Data with Property Graphs 291 abehind the technologies involved in the semantic web stack is that they provide a mech‐ anism for representing and consolidating data from more complex domains, thereby making it possible to arrive at a standardized vocabulary that can be meaningfully queried. Unfortunately, there have been litanies of historical challenges involved in ap‐ plying these technologies at web scale. One of Titan’s promising key claims is that it is designed to effectively manage the memory hierarchy in order to scale well. It will be exciting to see what the future holds as big graph databases based upon NoSQL databases and the property graph model fuse with the ideas and technologies involved in a more traditional semantic web toolchain. The next chapter introduces some current web innovations involving microformats that are a step in the general direction of a more semantic web; it ends with a brief example that demonstrates inferencing on a simple graph with a small subset of technology akin to the semantic web stack. 7.4. Analyzing GitHub Interest Graphs Now equipped with the tools to both query GitHub’s API and model the data that comes back as a graph, let’s put our skills to the test and begin constructing and analyzing an interest graph. We’ll start with a repository that will represent a common interest among a group of GitHub users and use GitHub’s API to discover the stargazers for this repos‐ itory. From there, we’ll be able to use other APIs to model social connections between GitHub users who follow one another and hone in on other interests that these users might share. We’ll also learn about some fundamental techniques for analyzing graphs called cen‐ trality measures. Although a visual layout of a graph is tremendously useful, many graphs are simply too large or complex for an effective visual inspection, and centrality measures can be helpful in analytically measuring aspects of the network structure. (But don’t worry, we will also visualize a graph before closing this chapter.) 7.4.1. Seeding an Interest Graph Recall that an interest graph and a social graph are not the same thing. Whereas a social graph’s primary focus is representing connections between people and it generally re‐ quires a mutual relationship between the parties involved, an interest graph connects people and interests and involves unidirectional edges. Although the two are by no means totally disjoint concepts, do not confuse the connection between a GitHub user following another GitHub user with a social connection—it is an “interested in” con‐ nection because there is not a mutual acceptance criterion involved. 292 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More a A classic example of a hybridized graphical model that would qualify as a social interest graph is Facebook. It started primarily as a tech‐ nology platform based upon the concept of a social graph, but the incorporation of the Like button squarely catapulted it into hybrid territory that could be articulated as a social-interest graph. It explic‐ itly represents connections between people as well as connections be‐ tween people and the things that they are interested in. Twitter has always been some flavor of an interest graph with its asymmetric “fol‐ lowing” model, which can be interpreted as a connection between a person and the things (which could be other people) that person is interested in. Examples 7-5 and 7-6 will introduce code samples to construct the initial “gazes” rela‐ tionships between users and repositories, demonstrating how to explore the graphical structure that emerges. The graph that is initially constructed can be referred to as an ego graph in that there is a central point of focus (an ego) that is the basis for most (in this case, all) of the edges. An ego graph is sometimes called a “hub and spoke graph” or a “star graph” since it resembles a hub with spokes emanating from it and looks like a star when visually rendered visually. From the standpoint of a graph schema, the graph contains two types of nodes and one type of edge, as is demonstrated in Figure 7-3. We’ll use the graph schema shown in Figure 7-3 as a starting point and evolve it with modifications throughout the remainder of this chapter. Figure 7-3. The basis of a graph schema that includes GitHub users who are interested in repositories 7.4. Analyzing GitHub Interest Graphs 293 a There is a subtle but important constraint in data modeling that involves the avoidance of naming collisions: usernames and repository names may (and often do) collide with one another. For example, there could be a GitHub user named “ptwobrussell,” as well as multiple repositories named “ptwobrussell”. Recalling that theadd_edge method uses the items passed in as its first two parameters as unique identifiers, we can append either “(user)” or “(repo)” to the items to ensure that all nodes in the graph will be unique. From the standpoint of modeling with NetworkX, appending a type for the node mostly takes care of the problem. Along the same lines, repositories that are owned by different users can have the same name whether they are forks of the same code base or entirely different code bases. At the moment this particular detail is of no concern to us, but once we begin to add in other repositories that other GitHub users are stargazing at, the possibility of this kind of collision will increase. Whether to allow these types of collisions or to implement a graph construction strategy that avoids them is a design decision that carries with it certain consequences. For ex‐ ample, it would probably be desirable to have forks of the same repository collapse into the same node in the graph as opposed to representing them all as different repositories, but you would certainly not want completely different projects that shared a name to collapse into the same node. Given the limited scope of the problem that we are solving and that it’s initially focusing on a particular repository of interest, we’ll opt to avoid the complexity that disambiguating repository names introdu‐ ces. With that in mind, take a look at Example 7-5, which constructs an ego graph of a repository and its stargazers, and Example 7-6, which introduces some useful graph operations. Example 7-5. Constructing an ego graph of a repository and its stargazers Expand the initial graph with (interest) edges pointing each direction for additional people interested. Take care to ensure that user and repo nodes do not collide by appending their type. import networkx as nx g = nx.DiGraph() g.add_node(repo.name + '(repo)', type='repo', lang=repo.language, owner=user.login) for sg in stargazers: g.add_node(sg.login + '(user)', type='user') g.add_edge(sg.login + '(user)', repo.name + '(repo)', type='gazes') 294 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More aExample 7-6. Introducing some handy graph operations Poke around in the current graph to get a better feel for how NetworkX works print nx.info(g) print print g.node'Mining-the-Social-Web(repo)' print g.node'ptwobrussell(user)' print print g'ptwobrussell(user)''Mining-the-Social-Web(repo)' The next line would throw a KeyError since no such edge exists: print g'Mining-the-Social-Web(repo)''ptwobrussell(user)' print print g'ptwobrussell(user)' print g'Mining-the-Social-Web(repo)' print print g.in_edges('ptwobrussell(user)') print g.out_edges('ptwobrussell(user)') print print g.in_edges('Mining-the-Social-Web(repo)') print g.out_edges('Mining-the-Social-Web(repo)') The following sample (abbreviated) output demonstrates some of the possibilities based upon the graph operations just shown: Name: Type: DiGraph Number of nodes: 852 Number of edges: 851 Average in degree: 0.9988 Average out degree: 0.9988 'lang': u'JavaScript', 'owner': u'ptwobrussell', 'type': 'repo' 'type': 'user' 'type': 'gazes' u'Mining-the-Social-Web(repo)': 'type': 'gazes' ('ptwobrussell(user)', u'Mining-the-Social-Web(repo)') (u'gregmoreno(user)', 'Mining-the-Social-Web(repo)'), (u'SathishRaju(user)', 'Mining-the-Social-Web(repo)'), ... With an initial interest graph in place, we can get creative in determining which steps might be most interesting to take next. What we know so far is that there are approxi‐ mately 850 users who share a common interest in social web mining, as indicated by 7.4. Analyzing GitHub Interest Graphs 295 atheir stargazing association to ptwobrussell’s Mining-the-Social-Web repository. As ex‐ pected, the number of edges in the graph is one less than the number of nodes. The reason that this is the case is because there is a one-to-one correspondence at this point between the stargazers and the repository (an edge must exist to connect each stargazer to the repository). If you recall that the average in degree and average out degree metrics yield a normalized value that provides a measure for the density of the graph, the value of 0.9988 should confirm our intuition. We know that we have 851 nodes corresponding to stargazers that each have an out degree equal to 1, and one node corresponding to a repository that has an in degree of 851. In other words, we know that the number of edges in the graph is one less than the number of nodes. The density of edges in the graph is quite low given that the maximum value for the average degree in this case is 851. It might be tempting to think about the topology of the graph, knowing that it looks like a star if visually rendered, and try to make some kind of connection to the value of 0.9988. It is true that we have one node that is connected to all other nodes in the graph, but it would be a mistake to generalize and try to make some kind of connection to the average degree being approximately 1.0 based on this single node. It could just as easily have been the case that the 851 nodes could have been connected in many other con‐ figurations to arrive at a value of 0.9988. To gain insight that would support this kind of conclusion we would need to consider additional analytics, such as the centrality measures introduced in the next section. 7.4.2. Computing Graph Centrality Measures A centrality measure is a fundamental graph analytic that provides insight into the relative importance of a particular node in a graph. Let’s consider the following centrality measures, which will help us more carefully examine graphs to gain insights about networks: Degree centrality The degree centrality of a node in the graph is a measure of the number of incident edges upon it. Think of this centrality measure as a way of tabulating the frequency of incident edges on nodes for the purpose of measuring uniformity among them, finding the nodes with the highest or lowest numbers of incident edges, or otherwise trying to discover patterns that provide insight into the network topology based on number of connections as a primary motivation. The degree centrality of a node is just one facet that is useful in reasoning about its role in a network, and it provides a good starting point for identifying outliers or anomalies with respect to connect‐ edness relative to other nodes in the graph. In aggregate, we also know from our earlier discussion that the average degree centrality tells us something about the density of an overall graph. NetworkX providesnetworkx.degree_centrality as 296 Chapter 7: Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More a a built-in function to compute the degree centrality of a graph. It returns a dictio‐ nary that maps the ID of each node to its degree centrality. Betweenness centrality The betweenness centrality of a node is a measure of how often it connects any other nodes in the graph in the sense of being in between other nodes. You might think about betweenness centrality as a measure of how critical a node is in con‐ necting other nodes as a broker or gateway. Although not necessarily the case, the loss of nodes with a high betweenness centrality measure could be quite disruptive 2 to the flow of energy in a graph, and in some circumstances removing nodes with high betweenness centrality can disintegrate a graph into smaller subgraphs. Net‐ workX provides networkx.betweenness_centrality as a built-in function to compute the betweenness centrality of a graph. It returns a dictionary that maps the ID of each node to its betweenness centrality. Closeness centrality The closeness centrality of a node is a measure of how highly connected (“close”) it is to all other nodes in the graph. This centrality measure is also predicated on the notion of shortest paths in the graph and offers insight into how well connected a particular node is in the graph. Unlike a node’s betweenness centrality, which tells you something about how integral it is in connecting nodes as a broker or gateway, a node’s closeness centrality accounts more for direct connections. Think of close‐ ness in terms of a node’s ability to spread energy to all other nodes in a graph. NetworkX provides networkx.closeness_centrality as a built-in function to compute the closeness centrality of a graph. It returns a dictionary that maps the ID of each node to its closeness centrality. NetworkX provides a number of powerful centrality meas‐ ures in its online documentation. Figure 7-4 shows the Krackhardt kite graph, a well-studied graph in social network analysis that illustrates the differences among the centrality measures introduced in this section. It’s called a “kite graph” because when rendered visually, it has the appearance of a kite. 2. In the current discussion, the term “energy” is used to generically describe flow within an abstract graph. 7.4. Analyzing GitHub Interest Graphs 297 a