How to write Linkedin Job Description

how use linkedin job search and how to format linkedin job description and how to use linkedin effectively job search and how to maximize linkedin for your job search
MattGates Profile Pic
MattGates,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Mining LinkedIn: Faceting Job Titles, This chapter introduces techniques and considerations for mining the troves of data tucked away at LinkedIn, a social networking site focused on professional and business relationships. Although LinkedIn may initially seem like any other social network, the nature of its API data is inherently quite different. If you liken Twitter to a busy public forum like a town square and Facebook to a very large room filled with friends and family chatting about things that are (mostly) appropriate for dinner conversation, then you might liken LinkedIn to a private event with a semiformal dress code where every‐ one is on their best behavior and trying to convey the specific value and expertise that they could bring to the professional marketplace. Given the somewhat sensitive nature of the data that’s tucked away at LinkedIn, its API has its own nuances that make it a bit different from many of the others we’ve looked at in this book. People who join LinkedIn are principally interested in the business opportunities that it provides as opposed to arbitrary socializing and will necessarily be providing sensitive details about business relationships, job histories, and more. For example, while you can generally access all of the details about your LinkedIn connec‐ tions’ educational histories and previous work positions, you cannot determine whether two arbitrary people are “mutually connected” as you could with Facebook. The absence of such an API method is intentional. The API doesn’t lend itself to being modeled as a social graph like Facebook or Twitter, therefore requiring that you ask different types of questions about the data that’s available to you. The remainder of this chapter gets you set up to access data with the LinkedIn API and introduces some fundamental data mining techniques that can help you cluster colleagues according to a similarity measurement in order to answer the following kinds of queries: 89 a • Which of your connections are the most similar based upon a criterion like job title? • Which of your connections have worked in companies you want to work for? • Where do most of your connections reside geographically? In all cases, the pattern for analysis with a clustering technique is essentially the same: extract some features from data in a colleague’s profile, define a similarity measurement to compare the features from each profile, and use a clustering technique to group together colleagues that are “similar enough.” The approach works well for LinkedIn data, and you can apply these same techniques to just about any kind of other data that you’ll ever encounter. Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code. 3.1. Overview This chapter introduces content that is foundational in machine learning and, in general, is a bit more advanced than the two chapters before it. It is recommended that you have a firm grasp on the previous two chapters before working through the material presented here. In this chapter, you’ll learn about: • LinkedIn’s Developer Platform and making API requests • Three common types of clustering, a fundamental machine-learning topic that ap‐ plies to nearly any problem domain • Data cleansing and normalization • Geocoding, a means of arriving at a set of coordinates from a textual reference to a location • Visualizing geographic data with Google Earth and with cartograms 3.2. Exploring the LinkedIn API You’ll need a LinkedIn account and a handful of colleagues in your professional network to follow along with this chapter’s examples in a meaningful way. If you don’t have a LinkedIn account, you can still apply the fundamental clustering techniques that you’ll learn about to other domains, but this chapter won’t be quite as engaging since you can’t 90 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a follow along with the examples without your own LinkedIn data. Start developing a LinkedIn professional network if you don’t already have one as a worthwhile investment in your professional life. Although most of the analysis in this chapter is performed against a comma-separated values (CSV) file of your LinkedIn connections that you can download, this section maintains continuity with other chapters in the book by providing an overview of the LinkedIn API. If you’re not interested in learning about the LinkedIn API and would like to jump straight into the analysis, skip ahead to Section 3.2.2 on page 96 and come back to the details about making API requests at a later time. 3.2.1. Making LinkedIn API Requests As is the case with other social web properties, such as Twitter and Facebook (discussed in the preceding chapters), the first step involved in gaining API access to LinkedIn is to create an application. You’ll be able to create a sample application at https://www.link edin.com/secure/developer; you will want to take note of your application’s API Key, Secret Key, OAuth User Token, and OAuth User Secret credentials, which you’ll use to programmatically access the API. Figure 3-1 illustrates the form that you’ll see once you have created an application. Figure 3-1. To access the LinkedIn API, create an application at https://www.linke din.com/secure/developer and take note of the four OAuth credentials (shown here as blurred values) that are available from the application details page 3.2. Exploring the LinkedIn API 91 a With the necessary OAuth credentials in hand, the process for obtaining API access to your own personal data is much like that of Twitter in that you’ll provide these creden‐ tials to a library that will take care of the details involved in making API requests. If you’re not taking advantage of the book’s virtual machine experience, you’ll need to install it by typingpip install python-linkedin in a terminal. See Appendix B for details on implementing an OAuth 2.0 flow, which you will need to build an application that requires an arbitrary user to authorize it to access account data. Example 3-1 illustrates a sample script that uses your LinkedIn credentials to ultimately create an instance of a LinkedInApplication class that can access your account data. Notice that the final line of the script retrieves your basic profile information, which includes your name and headline. Before going too much further, you should take a moment to read about what LinkedIn API operations are available to you as a developer by browsing its REST documentation, which provides a broad overview of what you can do. Although we’ll be accessing the API through a Python package that abstracts the HTTP requests that are involved, the core API documentation is always your de‐ finitive reference, and most good libraries mimic its style. Should you need to revoke account access from your application or any other OAuth application, you can do so in your account settings. Example 3-1. Using LinkedIn OAuth credentials to receive an access token suitable for development and accessing your own data from linkedin import linkedin pip install python-linkedin Define CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, and USER_SECRET from the credentials provided in your LinkedIn application CONSUMER_KEY = '' CONSUMER_SECRET = '' USER_TOKEN = '' USER_SECRET = '' RETURN_URL = '' Not required for developer authentication Instantiate the developer authentication class auth = linkedin.LinkedInDeveloperAuthentication(CONSUMER_KEY, CONSUMER_SECRET, USER_TOKEN, USER_SECRET, 92 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a RETURN_URL, permissions=linkedin.PERMISSIONS.enums.values()) Pass it in to the app... app = linkedin.LinkedInApplication(auth) Use the app... app.get_profile() In short, the calls available to you through an instance ofLinkedInApplication are the same as those available through the REST API, and thepython-linkedin documenta‐ tion on GitHub provides a number of queries to get you started. A couple of APIs of particular interest are the Connections API and the Search API. You’ll recall from our introductory discussion that you cannot get “friends of friends” (“connections of con‐ nections,” in LinkedIn parlance), but the Connections API returns a list of your con‐ nections, which provides a jumping-off point for obtaining profile information. The Search API provides a means of querying for people, companies, or jobs that are avail‐ able on LinkedIn. Additional APIs are available, and it’s worth your while to take a moment and familiarize yourself with them. The quality of the data available about your professional network is quite remarkable, as it can potentially contain full job histories, company details, geographic information about the location of positions, and more. Example 3-2 shows you how to useapp, an instance of yourLinkedInApplication, to 1 retrieve extended profile information for your connections and save this data to a file so as to avoid making any unnecessary API requests that will count against your rate- throttling limits, which are similar to those of Twitter’s API. Be careful when tinkering around with LinkedIn’s API: the rate limits don’t reset until midnight UTC, and one buggy loop could potentially blow your plans for the next 24 hours if you aren’t careful. Example 3-2. Retrieving your LinkedIn connections and storing them to disk import json connections = app.get_connections() connections_data = 'resources/ch03-linkedin/linkedin_connections.json' 1. If any of your connections have opted out of LinkedIn API access, their first and last names will appear as “private” and additional details will not be available. 3.2. Exploring the LinkedIn API 93 a f = open(conections_data, 'w') f.write(json.dumps(connections, indent=1)) f.close() You can reuse the data without using the API later like this... connections = json.loads(open(connections_data).read()) For an initial step in reviewing your connections’ data, let’s use theprettytable package as introduced in previous chapters to display a nicely formatted table of your connec‐ tions and where they are each located, as shown in Example 3-3. If you’re not taking advantage of this book’s preconfigured virtual machine, you’ll need to typepip install prettytable from a terminal for most of the examples in this chapter to work; it’s a package that produces nicely formatted, tabular output. Example 3-3. Pretty-printing your LinkedIn connections’ data from prettytable import PrettyTable pip install prettytable pt = PrettyTable(field_names='Name', 'Location') pt.align = 'l' pt.add_row((c'firstName' + ' ' + c'lastName', c'location''name')) for c in connections'values' if c.has_key('location') print pt Sample (anonymized) results follow and display your connections and where they are currently located according to their profiles. +-+-+ Name Location +-+-+ Laurel A. Greater Boston Area Eve A. Greater Chicago Area Jim A. Washington D.C. Metro Area Tom A. San Francisco Bay Area ... ... +-+-+ A full scan of the profile information returned from the Connections API reveals that it’s pretty spartan, but you can use field selectors as outlined in the Profile Fields online documentation to retrieve additional details, if available. For example, Example 3-4 shows how to fetch a connection’s job position history. 94 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More aExample 3-4. Displaying job position history for your profile and a connection’s profile import json See http://developer.linkedin.com/documents/profile-fieldsfullprofile for details on additional field selectors that can be passed in for retrieving additional profile information. Display your own positions... my_positions = app.get_profile(selectors='positions') print json.dumps(my_positions, indent=1) Display positions for someone in your network... Get an id for a connection. We'll just pick the first one. connection_id = connections'values'0'id' connection_positions = app.get_profile(member_id=connection_id, selectors='positions') print json.dumps(connection_positions, indent=1) Sample output reveals a number of interesting details about each position, including the company name, industry, summary of efforts, and employment dates: "positions": "_total": 10, "values": "startDate": "year": 2013, "month": 2 , "title": "Chief Technology Officer", "company": "industry": "Computer Software", "name": "Digital Reasoning Systems" , "summary": "I lead strategic technology efforts...", "isCurrent": true, "id": 370675000 , "startDate": "year": 2009, "month": 10 ... 3.2. Exploring the LinkedIn API 95 aAs might be expected, some API responses may not necessarily contain all of the in‐ formation that you want to know, and some responses may contain more information than you need. Instead of making multiple API calls to piece together information or potentially stripping out information you don’t want to keep, you could take advantage of the field selector syntax to customize the response details. Example 3-5 shows how you can retrieve only the name, industry, and id fields for companies as part of a re‐ sponse for profile positions. Example 3-5. Using field selector syntax to request additional details for APIs See http://developer.linkedin.com/documents/understanding-field-selectors for more information on the field selector syntax my_positions = app.get_profile(selectors='positions:(company:(name,industry,id))') print json.dumps(my_positions, indent=1) Once you’re familiar with the basic APIs that are available to you, have a few handy pieces of documentation bookmarked, and have made a few API calls to familiarize yourself with the basics, you’re up and running with LinkedIn. 3.2.2. Downloading LinkedIn Connections as a CSV File While using the API provides programmatic access to everything that would be visible to you as an authenticated user browsing profiles at http://linkedin.com, you can get all of the job title details you’ll need for much of this chapter by exporting your LinkedIn connections as address book data in a CSV file format. To initiate the export, select the Connections menu item from the Contacts menu to navigate to your LinkedIn con‐ nections page, and then select the “Export connections” link from within your LinkedIn account. Alternatively, you can navigate directly to the Export LinkedIn Connections dialog illustrated in Figure 3-2. Later in this chapter, we’ll be using thecsv module that’s part of Python’s standard library to parse the exported data, so in order to ensure compatibility with the upcoming code listing, choose the Outlook CSV option from the available choices. 96 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More aFigure 3-2. A lesser-known feature of LinkedIn is that you can export all of your con‐ nections in a convenient and portable CSV format at http://www.linkedin.com/people/ export-settings 3.3. Crash Course on Clustering Data Now that you have a basic understanding of how to access LinkedIn’s API, let’s dig into some more specific analysis with what will turn out to be a fairly thorough discussion 2 of clustering, an unsupervised machine-learning technique that is a staple in any data mining toolkit. Clustering involves taking a collection of items and partitioning them into smaller collections (clusters) according to some heuristic that is usually designed to compare items in the collection. Clustering is a fundamental data mining technique, and as part of a proper introduction to it, this chapter includes some footnotes and interlaced discussion of a somewhat mathematical nature that un‐ dergirds the problem. Although you should strive to eventually un‐ derstand these details, you don’t need to grasp all of the finer points to successfully employ clustering techniques, and you certainly shouldn’t feel any pressure to understand them the first time that you encounter them. It may take a little bit of reflection to digest some of the discussion, especially if you don’t have a mathematical background. 2. Without splitting hairs over technical nuances, it’s also commonly referred to as approximate matching, fuzzy matching, and/or deduplication, among many other names. 3.3. Crash Course on Clustering Data 97 a For example, if you were considering a geographic relocation, you might find it useful to cluster your LinkedIn connections into some number of geographic regions in order to better understand the economic opportunities available. We’ll revisit this concept momentarily, but first let’s take a moment to briefly discuss some nuances associated with clustering. When implementing solutions to problems that lend themselves to clustering on Link‐ edIn or elsewhere, you’ll repeatedly encounter at least two primary themes (see the sidebar “The Role of Dimensionality Reduction in Clustering” on page 98 for a discussion of a third) as part of a clustering analysis: Data normalization Even when you’re retrieving data from a nice API, it’s usually not the case that the data will be provided to you in exactly the format you’d like—it often takes more than a little bit of munging to get the data into a form suitable for analysis. For example, LinkedIn members can enter in text that describes their job titles, so you won’t always end up with perfectly normalized job titles. One executive might choose the title “Chief Technology Officer,” while another may opt for the more ambiguous “CTO,” and still others may choose other variations of the same role. We’ll revisit the data normalization problem and implement a pattern for handling certain aspects of it for LinkedIn data momentarily. Similarity computation Assuming you have reasonably well-normalized items, you’ll need to measure sim‐ ilarity between any two of them, whether they’re job titles, company names, pro‐ fessional interests, geographic labels, or any other field you can enter in as variable- free text, so you’ll need to define a heuristic that can approximate the similarity between any two values. In some situations computing a similarity heuristic can be quite obvious, but in others it can be tricky. For example, comparing the combined years of career experience for two people might be as simple as some addition operations, but comparing a broad professional element such as “leadership apti‐ tude” in a fully automated manner could be quite a challenge. The Role of Dimensionality Reduction in Clustering Although data normalization and similarity computation are two overarching themes that you’ll encounter in clustering at an abstract level, dimensionality reduction is a third theme that soon emerges once the scale of the data you are working with becomes nontrivial. To cluster all of the items in a set using a similarity metric, you would ideally compare every member to every other member. Thus, for a set of n members in a col‐ 2 lection, you would perform somewhere on the order of n similarity computations in your algorithm for the worst-case scenario because you have to compare each of the n items to n–1 other items. 98 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a Computer scientists call this predicament an n-squared problem and generally use the 2 nomenclature O(n ) to describe it; conversationally, you’d say it’s a “Big-O of n-squared” 2 problem. O(n ) problems become intractable for very large values of n, and most of the time, the use of the term intractable means you’d have to wait “too long” for a solution to be computed. “Too long” might be minutes, years, or eons, depending on the nature of the problem and its constraints. An exploration of dimensionality reduction techniques is beyond the scope of the cur‐ rent discussion, but suffice it to say that a typical dimensionality reduction technique involves using a function to organize “similar enough” items into a fixed number of bins so that the items within each bin can then be more exhaustively compared to one an‐ other. Dimensionality reduction is often as much art as it is science, and is frequently considered proprietary information or a trade secret by organizations that successfully employ it to gain a competitive advantage. Techniques for clustering are a fundamental part of any legitimate data miner’s tool belt, because in nearly any sector of any industry—ranging from defense intelligence to fraud detection at a bank to landscaping—there can be a truly immense amount of semi- standardized relational data that needs to be analyzed, and the rise of data scientist job opportunities over the previous years has been a testament to this. What generally happens is that a company establishes a database for collecting some kind of information, but not every field is enumerated into some predefined universe of valid answers. Whether it’s because the application’s user interface logic wasn’t de‐ signed properly, because some fields just don’t lend themselves to having static prede‐ termined values, or because it was critical to the user experience that users be allowed to enter whatever they’d like into a text box, the result is always the same: you eventually end up with a lot of semi-standardized data, or “dirty records.” While there might be a total of N distinct string values for a particular field, some number of these string values will actually relate the same concept. Duplicates can occur for various reasons—for example, misspellings, abbreviations or shorthand, and differences in the case of words. Although it may not be obvious, this is exactly one of the classic situations we’re faced with in mining LinkedIn data: LinkedIn members are able to enter in their professional information as free text, which results in a certain amount of unavoidable variation. For example, if you wanted to examine your professional network and try to determine where most of your connections work, you’d need to consider common variations in company names. Even the simplest of company names has a few common variations you’ll almost certainly encounter. For example, it should be obvious to most people that “Google” is an abbreviated form of “Google, Inc.,” but even these kinds of simple var‐ iations in naming conventions must be explicitly accounted for during standardization efforts. In standardizing company names, a good starting point is to first consider suf‐ fixes such as LLC and Inc. 3.3. Crash Course on Clustering Data 99 a3.3.1. Clustering Enhances User Experiences Simple clustering techniques can create incredibly compelling user experiences by lev‐ eraging results even as simple as the job title ones we just produced. Figure 3-3 dem‐ onstrates a powerful alternative view of your data via a simple tree widget that could be used as part of a navigation pane or faceted display for filtering search criteria. Assuming that the underlying similarity metrics you’ve chosen have produced meaningful clusters, a simple hierarchical display that presents data in logical groups with a count of each group’s items can streamline the process of finding information and power intuitive workflows for almost any application where a lot of skimming would otherwise be required to find the results. The code for creating a faceted display from your LinkedIn connec‐ tions is included as a turnkey example with the IPython Notebook for this chapter. Figure 3-3. Intelligently clustered data lends itself to faceted displays and compelling user experiences The code to create a simple navigational display can be surprisingly simple, given the maturity of Ajax toolkits and other UI libraries, and there’s incredible value in being able to create user experiences that present data in intuitive ways that power workflows. Something as simple as an intelligently crafted hierarchical display can inadvertently motivate users to spend more time on a site, discover more information than they nor‐ mally would, and ultimately realize more value in the services the site offers. 100 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a3.3.2. Normalizing Data to Enable Analysis As a necessary and helpful interlude toward building a working knowledge of clustering algorithms, let’s explore a few of the common situations you may face in normalizing LinkedIn data. In this section, we’ll implement a common pattern for normalizing company names and job titles. As a more advanced exercise, we’ll also briefly divert and discuss the problem of disambiguating and geocoding geographic references from LinkedIn profile information. (In other words, we’ll attempt to convert labels from LinkedIn profiles such as “Greater Nashville Area” to coordinates that can be plotted on a map.) The chief artifact of data normalization efforts is that you can count and analyze important features of the data and enable advanced data mining techniques such as clustering. In the case of LinkedIn data, we’ll be examining entities such as companies’ job titles and geograph‐ ic locations. 3.3.2.1. Normalizing and counting companies Let’s take a stab at standardizing company names from your professional network. Recall that the two primary ways you can access your LinkedIn data are either by using the LinkedIn API to programmatically retrieve the relevant fields or by employing a slightly lesser-known mechanism that allows you to export your professional network as address book data, which includes basic information such as name, job title, company, and contact information. Assuming you have a CSV file of contacts that you’ve exported from LinkedIn, you could normalize and display selected entities from a histogram, as illustrated in Example 3-6. As you’ll notice in the opening comments of code listings such as Example 3-6, you’ll need to copy and rename the CSV file of your LinkedIn connections that you exported to a particular directory in your source code checkout, per the guidance provided in Sec‐ tion 3.2.2 on page 96. Example 3-6. Simple normalization of company suffixes from address book data import os import csv from collections import Counter from operator import itemgetter from prettytable import PrettyTable 3.3. Crash Course on Clustering Data 101 a XXX: Place your "Outlook CSV" formatted file of connections from http://www.linkedin.com/people/export-settings at the following location: resources/ch03-linkedin/my_connections.csv CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv') Define a set of transforms that converts the first item to the second item. Here, we're simply handling some commonly known abbreviations, stripping off common suffixes, etc. transforms = (', Inc.', ''), (', Inc', ''), (', LLC', ''), (', LLP', ''), (' LLC', ''), (' Inc.', ''), (' Inc', '') csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"') contacts = row for row in csvReader companies = c'Company'.strip() for c in contacts if c'Company'.strip() = '' for i, _ in enumerate(companies): for transform in transforms: companiesi = companiesi.replace(transform) pt = PrettyTable(field_names='Company', 'Freq') pt.align = 'l' c = Counter(companies) pt.add_row(company, freq) for (company, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) if freq 1 print pt The following illustrates typical results for frequency analysis: +++ Company Freq +++ Digital Reasoning Systems 31 O'Reilly Media 19 Google 18 Novetta Solutions 9 Mozilla Corporation 9 Booz Allen Hamilton 8 ... ... +++ Python allows you to pass arguments to a function by dereferencing a list and dictionary as parameters, which is sometimes convenient, as illustrated in Example 3-6. For example, calling f(args, kw) is equivalent to calling f(1,7, x=23) so long as args is defined as 1,7 and kw is defined as 'x' : 23. See Appendix C for more Python tips. 102 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a Keep in mind that you’ll need to get a little more sophisticated to handle more complex situations, such as the various manifestations of company names—like O’Reilly Media —that have evolved over the years. For example, you might see this company’s name 3 represented as O’Reilly & Associates, O’Reilly Media, O’Reilly, Inc., or just O’Reilly. 3.3.2.2. Normalizing and counting job titles As might be expected, the same problem that occurs with normalizing company names presents itself when considering job titles, except that it can get a lot messier because job titles are so much more variable. Table 3-1 lists a few job titles you’re likely to en‐ counter in a software company that include a certain amount of natural variation. How many distinct roles do you see for the 10 distinct titles that are listed? Table 3-1. Example job titles for the technology industry Job title Chief Executive Officer President/CEO President & CEO CEO Developer Software Developer Software Engineer Chief Technical Officer President Senior Software Engineer While it’s certainly possible to define a list of aliases or abbreviations that equate titles like CEO and Chief Executive Officer, it may not be practical to manually define lists that equate titles such as Software Engineer and Developer for the general case in all possible domains. However, for even the messiest of fields in a worst-case scenario, it shouldn’t be too difficult to implement a solution that condenses the data to the point that it’s manageable for an expert to review it and then feed it back into a program that can apply it in much the same way that the expert would have done. More times than not, this is actually the approach that organizations prefer since it allows humans to briefly insert themselves into the loop to perform quality control. 3. If you think this is starting to sound complicated, just consider the work taken on by Dun & Bradstreet, the “Who’s Who” of company information, blessed with the challenge of maintaining a worldwide directory that identifies companies spanning multiple languages from all over the globe. 3.3. Crash Course on Clustering Data 103 a Recall that one of the most obvious starting points when working with any data set is to count things, and this situation is no different. Let’s reuse the same concepts from normalizing company names to implement a pattern for normalizing common job titles and then perform a basic frequency analysis on those titles as an initial basis for clus‐ tering. Assuming you have a reasonable number of exported contacts, the minor nuances among job titles that you’ll encounter may actually be surprising, but before we get into that, let’s introduce some sample code that establishes some patterns for normalizing record data and takes a basic inventory sorted by frequency. Example 3-7 inspects job titles and prints out frequency information for the titles them‐ selves and for individual tokens that occur in them. Example 3-7. Standardizing common job titles and computing their frequencies import os import csv from operator import itemgetter from collections import Counter from prettytable import PrettyTable XXX: Place your "Outlook CSV" formatted file of connections from http://www.linkedin.com/people/export-settings at the following location: resources/ch03-linkedin/my_connections.csv CSV_FILE = os.path.join("resources", "ch03-linkedin", 'my_connections.csv') transforms = ('Sr.', 'Senior'), ('Sr', 'Senior'), ('Jr.', 'Junior'), ('Jr', 'Junior'), ('CEO', 'Chief Executive Officer'), ('COO', 'Chief Operating Officer'), ('CTO', 'Chief Technology Officer'), ('CFO', 'Chief Finance Officer'), ('VP', 'Vice President'), csvReader = csv.DictReader(open(CSV_FILE), delimiter=',', quotechar='"') contacts = row for row in csvReader Read in a list of titles and split apart any combined titles like "President/CEO." Other variations could be handled as well, such as "President & CEO", "President and CEO", etc. titles = for contact in contacts: titles.extend(t.strip() for t in contact'Job Title'.split('/') if contact'Job Title'.strip() = '') 104 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a Replace common/known abbreviations for i, _ in enumerate(titles): for transform in transforms: titlesi = titlesi.replace(transform) Print out a table of titles sorted by frequency pt = PrettyTable(field_names='Title', 'Freq') pt.align = 'l' c = Counter(titles) pt.add_row(title, freq) for (title, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) if freq 1 print pt Print out a table of tokens sorted by frequency tokens = for title in titles: tokens.extend(t.strip(',') for t in title.split()) pt = PrettyTable(field_names='Token', 'Freq') pt.align = 'l' c = Counter(tokens) pt.add_row(token, freq) for (token, freq) in sorted(c.items(), key=itemgetter(1), reverse=True) if freq 1 and len(token) 2 print pt In short, the code reads in CSV records and makes a mild attempt at normalizing them by splitting apart combined titles that use the forward slash (like a title of “President/ CEO”) and replacing known abbreviations. Beyond that, it just displays the results of a frequency distribution of both full job titles and individual tokens contained in the job titles. This is not all that different from the previous exercise with company names, but it serves as a useful starting template and provides you with some reasonable insight into how the data breaks down. Sample results follow: +-++ Title Freq +-++ Chief Executive Officer 19 Senior Software Engineer 17 President 12 Founder 9 ... ... +-++ 3.3. Crash Course on Clustering Data 105 a+-++ Token Freq +-++ Engineer 43 Chief 43 Senior 42 Officer 37 ... ... +-++ One thing that’s notable about the sample results is that the most common job title based on exact matches is “Chief Executive Officer,” which is closely followed by other senior positions such as “President” and “Founder.” Hence, the ego of this professional network has reasonably good access to entrepreneurs and business leaders. The most common tokens from within the job titles are “Engineer” and “Chief.” The “Chief ” token corre‐ lates back to the previous thought about connections to higher-ups in companies, while the token “Engineer” provides a slightly different clue into the nature of the professional network. Although “Engineer” is not a constituent token of the most common job title, it does appear in a large number of job titles such as “Senior Software Engineer” and “Software Engineer,” which show up near the top of the job titles list. Therefore, the ego of this network appears to have connections to technical practitioners as well. In job title or address book data analysis, this is precisely the kind of insight that moti‐ vates the need for an approximate matching or clustering algorithm. The next section investigates further. 3.3.2.3. Normalizing and counting locations Although LinkedIn includes a general geographic region that usually corresponds to a metropolitan area for each of your connections, this label is not specific enough that it can be pinpointed on a map without some additional work. Knowing that someone works in the “Greater Nashville Area” is useful, and as human beings with additional knowledge, we know that this label probably refers to the Nashville, Tennessee metro area. However, writing code to transform “Greater Nashville Area” to a set of coordinates that you could render on a map can be trickier than it sounds, particularly when the human-readable label for a region is especially common. As a generalized problem, disambiguating geographic references is quite difficult. The population of New York City might be high enough that you can reasonably infer that “New York” refers to New York City, New York, but what about “Smithville”? There are hundreds of Smithvilles in the United States, and with most states having several of them, geographic context beyond the surrounding state is needed to make the right determination. It won’t be the case that a highly ambiguous place like “Greater Smithville Area” is something you’ll see on LinkedIn, but it serves to illustrate the general problem of disambiguating a geographic reference so that it can be resolved to a specific set of coordinates. 106 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a Disambiguating and geocoding the whereabouts of LinkedIn connections is slightly easier than the most generalized form of the problem because most professionals tend to identify with the larger metropolitan area that they’re associated with, and there are a relatively finite number of these regions. Although not always the case, you can gen‐ erally employ the crude assumption that the location referred to in a LinkedIn profile is a relatively well-known location and is likely to be the “most popular” metropolitan region by that name. You can install a Python package called geopy via pip install geopy; it provides a generalized mechanism for passing in labels for locations and getting back lists of co‐ ordinates that might match. Thegeopy package itself is a proxy to multiple web services providers such as Bing and Google that perform the geocoding, and an advantage of using it is that it provides a standardized API for interfacing with various geocoding services so that you don’t have to manually craft requests and parse responses. Thegeopy GitHub code repository is a good starting point for reading the documentation that’s available online. Example 3-8 illustrates how to usegeopy with Microsoft’s Bing, which offers a generous number of API calls for accounts that fall under educational usage guidelines that apply to situations such as learning from as this book. To run the script, you will need to request an API key from Bing. Bing is the recommended geocoder for exercises in this book with geopy, because at the time of this writing the Yahoo geocoding ser‐ vice was not operational due to some changes in product strategy resulting in the creation of a new product called Yahoo BOSS Geo Services. Although the Google Maps (v3) API was operational, its maximum number of requests per day seemed less ideal than that offered by Bing. Example 3-8. Geocoding locations with Microsoft Bing from geopy import geocoders GEO_APP_KEY = '' XXX: Get this from https://www.bingmapsportal.com g = geocoders.Bing(GEO_APP_KEY) print g.geocode("Nashville", exactly_one=False) The keyword parameter exactly_one=False tells the geocoder not to trigger an error if there is more than one possible result, which is more common than you might imag‐ ine. Sample results from this script follow and illustrate the nature of using an ambig‐ uous label like “Nashville” to resolve a set of coordinates: (u'Nashville, TN, United States', (36.16783905029297, -86.77816009521484)), (u'Nashville, AR, United States', (33.94792938232422, -93.84703826904297)), 3.3. Crash Course on Clustering Data 107 a (u'Nashville, GA, United States', (31.206039428710938, -83.25031280517578)), (u'Nashville, IL, United States', (38.34368133544922, -89.38263702392578)), (u'Nashville, NC, United States', (35.97433090209961, -77.96495056152344)) The Bing geocoding service appears to return the most populous locations first in the list of results, so we’ll opt to simply select the first item in the list as our response given that LinkedIn generally exposes locations in profiles as large metropolitan areas. How‐ ever, before we’ll be able to geocode, we’ll have to return to the problem of data nor‐ malization, because passing in a value such as “Greater Nashville Area” to the geocoder won’t return a response to us. (Try it and see for yourself.) As a pattern, we can transform locations such that common prefixes and suffixes are routinely stripped, as illustrated in Example 3-9. Example 3-9. Geocoding locations of LinkedIn connections with Microsoft Bing from geopy import geocoders GEO_APP_KEY = '' XXX: Get this from https://www.bingmapsportal.com g = geocoders.Bing(GEO_APP_KEY) transforms = ('Greater ', ''), (' Area', '') results = for c in connections'values': if not c.has_key('location'): continue transformed_location = c'location''name' for transform in transforms: transformed_location = transformed_location.replace(transform) geo = g.geocode(transformed_location, exactly_one=False) if geo == : continue results.update( c'location''name' : geo ) print json.dumps(results, indent=1) Sample results from the geocoding exercise follow: "Greater Chicago Area": "Chicago, IL, United States", 41.884151458740234, -87.63240814208984 , "Greater Boston Area": "Boston, MA, United States", 42.3586311340332, -71.05670166015625 , 108 Chapter 3: Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More a