Data mining using Google

how to do data mining in google and google web data mining tools
MattGates Profile Pic
MattGates,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Mining Google+: Computing Document Google+ initially serves as our primary source of data for this chapter because it’s in‐ herently social, features content that’s often expressed as longer-form notes that resem‐ ble blog entries, and is now an established staple in the social web. It’s not hard to make a case that Google+ activities (notes) fill a niche somewhere between Twitter and blogs. The concepts from previous chapters could equally be applied to the data behind Google +’s diverse and powerful API, although we’ll opt to leave regurgitations of those exam‐ ples (mostly) as exercises for the motivated reader. Wherever possible we won’t reinvent the wheel and implement analysis tools from scratch, but we will take a couple of “deep dives” when particularly foundational topics come up that are essential to an understanding of text mining. The Natural Language Toolkit (NLTK) is a powerful technology that you may recall from Chapter 3; it provides a many of the tools we’ll use in this chapter. Its rich suites of APIs can be a bit over‐ whelming at first, but don’t worry: while text analytics is an incredibly diverse and complex field of study, there are lots of powerful fundamentals that can take you a long way without too significant of an investment. This chapter and the chapters after it aim to hone in on those fundamentals. (A full-blown introduction to NLTK is outside the scope of this book, but you can review the full text of Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit O’Reilly at the NLTK web‐ site.) Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code. 4.1. Overview This chapter uses Google+ to begin our journey in analyzing human language data. In this chapter you’ll learn about: • The Google+ API and making API requests • TF-IDF (Term Frequency–Inverse Document Frequency), a fundamental techni‐ que for analyzing words in documents • How to apply NLTK to the problem of understanding human language • How to apply cosine similarity to common problems such as querying documents by keyword • How to extract meaningful phrases from human language data by detecting collo‐ cation patterns 4.2. Exploring the Google+ API Anyone with a Gmail account can trivially create a Google+ account and start collab‐ orating with friends. From a product standpoint, Google+ has evolved rapidly and used some of the most compelling features of existing social network platforms such as Twit‐ ter and Facebook in carving out its own set of unique capabilities. As with the other social websites featured in this book, a full overview of Google+ isn’t in scope for this chapter, but you can easily read about it (or sign up) online. The best way to learn is by creating an account and spending some time exploring. For the most part, where there’s a feature in the user interface, there’s an API that provides that feature that you can tap into. Suffice it to say that Google+ has leveraged tried-and-true features of existing social 136 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a networks, such as marking content with hashtags and maintaining a profile according to customizable privacy settings, with additional novelties such as a fresh take on content sharing called circles, video chats called hangouts, and extensive integration with other Google services such as Gmail contacts. In Google+ API parlance, social interactions are framed in terms of people, activities, comments, and moments. The API documentation that’s available online is always the definitive source of guid‐ ance, but a brief overview may be helpful to get you thinking about how Google+ com‐ pares to another platform such as Twitter or Facebook: People People are Google+ users. Programmatically, you’ll either discover users by using 2 the search API, look them up by a personalized URL if they’re a celebrity type, or strip their Google+ IDs out of the URLs that appear in your web browser and use them for exploring their profiles. Later in this chapter, we’ll use the search API to find users on Google+. Activities Activities are the things that people do on Google+. An activity is essentially a note and can be as long or short as the author likes: it can be as long as a blog post, or it can be devoid of any real textual meaning (e.g., just used to share links or multimedia content). Given a Google+ user, you can easily retrieve a list of that person’s activ‐ ities, and we’ll do that later in this chapter. Like a tweet, an activity contains a lot of fascinating metadata , such as the number of times the activity has been reshared. Comments Leaving comments is the way Google+ users interact with one another. Simple stat‐ istical analysis of comments on Google+ could be very interesting and potentially reveal a lot of insights into a person’s social circles or the virality of content. For example, which other Google+ users most frequently comment on activities? Which activities have the highest numbers of comments (and why)? Moments Moments are a relatively recent addition to Google+ and represent a way of cap‐ turing interactions between a user and a Google+ application. Moments are similar to Facebook’s social graph stories in that they are designed to capture and create opportunities for user interaction with an application that can be displayed on a timeline. For example, if you were to make a purchase in an application, upload a photo, or watch a YouTube video, it could be captured as a moment (something you 2. Google+ only offered personalized URLs such as https://plus.google.com/+TimOReilly to well-known indi‐ viduals at the outset, but is in the process of making them available to all users. Until you are either selected by Google or eligible to apply, you’ll have to continue using your more arcane URL with a numeric identifier in it (such as https://plus.google.com/107033731246200681024). 4.2. Exploring the Google+ API 137 a did in time) and displayed in a history of your actions or shared with friends in an activity stream by the application. For the purposes of this chapter, we’ll be focusing on harvesting and analyzing Google+ activity data that is textual and intended to convey the same kind of meaning that you might encounter in a tweet, blog post, or Facebook status update. In other words, we’ll be trying to analyze human language data. If you haven’t signed up for Google+ yet, it’s worth taking the time to do so, as well as spending a few moments to familiarize yourself with a Google+ profile. One of the easiest ways to find specific users on Google+ is to just search for them at http://plus.google.com, and explore the platform from there. As you explore Google+, bear in mind that it has a unique set of capabilities and is a little awkward to compare directly to other social web properties. If you’re expecting it to be a straightforward comparison, you might find yourself a bit surprised. It’s similar to Twitter in that it provides a “following” model where you can add individuals to one of your Google+ circles and keep up with their happenings without any approval on their part, but the Google+ platform also offers rich integration with other Google web properties, sports a powerful feature for videoconferencing (hangouts), and has an API similar to Facebook’s in the way that users share content and interact. Without further ado, let’s get busy exploring the API and work toward mining some data. 4.2.1. Making Google+ API Requests From a software development standpoint, Google+ leverages OAuth like the rest of the social web to enable an application that you’ll build to access data on a user’s behalf, so you’ll need to register an application to get appropriate credentials for accessing the Google+ platform. The Google API Console provides a means of registering an appli‐ cation (called a project in the Google API Console) to get OAuth credentials but also exposes an API key that you can use for “simple API access.” This API key is what we’ll use in this chapter to programmatically access the Google+ platform and just about every other Google service. Once you’ve created an application, you’ll also need to specifically enable it to use Google+ as a separate step. Figure 4-1 provides a screenshot of the Google+ API Console as well as a view that demonstrates what it looks like when you have enabled your application for Google+ API access. 138 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a Figure 4-1. Registering an application with the Google API Console to gain API access to Google services; don’t forget to enable Google+ API access as one of the service options 4.2. Exploring the Google+ API 139 a You can install a Python package calledgoogle-api-python-client for accessing Goo‐ gle’s API via pip install google-api-python-client. This is one of the standard Python-based options for accessing Google+ data. The online documentation for google-api-python-client is marginally helpful in familiarizing yourself with the ca‐ pabilities of what it offers, but in general, you’ll just be plugging parameters from the official Google+ API documents into some predictable access patterns with the Python package. Once you’ve walked through a couple of exercises, it’s a relatively straightfor‐ ward process. Don’t forget that pydoc can be helpful for gathering clues about a package, class, or method in a terminal as you are learning it. The help function in a standard Python interpreter is also useful. Recall that appending ? to a method name in IPython is a shortcut for display‐ ing its docstring. As an initial exercise, let’s consider the problem of locating a person on Google+. Like any other social web API, the Google+ API offers a means of searching, and in particular we’ll be interested in the People: search API. Example 4-1 illustrates how to search for a person with the Google+ API. Since Tim O’Reilly is a well-known personality with an active and compelling Google+ account, let’s look him up. The basic pattern that you’ll repeatedly use with the Python client is to create an instance of a service that’s parameterized for Google+ access with your API key that you can then instantiate for particular platform services. Here, we create a connection to the People API by invoking service.people() and then chaining on some additional API oper‐ ations deduced from reviewing the API documentation online. In a moment we’ll query for activity data, and you’ll see that the same basic pattern holds. Example 4-1. Searching for a person with the Google+ API import httplib2 import json import apiclient.discovery pip install google-api-python-client XXX: Enter any person's name Q = "Tim O'Reilly" XXX: Enter in your API key from https://code.google.com/apis/console API_KEY = '' service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), developerKey=API_KEY) people_feed = service.people().search(query=Q).execute() print json.dumps(people_feed'items', indent=1) 140 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More aFollowing are sample results for searching for Tim O’Reilly: "kind": "plusperson", "displayName": "Tim O'Reilly", "url": "https://plus.google.com/+TimOReilly", "image": "url": "https://lh4.googleusercontent.com/-J8..." , "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/wwgOCMn..."", "id": "107033731246200681024", "objectType": "person" , "kind": "plusperson", "displayName": "Tim O'Reilly", "url": "https://plus.google.com/11566571170551...", "image": "url": "https://lh3.googleusercontent.com/-yka..." , "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/0z-EwRK7..."", "id": "115665711705516993369", "objectType": "person" , ... The results do indeed return a list of people named Tim O’Reilly, but how can we tell which one of these results refers to the well-known Tim O’Reilly of technology fame that we are looking for? One option would be to request profile or activity information for each of these results and try to disambiguate them manually. Another option is to render the avatars included in each of the results, which is trivial to do by rendering the avatars as images within IPython Notebook. Example 4-2 illustrates how to display avatars and the corresponding ID values for each search result by generating HTML and rendering it inline as a result in the notebook. Example 4-2. Displaying Google+ avatars in IPython Notebook provides a quick way to disambiguate the search results and discover the person you are looking for from IPython.core.display import HTML html = for p in people_feed'items': html += 'pimg src="%s" / %s: %s/p' % \ (p'image''url', p'id', p'displayName') HTML(''.join(html)) 4.2. Exploring the Google+ API 141 a Sample results are displayed in Figure 4-2 and provide the “quick fix” that we’re looking for in our search for the particular Tim O’Reilly of O’Reilly Media. Figure 4-2. Rendering Google+ avatars as images allows you to quickly scan the search results to disambiguate the person you are looking for Although there’s a multiplicity of things we could do with the People API, our focus in this chapter is on an analysis of the textual content in accounts, so let’s turn our attention to the task of retrieving activities associated with this account. As you’re about to find out, Google+ activities are the linchpin of Google+ content, containing a variety of rich content associated with the account and providing logical pivots to other platform ob‐ jects such as comments. To get some activities, we’ll need to tweak the design pattern we applied for searching for people, as illustrated in Example 4-3. Example 4-3. Fetching recent activities for a particular Google+ user import httplib2 import json import apiclient.discovery USER_ID = '107033731246200681024' Tim O'Reilly 142 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a XXX: Re-enter your API_KEY from https://code.google.com/apis/console if not currently set API_KEY = '' service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), developerKey=API_KEY) activity_feed = service.activities().list( userId=USER_ID, collection='public', maxResults='100' Max allowed per API ).execute() print json.dumps(activity_feed, indent=1) Sample results for the first item in the results (activity_feed'items'0) follow and illustrate the basic nature of a Google+ activity: "kind": "plusactivity", "provider": "title": "Google+" , "title": "This is the best piece about privacy that I've read in a ...", "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", "object": "resharers": "totalItems": 191, "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvy..." , "attachments": "content": "Many governments (including our own, here in the US) ...", "url": "http://www.zdziarski.com/blog/?p=2155", "displayName": "On Expectation of Privacy Jonathan Zdziarski's Domain", "objectType": "article" , "url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ", "content": "This is the best piece about privacy that I&39;ve read ...", "plusoners": "totalItems": 356, "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..." , "replies": "totalItems": 48, "selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..." , "objectType": "note" , "updated": "2013-04-25T14:46:16.908Z", "actor": 4.2. Exploring the Google+ API 143 a "url": "https://plus.google.com/107033731246200681024", "image": "url": "https://lh4.googleusercontent.com/-J8nmMwIhpiA/AAAAAAAAAAI/A..." , "displayName": "Tim O'Reilly", "id": "107033731246200681024" , "access": "items": "type": "public" , "kind": "plusacl", "description": "Public" , "verb": "post", "etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/d-ppAzuVZpXrW_YeLXc5ctstsCM\"", "published": "2013-04-25T14:46:16.908Z", "id": "z125xvyidpqjdtol423gcxizetybvpydh" Each activity object follows a three-tuple pattern of the form (actor, verb, object). In this post, the tuple (Tim O’Reilly, post, note) tells us that this particular item in the results is a note, which is essentially just a status update with some textual content. A closer look at the result reveals that the content is something that Tim O’Reilly feels strongly about as indicated by the title “This is the best piece about privacy that I’ve read in a long time” and hints that the note is active as evidenced by the number of reshares and comments. If you reviewed the output carefully, you may have noticed that the content field for the activity contains HTML markup, as evidenced by the HTML entity I&39;ve that appears. In general, you should assume that the textual data exposed as Google+ activ‐ ities contains some basic markup—such asbr / tags and escaped HTML entities for apostrophes—so as a best practice we need to do a little bit of additional filtering to clean it up. Example 4-4 provides an example of how to distill plain text from the content field of a note by introducing a function called cleanHtml. It takes advantage of a clean_html function provided by NLTK and another handy package for manipulating HTML, called BeautifulSoup, that converts HTML entities back to plain text. If you haven’t already encountered BeautifulSoup, it’s a package that you won’t want to live without once you’ve added it to your toolbox—it has the ability to process HTML in a reasonable way even if it is invalid and violates standards or other reasonable expecta‐ tions (à la web data). You should install these packages viapip install nltk beauti fulsoup4 if you haven’t already. 144 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a Example 4-4. Cleaning HTML in Google+ content by stripping out HTML tags and converting HTML entities back to plain-text representations from nltk import clean_html from BeautifulSoup import BeautifulStoneSoup clean_html removes tags and BeautifulStoneSoup converts HTML entities def cleanHtml(html): if html == "": return "" return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents0 print activity_feed'items'0'object''content' print print cleanHtml(activity_feed'items'0'object''content') The output from the note’s content, once cleansed with cleanHtml, is very clean text that can be processed without additional concerns about noise in the content. As we’ll learn in this chapter and follow-on chapters about text mining, reduction of noise in text content is a critical aspect of improving accuracy. The before and after content follows. Here’s the raw content inactivity_feed'items'0'object''content': This is the best piece about privacy that I&39;ve read in a long time If it doesn&39;t change how you think about the privacy issue, I&39;ll be surprised. It opens:br /br /"Many governments (including our own, here in the US) would have its citizens believe that privacy is a switch (that is, you either reasonably expect it, or you don’t). This has been demonstrated in many legal tests, and abused in many circumstances ranging from spying on electronic mail, to drones in our airspace monitoring the movements of private citizens. But privacy doesn’t work like a switch – at least it shouldn’t for a country that recognizes that privacy is an inherent right. In fact, privacy, like other components to security, works in layers..."br /br / Please read And here’s the content rendered after cleansing with cleanHtml(activity _feed'items'0'object''content'): This is the best piece about privacy that I've read in a long time If it doesn't change how you think about the privacy issue, I'll be surprised. It opens: "Many governments (including our own, here in the US) would have its citizens believe that privacy is a switch (that is, you either reasonably expect it, or you don’t). This has been demonstrated in many legal tests, and abused in many circumstances ranging from spying on electronic mail, to drones in our airspace monitoring the movements of private citizens. But privacy doesn’t work like a switch – at least it shouldn’t for a country that recognizes that privacy is an inherent right. In fact, privacy, like other components to security, works in layers..." Please read 4.2. Exploring the Google+ API 145 a The ability to query out clean text from Google+ is the basis for the remainder of the text mining exercises in this chapter, but one additional consideration that you may find useful before we focus our attention elsewhere is a pattern for fetching multiple pages of content. Whereas the previous example fetched 100 activities, the maximum number of results for a query, it may be the case that you’ll want to iterate over an activities feed and retrieve more than the maximum number of activities per page. The pattern for pagination is outlined in the HTTP API Overview, and the Python client wrapper takes care of most of the hassle. Example 4-5 shows how to fetch multiple pages of activities and distill the text from them if they are notes and have meaningful content. Example 4-5. Looping over multiple pages of Google+ activities and distilling clean text from notes import os import httplib2 import json import apiclient.discovery from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html USER_ID = '107033731246200681024' Tim O'Reilly XXX: Re-enter your API_KEY from https://code.google.com/apis/console if not currently set API_KEY = '' MAX_RESULTS = 200 Will require multiple requests def cleanHtml(html): if html == "": return "" return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents0 service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), developerKey=API_KEY) activity_feed = service.activities().list( userId=USER_ID, collection='public', maxResults='100' Max allowed per request ) activity_results = while activity_feed = None and len(activity_results) MAX_RESULTS: 146 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a activities = activity_feed.execute() if 'items' in activities: for activity in activities'items': if activity'object''objectType' == 'note' and \ activity'object''content' = '': activity'title' = cleanHtml(activity'title') activity'object''content' = cleanHtml(activity'object''content') activity_results += activity list_next requires the previous request and response objects activity_feed = service.activities().list_next(activity_feed, activities) Write the output to a file for convenience f = open(os.path.join('resources', 'ch04-googleplus', USER_ID + '.json'), 'w') f.write(json.dumps(activity_results, indent=1)) f.close() print str(len(activity_results)), "activities written to", f.name With the know-how to explore the Google+ API and fetch some interesting human language data from activities’ content, let’s now turn our attention to the problem of analyzing the content. 4.3. A Whiz-Bang Introduction to TF-IDF Although rigorous approaches to natural language processing (NLP) that include such things as sentence segmentation, tokenization, word chunking, and entity detection are necessary in order to achieve the deepest possible understanding of textual data, it’s helpful to first introduce some fundamentals from information retrieval theory. The remainder of this chapter introduces some of its more foundational aspects, including TF-IDF, the cosine similarity metric, and some of the theory behind collocation detec‐ tion. Chapter 5 provides a deeper discussion of NLP as a logical continuation of this discussion. If you want to dig deeper into IR theory, the full text of Christopher Manning, Prabhakar Raghavan, and Hinrich Schütze’s Introduction to Information Retrieval (Cambridge University Press) is available on‐ line and provides more information than you could (probably) ever want to know about the field. 4.3. A Whiz-Bang Introduction to TF-IDF 147 aInformation retrieval is an extensive field with many specialties. This discussion nar‐ rows in on TF-IDF, one of the most fundamental techniques for retrieving relevant documents from a corpus (collection). TF-IDF stands for term frequency-inverse docu‐ ment frequency and can be used to query a corpus by calculating normalized scores that express the relative importance of terms in the documents. Mathematically, TF-IDF is expressed as the product of the term frequency and the in‐ verse document frequency, tf_idf = tfidf, where the term tf represents the importance of a term in a specific document, and idf represents the importance of a term relative to the entire corpus. Multiplying these terms together produces a score that accounts for both factors and has been an integral part of every major search engine at some point in its existence. To get a more intuitive idea of how TF-IDF works, let’s walk through each of the calculations involved in computing the overall score. 4.3.1. Term Frequency For simplicity in illustration, suppose you have a corpus containing three sample docu‐ ments and terms are calculated by simply breaking on whitespace, as illustrated in Example 4-6 as ordinary Python code. Example 4-6. Sample data structures used in illustrations for the rest of this chapter corpus = 'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. \ Mr. Green is not a very nice fellow.", 'b' : "Professor Plum has a green plant in his study.", 'c' : "Miss Scarlett watered Professor Plum's green plant while he was away \ from his office last week." terms = 'a' : i.lower() for i in corpus'a'.split() , 'b' : i.lower() for i in corpus'b'.split() , 'c' : i.lower() for i in corpus'c'.split() A term’s frequency could simply be represented as the number of times it occurs in the text, but it is more commonly the case that you normalize it by taking into account the total number of terms in the text, so that the overall score accounts for document length relative to a term’s frequency. For example, “green” (once normalized to lowercase) oc‐ curs twice incorpus'a' and only once incorpus'b', socorpus'a' would pro‐ duce a higher score if frequency were the only scoring criterion. However, if you nor‐ malize for document length,corpus'b' would have a slightly higher term frequency score for “green” (1/9) than corpus'a' (2/19), because corpus'b' is shorter than corpus'a'—even though “green” occurs more frequently in corpus'a'. A com‐ mon technique for scoring a compound query such as “Mr. Green” is to sum the term frequency scores for each of the query terms in each document, and return the docu‐ ments ranked by the summed term frequency score. 148 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a Let’s illustrate how term frequency works by querying our sample corpus for “Mr. Green,” which would produce the normalized scores reported in Table 4-1 for each document. Table 4-1. Sample term frequency scores for “Mr. Green” Document tf(mr.) tf(green) Sum corpus'a' 2/19 2/19 4/19 (0.2105) corpus'b' 0 1/9 1/9 (0.1111) corpus'c' 0 1/16 1/16 (0.0625) For this contrived example, a cumulative term frequency scoring scheme works out and returns corpus'a' (the document that we’d expect it to return), since corpus'a' is the only one that contains the compound token “Mr. Green.” However, a number of problems could have emerged, because the term frequency scoring model looks at each document as an unordered collection of words. For example, queries for “Green Mr.” or “Green Mr. Foo” would have returned the exact same scores as the query for “Mr. Green,” even though neither of those phrases appears in the sample sentences. Addi‐ tionally, there are a number of scenarios that we could easily contrive to illustrate fairly poor results from the term frequency ranking technique given that trailing punctuation is not handled properly, and that the context around tokens of interest is not taken into account by the calculations. Considering term frequency alone turns out to be a common source of problems when scoring on a document-by-document basis, because it doesn’t account for very frequent 3 words, called stopwords, that are common across many documents. In other words, all terms are weighted equally, regardless of their actual importance. For example, “the green plant” contains the stopword “the,” which skews overall term frequency scores in favor ofcorpus'a' because “the” appears twice in that document, as does “green.” In contrast, incorpus'c' “green” and “plant” each appear only once. Consequently, the scores would break down as shown in Table 4-2, with corpus'a' ranked as more relevant than corpus'c', even though intuition might lead you to believe that ideal query results probably shouldn’t have turned out that way. (Fortunately, however,corpus'b' still ranks highest.) 3. Stopwords are words that appear frequently in text but usually relay little information. Common examples of stopwords are a, an, the, and other determinants. 4.3. A Whiz-Bang Introduction to TF-IDF 149 aTable 4-2. Sample term frequency scores for “the green plant” Document tf(the) tf(green) tf(plant) Sum corpus'a' 2/19 2/19 0 4/19 (0.2105) corpus'b' 0 1/9 1/9 2/9 (0.2222) corpus'c' 0 1/16 1/16 1/8 (0.125) 4.3.2. Inverse Document Frequency Toolkits such as NLTK provide lists of stopwords that can be used to filter out terms such as and, a, and the, but keep in mind that there may be terms that evade even the best stopword lists and yet still are quite common to specialized domains. Although you can certainly customize a list of stopwords with domain knowledge, the inverse document frequency metric is a calculation that provides a generic normalization metric for a corpus. It works in the general case by accounting for the appearance of common terms across a set of documents by considering the total number of documents in which a query term ever appears. The intuition behind this metric is that it produces a higher value if a term is somewhat uncommon across the corpus than if it is common, which helps to account for the problem with stopwords we just investigated. For example, a query for “green” in the corpus of sample documents should return a lower inverse document frequency score than a query for “candlestick,” because “green” appears in every document while “can‐ dlestick” appears in only one. Mathematically, the only nuance of interest for the inverse document frequency calculation is that a logarithm is used to reduce the result into a compressed range, since its usual application is in multiplying it against term frequency as a scaling factor. For reference, a logarithm function is shown in Figure 4-3; as you can see, the logarithm function grows very slowly as values for its domain increase, effectively “squashing” its input. Figure 4-3. The logarithm function “squashes” a large range of values into a more com‐ pressed space—notice how slowly the y-values grow as the values of x-increase 150 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a Table 4-3 provides inverse document frequency scores that correspond to the term frequency scores for in the previous section. Example 4-7 in the next section presents source code that shows how to compute these scores. In the meantime, you can no‐ tionally think of the IDF score for a term as the logarithm of a quotient that is defined by the number of documents in the corpus divided by the number of texts in the corpus that contain the term. When viewing these tables, keep in mind that whereas a term frequency score is calculated on a per-document basis, an inverse document frequency score is computed on the basis of the entire corpus. Hopefully this makes sense given that its purpose is to act as a normalizer for common words across the entire corpus. Table 4-3. Sample inverse document frequency scores for terms appearing in “mr. green” and “the green plant” idf(mr.) idf(green) idf(the) idf(plant) 1+ log(3/1) = 2.0986 1 + log(3/3) = 1.0 1 + log(3/1) = 2.0986 1 + log(3/2) = 1.4055 4.3.3. TF-IDF At this point, we’ve come full circle and devised a way to compute a score for a multiterm query that accounts for the frequency of terms appearing in a document, the length of the document in which any particular term appears, and the overall uniqueness of the terms across documents in the entire corpus. We can combine the concepts behind term frequency and inverse document frequency into a single score by multiplying them together, so that TF–IDF = TFIDF. Example 4-7 is a naive implementation of this discussion that should help solidify the concepts described. Take a moment to review it, and then we’ll discuss a few sample queries. Example 4-7. Running TF-IDF on sample data from math import log XXX: Enter in a query term from the corpus variable QUERY_TERMS = 'mr.', 'green' def tf(term, doc, normalize=True): doc = doc.lower().split() if normalize: return doc.count(term.lower()) / float(len(doc)) else: return doc.count(term.lower()) / 1.0 def idf(term, corpus): num_texts_with_term = len(True for text in corpus if term.lower() in text.lower().split()) tf-idf calc involves multiplying against a tf value less than 0, so it's necessary to return a value greater than 1 for consistent scoring. 4.3. A Whiz-Bang Introduction to TF-IDF 151 a (Multiplying two values less than 1 returns a value less than each of them.) try: return 1.0 + log(float(len(corpus)) / num_texts_with_term) except ZeroDivisionError: return 1.0 def tf_idf(term, doc, corpus): return tf(term, doc) idf(term, corpus) corpus = \ 'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \ Mr. Green is not a very nice fellow.', 'b': 'Professor Plum has a green plant in his study.', 'c': "Miss Scarlett watered Professor Plum's green plant while he was away \ from his office last week." for (k, v) in sorted(corpus.items()): print k, ':', v print Score queries by calculating cumulative tf_idf score for each term in query query_scores = 'a': 0, 'b': 0, 'c': 0 for term in t.lower() for t in QUERY_TERMS: for doc in sorted(corpus): print 'TF(%s): %s' % (doc, term), tf(term, corpusdoc) print 'IDF: %s' % (term, ), idf(term, corpus.values()) print for doc in sorted(corpus): score = tf_idf(term, corpusdoc, corpus.values()) print 'TF-IDF(%s): %s' % (doc, term), score query_scoresdoc += score print print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), ) for (doc, score) in sorted(query_scores.items()): print doc, score Sample output follows: a : Mr. Green killed Colonel Mustard in the study... b : Professor Plum has a green plant in his study. c : Miss Scarlett watered Professor Plum's green... TF(a): mr. 0.105263157895 TF(b): mr. 0.0 TF(c): mr. 0.0 IDF: mr. 2.09861228867 152 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More aTF-IDF(a): mr. 0.220906556702 TF-IDF(b): mr. 0.0 TF-IDF(c): mr. 0.0 TF(a): green 0.105263157895 TF(b): green 0.111111111111 TF(c): green 0.0625 IDF: green 1.0 TF-IDF(a): green 0.105263157895 TF-IDF(b): green 0.111111111111 TF-IDF(c): green 0.0625 Overall TF-IDF scores for query 'mr. green' a 0.326169714597 b 0.111111111111 c 0.0625 Although we’re working on a trivially small scale, the calculations involved work the same for larger data sets. Table 4-4 is a consolidated adaptation of the program’s output for three sample queries that involve four distinct terms: • “green” • “mr. green” • “the green plant” Even though the IDF calculations for terms are computed on the basis of the entire corpus, they are displayed on a per-document basis so that you can easily verify TF-IDF scores by skimming a single row and multiplying two numbers. As you work through the query results, you’ll find that it’s remarkable just how powerful TF-IDF is, given that it doesn’t account for the proximity or ordering of words in a document. Table 4-4. Calculations involved in TF-IDF sample queries, as computed by Example 4-7 Document tf(mr.) tf(green) tf(the) tf(plant) corpus'a' 0.1053 0.1053 0.1053 0 corpus'b' 0 0.1111 0 0.1111 corpus'c' 0 0.0625 0 0.0625 idf(mr.) idf(green) idf(the) idf(plant) 2.0986 1.0 2.099 1.4055 4.3. A Whiz-Bang Introduction to TF-IDF 153 a tf-idf(mr.) tf-idf(green) tf-idf(the) tf-idf(plant) corpus'a' 0.10532.0986 = 0.2209 0.10531.0 = 0.1053 0.10532.099 = 0.2209 01.4055 = 0 corpus'b' 02.0986 = 0 0.11111.0 = 0.1111 02.099 = 0 0.11111.4055 = 0.1562 corpus'c' 02.0986 = 0 0.06251.0 = 0.0625 02.099 = 0 0.06251.4055 = 0.0878 The same results for each query are shown in Table 4-5, with the TF-IDF values summed on a per-document basis. Table 4-5. Summed TF-IDF values for sample queries as computed by Example 4-7 (values in bold are the maximum scores for each of the three queries) Query corpus'a' corpus'b' corpus'c' green 0.1053 0.1111 0.0625 Mr. Green 0.2209 + 0.1053 = 0.3262 0 + 0.1111 = 0.1111 0 + 0.0625 = 0.0625 the green plant 0.2209 + 0.1053 + 0 = 0.3262 0 + 0.1111 + 0.1562 = 0.2673 0 + 0.0625 + 0.0878 = 0.1503 From a qualitative standpoint, the query results are quite reasonable. Thecorpus'b' document is the winner for the query “green,” with corpus'a' just a hair behind. In this case, the deciding factor was the length of corpus'b' being much smaller than that of corpus'a'—the normalized TF score tipped the results in favor of cor pus'b' for its one occurrence of “green,” even though “Green” appeared in cor pus'a' two times. Since “green” appears in all three documents, the net effect of the IDF term in the calculations was a wash. Do note, however, that if we had returned 0.0 instead of 1.0 for “green,” as is done in some IDF implementations, the TF-IDF scores for “green” would have been 0.0 for all three documents due the effect of multiplying the TF score by zero. Depending on the particular situation, it may be better to return 0.0 for the IDF scores rather than 1.0. For example, if you had 100,000 documents and “green” appeared in all of them, you’d almost certainly consider it to be a stopword and want to remove its effects in a query entirely. For the query “Mr. Green,” the clear and appropriate winner is thecorpus'a' docu‐ ment. However, this document also comes out on top for the query “the green plant.” A worthwhile exercise is to consider whycorpus'a' scored highest for this query as opposed tocorpus'b', which at first blush might have seemed a little more obvious. A final nuanced point to observe is that the sample implementation provided in Example 4-7 adjusts the IDF score by adding a value of 1.0 to the logarithm calculation, for the purposes of illustration and because we’re dealing with a trivial document set. Without the 1.0 adjustment in the calculation, it would be possible to have the idf function return values that are less than 1.0, which would result in two fractions being multiplied in the TF-IDF calculation. Since multiplying two fractions together results in a value smaller than either of them, this turns out to be an easily overlooked edge case in the TF-IDF calculation. Recall that the intuition behind the TF-IDF calculation 154 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More a