how to do data mining in google and google web data mining tools
MattGates,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Mining Google+:
Computing Document
Google+ initially serves as our primary source of data for this chapter because it’s in‐
herently social, features content that’s often expressed as longer-form notes that resem‐
ble blog entries, and is now an established staple in the social web. It’s not hard to make
a case that Google+ activities (notes) fill a niche somewhere between Twitter and blogs.
The concepts from previous chapters could equally be applied to the data behind Google
+’s diverse and powerful API, although we’ll opt to leave regurgitations of those exam‐
ples (mostly) as exercises for the motivated reader.
Wherever possible we won’t reinvent the wheel and implement analysis tools from
scratch, but we will take a couple of “deep dives” when particularly foundational topics
come up that are essential to an understanding of text mining. The Natural Language
Toolkit (NLTK) is a powerful technology that you may recall from Chapter 3; it provides
a many of the tools we’ll use in this chapter. Its rich suites of APIs can be a bit over‐
whelming at first, but don’t worry: while text analytics is an incredibly diverse and
complex field of study, there are lots of powerful fundamentals that can take you a long
way without too significant of an investment. This chapter and the chapters after it aim
to hone in on those fundamentals. (A full-blown introduction to NLTK is outside the
scope of this book, but you can review the full text of Natural Language Processing with
Python: Analyzing Text with the Natural Language Toolkit O’Reilly at the NLTK web‐
site.)
Always get the latest bug-fixed source code for this chapter (and every
other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure
to also take advantage of this book’s virtual machine experience, as
described in Appendix A, to maximize your enjoyment of the sample
code.
4.1. Overview
This chapter uses Google+ to begin our journey in analyzing human language data. In
this chapter you’ll learn about:
• The Google+ API and making API requests
• TF-IDF (Term Frequency–Inverse Document Frequency), a fundamental techni‐
que for analyzing words in documents
• How to apply NLTK to the problem of understanding human language
• How to apply cosine similarity to common problems such as querying documents
by keyword
• How to extract meaningful phrases from human language data by detecting collo‐
cation patterns
4.2. Exploring the Google+ API
Anyone with a Gmail account can trivially create a Google+ account and start collab‐
orating with friends. From a product standpoint, Google+ has evolved rapidly and used
some of the most compelling features of existing social network platforms such as Twit‐
ter and Facebook in carving out its own set of unique capabilities. As with the other
social websites featured in this book, a full overview of Google+ isn’t in scope for this
chapter, but you can easily read about it (or sign up) online. The best way to learn is by
creating an account and spending some time exploring. For the most part, where there’s
a feature in the user interface, there’s an API that provides that feature that you can tap
into. Suffice it to say that Google+ has leveraged tried-and-true features of existing social
136 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a networks, such as marking content with hashtags and maintaining a profile according
to customizable privacy settings, with additional novelties such as a fresh take on content
sharing called circles, video chats called hangouts, and extensive integration with other
Google services such as Gmail contacts. In Google+ API parlance, social interactions
are framed in terms of people, activities, comments, and moments.
The API documentation that’s available online is always the definitive source of guid‐
ance, but a brief overview may be helpful to get you thinking about how Google+ com‐
pares to another platform such as Twitter or Facebook:
People
People are Google+ users. Programmatically, you’ll either discover users by using
2
the search API, look them up by a personalized URL if they’re a celebrity type, or
strip their Google+ IDs out of the URLs that appear in your web browser and use
them for exploring their profiles. Later in this chapter, we’ll use the search API to
find users on Google+.
Activities
Activities are the things that people do on Google+. An activity is essentially a note
and can be as long or short as the author likes: it can be as long as a blog post, or it
can be devoid of any real textual meaning (e.g., just used to share links or multimedia
content). Given a Google+ user, you can easily retrieve a list of that person’s activ‐
ities, and we’ll do that later in this chapter. Like a tweet, an activity contains a lot of
fascinating metadata , such as the number of times the activity has been reshared.
Comments
Leaving comments is the way Google+ users interact with one another. Simple stat‐
istical analysis of comments on Google+ could be very interesting and potentially
reveal a lot of insights into a person’s social circles or the virality of content. For
example, which other Google+ users most frequently comment on activities?
Which activities have the highest numbers of comments (and why)?
Moments
Moments are a relatively recent addition to Google+ and represent a way of cap‐
turing interactions between a user and a Google+ application. Moments are similar
to Facebook’s social graph stories in that they are designed to capture and create
opportunities for user interaction with an application that can be displayed on a
timeline. For example, if you were to make a purchase in an application, upload a
photo, or watch a YouTube video, it could be captured as a moment (something you
2. Google+ only offered personalized URLs such as https://plus.google.com/+TimOReilly to well-known indi‐
viduals at the outset, but is in the process of making them available to all users. Until you are either selected
by Google or eligible to apply, you’ll have to continue using your more arcane URL with a numeric identifier
in it (such as https://plus.google.com/107033731246200681024).
4.2. Exploring the Google+ API 137
a did in time) and displayed in a history of your actions or shared with friends in an
activity stream by the application.
For the purposes of this chapter, we’ll be focusing on harvesting and analyzing Google+
activity data that is textual and intended to convey the same kind of meaning that you
might encounter in a tweet, blog post, or Facebook status update. In other words, we’ll
be trying to analyze human language data. If you haven’t signed up for Google+ yet, it’s
worth taking the time to do so, as well as spending a few moments to familiarize yourself
with a Google+ profile. One of the easiest ways to find specific users on Google+ is to
just search for them at http://plus.google.com, and explore the platform from there.
As you explore Google+, bear in mind that it has a unique set of capabilities and is a
little awkward to compare directly to other social web properties. If you’re expecting it
to be a straightforward comparison, you might find yourself a bit surprised. It’s similar
to Twitter in that it provides a “following” model where you can add individuals to one
of your Google+ circles and keep up with their happenings without any approval on
their part, but the Google+ platform also offers rich integration with other Google web
properties, sports a powerful feature for videoconferencing (hangouts), and has an API
similar to Facebook’s in the way that users share content and interact. Without further
ado, let’s get busy exploring the API and work toward mining some data.
4.2.1. Making Google+ API Requests
From a software development standpoint, Google+ leverages OAuth like the rest of the
social web to enable an application that you’ll build to access data on a user’s behalf, so
you’ll need to register an application to get appropriate credentials for accessing the
Google+ platform. The Google API Console provides a means of registering an appli‐
cation (called a project in the Google API Console) to get OAuth credentials but also
exposes an API key that you can use for “simple API access.” This API key is what we’ll
use in this chapter to programmatically access the Google+ platform and just about
every other Google service.
Once you’ve created an application, you’ll also need to specifically enable it to use
Google+ as a separate step. Figure 4-1 provides a screenshot of the Google+ API Console
as well as a view that demonstrates what it looks like when you have enabled your
application for Google+ API access.
138 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a Figure 4-1. Registering an application with the Google API Console to gain API access
to Google services; don’t forget to enable Google+ API access as one of the service
options
4.2. Exploring the Google+ API 139
a You can install a Python package calledgoogle-api-python-client for accessing Goo‐
gle’s API via pip install google-api-python-client. This is one of the standard
Python-based options for accessing Google+ data. The online documentation for
google-api-python-client is marginally helpful in familiarizing yourself with the ca‐
pabilities of what it offers, but in general, you’ll just be plugging parameters from the
official Google+ API documents into some predictable access patterns with the Python
package. Once you’ve walked through a couple of exercises, it’s a relatively straightfor‐
ward process.
Don’t forget that pydoc can be helpful for gathering clues about a
package, class, or method in a terminal as you are learning it. The help
function in a standard Python interpreter is also useful. Recall that
appending ? to a method name in IPython is a shortcut for display‐
ing its docstring.
As an initial exercise, let’s consider the problem of locating a person on Google+. Like
any other social web API, the Google+ API offers a means of searching, and in particular
we’ll be interested in the People: search API. Example 4-1 illustrates how to search for
a person with the Google+ API. Since Tim O’Reilly is a well-known personality with an
active and compelling Google+ account, let’s look him up.
The basic pattern that you’ll repeatedly use with the Python client is to create an instance
of a service that’s parameterized for Google+ access with your API key that you can then
instantiate for particular platform services. Here, we create a connection to the People
API by invoking service.people() and then chaining on some additional API oper‐
ations deduced from reviewing the API documentation online. In a moment we’ll query
for activity data, and you’ll see that the same basic pattern holds.
Example 4-1. Searching for a person with the Google+ API
import httplib2
import json
import apiclient.discovery pip install google-api-python-client
XXX: Enter any person's name
Q = "Tim O'Reilly"
XXX: Enter in your API key from https://code.google.com/apis/console
API_KEY = ''
service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(),
developerKey=API_KEY)
people_feed = service.people().search(query=Q).execute()
print json.dumps(people_feed'items', indent=1)
140 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
aFollowing are sample results for searching for Tim O’Reilly:
"kind": "plusperson",
"displayName": "Tim O'Reilly",
"url": "https://plus.google.com/+TimOReilly",
"image":
"url": "https://lh4.googleusercontent.com/-J8..."
,
"etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/wwgOCMn..."",
"id": "107033731246200681024",
"objectType": "person"
,
"kind": "plusperson",
"displayName": "Tim O'Reilly",
"url": "https://plus.google.com/11566571170551...",
"image":
"url": "https://lh3.googleusercontent.com/-yka..."
,
"etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/0z-EwRK7..."",
"id": "115665711705516993369",
"objectType": "person"
,
...
The results do indeed return a list of people named Tim O’Reilly, but how can we tell
which one of these results refers to the well-known Tim O’Reilly of technology fame
that we are looking for? One option would be to request profile or activity information
for each of these results and try to disambiguate them manually. Another option is to
render the avatars included in each of the results, which is trivial to do by rendering the
avatars as images within IPython Notebook. Example 4-2 illustrates how to display
avatars and the corresponding ID values for each search result by generating HTML
and rendering it inline as a result in the notebook.
Example 4-2. Displaying Google+ avatars in IPython Notebook provides a quick way to
disambiguate the search results and discover the person you are looking for
from IPython.core.display import HTML
html =
for p in people_feed'items':
html += 'pimg src="%s" / %s: %s/p' % \
(p'image''url', p'id', p'displayName')
HTML(''.join(html))
4.2. Exploring the Google+ API 141
a Sample results are displayed in Figure 4-2 and provide the “quick fix” that we’re looking
for in our search for the particular Tim O’Reilly of O’Reilly Media.
Figure 4-2. Rendering Google+ avatars as images allows you to quickly scan the search
results to disambiguate the person you are looking for
Although there’s a multiplicity of things we could do with the People API, our focus in
this chapter is on an analysis of the textual content in accounts, so let’s turn our attention
to the task of retrieving activities associated with this account. As you’re about to find
out, Google+ activities are the linchpin of Google+ content, containing a variety of rich
content associated with the account and providing logical pivots to other platform ob‐
jects such as comments. To get some activities, we’ll need to tweak the design pattern
we applied for searching for people, as illustrated in Example 4-3.
Example 4-3. Fetching recent activities for a particular Google+ user
import httplib2
import json
import apiclient.discovery
USER_ID = '107033731246200681024' Tim O'Reilly
142 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a XXX: Re-enter your API_KEY from https://code.google.com/apis/console
if not currently set
API_KEY = ''
service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(),
developerKey=API_KEY)
activity_feed = service.activities().list(
userId=USER_ID,
collection='public',
maxResults='100' Max allowed per API
).execute()
print json.dumps(activity_feed, indent=1)
Sample results for the first item in the results (activity_feed'items'0) follow
and illustrate the basic nature of a Google+ activity:
"kind": "plusactivity",
"provider":
"title": "Google+"
,
"title": "This is the best piece about privacy that I've read in a ...",
"url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ",
"object":
"resharers":
"totalItems": 191,
"selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvy..."
,
"attachments":
"content": "Many governments (including our own, here in the US) ...",
"url": "http://www.zdziarski.com/blog/?p=2155",
"displayName": "On Expectation of Privacy Jonathan Zdziarski's Domain",
"objectType": "article"
,
"url": "https://plus.google.com/107033731246200681024/posts/78UeZ1jdRsQ",
"content": "This is the best piece about privacy that I&39;ve read ...",
"plusoners":
"totalItems": 356,
"selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..."
,
"replies":
"totalItems": 48,
"selfLink": "https://www.googleapis.com/plus/v1/activities/z125xvyid..."
,
"objectType": "note"
,
"updated": "2013-04-25T14:46:16.908Z",
"actor":
4.2. Exploring the Google+ API 143
a "url": "https://plus.google.com/107033731246200681024",
"image":
"url": "https://lh4.googleusercontent.com/-J8nmMwIhpiA/AAAAAAAAAAI/A..."
,
"displayName": "Tim O'Reilly",
"id": "107033731246200681024"
,
"access":
"items":
"type": "public"
,
"kind": "plusacl",
"description": "Public"
,
"verb": "post",
"etag": "\"WIBkkymG3C8dXBjiaEVMpCLNTTs/d-ppAzuVZpXrW_YeLXc5ctstsCM\"",
"published": "2013-04-25T14:46:16.908Z",
"id": "z125xvyidpqjdtol423gcxizetybvpydh"
Each activity object follows a three-tuple pattern of the form (actor, verb, object). In this
post, the tuple (Tim O’Reilly, post, note) tells us that this particular item in the results
is a note, which is essentially just a status update with some textual content. A closer
look at the result reveals that the content is something that Tim O’Reilly feels strongly
about as indicated by the title “This is the best piece about privacy that I’ve read in a
long time” and hints that the note is active as evidenced by the number of reshares and
comments.
If you reviewed the output carefully, you may have noticed that the content field for
the activity contains HTML markup, as evidenced by the HTML entity I&39;ve that
appears. In general, you should assume that the textual data exposed as Google+ activ‐
ities contains some basic markup—such asbr / tags and escaped HTML entities for
apostrophes—so as a best practice we need to do a little bit of additional filtering to
clean it up. Example 4-4 provides an example of how to distill plain text from the content
field of a note by introducing a function called cleanHtml. It takes advantage of a
clean_html function provided by NLTK and another handy package for manipulating
HTML, called BeautifulSoup, that converts HTML entities back to plain text. If you
haven’t already encountered BeautifulSoup, it’s a package that you won’t want to live
without once you’ve added it to your toolbox—it has the ability to process HTML in a
reasonable way even if it is invalid and violates standards or other reasonable expecta‐
tions (à la web data). You should install these packages viapip install nltk beauti
fulsoup4 if you haven’t already.
144 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a Example 4-4. Cleaning HTML in Google+ content by stripping out HTML tags and
converting HTML entities back to plain-text representations
from nltk import clean_html
from BeautifulSoup import BeautifulStoneSoup
clean_html removes tags and
BeautifulStoneSoup converts HTML entities
def cleanHtml(html):
if html == "": return ""
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents0
print activity_feed'items'0'object''content'
print
print cleanHtml(activity_feed'items'0'object''content')
The output from the note’s content, once cleansed with cleanHtml, is very clean text
that can be processed without additional concerns about noise in the content. As we’ll
learn in this chapter and follow-on chapters about text mining, reduction of noise in
text content is a critical aspect of improving accuracy. The before and after content
follows.
Here’s the raw content inactivity_feed'items'0'object''content':
This is the best piece about privacy that I&39;ve read in a long time
If it doesn&39;t change how you think about the privacy issue, I&39;ll be
surprised. It opens:br /br /"Many governments (including our own,
here in the US) would have its citizens believe that privacy is a switch (that
is, you either reasonably expect it, or you don’t). This has been demonstrated
in many legal tests, and abused in many circumstances ranging from spying
on electronic mail, to drones in our airspace monitoring the movements of
private citizens. But privacy doesn’t work like a switch – at least it shouldn’t
for a country that recognizes that privacy is an inherent right. In fact,
privacy, like other components to security, works in layers..."br /br /
Please read
And here’s the content rendered after cleansing with cleanHtml(activity
_feed'items'0'object''content'):
This is the best piece about privacy that I've read in a long time If it
doesn't change how you think about the privacy issue, I'll be surprised. It
opens: "Many governments (including our own, here in the US) would have its
citizens believe that privacy is a switch (that is, you either reasonably expect it,
or you don’t). This has been demonstrated in many legal tests, and abused
in many circumstances ranging from spying on electronic mail, to drones in our
airspace monitoring the movements of private citizens. But privacy doesn’t work like
a switch – at least it shouldn’t for a country that recognizes that privacy is an
inherent right. In fact, privacy, like other components to security, works in
layers..." Please read
4.2. Exploring the Google+ API 145
a The ability to query out clean text from Google+ is the basis for the remainder of the
text mining exercises in this chapter, but one additional consideration that you may find
useful before we focus our attention elsewhere is a pattern for fetching multiple pages
of content.
Whereas the previous example fetched 100 activities, the maximum number of results
for a query, it may be the case that you’ll want to iterate over an activities feed and retrieve
more than the maximum number of activities per page. The pattern for pagination is
outlined in the HTTP API Overview, and the Python client wrapper takes care of most
of the hassle.
Example 4-5 shows how to fetch multiple pages of activities and distill the text from
them if they are notes and have meaningful content.
Example 4-5. Looping over multiple pages of Google+ activities and distilling clean text
from notes
import os
import httplib2
import json
import apiclient.discovery
from BeautifulSoup import BeautifulStoneSoup
from nltk import clean_html
USER_ID = '107033731246200681024' Tim O'Reilly
XXX: Re-enter your API_KEY from https://code.google.com/apis/console
if not currently set
API_KEY = ''
MAX_RESULTS = 200 Will require multiple requests
def cleanHtml(html):
if html == "": return ""
return BeautifulStoneSoup(clean_html(html),
convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents0
service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(),
developerKey=API_KEY)
activity_feed = service.activities().list(
userId=USER_ID,
collection='public',
maxResults='100' Max allowed per request
)
activity_results =
while activity_feed = None and len(activity_results) MAX_RESULTS:
146 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a activities = activity_feed.execute()
if 'items' in activities:
for activity in activities'items':
if activity'object''objectType' == 'note' and \
activity'object''content' = '':
activity'title' = cleanHtml(activity'title')
activity'object''content' = cleanHtml(activity'object''content')
activity_results += activity
list_next requires the previous request and response objects
activity_feed = service.activities().list_next(activity_feed, activities)
Write the output to a file for convenience
f = open(os.path.join('resources', 'ch04-googleplus', USER_ID + '.json'), 'w')
f.write(json.dumps(activity_results, indent=1))
f.close()
print str(len(activity_results)), "activities written to", f.name
With the know-how to explore the Google+ API and fetch some interesting human
language data from activities’ content, let’s now turn our attention to the problem of
analyzing the content.
4.3. A Whiz-Bang Introduction to TF-IDF
Although rigorous approaches to natural language processing (NLP) that include such
things as sentence segmentation, tokenization, word chunking, and entity detection are
necessary in order to achieve the deepest possible understanding of textual data, it’s
helpful to first introduce some fundamentals from information retrieval theory. The
remainder of this chapter introduces some of its more foundational aspects, including
TF-IDF, the cosine similarity metric, and some of the theory behind collocation detec‐
tion. Chapter 5 provides a deeper discussion of NLP as a logical continuation of this
discussion.
If you want to dig deeper into IR theory, the full text of Christopher
Manning, Prabhakar Raghavan, and Hinrich Schütze’s Introduction to
Information Retrieval (Cambridge University Press) is available on‐
line and provides more information than you could (probably) ever
want to know about the field.
4.3. A Whiz-Bang Introduction to TF-IDF 147
aInformation retrieval is an extensive field with many specialties. This discussion nar‐
rows in on TF-IDF, one of the most fundamental techniques for retrieving relevant
documents from a corpus (collection). TF-IDF stands for term frequency-inverse docu‐
ment frequency and can be used to query a corpus by calculating normalized scores that
express the relative importance of terms in the documents.
Mathematically, TF-IDF is expressed as the product of the term frequency and the in‐
verse document frequency, tf_idf = tfidf, where the term tf represents the importance
of a term in a specific document, and idf represents the importance of a term relative to
the entire corpus. Multiplying these terms together produces a score that accounts for
both factors and has been an integral part of every major search engine at some point
in its existence. To get a more intuitive idea of how TF-IDF works, let’s walk through
each of the calculations involved in computing the overall score.
4.3.1. Term Frequency
For simplicity in illustration, suppose you have a corpus containing three sample docu‐
ments and terms are calculated by simply breaking on whitespace, as illustrated in
Example 4-6 as ordinary Python code.
Example 4-6. Sample data structures used in illustrations for the rest of this chapter
corpus =
'a' : "Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.",
'b' : "Professor Plum has a green plant in his study.",
'c' : "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."
terms =
'a' : i.lower() for i in corpus'a'.split() ,
'b' : i.lower() for i in corpus'b'.split() ,
'c' : i.lower() for i in corpus'c'.split()
A term’s frequency could simply be represented as the number of times it occurs in the
text, but it is more commonly the case that you normalize it by taking into account the
total number of terms in the text, so that the overall score accounts for document length
relative to a term’s frequency. For example, “green” (once normalized to lowercase) oc‐
curs twice incorpus'a' and only once incorpus'b', socorpus'a' would pro‐
duce a higher score if frequency were the only scoring criterion. However, if you nor‐
malize for document length,corpus'b' would have a slightly higher term frequency
score for “green” (1/9) than corpus'a' (2/19), because corpus'b' is shorter than
corpus'a'—even though “green” occurs more frequently in corpus'a'. A com‐
mon technique for scoring a compound query such as “Mr. Green” is to sum the term
frequency scores for each of the query terms in each document, and return the docu‐
ments ranked by the summed term frequency score.
148 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a Let’s illustrate how term frequency works by querying our sample corpus for “Mr.
Green,” which would produce the normalized scores reported in Table 4-1 for each
document.
Table 4-1. Sample term frequency scores for “Mr. Green”
Document tf(mr.) tf(green) Sum
corpus'a' 2/19 2/19 4/19 (0.2105)
corpus'b' 0 1/9 1/9 (0.1111)
corpus'c' 0 1/16 1/16 (0.0625)
For this contrived example, a cumulative term frequency scoring scheme works out and
returns corpus'a' (the document that we’d expect it to return), since corpus'a'
is the only one that contains the compound token “Mr. Green.” However, a number of
problems could have emerged, because the term frequency scoring model looks at each
document as an unordered collection of words. For example, queries for “Green Mr.”
or “Green Mr. Foo” would have returned the exact same scores as the query for “Mr.
Green,” even though neither of those phrases appears in the sample sentences. Addi‐
tionally, there are a number of scenarios that we could easily contrive to illustrate fairly
poor results from the term frequency ranking technique given that trailing punctuation
is not handled properly, and that the context around tokens of interest is not taken into
account by the calculations.
Considering term frequency alone turns out to be a common source of problems when
scoring on a document-by-document basis, because it doesn’t account for very frequent
3
words, called stopwords, that are common across many documents. In other words,
all terms are weighted equally, regardless of their actual importance. For example, “the
green plant” contains the stopword “the,” which skews overall term frequency scores in
favor ofcorpus'a' because “the” appears twice in that document, as does “green.” In
contrast, incorpus'c' “green” and “plant” each appear only once.
Consequently, the scores would break down as shown in Table 4-2, with corpus'a'
ranked as more relevant than corpus'c', even though intuition might lead you to
believe that ideal query results probably shouldn’t have turned out that way. (Fortunately,
however,corpus'b' still ranks highest.)
3. Stopwords are words that appear frequently in text but usually relay little information. Common examples of
stopwords are a, an, the, and other determinants.
4.3. A Whiz-Bang Introduction to TF-IDF 149
aTable 4-2. Sample term frequency scores for “the green plant”
Document tf(the) tf(green) tf(plant) Sum
corpus'a' 2/19 2/19 0 4/19 (0.2105)
corpus'b' 0 1/9 1/9 2/9 (0.2222)
corpus'c' 0 1/16 1/16 1/8 (0.125)
4.3.2. Inverse Document Frequency
Toolkits such as NLTK provide lists of stopwords that can be used to filter out terms
such as and, a, and the, but keep in mind that there may be terms that evade even the
best stopword lists and yet still are quite common to specialized domains. Although
you can certainly customize a list of stopwords with domain knowledge, the inverse
document frequency metric is a calculation that provides a generic normalization metric
for a corpus. It works in the general case by accounting for the appearance of common
terms across a set of documents by considering the total number of documents in which
a query term ever appears.
The intuition behind this metric is that it produces a higher value if a term is somewhat
uncommon across the corpus than if it is common, which helps to account for the
problem with stopwords we just investigated. For example, a query for “green” in the
corpus of sample documents should return a lower inverse document frequency score
than a query for “candlestick,” because “green” appears in every document while “can‐
dlestick” appears in only one. Mathematically, the only nuance of interest for the inverse
document frequency calculation is that a logarithm is used to reduce the result into a
compressed range, since its usual application is in multiplying it against term frequency
as a scaling factor. For reference, a logarithm function is shown in Figure 4-3; as you
can see, the logarithm function grows very slowly as values for its domain increase,
effectively “squashing” its input.
Figure 4-3. The logarithm function “squashes” a large range of values into a more com‐
pressed space—notice how slowly the y-values grow as the values of x-increase
150 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a Table 4-3 provides inverse document frequency scores that correspond to the term
frequency scores for in the previous section. Example 4-7 in the next section presents
source code that shows how to compute these scores. In the meantime, you can no‐
tionally think of the IDF score for a term as the logarithm of a quotient that is defined
by the number of documents in the corpus divided by the number of texts in the corpus
that contain the term. When viewing these tables, keep in mind that whereas a term
frequency score is calculated on a per-document basis, an inverse document frequency
score is computed on the basis of the entire corpus. Hopefully this makes sense given
that its purpose is to act as a normalizer for common words across the entire corpus.
Table 4-3. Sample inverse document frequency scores for terms appearing in “mr.
green” and “the green plant”
idf(mr.) idf(green) idf(the) idf(plant)
1+ log(3/1) = 2.0986 1 + log(3/3) = 1.0 1 + log(3/1) = 2.0986 1 + log(3/2) = 1.4055
4.3.3. TF-IDF
At this point, we’ve come full circle and devised a way to compute a score for a multiterm
query that accounts for the frequency of terms appearing in a document, the length of
the document in which any particular term appears, and the overall uniqueness of the
terms across documents in the entire corpus. We can combine the concepts behind term
frequency and inverse document frequency into a single score by multiplying them
together, so that TF–IDF = TFIDF. Example 4-7 is a naive implementation of this
discussion that should help solidify the concepts described. Take a moment to review
it, and then we’ll discuss a few sample queries.
Example 4-7. Running TF-IDF on sample data
from math import log
XXX: Enter in a query term from the corpus variable
QUERY_TERMS = 'mr.', 'green'
def tf(term, doc, normalize=True):
doc = doc.lower().split()
if normalize:
return doc.count(term.lower()) / float(len(doc))
else:
return doc.count(term.lower()) / 1.0
def idf(term, corpus):
num_texts_with_term = len(True for text in corpus if term.lower()
in text.lower().split())
tf-idf calc involves multiplying against a tf value less than 0, so it's
necessary to return a value greater than 1 for consistent scoring.
4.3. A Whiz-Bang Introduction to TF-IDF 151
a (Multiplying two values less than 1 returns a value less than each of
them.)
try:
return 1.0 + log(float(len(corpus)) / num_texts_with_term)
except ZeroDivisionError:
return 1.0
def tf_idf(term, doc, corpus):
return tf(term, doc) idf(term, corpus)
corpus = \
'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \
Mr. Green is not a very nice fellow.',
'b': 'Professor Plum has a green plant in his study.',
'c': "Miss Scarlett watered Professor Plum's green plant while he was away \
from his office last week."
for (k, v) in sorted(corpus.items()):
print k, ':', v
print
Score queries by calculating cumulative tf_idf score for each term in query
query_scores = 'a': 0, 'b': 0, 'c': 0
for term in t.lower() for t in QUERY_TERMS:
for doc in sorted(corpus):
print 'TF(%s): %s' % (doc, term), tf(term, corpusdoc)
print 'IDF: %s' % (term, ), idf(term, corpus.values())
print
for doc in sorted(corpus):
score = tf_idf(term, corpusdoc, corpus.values())
print 'TF-IDF(%s): %s' % (doc, term), score
query_scoresdoc += score
print
print "Overall TF-IDF scores for query '%s'" % (' '.join(QUERY_TERMS), )
for (doc, score) in sorted(query_scores.items()):
print doc, score
Sample output follows:
a : Mr. Green killed Colonel Mustard in the study...
b : Professor Plum has a green plant in his study.
c : Miss Scarlett watered Professor Plum's green...
TF(a): mr. 0.105263157895
TF(b): mr. 0.0
TF(c): mr. 0.0
IDF: mr. 2.09861228867
152 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
aTF-IDF(a): mr. 0.220906556702
TF-IDF(b): mr. 0.0
TF-IDF(c): mr. 0.0
TF(a): green 0.105263157895
TF(b): green 0.111111111111
TF(c): green 0.0625
IDF: green 1.0
TF-IDF(a): green 0.105263157895
TF-IDF(b): green 0.111111111111
TF-IDF(c): green 0.0625
Overall TF-IDF scores for query 'mr. green'
a 0.326169714597
b 0.111111111111
c 0.0625
Although we’re working on a trivially small scale, the calculations involved work the
same for larger data sets. Table 4-4 is a consolidated adaptation of the program’s output
for three sample queries that involve four distinct terms:
• “green”
• “mr. green”
• “the green plant”
Even though the IDF calculations for terms are computed on the basis of the entire
corpus, they are displayed on a per-document basis so that you can easily verify TF-IDF
scores by skimming a single row and multiplying two numbers. As you work through
the query results, you’ll find that it’s remarkable just how powerful TF-IDF is, given that
it doesn’t account for the proximity or ordering of words in a document.
Table 4-4. Calculations involved in TF-IDF sample queries, as computed by
Example 4-7
Document tf(mr.) tf(green) tf(the) tf(plant)
corpus'a' 0.1053 0.1053 0.1053 0
corpus'b' 0 0.1111 0 0.1111
corpus'c' 0 0.0625 0 0.0625
idf(mr.) idf(green) idf(the) idf(plant)
2.0986 1.0 2.099 1.4055
4.3. A Whiz-Bang Introduction to TF-IDF 153
a tf-idf(mr.) tf-idf(green) tf-idf(the) tf-idf(plant)
corpus'a' 0.10532.0986 = 0.2209 0.10531.0 = 0.1053 0.10532.099 = 0.2209 01.4055 = 0
corpus'b' 02.0986 = 0 0.11111.0 = 0.1111 02.099 = 0 0.11111.4055 = 0.1562
corpus'c' 02.0986 = 0 0.06251.0 = 0.0625 02.099 = 0 0.06251.4055 = 0.0878
The same results for each query are shown in Table 4-5, with the TF-IDF values summed
on a per-document basis.
Table 4-5. Summed TF-IDF values for sample queries as computed by Example 4-7
(values in bold are the maximum scores for each of the three queries)
Query corpus'a' corpus'b' corpus'c'
green 0.1053 0.1111 0.0625
Mr. Green 0.2209 + 0.1053 = 0.3262 0 + 0.1111 = 0.1111 0 + 0.0625 = 0.0625
the green plant 0.2209 + 0.1053 + 0 = 0.3262 0 + 0.1111 + 0.1562 = 0.2673 0 + 0.0625 + 0.0878 = 0.1503
From a qualitative standpoint, the query results are quite reasonable. Thecorpus'b'
document is the winner for the query “green,” with corpus'a' just a hair behind. In
this case, the deciding factor was the length of corpus'b' being much smaller
than that of corpus'a'—the normalized TF score tipped the results in favor of cor
pus'b' for its one occurrence of “green,” even though “Green” appeared in cor
pus'a' two times. Since “green” appears in all three documents, the net effect of the
IDF term in the calculations was a wash.
Do note, however, that if we had returned 0.0 instead of 1.0 for “green,” as is done in
some IDF implementations, the TF-IDF scores for “green” would have been 0.0 for all
three documents due the effect of multiplying the TF score by zero. Depending on the
particular situation, it may be better to return 0.0 for the IDF scores rather than 1.0. For
example, if you had 100,000 documents and “green” appeared in all of them, you’d almost
certainly consider it to be a stopword and want to remove its effects in a query entirely.
For the query “Mr. Green,” the clear and appropriate winner is thecorpus'a' docu‐
ment. However, this document also comes out on top for the query “the green plant.”
A worthwhile exercise is to consider whycorpus'a' scored highest for this query as
opposed tocorpus'b', which at first blush might have seemed a little more obvious.
A final nuanced point to observe is that the sample implementation provided in
Example 4-7 adjusts the IDF score by adding a value of 1.0 to the logarithm calculation,
for the purposes of illustration and because we’re dealing with a trivial document set.
Without the 1.0 adjustment in the calculation, it would be possible to have the idf
function return values that are less than 1.0, which would result in two fractions being
multiplied in the TF-IDF calculation. Since multiplying two fractions together results
in a value smaller than either of them, this turns out to be an easily overlooked edge
case in the TF-IDF calculation. Recall that the intuition behind the TF-IDF calculation
154 Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More
a
Advise:Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.