how to send email from unix shell script and how to use mail command in unix
MattGates,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Mining Mailboxes: Analyzing Who’s Talking
Mail archives are arguably the ultimate kind of social web data and the basis of the
earliest online social networks. Mail data is ubiquitous, and each message is inherently
social, involving conversations and interactions among two or more people. Further‐
more, each message consists of human language data that’s inherently expressive, and
is laced with structured metadata fields that anchor the human language data in par‐
ticular timespans and unambiguous identities. Mining mailboxes certainly provides an
opportunity to synthesize all of the concepts you’ve learned in previous chapters and
opens up incredible opportunities for discovering valuable insights.
Whether you are the CIO of a corporation and want to analyze corporate communica‐
tions for trends and patterns, you have keen interest in mining online mailing lists for
insights, or you’d simply like to explore your own mailbox for patterns as part of quan‐
tifying yourself, the following discussion provides a primer to help you get started. This
chapter introduces some fundamental tools and techniques for exploring mailboxes to
answer questions such as:
• Who sends mail to whom (and how much/often)?
• Is there a particular time of the day (or day of the week) when the most mail chatter
happens?
• Which people send the most messages to one another?
• What are the subjects of the liveliest discussion threads?
Although social media sites are racking up petabytes of near-real-time social data, there
is still the significant drawback that social networking data is centrally managed by a
225
a service provider that gets to create the rules about exactly how you can access it and
what you can and can’t do with it. Mail archives, on the other hand, are decentralized
and scattered across the Web in the form of rich mailing list discussions about a litany
of topics, as well as the many thousands of messages that people have tucked away in
their own accounts. When you take a moment to think about it, it seems as though being
able to effectively mine mail archives could be one of the most essential capabilities in
your data mining toolbox.
Although it’s not always easy to find realistic social data sets for purposes of illustration,
this chapter showcases the fairly well-studied Enron corpus as its basis in order to max‐
1
imize the opportunity for analysis without introducing any legal or privacy concerns.
We’ll standardize the data set into the well-known Unix mailbox (mbox) format so that
we can employ a common set of tools to process it. Finally, although we could just opt
to process the data in a JSON format that we store in a flat file, we’ll take advantage of
the inherently document-centric nature of a mail message and learn how to use Mon‐
goDB to store and analyze the data in a number of powerful and interesting ways, in‐
cluding various types of frequency analysis and keyword search.
As a general-purpose database for storing and querying arbitrary JSON data, MongoDB
is hard to beat, and it’s a powerful and versatile tool that you’ll want to have on hand for
a variety of circumstances. (Although we’ve opted to avoid the use of external depen‐
dencies such as databases until this chapter, when it has all but become a necessity given
the nature of our subject matter here, you’ll soon realize that you could use MongoDB
to store any of the social web data we’ve been retrieving and accessing as flat JSON files.)
Always get the latest bug-fixed source code for this chapter
(and every other chapter) online at http://bit.ly/MiningTheSocial
Web2E. Be sure to also take advantage of this book’s virtual ma‐
chine experience, as described in Appendix A, to maximize your
enjoyment of the sample code.
6.1. Overview
Mail data is incredibly rich and presents opportunities for analysis involving everything
you’ve learned about so far in this book. In this chapter you’ll learn about:
• The process of standardizing mail data to a convenient and portable format
1. Should you want to analyze mailing list data, be advised that most service providers (such as Google and
Yahoo) restrict your use of mailing list data if you retrieve it using their APIs, but you can easily enough
collect and archive mailing list data yourself by subscribing to a list and waiting for your mailbox to start
filling up. You might also be able to ask the list owner or members of the list to provide you with an archive
as another option.
226 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a • MongoDB, a powerful document-oriented database that is ideal for storing mail
and other forms of social web data
• The Enron corpus, a public data set consisting of the contents of employee mail‐
boxes from around the time of the Enron scandal
• Using MongoDB to query the Enron corpus in arbitrary ways
• Tools for accessing and exporting your own mailbox data for analysis
6.2. Obtaining and Processing a Mail Corpus
This section illustrates how to obtain a mail corpus, convert it into a standardized mbox,
and then import the mbox into MongoDB, which will serve as a general-purpose API
for storing and querying the data. We’ll start out by analyzing a small fictitious mailbox
and then proceed to processing the Enron corpus.
6.2.1. A Primer on Unix Mailboxes
An mbox is really just a large text file of concatenated mail messages that are easily
accessible by text-based tools. Mail tools and protocols have long since evolved beyond
mboxes, but it’s usually the case that you can use this format as a lowest common de‐
nominator to easily process the data and feel confident that if you share or distribute
the data it’ll be just as easy for someone else to process it. In fact, most mail clients
provide an “export” or “save as” option to export data to this format (even though the
verbiage may vary), as illustrated in Figure 6-2 in the section Section 6.5 on page 268.
In terms of specification, the beginning of each message in an mbox is signaled by a
special From line formatted to the pattern"From userexample.com asctime", where
asctime is a standardized fixed-width representation of a timestamp in the form Fri
Dec 25 00:06:42 2009. The boundary between messages is determined by a From_
line preceded (except for the first occurrence) by exactly two new line characters. (Vis‐
ually, as shown below, this appears as though there is a single blank line that precedes
the From_ line.) A small slice from a fictitious mbox containing two messages follows:
From santanorthpole.example.org Fri Dec 25 00:06:42 2009
Message-ID: 16159836.1075855377439mail.northpole.example.org
References: 88364590.8837464573838mail.northpole.example.org
In-Reply-To: 194756537.0293874783209mail.northpole.example.org
Date: Fri, 25 Dec 2001 00:06:42 -0000 (GMT)
From: St. Nick santanorthpole.example.org
To: rudolphnorthpole.example.org
Subject: RE: FWD: Tonight
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
6.2. Obtaining and Processing a Mail Corpus 227
aSounds good. See you at the usual location.
Thanks,
-S
-Original Message-
From: Rudolph
Sent: Friday, December 25, 2009 12:04 AM
To: Claus, Santa
Subject: FWD: Tonight
Santa -
Running a bit late. Will come grab you shortly. Standby.
Rudy
Begin forwarded message:
Last batch of toys was just loaded onto sleigh.
Please proceed per the norm.
Regards,
Buddy
Buddy the Elf
Chief Elf
Workshop Operations
North Pole
buddy.the.elfnorthpole.example.org
From buddy.the.elfnorthpole.example.org Fri Dec 25 00:03:34 2009
Message-ID: 88364590.8837464573838mail.northpole.example.org
Date: Fri, 25 Dec 2001 00:03:34 -0000 (GMT)
From: Buddy buddy.the.elfnorthpole.example.org
To: workshopnorthpole.example.org
Subject: Tonight
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Last batch of toys was just loaded onto sleigh.
Please proceed per the norm.
Regards,
Buddy
Buddy the Elf
228 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
aChief Elf
Workshop Operations
North Pole
buddy.the.elfnorthpole.example.org
In the preceding sample mailbox we see two messages, although there is evidence of at
least one other message that was replied to that might exist elsewhere in the mbox.
Chronologically, the first message was authored by a fellow named Buddy and was sent
out to workshopnorthpole.example.org to announce that the toys had just been loaded.
The other message in the mbox is a reply from Santa to Rudolph. Not shown in the
sample mbox is an intermediate message in which Rudolph forwarded Buddy’s message
to Santa with the note saying that he was running late. Although we could infer these
things by reading the text of the messages themselves as humans with contextualized
knowledge, the Message-ID, References, and In-Reply-To headers also provide impor‐
tant clues that can be analyzed.
These headers are pretty intuitive and provide the basis for algorithms that display
threaded discussions and things of that nature. We’ll look at a well-known algorithm
that uses these fields to thread messages a bit later, but the gist is that each message has
a unique message ID, contains a reference to the exact message that is being replied to
in the case of it being a reply, and can reference multiple other messages in the reply
chain that are part of the larger discussion thread at hand.
Because we’ll be employing some Python modules to do much of the
tedious work for us, we won’t need to digress into discussions con‐
cerning the nuances of email messages, such as multipart content,
MIME types, and 7-bit content transfer encoding.
These headers are vitally important. Even with this simple example, you can already see
how things can get quite messy when you’re parsing the actual body of a message:
Rudolph’s client quoted forwarded content with characters, while the mail client Santa
used to reply apparently didn’t quote anything, but instead included a human-readable
message header.
Most mail clients have an option to display extended mail headers beyond the ones you
normally see, if you’re interested in a technique that’s a little more accessible than digging
into raw storage when you want to view this kind of information; Figure 6-1 shows
sample headers as displayed by Apple Mail.
6.2. Obtaining and Processing a Mail Corpus 229
a Figure 6-1. Most mail clients allow you to view the extended headers through an op‐
tions menu
Luckily for us, there’s a lot you can do without having to essentially reimplement a mail
client. Besides, if all you wanted to do was browse the mailbox, you’d simply import it
into a mail client and browse away, right?
It’s worth taking a moment to explore whether your mail client has an
option to import/export data in the mbox format so that you can use
the tools in this chapter to manipulate it.
To get the ball rolling on some data processing, Example 6-1 illustrates a routine that
makes numerous simplifying assumptions about an mbox to introduce themailbox and
email packages that are part of Python’s standard library.
Example 6-1. Converting a toy mailbox to JSON
import mailbox
import email
import json
MBOX = 'resources/ch06-mailboxes/data/northpole.mbox'
230 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a A routine that makes a ton of simplifying assumptions
about converting an mbox message into a Python object
given the nature of the northpole.mbox file in order
to demonstrate the basic parsing of an mbox with mail
utilities
def objectify_message(msg):
Map in fields from the message
o_msg = dict( (k, v) for (k,v) in msg.items() )
Assume one part to the message and get its content
and its content type
part = p for p in msg.walk()0
o_msg'contentType' = part.get_content_type()
o_msg'content' = part.get_payload()
return o_msg
Create an mbox that can be iterated over and transform each of its
messages to a convenient JSON representation
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
messages =
while 1:
msg = mbox.next()
if msg is None: break
messages.append(objectify_message(msg))
print json.dumps(messages, indent=1)
Although this little script for processing an mbox file seems pretty clean and produces
reasonable results, trying to parse arbitrary mail data or determine the exact flow of a
conversation from mailbox data for the general case can be a tricky enterprise. Many
factors contribute to this, such as the ambiguity involved and the variation that can
occur in how humans embed replies and comments into reply chains, how different
mail clients handle messages and replies, etc.
Table 6-1 illustrates the message flow and explicitly includes the third message that was
referenced but not present in the northpole.mbox to highlight this point. Truncated
sample output from the script follows:
"From": "St. Nick santanorthpole.example.org",
"Content-Transfer-Encoding": "7bit",
6.2. Obtaining and Processing a Mail Corpus 231
a "content": "Sounds good. See you at the usual location.\n\nThanks,...",
"To": "rudolphnorthpole.example.org",
"References": "88364590.8837464573838mail.northpole.example.org",
"Mime-Version": "1.0",
"In-Reply-To": "194756537.0293874783209mail.northpole.example.org",
"Date": "Fri, 25 Dec 2001 00:06:42 -0000 (GMT)",
"contentType": "text/plain",
"Message-ID": "16159836.1075855377439mail.northpole.example.org",
"Content-Type": "text/plain; charset=us-ascii",
"Subject": "RE: FWD: Tonight"
,
"From": "Buddy buddy.the.elfnorthpole.example.org",
"Subject": "Tonight",
"Content-Transfer-Encoding": "7bit",
"content": "Last batch of toys was just loaded onto sleigh. \n\nPlease...",
"To": "workshopnorthpole.example.org",
"Date": "Fri, 25 Dec 2001 00:03:34 -0000 (GMT)",
"contentType": "text/plain",
"Message-ID": "88364590.8837464573838mail.northpole.example.org",
"Content-Type": "text/plain; charset=us-ascii",
"Mime-Version": "1.0"
Table 6-1. Message flow from northpole.mbox
Date Message activity
Fri, 25 Dec 2001 00:03:34 -0000 (GMT) Buddy sends a message to the workshop
Friday, December 25, 2009 12:04 AM Rudolph forwards Buddy’s message to Santa with an additional note
Fri, 25 Dec 2001 00:06:42 -0000 (GMT) Santa replies to Rudolph
With a basic appreciation for mailboxes in place, let’s now shift our attention to con‐
verting the Enron corpus to an mbox so that we can leverage Python’s standard library
as much as possible.
6.2.2. Getting the Enron Data
A downloadable form of the full Enron data set in a raw form is available in multiple
formats requiring various amounts of processing. We’ll opt to start with the original
raw form of the data set, which is essentially a set of folders that organizes a collection
of mailboxes by person and folder. Data standardization and cleansing is a routine
problem, and this section should give you some perspective and some appreciation for
it.
If you are taking advantage of the virtual machine experience for this book, the IPython
Notebook for this chapter provides a script that downloads the data to the proper work‐
ing location for you to seamlessly follow along with these examples. The full Enron
232 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a corpus is approximately 450 MB in the compressed form in which you would download
it to follow along with these exercises. It may take upward of 10 minutes to download
and decompress if you have a reasonable Internet connection speed and a relatively new
computer.
Unfortunately, if you are using the virtual machine, the time that it takes for Vagrant to
synchronize the thousands of files that are unarchived back to the host machine can be
upward of two hours. If time is a significant factor and you can’t let this script run at an
opportune time, you could opt to skip the download and initial processing steps since
the refined version of the data, as produced from Example 6-3, is checked in with the
source code and available at ipynb/resources/ch06-mailboxes/data/enron.mbox
.json.bz2. See the notes in the IPython Notebook for this chapter for more details.
The download and decompression of the file is relatively fast com‐
pared to the time that it takes for Vagrant to synchronize the high
number of files that decompress with the host machine, and at the time
of this writing, there isn’t a known workaround that will speed this up
for all platforms. It may take longer than a hour for Vagrant to syn‐
chronize the thousands of files that decompress.
The output from the following terminal session illustrates the basic structure of the
corpus once you’ve downloaded and unarchived it. It’s worthwhile to explore the data
in a terminal session for a few minutes once you’ve downloaded it to familiarize yourself
with what’s there and learn how to navigate through it.
If you are working on a Windows system or are not comfortable
working in a terminal, you can poke around in theipynb/resources/
ch06-mailboxes/data folder, which will be synchronized onto your host
machine if you are taking advantage of the virtual machine experi‐
ence for this book.
cd enron_mail_20110402/maildir Go into the mail directory
maildir ls Show folders/files in the current directory
allen-p crandell-s gay-r horton-s
lokey-t nemec-g rogers-b slinger-r
tycholiz-b arnold-j cuilla-m geaccone-t
hyatt-k love-p panus-s ruscitti-k
smith-m ward-k arora-h dasovich-j
germany-c hyvl-d lucci-p parks-j
sager-e solberg-g watson-k badeer-r
corman-s gang-l holst-k lokay-m
6.2. Obtaining and Processing a Mail Corpus 233
a ...directory listing truncated...
neal-s rodrique-r skilling-j townsend-j
cd allen-p/ Go into the allen-p folder
allen-p ls Show files in the current directory
_sent_mail contacts discussion_threads notes_inbox
sent_items all_documents deleted_items inbox
sent straw
allen-p cd inbox/ Go into the inbox for allen-p
inbox ls Show the files in the inbox for allen-p
1. 11. 13. 15. 17. 19. 20. 22. 24. 26. 28. 3. 31. 33. 35. 37. 39. 40.
42. 44. 5. 62. 64. 66. 68. 7. 71. 73. 75. 79. 83. 85. 87. 10. 12. 14.
16. 18. 2. 21. 23. 25. 27. 29. 30. 32. 34. 36. 38. 4. 41. 43. 45. 6.
63. 65. 67. 69. 70. 72. 74. 78. 8. 84. 86. 9.
inbox head -20 1. Show the first 20 lines of the file named "1."
Message-ID: 16159836.1075855377439.JavaMail.evansthyme
Date: Fri, 7 Dec 2001 10:06:42 -0800 (PST)
From: heather.duntonenron.com
To: k..allenenron.com
Subject: RE: West Position
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Dunton, Heather /O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON
X-To: Allen, Phillip K. /O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen
X-cc:
X-bcc:
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\Inbox
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst
Please let me know if you still need Curve Shift.
Thanks,
The final command in the terminal session shows that mail messages are organized into
files and contain metadata in the form of headers that can be processed along with the
content of the data itself. The data is in a fairly consistent format, but not necessarily a
well-known format with great tools for processing it. So, let’s do some preprocessing on
the data and convert a portion of it to the well-known Unix mbox format in order to
illustrate the general process of standardizing a mail corpus to a format that is widely
known and well tooled.
234 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a6.2.3. Converting a Mail Corpus to a Unix Mailbox
Example 6-2 illustrates an approach that searches the directory structure of the Enron
corpus for folders named “inbox” and adds messages contained in them to a single
output file that’s written out as enron.mbox. To run this script, you will need to download
the Enron corpus and unarchive it to the path specified byMAILDIR in the script.
The script takes advantage of a package calleddateutil to handle the parsing of dates
into a standard format. We didn’t do this earlier, and it’s slightly trickier than it may
sound given the room for variation in the general case. You can install this package with
pip install python_dateutil. (In this particular instance, the package name thatpip
tries to install is slightly different than what you import in your code.) Otherwise, the
script is just using some tools from Python’s standard library to munge the data into an
mbox. Although not analytically interesting, the script provides reminders of how to
use regular expressions, uses theemail package that we’ll continue to see, and illustrates
some other concepts that may be useful for general data processing. Be sure that you
understand how the script works to broaden your overall working knowledge and data
mining toolchain.
This script may take 10–15 minutes to run on the entire Enron cor‐
pus, depending on your hardware. IPython Notebook will indicate that
it is still processing data by displaying a “Kernel Busy” message in the
upper-right corner of the user interface.
Example 6-2. Converting the Enron corpus to a standardized mbox format
import re
import email
from time import asctime
import os
import sys
from dateutil.parser import parse pip install python_dateutil
XXX: Download the Enron corpus to resources/ch06-mailboxes/data
and unarchive it there.
MAILDIR = 'resources/ch06-mailboxes/data/enron_mail_20110402/' + \
'enron_data/maildir'
Where to write the converted mbox
MBOX = 'resources/ch06-mailboxes/data/enron.mbox'
Create a file handle that we'll be writing into...
mbox = open(MBOX, 'w')
Walk the directories and process any folder named 'inbox'
for (root, dirs, file_names) in os.walk(MAILDIR):
6.2. Obtaining and Processing a Mail Corpus 235
a if root.split(os.sep)-1.lower() = 'inbox':
continue
Process each message in 'inbox'
for file_name in file_names:
file_path = os.path.join(root, file_name)
message_text = open(file_path).read()
Compute fields for the From_ line in a traditional mbox message
_from = re.search(r"From: (\r+)", message_text).groups()0
_date = re.search(r"Date: (\r+)", message_text).groups()0
Convert _date to the asctime representation for the From_ line
_date = asctime(parse(_date).timetuple())
msg = email.message_from_string(message_text)
msg.set_unixfrom('From %s %s' % (_from, _date))
mbox.write(msg.as_string(unixfrom=True) + "\n\n")
mbox.close()
If you peek at the mbox file that we’ve just created, you’ll see that it looks quite similar
to the mail format we saw earlier, except that it now conforms to well-known specifi‐
cations and is a single file.
Keep in mind that you could just as easily create separate mbox files
for each individual person or a particular group of people if you pre‐
ferred to analyze a more focused subset of the Enron corpus.
6.2.4. Converting Unix Mailboxes to JSON
Having an mbox file is especially convenient because of the variety of tools available to
process it across computing platforms and programming languages. In this section we’ll
look at eliminating many of the simplifying assumptions from Example 6-1, to the point
that we can robustly process the Enron mailbox and take into account several of the
common issues that you’ll likely encounter with mailbox data from the wild. Python’s
tooling for mboxes is included in its standard library, and the script in Example 6-3
introduces a means of converting mbox data to a line-delimited JSON format that can
be imported into a document-oriented database such as MongoDB. We’ll talk more
about MongoDB and why it’s such a great fit for storing content such as mail data in a
moment, but for now, it’s sufficient to know that it stores data in what’s conceptually a
236 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
aJSON-like format and provides some powerful capabilities for indexing and manipu‐
lating the data.
One additional accommodation that we make for MongoDB is that we normalize the
date of each message to a standard epoch format that’s the number of milliseconds since
January 1, 1970, and pass it in with a special hint so that MongoDB can interpret each
date field in a standardized way. Although we could have done this after we loaded the
data into MongoDB, this chore falls into the “data cleansing” category and enables us
to run some queries that use the Date field of each mail message in a consistent way
immediately after the data is loaded.
Finally, in order to actually get the data to import into MongoDB, we need to write out
a file in which each line contains a single JSON object, per MongoDB’s documentation.
Once again, although not interesting from the standpoint of analysis, this script illus‐
trates some additional realities in data cleansing and processing—namely, that mail data
may not be in a particular encoding like UTF-8 and may contain HTML formatting that
needs to be stripped out.
Example 6-3 includes the decode('utf-8', 'ignore') function in
several places. When you’re working with text-based data such as
emails or web pages, it’s not at all uncommon to run into the infa‐
mous UnicodeDecodeError because of unexpected character encod‐
ings, and it’s not always immediately obvious what’s going on or how
to fix the problem. You can run the decode function on any string value
and pass it a second argument that specifies what to do in the event of
a UnicodeDecodeError. The default value is 'strict', which results
in the exception being raised, but you can use 'ignore' or 're
place' instead, depending on your needs.
Example 6-3. Converting an mbox to a JSON structure suitable for import into
MongoDB
import sys
import mailbox
import email
import quopri
import json
import time
from BeautifulSoup import BeautifulSoup
from dateutil.parser import parse
MBOX = 'resources/ch06-mailboxes/data/enron.mbox'
OUT_FILE = 'resources/ch06-mailboxes/data/enron.mbox.json'
def cleanContent(msg):
Decode message from "quoted printable" format
6.2. Obtaining and Processing a Mail Corpus 237
a msg = quopri.decodestring(msg)
Strip out HTML tags, if any are present.
Bail on unknown encodings if errors happen in BeautifulSoup.
try:
soup = BeautifulSoup(msg)
except:
return ''
return ''.join(soup.findAll(text=True))
There's a lot of data to process, and the Pythonic way to do it is with a
generator. See http://wiki.python.org/moin/Generators.
Using a generator requires a trivial encoder to be passed to json for object
serialization.
class Encoder(json.JSONEncoder):
def default(self, o): return list(o)
The generator itself...
def gen_json_msgs(mb):
while 1:
msg = mb.next()
if msg is None:
break
yield jsonifyMessage(msg)
def jsonifyMessage(msg):
json_msg = 'parts':
for (k, v) in msg.items():
json_msgk = v.decode('utf-8', 'ignore')
The To, Cc, and Bcc fields, if present, could have multiple items.
Note that not all of these fields are necessarily defined.
for k in 'To', 'Cc', 'Bcc':
if not json_msg.get(k):
continue
json_msgk = json_msgk.replace('\n', '').replace('\t', '').replace('\r', '')\
.replace(' ', '').decode('utf-8', 'ignore').split(',')
for part in msg.walk():
json_part =
if part.get_content_maintype() == 'multipart':
continue
json_part'contentType' = part.get_content_type()
content = part.get_payload(decode=False).decode('utf-8', 'ignore')
json_part'content' = cleanContent(content)
json_msg'parts'.append(json_part)
Finally, convert date from asctime to milliseconds since epoch using the
238 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a date descriptor so it imports "natively" as an ISODate object in MongoDB
then = parse(json_msg'Date')
millis = int(time.mktime(then.timetuple())1000 + then.microsecond/1000)
json_msg'Date' = 'date' : millis
return json_msg
mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file)
Write each message out as a JSON object on a separate line
for easy import into MongoDB via mongoimport
f = open(OUT_FILE, 'w')
for msg in gen_json_msgs(mbox):
if msg = None:
f.write(json.dumps(msg, cls=Encoder) + '\n')
f.close()
There’s always more data cleansing that we could do, but we’ve addressed some of the
most common issues, including a primitive mechanism for decoding quoted-printable
text and stripping out HTML tags. (Thequopri package is used to handle the quoted-
2
printable format, an encoding used to transfer 8-bit content over a 7-bit channel. )
Following is one line of pretty-printed sample output from running Example 6-3 on the
Enron mbox file, to demonstrate the basic form of the output:
"Content-Transfer-Encoding": "7bit",
"Content-Type": "text/plain; charset=us-ascii",
"Date":
"date": 988145040000
,
"From": "craig_estesenron.com",
"Message-ID": "24537021.1075840152262.JavaMail.evansthyme",
"Mime-Version": "1.0",
"Subject": "Parent Child Mountain Adventure, July 21-25, 2001",
"X-FileName": "jskillin.pst",
"X-Folder": "\\jskillin\\Inbox",
"X-From": "Craig_Estes",
"X-Origin": "SKILLING-J",
"X-To": "",
"X-bcc": "",
"X-cc": "",
"parts":
"content": "Please respond to Keith_Williams...",
"contentType": "text/plain"
2. See Wikipedia for an overview, or RFC 2045 if you are interested in the nuts and bolts of how this works.
6.2. Obtaining and Processing a Mail Corpus 239
a
This short script does a pretty decent job of removing some of the noise, parsing out
the most pertinent information from an email, and constructing a data file that we can
now trivially import into MongoDB. This is where the real fun begins. With your new‐
found ability to cleanse and process mail data into an accessible format, the urge to start
analyzing it is only natural. In the next section, we’ll import the data into MongoDB
and begin the data analysis.
If you opted not to download the original Enron data and follow along
with the preprocessing steps, you can still produce the output from
Example 6-3 by following along with the notes in the IPython Note‐
book for this chapter and proceed from here per the standard discus‐
sion that continues.
6.2.5. Importing a JSONified Mail Corpus into MongoDB
Using the right tool for the job can significantly streamline the effort involved in ana‐
lyzing data, and although Python is a language that would make it fairly simple to process
JSON data, it still wouldn’t be nearly as easy as storing the JSON data in a document-
oriented database like MongoDB.
For all practical purposes, think of MongoDB as a database that makes storing and
manipulating JSON just about as easy as it should be. You can organize it into collections,
iterate over it and query it in efficient ways, full-text index it, and much more. In the
current context of analyzing the Enron corpus, MongoDB provides a natural API into
the data since it allows us to create indexes and query on arbitrary fields of the JSON
documents, even performing a full-text search if desired.
For our exercises, you’ll just be running an instance of MongoDB on your local machine,
but you can also scale MongoDB across a cluster of machines as your data grows. It
comes with great administration utilities, and it’s backed by a professional services
company should you need pro support. A full-blown discussion about MongoDB is
outside the scope of this book, but it should be straightforward enough to follow along
with this section even if you’ve never heard of MongoDB until reading this chapter. Its
online documentation and tutorials are superb, so take a moment to bookmark them
since they make such a handy reference.
Regardless of your operating system, should you choose to install MongoDB instead of
using the virtual machine, you should be able to follow the instructions online easily
enough; nice packaging for all major platforms is available. Just make sure that you are
using version 2.4 or higher since some of the exercises in this chapter rely on full-text
indexing, which is a new beta feature introduced in version 2.4. For reference, the Mon‐
240 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a goDB that is preinstalled with the virtual machine is installed and managed as a ser‐
vice with no particular customization aside from setting a parameter in its configuration
file (located at /etc/mongodb.conf) to enable full-text search indexing.
Verify that the Enron data is loaded, full-text indexed, and ready for analysis by exe‐
cuting Examples 6-4, 6-5, and 6-6. These examples take advantage of a lightweight
wrapper around the subprocess package called Envoy, which allows you to easily ex‐
ecute terminal commands from a Python program and get the standard output and
standard error. Per the standard protocol, you can installenvoy withpip install en
voy from a terminal.
Example 6-4. Getting the options for the mongoimport command from IPython
Notebook
import envoy pip install envoy
r = envoy.run('mongoimport')
print r.std_out
print r.std_err
Example 6-5. Using mongoimport to load data into MongoDB from IPython Notebook
import os
import sys
import envoy
data_file = os.path.join(os.getcwd(), 'resources/ch06-mailboxes/data/enron.mbox.json')
Run a command just as you would in a terminal on the virtual machine to
import the data file into MongoDB.
r = envoy.run('mongoimport db enron collection mbox ' + \
'file %s' % data_file)
Print its standard output
print r.std_out
print sys.stderr.write(r.std_err)
Example 6-6. Simulating a MongoDB shell that you can run from within IPython
Notebook
We can even simulate a MongoDB shell using envoy to execute commands.
For example, let's get some stats out of MongoDB just as though we were working
in a shell by passing it the command and wrapping it in a printjson function to
display it for us.
def mongo(db, cmd):
r = envoy.run("mongo %s eval 'printjson(%s)'" % (db, cmd,))
print r.std_out
if r.std_err: print r.std_err
mongo('enron', 'db.mbox.stats()')
6.2. Obtaining and Processing a Mail Corpus 241
a Sample output from Example 6-6 follows and illustrates that it’s exactly what you’d see
if you were writing commands in the MongoDB shell. Neat
MongoDB shell version: 2.4.3
connecting to: enron
"ns" : "enron.mbox",
"count" : 41299,
"size" : 157744000,
"avgObjSize" : 3819.5597956366983,
"storageSize" : 185896960,
"numExtents" : 10,
"nindexes" : 1,
"lastExtentSize" : 56438784,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 1349040,
"indexSizes" :
"_id_" : 1349040
,
"ok" : 1
Loading the JSON data through a terminal session on the virtual ma‐
chine can be accomplished through mongoimport in exactly the same
fashion as illustrated in Example 6-5 with the following command:
mongoimport db enron collection mbox file
/home/vagrant/share/ipynb/resources/ch06-mailboxes
/data/enron.mbox.json
Once MongoDB is installed, the final administrative task you’ll need to perform is in‐
stalling the Python client package pymongo via the usual pip install pymongo com‐
mand, since we’ll soon be using a Python client to connect to MongoDB and access the
Enron data.
Be advised that MongoDB supports only databases of up to 2 GB in
size for 32-bit systems. Although this limitation is not likely to be an
issue for the Enron data set that we’re working with in this chapter,
you may want to take note of it in case any of the machines you com‐
monly work on are 32-bit systems.
6.2.5.1. The MongoDB shell
Although we are programmatically using Python for our exercises in this chapter, Mon‐
goDB has a shell that can be quite convenient if you are comfortable working in a
242 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a terminal, and this brief section introduces you to it. If you are taking advantage of the
virtual machine experience for this book, you will need to log into the virtual machine
over a secure shell session in order to follow along. Typing vagrant ssh from inside
the top-level checkout folder containing your Vagrantfile automatically logs you into
the virtual machine.
If you run Mac OS X or Linux, an SSH client will already exist on your system and
vagrant ssh will just work. If you are a Windows user and followed the instructions
in Appendix A recommending the installation of Git for Windows, which provides an
SSH client, vagrant ssh will also work so long as you explicitly opt to install the SSH
client as part of the installation process. If you are a Windows user and prefer to use
PuTTY, typingvagrant ssh provides some instructions on how to configure it:
vagrant ssh
Last login: Sat Jun 1 04:18:57 2013 from 10.0.2.2
vagrantprecise64: mongo
MongoDB shell version: 2.4.3
connecting to: test
show dbs
enron 0.953125GB
local 0.078125GB
use enron
switched to db enron
db.mbox.stats()
"ns" : "enron.mbox",
"count" : 41300,
"size" : 157756112,
"avgObjSize" : 3819.7605811138014,
"storageSize" : 174727168,
"numExtents" : 11,
"nindexes" : 2,
"lastExtentSize" : 50798592,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 221471488,
"indexSizes" :
"_id_" : 1349040,
"TextIndex" : 220122448
,
"ok" : 1
db.mbox.findOne()
6.2. Obtaining and Processing a Mail Corpus 243
a"_id" : ObjectId("51968affaada66efc5694cb7"),
"X-cc" : "",
"From" : "heather.duntonenron.com",
"X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\Inbox",
"Content-Transfer-Encoding" : "7bit",
"X-bcc" : "",
"X-Origin" : "Allen-P",
"To" :
"k..allenenron.com"
,
"parts" :
"content" : " \nPlease let me know if you still need...",
"contentType" : "text/plain"
,
"X-FileName" : "pallen (Non-Privileged).pst",
"Mime-Version" : "1.0",
"X-From" : "Dunton, Heather /O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON",
"Date" : ISODate("2001-12-07T16:06:42Z"),
"X-To" : "Allen, Phillip K. /O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen",
"Message-ID" : "16159836.1075855377439.JavaMail.evansthyme",
"Content-Type" : "text/plain; charset=us-ascii",
"Subject" : "RE: West Position"
The commands in this shell session showed the available databases, set the working
database toenron, displayed the database statistics for enron, and fetched an arbitrary
document for display. We won’t spend more time in the MongoDB shell in this chapter,
but you’ll likely find it useful as you work with data, so it seemed appropriate to briefly
introduce you to it. See “The Mongo Shell” in MongoDB’s online documentation for
details about the capabilities of the MongoDB shell.
6.2.6. Programmatically Accessing MongoDB with Python
With MongoDB successfully loaded with the Enron corpus (or any other data, for that
matter), you’ll want to access and manipulate it with a programming language.
MongoDB is sure to please with a broad selection of libraries for many programming
languages, including PyMongo, the recommended way to work with MongoDB from
Python. Apip install pymongo should get PyMongo ready to use; Example 6-7 con‐
tains a simple script to show how it works. Queries are serviced by MongoDB’s versatile
find function, which you’ll want to get acquainted with since it’s the basis of most queries
you’ll perform with MongoDB.
Example 6-7. Using PyMongo to access MongoDB from Python
import json
import pymongo pip install pymongo
from bson import json_util Comes with pymongo
244 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More
a
Advise:Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.