Mailbox in Unix

how to send email from unix shell script and how to use mail command in unix
MattGates Profile Pic
MattGates,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Mining Mailboxes: Analyzing Who’s Talking Mail archives are arguably the ultimate kind of social web data and the basis of the earliest online social networks. Mail data is ubiquitous, and each message is inherently social, involving conversations and interactions among two or more people. Further‐ more, each message consists of human language data that’s inherently expressive, and is laced with structured metadata fields that anchor the human language data in par‐ ticular timespans and unambiguous identities. Mining mailboxes certainly provides an opportunity to synthesize all of the concepts you’ve learned in previous chapters and opens up incredible opportunities for discovering valuable insights. Whether you are the CIO of a corporation and want to analyze corporate communica‐ tions for trends and patterns, you have keen interest in mining online mailing lists for insights, or you’d simply like to explore your own mailbox for patterns as part of quan‐ tifying yourself, the following discussion provides a primer to help you get started. This chapter introduces some fundamental tools and techniques for exploring mailboxes to answer questions such as: • Who sends mail to whom (and how much/often)? • Is there a particular time of the day (or day of the week) when the most mail chatter happens? • Which people send the most messages to one another? • What are the subjects of the liveliest discussion threads? Although social media sites are racking up petabytes of near-real-time social data, there is still the significant drawback that social networking data is centrally managed by a 225 a service provider that gets to create the rules about exactly how you can access it and what you can and can’t do with it. Mail archives, on the other hand, are decentralized and scattered across the Web in the form of rich mailing list discussions about a litany of topics, as well as the many thousands of messages that people have tucked away in their own accounts. When you take a moment to think about it, it seems as though being able to effectively mine mail archives could be one of the most essential capabilities in your data mining toolbox. Although it’s not always easy to find realistic social data sets for purposes of illustration, this chapter showcases the fairly well-studied Enron corpus as its basis in order to max‐ 1 imize the opportunity for analysis without introducing any legal or privacy concerns. We’ll standardize the data set into the well-known Unix mailbox (mbox) format so that we can employ a common set of tools to process it. Finally, although we could just opt to process the data in a JSON format that we store in a flat file, we’ll take advantage of the inherently document-centric nature of a mail message and learn how to use Mon‐ goDB to store and analyze the data in a number of powerful and interesting ways, in‐ cluding various types of frequency analysis and keyword search. As a general-purpose database for storing and querying arbitrary JSON data, MongoDB is hard to beat, and it’s a powerful and versatile tool that you’ll want to have on hand for a variety of circumstances. (Although we’ve opted to avoid the use of external depen‐ dencies such as databases until this chapter, when it has all but become a necessity given the nature of our subject matter here, you’ll soon realize that you could use MongoDB to store any of the social web data we’ve been retrieving and accessing as flat JSON files.) Always get the latest bug-fixed source code for this chapter (and every other chapter) online at Web2E. Be sure to also take advantage of this book’s virtual ma‐ chine experience, as described in Appendix A, to maximize your enjoyment of the sample code. 6.1. Overview Mail data is incredibly rich and presents opportunities for analysis involving everything you’ve learned about so far in this book. In this chapter you’ll learn about: • The process of standardizing mail data to a convenient and portable format 1. Should you want to analyze mailing list data, be advised that most service providers (such as Google and Yahoo) restrict your use of mailing list data if you retrieve it using their APIs, but you can easily enough collect and archive mailing list data yourself by subscribing to a list and waiting for your mailbox to start filling up. You might also be able to ask the list owner or members of the list to provide you with an archive as another option. 226 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a • MongoDB, a powerful document-oriented database that is ideal for storing mail and other forms of social web data • The Enron corpus, a public data set consisting of the contents of employee mail‐ boxes from around the time of the Enron scandal • Using MongoDB to query the Enron corpus in arbitrary ways • Tools for accessing and exporting your own mailbox data for analysis 6.2. Obtaining and Processing a Mail Corpus This section illustrates how to obtain a mail corpus, convert it into a standardized mbox, and then import the mbox into MongoDB, which will serve as a general-purpose API for storing and querying the data. We’ll start out by analyzing a small fictitious mailbox and then proceed to processing the Enron corpus. 6.2.1. A Primer on Unix Mailboxes An mbox is really just a large text file of concatenated mail messages that are easily accessible by text-based tools. Mail tools and protocols have long since evolved beyond mboxes, but it’s usually the case that you can use this format as a lowest common de‐ nominator to easily process the data and feel confident that if you share or distribute the data it’ll be just as easy for someone else to process it. In fact, most mail clients provide an “export” or “save as” option to export data to this format (even though the verbiage may vary), as illustrated in Figure 6-2 in the section Section 6.5 on page 268. In terms of specification, the beginning of each message in an mbox is signaled by a special From line formatted to the pattern"From asctime", where asctime is a standardized fixed-width representation of a timestamp in the form Fri Dec 25 00:06:42 2009. The boundary between messages is determined by a From_ line preceded (except for the first occurrence) by exactly two new line characters. (Vis‐ ually, as shown below, this appears as though there is a single blank line that precedes the From_ line.) A small slice from a fictitious mbox containing two messages follows: From Fri Dec 25 00:06:42 2009 Message-ID: References: In-Reply-To: Date: Fri, 25 Dec 2001 00:06:42 -0000 (GMT) From: St. Nick To: Subject: RE: FWD: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit 6.2. Obtaining and Processing a Mail Corpus 227 aSounds good. See you at the usual location. Thanks, -S -Original Message- From: Rudolph Sent: Friday, December 25, 2009 12:04 AM To: Claus, Santa Subject: FWD: Tonight Santa - Running a bit late. Will come grab you shortly. Standby. Rudy Begin forwarded message: Last batch of toys was just loaded onto sleigh. Please proceed per the norm. Regards, Buddy Buddy the Elf Chief Elf Workshop Operations North Pole From Fri Dec 25 00:03:34 2009 Message-ID: Date: Fri, 25 Dec 2001 00:03:34 -0000 (GMT) From: Buddy To: Subject: Tonight Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Last batch of toys was just loaded onto sleigh. Please proceed per the norm. Regards, Buddy Buddy the Elf 228 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More aChief Elf Workshop Operations North Pole In the preceding sample mailbox we see two messages, although there is evidence of at least one other message that was replied to that might exist elsewhere in the mbox. Chronologically, the first message was authored by a fellow named Buddy and was sent out to to announce that the toys had just been loaded. The other message in the mbox is a reply from Santa to Rudolph. Not shown in the sample mbox is an intermediate message in which Rudolph forwarded Buddy’s message to Santa with the note saying that he was running late. Although we could infer these things by reading the text of the messages themselves as humans with contextualized knowledge, the Message-ID, References, and In-Reply-To headers also provide impor‐ tant clues that can be analyzed. These headers are pretty intuitive and provide the basis for algorithms that display threaded discussions and things of that nature. We’ll look at a well-known algorithm that uses these fields to thread messages a bit later, but the gist is that each message has a unique message ID, contains a reference to the exact message that is being replied to in the case of it being a reply, and can reference multiple other messages in the reply chain that are part of the larger discussion thread at hand. Because we’ll be employing some Python modules to do much of the tedious work for us, we won’t need to digress into discussions con‐ cerning the nuances of email messages, such as multipart content, MIME types, and 7-bit content transfer encoding. These headers are vitally important. Even with this simple example, you can already see how things can get quite messy when you’re parsing the actual body of a message: Rudolph’s client quoted forwarded content with characters, while the mail client Santa used to reply apparently didn’t quote anything, but instead included a human-readable message header. Most mail clients have an option to display extended mail headers beyond the ones you normally see, if you’re interested in a technique that’s a little more accessible than digging into raw storage when you want to view this kind of information; Figure 6-1 shows sample headers as displayed by Apple Mail. 6.2. Obtaining and Processing a Mail Corpus 229 a Figure 6-1. Most mail clients allow you to view the extended headers through an op‐ tions menu Luckily for us, there’s a lot you can do without having to essentially reimplement a mail client. Besides, if all you wanted to do was browse the mailbox, you’d simply import it into a mail client and browse away, right? It’s worth taking a moment to explore whether your mail client has an option to import/export data in the mbox format so that you can use the tools in this chapter to manipulate it. To get the ball rolling on some data processing, Example 6-1 illustrates a routine that makes numerous simplifying assumptions about an mbox to introduce themailbox and email packages that are part of Python’s standard library. Example 6-1. Converting a toy mailbox to JSON import mailbox import email import json MBOX = 'resources/ch06-mailboxes/data/northpole.mbox' 230 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a A routine that makes a ton of simplifying assumptions about converting an mbox message into a Python object given the nature of the northpole.mbox file in order to demonstrate the basic parsing of an mbox with mail utilities def objectify_message(msg): Map in fields from the message o_msg = dict( (k, v) for (k,v) in msg.items() ) Assume one part to the message and get its content and its content type part = p for p in msg.walk()0 o_msg'contentType' = part.get_content_type() o_msg'content' = part.get_payload() return o_msg Create an mbox that can be iterated over and transform each of its messages to a convenient JSON representation mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file) messages = while 1: msg = if msg is None: break messages.append(objectify_message(msg)) print json.dumps(messages, indent=1) Although this little script for processing an mbox file seems pretty clean and produces reasonable results, trying to parse arbitrary mail data or determine the exact flow of a conversation from mailbox data for the general case can be a tricky enterprise. Many factors contribute to this, such as the ambiguity involved and the variation that can occur in how humans embed replies and comments into reply chains, how different mail clients handle messages and replies, etc. Table 6-1 illustrates the message flow and explicitly includes the third message that was referenced but not present in the northpole.mbox to highlight this point. Truncated sample output from the script follows: "From": "St. Nick", "Content-Transfer-Encoding": "7bit", 6.2. Obtaining and Processing a Mail Corpus 231 a "content": "Sounds good. See you at the usual location.\n\nThanks,...", "To": "", "References": "", "Mime-Version": "1.0", "In-Reply-To": "", "Date": "Fri, 25 Dec 2001 00:06:42 -0000 (GMT)", "contentType": "text/plain", "Message-ID": "", "Content-Type": "text/plain; charset=us-ascii", "Subject": "RE: FWD: Tonight" , "From": "Buddy", "Subject": "Tonight", "Content-Transfer-Encoding": "7bit", "content": "Last batch of toys was just loaded onto sleigh. \n\nPlease...", "To": "", "Date": "Fri, 25 Dec 2001 00:03:34 -0000 (GMT)", "contentType": "text/plain", "Message-ID": "", "Content-Type": "text/plain; charset=us-ascii", "Mime-Version": "1.0" Table 6-1. Message flow from northpole.mbox Date Message activity Fri, 25 Dec 2001 00:03:34 -0000 (GMT) Buddy sends a message to the workshop Friday, December 25, 2009 12:04 AM Rudolph forwards Buddy’s message to Santa with an additional note Fri, 25 Dec 2001 00:06:42 -0000 (GMT) Santa replies to Rudolph With a basic appreciation for mailboxes in place, let’s now shift our attention to con‐ verting the Enron corpus to an mbox so that we can leverage Python’s standard library as much as possible. 6.2.2. Getting the Enron Data A downloadable form of the full Enron data set in a raw form is available in multiple formats requiring various amounts of processing. We’ll opt to start with the original raw form of the data set, which is essentially a set of folders that organizes a collection of mailboxes by person and folder. Data standardization and cleansing is a routine problem, and this section should give you some perspective and some appreciation for it. If you are taking advantage of the virtual machine experience for this book, the IPython Notebook for this chapter provides a script that downloads the data to the proper work‐ ing location for you to seamlessly follow along with these examples. The full Enron 232 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a corpus is approximately 450 MB in the compressed form in which you would download it to follow along with these exercises. It may take upward of 10 minutes to download and decompress if you have a reasonable Internet connection speed and a relatively new computer. Unfortunately, if you are using the virtual machine, the time that it takes for Vagrant to synchronize the thousands of files that are unarchived back to the host machine can be upward of two hours. If time is a significant factor and you can’t let this script run at an opportune time, you could opt to skip the download and initial processing steps since the refined version of the data, as produced from Example 6-3, is checked in with the source code and available at ipynb/resources/ch06-mailboxes/data/enron.mbox .json.bz2. See the notes in the IPython Notebook for this chapter for more details. The download and decompression of the file is relatively fast com‐ pared to the time that it takes for Vagrant to synchronize the high number of files that decompress with the host machine, and at the time of this writing, there isn’t a known workaround that will speed this up for all platforms. It may take longer than a hour for Vagrant to syn‐ chronize the thousands of files that decompress. The output from the following terminal session illustrates the basic structure of the corpus once you’ve downloaded and unarchived it. It’s worthwhile to explore the data in a terminal session for a few minutes once you’ve downloaded it to familiarize yourself with what’s there and learn how to navigate through it. If you are working on a Windows system or are not comfortable working in a terminal, you can poke around in theipynb/resources/ ch06-mailboxes/data folder, which will be synchronized onto your host machine if you are taking advantage of the virtual machine experi‐ ence for this book. cd enron_mail_20110402/maildir Go into the mail directory maildir ls Show folders/files in the current directory allen-p crandell-s gay-r horton-s lokey-t nemec-g rogers-b slinger-r tycholiz-b arnold-j cuilla-m geaccone-t hyatt-k love-p panus-s ruscitti-k smith-m ward-k arora-h dasovich-j germany-c hyvl-d lucci-p parks-j sager-e solberg-g watson-k badeer-r corman-s gang-l holst-k lokay-m 6.2. Obtaining and Processing a Mail Corpus 233 a listing truncated... neal-s rodrique-r skilling-j townsend-j cd allen-p/ Go into the allen-p folder allen-p ls Show files in the current directory _sent_mail contacts discussion_threads notes_inbox sent_items all_documents deleted_items inbox sent straw allen-p cd inbox/ Go into the inbox for allen-p inbox ls Show the files in the inbox for allen-p 1. 11. 13. 15. 17. 19. 20. 22. 24. 26. 28. 3. 31. 33. 35. 37. 39. 40. 42. 44. 5. 62. 64. 66. 68. 7. 71. 73. 75. 79. 83. 85. 87. 10. 12. 14. 16. 18. 2. 21. 23. 25. 27. 29. 30. 32. 34. 36. 38. 4. 41. 43. 45. 6. 63. 65. 67. 69. 70. 72. 74. 78. 8. 84. 86. 9. inbox head -20 1. Show the first 20 lines of the file named "1." Message-ID: 16159836.1075855377439.JavaMail.evansthyme Date: Fri, 7 Dec 2001 10:06:42 -0800 (PST) From: To: Subject: RE: West Position Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-From: Dunton, Heather /O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON X-To: Allen, Phillip K. /O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen X-cc: X-bcc: X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\Inbox X-Origin: Allen-P X-FileName: pallen (Non-Privileged).pst Please let me know if you still need Curve Shift. Thanks, The final command in the terminal session shows that mail messages are organized into files and contain metadata in the form of headers that can be processed along with the content of the data itself. The data is in a fairly consistent format, but not necessarily a well-known format with great tools for processing it. So, let’s do some preprocessing on the data and convert a portion of it to the well-known Unix mbox format in order to illustrate the general process of standardizing a mail corpus to a format that is widely known and well tooled. 234 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a6.2.3. Converting a Mail Corpus to a Unix Mailbox Example 6-2 illustrates an approach that searches the directory structure of the Enron corpus for folders named “inbox” and adds messages contained in them to a single output file that’s written out as enron.mbox. To run this script, you will need to download the Enron corpus and unarchive it to the path specified byMAILDIR in the script. The script takes advantage of a package calleddateutil to handle the parsing of dates into a standard format. We didn’t do this earlier, and it’s slightly trickier than it may sound given the room for variation in the general case. You can install this package with pip install python_dateutil. (In this particular instance, the package name thatpip tries to install is slightly different than what you import in your code.) Otherwise, the script is just using some tools from Python’s standard library to munge the data into an mbox. Although not analytically interesting, the script provides reminders of how to use regular expressions, uses theemail package that we’ll continue to see, and illustrates some other concepts that may be useful for general data processing. Be sure that you understand how the script works to broaden your overall working knowledge and data mining toolchain. This script may take 10–15 minutes to run on the entire Enron cor‐ pus, depending on your hardware. IPython Notebook will indicate that it is still processing data by displaying a “Kernel Busy” message in the upper-right corner of the user interface. Example 6-2. Converting the Enron corpus to a standardized mbox format import re import email from time import asctime import os import sys from dateutil.parser import parse pip install python_dateutil XXX: Download the Enron corpus to resources/ch06-mailboxes/data and unarchive it there. MAILDIR = 'resources/ch06-mailboxes/data/enron_mail_20110402/' + \ 'enron_data/maildir' Where to write the converted mbox MBOX = 'resources/ch06-mailboxes/data/enron.mbox' Create a file handle that we'll be writing into... mbox = open(MBOX, 'w') Walk the directories and process any folder named 'inbox' for (root, dirs, file_names) in os.walk(MAILDIR): 6.2. Obtaining and Processing a Mail Corpus 235 a if root.split(os.sep)-1.lower() = 'inbox': continue Process each message in 'inbox' for file_name in file_names: file_path = os.path.join(root, file_name) message_text = open(file_path).read() Compute fields for the From_ line in a traditional mbox message _from ="From: (\r+)", message_text).groups()0 _date ="Date: (\r+)", message_text).groups()0 Convert _date to the asctime representation for the From_ line _date = asctime(parse(_date).timetuple()) msg = email.message_from_string(message_text) msg.set_unixfrom('From %s %s' % (_from, _date)) mbox.write(msg.as_string(unixfrom=True) + "\n\n") mbox.close() If you peek at the mbox file that we’ve just created, you’ll see that it looks quite similar to the mail format we saw earlier, except that it now conforms to well-known specifi‐ cations and is a single file. Keep in mind that you could just as easily create separate mbox files for each individual person or a particular group of people if you pre‐ ferred to analyze a more focused subset of the Enron corpus. 6.2.4. Converting Unix Mailboxes to JSON Having an mbox file is especially convenient because of the variety of tools available to process it across computing platforms and programming languages. In this section we’ll look at eliminating many of the simplifying assumptions from Example 6-1, to the point that we can robustly process the Enron mailbox and take into account several of the common issues that you’ll likely encounter with mailbox data from the wild. Python’s tooling for mboxes is included in its standard library, and the script in Example 6-3 introduces a means of converting mbox data to a line-delimited JSON format that can be imported into a document-oriented database such as MongoDB. We’ll talk more about MongoDB and why it’s such a great fit for storing content such as mail data in a moment, but for now, it’s sufficient to know that it stores data in what’s conceptually a 236 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More aJSON-like format and provides some powerful capabilities for indexing and manipu‐ lating the data. One additional accommodation that we make for MongoDB is that we normalize the date of each message to a standard epoch format that’s the number of milliseconds since January 1, 1970, and pass it in with a special hint so that MongoDB can interpret each date field in a standardized way. Although we could have done this after we loaded the data into MongoDB, this chore falls into the “data cleansing” category and enables us to run some queries that use the Date field of each mail message in a consistent way immediately after the data is loaded. Finally, in order to actually get the data to import into MongoDB, we need to write out a file in which each line contains a single JSON object, per MongoDB’s documentation. Once again, although not interesting from the standpoint of analysis, this script illus‐ trates some additional realities in data cleansing and processing—namely, that mail data may not be in a particular encoding like UTF-8 and may contain HTML formatting that needs to be stripped out. Example 6-3 includes the decode('utf-8', 'ignore') function in several places. When you’re working with text-based data such as emails or web pages, it’s not at all uncommon to run into the infa‐ mous UnicodeDecodeError because of unexpected character encod‐ ings, and it’s not always immediately obvious what’s going on or how to fix the problem. You can run the decode function on any string value and pass it a second argument that specifies what to do in the event of a UnicodeDecodeError. The default value is 'strict', which results in the exception being raised, but you can use 'ignore' or 're place' instead, depending on your needs. Example 6-3. Converting an mbox to a JSON structure suitable for import into MongoDB import sys import mailbox import email import quopri import json import time from BeautifulSoup import BeautifulSoup from dateutil.parser import parse MBOX = 'resources/ch06-mailboxes/data/enron.mbox' OUT_FILE = 'resources/ch06-mailboxes/data/enron.mbox.json' def cleanContent(msg): Decode message from "quoted printable" format 6.2. Obtaining and Processing a Mail Corpus 237 a msg = quopri.decodestring(msg) Strip out HTML tags, if any are present. Bail on unknown encodings if errors happen in BeautifulSoup. try: soup = BeautifulSoup(msg) except: return '' return ''.join(soup.findAll(text=True)) There's a lot of data to process, and the Pythonic way to do it is with a generator. See Using a generator requires a trivial encoder to be passed to json for object serialization. class Encoder(json.JSONEncoder): def default(self, o): return list(o) The generator itself... def gen_json_msgs(mb): while 1: msg = if msg is None: break yield jsonifyMessage(msg) def jsonifyMessage(msg): json_msg = 'parts': for (k, v) in msg.items(): json_msgk = v.decode('utf-8', 'ignore') The To, Cc, and Bcc fields, if present, could have multiple items. Note that not all of these fields are necessarily defined. for k in 'To', 'Cc', 'Bcc': if not json_msg.get(k): continue json_msgk = json_msgk.replace('\n', '').replace('\t', '').replace('\r', '')\ .replace(' ', '').decode('utf-8', 'ignore').split(',') for part in msg.walk(): json_part = if part.get_content_maintype() == 'multipart': continue json_part'contentType' = part.get_content_type() content = part.get_payload(decode=False).decode('utf-8', 'ignore') json_part'content' = cleanContent(content) json_msg'parts'.append(json_part) Finally, convert date from asctime to milliseconds since epoch using the 238 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a date descriptor so it imports "natively" as an ISODate object in MongoDB then = parse(json_msg'Date') millis = int(time.mktime(then.timetuple())1000 + then.microsecond/1000) json_msg'Date' = 'date' : millis return json_msg mbox = mailbox.UnixMailbox(open(MBOX, 'rb'), email.message_from_file) Write each message out as a JSON object on a separate line for easy import into MongoDB via mongoimport f = open(OUT_FILE, 'w') for msg in gen_json_msgs(mbox): if msg = None: f.write(json.dumps(msg, cls=Encoder) + '\n') f.close() There’s always more data cleansing that we could do, but we’ve addressed some of the most common issues, including a primitive mechanism for decoding quoted-printable text and stripping out HTML tags. (Thequopri package is used to handle the quoted- 2 printable format, an encoding used to transfer 8-bit content over a 7-bit channel. ) Following is one line of pretty-printed sample output from running Example 6-3 on the Enron mbox file, to demonstrate the basic form of the output: "Content-Transfer-Encoding": "7bit", "Content-Type": "text/plain; charset=us-ascii", "Date": "date": 988145040000 , "From": "", "Message-ID": "24537021.1075840152262.JavaMail.evansthyme", "Mime-Version": "1.0", "Subject": "Parent Child Mountain Adventure, July 21-25, 2001", "X-FileName": "jskillin.pst", "X-Folder": "\\jskillin\\Inbox", "X-From": "Craig_Estes", "X-Origin": "SKILLING-J", "X-To": "", "X-bcc": "", "X-cc": "", "parts": "content": "Please respond to Keith_Williams...", "contentType": "text/plain" 2. See Wikipedia for an overview, or RFC 2045 if you are interested in the nuts and bolts of how this works. 6.2. Obtaining and Processing a Mail Corpus 239 a This short script does a pretty decent job of removing some of the noise, parsing out the most pertinent information from an email, and constructing a data file that we can now trivially import into MongoDB. This is where the real fun begins. With your new‐ found ability to cleanse and process mail data into an accessible format, the urge to start analyzing it is only natural. In the next section, we’ll import the data into MongoDB and begin the data analysis. If you opted not to download the original Enron data and follow along with the preprocessing steps, you can still produce the output from Example 6-3 by following along with the notes in the IPython Note‐ book for this chapter and proceed from here per the standard discus‐ sion that continues. 6.2.5. Importing a JSONified Mail Corpus into MongoDB Using the right tool for the job can significantly streamline the effort involved in ana‐ lyzing data, and although Python is a language that would make it fairly simple to process JSON data, it still wouldn’t be nearly as easy as storing the JSON data in a document- oriented database like MongoDB. For all practical purposes, think of MongoDB as a database that makes storing and manipulating JSON just about as easy as it should be. You can organize it into collections, iterate over it and query it in efficient ways, full-text index it, and much more. In the current context of analyzing the Enron corpus, MongoDB provides a natural API into the data since it allows us to create indexes and query on arbitrary fields of the JSON documents, even performing a full-text search if desired. For our exercises, you’ll just be running an instance of MongoDB on your local machine, but you can also scale MongoDB across a cluster of machines as your data grows. It comes with great administration utilities, and it’s backed by a professional services company should you need pro support. A full-blown discussion about MongoDB is outside the scope of this book, but it should be straightforward enough to follow along with this section even if you’ve never heard of MongoDB until reading this chapter. Its online documentation and tutorials are superb, so take a moment to bookmark them since they make such a handy reference. Regardless of your operating system, should you choose to install MongoDB instead of using the virtual machine, you should be able to follow the instructions online easily enough; nice packaging for all major platforms is available. Just make sure that you are using version 2.4 or higher since some of the exercises in this chapter rely on full-text indexing, which is a new beta feature introduced in version 2.4. For reference, the Mon‐ 240 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a goDB that is preinstalled with the virtual machine is installed and managed as a ser‐ vice with no particular customization aside from setting a parameter in its configuration file (located at /etc/mongodb.conf) to enable full-text search indexing. Verify that the Enron data is loaded, full-text indexed, and ready for analysis by exe‐ cuting Examples 6-4, 6-5, and 6-6. These examples take advantage of a lightweight wrapper around the subprocess package called Envoy, which allows you to easily ex‐ ecute terminal commands from a Python program and get the standard output and standard error. Per the standard protocol, you can installenvoy withpip install en voy from a terminal. Example 6-4. Getting the options for the mongoimport command from IPython Notebook import envoy pip install envoy r ='mongoimport') print r.std_out print r.std_err Example 6-5. Using mongoimport to load data into MongoDB from IPython Notebook import os import sys import envoy data_file = os.path.join(os.getcwd(), 'resources/ch06-mailboxes/data/enron.mbox.json') Run a command just as you would in a terminal on the virtual machine to import the data file into MongoDB. r ='mongoimport db enron collection mbox ' + \ 'file %s' % data_file) Print its standard output print r.std_out print sys.stderr.write(r.std_err) Example 6-6. Simulating a MongoDB shell that you can run from within IPython Notebook We can even simulate a MongoDB shell using envoy to execute commands. For example, let's get some stats out of MongoDB just as though we were working in a shell by passing it the command and wrapping it in a printjson function to display it for us. def mongo(db, cmd): r ="mongo %s eval 'printjson(%s)'" % (db, cmd,)) print r.std_out if r.std_err: print r.std_err mongo('enron', 'db.mbox.stats()') 6.2. Obtaining and Processing a Mail Corpus 241 a Sample output from Example 6-6 follows and illustrates that it’s exactly what you’d see if you were writing commands in the MongoDB shell. Neat MongoDB shell version: 2.4.3 connecting to: enron "ns" : "enron.mbox", "count" : 41299, "size" : 157744000, "avgObjSize" : 3819.5597956366983, "storageSize" : 185896960, "numExtents" : 10, "nindexes" : 1, "lastExtentSize" : 56438784, "paddingFactor" : 1, "systemFlags" : 1, "userFlags" : 0, "totalIndexSize" : 1349040, "indexSizes" : "_id_" : 1349040 , "ok" : 1 Loading the JSON data through a terminal session on the virtual ma‐ chine can be accomplished through mongoimport in exactly the same fashion as illustrated in Example 6-5 with the following command: mongoimport db enron collection mbox file /home/vagrant/share/ipynb/resources/ch06-mailboxes /data/enron.mbox.json Once MongoDB is installed, the final administrative task you’ll need to perform is in‐ stalling the Python client package pymongo via the usual pip install pymongo com‐ mand, since we’ll soon be using a Python client to connect to MongoDB and access the Enron data. Be advised that MongoDB supports only databases of up to 2 GB in size for 32-bit systems. Although this limitation is not likely to be an issue for the Enron data set that we’re working with in this chapter, you may want to take note of it in case any of the machines you com‐ monly work on are 32-bit systems. The MongoDB shell Although we are programmatically using Python for our exercises in this chapter, Mon‐ goDB has a shell that can be quite convenient if you are comfortable working in a 242 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a terminal, and this brief section introduces you to it. If you are taking advantage of the virtual machine experience for this book, you will need to log into the virtual machine over a secure shell session in order to follow along. Typing vagrant ssh from inside the top-level checkout folder containing your Vagrantfile automatically logs you into the virtual machine. If you run Mac OS X or Linux, an SSH client will already exist on your system and vagrant ssh will just work. If you are a Windows user and followed the instructions in Appendix A recommending the installation of Git for Windows, which provides an SSH client, vagrant ssh will also work so long as you explicitly opt to install the SSH client as part of the installation process. If you are a Windows user and prefer to use PuTTY, typingvagrant ssh provides some instructions on how to configure it: vagrant ssh Last login: Sat Jun 1 04:18:57 2013 from vagrantprecise64: mongo MongoDB shell version: 2.4.3 connecting to: test show dbs enron 0.953125GB local 0.078125GB use enron switched to db enron db.mbox.stats() "ns" : "enron.mbox", "count" : 41300, "size" : 157756112, "avgObjSize" : 3819.7605811138014, "storageSize" : 174727168, "numExtents" : 11, "nindexes" : 2, "lastExtentSize" : 50798592, "paddingFactor" : 1, "systemFlags" : 0, "userFlags" : 1, "totalIndexSize" : 221471488, "indexSizes" : "_id_" : 1349040, "TextIndex" : 220122448 , "ok" : 1 db.mbox.findOne() 6.2. Obtaining and Processing a Mail Corpus 243 a"_id" : ObjectId("51968affaada66efc5694cb7"), "X-cc" : "", "From" : "", "X-Folder" : "\\Phillip_Allen_Jan2002_1\\Allen, Phillip K.\\Inbox", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "X-Origin" : "Allen-P", "To" : "" , "parts" : "content" : " \nPlease let me know if you still need...", "contentType" : "text/plain" , "X-FileName" : "pallen (Non-Privileged).pst", "Mime-Version" : "1.0", "X-From" : "Dunton, Heather /O=ENRON/OU=NA/CN=RECIPIENTS/CN=HDUNTON", "Date" : ISODate("2001-12-07T16:06:42Z"), "X-To" : "Allen, Phillip K. /O=ENRON/OU=NA/CN=RECIPIENTS/CN=Pallen", "Message-ID" : "16159836.1075855377439.JavaMail.evansthyme", "Content-Type" : "text/plain; charset=us-ascii", "Subject" : "RE: West Position" The commands in this shell session showed the available databases, set the working database toenron, displayed the database statistics for enron, and fetched an arbitrary document for display. We won’t spend more time in the MongoDB shell in this chapter, but you’ll likely find it useful as you work with data, so it seemed appropriate to briefly introduce you to it. See “The Mongo Shell” in MongoDB’s online documentation for details about the capabilities of the MongoDB shell. 6.2.6. Programmatically Accessing MongoDB with Python With MongoDB successfully loaded with the Enron corpus (or any other data, for that matter), you’ll want to access and manipulate it with a programming language. MongoDB is sure to please with a broad selection of libraries for many programming languages, including PyMongo, the recommended way to work with MongoDB from Python. Apip install pymongo should get PyMongo ready to use; Example 6-7 con‐ tains a simple script to show how it works. Queries are serviced by MongoDB’s versatile find function, which you’ll want to get acquainted with since it’s the basis of most queries you’ll perform with MongoDB. Example 6-7. Using PyMongo to access MongoDB from Python import json import pymongo pip install pymongo from bson import json_util Comes with pymongo 244 Chapter 6: Mining Mailboxes: Analyzing Who’s Talking to Whom About What, How Often, and More a