How Google web crawler works

how web crawling works and what is a web crawler hit and what is web crawler search engine
HartJohnson Profile Pic
HartJohnson,United States,Professional
Published Date:02-08-2017
Your Website URL(Optional)
Comment
Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, This chapter follows closely on the heels of the chapter before it and is a modest attempt to introduce natural language processing (NLP) and apply it to the vast source of human 1 language data that you’ll encounter on the social web (or elsewhere). The previous chapter introduced some foundational techniques from information retrieval (IR) theory, which generally treats text as document-centric “bags of words” (unordered collections of words) that can be modeled and manipulated as vectors. Although these models often perform remarkably well in many circumstances, a recurring shortcoming is that they do not maximize cues from the immediate context that ground words in meaning. This chapter employs different techniques that are more context-driven and delves deeper into the semantics of human language data. Social web APIs that return data conforming to a well-defined schema are essential, but the most basic currency of hu‐ man communication is natural language data such as the words that you are reading on this page, Facebook posts, web pages linked into tweets, and so forth. Human language is by far the most ubiquitous kind of data available to us, and the future of data-driven innovation depends largely upon our ability to effectively harness machines to under‐ stand digital forms of human communication. 1. Throughout this chapter, the phrase human language data refers to the object of natural language processing and is intended to convey the same meaning as natural language data or unstructured data. No particular distinction is intended to be drawn by this choice of words other than its precision in describing the data itself. 181 a It is highly recommended that you have a good working knowledge of the content in the previous chapter before you dive into this chapter. A good understanding of NLP presupposes an appreciation and work‐ ing knowledge of some of the fundamental strengths and weaknesses of TF-IDF, vector space models, and so on. In that regard, this chap‐ ter and the one before it have a somewhat tighter coupling than most other chapters in this book. In the spirit of the prior chapters, we’ll attempt to cover the minimal level of detail required to empower you with a solid general understanding of an inherently complex topic, while also providing enough of a technical drill-down that you’ll be able to im‐ mediately get to work mining some data. While continuing to cut corners and attempt‐ ing to give you the crucial 20% of the skills that you can use to do 80% of the work (no single chapter out of any book—or small multivolume set of books, for that matter— could possibly do the topic of NLP justice), the content in this chapter is a pragmatic introduction that’ll give you enough information to do some pretty amazing things with the human language data that you’ll find all over the social web. Although we’ll be focused on extracting human language data from web pages and feeds, keep in mind that just about every social website with an API is going to return human language, so these techniques generalize to just about any social website. Always get the latest bug-fixed source code for this chapter (and every other chapter) online at http://bit.ly/MiningTheSocialWeb2E. Be sure to also take advantage of this book’s virtual machine experience, as described in Appendix A, to maximize your enjoyment of the sample code. 5.1. Overview This chapter continues our journey in analyzing human language data and uses arbitrary web pages and feeds as a basis. In this chapter you’ll learn about: • Fetching web pages and extracting the human language data from them • Leveraging NLTK for completing fundamental tasks in natural language processing • Contextually driven analysis in NLP • Using NLP to complete analytical tasks such as generating document abstracts • Metrics for measuring quality for domains that involve predictive analysis 182 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More a5.2. Scraping, Parsing, and Crawling the Web Although it’s trivial to use a programming language or terminal utility such ascurl or wget to fetch an arbitrary web page, extracting the isolated text that you want from the page isn’t quite as trivial. Although the text is certainly in the page, so is lots of other boilerplate content such as navigation bars, headers, footers, advertisements, and other sources of noise that you probably don’t care about. Hence, the bad news is that the problem isn’t quite as simple as just stripping out the HTML tags and processing the text that is left behind, because the removal of HTML tags would have done nothing to remove the boilerplate itself. In some cases, there may actually be more boilerplate in the page that contributes noise than the signal you were looking for in the first place. The good news is that the tools for helping to identify the content you’re interested in have continued to mature over the years, and there are some excellent options for iso‐ lating the material that you’d want for text-mining purposes. Additionally, the relative ubiquity of feeds such as RSS and Atom can often aid the process of retrieving clean text without all of the cruft that’s typically in web pages, if you have the foresight to fetch the feeds while they are available. It’s often the case that feeds are published only for “recent” content, so you may sometimes have to process web pages even if feeds are avail‐ able. If given the choice, you’ll probably want to prefer the feeds over arbitrary web pages, but you’ll need to be prepared for both. One excellent tool for web scraping (the process of extracting text from a web page) is the Java-based boilerpipe library, which is designed to identify and remove the boiler‐ plate from web pages. The boilerpipe library is based on a published paper entitled “Boilerplate Detection Using Shallow Text Features,” which explains the efficacy of using supervised machine learning techniques to bifurcate the boilerplate and the content of the page. Supervised machine learning techniques involve a process that creates a pre‐ dictive model from training samples that are representative of its domain, and thus, boilerpipe is customizable should you desire to tweak it for increased accuracy. Even though the library is Java-based, it’s useful and popular enough that a Python package wrapper calledpython-boilerpipe is available.. Installation of this package is predictable: usepip install boilerpipe. Assuming you have a relatively recent ver‐ sion of Java on your system, that’s all that should be required to use boilerpipe. Example 5-1 demonstrates a sample script that illustrates its rather straightforward us‐ age for the task of extracting the body content of an article, as denoted by the Arti cleExtractor parameter that’s passed into theExtractor constructor. You can also try out a hosted version of boilerpipe online to see the difference between some of its other provided extractors, such as theLargestContentExtractor orDefaultExtractor. 5.2. Scraping, Parsing, and Crawling the Web 183 a There’s a default extractor that works for the general case, an extractor that has been trained for web pages containing articles, and an extractor that is trained to extract the largest body of text on a page, which might be suitable for web pages that tend to have just one large block of text. In any case, there may still be some light post-processing required on the text, depending on what other features you can identify that may be noise or require attention, but employing boilerpipe to do the heavy lifting is just about as easy as it should be. Example 5-1. Using boilerpipe to extract the text from a web page from boilerpipe.extract import Extractor URL='http://radar.oreilly.com/2010/07/louvre-industrial-age-henry-ford.html' extractor = Extractor(extractor='ArticleExtractor', url=URL) print extractor.getText() Although web scraping used to be the only way to fetch content from websites, there’s potentially an easier way to harvest content, especially if it’s content that’s coming from a news source, blog, or other syndicated source. But before we get into that, let’s take a quick trip down memory lane. If you’ve been using the Web long enough, you may remember a time in the late 1990s when news readers didn’t exist. Back then, if you wanted to know what the latest changes to a website were, you just had to go to the site and see if anything had changed. As a result, syndication formats took advantage of the self-publishing movement with blogs and formats such as RSS (Really Simple Syndication) and Atom built upon the evolving XML specifications that were growing in popularity to handle content providers pub‐ lishing content and consumers subscribing to it. Parsing feeds is an easier problem to solve since the feeds are well-formed XML data that validates to a published schema, whereas web pages may or may not be well formed, be valid, or even conform to best practices. A commonly used Python package for processing feed, appropriately named feed parser, is an essential utility to have on hand for parsing feeds. You can install it with pip using the standard pip install feedparser approach in a terminal, and Example 5-2 illustrates minimal usage to extract the text, title, and source URL for an entry in an RSS feed. Example 5-2. Using feedparser to extract the text (and other fields) from an RSS or Atom feed import feedparser FEED_URL='http://feeds.feedburner.com/oreilly/radar/atom' fp = feedparser.parse(FEED_URL) 184 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More a for e in fp.entries: print e.title print e.links0.href print e.content0.value HTML, XML, and XHTML As the early Web evolved, the difficulty of separating the content in a page from its presentation quickly became recognized as a problem, and XML was (in part) a solution. The idea was that content creators could publish data in an XML format and use a stylesheet to transform it into XHTML for presentation to an end user. XHTML is essentially just HTML that is written as well-formed XML: each tag is defined in low‐ ercase, tags are properly nested as a tree structure, and tags are either self-closing (such asbr /), or each opening tag (e.g.,p) has a corresponding closing tag (/p). In the context of web scraping, these conventions have the added benefit of making each web page much easier to process with a parser, and in terms of design, it appeared that XHTML was exactly what the Web needed. There was a lot to gain and virtually nothing to lose from the proposition: well-formed XHTML content could be proven valid against an XML schema and enjoy all of the other perks of XML, such as custom attributes using namespaces (a device that semantic web technologies such as RDFa rely upon). The problem is that it just didn’t catch on. As a result, we now live in a world where semantic markup based on the HTML 4.01 standard that’s over a decade old continues to thrive, while XHTML and XHTML-based technologies such as RDFa remain on the fringe. (In fact, libraries such as BeautifulSoup are designed with the specific intent of being able to reasonably process HTML that probably isn’t well-formed or even sane.) Most of the web development world is holding its breath and hoping that HTML5 will indeed create a long-overdue convergence as technologies like microdata catch on and publishing tools modernize. The HTML article on Wikipedia is worth reading if you find this kind of history interesting. Crawling websites is a logical extension of the same concepts already presented in this section: it typically consists of fetching a page, extracting the hyperlinks in the page, and then systematically fetching all of those pages that are hyperlinked. This process is re‐ peated to an arbitrary depth, depending on your objective. This very process is the way that the earliest search engines used to work, and the way most search engines that index the Web still continue to work today. Although a crawl of the Web is far outside our scope, it is helpful to have a working knowledge of the problem, so let’s briefly think about the computational complexity of harvesting all of those pages. 5.2. Scraping, Parsing, and Crawling the Web 185 a In terms of implementing your own web crawl, should you ever desire to do so, Scrapy is a Python-based framework for web scraping that serves as an excellent resource if your endeavors entail web crawling. The actual exercise of performing a web crawl is just outside the scope of this chapter, but the Scrapy documentation online is excellent and includes all of the tutelage you’ll need to get a targeted web crawl going with little effort up front. The next section is a brief aside that discusses the computational com‐ plexity of how web crawls are typically implemented so that you can better appreciate what you may be getting yourself into. Nowadays, you can get a periodically updated crawl of the Web that’s suitable for most research purposes from a source such as Amazon’s Common Crawl Corpus, which features more than 5 billion web pa‐ ges and checks in at over 81 terabytes of data 5.2.1. Breadth-First Search in Web Crawling This section contains some detailed content and analysis about how web crawls can be implemented and is not essential to your under‐ standing of the content in this chapter (although you will likely find it interesting and edifying). If this is your first reading of the chapter, feel free to save it for next time. The basic algorithm for a web crawl can be framed as a breadth-first search, which is a fundamental technique for exploring a space that’s typically modeled as a tree or a graph given a starting node and no other known information except a set of possibilities. In our web crawl scenario, our starting node would be the initial web page and the set of neighboring nodes would be the other pages that are hyperlinked. There are alternative ways to search the space, with a depth-first search being a common alternative to a breadth-first search. The particular choice of one technique versus an‐ other often depends on available computing resources, specific domain knowledge, and even theoretical considerations. A breadth-first search is a reasonable approach for exploring a sliver of the Web. Example 5-3 presents some pseudocode that illustrates how it works, and Figure 5-1 provides some visual insight into how the search would look if you were to draw it out on the back of a napkin. Example 5-3. Pseudocode for a breadth-first search Create an empty graph Create an empty queue to keep track of nodes that need to be processed Add the starting point to the graph as the root node Add the root node to a queue for processing 186 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More aRepeat until some maximum depth is reached or the queue is empty: Remove a node from the queue For each of the node's neighbors: If the neighbor hasn't already been processed: Add it to the queue Add it to the graph Create an edge in the graph that connects the node and its neighbor We generally haven’t taken quite this long of a pause to analyze an approach, but breadth- first search is a fundamental tool you’ll want to have in your belt and understand well. In general, there are two criteria for examination that you should always consider for an algorithm: efficiency and effectiveness (or, to put it another way: performance and quality). Standard performance analysis of any algorithm generally involves examining its worst- case time and space complexity—in other words, the amount of time it would take the program to execute, and the amount of memory required for execution over a very large data set. The breadth-first approach we’ve used to frame a web crawl is essentially a breadth-first search, except that we’re not actually searching for anything in particular because there are no exit criteria beyond expanding the graph out either to a maximum depth or until we run out of nodes. If we were searching for something specific instead of just crawling links indefinitely, that would be considered an actual breadth-first search. Thus, a more common variation of a breadth-first search is called a bounded breadth-first search, which imposes a limit on the maximum depth of the search just as we do in this example. For a breadth-first search (or breadth-first crawl), both the time and space complexity d can be bounded in the worst case by b , where b is the branching factor of the graph and d is the depth. If you sketch out an example on paper, as in Figure 5-1, and think about it, this analysis quickly becomes more apparent. 5.2. Scraping, Parsing, and Crawling the Web 187 a Figure 5-1. In a breadth-first search, each step of the search expands the depth by one level until a maximum depth or some other termination criterion is reached 188 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More a If every node in a graph had five neighbors, and you only went out to a depth of one, you’d end up with six nodes in all: the root node and its five neighbors. If all five of those neighbors had five neighbors too and you expanded out another level, you’d end up with 31 nodes in all: the root node, the root node’s five neighbors, and five neighbors d for each of the root node’s neighbors. Table 5-1 provides an overview of how b grows for a few sizes of b and d. Table 5-1. Example branching factor calculations for graphs of varying depths Branching factor Nodes for Nodes for Nodes for Nodes for Nodes for depth = 1 depth = 2 depth = 3 depth = 4 depth = 5 2 3 7 15 31 63 3 4 13 40 121 364 4 5 21 85 341 1,365 5 6 31 156 781 3,906 6 7 43 259 1,555 9,331 Figure 5-2 provides a visual for the values displayed in Table 5-1. Figure 5-2. The growth in the number of nodes as the depth of a breadth-first search increases 5.2. Scraping, Parsing, and Crawling the Web 189 www.it-ebooks.infoWhile the previous comments pertain primarily to the theoretical bounds of the algo‐ rithm, one final consideration worth noting is the practical performance of the algorithm for a data set of a fixed size. Mild profiling of a breadth-first implementation that fetches web pages would likely reveal that the code is primarily I/O bound from the standpoint that the vast majority of time is spent waiting for a library call to return content to be processed. In situations in which you are I/O bound, a thread pool is a common technique for increasing performance. 5.3. Discovering Semantics by Decoding Syntax You may recall from the previous chapter that perhaps the most fundamental weak‐ nesses of TF-IDF and cosine similarity are that these models inherently don’t leverage a deep semantic understanding of the data and throw away a lot of critical context. Quite the contrary, the examples in that chapter took advantage of very basic syntax that separated tokens by whitespace to break an otherwise opaque document into a bag of words and used frequency and simple statistical similarity metrics to determine which tokens were likely to be important in the data. Although you can do some really amazing things with these techniques, they don’t really give you any notion of what any given token means in the context in which it appears in the document. Look no further than 2 a sentence containing a homograph such as fish, bear, or even google as a case in point; either one could be a noun or a verb. NLP is inherently complex and difficult to do even reasonably well, and completely nailing it for a large set of commonly spoken languages may well be the problem of the century. However, despite what many believe, it’s far from being a solved problem, and it is already the case that we are starting to see a rising interest in a “deeper understand‐ ing” of the Web with initiatives such as Google’s Knowledge Graph, which is being promoted as “the future of search.” After all, a complete mastery of NLP is essentially a plausible strategy for acing the Turing Test, and to the most careful observer, a computer program that achieves this level of “understanding” demonstrates an uncanny amount of human-like intelligence even if it is through a brain that’s mathematically modeled in software as opposed to a biological one. Whereas structured or semistructured sources are essentially collections of records with some presupposed meaning given to each field that can immediately be analyzed, there are more subtle considerations to be handled with human language data for even the seemingly simplest of tasks. For example, let’s suppose you’re given a document and asked to count the number of sentences in it. It’s a trivial task if you’re a human and have just a basic understanding of English grammar, but it’s another story entirely for a 2. A homonym is a special case of a homograph. Two words are homographs if they have the same spelling. Two words are homonyms if they have the same spelling and the same pronunciation. For some reason, homonym seems more common in parlance than homograph, even if it’s being misused. 190 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More www.it-ebooks.info machine, which will require a complex and detailed set of instructions to complete the same task. The encouraging news is that machines can detect the ends of sentences on relatively well-formed data quickly and with nearly perfect accuracy. Even if you’ve accurately detected all of the sentences, though, there’s still a lot that you probably don’t know about the ways that words or phrases are used in those sentences. Look no further than sarcasm or other forms of ironic language as cases in point. Even with perfect infor‐ mation about the structure of a sentence, you often still need additional context outside the sentence to properly interpret it. Thus, as an overly broad generalization, we can say that NLP is fundamentally about taking an opaque document that consists of an ordered collection of symbols adhering to proper syntax and a reasonably well-defined grammar, and deducing the semantics associated with those symbols. Let’s get back to the task of detecting sentences, the first step in most NLP pipelines, to illustrate some of the complexity involved in NLP. It’s deceptively easy to overestimate the utility of simple rule-based heuristics, and it’s important to work through an exercise so that you realize what some of the key issues are and don’t waste time trying to reinvent the wheel. Your first attempt at solving the sentence detection problem might be to just count the periods, question marks, and exclamation points in the sentence. That’s the most obvi‐ ous heuristic for starting out, but it’s quite crude and has the potential for producing an extremely high margin of error. Consider the following (relatively unambiguous) accusation: Mr. Green killed Colonel Mustard in the study with the candlestick. Mr. Green is not a very nice fellow. Simply tokenizing the sentence by splitting on punctuation (specifically, periods) would produce the following result: txt = "Mr. Green killed Colonel Mustard in the study with the \ ... candlestick. Mr. Green is not a very nice fellow." txt.split(".") 'Mr', 'Green killed Colonel Mustard in the study with the candlestick', 'Mr', 'Green is not a very nice fellow', '' It should be immediately obvious that performing sentence detection by blindly break‐ ing on periods without incorporating some notion of context or higher-level information is insufficient. In this case, the problem is the use of “Mr.,” a valid abbre‐ viation that’s commonly used in the English language. Although we already know from the previous chapter that n-gram analysis of this sample would likely tell us that “Mr. Green” is really one compound token called a collocation or chunk, if we had a larger amount of text to analyze, it’s not hard to imagine other edge cases that would be difficult to detect based on the appearance of collocations. Thinking ahead a bit, it’s also worth 5.3. Discovering Semantics by Decoding Syntax 191 www.it-ebooks.info pointing out that finding the key topics in a sentence isn’t easy to accomplish with trivial logic either. As an intelligent human, you can easily discern that the key topics in our sample might be “Mr. Green,” “Colonel Mustard,” “the study,” and “the candlestick,” but training a machine to tell you the same things without human intervention is a complex task. Take a moment to think about how you might write a computer pro‐ gram to solve the problem before continuing in order to get the most out of the remainder of this discussion. A few obvious possibilities are probably occurring to you, such as doing some “Title Case” detection with a regular expression, constructing a list of common abbreviations to parse out the proper noun phrases, and applying some variation of that logic to the problem of finding end-of-sentence (EOS) boundaries to prevent yourself from getting into trouble on that front. OK, sure. Those things will work for some examples, but what’s the margin of error going to be like for arbitrary English text? How forgiving is your algorithm for poorly formed English text; highly abbreviated information such as text messages or tweets; or (gasp) other romantic languages, such as Spanish, French, or Italian? There are no simple answers here, and that’s why text analytics is such an important topic in an age where the amount of digitally available human language data is literally increasing every second. 5.3.1. Natural Language Processing Illustrated Step-by-Step Let’s prepare to step through a series of examples that illustrate NLP with NLTK. The NLP pipeline we’ll examine involves the following steps: 1. EOS detection 2. Tokenization 3. Part-of-speech tagging 4. Chunking 5. Extraction We’ll continue to use the following sample text from the previous chapter for the pur‐ poses of illustration: “Mr. Green killed Colonel Mustard in the study with the candle‐ stick. Mr. Green is not a very nice fellow.” Remember that even though you have already read the text and understand its underlying grammatical structure, it’s merely an opaque string value to a machine at this point. Let’s look at the steps we need to work through in more detail. 192 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More www.it-ebooks.info The following NLP pipeline is presented as though it is unfolding in a Python interpreter session for clarity and ease of illustration in the input and expected output of each step. However, each step of the pipeline is preloaded into this chapter’s IPython Notebook so that you can follow along per the norm with all other examples. The five steps are: EOS detection This step breaks a text into a collection of meaningful sentences. Since sentences generally represent logical units of thought, they tend to have a predictable syntax that lends itself well to further analysis. Most NLP pipelines you’ll see begin with this step because tokenization (the next step) operates on individual sentences. Breaking the text into paragraphs or sections might add value for certain types of analysis, but it is unlikely to aid in the overall task of EOS detection. In the inter‐ preter, you’d parse out a sentence with NLTK like so: import nltk txt = "Mr. Green killed Colonel Mustard in the study with the \ ... candlestick. Mr. Green is not a very nice fellow." txt = "Mr. Green killed Colonel Mustard in the study with the \ ... candlestick. Mr. Green is not a very nice fellow." sentences = nltk.tokenize.sent_tokenize(txt) sentences 'Mr. Green killed Colonel Mustard in the study with the candlestick.', 'Mr. Green is not a very nice fellow.' We’ll talk a little bit more about what is happening under the hood withsent_token ize in the next section. For now, we’ll accept at face value that proper sentence detection has occurred for arbitrary text—a clear improvement over breaking on characters that are likely to be punctuation marks. Tokenization This step operates on individual sentences, splitting them into tokens. Following along in the example interpreter session, you’d do the following: tokens = nltk.tokenize.word_tokenize(s) for s in sentences tokens 'Mr.', 'Green', 'killed', 'Colonel', 'Mustard', 'in', 'the', 'study', 'with', 'the', 'candlestick', '.', 'Mr.', 'Green', 'is', 'not', 'a', 'very', 'nice', 'fellow', '.' Note that for this simple example, tokenization appeared to do the same thing as splitting on whitespace, with the exception that it tokenized out EOS markers (the periods) correctly. As we’ll see in a later section, though, it can do a bit more if we give it the opportunity, and we already know that distinguishing between whether a period is an EOS marker or part of an abbreviation isn’t always trivial. As an anecdotal note, some written languages, such as ones that use pictograms as 5.3. Discovering Semantics by Decoding Syntax 193 www.it-ebooks.info opposed to letters, don’t necessarily even require whitespace to separate the tokens in sentences and require the reader (or machine) to distinguish the boundaries. POS tagging This step assigns part-of-speech (POS) information to each token. In the example interpreter session, you’d run the tokens through one more step to have them dec‐ orated with tags: pos_tagged_tokens = nltk.pos_tag(t) for t in tokens pos_tagged_tokens ('Mr.', 'NNP'), ('Green', 'NNP'), ('killed', 'VBD'), ('Colonel', 'NNP'), ('Mustard', 'NNP'), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.'), ('Mr.', 'NNP'), ('Green', 'NNP'), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'JJ'), ('.', '.') You may not intuitively understand all of these tags, but they do represent POS information. For example, 'NNP' indicates that the token is a noun that is part of a noun phrase, 'VBD' indicates a verb that’s in simple past tense, and'JJ' indicates an adjective. The Penn Treebank Project provides a full summary of the POS tags that could be returned. With POS tagging completed, it should be getting pretty apparent just how powerful analysis can become. For example, by using the POS tags, we’ll be able to chunk together nouns as part of noun phrases and then try to reason about what types of entities they might be (e.g., people, places, or organi‐ zations). If you never thought that you’d need to apply those exercises from ele‐ mentary school regarding parts of speech, think again: it’s essential to a proper application of natural language processing. Chunking This step involves analyzing each tagged token within a sentence and assembling compound tokens that express logical concepts—quite a different approach than statistically analyzing collocations. It is possible to define a custom grammar through NLTK’schunk.RegexpParser, but that’s beyond the scope of this chapter; see Chapter 9 (“Building Feature Based Grammars”) of Natural Language Process‐ ing with Python (O’Reilly) for full details. Besides, NLTK exposes a function that combines chunking with named entity extraction, which is the next step. 194 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More www.it-ebooks.infoExtraction This step involves analyzing each chunk and further tagging the chunks as named entities, such as people, organizations, locations, etc. The continuing saga of NLP in the interpreter demonstrates: ne_chunks = nltk.batch_ne_chunk(pos_tagged_tokens) print ne_chunks Tree('S', Tree('PERSON', ('Mr.', 'NNP')), Tree('PERSON', ('Green', 'NNP')), ('killed', 'VBD'), Tree('ORGANIZATION', ('Colonel', 'NNP'), ('Mustard', 'NNP')), ('in', 'IN'), ('the', 'DT'), ('study', 'NN'), ('with', 'IN'), ('the', 'DT'), ('candlestick', 'NN'), ('.', '.')), Tree('S', Tree('PERSON', ('Mr.', 'NNP')), Tree('ORGANIZATION', ('Green', 'NNP')), ('is', 'VBZ'), ('not', 'RB'), ('a', 'DT'), ('very', 'RB'), ('nice', 'JJ'), ('fellow', 'JJ'), ('.', '.')) print ne_chunks0.pprint() You can pretty-print each chunk in the tree (S (PERSON Mr./NNP) (PERSON Green/NNP) killed/VBD (ORGANIZATION Colonel/NNP Mustard/NNP) in/IN the/DT study/NN with/IN the/DT candlestick/NN ./.) Don’t get too wrapped up in trying to decipher exactly what the tree output means just yet. In short, it has chunked together some tokens and attempted to classify them as being certain types of entities. (You may be able to discern that it has iden‐ tified “Mr. Green” as a person, but unfortunately categorized “Colonel Mustard” as an organization.) Figure 5-3 illustrates output in IPython Notebook. As worthwhile as it would be to continue exploring natural language with NLTK, that level of engagement isn’t really our purpose here. The background in this section is provided to motivate an appreciation for the difficulty of the task and to encourage you to review the NLTK book or one of the many other plentiful resources available online if you’d like to pursue the topic further. 5.3. Discovering Semantics by Decoding Syntax 195 www.it-ebooks.info Figure 5-3. NLTK can interface with drawing toolkits so that you can inspect the chunked output in a more intuitive visual form than the raw text output you see in the interpreter Given that it’s possible to customize certain aspects of NLTK, the remainder of this chapter assumes you’ll be using NLTK “as is” unless otherwise noted. With that brief introduction to NLP concluded, let’s get to work mining some blog data. 5.3.2. Sentence Detection in Human Language Data Given that sentence detection is probably the first task you’ll want to ponder when building an NLP stack, it makes sense to start there. Even if you never complete the remaining tasks in the pipeline, it turns out that EOS detection alone yields some pow‐ erful possibilities, such as document summarization, which we’ll be considering as a follow-up exercise in the next section. But first, we’ll need to fetch some clean human language data. Let’s use the tried-and-true feedparser package, along with some util‐ ities introduced in the previous chapter that are based onnltk andBeautifulSoup, to clean up HTML formatting that may appear in the content to fetch some posts from the O’Reilly Radar blog. The listing in Example 5-4 fetches a few posts and saves them to a local file as JSON. 196 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More www.it-ebooks.infoExample 5-4. Harvesting blog data by parsing feeds import os import sys import json import feedparser from BeautifulSoup import BeautifulStoneSoup from nltk import clean_html FEED_URL = 'http://feeds.feedburner.com/oreilly/radar/atom' def cleanHtml(html): return BeautifulStoneSoup(clean_html(html), convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents0 fp = feedparser.parse(FEED_URL) print "Fetched %s entries from '%s'" % (len(fp.entries0.title), fp.feed.title) blog_posts = for e in fp.entries: blog_posts.append('title': e.title, 'content' : cleanHtml(e.content0.value), 'link': e.links0.href) out_file = os.path.join('resources', 'ch05-webpages', 'feed.json') f = open(out_file, 'w') f.write(json.dumps(blog_posts, indent=1)) f.close() print 'Wrote output file to %s' % (f.name, ) Obtaining human language data from a reputable source affords us the luxury of assuming good English grammar; hopefully this also means that one of NLTK’s out-of- the-box sentence detectors will work reasonably well. There’s no better way to find out than hacking some code to see what happens, so go ahead and review the code listing in Example 5-5. It introduces thesent_tokenize andword_tokenize methods, which are aliases for NLTK’s currently recommended sentence detector and word tokenizer. A brief discussion of the listing is provided afterward. Example 5-5. Using NLTK’s NLP tools to process human language in blog data import json import nltk Download nltk packages used in this example nltk.download('stopwords') BLOG_DATA = "resources/ch05-webpages/feed.json" blog_data = json.loads(open(BLOG_DATA).read()) Customize your list of stopwords as needed. Here, we add common 5.3. Discovering Semantics by Decoding Syntax 197 www.it-ebooks.info punctuation and contraction artifacts. stop_words = nltk.corpus.stopwords.words('english') + '.', ',', '', '\'s', '?', ')', '(', ':', '\'', '\'re', '"', '-', '', '', u'—', for post in blog_data: sentences = nltk.tokenize.sent_tokenize(post'content') words = w.lower() for sentence in sentences for w in nltk.tokenize.word_tokenize(sentence) fdist = nltk.FreqDist(words) Basic stats num_words = sum(i1 for i in fdist.items()) num_unique_words = len(fdist.keys()) Hapaxes are words that appear only once num_hapaxes = len(fdist.hapaxes()) top_10_words_sans_stop_words = w for w in fdist.items() if w0 not in stop_words:10 print post'title' print '\tNum Sentences:'.ljust(25), len(sentences) print '\tNum Words:'.ljust(25), num_words print '\tNum Unique Words:'.ljust(25), num_unique_words print '\tNum Hapaxes:'.ljust(25), num_hapaxes print '\tTop 10 Most Frequent Words (sans stop words):\n\t\t', \ '\n\t\t'.join('%s (%s)' % (w0, w1) for w in top_10_words_sans_stop_words) print 198 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More www.it-ebooks.info The first things you’re probably wondering about are the sent_tokenize and word_tokenize calls. NLTK provides several options for tokenization, but it provides “recommendations” as to the best available via these aliases. At the time of this writing (you can double-check this withpydoc or a command likenltk.tokenize.sent_token ize? in IPython or IPython Notebook at any time), the sentence detector is the Punkt SentenceTokenizer and the word tokenizer is theTreebankWordTokenizer. Let’s take a brief look at each of these. Internally, thePunktSentenceTokenizer relies heavily on being able to detect abbrevi‐ ations as part of collocation patterns, and it uses some regular expressions to try to intelligently parse sentences by taking into account common patterns of punctuation usage. A full explanation of the innards of thePunktSentenceTokenizer’s logic is out‐ side the scope of this book, but Tibor Kiss and Jan Strunk’s original paper “Unsupervised Multilingual Sentence Boundary Detection” discusses its approach in a highly readable way, and you should take some time to review it. As we’ll see in a bit, it is possible to instantiate thePunktSentenceTokenizer with sample text that it trains on to try to improve its accuracy. The type of underlying algorithm that’s used is an unsupervised learning algorithm; it does not require you to explicitly mark up the sample training data in any way. Instead, the algorithm inspects certain features that appear in the text itself, such as the use of capitalization and the co- occurrences of tokens, to derive suitable parameters for breaking the text into sentences. While NLTK’sWhitespaceTokenizer, which creates tokens by breaking a piece of text on whitespace, would have been the simplest word tokenizer to introduce, you’re already familiar with some of the shortcomings of blindly breaking on whitespace. Instead, NLTK currently recommends theTreebankWordTokenizer, a word tokenizer that op‐ 3 erates on sentences and uses the same conventions as the Penn Treebank Project. The one thing that may catch you off guard is that theTreebankWordTokenizer’s tokeniza‐ tion does some less-than-obvious things, such as separately tagging components in contractions and nouns having possessive forms. For example, the parsing for the sen‐ tence “I’m hungry” would yield separate components for “I” and “’m,” maintaining a distinction between the subject and verb for the two words conjoined in the contraction “I’m.” As you might imagine, finely grained access to this kind of grammatical infor‐ mation can be quite valuable when it’s time to do advanced analysis that scrutinizes relationships between subjects and verbs in sentences. Given a sentence tokenizer and a word tokenizer, we can first parse the text into sen‐ tences and then parse each sentence into tokens. While this approach is fairly intuitive, its Achilles’ heel is that errors produced by the sentence detector propagate forward and 3. Treebank is a very specific term that refers to a corpus that’s been specially tagged with advanced linguistic information. In fact, the reason such a corpus is called a treebank is to emphasize that it’s a bank (think: collection) of sentences that have been parsed into trees adhering to a particular grammar. 5.3. Discovering Semantics by Decoding Syntax 199 www.it-ebooks.info can potentially bound the upper limit of the quality that the rest of the NLP stack can produce. For example, if the sentence tokenizer mistakenly breaks a sentence on the period after “Mr.” that appears in a section of text such as “Mr. Green killed Colonel Mustard in the study with the candlestick,” it may not be possible to extract the entity “Mr. Green” from the text unless specialized repair logic is in place. Again, it all depends on the sophistication of the full NLP stack and how it accounts for error propagation. The out-of-the-boxPunktSentenceTokenizer is trained on the Penn Treebank corpus and performs quite well. The end goal of the parsing is to instantiate annltk.FreqDist object (which is like a slightly more sophisticatedcollections.Counter), which expects a list of tokens. The remainder of the code in Example 5-5 is a straightforward usage of a few of the commonly used NLTK APIs. If you have a lot of trouble with advanced word tokenizers such as NLTK’s TreebankWordTokenizer or PunktWordTokenizer, it’s fine to default to the WhitespaceTokenizer until you decide whether it’s worth the investment to use a more advanced tokenizer. In fact, us‐ ing a more straightforward tokenizer can often be advantageous. For example, using an advanced tokenizer on data that frequently inlines URLs might be a bad idea. The aim of this section was to familiarize you with the first step involved in building an NLP pipeline. Along the way, we developed a few metrics that make a feeble attempt at characterizing some blog data. Our pipeline doesn’t involve part-of-speech tagging or chunking (yet), but it should give you a basic understanding of some concepts and get you thinking about some of the subtler issues involved. While it’s true that we could have simply split on whitespace, counted terms, tallied the results, and still gained a lot of information from the data, it won’t be long before you’ll be glad that you took these initial steps toward a deeper understanding of the data. To illustrate one possible ap‐ plication for what you’ve just learned, in the next section we’ll look at a simple document summarization algorithm that relies on little more than sentence segmentation and frequency analysis. 5.3.3. Document Summarization Being able to perform reasonably good sentence detection as part of an NLP approach to mining unstructured data can enable some pretty powerful text-mining capabilities, such as crude but very reasonable attempts at document summarization. There are numerous possibilities and approaches, but one of the simplest to get started with dates all the way back to the April 1958 issue of IBM Journal. In the seminal article entitled “The Automatic Creation of Literature Abstracts,” H.P. Luhn describes a technique that 200 Chapter 5: Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More www.it-ebooks.info