Natural language processing with hadoop and python

natural language processing python tutorial and natural language processing with apache hadoop and python,natural language processing python corpus pdf free downlaod
Dr.GordenMorse Profile Pic
Dr.GordenMorse,France,Professional
Published Date:22-07-2017
Your Website URL(Optional)
Comment
CHAPTER 1 Language Processing and Python It is easy to get our hands on millions of words of text. What can we do with it, assuming we can write some simple programs? In this chapter, we’ll address the following questions: 1. What can we achieve by combining simple programming techniques with large quantities of text? 2. How can we automatically extract key words and phrases that sum up the style and content of a text? 3. What tools and techniques does the Python programming language provide for such work? 4. What are some of the interesting challenges of natural language processing? This chapter is divided into sections that skip between two quite different styles. In the “computing with language” sections, we will take on some linguistically motivated programming tasks without necessarily explaining how they work. In the “closer look at Python” sections we will systematically review key programming concepts. We’ll flag the two styles in the section titles, but later chapters will mix both styles without being so up-front about it. We hope this style of introduction gives you an authentic taste of what will come later, while covering a range of elementary concepts in linguis- tics and computer science. If you have basic familiarity with both areas, you can skip to Section 1.5; we will repeat any important points in later chapters, and if you miss anything you can easily consult the online reference material at http://www.nltk.org/. If the material is completely new to you, this chapter will raise more questions than it answers, questions that are addressed in the rest of this book. 1.1 Computing with Language: Texts and Words We’re all very familiar with text, since we read and write it every day. Here we will treat text as raw data for the programs we write, programs that manipulate and analyze it in a variety of interesting ways. But before we can do this, we have to get started with the Python interpreter. 1Getting Started with Python One of the friendly things about Python is that it allows you to type directly into the interactive interpreter—the program that will be running your Python programs. You can access the Python interpreter using a simple graphical interface called the In- teractive DeveLopment Environment (IDLE). On a Mac you can find this under Ap- plications→MacPython, and on Windows under All Programs→Python. Under Unix you can run Python from the shell by typing idle (if this is not installed, try typing python). The interpreter will print a blurb about your Python version; simply check that you are running Python 2.4 or 2.5 (here it is 2.5.1): Python 2.5.1 (r251:54863, Apr 15 2008, 22:57:26) GCC 4.0.1 (Apple Inc. build 5465) on darwin Type "help", "copyright", "credits" or "license" for more information. If you are unable to run the Python interpreter, you probably don’t have Python installed correctly. Please visit http://python.org/ for detailed in- structions. The prompt indicates that the Python interpreter is now waiting for input. When copying examples from this book, don’t type the “” yourself. Now, let’s begin by using Python as a calculator: 1 + 5 2 - 3 8 Once the interpreter has finished calculating the answer and displaying it, the prompt reappears. This means the Python interpreter is waiting for another instruction. Your Turn: Enter a few more expressions of your own. You can use asterisk () for multiplication and slash (/) for division, and parentheses for bracketing expressions. Note that division doesn’t always behave as you might expect—it does integer division (with rounding of fractions downwards) when you type 1/3 and “floating-point” (or decimal) divi- sion when you type 1.0/3.0. In order to get the expected behavior of division (standard in Python 3.0), you need to type: from __future__ import division. The preceding examples demonstrate how you can work interactively with the Python interpreter, experimenting with various expressions in the language to see what they do. Now let’s try a non-sensical expression to see how the interpreter handles it: 2 Chapter 1: Language Processing and Python 1 + File "stdin", line 1 1 + SyntaxError: invalid syntax This produced a syntax error. In Python, it doesn’t make sense to end an instruction with a plus sign. The Python interpreter indicates the line where the problem occurred (line 1 of stdin, which stands for “standard input”). Now that we can use the Python interpreter, we’re ready to start working with language data. Getting Started with NLTK Before going further you should install NLTK, downloadable for free from http://www .nltk.org/. Follow the instructions there to download the version required for your platform. Once you’ve installed NLTK, start up the Python interpreter as before, and install the data required for the book by typing the following two commands at the Python prompt, then selecting the book collection as shown in Figure 1-1. import nltk nltk.download() Figure 1-1. Downloading the NLTK Book Collection: Browse the available packages using nltk.download(). The Collections tab on the downloader shows how the packages are grouped into sets, and you should select the line labeled book to obtain all data required for the examples and exercises in this book. It consists of about 30 compressed files requiring about 100Mb disk space. The full collection of data (i.e., all in the downloader) is about five times this size (at the time of writing) and continues to expand. Once the data is downloaded to your machine, you can load some of it using the Python interpreter. The first step is to type a special command at the Python prompt, which 1.1 Computing with Language: Texts and Words 3tells the interpreter to load some texts for us to explore: from nltk.book import . This says “from NLTK’s book module, load all items.” The book module contains all the data you will need as you read this chapter. After printing a welcome message, it loads the text of several books (this will take a few seconds). Here’s the command again, together with the output that you will see. Take care to get spelling and punctuation right, and remember that you don’t type the . from nltk.book import Introductory Examples for the NLTK Book Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908 Any time we want to find out about these texts, we just have to enter their names at the Python prompt: text1 Text: Moby Dick by Herman Melville 1851 text2 Text: Sense and Sensibility by Jane Austen 1811 Now that we can use the Python interpreter, and have some data to work with, we’re ready to get started. Searching Text There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. Here we look up the word monstrous in Moby Dick by entering text1 followed by a period, then the term concordance, and then placing "monstrous" in parentheses: text1.concordance("monstrous") Building index... Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo 4 Chapter 1: Language Processing and Pythonght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u Your Turn: Try searching for other words; to save re-typing, you might be able to use up-arrow, Ctrl-up-arrow, or Alt-p to access the previous command and modify the word being searched. You can also try search- es on some of the other texts we have included. For example, search Sense and Sensibility for the word affection, using text2.concord ance("affection"). Search the book of Genesis to find out how long some people lived, using: text3.concordance("lived"). You could look at text4, the Inaugural Address Corpus, to see examples of English going back to 1789, and search for words like nation, terror, god to see how these words have been used differently over time. We’ve also included text5, the NPS Chat Corpus: search this for unconventional words like im, ur, lol. (Note that this corpus is uncensored) Once you’ve spent a little while examining these texts, we hope you have a new sense of the richness and diversity of language. In the next chapter you will learn how to access a broader range of text, including text in languages other than English. A concordance permits us to see words in context. For example, we saw that mon- strous occurred in contexts such as the ___ pictures and the ___ size. What other words appear in a similar range of contexts? We can find out by appending the term similar to the name of the text in question, then inserting the relevant word in parentheses: text1.similar("monstrous") Building word-context index... subtly impalpable pitiable curious imperial perilous trustworthy abundant untoward singular lamentable few maddens horrible loving lazy mystifying christian exasperate puzzled text2.similar("monstrous") Building word-context index... very exceedingly so heartily a great good amazingly as sweet remarkably extremely vast Observe that we get different results for different texts. Austen uses this word quite differently from Melville; for her, monstrous has positive connotations, and sometimes functions as an intensifier like the word very. The term common_contexts allows us to examine just the contexts that are shared by two or more words, such as monstrous and very. We have to enclose these words by square brackets as well as parentheses, and separate them with a comma: text2.common_contexts("monstrous", "very") be_glad am_glad a_pretty is_pretty a_lucky 1.1 Computing with Language: Texts and Words 5Figure 1-2. Lexical dispersion plot for words in U.S. Presidential Inaugural Addresses: This can be used to investigate changes in language use over time. Your Turn: Pick another pair of words and compare their usage in two different texts, using the similar() and common_contexts() functions. It is one thing to automatically detect that a particular word occurs in a text, and to display some words that appear in the same context. However, we can also determine the location of a word in the text: how many words from the beginning it appears. This positional information can be displayed using a dispersion plot. Each stripe represents an instance of a word, and each row represents the entire text. In Figure 1-2 we see some striking patterns of word usage over the last 220 years (in an artificial text con- structed by joining the texts of the Inaugural Address Corpus end-to-end). You can produce this plot as shown below. You might like to try more words (e.g., liberty, constitution) and different texts. Can you predict the dispersion of a word before you view it? As before, take care to get the quotes, commas, brackets, and parentheses exactly right. text4.dispersion_plot("citizens", "democracy", "freedom", "duties", "America") Important: You need to have Python’s NumPy and Matplotlib pack- ages installed in order to produce the graphical plots used in this book. Please see http://www.nltk.org/ for installation instructions. Now, just for fun, let’s try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there’s nothing that goes between them.) 6 Chapter 1: Language Processing and Python text3.generate() In the beginning of his brother is a hairy man , whose top may reach unto heaven ; and ye shall sow the land of Egypt there was no bread in all that he was taken out of the month , upon the earth . So shall thy wages be ? And they made their father ; and Isaac was old , and kissed him : and Laban with his cattle in the midst of the hands of Esau thy first born , and Phichol the chief butler unto his son Isaac , she Note that the first time you run this command, it is slow because it gathers statistics about word sequences. Each time you run it, you will get different output text. Now try generating random text in the style of an inaugural address or an Internet chat room. Although the text is random, it reuses common words and phrases from the source text and gives us a sense of its style and content. (What is lacking in this randomly generated text?) When generate produces its output, punctuation is split off from the preceding word. While this is not correct formatting for English text, we do it to make clear that words and punctuation are independent of one another. You will learn more about this in Chapter 3. Counting Vocabulary The most obvious fact about texts that emerges from the preceding examples is that they differ in the vocabulary they use. In this section, we will see how to use the com- puter to count the words in a text in a variety of useful ways. As before, you will jump right in and experiment with the Python interpreter, even though you may not have studied Python systematically yet. Test your understanding by modifying the examples, and trying the exercises at the end of the chapter. Let’s begin by finding out the length of a text from start to finish, in terms of the words and punctuation symbols that appear. We use the term len to get the length of some- thing, which we’ll apply here to the book of Genesis: len(text3) 44764 So Genesis has 44,764 words and punctuation symbols, or “tokens.” A token is the technical name for a sequence of characters—such as hairy, his, or :)—that we want to treat as a group. When we count the number of tokens in a text, say, the phrase to be or not to be, we are counting occurrences of these sequences. Thus, in our example phrase there are two occurrences of to, two of be, and one each of or and not. But there are only four distinct vocabulary items in this phrase. How many distinct words does the book of Genesis contain? To work this out in Python, we have to pose the question slightly differently. The vocabulary of a text is just the set of tokens that it uses, since in a set, all duplicates are collapsed together. In Python we can obtain the vocabulary 1.1 Computing with Language: Texts and Words 7items of text3 with the command: set(text3). When you do this, many screens of words will fly past. Now try the following: sorted(set(text3)) '', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', ... len(set(text3)) 2789 By wrapping sorted() around the Python expression set(text3) , we obtain a sorted list of vocabulary items, beginning with various punctuation symbols and continuing with words starting with A. All capitalized words precede lowercase words. We dis- cover the size of the vocabulary indirectly, by asking for the number of items in the set, and again we can use len to obtain this number . Although it has 44,764 tokens, this book has only 2,789 distinct words, or “word types.” A word type is the form or spelling of the word independently of its specific occurrences in a text—that is, the word considered as a unique item of vocabulary. Our count of 2,789 items will include punctuation symbols, so we will generally call these unique items types instead of word types. Now, let’s calculate a measure of the lexical richness of the text. The next example shows us that each word is used 16 times on average (we need to make sure Python uses floating-point division): from __future__ import division len(text3) / len(set(text3)) 16.050197203298673 Next, let’s focus on particular words. We can count how often a word occurs in a text, and compute what percentage of the text is taken up by a specific word: text3.count("smote") 5 100 text4.count('a') / len(text4) 1.4643016433938312 Your Turn: How many times does the word lol appear in text5? How much is this as a percentage of the total number of words in this text? You may want to repeat such calculations on several texts, but it is tedious to keep retyping the formula. Instead, you can come up with your own name for a task, like “lexical_diversity” or “percentage”, and associate it with a block of code. Now you only have to type a short name instead of one or more complete lines of Python code, and you can reuse it as often as you like. The block of code that does a task for us is 8 Chapter 1: Language Processing and Pythoncalled a function, and we define a short name for our function with the keyword def. The next example shows how to define two new functions, lexical_diversity() and percentage(): def lexical_diversity(text): ... return len(text) / len(set(text)) ... def percentage(count, total): ... return 100 count / total ... Caution The Python interpreter changes the prompt from to ... after en- countering the colon at the end of the first line. The ... prompt indicates that Python expects an indented code block to appear next. It is up to you to do the indentation, by typing four spaces or hitting the Tab key. To finish the indented block, just enter a blank line. In the definition of lexical diversity() , we specify a parameter labeled text. This parameter is a “placeholder” for the actual text whose lexical diversity we want to compute, and reoccurs in the block of code that will run when the function is used, in line . Similarly, percentage() is defined to take two parameters, labeled count and total . Once Python knows that lexical_diversity() and percentage() are the names for spe- cific blocks of code, we can go ahead and use these functions: lexical_diversity(text3) 16.050197203298673 lexical_diversity(text5) 7.4200461589185629 percentage(4, 5) 80.0 percentage(text4.count('a'), len(text4)) 1.4643016433938312 To recap, we use or call a function such as lexical_diversity() by typing its name, followed by an open parenthesis, the name of the text, and then a close parenthesis. These parentheses will show up often; their role is to separate the name of a task—such as lexical_diversity()—from the data that the task is to be performed on—such as text3. The data value that we place in the parentheses when we call a function is an argument to the function. You have already encountered several functions in this chapter, such as len(), set(), and sorted(). By convention, we will always add an empty pair of parentheses after a function name, as in len(), just to make clear that what we are talking about is a func- tion rather than some other kind of Python expression. Functions are an important concept in programming, and we only mention them at the outset to give newcomers 1.1 Computing with Language: Texts and Words 9a sense of the power and creativity of programming. Don’t worry if you find it a bit confusing right now. Later we’ll see how to use functions when tabulating data, as in Table 1-1. Each row of the table will involve the same computation but with different data, and we’ll do this repetitive work using a function. Table 1-1. Lexical diversity of various genres in the Brown Corpus Genre Tokens Types Lexical diversity skill and hobbies 82345 11935 6.9 humor 21695 5017 4.3 fiction: science 14470 3233 4.5 press: reportage 100554 14394 7.0 fiction: romance 70022 8452 8.3 religion 39399 6373 6.2 1.2 A Closer Look at Python: Texts as Lists of Words You’ve seen some important elements of the Python programming language. Let’s take a few moments to review them systematically. Lists What is a text? At one level, it is a sequence of symbols on a page such as this one. At another level, it is a sequence of chapters, made up of a sequence of sections, where each section is a sequence of paragraphs, and so on. However, for our purposes, we will think of a text as nothing more than a sequence of words and punctuation. Here’s how we represent text in Python, in this case the opening sentence of Moby Dick: sent1 = 'Call', 'me', 'Ishmael', '.' After the prompt we’ve given a name we made up, sent1, followed by the equals sign, and then some quoted words, separated with commas, and surrounded with brackets. This bracketed material is known as a list in Python: it is how we store a text. We can inspect it by typing the name . We can ask for its length . We can even apply our own lexical_diversity() function to it . sent1 'Call', 'me', 'Ishmael', '.' len(sent1) 4 lexical_diversity(sent1) 1.0 10 Chapter 1: Language Processing and PythonSome more lists have been defined for you, one for the opening sentence of each of our texts, sent2 … sent9. We inspect two of them here; you can see the rest for yourself using the Python interpreter (if you get an error saying that sent2 is not defined, you need to first type from nltk.book import ). sent2 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.' sent3 'In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.' Your Turn: Make up a few sentences of your own, by typing a name, equals sign, and a list of words, like this: ex1 = 'Monty', 'Python', 'and', 'the', 'Holy', 'Grail'. Repeat some of the other Python op- erations we saw earlier in Section 1.1, e.g., sorted(ex1), len(set(ex1)), ex1.count('the'). A pleasant surprise is that we can use Python’s addition operator on lists. Adding two lists creates a new list with everything from the first list, followed by everything from the second list: 'Monty', 'Python' + 'and', 'the', 'Holy', 'Grail' 'Monty', 'Python', 'and', 'the', 'Holy', 'Grail' This special use of the addition operation is called concatenation; it combines the lists together into a single list. We can concatenate sen- tences to build up a text. We don’t have to literally type the lists either; we can use short names that refer to pre- defined lists. sent4 + sent1 'Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.' What if we want to add a single item to a list? This is known as appending. When we append() to a list, the list itself is updated as a result of the operation. sent1.append("Some") sent1 'Call', 'me', 'Ishmael', '.', 'Some' 1.2 A Closer Look at Python: Texts as Lists of Words 11Indexing Lists As we have seen, a text in Python is a list of words, represented using a combination of brackets and quotes. Just as with an ordinary page of text, we can count up the total number of words in text1 with len(text1), and count the occurrences in a text of a particular word—say, heaven—using text1.count('heaven'). With some patience, we can pick out the 1st, 173rd, or even 14,278th word in a printed text. Analogously, we can identify the elements of a Python list by their order of oc- currence in the list. The number that represents this position is the item’s index. We instruct Python to show us the item that occurs at an index such as 173 in a text by writing the name of the text followed by the index inside square brackets: text4173 'awaken' We can do the converse; given a word, find the index of when it first occurs: text4.index('awaken') 173 Indexes are a common way to access the words of a text, or, more generally, the ele- ments of any list. Python permits us to access sublists as well, extracting manageable pieces of language from large texts, a technique known as slicing. text516715:16735 'U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it' text61600:1625 'We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week' Indexes have some subtleties, and we’ll explore these with the help of an artificial sentence: sent = 'word1', 'word2', 'word3', 'word4', 'word5', ... 'word6', 'word7', 'word8', 'word9', 'word10' sent0 'word1' sent9 'word10' Notice that our indexes start from zero: sent element zero, written sent0, is the first word, 'word1', whereas sent element 9 is 'word10'. The reason is simple: the moment Python accesses the content of a list from the computer’s memory, it is already at the first element; we have to tell it how many elements forward to go. Thus, zero steps forward leaves it at the first element. 12 Chapter 1: Language Processing and PythonThis practice of counting from zero is initially confusing, but typical of modern programming languages. You’ll quickly get the hang of it if you’ve mastered the system of counting centuries where 19XY is a year in the 20th century, or if you live in a country where the floors of a building are numbered from 1, and so walking up n-1 flights of stairs takes you to level n. Now, if we accidentally use an index that is too large, we get an error: sent10 Traceback (most recent call last): File "stdin", line 1, in ? IndexError: list index out of range This time it is not a syntax error, because the program fragment is syntactically correct. Instead, it is a runtime error, and it produces a Traceback message that shows the context of the error, followed by the name of the error, IndexError, and a brief explanation. Let’s take a closer look at slicing, using our artificial sentence again. Here we verify that the slice 5:8 includes sent elements at indexes 5, 6, and 7: sent5:8 'word6', 'word7', 'word8' sent5 'word6' sent6 'word7' sent7 'word8' By convention, m:n means elements m…n-1. As the next example shows, we can omit the first number if the slice begins at the start of the list , and we can omit the second number if the slice goes to the end : sent:3 'word1', 'word2', 'word3' text2141525: 'among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne', ',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',', 'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of', 'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between', 'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.', 'THE', 'END' We can modify an element of a list by assigning to one of its index values. In the next example, we put sent0 on the left of the equals sign . We can also replace an entire slice with new material . A consequence of this last change is that the list only has four elements, and accessing a later value generates an error . 1.2 A Closer Look at Python: Texts as Lists of Words 13 sent0 = 'First' sent9 = 'Last' len(sent) 10 sent1:9 = 'Second', 'Third' sent 'First', 'Second', 'Third', 'Last' sent9 Traceback (most recent call last): File "stdin", line 1, in ? IndexError: list index out of range Your Turn: Take a few minutes to define a sentence of your own and modify individual words and groups of words (slices) using the same methods used earlier. Check your understanding by trying the exercises on lists at the end of this chapter. Variables From the start of Section 1.1, you have had access to texts called text1, text2, and so on. It saved a lot of typing to be able to refer to a 250,000-word book with a short name like this In general, we can make up names for anything we care to calculate. We did this ourselves in the previous sections, e.g., defining a variable sent1, as follows: sent1 = 'Call', 'me', 'Ishmael', '.' Such lines have the form: variable = expression. Python will evaluate the expression, and save its result to the variable. This process is called assignment. It does not gen- erate any output; you have to type the variable on a line of its own to inspect its contents. The equals sign is slightly misleading, since information is moving from the right side to the left. It might help to think of it as a left-arrow. The name of the variable can be anything you like, e.g., my_sent, sentence, xyzzy. It must start with a letter, and can include numbers and underscores. Here are some examples of variables and assignments: my_sent = 'Bravely', 'bold', 'Sir', 'Robin', ',', 'rode', ... 'forth', 'from', 'Camelot', '.' noun_phrase = my_sent1:4 noun_phrase 'bold', 'Sir', 'Robin' wOrDs = sorted(noun_phrase) wOrDs 'Robin', 'Sir', 'bold' Remember that capitalized words appear before lowercase words in sorted lists. 14 Chapter 1: Language Processing and PythonNotice in the previous example that we split the definition of my_sent over two lines. Python expressions can be split across multiple lines, so long as this happens within any kind of brackets. Python uses the ... prompt to indicate that more input is expected. It doesn’t matter how much indentation is used in these continuation lines, but some inden- tation usually makes them easier to read. It is good to choose meaningful variable names to remind you—and to help anyone else who reads your Python code—what your code is meant to do. Python does not try to make sense of the names; it blindly follows your instructions, and does not object if you do something confusing, such as one = 'two' or two = 3. The only restriction is that a variable name cannot be any of Python’s reserved words, such as def, if, not, and import. If you use a reserved word, Python will produce a syntax error: not = 'Camelot' File "stdin", line 1 not = 'Camelot' SyntaxError: invalid syntax We will often use variables to hold intermediate steps of a computation, especially when this makes the code easier to follow. Thus len(set(text1)) could also be written: vocab = set(text1) vocab_size = len(vocab) vocab_size 19317 Caution Take care with your choice of names (or identifiers) for Python varia- bles. First, you should start the name with a letter, optionally followed by digits (0 to 9) or letters. Thus, abc23 is fine, but 23abc will cause a syntax error. Names are case-sensitive, which means that myVar and myvar are distinct variables. Variable names cannot contain whitespace, but you can separate words using an underscore, e.g., my_var. Be careful not to insert a hyphen instead of an underscore: my-var is wrong, since Python interprets the - as a minus sign. Strings Some of the methods we used to access the elements of a list also work with individual words, or strings. For example, we can assign a string to a variable , index a string , and slice a string . 1.2 A Closer Look at Python: Texts as Lists of Words 15 name = 'Monty' name0 'M' name:4 'Mont' We can also perform multiplication and addition with strings: name 2 'MontyMonty' name + '' 'Monty' We can join the words of a list to make a single string, or split a string into a list, as follows: ' '.join('Monty', 'Python') 'Monty Python' 'Monty Python'.split() 'Monty', 'Python' We will come back to the topic of strings in Chapter 3. For the time being, we have two important building blocks—lists and strings—and are ready to get back to some language analysis. 1.3 Computing with Language: Simple Statistics Let’s return to our exploration of the ways we can bring our computational resources to bear on large quantities of text. We began this discussion in Section 1.1, and saw how to search for words in context, how to compile the vocabulary of a text, how to generate random text in the same style, and so on. In this section, we pick up the question of what makes a text distinct, and use automatic methods to find characteristic words and expressions of a text. As in Section 1.1, you can try new features of the Python language by copying them into the interpreter, and you’ll learn about these features systematically in the following section. Before continuing further, you might like to check your understanding of the last sec- tion by predicting the output of the following code. You can use the interpreter to check whether you got it right. If you’re not sure how to do this task, it would be a good idea to review the previous section before continuing further. saying = 'After', 'all', 'is', 'said', 'and', 'done', ... 'more', 'is', 'said', 'than', 'done' tokens = set(saying) tokens = sorted(tokens) tokens-2: what output do you expect here? 16 Chapter 1: Language Processing and PythonFrequency Distributions How can we automatically identify the words of a text that are most informative about the topic and genre of the text? Imagine how you might go about finding the 50 most frequent words of a book. One method would be to keep a tally for each vocabulary item, like that shown in Figure 1-3. The tally would need thousands of rows, and it would be an exceedingly laborious process—so laborious that we would rather assign the task to a machine. Figure 1-3. Counting words appearing in a text (a frequency distribution). The table in Figure 1-3 is known as a frequency distribution , and it tells us the frequency of each vocabulary item in the text. (In general, it could count any kind of observable event.) It is a “distribution” since it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let’s use a FreqDist to find the 50 most frequent words of Moby Dick. Try to work out what is going on here, then read the explanation that follows. fdist1 = FreqDist(text1) fdist1 FreqDist with 260819 outcomes vocabulary1 = fdist1.keys() vocabulary1:50 ',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '', 'at', 'by', 'but', 'not', '', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like' fdist1'whale' 906 When we first invoke FreqDist, we pass the name of the text as an argument . We can inspect the total number of words (“outcomes”) that have been counted up — 260,819 in the case of Moby Dick. The expression keys() gives us a list of all the distinct types in the text , and we can look at the first 50 of these by slicing the list . 1.3 Computing with Language: Simple Statistics 17Your Turn: Try the preceding frequency distribution example for your- self, for text2. Be careful to use the correct parentheses and uppercase letters. If you get an error message NameError: name 'FreqDist' is not defined, you need to start your work with from nltk.book import . Do any words produced in the last example help us grasp the topic or genre of this text? Only one word, whale, is slightly informative It occurs over 900 times. The rest of the words tell us nothing about the text; they’re just English “plumbing.” What proportion of the text is taken up with such words? We can generate a cumulative frequency plot for these words, using fdist1.plot(50, cumulative=True), to produce the graph in Figure 1-4. These 50 words account for nearly half the book Figure 1-4. Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which account for nearly half of the tokens. 18 Chapter 1: Language Processing and PythonIf the frequent words don’t help us, how about the words that occur once only, the so- called hapaxes? View them by typing fdist1.hapaxes(). This list contains lexicographer, cetological, contraband, expostulations, and about 9,000 others. It seems that there are too many rare words, and without seeing the context we probably can’t guess what half of the hapaxes mean in any case Since neither frequent nor infrequent words help, we need to try something else. Fine-Grained Selection of Words Next, let’s look at the long words of a text; perhaps these will be more characteristic and informative. For this we adapt some notation from set theory. We would like to find the words from the vocabulary of the text that are more than 15 characters long. Let’s call this property P, so that P(w) is true if and only if w is more than 15 characters long. Now we can express the words of interest using mathematical set notation as shown in (1a). This means “the set of all w such that w is an element of V (the vocabu- lary) and w has property P.” (1) a. w w ∈ V & P(w) b. w for w in V if p(w) The corresponding Python expression is given in (1b). (Note that it produces a list, not a set, which means that duplicates are possible.) Observe how similar the two notations are. Let’s go one more step and write executable Python code: V = set(text1) long_words = w for w in V if len(w) 15 sorted(long_words) 'CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly' For each word w in the vocabulary V, we check whether len(w) is greater than 15; all other words will be ignored. We will discuss this syntax more carefully later. Your Turn: Try out the previous statements in the Python interpreter, and experiment with changing the text and changing the length condi- tion. Does it make an difference to your results if you change the variable names, e.g., using word for word in vocab if ...? 1.3 Computing with Language: Simple Statistics 19Let’s return to our task of finding words that characterize a text. Notice that the long words in text4 reflect its national focus—constitutionally, transcontinental—whereas those in text5 reflect its informal content: boooooooooooglyyyyyy and yuuuuuuuuuuuummmmmmmmmmmm. Have we succeeded in automatically extract- ing words that typify a text? Well, these very long words are often hapaxes (i.e., unique) and perhaps it would be better to find frequently occurring long words. This seems promising since it eliminates frequent short words (e.g., the) and infrequent long words (e.g., antiphilosophists). Here are all words from the chat corpus that are longer than seven characters, that occur more than seven times: fdist5 = FreqDist(text5) sorted(w for w in set(text5) if len(w) 7 and fdist5w 7) '14-19teens', 'talkcity_adults', '((((((((((', '........', 'Question', 'actually', 'anything', 'computer', 'cute.-ass', 'everyone', 'football', 'innocent', 'listening', 'remember', 'seriously', 'something', 'together', 'tomorrow', 'watching' Notice how we have used two conditions: len(w) 7 ensures that the words are longer than seven letters, and fdist5w 7 ensures that these words occur more than seven times. At last we have managed to automatically identify the frequently occurring con- tent-bearing words of the text. It is a modest but important milestone: a tiny piece of code, processing tens of thousands of words, produces some informative output. Collocations and Bigrams A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds very odd. To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams(): bigrams('more', 'is', 'said', 'than', 'done') ('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done') Here we see that the pair of words than-done is a bigram, and we write it in Python as ('than', 'done'). Now, collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the fre- quency of individual words. The collocations() function does this for us (we will see how it works later): text4.collocations() Building collocations list United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; 20 Chapter 1: Language Processing and Python

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.