Natural language Processing lecture notes

lecture notes on natural language processing, how is natural language processing used, how to learn natural language processing, how to implement natural language processing pdf free download
Dr.DavisHardy Profile Pic
Published Date:22-07-2017
Your Website URL(Optional)
HANDBOOK OF NATURAL LANGUAGE PROCESSING SECOND EDITIONI Classical Approaches 1 ClassicalApproachestoNaturalLanguageProcessing RobertDale ............... 3 Context • TheClassicalToolkit • Conclusions • Reference 2 TextPreprocessing DavidD.Palmer .................................................... 9 Introduction • ChallengesofTextPreprocessing • Tokenization • SentenceSegmentation • Conclusion • References 3 LexicalAnalysis AndrewHippisley ...................................................... 31 • • • Introduction FiniteStateMorphonology FiniteStateMorphology “Difficult” • • MorphologyandLexicalAnalysis Paradigm-BasedLexicalAnalysis Concluding • • Remarks Acknowledgments References 4 SyntacticParsing PeterLjunglöfandMatsWirén ..................................... 59 • • • Introduction Background TheCocke–Kasami–YoungerAlgorithm Parsingas • • • Deduction ImplementingDeductiveParsing LRParsing Constraint-based • • • • Grammars IssuesinParsing HistoricalNotesandOutlook Acknowledgments References 5 SemanticAnalysis CliffGoddardandAndreaC.Schalley ............................. 93 • BasicConceptsandIssuesinNaturalLanguageSemantics TheoriesandApproachesto • • SemanticRepresentation RelationalIssuesinLexicalSemantics Fine-Grained • • Lexical-SemanticAnalysis:ThreeCaseStudies Prospectusand“HardProblems” • Acknowledgments References 6 NaturalLanguageGeneration DavidD.McDonald .................................. 121 • Introduction ExamplesofGeneratedTexts:FromComplextoSimpleandBackAgain • • • TheComponentsofaGenerator ApproachestoTextPlanning TheLinguistic • • • Component TheCuttingEdge Conclusions References1 Classical Approaches to Natural Language Processing 1.1 Context................................................................. 3 1.2 TheClassicalToolkit.................................................. 4 TextPreprocessing • LexicalAnalysis • SyntacticParsing • Semantic Analysis • NaturalLanguageGeneration 1.3 Conclusions............................................................ 6 Robert Dale MacquarieUniversity Reference...................................................................... 7 1.1 Context The first edition of this handbook appeared in 2000, but the project that resulted in that volume in fact began 4 years earlier, in mid-1996. When Hermann Moisl, Harold Somers, and I started planning the contentofthebook,thefieldofnaturallanguageprocessingwaslessthan10yearsintowhatsomemight callits“statisticalrevolution.”Itwasstillearlyenoughthattherewereoccasionalsignsoffrictionbetween someofthe“oldguard,”whohungontothesymbolicapproachestonaturallanguageprocessingthatthey hadgrownupwith,andthe“youngturks,”withtheirnew-fangledstatisticalprocessingtechniques,which justkeptgainingground.Someintheoldguardwouldgivetalkspointingoutthattherewereproblemsin naturallanguageprocessingthatwerebeyondthereachofstatisticalorcorpus-basedmethods;meanwhile, the occasional young turk could be heard muttering a variation on Fred Jelinek’s 1988 statement that “whenever I fire a linguist our system performance improves.” Then there were those with an eye to a futurepeacefulcoexistencethatpromisedjobsforall,arguingthatweneededtodevelophybridtechniques and applications that integrated the best properties of both the symbolic approaches and the statistical approaches. At the time, we saw the handbook as being one way of helping to catalog the constituent tools and techniques that might play a role in such hybrid enterprises. So, in the first edition of the handbook, weadoptedatripartitestructure,withthe38bookchaptersbeingfairlyevenlysegregatedintoSymbolic Approaches to NLP, Empirical Approaches to NLP, and NLP based on Artificial Neural Networks. The editorsofthepresenteditionhaverenamedtheSymbolicApproachestoNLPpartasClassicalApproaches: that name change surely says something about the way in which things have developed over the last 10 years or so. In the various conferences and journals in our field, papers that make use of statistical techniques now very significantly outnumber those that do not. The number of chapters in the present edition of the handbook that focus on these “classical” approaches is half the number that focus on the empirical and statistical approaches. But these changes should not be taken as an indication that the earlier-established approaches are somehow less relevant; in fact, the reality is quite the opposite, as 34 HandbookofNaturalLanguageProcessing the incorporation of linguistic knowledge into statistical processing becomes more and more common. Those who argue for the study of the classics in the more traditional sense of that word make great claims for the value of such study: it encourages the questioning of cultural assumptions, allows one to appreciatedifferentculturesandvaluesystems,promotescreativethinking,andhelpsonetounderstand thedevelopmentofideas.ThatisjustastrueinnaturallanguageprocessingasitisinthestudyofGreek literature. So, in the spirit of all those virtues that surely no one would question, this part of the handbook providesthoughtfulandreflectiveoverviewsofanumberofareasofnaturallanguageprocessingthatare in some sense foundational. They represent areas of research and technological development that have been around for some time; long enough to benefit from hindsight and a measured and more objective assessmentthanispossibleforworkthatismorerecent.Thisintroductioncommentsbrieflyoneachof thesechaptersasawayofsettingthesceneforthispartofthehandbookasawhole. 1.2 The Classical Toolkit Traditionally, work in natural language processing has tended to view the process of language analysis as being decomposable into a number of stages, mirroring the theoretical linguistic distinctions drawn between SYNTAX, SEMANTICS,and PRAGMATICS. The simple view is that the sentences of a text are first analyzed in terms of their syntax; this provides an order and structure that is more amenable to an analysis in terms of semantics, or literal meaning; and this is followed by a stage of pragmatic analysis wherebythemeaningoftheutteranceortextincontextisdetermined.Thislaststageisoftenseenasbeing concernedwithDISCOURSE,whereastheprevioustwoaregenerallyconcernedwithsententialmatters.This attempt at a correlation between a stratificational distinction (syntax, semantics, and pragmatics) and a distinction in terms of granularity (sentence versus discourse) sometimes causes some confusion in thinkingabouttheissuesinvolvedinnaturallanguageprocessing;anditiswidelyrecognizedthatinreal terms it is not so easy to separate the processing of language neatly into boxes corresponding to each of the strata. However, such a separation Speaker’s intended meaning serves as a useful pedagogic aid, and also constitutes the basis for archi- tectural models that make the task of natural language analysis more manageablefromasoftwareengineeringpointofview. Pragmatic analysis Nonetheless, the tripartite distinction into syntax, semantics, and pragmatics only serves at best as a starting point when we consider the processingofrealnaturallanguagetexts.Afiner-graineddecomposition Semantic analysis oftheprocessisusefulwhenwetakeintoaccountthecurrentstateofthe art in combination with the need to deal with real language data; this is Syntactic analysis reflectedinFigure1.1. Weidentifyherethestageoftokenizationandsentencesegmentation as a crucial first step. Natural language text is generally not made up of Lexical analysis the short, neat, well-formed, and well-delimited sentences we find in textbooks; and for languages such as Chinese, Japanese, or Thai, which donotsharetheapparentlyeasyspace-delimitedtokenizationwemight Tokenization believe to be a property of languages like English, the ability to address issuesoftokenizationisessentialtogettingoffthegroundatall.Wealso treat lexical analysis as a separate step in the process. To some degree Surface text thisfiner-graineddecompositionreflectsourcurrentstateofknowledge aboutlanguageprocessing:weknowquitealotaboutgeneraltechniques FIGURE 1.1 The stages of for tokenization, lexical analysis, and syntactic analysis, but much less analysis in processing natural language. about semantics and discourse-level processing. But it also reflects theClassicalApproachestoNaturalLanguageProcessing 5 factthattheknownisthesurfacetext,andanythingdeeperisarepresentationalabstractionthatisharder topindown;soitisnotsosurprisingthatwehavebetter-developedtechniquesatthemoreconcreteend oftheprocessingspectrum. Of course, natural language analysis is only one-half of the story. We also have to consider natural languagegeneration,whereweareconcernedwithmappingfromsome(typicallynonlinguistic)internal representation to a surface text. In the history of the field so far, there has been much less work on naturallanguagegenerationthantherehasbeenonnaturallanguageanalysis.Onesometimeshearsthe suggestionthatthisisbecausenaturallanguagegenerationiseasier,sothatthereislesstobesaid.Thisis farfromthetruth:thereareagreatmanycomplexitiestobeaddressedingeneratingfluentandcoherent multi-sentential texts from an underlying source of information. A more likely reason for the relative lack of work in generation is precisely the correlate of the observation made at the end of the previous paragraph: it is relatively straightforward to build theories around the processing of something known (such as a sequence of words), but much harder when the input to the process is more or less left to the imagination. This is the question that causes researchers in natural language generation to wake in the middle of the night in a cold sweat: what does generation start from? Much work in generation is concerned with addressing these questions head-on; work in natural language understanding may eventuallyseebenefitintakinggeneration’sstartingpointasitsendgoal. 1.2.1 Text Preprocessing Aswehavealreadynoted, notalllanguagesdelivertextintheformofwordsneatlydelimitedbyspaces. Languages such as Chinese, Japanese, and Thai require first that a segmentation process be applied, analogous to the segmentation process that must first be applied to a continuous speech stream in order to identify the words that make up an utterance. As Palmer demonstrates in his chapter, there aresignificantsegmentationandtokenizationissuesinapparentlyeasier-to-segmentlanguages—suchas English—too.Fundamentally,theissuehereisthatofwhatconstitutesaword;asPalmershows,thereis noeasyanswerhere.Thischapteralsolooksattheproblemofsentencesegmentation:sincesomuchwork innaturallanguageprocessingviewsthesentenceastheunitofanalysis,clearlyitisofcrucialimportance toensurethat, givenatext, wecanbreakitintosentence-sizedpieces.Thisturnsoutnottobesotrivial either.Palmeroffersacatalogoftipsandtechniquesthatwillbeusefultoanyonefacedwithdealingwith realrawtextastheinputtoananalysisprocess,andprovidesahealthyreminderthattheseproblemshave tendedtobeidealizedawayinmuchearlier,laboratory-basedworkinnaturallanguageprocessing. 1.2.2 Lexical Analysis The previous chapter addressed the problem of breaking a stream of input text into the words and sentences that will be subject to subsequent processing. The words, of course, are not atomic, and are themselves open to further analysis. Here we enter the realms of computational morphology, the focus of Andrew Hippisley’s chapter. By taking words apart, we can uncover information that will be useful atlaterstagesofprocessing.Thecombinatoricsalsomeanthatdecomposingwordsintotheirparts,and maintainingrulesforhowcombinationsareformed,ismuchmoreefficientintermsofstoragespacethan would be the case if we simply listed every word as an atomic element in a huge inventory. And, once morereturningtoourconcernwiththehandlingofrealtexts,therewillalwaysbewordsmissingfromany such inventory; morphological processing can go some way toward handling such unrecognized words. Hippisley provides a wide-ranging and detailed review of the techniques that can be used to carry out morphological processing, drawing on examples from languages other than English to demonstrate the need for sophisticated processing methods; along the way he provides some background in the relevant theoreticalaspectsofphonologyandmorphology.6 HandbookofNaturalLanguageProcessing 1.2.3 Syntactic Parsing Apresuppositioninmostworkinnaturallanguageprocessingisthatthebasicunitofmeaninganalysis isthesentence:asentenceexpressesaproposition,anidea,orathought,andsayssomethingaboutsome real or imaginary world. Extracting the meaning from a sentence is thus a key issue. Sentences are not, however,justlinearsequencesofwords,andsoitiswidelyrecognizedthattocarryoutthistaskrequiresan analysisofeachsentence,whichdeterminesitsstructureinonewayoranother.In NLPapproachesbased ongenerativelinguistics,thisisgenerallytakentoinvolvethedeterminingofthesyntacticorgrammatical structure of each sentence. In their chapter, Ljunglöf and Wirén present a range of techniques that can be used to achieve this end. This area is probably the most well established in the field of NLP, enabling the authors here to provide an inventory of basic concepts in parsing, followed by a detailed catalog of parsingtechniquesthathavebeenexploredintheliterature. 1.2.4 Semantic Analysis Identifyingtheunderlyingsyntacticstructureofasequenceofwordsisonlyonestepindeterminingthe meaningofasentence;itprovidesastructuredobjectthatismoreamenabletofurthermanipulationand subsequentinterpretation.Itisthesesubsequentstepsthatderiveameaningforthesentenceinquestion. GoddardandSchalley’schapterturnstothesedeeperissues.Itisherethatwebegintoreachthebounds of what has so far been scaled up from theoretical work to practical application. As pointed out earlier in this introduction, the semantics of natural language have been less studied than syntactic issues, and so the techniques described here are not yet developed to the extent that they can easily be applied in a broad-coveragefashion. Aftersettingthescenebyreviewingarangeofexistingapproachestosemanticinterpretation,Goddard andSchalleyprovideadetailedexpositionofNaturalSemanticMetalanguage,anapproachtosemantics thatislikelytobenewtomanyworkinginnaturallanguageprocessing.Theyendbycatalogingsomeof thechallengestobefacedifwearetodeveloptrulybroadcoveragesemanticanalyses. 1.2.5 Natural Language Generation Attheendoftheday,determiningthemeaningofanutteranceisonlyreallyone-halfofthestoryofnatural languageprocessing:inmanysituations,aresponsethenneedstobegenerated,eitherinnaturallanguage alone or in combination with other modalities. For many of today’s applications, what is required here is rather trivial and can be handled by means of canned responses; increasingly, however, we are seeing natural language generation techniques applied in the context of more sophisticated back-end systems, where the need to be able to custom-create fluent multi-sentential texts on demand becomes a priority. Thegeneration-orientedchaptersintheApplicationspartbeartestimonytothescopehere. Inhischapter,DavidMcDonaldprovidesafar-reachingsurveyofworkinthefieldofnaturallanguage generation.McDonaldbeginsbylucidlycharacterizingthedifferencesbetweennaturallanguageanalysis and natural language generation. He goes on to show what can be achieved using natural language generation techniques, drawing examples from systems developed over the last 35 years. The bulk of thechapteristhenconcernedwithlayingoutapictureofthecomponentprocessesandrepresentations required in order to generate fluent multi-sentential or multi-paragraph texts, built around the now- standarddistinctionbetweentextplanningandlinguisticrealization. 1.3 Conclusions EarlyresearchintomachinetranslationwasunderwayinbothU.K.andU.S.universitiesinthemid-1950s, andthefirstannualmeetingoftheAssociationforComputationalLinguisticswasin1963;so,dependingClassicalApproachestoNaturalLanguageProcessing 7 onhowyoucount,thefieldofnaturallanguageprocessinghaseitherpassedorisfastapproachingits50th birthday.Alothasbeenachievedinthistime.Thispartofthehandbookprovidesaconsolidatedsummary of the outcomes of significant research agendas that have shaped the field and the issues it chooses to address. An awareness and understanding of this work is essential for any modern-day practitioner of natural language processing; as George Santayana put it over a 100 years ago, “Those who cannot rememberthepastarecondemnedtorepeatit.” Oneaspectofcomputationalworknotrepresentedhereisthebodyofresearchthatfocusesondiscourse and pragmatics. As noted earlier, it is in these areas that our understanding is still very much weaker thaninareassuchasmorphologyandsyntax.Itisprobablyalsothecasethatthereiscurrentlylesswork going on here than there was in the past: there is a sense in which the shift to statistically based work restarted investigations of language processing from the ground up, and current approaches have many intermediateproblemstotacklebeforetheyreachtheconcernsthatwereoncethefocusof“thediscourse community.” There is no doubt that these issues will resurface; but right now, the bulk of attention is ∗ focused on dealing with syntax and semantics. When most problems here have been solved, we can expecttoseearenewedinterestindiscourse-levelphenomenaandpragmatics,andatthatpointthetime will be ripe for another edition of this handbook that puts classical approaches to discourse back on the tableasasourceofideasandinspiration.Meanwhile,agoodsurveyofvariousapproachescanbefound inJurafskyandMartin(2008). Reference Jurafsky,D.andMartin,J.H.,2008,SpeechandLanguageProcessing:AnIntroductiontoNaturalLanguage Processing, Speech Recognition, and Computational Linguistics, 2nd edition. Prentice-Hall, Upper SaddleRiver,NJ. ∗ Anotableexceptionistheconsiderablebodyofworkontextsummarizationthathasdevelopedoverthelast10years.2 Text Preprocessing 2.1 Introduction ........................................................... 9 2.2 ChallengesofTextPreprocessing ................................... 10 Character-SetDependence • LanguageDependence • Corpus Dependence • ApplicationDependence 2.3 Tokenization........................................................... 15 TokenizationinSpace-DelimitedLanguages • Tokenizationin UnsegmentedLanguages 2.4 SentenceSegmentation............................................... 22 SentenceBoundaryPunctuation •TheImportanceofContext • TraditionalRule-BasedApproaches • RobustnessandTrainability • TrainableAlgorithms 2.5 Conclusion............................................................. 27 David D. Palmer AutonomyVirage References..................................................................... 28 2.1 Introduction Inthelinguisticanalysisofadigitalnaturallanguagetext, itisnecessarytoclearlydefinethecharacters, words, and sentencesinanydocument. Definingtheseunitspresentsdifferentchallengesdepending on thelanguagebeingprocessedandthesourceofthedocuments,andthetaskisnottrivial,especiallywhen considering the variety of human languages and writing systems. Natural languages contain inherent ambiguities, and writing systems often amplify ambiguities as well as generate additional ambiguities. MuchofthechallengeofNaturalLanguageProcessing(NLP)involvesresolvingtheseambiguities.Early work in NLP focused on a small number of well-formed corpora in a small number of languages, but significantadvanceshavebeenmadeinrecentyearsbyusinglargeanddiversecorporafromawiderange of sources, including a vast and ever-growing supply of dynamically generated text from the Internet. This explosion in corpus size and variety has necessitated techniques for automatically harvesting and preparingtextcorporaforNLPtasks. Inthischapter,wediscussthechallengesposedbytextpreprocessing,thetaskofconvertingarawtext file, essentially a sequence of digital bits, into a well-defined sequence of linguistically meaningful units: atthelowestlevelcharactersrepresentingtheindividualgraphemesinalanguage’swrittensystem,words consistingofoneormorecharacters,andsentencesconsistingofoneormorewords.Textpreprocessing isanessentialpartofanyNLPsystem, sincethecharacters, words, andsentencesidentifiedatthisstage arethefundamentalunitspassedtoallfurtherprocessingstages,fromanalysisandtaggingcomponents, such as morphological analyzers and part-of-speech taggers, through applications, such as information retrievalandmachinetranslationsystems. Textpreprocessingcanbedividedintotwostages:documenttriageandtextsegmentation.Document triageistheprocessofconvertingasetofdigitalfilesintowell-definedtextdocuments.Forearlycorpora, this was a slow, manual process, and these early corpora were rarely more than a few million words. 910 HandbookofNaturalLanguageProcessing Incontrast,currentcorporaharvestedfromtheInternetcanencompassbillionsofwordseachday,which requiresafullyautomateddocumenttriageprocess.Thisprocesscaninvolveseveralsteps,dependingon the origin of the files being processed. First, in order for any natural language document to be machine readable, its characters must be represented in a character encoding, in which one or more bytes in a file maps to a known character. Character encoding identification determines the character encoding (or encodings) for any file and optionally converts between encodings. Second, in order to know what language-specific algorithms to apply to a document, language identification determines the natural language for a document; this step is closely linked to, but not uniquely determined by, the character encoding. Finally, textsectioningidentifiestheactualcontentwithinafilewhilediscardingundesirable elements, such as images, tables, headers, links, and HTML formatting. The output of the document triage stage is a well-defined text corpus, organized by language, suitable for text segmentation and furtheranalysis. Text segmentation is the process of converting a well-defined text corpus into its component words and sentences. Wordsegmentation breaks up the sequence of characters in a text by locating the word boundaries,thepointswhereonewordendsandanotherbegins.Forcomputationallinguisticspurposes, the words thus identified are frequently referred to as tokens, and word segmentation is also known as tokenization. Text normalization is a related step that involves merging different written forms of a token into a canonical normalized form; for example, a document may contain the equivalent tokens “Mr.”,“Mr”,“mister”,and“Mister”thatwouldallbenormalizedtoasingleform. Sentencesegmentationistheprocessofdeterminingthelongerprocessingunitsconsistingofoneor more words. This task involves identifying sentence boundaries between words in different sentences. Since most written languages have punctuation marks that occur at sentence boundaries, sentence seg- mentationisfrequentlyreferredtoassentenceboundarydetection,sentenceboundarydisambiguation, orsentenceboundaryrecognition.Allthesetermsrefertothesametask:determininghowatextshould bedividedintosentencesforfurtherprocessing. Inpractice,sentenceandwordsegmentationcannotbeperformedsuccessfullyindependentfromone another. For example, an essential subtask in both word and sentence segmentation for most European languages is identifying abbreviations, because a period can be used to mark an abbreviation as well as to mark the end of a sentence. In the case of a period marking an abbreviation, the period is usually consideredapartoftheabbreviationtoken,whereasaperiodattheendofasentenceisusuallyconsidered atokeninandofitself.Inthecaseofanabbreviationattheendofasentence,theperiodmarksboththe abbreviationandthesentenceboundary. This chapter provides an introduction to text preprocessing in a variety of languages and writing systems. We begin in Section 2.2 with a discussion of the challenges posed by text preprocessing, and emphasize the document triage issues that must be considered before implementing a tokenization or sentencesegmentationalgorithm.Thesectiondescribesthedependenciesonthelanguagebeingprocessed andthecharactersetinwhichthelanguageisencoded.Italsodiscussesthedependencyontheapplication thatusestheoutputofthesegmentationandthedependencyonthecharacteristicsofthespecificcorpus beingprocessed. InSection2.3,weintroducesomecommontechniquescurrentlyusedfortokenization.Thefirstpartof thesectionfocusesonissuesthatariseintokenizingandnormalizinglanguagesinwhichwordsaresepa- ratedbywhitespace.Thesecondpartofthesectiondiscussestokenizationtechniquesinlanguageswhere nosuchwhitespacewordboundariesexist.InSection2.4,wediscusstheproblemofsentencesegmentation andintroducesomecommontechniquescurrentlyusedtoidentifysentenceboundariesintexts. 2.2 Challenges of Text Preprocessing There are many issues that arise in text preprocessing that need to be addressed when designing NLP systems,andmanycanbeaddressedaspartofdocumenttriageinpreparingacorpusforanalysis.TextPreprocessing 11 The type of writing system used for a language is the most important factor for determining the best approach to text preprocessing. Writing systems can be logographic, where a large number (often thousands) of individual symbols represent words. In contrast, writing systems can be syllabic,in which individual symbols represent syllables, or alphabetic, in which individual symbols (more or less) represent sounds; unlike logographic systems, syllabic and alphabetic systems typically have fewer than 100 symbols. According to Comrie et al. (1996), the majority of all written languages use an alphabetic or syllabic system. However, in practice, no modern writing system employs symbols of only one kind, so no natural language writing system can be classified as purely logographic, syllabic, or alphabetic. EvenEnglish,withitsrelativelysimplewritingsystembasedontheRomanalphabet,utilizeslogographic symbolsincludingArabicnumerals(0–9),currencysymbols(,£),andothersymbols(%,&,).Englishis neverthelesspredominatelyalphabetic,andmostotherwritingsystemsarecomprisedofsymbolswhich aremainlyofonetype. In this section, we discuss the essential document triage steps, and we emphasize the main types of dependencies that must be addressed in developing algorithms for text segmentation: character-set dependence (Section 2.2.1), language dependence (Section 2.2.2), corpus dependence (Section 2.2.3), andapplicationdependence(Section2.2.4). 2.2.1 Character-Set Dependence At its lowest level, a computer-based text or document is merely a sequence of digital bits in a file. The firstessentialtaskistointerpretthesebitsascharactersofawritingsystemofanaturallanguage. About Character Sets Historically, interpreting digital text files was trivial, since nearly all texts were encoded in the 7-bit 7 character set ASCII, which allowed only 128 (2 ) characters and included only the Roman (or Latin) alphabet and essential characters for writing English. This limitation required the “asciification” or “romanization”ofmanytexts,inwhichASCIIequivalentsweredefinedforcharactersnotdefinedinthe characterset. An example of this asciificationis theadaptation of many European languages containing umlauts and accents, in which the umlauts are replaced by a double quotation mark or the letter ‘e’ and accents are denoted by a single quotation mark or even a number code. In this system, the German word über would be written as u”ber or ueber, and the French word déjà would be written as de’ja‘ or de1ja2.Languagesthatdonotusetheromanalphabet,suchasRussianandArabic,requiredmuchmore elaborate romanization systems, usually based on a phonetic mapping of the source characters to the roman characters. The Pinyin transliteration of Chinese writing is another example of asciification of a morecomplexwritingsystem.Theseadaptationsarestillcommonduetothewidespreadfamiliaritywith theromancharacters;inaddition,somecomputerapplicationsarestilllimitedtothis7-bitencoding. 8 Eight-bit character sets can encode 256 (2 ) characters using a single 8-bit byte, but most of these 8-bitsetsreservethefirst128charactersfortheoriginalASCIIcharacters.Eight-bitencodingsexistforall commonalphabeticandsomesyllabicwritingsystems;forexample,theISO-8859seriesof10+character sets contains encoding definitions for most European characters, including separate ISO-8859 sets for the Cyrillic and Greek alphabets. However, since all 8-bit character sets are limited to exactly the same 256bytecodes(decimal0–255),thisresultsinalargenumberofoverlappingcharactersetsforencoding charactersindifferentlanguages. Writingsystemswithlargercharactersets,suchasthoseofwrittenChineseandJapanese,whichhave several thousand distinct characters, require multiple bytes to encode a single character. A two-byte 16 character set can represent 65,536 (2 ) distinct characters, since 2 bytes contain 16 bits. Determining individual characters in two-byte character sets involves grouping pairs of bytes representing a single character. This process can be complicated by the tokenization equivalent of code-switching, in which charactersfrommanydifferentwritingsystemsoccurwithinthesametext. Itisverycommonindigital texts to encounter multiple writing systems and thus multiple encodings, or as discussed previously,12 HandbookofNaturalLanguageProcessing character encodings that include other encodings as subsets. In Chinese and Japanese texts, single-byte letters,spaces,punctuationmarks(e.g.,periods,quotationmarks,andparentheses),andArabicnumerals (0–9)arecommonlyinterspersedwith2-byteChineseandJapanesecharacters.Suchtextsalsofrequently contain ASCII headers. Multiple encodings also exist for these character sets; for example, the Chinese character set is represented in two widely used encodings, Big-5 for the complex-form (traditional) character set and GB for the simple-form (simplified) set, with several minor variants of these sets also commonlyfound. TheUnicode5.0standard(UnicodeConsortium2006)seekstoeliminatethischaractersetambiguity by specifying a Universal Character Set that includes over 100,000 distinct coded characters derived fromover75supportedscriptsrepresentingallthewritingsystemscommonlyusedtoday.TheUnicode standard is most commonly implemented in the UTF-8 variable-length character encoding,inwhich eachcharacterisrepresentedbya1to4byteencoding.IntheUTF-8encoding,ASCIIcharactersrequire 1 byte, most other characters included in ISO-8859 character encodings and other alphabetic systems require 2 bytes, and all other characters, including Chinese, Japanese, and Korean, require 3 bytes (and very rarely 4 bytes). The Unicode standard and its implementation in UTF-8 allow for the encoding of all supported characters with no overlap or confusion between conflicting byte ranges, and it is rapidly replacingoldercharacterencodingsetsformultilingualapplications. Character Encoding Identification and Its Impact on Tokenization Despite the growing use of Unicode, the fact that the same range of numeric values can represent different characters in different encodings can be a problem for tokenization. For example, English or Spanisharebothnormallystoredinthecommon8-bitencodingLatin-1(orISO-8859-1).AnEnglishor Spanishtokenizerwouldneedtobeawarethatbytesinthe(decimal)range161–191inLatin-1represent punctuation marks and other symbols (such as ‘¡’, ‘¿’, ‘£’, and ‘ c ’). Tokenization rules would then be required to handle each symbol (and thus its byte code) appropriately for that language. However, this same byte range in UTF-8 represents the second (or third or fourth) byte of a multi-byte sequence and is meaningless by itself; an English or Spanish tokenizer for UTF-8 would thus need to model multi-byte character sequences explicitly. Furthermore, the same byte range in ISO-8859-5, a common Russianencoding,containsCyrilliccharacters;inKOI8-R,anotherRussianencoding,therangecontains a different set of Cyrillic characters. Tokenizers thus must be targeted to a specific language in a specific encoding. Tokenization is unavoidably linked to the underlying character encoding of the text being processed, and character encoding identification is an essential first step. While the header of a digital document maycontaininformationregardingitscharacterencoding,thisinformationisnotalwayspresentoreven reliable,inwhichcasetheencodingmustbedeterminedautomatically. Acharacterencodingidentificationalgorithmmustfirstexplicitlymodeltheknownencodingsystems, inordertoknowinwhatbyterangestolookforvalidcharactersaswellaswhichbyterangesareunlikely toappearfrequentlyinthatencoding.Thealgorithmthenanalyzesthebytesinafiletoconstructaprofile ofwhichbyterangesarerepresentedinthefile.Next,thealgorithmcomparesthepatternsofbytesfound in the file to the expected byte ranges from the known encodings and decides which encoding best fits thedata. Russianencodingsprovideagoodexampleofthedifferentbyterangesencounteredforagivenlanguage. In the ISO-8859-5 encoding for Russian texts, the capital Cyrillic letters are in the (hexadecimal) range B0-CF(andarelistedinthetraditionalCyrillicalphabeticalorder);thelowercaselettersareintherange D0-EF.Incontrast,intheKOI8-Rencoding,thecapitallettersareE0-FF(andarelistedinpseudo-Roman order);thelowercaselettersareC0-DF.InUnicode,Cyrilliccharactersrequiretwobytes,andthecapital lettersareintherange0410(thebyte04followedbythebyte10)through042F;thelowercaselettersare intherange0430-045F.Acharacterencodingidentificationalgorithmseekingtodeterminetheencoding ofagivenRussiantextwouldexaminethebytescontainedinthefiletodeterminethebyterangespresent. Thehexbyte04isararecontrolcharacterinISO-8859-5andinKOI8-RbutwouldcomprisenearlyhalfTextPreprocessing 13 thebytesinaUnicodeRussianfile.Similarly,afileinISO-8859-5wouldlikelycontainmanybytesinthe range B0-BF but few in F0-FF, while a file in KOI8-R would contain few in B0-BF and many in F0-FF. Using these simple heuristics to analyze the byte distribution in a file should allow for straightforward encodingidentificationforRussiantexts. Notethat,duetotheoverlapbetweenexistingcharacterencodings,evenwithahigh-qualitycharacter encoding classifier, it may be impossible to determine the character encoding. For example, since most character encodings reserve the first 128 characters for the ASCII characters, a document that contains onlythese128characterscouldbeanyoftheISO-8859encodingsorevenUTF-8. 2.2.2 Language Dependence Impact of Writing System on Text Segmentation In addition to the variety of symbol types (logographic, syllabic, or alphabetic) used in writing systems, thereisarangeoforthographicconventionsusedinwrittenlanguagestodenotetheboundariesbetween linguisticunits suchassyllables, words, orsentences. InmanywrittenAmharictexts, for example, both wordandsentenceboundariesareexplicitlymarked,whileinwrittenThaitextsneitherismarked.Inthe latter case, where no boundaries are explicitly indicated in the written language, written Thai is similar to spoken language, where there are no explicit boundaries and few cues to indicate segments at any level.Betweenthetwoextremesarelanguagesthatmarkboundariestodifferentdegrees.Englishemploys whitespace between most words and punctuation marks at sentence boundaries, but neither feature is sufficient to segment the text completely and unambiguously. Tibetan and Vietnamese both explicitly marksyllableboundaries, eitherthroughlayoutorbypunctuation, butneithermarksword boundaries. Written Chinese and Japanese have adopted punctuation marks for sentence boundaries, but neither denoteswordboundaries. Inthischapter,weprovidegeneraltechniquesapplicabletoavarietyofdifferentwritingsystems.Since many segmentation issues are language-specific, we will also highlight the challenges faced by robust, broad-coverage tokenization efforts. For a very thorough description of the various writing systems employed to represent natural languages, including detailed examples of all languages and features discussedinthischapter,werecommendDanielsandBright(1996). Language Identification The wide range of writing systems used by the languages of the world result in language-specific as well as orthography-specific features that must be taken into account for successful text segmentation. An importantstepinthedocumenttriagestageisthustoidentifythelanguageofeachdocumentordocument section,sincesomedocumentsaremultilingualatthesectionlevelorevenparagraphlevel. For languages with a unique alphabet not used by any other languages, such as Greek or Hebrew, languageidentificationisdeterminedbycharactersetidentification.Similarly,charactersetidentification canbeusedtonarrowthetaskoflanguageidentificationtoasmallernumberoflanguagesthatallshare manycharacters, suchasArabicvs. Persian, Russianvs. Ukrainian, orNorwegianvs. Swedish. Thebyte rangedistributionusedtodeterminecharactersetidentificationcanfurtherbeusedtoidentifybytes,and thuscharacters,thatarepredominantinoneoftheremainingcandidatelanguages,ifthelanguagesdonot share exactly the same characters. For example, while Arabic and Persian both use the Arabic alphabet, thePersianlanguageusesseveralsupplementalcharactersthatdonotappearinArabic.Formoredifficult cases, suchasEuropeanlanguagesthatuseexactlythesamecharactersetbutwithdifferentfrequencies, final identification can be performed by training models of byte/character distributions in each of the languages. A basic but very effective algorithm for this would sort the bytes in a file by frequency count andusethesortedlistasasignaturevectorforcomparisonviaann-gramorvectordistancemodel.14 HandbookofNaturalLanguageProcessing 2.2.3 Corpus Dependence Early NLP systems rarely addressed the problem of robustness, and they normally could process only well-formedinputconformingtotheirhand-builtgrammars.Theincreasingavailabilityoflargecorpora in multiple languages that encompass a wide range of data types (e.g., newswire texts, email messages, closed captioning data, Internet news pages, and weblogs) has required the development of robust NLPapproaches,asthesecorporafrequentlycontainmisspellings,erraticpunctuationandspacing,and other irregular features. It has become increasingly clear that algorithms which rely on input texts to be well-formedaremuchlesssuccessfulonthesedifferenttypesoftexts. Similarly, algorithms that expect a corpus to follow a set of conventions for a written language are frequentlynotrobustenoughtohandleavarietyofcorpora,especiallythoseharvestedfromtheInternet. Itisnotoriouslydifficulttoprescriberulesgoverningtheuseofawrittenlanguage;itisevenmoredifficult to get people to “follow the rules.” This is in large part due to the nature of written language, in which the conventions are not always in line with actual usage and are subject to frequent change. So while punctuation roughly corresponds to the use of suprasegmental features in spoken language, reliance on well-formed sentences delimited by predictable punctuation can be very problematic. In many corpora, traditionalprescriptiverulesarecommonlyignored.Thisfactisparticularlyimportanttoourdiscussion ofbothwordandsentencesegmentation,whichtoalargedegreedependsontheregularityofspacingand punctuation.Mostexistingsegmentationalgorithmsfornaturallanguagesarebothlanguage-specificand corpus-dependent,developedtohandlethepredictableambiguitiesinawell-formedtext.Dependingon theoriginandpurposeofatext,capitalizationandpunctuationrulesmaybefollowedveryclosely(asin most works of literature), erratically (as in various newspaper texts), or not at all (as in email messages andpersonalWebpages).CorporaautomaticallyharvestedfromtheInternetcanbeespeciallyill-formed, suchasExample(1),anactualpostingtoaUsenetnewsgroup,whichshowstheerraticuseofcapitalization andpunctuation,“creative”spelling,anddomain-specificterminologyinherentinsuchtexts. (1) ive just loaded pcl onto my akcl. when i do an ‘in- package’ to load pcl, ill get the prompt but imnot able touse functions like defclass, etc... is therewomething basic im missing or am i justleft hanging,twistinginthebreeze? Many digital text files, such as those harvested from the Internet, contain large regions of text that are undesirable for the NLP application processing the file. For example, Web pages can contain headers, images, advertisements, site navigation links, browser scripts, search engine optimization terms, and othermarkup,littleofwhichisconsideredactualcontent.Robusttextsegmentationalgorithmsdesigned for use with such corpora must therefore havethe capabilityto handlethe range of irregularities, which distinguish these texts from well-formed corpora. A key task in the document triage stage for such files is text sectioning, in which extraneous text is removed. The sectioning and cleaning of Web pages has recentlybecomethefocusofCleaneval,“asharedtaskandcompetitiveevaluationonthetopicofcleaning arbitraryWebpages,withthegoalofpreparingWebdataforuseasacorpusforlinguisticandlanguage technologyresearchanddevelopment.”(Baronietal.2008) 2.2.4 Application Dependence Althoughwordandsentencesegmentationarenecessary,inreality,thereisnoabsolutedefinitionforwhat constitutesawordorasentence.Botharerelativelyarbitrarydistinctionsthatvarygreatlyacrosswritten languages.However,forthepurposesofcomputationallinguisticsweneedtodefineexactlywhatweneed forfurtherprocessing;inmostcases,thelanguageandtaskathanddeterminethenecessaryconventions. Forexample,theEnglishwordsIamarefrequentlycontractedtoI’m,andatokenizerfrequentlyexpands the contraction to recover the essential grammatical features of the pronoun and the verb. A tokenizer that does not expand this contraction to the component words would pass the single token I’m to later processing stages. Unless these processors, which may include morphological analyzers, part-of-speechTextPreprocessing 15 taggers,lexicallookuproutines,orparsers,areawareofboththecontractedanduncontractedforms,the tokenmaybetreatedasanunknownword. Anotherexampleofthedependenceoftokenizationoutputonlaterprocessingstagesisthetreatment ∗ oftheEnglish possessive ’s invarious taggedcorpora. IntheBrown corpus (Francisand Kucera1982), the word governor’s is considered one token and is tagged as a possessive noun. In the Susanne corpus (Sampson 1995), on the other hand, the same word is treated as two tokens, governor and ’s, tagged singularnounandpossessive,respectively. Examplessuchastheaboveareusuallyaddressedduringtokenizationbynormalizingthetexttomeet therequirementsoftheapplications. Forexample, languagemodelingforautomaticspeechrecognition requires that the tokens be represented in a form similar to how they are spoken (and thus input to the speech recognizer). For example, the written token 300 would be spoken as “three hundred dollars,” and the text normalization would convert the original to the desired three tokens. Other applicationsmayrequirethatthisandallothermonetaryamountsbeconvertedtoasingletokensuchas “MONETARY_TOKEN.” In languages such as Chinese, which do not contain white space between any words, a wide range of wordsegmentationconventionsarecurrentlyinuse.Differentsegmentationstandardshaveasignificant impact on applications such as information retrieval and text-to-speech synthesis, as discussed in Wu (2003).Task-orientedChinesesegmentationhasreceivedagreatdealofattentionintheMTcommunity (Changetal.2008;MaandWay2009;Zhangetal.2008). The tasks of word and sentence segmentation overlap with the techniques discussed in many other chaptersinthishandbook,inparticularthechaptersonLexicalAnalysis,CorpusCreation,andMultiword Expressions,aswellaspracticalapplicationsdiscussedinotherchapters. 2.3 Tokenization Section 2.2 discussed the many challenges inherent in segmenting freely occurring text. In this section, wefocusonthespecifictechnicalissuesthatariseintokenization. Tokenization is well-established and well-understood for artificial languages such as programming † languages. However, such artificial languages can be strictly defined to eliminate lexical and structural ambiguities; we do not have this luxury with natural languages, in which the same character can serve many different purposes and in which the syntax is not strictly defined. Many factors can affect the difficulty of tokenizing a particular natural language. One fundamental difference exists between tok- enization approaches for space-delimited languages and approaches for unsegmented languages. In space-delimited languages, such as most European languages, some word boundaries are indicated by the insertion of whitespace. The character sequences delimited are not necessarily the tokens required for further processing, due both to the ambiguous nature of the writing systems and to the range of tokenizationconventionsrequiredbydifferentapplications.Inunsegmentedlanguages,suchasChinese and Thai, words are written in succession with no indication of word boundaries. The tokenization of unsegmentedlanguagesthereforerequiresadditionallexicalandmorphologicalinformation. Inbothunsegmentedandspace-delimitedlanguages,thespecificchallengesposedbytokenizationare largelydependentonboththewritingsystem(logographic,syllabic,oralphabetic,asdiscussedinSection 2.2.2) and the typographical structure of the words. There are three main categories into which word ‡ structures can be placed, and each category exists in both unsegmented and space-delimited writing systems.Themorphologyofwordsinalanguagecanbeisolating,wherewordsdonotdivideintosmaller units; agglutinating (or agglutinative), where words divide into smaller units (morphemes) with clear ∗ ThisexampleistakenfromGrefenstetteandTapanainen(1994). † Forathoroughintroductiontothebasictechniquesoftokenizationinprogramminglanguages,seeAhoetal.(1986). ‡ ThisclassificationcomesfromComrieetal.(1996)andCrystal(1987).16 HandbookofNaturalLanguageProcessing boundariesbetweenthemorphemes;orinflectional,wheretheboundariesbetweenmorphemesarenot clear and where the component morphemes can express more than one grammatical meaning. While individuallanguagesshowtendenciestowardonespecifictype(e.g.,MandarinChineseispredominantly isolating, Japanese is strongly agglutinative, and Latin is largely inflectional), most languages exhibit tracesofallthree.Afourthtypologicalclassificationfrequentlystudiedbylinguists,polysynthetic,canbe consideredanextremecaseofagglutinative,whereseveralmorphemesareputtogethertoformcomplex words that can function as a whole sentence. Chukchi and Inuktitut are examples of polysynthetic languages,andsomeresearchinmachinetranslationhasfocusedonaNunavutHansardsparallelcorpus ofInuktitutandEnglish(Martinetal.2003). Since the techniques used in tokenizing space-delimited languages are very different from those used in tokenizing unsegmented languages, we discuss the techniques separately in Sections 2.3.1 and 2.3.2, respectively. 2.3.1 Tokenization in Space-Delimited Languages In many alphabeticwriting systems, including those that use the Latin alphabet, words are separated by whitespace.Yeteveninawell-formedcorpusofsentences,therearemanyissuestoresolveintokenization. Mosttokenizationambiguityexistsamongusesofpunctuationmarks,suchasperiods,commas,quotation marks, apostrophes, and hyphens, since the same punctuation mark can serve many different functions in a single sentence, let alone a single text. Consider example sentence (3) from the Wall Street Journal (1988). (3)ClairsonInternationalCorp.saiditexpectstoreportanetlossforitssecondquarterendedMarch 26 and doesn’t expect to meet analysts’ profit estimates of 3.9 to 4 million, or 76 cents a share to 79centsashare,foritsyearendingSept.24. This sentence has several items of interest that are common for Latinate, alphabetic, space-delimited languages.First,itusesperiodsinthreedifferentways:withinnumbersasadecimalpoint(3.9),tomark abbreviations(Corp. andSept.), andtomarktheendofthesentence, inwhichcasetheperiodfollowing the number 24 is not a decimal point. The sentence uses apostrophes in two ways: to mark the genitive case(wheretheapostrophedenotespossession)inanalysts’andtoshowcontractions(placeswhereletters havebeenleftoutofwords)indoesn’t.Thetokenizermustthusbeawareoftheusesofpunctuationmarks and be able to determine when a punctuation mark is part of another token and when it is a separate token. In addition to resolving these cases, we must make tokenization decisions about a phrase such as 76 centsashare,whichonthesurfaceconsistsoffourtokens.However,whenusedadjectivallysuchasinthe phrasea76-cents-a-sharedividend, itisnormallyhyphenatedandappearsasone. Thesemanticcontent isthesamedespitetheorthographicdifferences,soitmakessensetotreatthetwoidentically,asthesame numberoftokens.Similarly,wemustdecidewhethertotreatthephrase3.9to4milliondifferentlythan ifithadbeenwrittenas3.9to4milliondollarsor3,900,000to4,000,000. Notealsothatthesemantics ofnumberscanbedependentonboththegenreandtheapplication;inscientificliterature,forexample, thenumbers3.9,3.90,and3.900havedifferentsignificantdigitsandarenotsemanticallyequivalent.We discusstheseambiguitiesandotherissuesinthefollowingsections. A logical initial tokenization of a space-delimited language would be to consider as a separate token any sequence of characters preceded and followed by space. This successfully tokenizes words that are a sequence of alphabetic characters, but does not take into account punctuation characters. In many cases,characterssuchascommas,semicolons,andperiodsshouldbetreatedasseparatetokens,although they are not preceded by whitespace (such as the case with the comma after 4 million in Example (3)). Additionally, many texts contain certain classes of character sequences which should be filtered out beforeactualtokenization;theseincludeexistingmarkupandheaders(includingHTMLmarkup),extra whitespace,andextraneouscontrolcharacters.TextPreprocessing 17 Tokenizing Punctuation While punctuation characters are usually treated as separate tokens, there are many cases when they should be “attached” to another token. The specific cases vary from one language to the next, and the specific treatment of the punctuation characters needs to be enumerated within the tokenizer for each language.Inthissection,wegiveexamplesofEnglishtokenization. Abbreviations are used in written language to denote the shortened form of a word. In many cases, abbreviations are written as a sequence of characters terminated with a period. When an abbreviation occursattheendofasentence, asingleperiodmarksboththeabbreviationandthesentenceboundary. For this reason, recognizing abbreviations is essential for both tokenization and sentence segmentation. Compiling a list of abbreviations can help in recognizing them, but abbreviations are productive, and it is not possible to compile an exhaustive list of all abbreviations in any language. Additionally, many abbreviationscanalsooccuraswordselsewhereinatext(e.g.,thewordMassisalsotheabbreviationfor Massachusetts). An abbreviation can also represent several different words, as is the case for St. which can stand for Saint, Street,or State. However, as Saint it is less likely to occur at a sentence boundary than as Street,or State. Examples (4) and (5) from the Wall Street Journal (1991 and 1987 respectively) demonstratethedifficultiesproducedbysuchambiguouscases,wherethesameabbreviationcanrepresent differentwordsandcanoccurbothwithinandattheendofasentence. (4) The contemporary viewer may simply ogle the vast wooded vistas rising up from the Saguenay RiverandLacSt.Jean,standinginfortheSt.LawrenceRiver. (5)Thefirmsaiditplanstosubleaseitscurrentheadquartersat55WaterSt.Aspokesmandeclined toelaborate. Recognizinganabbreviationisthusnotsufficientforcompletetokenization,andtheappropriatedefinition foranabbreviationcanbeambiguous,asdiscussedinParkandByrd(2001).Weaddressabbreviationsat sentenceboundariesfullyinSection2.4.2. Quotationmarksandapostrophes(“”‘’)areamajorsourceoftokenizationambiguity.Inmostcases, single and double quotes indicate a quoted passage, and the extent of the tokenization decision is to determine whether they open or close the passage. In many character sets, single quote and apostrophe are the same character, and it is therefore not always possible to immediately determine if the single quotation mark closes a quoted passage, or serves another purpose as an apostrophe. In addition, as discussedinSection2.2.1,quotationmarksarealsocommonlyusedwhen“romanizing”writingsystems, inwhichumlautsarereplacedbyadoublequotationmarkandaccentsaredenotedbyasinglequotation markoranapostrophe. The apostrophe is a very ambiguous character. In English, the main uses of apostrophes are to mark thegenitiveformofanoun,tomarkcontractions,andtomarkcertainpluralforms.Inthegenitivecase, some applications require a separate token while some require a single token, as discussed in Section 2.2.4.Howtotreatthegenitivecaseisimportant,sinceinotherlanguages,thepossessiveformofaword is not marked with an apostrophe and cannot be as readily recognized. In German, for example, the possessive form of a noun is usually formed by adding the letter s to the word, without an apostrophe, as in Peters Kopf (Peter’s head). However, in modern (informal) usage in German, Peter’s Kopf would alsobecommon;theapostropheisalsofrequentlyomittedinmodern(informal)EnglishsuchthatPeters head is a possible construction. Furthermore, in English, ’s can serve as a contraction for the verb is,as in he’s, it’s, she’s,and Peter’s head and shoulders above the rest. It also occurs in the plural form of some words,suchasI.D.’sor1980’s,althoughtheapostropheisalsofrequentlyomittedfromsuchplurals.The tokenizationdecisioninthesecasesiscontextdependentandiscloselytiedtosyntacticanalysis. In the case of apostrophe as contraction, tokenization may require the expansion of the word to eliminatetheapostrophe,butthecaseswherethisisnecessaryareverylanguage-dependent.TheEnglish contractionI’mcouldbetokenizedasthetwowordsIam,andwe’vecouldbecomewehave.WrittenFrench containsacompletelydifferentsetofcontractions,includingcontractedarticles(l’homme,c’etait),aswell18 HandbookofNaturalLanguageProcessing ascontractedpronouns(j’ai,jel’ai)andotherformssuchasn’y,qu’ils,d’ailleurs,andaujourd’hui.Clearly, recognizingthecontractionstoexpandrequiresknowledgeofthelanguage,andthespecificcontractions to expand, as well as the expanded forms, must be enumerated. All other word-internal apostrophes are treated as a part of the token and not expanded, which allows the proper tokenization of multiply- contractedwordssuchasfo’c’s’le(forecastle)andPudd’n’head(Puddinghead)assinglewords.Inaddition, sincecontractionsarenotalwaysdemarcatedwithapostrophes,asintheFrenchdu,whichisacontraction ofdele,ortheSpanishdel,contractionofdeel,otherwordstoexpandmustalsobelistedinthetokenizer. Multi-Part Words Todifferentdegrees,manywrittenlanguagescontainspace-delimitedwordscomposedofmultipleunits, each expressing a particular grammatical meaning. For example, the single Turkish word çöplüklerim- ∗ izdekilerdenmiydi means “was it from those that were in our garbage cans?” This type of construction is particularly common in strongly agglutinative languages such as Swahili, Quechua, and most Altaic languages. It is also common in languages such as German, where noun–noun (Lebensversicherung,life insurance), adverb–noun (Nichtraucher, nonsmoker), and preposition–noun (Nachkriegszeit,postwar period) compounding are all possible. In fact, though it is not an agglutinative language, German compounding can be quite complex, as in Feuerundlebensversicherung (fire and life insurance) or Kundenzufriedenheitsabfragen(customersatisfactionsurvey). Tosomeextent,agglutinatingconstructionsarepresentinnearlyalllanguages,thoughthiscompound- ing can be marked by hyphenation, in which the use of hyphens can create a single word with multiple grammaticalparts.InEnglish,itiscommonlyusedtocreatesingle-tokenwordslikeend-of-lineaswellas multi-token words like Boston-based. As with the apostrophe, the use of the hyphen is not uniform; for example,hyphenusagevariesgreatlybetweenBritishandAmericanEnglish,aswellasbetweendifferent languages. However, as with the case of apostrophes as contractions, many common language-specific usesofhyphenscanbeenumeratedinthetokenizer. Many languages use the hyphen to create essential grammatical structures. In French, for example, hyphenated compounds such as va-t-il (will it?), c’est-à-dire (that is to say), and celui-ci (it) need to be expanded during tokenization, in order to recover necessary grammatical features of the sentence. In these cases, the tokenizer needs to contain an enumerated list of structures to be expanded, as with the contractionsdiscussedabove. Another tokenization difficulty involving hyphens stems from the practice, common in traditional typesetting, of using hyphens at the ends of lines to break a word too long to include on one line. Such end-of-line hyphens can thus occur within words that are not normally hyphenated. Removing these hyphens is necessary during tokenization, yet it is difficult to distinguish between such incidental hyphenationandcaseswherenaturallyhyphenatedwordshappentooccuratalinebreak.Inanattempt to dehyphenate the artificial cases, it is possible to incorrectly remove necessary hyphens. Grefenstette and Tapanainen (1994) found that nearly 5% of the end-of-line hyphens in an English corpus were word-internalhyphens,whichhappenedtoalsooccurasend-of-linehyphens. In tokenizing multi-part words, such as hyphenated or agglutinative words, whitespace does not providemuchusefulinformationtofurtherprocessingstages.Insuchcases,theproblemoftokenization is very closely related both to tokenizationin unsegmented languages, discussed in Section 2.3.2, and to morphologicalanalysis,discussedinChapter3ofthishandbook. Multiword Expressions SpacingconventionsinwrittenlanguagesdonotalwayscorrespondtothedesiredtokenizationforNLP applications,andtheresultingmultiwordexpressionsareanimportantconsiderationinthetokenization stage.AlaterchapterofthishandbookaddressesMultiwordExpressionsinfulldetail,sowetouchbriefly inthissectiononsomeofthetokenizationissuesraisedbymultiwordexpressions. ∗ ThisexampleisfromHankamer(1986).TextPreprocessing 19 For example, the three-word English expression in spite of is, for all intents and purposes, equivalent tothesingleworddespite,andbothcouldbetreatedasasingletoken.Similarly,manycommonEnglish expressions, such as au pair, de facto,and joie de vivre, consist of foreign loan words thatcan be treated asasingletoken. Multiword numerical expressions are also commonly identified in the tokenization stage. Numbers areubiquitousinalltypesoftextsineverylanguage,buttheirrepresentationinthetextcanvarygreatly. For most applications, sequences of digitsand certaintypesof numerical expressions, such asdatesand times,moneyexpressions,andpercents,canbetreatedasasingletoken.Severalexamplesofsuchphrases can be seen in Example (3) above: March 26,3.9to4million,and Sept. 24 could each be treated as a singletoken.Similarly,phrasessuchas76centsashareand3-a-shareconveyroughlythesamemeaning, despite the difference in hyphenation, and the tokenizer should normalize the two phrases to the same number of tokens (either one or four). Tokenizing numeric expressions requires the knowledge of the syntax of such expressions, since numerical expressions are written differently in different languages. Even within a language or in languages as similar as English and French, major differences exist in the syntaxofnumericexpressions,inadditiontotheobviousvocabularydifferences.Forexample,theEnglish date November 18, 1989 could alternately appear in English texts as any number of variations, such as Nov. 18, 1989, 18 November 1989, 11/18/89 or 18/11/89. These examples underscore the importance of textnormalizationduringthetokenizationprocess,suchthatdates,times,monetaryexpressions,andall othernumericphrasescanbeconvertedintoaformthatisconsistentwiththeprocessingrequiredbythe NLPapplication. Closelyrelatedtohyphenation,thetreatmentofmultiwordexpressionsishighlylanguage-dependent and application-dependent, but can easily be handled in the tokenization stage if necessary. We need to be careful, however, when combining words into a single token. The phrase no one, along with noone and no-one, is a commonly encountered English equivalent for nobody, and should normally be treated as a single token. However, in a context such as No one man can do it alone, it needs to be treated as twowords.Thesameistrueofthetwo-wordphrasecannot,whichisnotalwaysequivalenttothesingle ∗ wordcannotorthecontractioncan’t. Insuchcases, itissafertoallowalaterprocess(suchasaparser) tomakethedecision. 2.3.2 Tokenization in Unsegmented Languages The nature of the tokenization task in unsegmented languages like Chinese, Japanese, and Thai is fundamentally different from tokenization in space-delimited languages like English. The lack of any spaces between words necessitates a more informed approach than simple lexical analysis. The specific approach to word segmentation for a particular unsegmented language is further limited by the writing system and orthography of the language, and a single general approach has not been developed. In Section2.3.2.1,wedescribesomealgorithms,whichhavebeenappliedtotheproblemtoobtainaninitial approximationforavarietyoflanguages.InSections2.3.2.2and2.3.2.3,wegivedetailsofsomesuccessful approachestoChineseandJapanesesegmentation,andinSection2.3.2.4,wedescribesomeapproaches, whichhavebeenappliedtolanguageswithunsegmentedalphabeticorsyllabicwritingsystems. Common Approaches Anextensivewordlistcombinedwithaninformedsegmentationalgorithmcanhelptoachieveacertain degree of accuracy in word segmentation, but the greatest barrier to accurate word segmentation is in recognizing unknown (or out-of-vocabulary) words, words not in the lexicon of the segmenter. This problem is dependent both on the source of the lexicon as well as the correspondence (in vocabulary) betweenthetextinquestionandthelexicon;forexample,WuandFung(1994)reportedthatsegmentation ∗ Forexample,considerthefollowingsentence:“WhyismysodacannotwhereIleftit?”20 HandbookofNaturalLanguageProcessing accuracyinChineseissignificantlyhigherwhenthelexiconisconstructedusingthesametypeofcorpus asthecorpusonwhichitistested. Another obstacle to high-accuracy word segmentation is the fact that there are no widely accepted guidelines as to what constitutes a word, and there is therefore no agreement on how to “correctly” segmentatextinanunsegmentedlanguage.Nativespeakersofalanguagedonotalwaysagreeaboutthe “correct” segmentation, and the same text could be segmented into several very different (and equally correct)setsofwordsbydifferentnativespeakers.AsimpleexamplefromEnglishwouldbethehyphenated phrase Boston-based. If asked to “segment” this phrase into words, some native English speakers might say Boston-based is a single word and some might say Boston and based are two separate words; in this lattercasetheremightalsobedisagreementaboutwhetherthehyphen“belongs”tooneofthetwowords (and to which one) or whether it is a “word” by itself. Disagreement by native speakers of Chinese is much more prevalent; in fact, Sproat et al. (1996) give empirical results showing that native speakers of Chinese agree on the correct segmentation in fewer than 70% of the cases. Such ambiguity in the definition of what constitutes a word makes it difficult to evaluate segmentation algorithms that follow differentconventions,sinceitisnearlyimpossibletoconstructa“goldstandard”againstwhichtodirectly compareresults. A simple word segmentation algorithm consists of considering each character to be a distinct word. ThisispracticalforChinesebecausetheaveragewordlengthisveryshort(usuallybetweenoneandtwo ∗ characters,dependingonthecorpus )andactualwordscanberecognizedwiththisalgorithm.Althoughit doesnotassistintaskssuchasparsing,part-of-speechtagging,ortext-to-speechsystems(seeSproatetal. 1996), thecharacter-as-wordsegmentationalgorithmisverycommoninChineseinformationretrieval, a task in which the words in a text play a major role in indexing and where incorrect segmentation can hurtsystemperformance. A very common approach to word segmentation is to use a variation of the maximum matching algorithm,frequentlyreferredtoasthegreedyalgorithm.Thegreedyalgorithmstartsatthefirstcharacter inatextand,usingawordlistforthelanguagebeingsegmented,attemptstofindthelongestwordinthe liststartingwiththatcharacter.Ifawordisfound,themaximum-matchingalgorithmmarksaboundaryat theendofthelongestword,thenbeginsthesamelongestmatchsearchstartingatthecharacterfollowing thematch.Ifnomatchisfoundinthewordlist,thegreedyalgorithmsimplysegmentsthatcharacterasa word(asinthecharacter-as-wordalgorithmabove)andbeginsthesearchstartingatthenextcharacter. A variationof thegreedy algorithmsegments a sequence of unmatched charactersasa single word; this variantismorelikelytobesuccessfulinwritingsystemswithlongeraveragewordlengths.Inthismanner, aninitialsegmentationcanbeobtainedthatismoreinformedthanasimplecharacter-as-wordapproach. Thesuccessofthisalgorithmislargelydependentonthewordlist. As a demonstration of the application of the character-as-word and greedy algorithms, consider an example of artificially “desegmented” English, in which all the white space has been removed. The desegmented version of the phrase the table down there would thus be thetabledownthere. Applying the character-as-wordalgorithmwouldresultintheuselesssequenceoftokensthetabledownthere, which is why this algorithm only makes sense for languages with short average word length, such as Chinese. Applying the greedy algorithm with a “perfect” word list containing all known English words would first identify the word theta, since that is the longest sequence of letters starting at the initial t, which forms an actual word. Starting at the b following theta, the algorithm would then identify bled as themaximummatch. Continuinginthismanner, thetabledowntherewouldbesegmentedbythegreedy algorithmasthetabledownthere. Avariantofthemaximummatchingalgorithmisthereversemaximummatchingalgorithm,inwhich thematchingproceedsfromtheendofthestringofcharacters,ratherthanthebeginning.Intheexample above,thetabledowntherewouldbecorrectlysegmentedasthetabledowntherebythereversemaximum matchingalgorithm.Greedymatchingfromthebeginningandtheendofthestringofcharactersenablesan ∗ Asmanyas95%ofChinesewordsconsistofoneortwocharacters,accordingtoFungandWu(1994).TextPreprocessing 21 algorithmsuchasforward-backwardmatching, inwhichtheresultsarecomparedandthesegmentation optimized based on the two results. In addition to simple greedy matching, it is possible to encode language-specificheuristicstorefinethematchingasitprogresses. Chinese Segmentation TheChinesewritingsystemconsistsofseveralthousandcharactersknownasHanzi,withawordconsisting ofoneormorecharacters.Inthissection,weprovideafewexamplesofpreviousapproachestoChinese wordsegmentation,butadetailedtreatmentisbeyondthescopeofthischapter.Muchofoursummaryis takenfromSproatetal.(1996)andSproatandShih(2001).Foracomprehensivesummaryofearlywork inChinesesegmentation,wealsorecommendWuandTseng(1993). MostpreviousworkinChinesesegmentationfallsintooneofthethreecategories:statisticalapproaches, lexical rule-based approaches, and hybrid approaches that use both statistical and lexical information. Statistical approaches use data such as the mutual information between characters, compiled from a training corpus, to determine which characters are most likely to form words. Lexical approaches use manually encoded features about the language, such as syntactic and semantic information, common phrasalstructures,andmorphologicalrules,inordertorefinethesegmentation.Thehybridapproaches combineinformationfrombothstatisticalandlexicalsources. Sproat et al. (1996) describe such a hybrid approach that uses a weighted finite-state transducer to identifybothdictionaryentriesaswellasunknownwordsderivedbyproductivelexicalprocesses.Palmer (1997) also describes a hybrid statistical-lexical approach in which the segmentation is incrementally improved by a trainable sequence of transformation rules; Hockenmaier and Brew (1998) describe a similar approach. Teahan et al. (2000) describe a novel approach based on adaptive language models similartothoseusedintextcompression.Gaoetal.(2005)describeanadaptivesegmentationalgorithm that allows for rapid retraining for new genres or segmentation standards and which does not assume a universalsegmentationstandard. Oneofthesignificantchallengesincomparingsegmentationalgorithmsistherangeinsegmentation standards,andthusthelackofacommonevaluationcorpus,whichwouldenablethedirectcomparison of algorithms. In response to this challenge, Chinese word segmentation has been the focus of several organized evaluations in recent years. The “First International Chinese Word Segmentation Bakeoff” in 2003 (Sproat and Emerson 2003), and several others since, have built on similar evaluations within China to encourage a direct comparison of segmentation methods. These evaluations have helped to develop consistent standards both for segmentation and for evaluation, and they have made significant contributionsbycleaningupinconsistencieswithinexistingcorpora. Japanese Segmentation TheJapanesewritingsystemincorporatesalphabetic,syllabicandlogographicsymbols.ModernJapanese texts,forexample,frequentlyconsistofmanydifferentwritingsystems:Kanji(ChineseHanzisymbols), hiragana (a syllabary for grammatical markers and for words of Japanese origin), katakana (a syllabary for words of foreign origin), romanji (words written in the Roman alphabet), Arabic numerals, and various punctuation symbols. In some ways, the multiple character sets make tokenization easier, as transitionsbetweencharactersetsgivevaluableinformationaboutwordboundaries.However,character set transitions are not enough, since a single word may contain characters from multiple character sets, such as inflected verbs, which can contain a Kanji base and hiragana inflectional ending. Company namesalsofrequentlycontainamixofKanjiandromanji.Forthesereasons,mostpreviousapproachesto Japanesesegmentation,suchasthepopularJUMAN(MatsumotoandNagao1994)andChasenprograms (Matsumotoetal.1997),relyonmanuallyderivedmorphologicalanalysisrules. Tosomeextent,JapanesecanbesegmentedusingthesamestatisticaltechniquesdevelopedforChinese. For example, Nagata (1994) describes an algorithm for Japanese segmentation similar to that used for Chinese segmentation by Sproat et al. (1996). More recently, Ando and Lee (2003) developed an

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.