Lecture notes on regular expression

how test regular expressions,how to use regular expressions,how to write regular expressions.
Dr.DouglasPatton Profile Pic
Dr.DouglasPatton,United States,Teacher
Published Date:26-07-2017
Your Website URL(Optional)
Comment
Basic Text Processing Regular ExpressionsDan Jurafsky Regular expressions • A formal language for specifying text strings • How can we search for any of these? • woodchuck • woodchucks • Woodchuck • Woodchucks www.ThesisScientist.comDan Jurafsky Regular Expressions: Disjunctions • Letters inside square brackets Pattern Matches wWoodchuck Woodchuck, woodchuck 1234567890 Any digit • Ranges A-Z Pattern Matches A-Z An upper case letter Drenched Blossoms A lower case letter a-z my beans were impatient A single digit 0-9 Chapter 1: Down the Rabbit Hole www.ThesisScientist.comDan Jurafsky Regular Expressions: Negation in Disjunction • Negations Ss • Carat means negation only when first in Pattern Matches A-Z Not an upper case letter Oyfn pripetchik Ss Neither ‘S’ nor ‘s’ I have no exquisite reason” e Neither e nor Look here The pattern a carat b ab Look up ab now www.ThesisScientist.comDan Jurafsky Regular Expressions: More Disjunction • Woodchucks is another name for groundhog • The pipe for disjunction Pattern Matches groundhogwoodchuck yoursmine yours mine abc = abc gGroundhogWwoodchuck Photo D. Fletcher www.ThesisScientist.comDan Jurafsky Regular Expressions: ? + . Pattern Matches colou?r Optional color colour previous char ooh 0 or more of oh ooh oooh ooooh previous char 1 or more of o+h oh ooh oooh ooooh previous char Stephen C Kleene baa+ baa baaa baaaa baaaaa Kleene , Kleene + beg.n begin begun begun beg3n www.ThesisScientist.comDan Jurafsky Regular Expressions: Anchors Pattern Matches A-Z Palo Alto A-Za-z 1 “Hello” \. The end. . The end? The end www.ThesisScientist.comDan Jurafsky Example • Find me all instances of the word “the” in a text. the Misses capitalized examples tThe Incorrectly returns other or theology a-zA-ZtThea-zA-Z www.ThesisScientist.comDan Jurafsky Errors • The process we just went through was based on fixing two kinds of errors • Matching strings that we should not have matched (there, then, other) • False positives (Type I) • Not matching things that we should have matched (The) • False negatives (Type II) www.ThesisScientist.comDan Jurafsky Errors cont. • In NLP we are always dealing with these kinds of errors. • Reducing the error rate for an application often involves two antagonistic efforts: • Increasing accuracy or precision (minimizing false positives) • Increasing coverage or recall (minimizing false negatives). www.ThesisScientist.comDan Jurafsky Summary • Regular expressions play a surprisingly large role • Sophisticated sequences of regular expressions are often the first model for any text processing text • For many hard tasks, we use machine learning classifiers • But regular expressions are used as features in the classifiers • Can be very useful in capturing generalizations www.ThesisScientist.comBasic Text Processing Regular ExpressionsBasic Text Processing Word tokenizationDan Jurafsky Text Normalization • Every NLP task needs to do text normalization: 1. Segmenting/tokenizing words in running text 2. Normalizing word formats 3. Segmenting sentences in running text www.ThesisScientist.comDan Jurafsky How many words? • I do uh main- mainly business data processing • Fragments, filled pauses • Seuss’s cat in the hat is different from other cats • Lemma: same stem, part of speech, rough word sense • cat and cats = same lemma • Wordform: the full inflected surface form • cat and cats = different wordforms www.ThesisScientist.comDan Jurafsky How many words? they lay back on the San Francisco grass and looked at the stars and their • Type: an element of the vocabulary. • Token: an instance of that type in running text. • How many? • 15 tokens (or 14) • 13 types (or 12) (or 11?) www.ThesisScientist.comDan Jurafsky How many words? N = number of tokens ½ Church and Gale (1990): V O(N ) V = vocabulary = set of types V is the size of the vocabulary Tokens = N Types = V Switchboard phone 2.4 million 20 thousand conversations Shakespeare 884,000 31 thousand Google N-grams 1 trillion 13 million www.ThesisScientist.comDan Jurafsky Simple Tokenization in UNIX • (Inspired by Ken Church’s UNIX for Poets.) • Given a text file, output the word tokens and their frequencies Change all non-alpha to newlines tr -sc ‟A-Za-z‟ ‟\n‟ shakes.txt sort Sort in alphabetical order uniq –c Merge and count each type 25 Aaron 1945 A 6 Abate 72 AARON 1 Abates 19 ABBESS 5 Abbess 5 ABBOT 6 Abbey ... ... 3 Abbot www.ThesisScientist.com .... …Dan Jurafsky The first step: tokenizing tr -sc ‟A-Za-z‟ ‟\n‟ shakes.txt head THE SONNETS by William Shakespeare From fairest creatures We ... www.ThesisScientist.comDan Jurafsky The second step: sorting tr -sc ‟A-Za-z‟ ‟\n‟ shakes.txt sort head A A A A A A A A A ... www.ThesisScientist.com

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.