Applications of regular expressions ppt

regular expressions in theory of computation ppt and Formal language and regular expressions ppt
Dr.BenjaminClark Profile Pic
Dr.BenjaminClark,United States,Teacher
Published Date:21-07-2017
Your Website URL(Optional)
Comment
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 5.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms FOUR TH EDITION NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu5.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduPattern matching Substring search. Find a single string in text. Pattern matching. Find one of a specified set of strings in text. Ex. genomics Fragile X syndrome is a common cause of mental retardation. A human's genome is a string. It contains triplet repeats of CGG or AGG, bracketed by GCG at the beginning and CTG at the end. Number of repeats is variable and is correlated to syndrome. pattern GCG(CGGAGG)CTG GCGGCGTGTGTGCGAGAGAGTGGGTTTAAAGCTGGCGCGGAGGCGGCTGGCGCGGAGGCTG text 3Syntax highlighting input output / Compilation: javac NFA.java Ada HTML Execution: java NFA regexp text Asm XHTML Dependencies: Stack.java Bag.java Digraph.java DirectedDFS.java Applescript LATEX Awk MediaWiki % java NFA "(ABAC)D" AAAABD Bat ODF true Bib TEXINFO Bison ANSI % java NFA "(ABAC)D" AAAAC C/C++ DocBook false C Cobol / Caml Changelog public class NFA Css D private Digraph G; // digraph of epsilon transitions Erlang private String regexp; // regular expression Flex private int M; // number of characters in regular expression Fortran GLSL // Create the NFA for the given RE Haskell public NFA(String regexp) Html Java this.regexp = regexp; Javalog M = regexp.length(); Javascript StackInteger ops = new StackInteger(); Latex G = new Digraph(M+1); Lisp ... Lua ⋮ GNU source-highlight 3.1.4 4Google code search http://code.google.com/p/chromium/source/search 5Pattern matching: applications Test if a string matches some pattern. Scan for virus signatures. Process natural language. Specify a programming language. Access information in digital libraries. Search genome using PROSITE patterns. Filter text (spam, NetNanny, Carnivore, malware). Validate data-entry fields (dates, email, URL, credit card). ... Parse text files. Compile a Java program. Crawl and index the Web. Read in data stored in ad hoc input file format. Create Java documentation from Javadoc comments. ... 6Regular expressions A regular expression is a notation to specify a set of strings. possibly infinite operation order example RE matches does not match 3 AABAAB AABAAB concatenation every other string AA 4 AA BAAB or every other string BAAB AA AB 2 ABA closure ABBBBBBBBA ABABA AAAAB A(AB)AAB every other string ABAAB 1 1 parent parenthes heses es A AA (AB)A ABABABABABA ABBA 7Regular expression shortcuts Additional operations are often added for convenience. operation example RE matches does not match CUMULUS SUCCUBUS .U.U.U. wildcard JUGULUM TUMULTUOUS word camelCase A-Za-za-z character class Capitalized 4illegal ABCDE ADE A(BC)+DE at least 1 ABCBCDE BCDE 08540-1321 111111111 0-95-0-94 exactly k 19072-5541 166-54-111 Ex. A-E+ is shorthand for (ABCDE)(ABCDE) 8Regular expression examples RE notation is surprisingly expressive. regular expression matches does not match .SPB. RASPBERRY SUBSPACE CRISPBREAD SUBSPECIES (substring search) 0-93-0-92-0-94 166-11-4433 11-55555555 166-45-1111 8675309 (U. S. Social Security numbers) a-z+(a-z+\.)+(educom) wayneprinceton.edu spamnowhere rsprinceton.edu (simplified email addresses) _A-Za-z_A-Za-z0-9 ident3 3a PatternMatcher ident3 (Java identifiers) REs play a well-understood role in the theory of computation. 9Regular expression golf yes no obama romney bush mccain clinton kerry reagan gore … ... washington clinton http://xkcd.com/1313 Ex. Match elected presidents but not opponents (unless they later won). RE. burntcoyemtgajisonhlaedlevshlndipools madison harrison 10Illegally screening a job candidate “ First name and pre/2 last name w/7 bush or gore or republican or democrat or charg or accus or criticiz or blam or defend or iran contra First name of a candidate and pre/2 last name of a or clinton or spotted owl or florida recount or sex candidate w/7 bush or gore or republican or democrat or charg or accus or criticiz or blam or or controvers or fraud or investigat or bankrupt defend or iran contra or clinton or spotted owl or or layoff or downsiz or PNTR or NAFTA or outsourc florida recount or sex or controvers or racis or fraud or indict or enron or kerry or iraq or wmd or arrest or investigat or bankrupt or layoff or downsiz or or intox or fired or racis or intox or slur PNTR or NAFTA or outsourc or indict or enron or kerry or controvers or abortion or gay or homosexual or iraq or wmd or arrest or intox or fired or sex or or gun or firearm ” racis or intox or slur or arrest or fired or controvers or abortion or gay or homosexual or gun or firearm — LexisNexis search string used by Monica Goodling to illegally screen candidates for DOJ positions http://www.justice.gov/oig/special/s0807/final.pdf 11Can the average web surfer learn to use REs? Google. Supports for full word wildcard and for union. 12Regular expressions to the rescue http://xkcd.com/208 13Can the average programmer learn to use REs? Perl RE for valid RFC822 email addresses (?:(?:\r\n)? \t)(?:(?:(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?: \r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t)))(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\ (?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?: (?:\r\n)? \t)))(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n) ? \t))\(?:(?:\r\n)? \t)(?:(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t )))(?:,(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t) )(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))) :(?:(?:\r\n)? \t))?(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r \n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t ))"(?:(?:\r\n)? \t)))(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\( ?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(? :\r\n)? \t)))\(?:(?:\r\n)? \t))(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t)):(?:(?:\r\n)? \t)(?:(?:(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\ \\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))" (?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t)))(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\ ".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\ \))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\.( ?:(?:\r\n)? \t))"(?:(?:\r\n)? \t))\(?:(?:\r\n)? \t)(?:(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\( \\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\ \r\\\\.)\(?:(?:\r\n)? \t)))(?:,(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\ r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))\(\\\r\\ \\.)\(?:(?:\r\n)? \t)))):(?:(?:\r\n)? \t))?(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(?:\"\r\\\\ .(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\))"(? :\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t)))(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\". \\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\ ))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))\(?:(?:\r\n)? \t))(?:,\s(?:(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\ ".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?= \"(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t)))(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t )+\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+ \Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\".\\ ))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t))\(?:(?:\r\n)? \t)(?:(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\" (),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"() ,;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))(?:,(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(), ;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\ ".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))):(?:(?:\r\n)? \t))?(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\"(),;:\\". \\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z(?=\ "(),;:\\".\\))"(?:\"\r\\\\.(?:(?:\r\n)? \t))"(?:(?:\r\n)? \t)))(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t) +\Z(?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t))(?:\.(?:(?:\r\n)? \t)(?:(),;:\\".\\ \000-\031+(?:(?:(?:\r\n)? \t)+\Z (?=\"(),;:\\".\\))\(\\\r\\\\.)\(?:(?:\r\n)? \t)))\(?:(?:\r\n)? \t))))?;\s) http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html 14Regular expression caveat Writing a RE is like writing a program. Need to understand programming model. Can be easier to write than read. Can be difficult to debug. “ Some people, when confronted with a problem, think 'I know I'll use regular expressions.' Now they have two problems. ” — Jamie Zawinski (flame war on alt.religion.emacs) Bottom line. REs are amazingly powerful and expressive, but using them in applications can be amazingly complex and error-prone. 155.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduDuality between REs and DFAs RE. Concise way to describe a set of strings. DFA. Machine to recognize whether a given string is in a given set. Kleene's theorem. For any DFA, there exists a RE that describes the same set of strings. For any RE, there exists a DFA that recognizes the same set of strings. RE DFA 0 (0101010) number of 1's is a multiple of 3 Stephen Kleene Princeton Ph.D. 1934 number of 1's is a multiple of 3 17reject Pattern matching implementation: basic plan (first attempt) Overview is the same as for KMP. No backup in text input stream. Linear-time guarantee. Ken Thompson Turing Award '83 Underlying abstraction. Deterministic finite state automata (DFA). Basic plan. apply Kleene’s theorem Build DFA from RE. Simulate DFA with text as input. pattern matches text text DFA for pattern AAAABD (ABAC)D pattern does not match text Bad news. Basic plan is infeasible (DFA may have exponential of states). 18 acceptreject Pattern matching implementation: basic plan (revised) Overview is similar to KMP. No backup in text input stream. Quadratic-time guarantee (linear-time typical). Ken Thompson Turing Award '83 Underlying abstraction. Nondeterministic finite state automata (NFA). Basic plan. apply Kleene’s theorem Build NFA from RE. Simulate NFA with text as input. pattern matches text text NFA for pattern AAAABD (ABAC)D pattern does not match text Q. What is an NFA? 19 acceptNondeterministic finite-state automata Regular-expression-matching NFA. We assume RE enclosed in parentheses. One state per RE character (start = 0, accept = M). Red ε-transition (change state, but don't scan text). Black match transition (change state and scan to next text char). Accept if any sequence of transitions ends in accept state. after scanning all text characters Nondeterminism. One view: machine can guess the proper sequence of state transitions. Another view: sequence is a proof that the machine accepts the text. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept state NFA corresponding to the pattern ( ( A B A C ) D ) 20