Question? Leave a message!




REGULAR EXPRESSIONS

REGULAR EXPRESSIONS
ROBERT SEDGEWICK KEVIN WAYNE Algorithms 5.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms FOUR TH EDITION NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.edu5.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduPattern matching Substring search. Find a single string in text. Pattern matching. Find one of a specified set of strings in text. Ex. genomics Fragile X syndrome is a common cause of mental retardation. A human's genome is a string. It contains triplet repeats of CGG or AGG, bracketed by GCG at the beginning and CTG at the end. Number of repeats is variable and is correlated to syndrome. pattern GCG(CGGAGG)CTG GCGGCGTGTGTGCGAGAGAGTGGGTTTAAAGCTGGCGCGGAGGCGGCTGGCGCGGAGGCTG text 3Syntax highlighting input output / Compilation: javac NFA.java Ada HTML Execution: java NFA regexp text Asm XHTML Dependencies: Stack.java Bag.java Digraph.java DirectedDFS.java Applescript LATEX Awk MediaWiki java NFA "(ABAC)D" AAAABD Bat ODF true Bib TEXINFO Bison ANSI java NFA "(ABAC)D" AAAAC C/C++ DocBook false C Cobol / Caml Changelog public class NFA Css D private Digraph G; // digraph of epsilon transitions Erlang private String regexp; // regular expression Flex private int M; // number of characters in regular expression Fortran GLSL // Create the NFA for the given RE Haskell public NFA(String regexp) Html Java this.regexp = regexp; Javalog M = regexp.length(); Javascript StackInteger ops = new StackInteger(); Latex G = new Digraph(M+1); Lisp ... Lua ⋮ GNU sourcehighlight 3.1.4 4Google code search http://code.google.com/p/chromium/source/search 5Pattern matching: applications Test if a string matches some pattern. Scan for virus signatures. Process natural language. Specify a programming language. Access information in digital libraries. Search genome using PROSITE patterns. Filter text (spam, NetNanny, Carnivore, malware). Validate dataentry fields (dates, email, URL, credit card). ... Parse text files. Compile a Java program. Crawl and index the Web. Read in data stored in ad hoc input file format. Create Java documentation from Javadoc comments. ... 6Regular expressions A regular expression is a notation to specify a set of strings. possibly infinite operation order example RE matches does not match 3 AABAAB AABAAB concatenation every other string AA 4 AA BAAB or every other string BAAB AA AB 2 ABA closure ABBBBBBBBA ABABA AAAAB A(AB)AAB every other string ABAAB 1 1 parent parenthes heses es A AA (AB)A ABABABABABA ABBA 7Regular expression shortcuts Additional operations are often added for convenience. operation example RE matches does not match CUMULUS SUCCUBUS .U.U.U. wildcard JUGULUM TUMULTUOUS word camelCase AZazaz character class Capitalized 4illegal ABCDE ADE A(BC)+DE at least 1 ABCBCDE BCDE 085401321 111111111 095094 exactly k 190725541 16654111 Ex. AE+ is shorthand for (ABCDE)(ABCDE) 8Regular expression examples RE notation is surprisingly expressive. regular expression matches does not match .SPB. RASPBERRY SUBSPACE CRISPBREAD SUBSPECIES (substring search) 093092094 166114433 1155555555 166451111 8675309 (U. S. Social Security numbers) az+(az+\.)+(educom) wayneprinceton.edu spamnowhere rsprinceton.edu (simplified email addresses) AZazAZaz09 ident3 3a PatternMatcher ident3 (Java identifiers) REs play a wellunderstood role in the theory of computation. 9Regular expression golf yes no obama romney bush mccain clinton kerry reagan gore … ... washington clinton http://xkcd.com/1313 Ex. Match elected presidents but not opponents (unless they later won). RE. burntcoyemtgajisonhlaedlevshlndipools madison harrison 10Illegally screening a job candidate “ First name and pre/2 last name w/7 bush or gore or republican or democrat or charg or accus or criticiz or blam or defend or iran contra First name of a candidate and pre/2 last name of a or clinton or spotted owl or florida recount or sex candidate w/7 bush or gore or republican or democrat or charg or accus or criticiz or blam or or controvers or fraud or investigat or bankrupt defend or iran contra or clinton or spotted owl or or layoff or downsiz or PNTR or NAFTA or outsourc florida recount or sex or controvers or racis or fraud or indict or enron or kerry or iraq or wmd or arrest or investigat or bankrupt or layoff or downsiz or or intox or fired or racis or intox or slur PNTR or NAFTA or outsourc or indict or enron or kerry or controvers or abortion or gay or homosexual or iraq or wmd or arrest or intox or fired or sex or or gun or firearm ” racis or intox or slur or arrest or fired or controvers or abortion or gay or homosexual or gun or firearm — LexisNexis search string used by Monica Goodling to illegally screen candidates for DOJ positions http://www.justice.gov/oig/special/s0807/final.pdf 11Can the average web surfer learn to use REs Google. Supports for full word wildcard and for union. 12Regular expressions to the rescue http://xkcd.com/208 13Can the average programmer learn to use REs Perl RE for valid RFC822 email addresses (:(:\r\n) \t)(:(:(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(: \r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t)))(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\ (:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(: (:\r\n) \t)))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t))\(:(:\r\n) \t)(:(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t )))(:,(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t) )(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t)))) :(:(:\r\n) \t))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r \n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t ))"(:(:\r\n) \t)))(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\( :(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:( :\r\n) \t)))\(:(:\r\n) \t))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t)):(:(:\r\n) \t)(:(:(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\ \\.(:(:\r\n) \t))"(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))" (:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t)))(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\ ".\\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\ \))\(\\\r\\\\.)\(:(:\r\n) \t)))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\.( :(:\r\n) \t))"(:(:\r\n) \t))\(:(:\r\n) \t)(:(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\( \\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\ \r\\\\.)\(:(:\r\n) \t)))(:,(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\ r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))\(\\\r\\ \\.)\(:(:\r\n) \t)))):(:(:\r\n) \t))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"(:\"\r\\\\ .(:(:\r\n) \t))"(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\))"( :\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t)))(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\". \\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\ ))\(\\\r\\\\.)\(:(:\r\n) \t)))\(:(:\r\n) \t))(:,\s(:(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\ ".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(= \"(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t)))(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t )+\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+ \Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t)))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\".\\ ))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t))\(:(:\r\n) \t)(:(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\" (),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"() ,;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t)))(:,(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(), ;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\ ".\\))\(\\\r\\\\.)\(:(:\r\n) \t)))):(:(:\r\n) \t))(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\"(),;:\\". \\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z(=\ "(),;:\\".\\))"(:\"\r\\\\.(:(:\r\n) \t))"(:(:\r\n) \t)))(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t) +\Z(=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t))(:\.(:(:\r\n) \t)(:(),;:\\".\\ \000\031+(:(:(:\r\n) \t)+\Z (=\"(),;:\\".\\))\(\\\r\\\\.)\(:(:\r\n) \t)))\(:(:\r\n) \t))));\s) http://www.exparrot.com/pdw/MailRFC822Address.html 14Regular expression caveat Writing a RE is like writing a program. Need to understand programming model. Can be easier to write than read. Can be difficult to debug. “ Some people, when confronted with a problem, think 'I know I'll use regular expressions.' Now they have two problems. ” — Jamie Zawinski (flame war on alt.religion.emacs) Bottom line. REs are amazingly powerful and expressive, but using them in applications can be amazingly complex and errorprone. 155.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduDuality between REs and DFAs RE. Concise way to describe a set of strings. DFA. Machine to recognize whether a given string is in a given set. Kleene's theorem. For any DFA, there exists a RE that describes the same set of strings. For any RE, there exists a DFA that recognizes the same set of strings. RE DFA 0 (0101010) number of 1's is a multiple of 3 Stephen Kleene Princeton Ph.D. 1934 number of 1's is a multiple of 3 17reject Pattern matching implementation: basic plan (first attempt) Overview is the same as for KMP. No backup in text input stream. Lineartime guarantee. Ken Thompson Turing Award '83 Underlying abstraction. Deterministic finite state automata (DFA). Basic plan. apply Kleene’s theorem Build DFA from RE. Simulate DFA with text as input. pattern matches text text DFA for pattern AAAABD (ABAC)D pattern does not match text Bad news. Basic plan is infeasible (DFA may have exponential of states). 18 acceptreject Pattern matching implementation: basic plan (revised) Overview is similar to KMP. No backup in text input stream. Quadratictime guarantee (lineartime typical). Ken Thompson Turing Award '83 Underlying abstraction. Nondeterministic finite state automata (NFA). Basic plan. apply Kleene’s theorem Build NFA from RE. Simulate NFA with text as input. pattern matches text text NFA for pattern AAAABD (ABAC)D pattern does not match text Q. What is an NFA 19 acceptNondeterministic finitestate automata Regularexpressionmatching NFA. We assume RE enclosed in parentheses. One state per RE character (start = 0, accept = M). Red εtransition (change state, but don't scan text). Black match transition (change state and scan to next text char). Accept if any sequence of transitions ends in accept state. after scanning all text characters Nondeterminism. One view: machine can guess the proper sequence of state transitions. Another view: sequence is a proof that the machine accepts the text. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept state NFA corresponding to the pattern ( ( A B A C ) D ) 20Nondeterministic finitestate automata Q. Is AAAABD matched by NFA A. Yes, because some sequence of legal transitions ends in state 11. A A A A B D 0 1 2 3 2 3 2 3 2 3 4 5 8 9 10 11 match transition: transition: accept state reached scan to next input character and all text characters scanned: change state and change state with no match pattern found Finding a pattern with ( ( A B A C ) D ) NFA 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 21Nondeterministic finitestate automata Q. Is AAAABD matched by NFA A. Yes, because some sequence of legal transitions ends in state 11. even though some sequences end in wrong state or stall A A A 0 1 2 3 2 3 4 no way out of state 4 wrong guess if input is A A A A B D A 0 1 6 7 no way out of state 7 A A A A C 0 1 2 3 2 3 2 3 2 3 4 no way out 0 1 2 3 4 5 6 7 8 9 10 11 of state 4 ( ( A B A C ) D ) Stalling sequences for ( ( A B A C ) D ) NFA NFA corresponding to the pattern ( ( A B A C ) D ) 22 A A A Nondeterministic finitestate automata 0 1 2 3 2 3 4 no way out of state 4 wrong guess if input is Q. Is AAAC matched by NFA A A A A B D A. No, because no sequence of legal transitions ends in state 11. A but need to argue about all possible sequences 0 1 6 7 no way out of state 7 A A A A C 0 1 2 3 2 3 2 3 2 3 4 no way out of state 4 Stalling sequences for ( ( A B A C ) D ) NFA 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 23Nondeterminism Q. How to determine whether a string is matched by an automaton DFA. Deterministic ⇒ easy because exactly one applicable transition. NFA. Nondeterministic ⇒ can be several applicable transitions; need to select the right one Q. How to simulate NFA A. Systematically consider all possible transition sequences. stay tuned 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 245.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduNFA representation State names. Integers from 0 to M. number of symbols in RE Matchtransitions. Keep regular expression in array re. εtransitions. Store in a digraph G. 0→1, 1→2, 1→6, 2→3, 3→2, 3→4, 5→8, 8→9, 10→11 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept state NFA corresponding to the pattern ( ( A B A C ) D ) 26NFA simulation Q. How to efficiently simulate an NFA A. Maintain set of all possible states that NFA could be in after reading in the first i text characters. Q. How to perform reachability 27NFA simulation demo Goal. Check whether input matches pattern. input A A A A B B D D match transitions εtransitions 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 28NFA simulation demo When no more input characters: Accept if any state reachable is an accept state. Reject otherwise. input A A B D 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept set of states reachable : 10, 11 29Digraph reachability Digraph reachability. Find all vertices reachable from a given source or set of vertices. recall Section 4.2 public class public class public class DirectedDFS DirectedDFS DirectedDFS find vertices reachable from s DirectedDFS(Digraph G, int s) find vertices reachable from sources DirectedDFS(Digraph G, IterableInteger s) is v reachable from source(s) boolean marked(int v) Solution. Run DFS from each source, without unmarking vertices. Performance. Runs in time proportional to E + V. 30NFA simulation: Java implementation public class NFA private char re; // match transitions private Digraph G; // epsilon transition digraph private int M; // number of states public NFA(String regexp) M = regexp.length(); re = regexp.toCharArray(); stay tuned (next segment) G = buildEpsilonTransitionDigraph(); public boolean recognizes(String txt) / see next slide / public Digraph buildEpsilonTransitionDigraph() / stay tuned / 31NFA simulation: Java implementation public boolean recognizes(String txt) BagInteger pc = new BagInteger(); states reachable from DirectedDFS dfs = new DirectedDFS(G, 0); start by εtransitions for (int v = 0; v G.V(); v++) if (dfs.marked(v)) pc.add(v); for (int i = 0; i txt.length(); i++) set of states reachable after BagInteger states = new BagInteger(); scanning past txt.charAt(i) for (int v : pc) not necessarily a match if (v == M) continue; (RE needs to match full text) if ((rev == txt.charAt(i)) rev == '.') states.add(v+1); dfs = new DirectedDFS(G, states); pc = new BagInteger(); follow εtransitions for (int v = 0; v G.V(); v++) if (dfs.marked(v)) pc.add(v); for (int v : pc) accept if can end in state M if (v == M) return true; return false; 32NFA simulation: analysis Proposition. Determining whether an Ncharacter text is recognized by the NFA corresponding to an Mcharacter pattern takes time proportional to M N in the worst case. Pf. For each of the N text characters, we iterate through a set of states of size no more than M and run DFS on the graph of εtransitions. The NFA construction we will consider ensures the number of edges ≤ 3M. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept state NFA corresponding to the pattern ( ( A B A C ) D ) 335.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduBuilding an NFA corresponding to an RE States. Include a state for each symbol in the RE, plus an accept state. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept state NFA corresponding to the pattern ( ( A B A C ) D ) 35Building an NFA corresponding to an RE Concatenation. Add matchtransition edge from state corresponding to characters in the alphabet to next state. Alphabet. A B C D Metacharacters. ( ) . 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 36Building an NFA corresponding to an RE Parentheses. Add εtransition edge from parentheses to next state. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 37singlecharacter closure Building an NFA corresponding to an RE i i+1 A Closure. Add three εtransition edges for each operator. G.addEdge(i, i+1); G.addEdge(i+1, i); closure expression singlecharacter closure i i+1 i i+1 lp A . . . ( ) G.addEdge(i, i+1); G.addEdge(i+1, i); G.addEdge(lp, i+1); G.addEdge(i+1, lp); closure expression or expression i i+1 lp or i . . . lp ( ) ... ... ( ) G.addEdge(lp, i+1); G.addEdge(lp, or+1); G.addEdge(i+1, lp); 0 1 2 3 4 5 6 7 8 G.addEdge(or, i); 9 10 11 ( ( A B A C ) D ) NFA construction rules or expression or i lp ... ... ( ) NFA corresponding to the pattern ( ( A B A C ) D ) G.addEdge(lp, or+1); 38 G.addEdge(or, i); NFA construction rulessinglecharacter closure i i+1 A G.addEdge(i, i+1); G.addEdge(i+1, i); closure expression Building an NFA correspondi i+1 ing to an RE lp . . . ( ) Or. Add two εtransition edges for each operator. G.addEdge(lp, i+1); G.addEdge(i+1, lp); or expression or i lp ... ... ( ) G.addEdge(lp, or+1); G.addEdge(or, i); NFA construction rules 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 39NFA construction: implementation Goal. Write a program to build the εtransition digraph. Challenges. Remember left parentheses to implement closure and or; remember to implement or. Solution. Maintain a stack. ( symbol: push ( onto stack. symbol: push onto stack. ) symbol: pop corresponding ( and any intervening ; add εtransition edges for closure/or. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 40NFA construction demo stack ( ( A B A C ) D ) 41NFA construction demo stack 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) accept state NFA corresponding to the pattern ( ( A B A C ) D ) 42NFA construction: Java implementation private Digraph buildEpsilonTransitionDigraph() Digraph G = new Digraph(M+1); StackInteger ops = new StackInteger(); for (int i = 0; i M; i++) int lp = i; left parentheses and if (rei == '(' rei == '') ops.push(i); else if (rei == ')') int or = ops.pop(); if (reor == '') 2way or lp = ops.pop(); G.addEdge(lp, or+1); G.addEdge(or, i); else lp = or; closure if (i M1 rei+1 == '') (needs 1character lookahead) G.addEdge(lp, i+1); G.addEdge(i+1, lp); if (rei == '(' rei == '' rei == ')') metasymbols G.addEdge(i, i+1); return G; 43NFA construction: analysis Proposition. Building the NFA corresponding to an Mcharacter RE takes time and space proportional to M. Pf. For each of the M characters in the RE, we add at most three εtransitions and execute at most two stack operations. 0 1 2 3 4 5 6 7 8 9 10 11 ( ( A B A C ) D ) NFA corresponding to the pattern ( ( A B A C ) D ) 445.4 REGULAR EXPRESSIONS regular expressions ‣ REs and NFAs ‣ NFA simulation ‣ Algorithms NFA construction ‣ applications ‣ ROBERT SEDGEWICK KEVIN WAYNE http://algs4.cs.princeton.eduGeneralized regular expression print Grep. Take a RE as a commandline argument and print the lines from standard input having some substring that is matched by the RE. public class GREP public static void main(String args) contains RE String re = "(." + args0 + ".)"; as a substring NFA nfa = new NFA(re); while (StdIn.hasNextLine()) String line = StdIn.readLine(); if (nfa.recognizes(line)) StdOut.println(line); Bottom line. Worstcase for grep (proportional to M N ) is the same as for bruteforce substring search. 46Typical grep application: crossword puzzles more words.txt a aback dictionary (standard in Unix) abacus abalone abandon … grep "s..ict.." words.txt constrictor stricter stricture 47Industrialstrength grep implementation To complete the implementation: Add multiway or. Handle metacharacters. Support character classes. Add capturing capabilities. Extend the closure operator. Error checking and recovery. Greedy vs. reluctant matching. Ex. Which substring(s) should be matched by the RE blink./blink reluctant reluctant blinktext/blinksome textblinkmore text/blink greedy 48Regular expressions in other languages Broadly applicable programmer's tool. Originated in Unix in the 1970s. Many languages support extended regular expressions. Built into grep, awk, emacs, Perl, PHP, Python, JavaScript, ... print all lines containing NEWLINE which grep 'NEWLINE' /.java occurs in any file with a .java extension egrep 'qwertyuiopzxcvbnm' words.txt egrep '...........' typewritten PERL. Practical Extraction and Report Language. replace all occurrences of from perl p i e 'sfromtog' input.txt with to in the file input.txt print all words that start perl n e 'print if /AZAZaz/' words.txt with uppercase letter do for each line 49Regular expressions in Java Validity checking. Does the input match the re Java string library. Use input.matches(re) for basic RE matching. public class Validate public static void main(String args) String regexp = args0; String input = args1; StdOut.println(input.matches(re)); java Validate "AZazAZaz09" ident123 legal Java identifier true valid email address java Validate "az+(az+\.)+(educom)" rscs.princeton.edu (simplified) true java Validate "093092094" 166114433 Social Security number true 50Harvesting information Goal. Print all substrings of input that match a RE. java Harvester "gcg(cggagg)ctg" chromosomeX.txt gcgcggcggcggcggcggctg gcgctg gcgctg harvest patterns from DNA gcgcggcggcggaggcggaggcggctg harvest links from website java Harvester "http://(\\w+\\.)(\\w+)" http://www.cs.princeton.edu http://www.princeton.edu http://www.google.com http://www.cs.princeton.edu/news 51Harvesting information RE pattern matching is implemented in Java's java.util.regexp.Pattern and java.util.regexp.Matcher classes. import java.util.regex.Pattern; import java.util.regex.Matcher; compile() creates a public class Harvester Pattern (NFA) from RE public static void main(String args) matcher() creates a String regexp = args0; Matcher (NFA simulator) In in = new In(args1); from NFA and text String input = in.readAll(); Pattern pattern = Pattern.compile(regexp); Matcher matcher = pattern.matcher(input); find() looks for while (matcher.find()) the next match StdOut.println(matcher.group()); group() returns the substring most recently found by find() 52Algorithmic complexity attacks Warning. Typical implementations do not guarantee performance Unix grep, Java, Perl, Python java Validate "(aaa)b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 1.6 seconds java Validate "(aaa)b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 3.7 seconds java Validate "(aaa)b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 9.7 seconds java Validate "(aaa)b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 23.2 seconds java Validate "(aaa)b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 62.2 seconds java Validate "(aaa)b" aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac 161.6 seconds SpamAssassin regular expression. java RE "az+az+(az\.+\.)+az+" spammerx...................... Takes exponential time on pathological email addresses. Troublemaker can use such addresses to DOS a mail server. 53Notsoregular expressions Backreferences. \1 notation matches subexpression that was matched earlier. Supported by typical RE implementations. (.+)\1 // beriberi couscous 1(11+)\1+ // 1111 111111 111111111 Some nonregular languages. Strings of the form w w for some string w: beriberi. Unary strings with a composite number of 1s: 111111. Bitstrings with an equal number of 0s and 1s: 01110100. WatsonCrick complemented palindromes: atttcggaaat. Remark. Pattern matching with backreferences is intractable. 54Context Abstract machines, languages, and nondeterminism. Basis of the theory of computation. Intensively studied since the 1930s. Basis of programming languages. Compiler. A program that translates a program to machine code. KMP string ⇒ DFA. grep RE ⇒ NFA. javac Java language ⇒ Java byte code. KMP grep Java pattern string RE program parser unnecessary check if legal check if legal compiler output DFA NFA byte code simulator DFA simulator NFA simulator JVM 55Summary of patternmatching algorithms Programmer. Implement substring search via DFA simulation. Implement RE pattern matching via NFA simulation. Theoretician. RE is a compact description of a set of strings. NFA is an abstract machine equivalent in power to RE. DFAs, NFAs, and REs have limitations. You. Practical application of core computer science principles. Example of essential paradigm in computer science. Build intermediate abstractions. Pick the right ones Solve important practical problems. 56
sharer
Presentations
Free
Document Information
Category:
Presentations
User Name:
Dr.BenjaminClark
User Type:
Teacher
Country:
United States
Uploaded Date:
21-07-2017