Question? Leave a message!




Regular Expressions and Context-Free Grammars

Regular Expressions and Context-Free Grammars
MIT 6.035 Specifying Languages with Regular Exp pressions and ContextFree Grammars Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Language Definition Problem • How to precisely define language • L Layeredd st truct ture of f l language d defifi nitii ti on • Start with a set of letters in language • • Lexical Lexical structure structure identifies identifies “words words ” in in language language (each word is a sequence of letters) • • Syntactic Syntactic s structure tructure identifies identifies “sentences sentences ” in in language (each sentence is a sequence of words) • Semantics meaning g of p prog gram ( (sp pecifies what result should be for each input) • Today’s topic: lexical and syntactic structures Specifying Formal Languages • Huge Triumph of Computer Science • Beautiful Theoretical Results • Practical Techniques and Applications • Two Dual Notions • Generative approach (g ( rammar or regullar expressiion) ) • Recognition approach (automaton) • Lots Lots of of theorems theorems about about converting converting o one ne approach approach automatically to another Specifying Lexical Structure Using Regular Regular Expressions Expressions • Have some alphabet ∑ = set of letters • R Regull ar expressii ons are b b uilt ilt f f rom: • ε empty string • • Any A nyl letter ett erf from roma alphabet l ph ab et ∑∑ •r r – regular expression r followed by r 1 2 1 2 (sequence) (sequence) •r r – either regular expression r or r 1 2 1 2 ( (choice) ) • r iterated sequence and choice ε r rr … • Parentheses to indicate grouping/precedence Concept of Regular Expression Generating Generating a a String String Rewrite regular expression until have only a sequence sequence of of letters letters (string) (string) left left Examp ple Gener Genera al l R Rul ules es (0 1).(01) 1) r r→ r 1 2 1 () (0 1)((0 ) 1).((01)) 2) r r→ r 1(01).(01) 1 2 2 1.(01) 3) ) r →rr 1.(01)(01) 4) r →ε 1.(01) 1.0 Nondeterminism in Generation • Rewriting is similar to equational reasoning • But different rule applications may yield different final results Examp p le 1 Example Example 2 2 (01).(01) (01).(01) (01)(01).(01) (() 01)((01)).((01)) 1(01).(01) 0(01).(01) 1.(01) 0.(01) 1 1.(01)(01) (01)(01) 0.(01)(01) 1.(01) 0.(01) 110 .0 001 .1 Concept of Language Generated by Regular Regular Expressions Expressions • Set of all strings generated by a regular expression expression is is language language of of regular regular expression expression • In general, language may be (countably) infinite • • String String in in language language is is often often called called a a token tokenExamples of Languages and Regular Expressions Expressions • ∑ = 0, 1, . • (01) (01) .(01) (01) Bi Binary fl fl oati ti ng poii nt t numb bers • (00) evenlength allzero strings • • 1(0101) 1(0101) strings strings w with ith e even ven number number of of zeros • ∑∑ = a a,bb ,cc , 00 , 11 , 2 2 • (abc)(abc012) alphanumeric identifiers • (012) trinary numbers Alternate Abstraction Finite FiniteState State A Automata utomata •Alphabet ∑ • S S et t of f st tat tes wiith th i iniiti ti al l and d accept t st tat tes • Transitions between states, labeled with letters (01).(01) Start state 1 1 . 0 0 Accept stateAutomaton Accepting String Conceptually, run string through automaton • Have current state and current letter in string • Start with start state and first letter in string • At each step, match current letter against a transition wh hose llab bell is same as lletter • Continue until reach end of string or match fails • If If end d iin accept t st tat te, aut tomat ton accept ts st triing • Language of automaton is set of strings it accepts Examp p le Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 String is accepted Current Current let letter ter Generative Versus Recognition • Regular expressions give you a way to generate all strings in language • Automata give you a way to recognize if a specific string is in language •Phl hilosoph hicalllly very d diff fferent • Theoretically equivalent (for regular expressions expressions a and nd automata) automata) • Standard approach • • Use Use r regular egular expressions expressions w when hen d define efine llanguage anguage • Translated automatically into automata for implementation implementation From Regular Expressions to Automata Automata • Construction by structural induction • Gi Given an arbit bitrary regull ar expressiion r • Assume we can convert r to an automaton with • • One One start start state state • One accept state • • Show Show how how to convert all constructors to deliver o c c o an automaton with • One One start start state state • One accept state Basic Constructs Start state Accept Accept s state tate ε εε a a∈Σ Sequence Start state Accept Accept s state tate r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate ε r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε ε r r r r 1 1 2 2 1 1 2 2Choice Start state Accept Accept s state tate r 1 r r r r 1 2 r 2Choice Old start state Start state Old Old accept accept state state Accept Accept s state tate r 1 r r r r 1 2 r 2Choice Old start state Start state Old Old accept accept state state Accept Accept s state tate r 1 ε r r r r 1 2 ε r 2Choice Old start state Start state Old Old accept accept state state Accept Accept s state tate ε r 1 ε r r r r 1 2 ε r ε 2Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε ε r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε ε r r r r ε NFA vs. DFA •DFA • • No Noεε transitions transitions • At most one transition from each state for each each letter letter a a NO NOT T OK OK OK b a • NFA – neither restrictionConversions • Our regular expression to automata conversion produces produces an an NFA NFA • Would like to have a DFA to make recognition ag algo orit thm s simp ple er • Can convert from NFA to DFA (but DFA may be exponentially larger than NFA) NFA to DFA Construction • DFA has a state for each subset of states in NFA • DFA start state corresponds to set of states reachable by following ε t tra ansitions nsitions f f rom om NFA NFA sta start t state state • DFA state is an accept state if an NFA accept state is in its set of NFA states • • To To compute compute the the transition transition for for a a given given D DFA FA state state D D and and lletter etter a a • Set S to empty set • Find the set N of D’s NFA states • For For a all ll NFA NFA s states tates n n in in N N – Compute set of states N’ that the NFA may be in after matching a – Set Set S S to to S S union union N N ’ • If S is nonempty, there is a transition for a from D to the DFA state that has the set S of NFA states • • Otherwise Otherwise , there there is is no no transition transition for for a a from from D D NFA to DFA Example for (ab).(ab) εε a a ε 3 5ε 11 13 εε εεεε . 1 1 2 2 7 7 8 8 9 9 10 10 15 15 16 16 εε εε 4 6 12 14 εε b b .. a a a a 5,7,2,3,4,8 13,15,10,11,12,16 a . a a a a a 1,2,3,4,8 9,10,11,12,16 b b b b .. 6, 67 7,22,33,44,88 14 14,15 15,10 10,11 11,12 12,16 16 b bLexical Structure in Languages Each language typically has several categories of words. words. In In a a typical typical programming programming llanguage: anguage: • Keywords (if, while) • Arithmetic Operations (+, , , /) • Integer numbers (1, 2, 45, 67) • Floating point numbers (1.0, .2, 3.337) •Id dentif fiers ( (ab bc, i, j, ab b345) ) • Typically have a lexical category for each keyword keyword a and/or nd/or e each ach c category ategory • Each lexical category defined by regexp Lexical Categories Example • IfKeyword = if • • WhileKeyword WhileKeyword = = while while • Operator = +/ •Intege tege r = 0 09 9 0 09 9 • Float = 09. 09 • Identifier = az(az09) • Note that 09 = (0123456789) az = (abc…yz) • Will Will use llexiicall cat tegoriies iin next t llevel lProgramming Language Syntax • Regular languages suboptimal for specifying programming programming language language syntax syntax • Why Constructs with nested syntax • (a+(b (a+(bc)) c)) (d (d (x (x(y (y z))) z))) • • if (x y) if (y z) a = 5 else a = 6 else a = 7 • • Regular Regular languages languages lack lack state state required required to to model model nesting • Canonical examp ple: nested exp pressions • No regular expression for language of parenthesized expressions Solution – ContextFree Grammar • Set of terminals Op = +/ Op, Int, Open, Close Int = 09 09 Each terminal defined Open = bl by regular expressiion Cl Close = • Set of nonterminals Start Start, E Expr xpr • Set of productions Start→ Expr •Sing gle nonterminal on LHS Exp pr→ Exp pr Op p Exp pr • Sequence of terminals and Expr→ Int nonterminals on RHS Expr→ Open Expr CloseProduction Game have a current string start start w with ith Start Start nonterminal nonterminal loop until no more nonterminals c choose oose a a no ont te ermina al in cu curre ent t st string g choose a production with nonterminal in LHS replace nonterminal with RHS of production substitute regular expressions with corresponding strings generated generated s string tring i is s i in n llanguage anguage Note: Note: d different ifferent choices choices p produce roduce different different strings strings Sample Derivation Op = +/ Start In Int t = 0 0 9 9 0 0 9 9 E Expr Open = Expr Op Expr Open Expr Close Op Expr Close = Open Expr Op Expr Close Op Expr Open Int Op Expr Close Op Expr Open Open Int Int Op Op Expr Expr Close Close Op Op Int Int 1) Start→ Expr Open Int Op Int Close Op Int 2) Expr→ Expr Op Expr 2 1 + 1 3) 3) E Expr→ It Int 4) Expr→ Open Expr Close Parse Tree • Internal Nodes: Nonterminals • L Leaves: T T ermii nalls • Edges: • • From From Nonterminal Nonterminal of of L LHS HS of of production production • To Nodes from RHS of production • • Captures Captures derivation derivation of of string stringParse Tree for 21+1 Start Expr Expr Expr Expr Op Op Open + Close Expr Int 1 Expr Expr Op Int Int 2 1 1 Ambiguity in Grammar Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string Derivation and parse tree usually reflect semantics of the the p program rogram g g Ambigy uity in grammar often reflects ambigguityy in semantics of language (which is considered undesirable) Ambiguity Example Two parse trees for 21+1 Tree corresponding T Tr re ee e corr corresponding esponding to 21+1 to 21+1 Start Start Expr Expr E Expr Op Expr Expr Op Expr + Int In Int t 2 Expr Op Expr Expr Op Expr 1 + Int Int Int Int 1 1 2 1 Eliminating Ambiguity Solution: hack the grammar Original Grammar Hacked Grammar Start→ Expr Start→ Expr Expr→ Expr Op Expr Expr→ Expr Op Int Expr→ Int Expr→ Int Ep Expr→ Open Open EEp xpr Close Close Ep Expr→ Open Open Ep Expr Close Close Concep ptually, y, makes all op perators associate to left Parse Trees for Hacked Grammar Only one parse tree for 21+1 Valid parse tree No longer valid parse tree Start Start Expr Expr Expr Op Int Expr Op Expr + 1 In Int t Expr Op Int 2 Expr Op Expr 1 + In Int t In Int t In In t t 2 1 1 Precedence Violations • All operators associate to left Parse tree for 2 2 34 34 • Vi Violl at tes preced d ence of f over + + Start • 234 associates like 234 Expr Expr Expr Expr Op Op In Int t 4 Exp pr Op p In Int t 3 Int 2 2Hacking Around Precedence Original Grammar Hacked Grammar Op = +/ AddOp = + Int = 09 09 MulOp = / Open = Int = 09 09 Close = Open = Close = Start Start→→ Expr Expr Start Start→→ Expr Expr Expr→ Expr Op Int Expr→ Expr AddOp Term Expr Expr→→ In Int t Expr Expr→→ Te Ter rm m Term→ Term MulOp Num Expr→ Open Expr Close Term→ Num Num→ Int Num→ Open Expr Close Parse Tree Changes New parse tree for 234 Old parse tree Start Start f for 2 234 34 Start Expr Expr AddOp Expr Term Expr Expr Op Op I Int Term 4 Term MulOp Num Expr Expr Op Op In Int t Num Num 3 Num Int Int Int 4 2 2 2 2 Int 3 General Idea • Group Operators into Precedence Levels • and d // are at t t top l level l, bibi ndd t trongest t s • + and are at next level, bind next strongest • • Nonterminal Nonterminal for f or each each Precedence Precedence Level Level • Term is nonterminal for and / for and • • Expr Expr is is nonterminal nonterminal or + + a • Can make operators left or right associative within within each each level level • Generalizes for arbitrary levels of precedence Parser • Converts program into a parse tree • Can be written by y hand • Or produced automatically by parser generator • Accepts a grammar as input • Produces a parser as output • Practical problem • Parse tree for hacked grammar is complicated • Would like to start with more intuitive parse tree Solution • Abstract versus Concrete Syntax • Ab Ab st tract t synt tax correspond ds t to “ “ii nt tuiti itive” ” way of thinking of structure of program • • Omits Omits d details etails like like s superfluous uperfluous keywords keywords that that are there to make the language unambiguous • Abstract syntax may be ambiguous • Concrete Syntax corresponds to full grammar used to parse the language • Parsers are often written to produce abstract syntax syntax trees trees. Abstract Syntax Trees • Start with intuitive but ambiguous grammar • HHk ack grammar t to mak ke i it t unamb bigi uous • Concrete parse trees • • Less Less intuitive intuitive • Convert concrete parse trees to abstract syntax trees trees • Correspond to intuitive grammar for language • Simpler Simpler f for or program program to to manipulate manipulateExample Hacked Unambiguous Gr Grammar ammar AddOp = + MulOp = / In Int t = = 0 0 9 9 0 0 9 9 Intuitive but Ambiguous Open = Close = Grammar St Start t → E Expr Op = /+ Expr→ Expr AddOp Term Int = 09 09 Expr→ Term Sta Start t→ EEp xpr Term → Term MulOp Num Term→ Num Expr→ Expr Op Expr Num → Int Expr Expr→→ In Int t Num→ Open Expr Close Start Concrete parse Abstract syntax tr tree ee tr tree ee Expr for 234 for 234 Op Expr Expr Start Start Expr Expr Expr Expr Expr Int Op p 4 4 Int Int AddOp Expr 2 3 Term • Uses intuitive Term grammar Term MulOp Num Num Num • • Eliminates Eliminates superfluous superfluous Num terminals Int Int 4 •Open 2 2 In In t t •Close 3 Start Abstr Abstract act parse parse tr tre ee e Further Further s simplified implified Start for 234 abstract syntax Expr Expr tree for 234 Op Expr Expr Op Int Expr 4 Expr Expr Expr Expr In Int t I Int O Op I Int Op 4 2 3 Int Int 2 2 3 3Summary • Lexical and Syntactic Levels of Structure • LLi exical l – regull ar expressii ons and d aut tomat ta • Syntactic – grammars • • Grammar Grammar ambiguities ambiguities • Hacked grammars • • Abstract Abstract syntax syntax trees trees • Generation versus Recognition Approaches • • Generation Generation more more convenient convenient for for specification specification • Recognition required in implementation Handling If Then Else Start → Stat Stat Stat →→ if if Expr Expr then then Stat Stat else else Stat Stat Stat → if Expr then Stat Stat → ... Parse Trees • Consider Statement if e then if e then s else s 1 2 1 2 Stat Two Parse Trees if Stat Expr e1 if Expr then Stat else Stat e2 2 s1 1 s2 2 Stat if if Expr Expr then then Stat Stat else else Stat Stat Which is correct e1 s2 if Expr then s1 e2 e2 Alternative Readings • Parse Tree Number 1 if if e 1 if e s 2 1 Gr Grammar ammar iis s a ambiguous mbiguous else s 2 • Parse Tree Number 2 if e 1 if e s 2 1 else s 2 Hack ed Gr ammar Goal → Stat Stat Stat →→ WithElse WithElse Stat → LastElse WithElse → if Expr then WithElse else WithElse WithElse → statements without if then or if then else LastElse → if Expr then Stat LastElse LastElse →→ if if Expr Expr then then WithElse WithElse else else LastElse LastElse Hacked Grammar • Basic Idea: control carefully where an if without an an else else can can occur occur • Either at top level of statement • • Or Or as as very very last last in in a a sequence sequence of of if if t then hen else else if if then ... statementsGrammar Vocabulary • Leftmost derivation • Al Always expand d s ll eft ft most t remaii niing nonterminal • • Similarly Similarly for for rightmost rightmost d derivation erivation • Sentential form • • Partially Partially o or r f fully ully derived derived s string tring f from rom a a step step in in valid derivation •0 + Expr p Op p Expr p •0 + Expr 2 Defining a Language • Grammar • Generative Generative approach approach • All strings that grammar generates (How many are there for grammar in previous example) • Automaton • Recognition approach • All strings that automaton accepts • Different flavors of grammars and automata • In general, grammars and automata correspond Regular Languages • Automaton Characterization • ( (S S,A A ,F F ,s ,s ) ) 0 F • Finite set of states S • • Finite Finite Alphabet Alphabet A A • Transition function F : S×A→ S • • Start Start s state tate s s 0 • Final states s F • • Lanuage Lanuage is is set set of of strings strings accepted accepted by by Automaton AutomatonRegular Languages • Regular Grammar Characterization • ( (TTN ,NTTSP ,S,P) ) • Finite set of Terminals T • • Finite Finite set set of of Nonterminals Nonterminals NT NT • Start Nonterminal S (goal symbol, start symbol) symbol) • Finite set of Productions P: NT→ T U NT U T NT • Language is set of strings generated by grammar Grammar and Automata Correspondence Correspondence Grammar Automaton Regular Regular rammar Finite FiniteState State A Automaton utomaton Grammar ContextFree Grammar PushDown Automaton ContextSensitive Grammar Turing g Machine ContextFree Grammars • Grammar Characterization • ( (TN T,NTTSP ,S,P) ) • Finite set of Terminals T • • Finite Finite set set of of Nonterminals Nonterminals NT NT • Start Nonterminal S (goal symbol, start symbol) symbol) • Finite set of Productions P: NT→ (T NT) • RHS RHS o of f p production roduction c can an have have any any sequence sequence of of terminals or nonterminals PushDown Automata • DFA Plus a Stack • ( (S S,A A ,V V, F F,s ,s ) ) 0 F • Finite set of states S • • Finite Finite Input Input A Alphabet lphabet A A, Stack Stack A Alphabet lphabet V V • Transition relation F : S ×(A Uε)×V→ S× V • • Start Start s state tate s s 0 • Final states s F • • Each Each configuration configuration consists consists o of f a a state state, a a stack stack, and remaining input string CFG Versus PDA • CFGs and PDAs are of equivalent power • G G rammar I Impl lement tat tii on M Mech h aniism: • Translate CFG to PDA, then use PDA to parse input input string string • Foundation for bottomup parser generators ContextSensitive Grammars and Turing Turing Machines Machines • ContextSensitive Grammars Allow Productions to Use Use C Context ontext • P: (T.NT)+→ (T.NT) • • Turing Turing Machines Machines Have Have • Finite State Control • • Two TwoWay Way T Tape ape Instead Instead of of A A Stack StackMIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.