Question? Leave a message!




Regular Expressions and Context-Free Grammars

Regular Expressions and Context-Free Grammars
MIT 6.035 Specifying Languages with Regular Exp pressions and Context-Free Grammars Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Language Definition Problem • How to precisely define language • L Layeredd st truct ture of f l language d defifi nitii ti on • Start with a set of letters in language • • Lexical Lexical structure structure - identifies identifies “words words ” in in language language (each word is a sequence of letters) • • Syntactic Syntactic s structure tructure - identifies identifies “sentences sentences ” in in language (each sentence is a sequence of words) • Semantics - meaning g of p prog gram ( (sp pecifies what result should be for each input) • Today’s topic: lexical and syntactic structures Specifying Formal Languages • Huge Triumph of Computer Science • Beautiful Theoretical Results • Practical Techniques and Applications • Two Dual Notions • Generative approach (g ( rammar or regullar expressiion) ) • Recognition approach (automaton) • Lots Lots of of theorems theorems about about converting converting o one ne approach approach automatically to another Specifying Lexical Structure Using Regular Regular Expressions Expressions • Have some alphabet ∑ = set of letters • R Regull ar expressii ons are b b uilt ilt f f rom: • ε -empty string • • Any A nyl letter ett erf from roma alphabet l ph ab et ∑∑ •r r – regular expression r followed by r 1 2 1 2 (sequence) (sequence) •r r – either regular expression r or r 1 2 1 2 ( (choice) ) • r - iterated sequence and choice ε r rr … • Parentheses to indicate grouping/precedence Concept of Regular Expression Generating Generating a a String String Rewrite regular expression until have only a sequence sequence of of letters letters (string) (string) left left Examp ple Gener Genera al l R Rul ules es (0 1).(01) 1) r r→ r 1 2 1 () (0 1)((0 ) 1).((01)) 2) r r→ r 1(01).(01) 1 2 2 1.(01) 3) ) r →rr 1.(01)(01) 4) r →ε 1.(01) 1.0 Nondeterminism in Generation • Rewriting is similar to equational reasoning • But different rule applications may yield different final results Examp p le 1 Example Example 2 2 (01).(01) (01).(01) (01)(01).(01) (() 01)((01)).((01)) 1(01).(01) 0(01).(01) 1.(01) 0.(01) 1 1.(01)(01) (01)(01) 0.(01)(01) 1.(01) 0.(01) 110 .0 001 .1 Concept of Language Generated by Regular Regular Expressions Expressions • Set of all strings generated by a regular expression expression is is language language of of regular regular expression expression • In general, language may be (countably) infinite • • String String in in language language is is often often called called a a token tokenExamples of Languages and Regular Expressions Expressions • ∑ = 0, 1, . • (01) (01) .(01) (01) Bi Binary fl fl oati ti ng poii nt t numb bers - • (00) - even-length all-zero strings • • 1(0101) 1(0101) strings strings w with ith e even ven number number of of - zeros • ∑∑ = a a,bb ,cc , 00 , 11 , 2 2 • (abc)(abc012) - alphanumeric identifiers • (012) - trinary numbers Alternate Abstraction Finite Finite-State State A Automata utomata •Alphabet ∑ • S S et t of f st tat tes wiith th i iniiti ti al l and d accept t st tat tes • Transitions between states, labeled with letters (01).(01) Start state 1 1 . 0 0 Accept stateAutomaton Accepting String Conceptually, run string through automaton • Have current state and current letter in string • Start with start state and first letter in string • At each step, match current letter against a transition wh hose llab bell is same as lletter • Continue until reach end of string or match fails • If If end d iin accept t st tat te, aut tomat ton accept ts st triing • Language of automaton is set of strings it accepts Examp p le Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 Current Current let letter ter Examp ple Current state Start state 1 1 . . 0 0 Accept state 11.0 String is accepted Current Current let letter ter Generative Versus Recognition • Regular expressions give you a way to generate all strings in language • Automata give you a way to recognize if a specific string is in language •Phl hilosoph hicalllly very d diff fferent • Theoretically equivalent (for regular expressions expressions a and nd automata) automata) • Standard approach • • Use Use r regular egular expressions expressions w when hen d define efine llanguage anguage • Translated automatically into automata for implementation implementation From Regular Expressions to Automata Automata • Construction by structural induction • Gi Given an arbit bitrary regull ar expressiion r • Assume we can convert r to an automaton with • • One One start start state state • One accept state • • Show Show how how to convert all constructors to deliver o c c o an automaton with • One One start start state state • One accept state Basic Constructs Start state Accept Accept s state tate ε εε a a∈Σ Sequence Start state Accept Accept s state tate r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate ε r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε r r r r 1 1 2 2 1 1 2 2Sequence Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε ε r r r r 1 1 2 2 1 1 2 2Choice Start state Accept Accept s state tate r 1 r r r r 1 2 r 2Choice Old start state Start state Old Old accept accept state state Accept Accept s state tate r 1 r r r r 1 2 r 2Choice Old start state Start state Old Old accept accept state state Accept Accept s state tate r 1 ε r r r r 1 2 ε r 2Choice Old start state Start state Old Old accept accept state state Accept Accept s state tate ε r 1 ε r r r r 1 2 ε r ε 2Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε ε r r r r Kleene Star Old start state Start state Old Old accept accept state state Accept Accept s state tate ε ε ε r r r r ε NFA vs. DFA •DFA • • No Noεε transitions transitions • At most one transition from each state for each each letter letter a a NO NOT T OK OK OK b a • NFA – neither restrictionConversions • Our regular expression to automata conversion produces produces an an NFA NFA • Would like to have a DFA to make recognition ag algo orit thm s simp ple er • Can convert from NFA to DFA (but DFA may be exponentially larger than NFA) NFA to DFA Construction • DFA has a state for each subset of states in NFA • DFA start state corresponds to set of states reachable by following ε t tra ansitions nsitions f f rom om NFA NFA sta start t state state • DFA state is an accept state if an NFA accept state is in its set of NFA states • • To To compute compute the the transition transition for for a a given given D DFA FA state state D D and and lletter etter a a • Set S to empty set • Find the set N of D’s NFA states • For For a all ll NFA NFA s states tates n n in in N N – Compute set of states N’ that the NFA may be in after matching a – Set Set S S to to S S union union N N ’ • If S is nonempty, there is a transition for a from D to the DFA state that has the set S of NFA states • • Otherwise Otherwise , there there is is no no transition transition for for a a from from D D NFA to DFA Example for (ab).(ab) εε a a ε 3 5ε 11 13 εε εεεε . 1 1 2 2 7 7 8 8 9 9 10 10 15 15 16 16 εε εε 4 6 12 14 εε b b .. a a a a 5,7,2,3,4,8 13,15,10,11,12,16 a . a a a a a 1,2,3,4,8 9,10,11,12,16 b b b b .. 6, 67 7,22,33,44,88 14 14,15 15,10 10,11 11,12 12,16 16 b bLexical Structure in Languages Each language typically has several categories of words. words. In In a a typical typical programming programming llanguage: anguage: • Keywords (if, while) • Arithmetic Operations (+, -, , /) • Integer numbers (1, 2, 45, 67) • Floating point numbers (1.0, .2, 3.337) •Id dentif fiers ( (ab bc, i, j, ab b345) ) • Typically have a lexical category for each keyword keyword a and/or nd/or e each ach c category ategory • Each lexical category defined by regexp Lexical Categories Example • IfKeyword = if • • WhileKeyword WhileKeyword = = while while • Operator = +-/ •Intege tege r = 0 0-9 9 0 0-9 9 • Float = 0-9. 0-9 • Identifier = a-z(a-z0-9) • Note that 0-9 = (0123456789) a-z = (abc…yz) • Will Will use llexiicall cat tegoriies iin next t llevel lProgramming Language Syntax • Regular languages suboptimal for specifying programming programming language language syntax syntax • Why? Constructs with nested syntax • (a+(b (a+(b-c)) c)) (d (d -(x (x-(y (y -z))) z))) • • if (x y) if (y z) a = 5 else a = 6 else a = 7 • • Regular Regular languages languages lack lack state state required required to to model model nesting • Canonical examp ple: nested exp pressions • No regular expression for language of parenthesized expressions Solution – Context-Free Grammar • Set of terminals Op = +-/ Op, Int, Open, Close Int = 0-9 0-9 Each terminal defined Open = bl by regular expressiion Cl Close = • Set of nonterminals Start Start, E Expr xpr • Set of productions Start→ Expr •Sing gle nonterminal on LHS Exp pr→ Exp pr Op p Exp pr • Sequence of terminals and Expr→ Int nonterminals on RHS Expr→ Open Expr CloseProduction Game have a current string start start w with ith Start Start nonterminal nonterminal loop until no more nonterminals c choose oose a a no ont te ermina al in cu curre ent t st string g choose a production with nonterminal in LHS replace nonterminal with RHS of production substitute regular expressions with corresponding strings generated generated s string tring i is s i in n llanguage anguage Note: Note: d different ifferent choices choices p produce roduce different different strings strings Sample Derivation Op = +-/ Start In Int t = 0 0 -9 9 0 0 -9 9 E Expr Open = Expr Op Expr Open Expr Close Op Expr Close = Open Expr Op Expr Close Op Expr Open Int Op Expr Close Op Expr Open Open Int Int Op Op Expr Expr Close Close Op Op Int Int 1) Start→ Expr Open Int Op Int Close Op Int 2) Expr→ Expr Op Expr 2 - 1 + 1 3) 3) E Expr→ It Int 4) Expr→ Open Expr Close Parse Tree • Internal Nodes: Nonterminals • L Leaves: T T ermii nalls • Edges: • • From From Nonterminal Nonterminal of of L LHS HS of of production production • To Nodes from RHS of production • • Captures Captures derivation derivation of of string stringParse Tree for 2-1+1 Start Expr Expr Expr Expr Op Op Open + Close Expr Int 1 Expr Expr Op - Int Int 2 1 1 Ambiguity in Grammar Grammar is ambiguous if there are multiple derivations (therefore multiple parse trees) for a single string Derivation and parse tree usually reflect semantics of the the p program rogram g g Ambigy uity in grammar often reflects ambigguityy in semantics of language (which is considered undesirable) Ambiguity Example Two parse trees for 2-1+1 Tree corresponding T Tr re ee e corr corresponding esponding to 2-1+1 to 2-1+1 Start Start Expr Expr E Expr Op Expr Expr Op Expr - + Int In Int t 2 Expr Op Expr Expr Op Expr 1 + - Int Int Int Int 1 1 2 1 Eliminating Ambiguity Solution: hack the grammar Original Grammar Hacked Grammar Start→ Expr Start→ Expr Expr→ Expr Op Expr Expr→ Expr Op Int Expr→ Int Expr→ Int Ep Expr→ Open Open EEp xpr Close Close Ep Expr→ Open Open Ep Expr Close Close Concep ptually, y, makes all op perators associate to left Parse Trees for Hacked Grammar Only one parse tree for 2-1+1 Valid parse tree No longer valid parse tree Start Start Expr Expr Expr Op Int Expr Op Expr + 1 - In Int t Expr Op Int 2 Expr Op Expr - 1 + In Int t In Int t In In t t 2 1 1 Precedence Violations • All operators associate to left Parse tree for 2 2 -34 34 • Vi Violl at tes preced d ence of f over + + Start • 2-34 associates like 2-34 Expr Expr Expr Expr Op Op In Int t 4 Exp pr Op p In Int t - 3 Int 2 2Hacking Around Precedence Original Grammar Hacked Grammar Op = +-/ AddOp = +- Int = 0-9 0-9 MulOp = / Open = Int = 0-9 0-9 Close = Open = Close = Start Start→→ Expr Expr Start Start→→ Expr Expr Expr→ Expr Op Int Expr→ Expr AddOp Term Expr Expr→→ In Int t Expr Expr→→ Te Ter rm m Term→ Term MulOp Num Expr→ Open Expr Close Term→ Num Num→ Int Num→ Open Expr Close Parse Tree Changes New parse tree for 2-34 Old parse tree Start Start f for 2 2-34 34 Start Expr Expr AddOp Expr Term - Expr Expr Op Op I Int Term 4 Term MulOp Num Expr Expr Op Op In Int t Num Num - 3 Num Int Int Int 4 2 2 2 2 Int 3 General Idea • Group Operators into Precedence Levels • and d // are at t t top l level l, bibi ndd t trongest t s • + and - are at next level, bind next strongest • • Nonterminal Nonterminal for f or each each Precedence Precedence Level Level • Term is nonterminal for and / for and • • Expr Expr is is nonterminal nonterminal or + + a - • Can make operators left or right associative within within each each level level • Generalizes for arbitrary levels of precedence Parser • Converts program into a parse tree • Can be written by y hand • Or produced automatically by parser generator • Accepts a grammar as input • Produces a parser as output • Practical problem • Parse tree for hacked grammar is complicated • Would like to start with more intuitive parse tree Solution • Abstract versus Concrete Syntax • Ab Ab st tract t synt tax correspond ds t to “ “ii nt tuiti itive” ” way of thinking of structure of program • • Omits Omits d details etails like like s superfluous uperfluous keywords keywords that that are there to make the language unambiguous • Abstract syntax may be ambiguous • Concrete Syntax corresponds to full grammar used to parse the language • Parsers are often written to produce abstract syntax syntax trees trees. Abstract Syntax Trees • Start with intuitive but ambiguous grammar • HHk ack grammar t to mak ke i it t unamb bigi uous • Concrete parse trees • • Less Less intuitive intuitive • Convert concrete parse trees to abstract syntax trees trees • Correspond to intuitive grammar for language • Simpler Simpler f for or program program to to manipulate manipulateExample Hacked Unambiguous Gr Grammar ammar AddOp = +- MulOp = / In Int t = = 0 0 -9 9 0 0 -9 9 Intuitive but Ambiguous Open = Close = Grammar St Start t → E Expr Op = /+- Expr→ Expr AddOp Term Int = 0-9 0-9 Expr→ Term Sta Start t→ EEp xpr Term → Term MulOp Num Term→ Num Expr→ Expr Op Expr Num → Int Expr Expr→→ In Int t Num→ Open Expr Close Start Concrete parse Abstract syntax tr tree ee tr tree ee Expr for 2-34 for 2-34 Op Expr Expr Start Start Expr Expr Expr Expr Expr Int Op p 4 4 - Int Int AddOp Expr 2 3 Term - • Uses intuitive Term grammar Term MulOp Num Num Num • • Eliminates Eliminates superfluous superfluous Num terminals Int Int 4 •Open 2 2 In In t t •Close 3 Start Abstr Abstract act parse parse tr tre ee e Further Further s simplified implified Start for 2-34 abstract syntax Expr Expr tree for 2-34 Op Expr Expr Op Int Expr 4 Expr Expr Expr Expr In Int t I Int O Op I Int Op 4 2 - 3 - Int Int 2 2 3 3Summary • Lexical and Syntactic Levels of Structure • LLi exical l – regull ar expressii ons and d aut tomat ta • Syntactic – grammars • • Grammar Grammar ambiguities ambiguities • Hacked grammars • • Abstract Abstract syntax syntax trees trees • Generation versus Recognition Approaches • • Generation Generation more more convenient convenient for for specification specification • Recognition required in implementation Handling If Then Else Start → Stat Stat Stat →→ if if Expr Expr then then Stat Stat else else Stat Stat Stat → if Expr then Stat Stat → ... Parse Trees • Consider Statement if e then if e then s else s 1 2 1 2 Stat Two Parse Trees if Stat Expr e1 if Expr then Stat else Stat e2 2 s1 1 s2 2 Stat if if Expr Expr then then Stat Stat else else Stat Stat Which is correct? e1 s2 if Expr then s1 e2 e2 Alternative Readings • Parse Tree Number 1 if if e 1 if e s 2 1 Gr Grammar ammar iis s a ambiguous mbiguous else s 2 • Parse Tree Number 2 if e 1 if e s 2 1 else s 2 Hack ed Gr ammar Goal → Stat Stat Stat →→ WithElse WithElse Stat → LastElse WithElse → if Expr then WithElse else WithElse WithElse → statements without if then or if then else LastElse → if Expr then Stat LastElse LastElse →→ if if Expr Expr then then WithElse WithElse else else LastElse LastElse Hacked Grammar • Basic Idea: control carefully where an if without an an else else can can occur occur • Either at top level of statement • • Or Or as as very very last last in in a a sequence sequence of of if if t then hen else else if if then ... statementsGrammar Vocabulary • Leftmost derivation • Al Always expand d s ll eft ft most t remaii niing nonterminal • • Similarly Similarly for for rightmost rightmost d derivation erivation • Sentential form • • Partially Partially o or r f fully ully derived derived s string tring f from rom a a step step in in valid derivation •0 + Expr p Op p Expr p •0 + Expr - 2 Defining a Language • Grammar • Generative Generative approach approach • All strings that grammar generates (How many are there for grammar in previous example?) • Automaton • Recognition approach • All strings that automaton accepts • Different flavors of grammars and automata • In general, grammars and automata correspond Regular Languages • Automaton Characterization • ( (S S,A A ,F F ,s ,s ) ) 0 F • Finite set of states S • • Finite Finite Alphabet Alphabet A A • Transition function F : S×A→ S • • Start Start s state tate s s 0 • Final states s F • • Lanuage Lanuage is is set set of of strings strings accepted accepted by by Automaton AutomatonRegular Languages • Regular Grammar Characterization • ( (TTN ,NTTSP ,S,P) ) • Finite set of Terminals T • • Finite Finite set set of of Nonterminals Nonterminals NT NT • Start Nonterminal S (goal symbol, start symbol) symbol) • Finite set of Productions P: NT→ T U NT U T NT • Language is set of strings generated by grammar Grammar and Automata Correspondence Correspondence Grammar Automaton Regular Regular rammar Finite Finite-State State A Automaton utomaton Grammar Context-Free Grammar Push-Down Automaton Context-Sensitive Grammar Turing g Machine Context-Free Grammars • Grammar Characterization • ( (TN T,NTTSP ,S,P) ) • Finite set of Terminals T • • Finite Finite set set of of Nonterminals Nonterminals NT NT • Start Nonterminal S (goal symbol, start symbol) symbol) • Finite set of Productions P: NT→ (T NT) • RHS RHS o of f p production roduction c can an have have any any sequence sequence of of terminals or nonterminals Push-Down Automata • DFA Plus a Stack • ( (S S,A A ,V V, F F,s ,s ) ) 0 F • Finite set of states S • • Finite Finite Input Input A Alphabet lphabet A A, Stack Stack A Alphabet lphabet V V • Transition relation F : S ×(A Uε)×V→ S× V • • Start Start s state tate s s 0 • Final states s F • • Each Each configuration configuration consists consists o of f a a state state, a a stack stack, and remaining input string CFG Versus PDA • CFGs and PDAs are of equivalent power • G G rammar I Impl lement tat tii on M Mech h aniism: • Translate CFG to PDA, then use PDA to parse input input string string • Foundation for bottom-up parser generators Context-Sensitive Grammars and Turing Turing Machines Machines • Context-Sensitive Grammars Allow Productions to Use Use C Context ontext • P: (T.NT)+→ (T.NT) • • Turing Turing Machines Machines Have Have • Finite State Control • • Two Two-Way Way T Tape ape Instead Instead of of A A Stack StackMIT OpenCourseWare http://ocw.mit.edu 6.035 Computer Language Engineering Spring 2010 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Website URL
Comment