Lexical analysis and parsing

lexical analyzer and parser example and reasons for separating lexical analysis from parsing
Dr.MasonHanks Profile Pic
Published Date:23-07-2017
Your Website URL(Optional)
Lexical Analysis www.ThesisScientist.comOutline  Role of lexical analyzer  Specification of tokens  Recognition of tokens  Lexical analyzer generator  Finite automata  Design of lexical analyzer generator www.ThesisScientist.comThe role of lexical analyzer token Source To semantic Lexical Parser program analysis Analyzer getNextToken Symbol table www.ThesisScientist.comWhy to separate Lexical analysis and parsing 1. Simplicity of design 2. Improving compiler efficiency 3. Enhancing compiler portability www.ThesisScientist.comTokens, Patterns and Lexemes  A token is a pair a token name and an optional token value  A pattern is a description of the form that the lexemes of a token may take  A lexeme is a sequence of characters in the source program that matches the pattern for a token www.ThesisScientist.comExample Token Informal description Sample lexemes if if Characters i, f else Characters e, l, s, e else =, = comparison or or = or = or == or = id Letter followed by letter and digits pi, score, D2 Any numeric constant 3.14159, 0, 6.02e23 number literal Anything but “ sorrounded by “ “core dumped” printf(“total = %d\n”, score); www.ThesisScientist.comAttributes for tokens  E = M C 2  id, pointer to symbol table entry for E  assign-op  id, pointer to symbol table entry for M  mult-op  id, pointer to symbol table entry for C  exp-op  number, integer value 2 www.ThesisScientist.comLexical errors  Some errors are out of power of lexical analyzer to recognize:  fi (a == f(x)) …  However it may be able to recognize errors like:  d = 2r  Such errors are recognized when no pattern for tokens matches a character sequence www.ThesisScientist.comError recovery  Panic mode: successive characters are ignored until we reach to a well formed token  Delete one character from the remaining input  Insert a missing character into the remaining input  Replace a character by another character  Transpose two adjacent characters www.ThesisScientist.comInput buffering  Sometimes lexical analyzer needs to look ahead some symbols to decide about the token to return  In C language: we need to look after -, = or to decide what token to return  In Fortran: DO 5 I = 1.25  We need to introduce a two buffer scheme to handle large look-aheads safely E = M C 2 eof www.ThesisScientist.comSentinels eof E = M eof C 2 eof Switch (forward++) case eof: if (forward is at end of first buffer) reload second buffer; forward = beginning of second buffer; else if forward is at end of second buffer) reload first buffer;\ forward = beginning of first buffer; else / eof within a buffer marks the end of input / terminate lexical analysis; break; cases for the other characters; www.ThesisScientist.com Specification of tokens  In theory of compilation regular expressions are used to formalize the specification of tokens  Regular expressions are means for specifying regular languages  Example:  Letter_(letter_ digit)  Each regular expression is a pattern specifying the form of strings www.ThesisScientist.comRegular expressions Ɛ is a regular expression, L(Ɛ) = Ɛ  If a is a symbol in ∑then a is a regular expression, L(a) = a  (r) (s) is a regular expression denoting the language L(r) ∪ L(s)  (r)(s) is a regular expression denoting the language L(r)L(s)  (r) is a regular expression denoting (L9r))  (r) is a regular expression denting L(r) www.ThesisScientist.comRegular definitions d1 - r1 d2 - r2 … dn - rn  Example: letter_ - A B … Z a b … Z _ digit - 0 1 … 9 id - letter_ (letter_ digit) www.ThesisScientist.comExtensions  One or more instances: (r)+  Zero of one instances: r?  Character classes: abc  Example:  letter_ - A-Za-z_  digit - 0-9  id - letter_(letterdigit) www.ThesisScientist.comRecognition of tokens  Starting point is the language grammar to understand the tokens: stmt - if expr then stmt if expr then stmt else stmt Ɛ expr - term relop term term term - id number www.ThesisScientist.comRecognition of tokens (cont.)  The next step is to formalize the patterns: digit - 0-9 Digits - digit+ number - digit(.digits)? (E+-? Digit)? letter - A-Za-z_ id - letter (letterdigit) If - if Then - then Else - else Relop - = = =  We also need to handle whitespaces: ws - (blank tab newline)+ www.ThesisScientist.comTransition diagrams  Transition diagram for relop www.ThesisScientist.comTransition diagrams (cont.)  Transition diagram for reserved words and identifiers www.ThesisScientist.comTransition diagrams (cont.)  Transition diagram for unsigned numbers www.ThesisScientist.com