Lecture notes on Regular languages and finite Automata

regular languages and finite automata theory of computation and regular languages and finite automata exercises pdf free downlaod
FrankRoberts Profile Pic
Published Date:11-07-2017
Your Website URL(Optional)
N Lecture Notes on Regular Languages and Finite Automata for Part IA of the Computer Science Tripos Marcelo Fiore Cambridge University Computer Laboratoryii Learning Guide The notes are designed to accompany six lectures on regular languages and finite automata for Part IA of the Cambridge University Computer Science Tripos. The aim of this short course will be to introduce the mathematical formalisms of finite state machines, regular expressions and grammars, and to explain their applications to computer languages. As such, it covers some basic theoretical material which Every Computer Scientist Should Know. Direct applications of the course material occur in the various CST courses on compilers. Further and related developments will be found in the CST Part IB courses Computation Theory and Semantics of Programming Languages and the CST Part II course Topics in Concurrency. This course contains the kind of material that is best learned through practice. The books mentioned below contain a large number of problems of varying degrees of difficulty, and some contain solutions to selected problems. A few exercises are given at the end of each section of these notes and relevant past Tripos questions are indicated there. At the end of the course students should be able to explain how to convert between the three ways of representing regular sets of strings introduced in the course; and be able to carry out such conversions by hand for simple cases. They should also be able to prove whether or not a given set of strings is regular. Recommended books Textbooks which cover the material in this course also tend to cover the material you will meet in the CST Part IB courses on Computation Theory and Complexity Theory, and the theory underlying parsing in various courses on compilers. There is a large number of such books. Three recommended ones are listed below. • J. E. Hopcroft, R. Motwani and J. D. Ullman, Introduction to Automata Theory, Languages, and Computation, Second Edition (Addison-Wesley, 2001). • D. C. Kozen, Automata and Computability (Springer-Verlag, New York, 1997). • T. A. Sudkamp, Languages and Machines (Addison-Wesley Publishing Company, Inc., 1988). Note The material in these notes has been drawn from several different sources, including the books mentioned above and previous versions of this course by the author and by others. Any errors are of course all the author’s own work. A list of corrections will be available from the course web page. Please take time to fill out the on-line lecture feedback form. Marcelo Fiore Marcelo.Fiorecl.cam.ac.uk1 1 Regular Expressions Doubtless you have used pattern matching in the command-line shells of various operating systems (Slide 1) and in the search facilities of text editors. Another important example of the same kind is the ‘lexical analysis’ phase in a compiler during which the text of a program is divided up into the allowed tokens of the programming language. The algorithms which implement such pattern-matching operations make use of the notion of a finite automaton (which is Greeklish for finite state machine). This course reveals (some of) the beautiful theory of finite automata (yes, that is the plural of ‘automaton’) and their use for recognising when a particular string matches a particular pattern. Pattern matching What happens if, at a Unix/Linux shell prompt, you type ls∗ and press return? Suppose the current directory contains files calledregfla.tex, regfla.aux,regfla.log,regfla.dvi, and (strangely).aux. What happens if you type ls ∗.aux and press return? Slide 1 1.1 Alphabets, strings, and languages The purpose of Section 1 is to introduce a particular language for patterns, called regular expressions, and to formulate some important problems to do with pattern-matching which will be solved in the subsequent sections. But first, here is some notation and terminology to do with character strings that we will be using throughout the course.2 1 REGULAR EXPRESSIONS Alphabets An alphabet is specified by giving a finite set,Σ, whose elements are called symbols. For us, any set qualifies as a possible alphabet, so long as it is finite. Examples: Σ =0,1,2,3,4,5,6,7,8,9 —10-element set of decimal digits. 1 Σ =a,b,c,...,x,y,z —26-element set of lower-case characters of the 2 English language. 10 Σ =S S ⊆ Σ —2 -element set of all subsets of the alphabet of 3 1 decimal digits. Non-example: N =0,1,2,3,... — set of all non-negative whole numbers is not an alphabet, because it is infinite. Slide 2 Strings over an alphabet A string of lengthn (≥ 0) over an alphabetΣ is just an orderedn-tuple of elements ofΣ, written without punctuation. Example: ifΣ =a,b,c, thena,ab,aac, andbbac are strings overΣ of lengths one, two, three and four respectively. def ∗ Σ = set of all strings overΣ of any finite length. N.B. there is a unique string of length zero overΣ, called the null string (or empty string) and denoted ε (no matter whichΣ we are talking about). Slide 31.1 Alphabets, strings, and languages 3 Concatenation of strings ∗ The concatenation of two stringsu,v ∈ Σ is the stringuv obtained by joining the strings end-to-end. Examples: Ifu =ab,v =ra andw =cad, thenvu =raab,uu =abab andwv =cadra. This generalises to the concatenation of three or more strings. E.g.uvwuv = abracadabra. Slide 4 ∗ Slides 2 and 3 define the notions of an alphabetΣ, and the setΣ of finite strings over an alphabet. The length of a stringu will be denoted bylength(u). Slide 4 defines the operation of concatenation of strings. We make no notational distinction between a symbola∈ Σ and ∗ the corresponding string of length one overΣ: soΣ can be regarded as a subset ofΣ . Note ∗ thatΣ is never empty—it always contains the null string,ε, the unique string of length zero. ∗ Note also that for anyu,v,w∈ Σ uε =u = εu and (uv)w =uvw =u(vw) andlength(uv) = length(u)+length(v). ∗ Example 1.1.1. Examples ofΣ for differentΣ: ∗ (i) IfΣ =a, thenΣ contains ε,a,aa,aaa,aaaa,... ∗ (ii) IfΣ =a,b, thenΣ contains ε,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,... ∗ (iii) If Σ = ∅ (the empty set — the unique set with no elements), then Σ = ε, the set just containing the null string.4 1 REGULAR EXPRESSIONS 1.2 Pattern matching Slide 5 defines the patterns, or regular expressions, over an alphabet Σ that we will use. Each such regular expression, r, represents a whole set (possibly an infinite set) of strings ∗ in Σ that match r. The precise definition of this matching relation is given on Slide 6. It might seem odd to include a regular expression∅ that is matched by no strings at all—but it is technically convenient to do so. Note that the regular expressionε is in fact equivalent to ∗ ∗ ∅ , in the sense that a stringu matches∅ iff it matchesε (iffu =ε). Regular expressions over an alphabetΣ • each symbola∈ Σ is a regular expression • ε is a regular expression • ∅ is a regular expression • ifr ands are regular expressions, then so is(rs) • ifr ands are regular expressions, then so isrs ∗ • ifr is a regular expression, then so is(r) Every regular expression is built up inductively, by finitely many applications of the above rules. ∗ (N.B. we assumeε,∅,(,),, and are not symbols inΣ.) Slide 5 Remark 1.2.1 (Binding precedence in regular expressions). In the definition on Slide 5 we assume implicitly that the alphabetΣ does not contain the six symbols ∗ ε ∅ ( ) Then, concretely speaking, the regular expressions over Σ form a certain set of strings over the alphabet obtained by adding these six symbols to Σ. However it makes things more readable if we adopt a slightly more abstract syntax, dropping as many brackets as possible and using the convention that ∗ − binds more tightly than−−, binds more tightly than−−. ∗ ∗ ∗ ∗ So, for example,rst means(rs(t) ), not(rs)(t) , or((rst)) , etc.1.2 Pattern matching 5 Matching strings to regular expressions • u matchesa∈ Σ iffu =a • u matchesε iffu =ε • no string matches∅ • u matchesrs iffu matches eitherr ors • u matchesrs iff it can be expressed as the concatenation of two strings,u =vw, withv matchingr andw matchings ∗ • u matchesr iff eitheru =ε, oru matchesr, oru can be expressed as the concatenation of two or more strings, each of which matchesr Slide 6 ∗ The definition of ‘u matchesr ’ on Slide 6 is equivalent to saying for some n ≥ 0, u can be expressed as a concatenation of n strings, u = u u ...u , where eachu matchesr. 1 2 n i ∗ The casen = 0 just means thatu =ε (soε always matchesr ); and the casen = 1 just means ∗ that u matches r (so any string matchingr also matches r ). For example, if Σ = a,b,c ∗ andr =ab, then the strings matchingr are ε,ab,abab,ababab, etc. Note that we didn’t include a regular expression for the ‘∗’ occurring in the UNIX examples on Slide 1. However, once we know which alphabet we are referring to, Σ = a ,a ,...,a say, we can get the effect of∗ using the regular expression 1 2 n ∗ (a a ...a ) 1 2 n ∗ which is indeed matched by any string inΣ (becausea a ...a is matched by any symbol 1 2 n inΣ).6 1 REGULAR EXPRESSIONS Examples of matching, withΣ =0,1 • 01 is matched by each symbol inΣ ∗ ∗ • 1(01) is matched by any string inΣ that starts with a ‘1’ ∗ ∗ • ((01)(01)) is matched by any string of even length inΣ ∗ ∗ ∗ • (01) (01) is matched by any string inΣ • (ε0)(ε1)11 is matched by just the stringsε,0,1,01, and11 • ∅10 is just matched by0 Slide 7 Notation 1.2.2. The notationr+s is quite often used for what we write asrs. n The notation r , for n ≥ 0, is an abbreviation for the regular expression obtained by concatenatingn copies ofr. Thus: ( def 0 r = ε def n+1 n r = r(r ). ∗ n Thusu matchesr iffu matchesr for somen≥ 0. + ∗ + We user as an abbreviation forrr . Thusu matchesr iff it can be expressed as the concatenation of one or more strings, each one matchingr. 1.3 Some questions about languages Slide 8 defines the notion of a formal language over an alphabet. We take a very extensional view of language: a formal language is completely determined by the ‘words in the dictionary’, rather than by any grammatical rules. Slide 9 gives some important questions about languages, regular expressions, and the matching relation between strings and regular expressions.1.3 Some questions about languages 7 Languages ∗ A (formal) languageL over an alphabetΣ is just a set of strings inΣ . ∗ Thus any subsetL⊆ Σ determines a language overΣ. The language determined by a regular expressionr overΣ is def ∗ L(r) = u∈ Σ u matchesr. Two regular expressionsr ands (over the same alphabet) are equivalent iffL(r) andL(s) are equal sets (i.e. have exactly the same members). Slide 8 Some questions (a) Is there an algorithm which, given a stringu and a regular expression r (over the same alphabet), computes whether or notu matchesr? (b) In formulating the definition of regular expressions, have we missed out some practically useful notions of pattern? (c) Is there an algorithm which, given two regular expressionsr ands (over the same alphabet), computes whether or not they are equivalent? (Cf. Slide 8.) (d) Is every language of the formL(r)? Slide 98 1 REGULAR EXPRESSIONS The answer to question (a) on Slide 9 is ‘yes’. Algorithms for deciding such pattern- matching questions make use of finite automata. We will see this during the next few sections. If you have used the UNIX utility grep, or a text editor with good facilities for regular expression based search, likeemacs, you will know that the answer to question (b) on Slide 9 is also ‘yes’—the regular expressions defined on Slide 5 leave out some forms of pattern that one sees in such applications. However, the answer to the question is also ‘no’, in the sense that (for a fixed alphabet) these extra forms of regular expression are definable, up to equivalence, from the basic forms given on Slide 5. For example, if the symbols of the alphabet are ordered in some standard way, it is common to provide a form of pattern for naming ranges of symbols—for examplea−z might denote a pattern matching any lower- case letter. It is not hard to see how to define a regular expression (albeit a rather long one) which achieves the same effect. However, some other commonly occurring kinds of pattern are much harder to describe using the rather minimalist syntax of Slide 5. The principal example is complementation,∼(r): u matches∼(r) iff u does not matchr. It will be a corollary of the work we do on finite automata (and a good measure of its power) that every pattern making use of the complementation operation ∼(−) can be replaced by an equivalent regular expression just making use of the operations on Slide 5. But why do we stick to the minimalist syntax of regular expressions on that slide? The answer is that it reduces the amount of work we will have to do to show that, in principle, matching strings against patterns can be decided via the use of finite automata. The answer to question (c) on Slide 9 is ‘yes’ and once again this will be a corollary of the work we do on finite automata. (See Section 5.3.) Finally, the answer to question (d) is easily seen to be ‘no’, provided the alphabet Σ ∗ contains at least one symbol. For in that caseΣ is countably infinite; and hence the number of ∗ languages overΣ, i.e. the number of subsets ofΣ is uncountable. (Recall Cantor’s diagonal argument.) But since Σ is a finite set, there are only countably many regular expressions overΣ. (Why?) So the answer to (d) is ‘no’ for cardinality reasons. However, even amongst the countably many languages that are ‘finitely describable’ (an intuitive notion that we will not formulate precisely) many are not of the form L(r) for any regular expression r. For example, in Section 5.2 we will use the ‘Pumping Lemma’ to see that n n a b n≥ 0 is not of this form. 1.4 Exercises Exercise 1.4.1. Write down an ML data type declaration for a type constructor’a regExp whose values correspond to the regular expressions over an alphabet’a. Exercise 1.4.2. Find regular expressions over0,1 that determine the following languages: (a) uu contains an even number of1’s1.4 Exercises 9 (b) uu contains an odd number of0’s ∗ Exercise 1.4.3. For which alphabets Σ is the set Σ of all finite strings over Σ itself an alphabet? Tripos questions 2005.2.1(d) 1999.2.1(s) 1997.2.1(q) 1996.2.1(i) 1993.5.1210 1 REGULAR EXPRESSIONS11 2 Finite State Machines We will be making use of mathematical models of physical systems called finite automata, or finite state machines to recognise whether or not a string is in a particular language. This section introduces this idea and gives the precise definition of what constitutes a finite automaton. We look at several variations on the definition (to do with the concept of determinism) and see that they are equivalent for the purpose of recognising whether or not a string is in a given language. 2.1 Finite automata Example of a finite automaton a b a a a q q q q 0 1 2 3 b b b States:q ,q ,q ,q . 0 1 2 3 Input symbols:a,b. Transitions: as indicated above. Start state:q . 0 Accepting state(s):q . 3 Slide 10 The key features of this abstract notion of ‘machine’ are listed below and are illustrated by the example on Slide 10. • There are only finitely many different states that a finite automaton can be in. In the example there are four states, labelledq ,q ,q , andq . 0 1 2 3 • We do not care at all about the internal structure of machine states. All we care about is which transitions the machine can make between the states. A symbol from some fixed alphabet Σ is associated with each transition: we think of the elements of Σ as input symbols. Thus all the possible transitions of the finite automaton can be specified by giving a finite graph whose vertices are the states and whose edges have12 2 FINITE STATE MACHINES both a direction and a label (drawn from Σ). In the exampleΣ =a,b and the only possible transitions from stateq are 1 b a q − → q and q − →q . 1 0 1 2 In other words, in state q the machine can either input the symbol b and enter state 1 q , or it can input the symbola and enter state q . (Note that transitions from a state 0 2 a back to the same state are allowed: e.g.q − →q in the example.) 3 3 1 • There is a distinguished start state. In the example it is q . In the graphical 0 representation of a finite automaton, the start state is usually indicated by means of a unlabelled arrow. 2 • The states are partitioned into two kinds: accepting states and non-accepting states. In the graphical representation of a finite automaton, the accepting states are indicated by double circles round the name of each such state, and the non-accepting states are indicated using single circles. In the example there is only one accepting state,q ; the 3 other three states are non-accepting. (The two extreme possibilities that all states are accepting, or that no states are accepting, are allowed; it is also allowed for the start state to be accepting.) The reason for the partitioning of the states of a finite automaton into ‘accepting’ and ‘non-accepting’ has to do with the use to which one puts finite automata—namely to recognise ∗ ∗ whether or not a string u ∈ Σ is in a particular language (= subset of Σ ). Given u we begin in the start state of the automaton and traverse its graph of transitions, using up the symbols inu in the correct order reading the string from left to right. If we can use up all the symbols in u in this way and reach an accepting state, then u is in the language ‘accepted’ (or ‘recognised’) by this particular automaton; otherwise u is not in that language. This is summed up on Slide 11. 1 The term initial state is a common synonym for ‘start state’. 2 The term final state is a common synonym for ‘accepting state’.2.1 Finite automata 13 L(M), language accepted by a finite automatonM consists of all stringsu over its alphabet of input symbols satisfying u ∗ q − → q withq the start state andq some accepting state. Here 0 0 u ∗ q − → q 0 means, ifu =a a ...a say, that for some statesq ,q ,...,q =q 1 2 n 1 2 n (not necessarily all distinct) there are transitions of the form a a a a 1 2 3 n q −→q −→q −→···−→q = q. 0 1 2 n N.B. ε ∗ ′ ′ casen = 0: q− → q iff q =q a a ∗ ′ ′ casen = 1: q− → q iff q− → q . Slide 11 Example 2.1.1. Let M be the finite automaton pictured on Slide 10. Using the notation introduced on Slide 11 we have: aaab ∗ q −−−→ q (soaaab∈ L(M)) 0 3 abaa ∗ q −−−→ q iff q =q (soabaa∈ / L(M)) 0 2 baaa ∗ q −−−→ q iff q =q (no conclusion aboutL(M)). 2 3 In fact in this case L(M) =u u contains three consecutivea’s. (For q (i = 0,1,2) corresponds to the state in the process of reading a string in which the i lasti symbols read were alla’s.) SoL(M) coincides with the languageL(r) determined by the regular expression ∗ ∗ r = (ab) aaa(ab) (cf. Slide 8).14 2 FINITE STATE MACHINES A non-deterministic finite automaton (NFA),M, is specified by • a finite setStates (of states) M • a finite setΣ (the alphabet of input symbols) M • for eachq∈ States and eacha∈ Σ , a subset M M Δ (q,a)⊆ States (the set of states that can be reached from M M q with a single transition labelleda) • an elements ∈ States (the start state) M M • a subsetAccept ⊆ States (of accepting states) M M Slide 12 2.2 Determinism, non-determinism, and ε-transitions Slide 12 gives the formal definition of the notion of finite automaton. Note that the function a ′ ′ Δ gives a precise way of specifying the allowed transitions ofM, via: q − → q iff q ∈ M Δ (q,a). M The reason for the qualification ‘non-deterministic’ on Slide 12 is because in general, for each state q ∈ States and each input symbola ∈ Σ , we allow the possibilities that M M there are no, one, or many states that can be reached in a single transition labelleda fromq, corresponding to the cases thatΔ (q,a) has no, one, or many elements. For example, ifM M is the NFA pictured on Slide 13, then Δ (q ,b) =∅ i.e. inM, no state can be reached fromq with a transition labelledb; M 1 1 Δ (q ,a) =q i.e. in M, precisely one state can be reached from q with a transition M 1 2 1 labelleda; Δ (q ,a) =q ,q i.e. in M, precisely two states can be reached from q with a M 0 0 1 0 transition labelleda.2.2 Determinism, non-determinism, andε-transitions 15 Example of a non-deterministic finite automaton Input alphabet:a,b. States, transitions, start state, and accepting states as shown: a a a a a q q q q 0 1 2 3 b b The language accepted by this automaton is the same as for the automaton on Slide 10, namely ∗ u∈a,b u contains three consecutivea’s. Slide 13 When each subset Δ (q,a) has exactly one element we say that M is deterministic. M This is a particularly important case and is singled out for definition on Slide 14. The finite automaton pictured on Slide 10 is deterministic. But note that if we took the same graph of transitions but insisted that the alphabet of input symbols was a,b,c say, then we have specified an NFA not a DFA, since for exampleΔ (q ,c) =∅. The moral of M 0 this is: when specifying an NFA, as well as giving the graph of state transitions, it is important to say what is the alphabet of input symbols (because some input symbols may not appear in the graph at all). When constructing machines for matching strings with regular expressions (as we will do in Section 3) it is useful to consider finite state machines exhibiting an ‘internal’ form of non-determinism in which the machine is allowed to change state without consuming any input symbol. One calls such transitionsε-transitions and writes them as ε ′ q− →q . ε This leads to the definition on Slide 15. Note that in an NFA , M, we always assume thatε is not an element of the alphabetΣ of input symbols. M16 2 FINITE STATE MACHINES A deterministic finite automaton (DFA) is an NFAM with the property that for eachq∈ States and M a∈ Σ , the finite setΔ (q,a) contains exactly one element—call it M M δ (q,a). M Thus in this case transitions inM are essentially specified by a next-state function,δ , mapping each (state, input symbol)-pair(q,a) M to the unique stateδ (q,a) which can be reached fromq by a transition M labelleda: a ′ ′ q− → q iff q = δ (q,a) M Slide 14 ε An NFA withε-transitions (NFA ) is specified by an NFAM together with a binary relation, called the ε-transition relation, on the setStates . We write M ε ′ q− →q ′ to indicate that the pair of states(q,q ) is in this relation. Example (with input alphabet =a,b): a a q q q a 1 2 3 a ε ε q q 0 7 ε ε q q q b 4 5 6 b b b Slide 152.3 A subset construction 17 ε L(M), language accepted by an NFA M consists of all stringsu over the alphabetΣ of input symbols satisfying M u − q ⇒q withq the initial state andq some accepting state. Here·⇒· 0 0 is defined by: ε ε ′ ′ ′ q⇒q iffq =q or there is a sequenceq− →···q of one or more ′ ε-transitions inM fromq toq a ε a ε ′ ′ q⇒q (fora∈ Σ ) iffq⇒·− →·⇒q M ab ε a ε b ε ′ ′ q⇒q (fora,b∈ Σ ) iffq⇒·− →·⇒·− →·⇒ q M and similarly for longer strings Slide 16 ε ∗ When using an NFA M to accept a stringu∈ Σ of input symbols, we are interested in sequences of transitions in which the symbols in u occur in the correct order, but with zero or moreε-transitions before or after each one. We write u ′ q⇒q ′ ε to indicate that such a sequence exists from stateq to stateq in the NFA . Then, by definition u ε u is accepted by the NFA M iffq ⇒q holds forq the start state andq some accepting state: 0 0 ε see Slide 16. For example, for the NFA on Slide 15, it is not too hard to see that the language accepted consists of all strings which either contain two consecutive a’s or contain two ∗ ∗ consecutiveb’s, i.e. the language determined by the regular expression(ab) (aabb)(ab) . 2.3 A subset construction Note that every DFA is an NFA (whose transition relation is deterministic) and that every ε NFA is an NFA (whoseε-transition relation is empty). It might seem that non-determinism andε-transitions allow a greater range of languages to be characterised as recognisable by a finite automaton, but this is not so. We can use a construction, called the subset construction, ε to convert an NFA M into a DFA PM accepting the same language (at the expense of increasing the number of states, possibly exponentially). Slide 17 gives an example of this construction. The name ‘subset construction’ refers to the fact that there is one state ofPM ′ for each subset of the setStates of states ofM. Given two subsetsS,S ⊆ States , there M M a ′ ′ ′ is a transitionS − → S inPM just in case S consists of all the M-statesq reachable from18 2 FINITE STATE MACHINES a states q in S via the· ⇒ · relation defined on Slide 16, i.e. such that we can get from q to ′ q inM via finitely manyε-transitions followed by ana-transition followed by finitely many ε-transitions. Example of the subset construction M : δ : a b PM a ∅ ∅ ∅ q q ,q ,q q q 0 0 1 2 2 1 q q ∅ 1 1 ε q ∅ q 2 2 q a 0 q ,q q ,q ,q q 0 1 0 1 2 2 q ,q q ,q ,q q ε 0 2 0 1 2 2 q ,q q q 1 2 1 2 q 2 q ,q ,q q ,q ,q q 0 1 2 0 1 2 2 b Slide 17 By definition, the start state of PM is the subset of States whose elements are the M states reachable by ε-transitions from the start state of M; and a subsetS ⊆ States is an M accepting state ofPM iff some accepting state ofM is an element ofS. Thus in the example on Slide 17 the start state isq ,q ,q and 0 1 2 a ε b ab∈ L(M) because inM: q − → q − →q − →q 0 0 2 2 a b ab∈ L(PM) because inPM: q ,q ,q − →q ,q ,q − →q . 0 1 2 0 1 2 2 ∗ ∗ Indeed, in this caseL(M) =L(a b ) =L(PM). The fact thatM andPM accept the same language in this case is no accident, as the Theorem on Slide 18 shows. That slide also gives the definition of the subset construction in general.