Lecture notes on Regular languages and finite Automata
regular languages and finite automata theory of computation and regular languages and finite automata exercises pdf free downlaod
Lecture Notes on
and Finite Automata
for Part IA of the Computer Science Tripos
Cambridge University Computer Laboratoryii
The notes are designed to accompany six lectures on regular languages and ﬁnite automata
for Part IA of the Cambridge University Computer Science Tripos. The aim of this short
course will be to introduce the mathematical formalisms of ﬁnite state machines, regular
expressions and grammars, and to explain their applications to computer languages. As such,
it covers some basic theoretical material which Every Computer Scientist Should Know.
Direct applications of the course material occur in the various CST courses on compilers.
Further and related developments will be found in the CST Part IB courses Computation
Theory and Semantics of Programming Languages and the CST Part II course Topics in
This course contains the kind of material that is best learned through practice. The books
mentioned below contain a large number of problems of varying degrees of difﬁculty, and
some contain solutions to selected problems. A few exercises are given at the end of each
section of these notes and relevant past Tripos questions are indicated there. At the end
of the course students should be able to explain how to convert between the three ways of
representing regular sets of strings introduced in the course; and be able to carry out such
conversions by hand for simple cases. They should also be able to prove whether or not a
given set of strings is regular.
Recommended books Textbooks which cover the material in this course also tend to
cover the material you will meet in the CST Part IB courses on Computation Theory and
Complexity Theory, and the theory underlying parsing in various courses on compilers.
There is a large number of such books. Three recommended ones are listed below.
• J. E. Hopcroft, R. Motwani and J. D. Ullman, Introduction to Automata Theory,
Languages, and Computation, Second Edition (Addison-Wesley, 2001).
• D. C. Kozen, Automata and Computability (Springer-Verlag, New York, 1997).
• T. A. Sudkamp, Languages and Machines (Addison-Wesley Publishing Company,
Note The material in these notes has been drawn from several different sources, including
the books mentioned above and previous versions of this course by the author and by others.
Any errors are of course all the author’s own work. A list of corrections will be available
from the course web page. Please take time to ﬁll out the on-line lecture feedback form.
1 Regular Expressions
Doubtless you have used pattern matching in the command-line shells of various operating
systems (Slide 1) and in the search facilities of text editors. Another important example of
the same kind is the ‘lexical analysis’ phase in a compiler during which the text of a program
is divided up into the allowed tokens of the programming language. The algorithms which
implement such pattern-matching operations make use of the notion of a ﬁnite automaton
(which is Greeklish for ﬁnite state machine). This course reveals (some of) the beautiful
theory of ﬁnite automata (yes, that is the plural of ‘automaton’) and their use for recognising
when a particular string matches a particular pattern.
What happens if, at a Unix/Linux shell prompt, you type
and press return?
Suppose the current directory contains ﬁles calledregfla.tex,
regfla.aux,regfla.log,regfla.dvi, and (strangely).aux. What
happens if you type
and press return?
1.1 Alphabets, strings, and languages
The purpose of Section 1 is to introduce a particular language for patterns, called regular
expressions, and to formulate some important problems to do with pattern-matching which
will be solved in the subsequent sections. But ﬁrst, here is some notation and terminology to
do with character strings that we will be using throughout the course.2 1 REGULAR EXPRESSIONS
An alphabet is speciﬁed by giving a ﬁnite set,Σ, whose elements are
called symbols. For us, any set qualiﬁes as a possible alphabet, so long
as it is ﬁnite.
Σ =0,1,2,3,4,5,6,7,8,9 —10-element set of decimal digits.
Σ =a,b,c,...,x,y,z —26-element set of lower-case characters of the
Σ =S S ⊆ Σ —2 -element set of all subsets of the alphabet of
N =0,1,2,3,... — set of all non-negative whole numbers is not an
alphabet, because it is inﬁnite.
Strings over an alphabet
A string of lengthn (≥ 0) over an alphabetΣ is just an orderedn-tuple
of elements ofΣ, written without punctuation.
Example: ifΣ =a,b,c, thena,ab,aac, andbbac are strings overΣ of
lengths one, two, three and four respectively.
Σ = set of all strings overΣ of any ﬁnite length.
N.B. there is a unique string of length zero overΣ, called the null string
(or empty string) and denoted ε (no matter whichΣ we are talking
Slide 31.1 Alphabets, strings, and languages 3
Concatenation of strings
The concatenation of two stringsu,v ∈ Σ is the stringuv obtained by
joining the strings end-to-end.
Examples: Ifu =ab,v =ra andw =cad, thenvu =raab,uu =abab
This generalises to the concatenation of three or more strings.
E.g.uvwuv = abracadabra.
Slides 2 and 3 deﬁne the notions of an alphabetΣ, and the setΣ of ﬁnite strings over an
alphabet. The length of a stringu will be denoted bylength(u). Slide 4 deﬁnes the operation
of concatenation of strings. We make no notational distinction between a symbola∈ Σ and
the corresponding string of length one overΣ: soΣ can be regarded as a subset ofΣ . Note
thatΣ is never empty—it always contains the null string,ε, the unique string of length zero.
Note also that for anyu,v,w∈ Σ
uε =u = εu and (uv)w =uvw =u(vw)
andlength(uv) = length(u)+length(v).
Example 1.1.1. Examples ofΣ for differentΣ:
(i) IfΣ =a, thenΣ contains
(ii) IfΣ =a,b, thenΣ contains
(iii) If Σ = ∅ (the empty set — the unique set with no elements), then Σ = ε, the set just
containing the null string.4 1 REGULAR EXPRESSIONS
1.2 Pattern matching
Slide 5 deﬁnes the patterns, or regular expressions, over an alphabet Σ that we will use.
Each such regular expression, r, represents a whole set (possibly an inﬁnite set) of strings
in Σ that match r. The precise deﬁnition of this matching relation is given on Slide 6. It
might seem odd to include a regular expression∅ that is matched by no strings at all—but it
is technically convenient to do so. Note that the regular expressionε is in fact equivalent to
∅ , in the sense that a stringu matches∅ iff it matchesε (iffu =ε).
Regular expressions over an alphabetΣ
• each symbola∈ Σ is a regular expression
• ε is a regular expression
• ∅ is a regular expression
• ifr ands are regular expressions, then so is(rs)
• ifr ands are regular expressions, then so isrs
• ifr is a regular expression, then so is(r)
Every regular expression is built up inductively, by ﬁnitely many
applications of the above rules.
(N.B. we assumeε,∅,(,),, and are not symbols inΣ.)
Remark 1.2.1 (Binding precedence in regular expressions). In the deﬁnition on Slide 5 we
assume implicitly that the alphabetΣ does not contain the six symbols
ε ∅ ( )
Then, concretely speaking, the regular expressions over Σ form a certain set of strings over
the alphabet obtained by adding these six symbols to Σ. However it makes things more
readable if we adopt a slightly more abstract syntax, dropping as many brackets as possible
and using the convention that
− binds more tightly than−−, binds more tightly than−−.
∗ ∗ ∗ ∗
So, for example,rst means(rs(t) ), not(rs)(t) , or((rst)) , etc.1.2 Pattern matching 5
Matching strings to regular expressions
• u matchesa∈ Σ iffu =a
• u matchesε iffu =ε
• no string matches∅
• u matchesrs iffu matches eitherr ors
• u matchesrs iff it can be expressed as the concatenation of two
strings,u =vw, withv matchingr andw matchings
• u matchesr iff eitheru =ε, oru matchesr, oru can be
expressed as the concatenation of two or more strings, each of which
The deﬁnition of ‘u matchesr ’ on Slide 6 is equivalent to saying
for some n ≥ 0, u can be expressed as a concatenation of n strings, u =
u u ...u , where eachu matchesr.
1 2 n i
The casen = 0 just means thatu =ε (soε always matchesr ); and the casen = 1 just means
that u matches r (so any string matchingr also matches r ). For example, if Σ = a,b,c
andr =ab, then the strings matchingr are
Note that we didn’t include a regular expression for the ‘∗’ occurring in the UNIX
examples on Slide 1. However, once we know which alphabet we are referring to, Σ =
a ,a ,...,a say, we can get the effect of∗ using the regular expression
1 2 n
(a a ...a )
1 2 n
which is indeed matched by any string inΣ (becausea a ...a is matched by any symbol
1 2 n
inΣ).6 1 REGULAR EXPRESSIONS
Examples of matching, withΣ =0,1
• 01 is matched by each symbol inΣ
• 1(01) is matched by any string inΣ that starts with a ‘1’
• ((01)(01)) is matched by any string of even length inΣ
∗ ∗ ∗
• (01) (01) is matched by any string inΣ
• (ε0)(ε1)11 is matched by just the stringsε,0,1,01, and11
• ∅10 is just matched by0
Notation 1.2.2. The notationr+s is quite often used for what we write asrs.
The notation r , for n ≥ 0, is an abbreviation for the regular expression obtained by
concatenatingn copies ofr. Thus:
r = ε
r = r(r ).
Thusu matchesr iffu matchesr for somen≥ 0.
+ ∗ +
We user as an abbreviation forrr . Thusu matchesr iff it can be expressed as the
concatenation of one or more strings, each one matchingr.
1.3 Some questions about languages
Slide 8 deﬁnes the notion of a formal language over an alphabet. We take a very extensional
view of language: a formal language is completely determined by the ‘words in the
dictionary’, rather than by any grammatical rules. Slide 9 gives some important questions
about languages, regular expressions, and the matching relation between strings and regular
expressions.1.3 Some questions about languages 7
A (formal) languageL over an alphabetΣ is just a set of strings inΣ .
Thus any subsetL⊆ Σ determines a language overΣ.
The language determined by a regular expressionr overΣ is
L(r) = u∈ Σ u matchesr.
Two regular expressionsr ands (over the same alphabet) are
equivalent iffL(r) andL(s) are equal sets (i.e. have exactly the same
(a) Is there an algorithm which, given a stringu and a regular expression
r (over the same alphabet), computes whether or notu matchesr?
(b) In formulating the deﬁnition of regular expressions, have we missed
out some practically useful notions of pattern?
(c) Is there an algorithm which, given two regular expressionsr ands
(over the same alphabet), computes whether or not they are
equivalent? (Cf. Slide 8.)
(d) Is every language of the formL(r)?
Slide 98 1 REGULAR EXPRESSIONS
The answer to question (a) on Slide 9 is ‘yes’. Algorithms for deciding such pattern-
matching questions make use of ﬁnite automata. We will see this during the next few sections.
If you have used the UNIX utility grep, or a text editor with good facilities for regular
expression based search, likeemacs, you will know that the answer to question (b) on Slide 9
is also ‘yes’—the regular expressions deﬁned on Slide 5 leave out some forms of pattern
that one sees in such applications. However, the answer to the question is also ‘no’, in the
sense that (for a ﬁxed alphabet) these extra forms of regular expression are deﬁnable, up
to equivalence, from the basic forms given on Slide 5. For example, if the symbols of the
alphabet are ordered in some standard way, it is common to provide a form of pattern for
naming ranges of symbols—for examplea−z might denote a pattern matching any lower-
case letter. It is not hard to see how to deﬁne a regular expression (albeit a rather long one)
which achieves the same effect. However, some other commonly occurring kinds of pattern
are much harder to describe using the rather minimalist syntax of Slide 5. The principal
example is complementation,∼(r):
u matches∼(r) iff u does not matchr.
It will be a corollary of the work we do on ﬁnite automata (and a good measure of its power)
that every pattern making use of the complementation operation ∼(−) can be replaced by
an equivalent regular expression just making use of the operations on Slide 5. But why do
we stick to the minimalist syntax of regular expressions on that slide? The answer is that it
reduces the amount of work we will have to do to show that, in principle, matching strings
against patterns can be decided via the use of ﬁnite automata.
The answer to question (c) on Slide 9 is ‘yes’ and once again this will be a corollary of
the work we do on ﬁnite automata. (See Section 5.3.)
Finally, the answer to question (d) is easily seen to be ‘no’, provided the alphabet Σ
contains at least one symbol. For in that caseΣ is countably inﬁnite; and hence the number of
languages overΣ, i.e. the number of subsets ofΣ is uncountable. (Recall Cantor’s diagonal
argument.) But since Σ is a ﬁnite set, there are only countably many regular expressions
overΣ. (Why?) So the answer to (d) is ‘no’ for cardinality reasons. However, even amongst
the countably many languages that are ‘ﬁnitely describable’ (an intuitive notion that we will
not formulate precisely) many are not of the form L(r) for any regular expression r. For
example, in Section 5.2 we will use the ‘Pumping Lemma’ to see that
a b n≥ 0
is not of this form.
Exercise 1.4.1. Write down an ML data type declaration for a type constructor’a regExp
whose values correspond to the regular expressions over an alphabet’a.
Exercise 1.4.2. Find regular expressions over0,1 that determine the following languages:
(a) uu contains an even number of1’s1.4 Exercises 9
(b) uu contains an odd number of0’s
Exercise 1.4.3. For which alphabets Σ is the set Σ of all ﬁnite strings over Σ itself an
Tripos questions 2005.2.1(d) 1999.2.1(s) 1997.2.1(q) 1996.2.1(i) 1993.5.1210 1 REGULAR EXPRESSIONS11
2 Finite State Machines
We will be making use of mathematical models of physical systems called ﬁnite automata,
or ﬁnite state machines to recognise whether or not a string is in a particular language.
This section introduces this idea and gives the precise deﬁnition of what constitutes a ﬁnite
automaton. We look at several variations on the deﬁnition (to do with the concept of
determinism) and see that they are equivalent for the purpose of recognising whether or not
a string is in a given language.
2.1 Finite automata
Example of a ﬁnite automaton
a a a
q q q q
0 1 2 3
States:q ,q ,q ,q .
0 1 2 3
Transitions: as indicated above.
Start state:q .
Accepting state(s):q .
The key features of this abstract notion of ‘machine’ are listed below and are illustrated
by the example on Slide 10.
• There are only ﬁnitely many different states that a ﬁnite automaton can be in. In the
example there are four states, labelledq ,q ,q , andq .
0 1 2 3
• We do not care at all about the internal structure of machine states. All we care about
is which transitions the machine can make between the states. A symbol from some
ﬁxed alphabet Σ is associated with each transition: we think of the elements of Σ
as input symbols. Thus all the possible transitions of the ﬁnite automaton can be
speciﬁed by giving a ﬁnite graph whose vertices are the states and whose edges have12 2 FINITE STATE MACHINES
both a direction and a label (drawn from Σ). In the exampleΣ =a,b and the only
possible transitions from stateq are
q − → q and q − →q .
1 0 1 2
In other words, in state q the machine can either input the symbol b and enter state
q , or it can input the symbola and enter state q . (Note that transitions from a state
back to the same state are allowed: e.g.q − →q in the example.)
• There is a distinguished start state. In the example it is q . In the graphical
representation of a ﬁnite automaton, the start state is usually indicated by means of
a unlabelled arrow.
• The states are partitioned into two kinds: accepting states and non-accepting states.
In the graphical representation of a ﬁnite automaton, the accepting states are indicated
by double circles round the name of each such state, and the non-accepting states are
indicated using single circles. In the example there is only one accepting state,q ; the
other three states are non-accepting. (The two extreme possibilities that all states are
accepting, or that no states are accepting, are allowed; it is also allowed for the start
state to be accepting.)
The reason for the partitioning of the states of a ﬁnite automaton into ‘accepting’ and
‘non-accepting’ has to do with the use to which one puts ﬁnite automata—namely to recognise
whether or not a string u ∈ Σ is in a particular language (= subset of Σ ). Given u we
begin in the start state of the automaton and traverse its graph of transitions, using up the
symbols inu in the correct order reading the string from left to right. If we can use up all the
symbols in u in this way and reach an accepting state, then u is in the language ‘accepted’
(or ‘recognised’) by this particular automaton; otherwise u is not in that language. This is
summed up on Slide 11.
The term initial state is a common synonym for ‘start state’.
The term ﬁnal state is a common synonym for ‘accepting state’.2.1 Finite automata 13
L(M), language accepted by a ﬁnite automatonM
consists of all stringsu over its alphabet of input symbols satisfying
q − → q withq the start state andq some accepting state. Here
q − → q
means, ifu =a a ...a say, that for some statesq ,q ,...,q =q
1 2 n 1 2 n
(not necessarily all distinct) there are transitions of the form
a a a a
1 2 3 n
q −→q −→q −→···−→q = q.
0 1 2 n
∗ ′ ′
casen = 0: q− → q iff q =q
∗ ′ ′
casen = 1: q− → q iff q− → q .
Example 2.1.1. Let M be the ﬁnite automaton pictured on Slide 10. Using the notation
introduced on Slide 11 we have:
q −−−→ q (soaaab∈ L(M))
q −−−→ q iff q =q (soabaa∈ / L(M))
q −−−→ q iff q =q (no conclusion aboutL(M)).
In fact in this case
L(M) =u u contains three consecutivea’s.
(For q (i = 0,1,2) corresponds to the state in the process of reading a string in which the
lasti symbols read were alla’s.) SoL(M) coincides with the languageL(r) determined by
the regular expression
r = (ab) aaa(ab)
(cf. Slide 8).14 2 FINITE STATE MACHINES
A non-deterministic ﬁnite automaton (NFA),M,
is speciﬁed by
• a ﬁnite setStates (of states)
• a ﬁnite setΣ (the alphabet of input symbols)
• for eachq∈ States and eacha∈ Σ , a subset
Δ (q,a)⊆ States (the set of states that can be reached from
q with a single transition labelleda)
• an elements ∈ States (the start state)
• a subsetAccept ⊆ States (of accepting states)
2.2 Determinism, non-determinism, and ε-transitions
Slide 12 gives the formal deﬁnition of the notion of ﬁnite automaton. Note that the function
Δ gives a precise way of specifying the allowed transitions ofM, via: q − → q iff q ∈
The reason for the qualiﬁcation ‘non-deterministic’ on Slide 12 is because in general,
for each state q ∈ States and each input symbola ∈ Σ , we allow the possibilities that
there are no, one, or many states that can be reached in a single transition labelleda fromq,
corresponding to the cases thatΔ (q,a) has no, one, or many elements. For example, ifM
is the NFA pictured on Slide 13, then
Δ (q ,b) =∅ i.e. inM, no state can be reached fromq with a transition labelledb;
M 1 1
Δ (q ,a) =q i.e. in M, precisely one state can be reached from q with a transition
M 1 2 1
Δ (q ,a) =q ,q i.e. in M, precisely two states can be reached from q with a
M 0 0 1 0
transition labelleda.2.2 Determinism, non-determinism, andε-transitions 15
Example of a non-deterministic ﬁnite automaton
States, transitions, start state, and accepting states as shown:
a a a
q q q q
0 1 2 3
The language accepted by this automaton is the same as for the
automaton on Slide 10, namely
u∈a,b u contains three consecutivea’s.
When each subset Δ (q,a) has exactly one element we say that M is deterministic.
This is a particularly important case and is singled out for deﬁnition on Slide 14.
The ﬁnite automaton pictured on Slide 10 is deterministic. But note that if we took the
same graph of transitions but insisted that the alphabet of input symbols was a,b,c say,
then we have speciﬁed an NFA not a DFA, since for exampleΔ (q ,c) =∅. The moral of
this is: when specifying an NFA, as well as giving the graph of state transitions, it is important
to say what is the alphabet of input symbols (because some input symbols may not appear in
the graph at all).
When constructing machines for matching strings with regular expressions (as we will
do in Section 3) it is useful to consider ﬁnite state machines exhibiting an ‘internal’ form
of non-determinism in which the machine is allowed to change state without consuming any
input symbol. One calls such transitionsε-transitions and writes them as
q− →q .
This leads to the deﬁnition on Slide 15. Note that in an NFA , M, we always assume thatε
is not an element of the alphabetΣ of input symbols.
M16 2 FINITE STATE MACHINES
A deterministic ﬁnite automaton (DFA)
is an NFAM with the property that for eachq∈ States and
a∈ Σ , the ﬁnite setΔ (q,a) contains exactly one element—call it
Thus in this case transitions inM are essentially speciﬁed by a
next-state function,δ , mapping each (state, input symbol)-pair(q,a)
to the unique stateδ (q,a) which can be reached fromq by a transition
q− → q iff q = δ (q,a)
An NFA withε-transitions (NFA )
is speciﬁed by an NFAM together with a binary relation, called the
ε-transition relation, on the setStates . We write
to indicate that the pair of states(q,q ) is in this relation.
Example (with input alphabet =a,b):
q q q
a 1 2 3 a
q q q
b 4 5 6 b
Slide 152.3 A subset construction 17
L(M), language accepted by an NFA M
consists of all stringsu over the alphabetΣ of input symbols satisfying
q ⇒q withq the initial state andq some accepting state. Here·⇒·
is deﬁned by:
′ ′ ′
q⇒q iffq =q or there is a sequenceq− →···q of one or more
ε-transitions inM fromq toq
a ε a ε
q⇒q (fora∈ Σ ) iffq⇒·− →·⇒q
ab ε a ε b ε
q⇒q (fora,b∈ Σ ) iffq⇒·− →·⇒·− →·⇒ q
and similarly for longer strings
When using an NFA M to accept a stringu∈ Σ of input symbols, we are interested in
sequences of transitions in which the symbols in u occur in the correct order, but with zero
or moreε-transitions before or after each one. We write
to indicate that such a sequence exists from stateq to stateq in the NFA . Then, by deﬁnition
u is accepted by the NFA M iffq ⇒q holds forq the start state andq some accepting state:
see Slide 16. For example, for the NFA on Slide 15, it is not too hard to see that the language
accepted consists of all strings which either contain two consecutive a’s or contain two
consecutiveb’s, i.e. the language determined by the regular expression(ab) (aabb)(ab) .
2.3 A subset construction
Note that every DFA is an NFA (whose transition relation is deterministic) and that every
NFA is an NFA (whoseε-transition relation is empty). It might seem that non-determinism
andε-transitions allow a greater range of languages to be characterised as recognisable by a
ﬁnite automaton, but this is not so. We can use a construction, called the subset construction,
to convert an NFA M into a DFA PM accepting the same language (at the expense of
increasing the number of states, possibly exponentially). Slide 17 gives an example of this
construction. The name ‘subset construction’ refers to the fact that there is one state ofPM
for each subset of the setStates of states ofM. Given two subsetsS,S ⊆ States , there
′ ′ ′
is a transitionS − → S inPM just in case S consists of all the M-statesq reachable from18 2 FINITE STATE MACHINES
states q in S via the· ⇒ · relation deﬁned on Slide 16, i.e. such that we can get from q to
q inM via ﬁnitely manyε-transitions followed by ana-transition followed by ﬁnitely many
Example of the subset construction
M : δ : a b
∅ ∅ ∅
q q ,q ,q q
q 0 0 1 2 2
q q ∅
q ∅ q
q ,q q ,q ,q q
0 1 0 1 2 2
q ,q q ,q ,q q
ε 0 2 0 1 2 2
q ,q q q
1 2 1 2
q ,q ,q q ,q ,q q
0 1 2 0 1 2 2
By deﬁnition, the start state of PM is the subset of States whose elements are the
states reachable by ε-transitions from the start state of M; and a subsetS ⊆ States is an
accepting state ofPM iff some accepting state ofM is an element ofS. Thus in the example
on Slide 17 the start state isq ,q ,q and
0 1 2
a ε b
ab∈ L(M) because inM: q − → q − →q − →q
0 0 2 2
ab∈ L(PM) because inPM: q ,q ,q − →q ,q ,q − →q .
0 1 2 0 1 2 2
Indeed, in this caseL(M) =L(a b ) =L(PM). The fact thatM andPM accept the same
language in this case is no accident, as the Theorem on Slide 18 shows. That slide also gives
the deﬁnition of the subset construction in general.