Lecture notes on Formal language and Automata

formal languages and automata theory lecture notes and formal languages and automata theory textbook pdf free download
DavidCooper Profile Pic
DavidCooper,Singapore,Researcher
Published Date:11-07-2017
Your Website URL(Optional)
Comment
Formal Languages and Automata Theory D. Goswami and K. V. Krishna November 5, 2010Chapter 2 Formal Languages A language can be seen as a system suitable for expression of certain ideas, facts and concepts. For formalizing the notion of a language one must cover allthevarietiesoflanguagessuchasnatural(human)languagesandprogram- ming languages. Let us look at some common features across the languages. One may broadly see that a language is a collection of sentences; a sentence is a sequence of words; and a word is a combination of syllables. If one con- siders a language that has a script, then it can be observed that a word is a sequence of symbols of its underlying alphabet. It is observed that a formal learning of a language has the following three steps. 1. Learning its alphabet - the symbols that are used in the language. 2. Its words - as various sequences of symbols of its alphabet. 3. Formation of sentences - sequence of various words that follow certain rules of the language. In this learning, step 3 is the most di±cult part. Let us postpone to discuss construction of sentences and concentrate on steps 1 and 2. For the time being instead of completely ignoring about sentences one may look at the common features of a word and a sentence to agree upon both are just se- quenceofsomesymbolsoftheunderlyingalphabet. Forexample,theEnglish sentence "The English articles - a, an and the - are categorized into two types: indefinite and definite." may be treated as a sequence of symbols from the Roman alphabet along with enough punctuation marks such as comma, full-stop, colon and further one more special symbol, namely blank-space which is used to separate two words. Thus, abstractly, a sentence or a word may be interchangeably used 4for a sequence of symbols from an alphabet. With this discussion we start with the basic de¯nitions of alphabets and strings and then we introduce the notion of language formally. Further,inthischapter,weintroducesomeoftheoperationsonlanguages anddiscussalgebraicpropertiesoflanguageswithrespecttothoseoperations. Weendthechapterwithanintroductionto¯niterepresentationoflanguages via regular expressions. 2.1 Strings We formally de¯ne an alphabet as a non-empty ¯nite set. We normally use the symbols a;b;c;::: with or without subscripts or 0;1;2;:::, etc. for the elements of an alphabet. Astring overanalphabet§isa¯nitesequenceofsymbolsof§. Although one writes a sequence as (a ;a ;:::;a ), in the present context, we prefer to 1 2 n write it as a a ¢¢¢a , i.e. by juxtaposing the symbols in that order. Thus, 1 2 n a string is also known as a word or a sentence. Normally, we use lower case letters towards the end of English alphabet, namely z;y;x;w; etc., to denote strings. Example 2.1.1. Let § = fa;bg be an alphabet; then aa;ab;bba;baaba;::: are some examples of strings over §. Since the empty sequence is a ¯nite sequence, it is also a string. Which is ( )inearliernotation; butwiththenotationadaptedforthepresentcontext we require a special symbol. We use ", to denote the empty string. ¤ The set of all strings over an alphabet § is denoted by § . For example, if §=f0;1g, then ¤ § =f";0;1;00;01;10;11;000;001;:::g: ¤ ¤ Although the set § is in¯nite, it is a countable set. In fact, § is countably in¯nite for any alphabet §. In order to understand some such fundamen- tal facts we introduce some string operations, which in turn are useful to manipulate and generate strings. One of the most fundamental operations used for string manipulation is concatenation. Let x = a a ¢¢¢a and y = b b ¢¢¢b be two strings. The 1 2 n 1 2 m concatenation of the pair x, y denoted by xy is the string a a ¢¢¢a b b ¢¢¢b : 1 2 n 1 2 m 5¤ Clearly, the binary operation concatenation on § is associative, i.e., for all ¤ x;y;z2§ , x(yz)=(xy)z: Thus, x(yz) may simply be written as xyz. Also, since " is the empty string, it satis¯es the property "x=x"=x ¤ ¤ for any sting x2 § . Hence, § is a monoid with respect to concatenation. ¤ The operation concatenation is not commutative on § . For a string x and an integer n¸0, we write n+1 n 0 x =x x with the base condition x =": n Thatis,x isobtainedbyconcatenatingncopiesofx. Also,whenevern=0, the string x ¢¢¢x represents the empty string ". 1 n Let x be a string over an alphabet §. For a2 §, the number of occur- rences of a in x shall be denoted by jxj . The length of a string x denoted a byjxj is de¯ned as X jxj= jxj : a a2§ Essentially, the length of a string is obtained by counting the number of symbols in the string. For example,jaabj=3,jaj=1. Note thatj"j=0. If we denote A to be the set of all strings of length n over §, then one n can easily ascertain that ¤ § = A : n n¸0 ¤ And hence, being A a ¯nite set, § is a countably in¯nite set. n We say that x is a substring of y if x occurs in y, that is y = uxv for some strings u and v. The substring x is said to be a pre¯x of y if u = ". Similarly, x is a su±x of y if v =". Generalizingthenotationusedfornumberofoccurrencesofsymbolaina string x, we adopt the notationjyj as the number of occurrences of a string x x in y. 2.2 Languages We have got acquainted with the formal notion of strings that are basic elements of a language. In order to de¯ne the notion of a language in a broad spectrum, it is felt that it can be any collection of strings over an alphabet. ¤ Thus we de¯ne a language over an alphabet § as a subset of § . 6Example 2.2.1. 1. The emptyset? is a language over any alphabet. Similarly,f"g is also a language over any alphabet. 2. The set of all strings overf0;1g that start with 0. 3. The set of all strings overfa;b;cg having ac as a substring. Remark 2.2.2. Note that?6=f"g, because the language? does not contain anystringbutf"gcontainsastring,namely". Alsoitisevidentthatj?j=0; whereas,jf"gj=1. Since languages are sets, we can apply various well known set operations suchasunion,intersection,complement,di®erenceonlanguages. Thenotion of concatenation of strings can be extended to languages as follows. The concatenation of a pair of languages L , L is 1 2 L L =fxy j x2L y2L g: 1 2 1 2 Example 2.2.3. 1. If L =f0;1;01g and L =f1;00g, then 1 2 L L =f01;11;011;000;100;0100g. 1 2 2. For L =fb;ba;babg and L =f";b;bb;abbg, we have 1 2 L L =fb;ba;bb;bab;bbb;babb;baabb;babbb;bababbg. 1 2 Remark 2.2.4. 1. Since concatenation of strings is associative, so is the concatenation of languages. That is, for all languages L ;L and L , 1 2 3 (L L )L =L (L L ): 1 2 3 1 2 3 Hence, (L L )L may simply be written as L L L . 1 2 3 1 2 3 2. The number of strings in L L is always less than or equal to the 1 2 product of individual numbers, i.e. jL L j·jL jjL j: 1 2 1 2 3. L µL L if and only if "2L . 1 1 2 2 7Proof. The \if part" is straightforward; for instance, if " 2 L , then 2 for any x2 L , we have x = x"2 L L . On the other hand, suppose 1 1 2 "2= L . Now, note that a string x2L of shortest length in L cannot 2 1 1 be inL L . This is because, if x=yz for some y2L and a nonempty 1 2 1 string z 2 L , then jyj jxj. A contradiction to our assumption that 2 x is of shortest length in L . Hence L 6µL L . 1 1 1 2 4. Similarly, "2L if and only if L µL L . 1 2 1 2 n We write L to denote the language which is obtained by concatenating n copies of L. More formally, 0 L =f"g and n n¡1 L =L L, for n¸1. Inthecontextofformallanguages,anotherimportantoperationisKleene ¤ star. Kleene star orKleene closure ofalanguageL,denotedbyL ,isde¯ned as ¤ n L = L : n¸0 Example 2.2.5. 1. Kleene star of the languagef01g is n f";01;0101;010101;:::g=f(01) j n¸0g. ¤ 2. If L=f0;10g, then L =f";0;10;00;010;100;1010;000;:::g n Since an arbitrary string in L is of the form x x ¢¢¢x , for x 2 L and 1 2 n i n ¤ L = L , one can easily observe that n¸0 ¤ L =fx x ¢¢¢x j n¸0 and x 2L; for 1·i·ng 1 2 n i ¤ Thus, a typical string in L is a concatenation of ¯nitely many strings of L. Remark 2.2.6. Note that, the Kleene star of the language L = f0;1g over the alphabet §=f0;1g is ¤ 0 2 L = L LL ¢¢¢ = f"gf0;1gf00;01;10;11g¢¢¢ = f";0;1;00;01;10;11;¢¢¢g = the set of all strings over §: ¤ Thus, the earlier introduced notation § is consistent with the notation of Kleene star by considering § as a language over §. 8+ The positive closure of a language L is denoted by L is de¯ned as + n L = L : n¸1 ¤ + Thus, L =L f"g. We often can easily describe various formal languages in English by stat- ing the property that is to be satis¯ed by the strings in the respective lan- guages. It is not only for elegant representation but also to understand the properties of languages better, describing the languages in set builder form is desired. Consider the set of all strings over f0;1g that start with 0. Note that ¤ each such string can be seen as 0x for some x2f0;1g . Thus the language can be represented by ¤ f0xj x2f0;1g g: Examples 1. The set of all strings over fa;b;cg that have ac as substring can be written as ¤ fxacy j x;y2fa;b;cg g: This can also be written as ¤ fx2fa;b;cg jjxj ¸1g; ac stating that the set of all strings over fa;b;cg in which the number of occurrences of substring ac is at least 1. 0 2. The set of all strings over some alphabet § with even number of as is ¤ fx2§ jjxj =2n; for some n2Ng: a Equivalently, ¤ fx2§ jjxj ´0 mod 2g: a 0 3. The set of all strings over some alphabet § with equal number of as 0 and bs can be written as ¤ fx2§ jjxj =jxjg: a b 4. The set of all palindromes over an alphabet § can be written as ¤ R fx2§ j x=x g; R where x is the string obtained by reversing x. 95. The set of all strings over some alphabet § that have an a in the 5th position from the right can be written as ¤ fxay j x;y2§ andjyj=4g: 0 6. The set of all strings over some alphabet § with no consecutive as can be written as ¤ fx2§ jjxj =0g: aa 7. The set of all strings over fa;bg in which every occurrence of b is not before an occurrence of a can be written as m n fa b j m;n¸0g: Note that, this is the set of all strings overfa;bg which do not contain ba as a substring. 2.3 Properties Theusualsettheoreticpropertieswithrespecttounion,intersection,comple- ment, di®erence, etc. hold even in the context of languages. Now we observe certain properties of languages with respect to the newly introduced oper- ations concatenation, Kleene closure, and positive closure. In what follows, L;L ;L , L and L are languages. 1 2 3 4 P1 Recall that concatenation of languages is associative. P2 Sinceconcatenationofstringsisnotcommutative,wehaveL L 6=L L , 1 2 2 1 in general. P3 Lf"g=f"gL=L. P4 L?=?L=?. Proof. Let x2L?; then x=x x for some x 2L and x 2?. But? 1 2 1 2 being emptyset cannot hold any element. Hence there cannot be any element x2L? so that L?=?. Similarly,?L=? as well. P5 Distributive Properties: 1. (L L )L =L L L L . 1 2 3 1 3 2 3 10Proof. Suppose x2(L L )L 1 2 3 =) x=x x ; for some x 2L L ); and some x 2L 1 2 1 1 2 2 3 =) x=x x ; for some x 2L or x 2L ; and x 2L 1 2 1 1 1 2 2 3 =) x=x x ; for some x 2L and x 2L ; 1 2 1 1 2 3 or x 2L and x 2L 1 2 2 3 =) x2L L or x2L L 1 3 2 3 =) x2L L L L : 1 3 2 3 Conversely, suppose x2 L L L L =) x2 L L or x2 L L . 1 3 2 3 1 3 2 3 Without loos of generality, assume x62L L . Then x2L L . 1 3 2 3 =) x=x x ; for some x 2L and x 2L 3 4 3 2 4 3 =) x=x x ; for some x 2L L ; and some x 2L 3 4 3 1 2 4 3 =) x2(L L )L : 1 2 3 Hence, (L L )L =L L L L . 1 2 3 1 3 2 3 2. L (L L )=L L L L . 1 2 3 1 2 1 3 Proof. Similar to the above. From these properties it is clear that the concatenation is distributive over ¯nite unions. Moreover, we can observe that concatenation is also distributive over countably in¯nite unions. That is, Ã L L = LL and i i i¸1 i¸1 Ã L L = LL i i i¸1 i¸1 P6 If L µL and L µL , then L L µL L . 1 2 3 4 1 3 2 4 ¤ P7 ? =f"g. ¤ P8 f"g =f"g. ¤ + P9 If "2L, then L =L . ¤ ¤ + P10 L L=LL =L . 11¤ ¤ Proof. Suppose x 2 L L. Then x = yz for some y 2 L and z 2 L. ¤ But y2L implies y =y ¢¢¢y with y 2L for all i. Hence, 1 n i ¤ x=yz =(y ¢¢¢y )z =y (y ¢¢¢y z)2LL : 1 n 1 2 n ¤ ¤ Converse is similar. Hence, L L=LL . ¤ Further, when x2 L L, as above, we have x = y ¢¢¢y z is clearly in 1 n + + L . On the other hand, x 2 L implies x = x ¢¢¢x with m ¸ 1 1 m 0 0 and x 2 L for all i. Now write x = x ¢¢¢x so that x = xx . i 1 m¡1 m 0 ¤ 0 Here, note that x 2 L ; particularly, when m = 1 then x = ". Thus, ¤ + ¤ x2L L. Hence, L =L L. ¤ ¤ ¤ P11 (L ) =L . ¤ ¤ ¤ P12 L L =L . ¤ ¤ P13 (L L ) L =L (L L ) . 1 2 1 1 2 1 ¤ Proof. Let x 2 (L L ) L . Then x = yz, where z 2 L and y = 1 2 1 1 ¤ y ¢¢¢y 2(L L ) withy 2L L . Noweachy =uv , foru 2L and 1 n 1 2 i 1 2 i i i i 1 v 2 L . Note that vu 2 L L , for all i with 1· i· n¡1. Hence, i 2 i i+1 2 1 x = yz = (y ¢¢¢y )z = (u v ¢¢¢u v )z = u (v u ¢¢¢v u v z) 2 1 n 1 1 n n 1 1 2 n¡1 n n ¤ ¤ ¤ L (L L ) . Converse is similar. Hence, (L L ) L =L (L L ) . 1 2 1 1 2 1 1 2 1 ¤ ¤ ¤ ¤ P14 (L L ) =(L L ) . 1 2 1 2 ¤ ¤ Proof. Observe that L µ L and f"gµ L . Hence, by properties P3 1 1 2 ¤ ¤ ¤ ¤ and P6, we have L = L f"g µ L L . Similarly, L µ L L . Hence, 1 1 2 1 2 1 2 ¤ ¤ ¤ ¤ ¤ ¤ L L µL L . Consequently, (L L ) µ(L L ) . 1 2 1 2 1 2 1 2 ¤ ¤ ¤ ¤ For converse, observe that L µ(L L ) . Similarly, L µ(L L ) . 1 2 1 2 1 2 Thus, ¤ ¤ ¤ ¤ L L µ(L L ) (L L ) : 1 2 1 2 1 2 ¤ ¤ ¤ But, by property P12, we have (L L ) (L L ) = (L L ) so 1 2 1 2 1 2 ¤ ¤ ¤ that L L µ(L L ) . Hence, 1 2 1 2 ¤ ¤ ¤ ¤ ¤ ¤ (L L ) µ((L L ) ) =(L L ) : 1 2 1 2 1 2 122.4 Finite Representation Pro¯ciency in a language does not expect one to know all the sentences of the language; rather with some limited information one should be able to come up with all possible sentences of the language. Even in case of programming languages, a compiler validates a program - a sentence in the programming language - with a ¯nite set of instructions incorporated in it. Thus, we are interested in a ¯nite representation of a language - that is, by giving a ¯nite amount of information, all the strings of a language shall be enumerated/validated. Now, we look at the languages for which ¯nite representation is possible. Given an alphabet §, to start with, the languages with single string fxg and? can have ¯nite representation, say x and?, respectively. In this way, ¯nitelanguagescanalsobegivena¯niterepresentation; say,byenumerating all the strings. Thus, giving ¯nite representation for in¯nite languages is a nontrivial interesting problem. In this context, the operations on languages may be helpful. For example, the in¯nite language f";ab;abab;ababab;:::g can be con- ¤ sidered as the Kleene star of the language fabg, that is fabg . Thus, using Kleene star operation we can have ¯nite representation for some in¯nite lan- guages. Whileoperationsareunderconsideration,togive¯niterepresentationfor languagesonemay¯rstlookattheindivisiblelanguages, namely?;f"g;and fag, for all a2§, as basis elements. ¤ To construct fxg, for x 2 § , we can use the operation concatenation over the basis elements. For example, if x = aba then choose fag and fbg; and concatenate fagfbgfag to get fabag. Any ¯nite language over §, say fx ;:::;x g can be obtained by considering the unionfx g¢¢¢fx g. 1 n 1 n In this section, we look at the aspects of considering operations over basis elements to represent a language. This is one aspect of representing a language. There are many other aspects to give ¯nite representations; some such aspects will be considered in the later chapters. 2.4.1 Regular Expressions We now consider the class of languages obtained by applying union, con- catenation, and Kleene star for ¯nitely many times on the basis elements. Theselanguagesareknownasregularlanguagesandthecorresponding¯nite representations are known as regular expressions. De¯nition 2.4.1 (Regular Expression). We de¯ne a regular expression over an alphabet § recursively as follows. 131. ?;", and a, for each a 2 §, are regular expressions representing the languages?;f"g, andfag, respectively. 2. If r and s are regular expressions representing the languages R and S, respectively, then so are (a) (r+s) representing the language RS, (b) (rs) representing the language RS, and ¤ ¤ (c) (r ) representing the language R . In a regular expression we keep a minimum number of parenthesis which are required to avoid ambiguity in the expression. For example, we may simply write r+st in case of (r+(st)). Similarly, r+s+t for ((r+s)+t). De¯nition2.4.2. Ifr isaregularexpression, thenthelanguagerepresented by r is denoted by L(r). Further, a language L is said to be regular if there is a regular expression r such that L=L(r). Remark 2.4.3. 1. A regular language over an alphabet § is the one that can be obtained from the emptyset, f"g, andfag, for a2 §, by ¯nitely many applica- tions of union, concatenation and Kleene star. 2. The smallest class of languages over an alphabet § which contains ?;f"g, and fag and is closed with respect to union, concatenation, and Kleene star is the class of all regular languages over §. Example 2.4.4. As we observed earlier that the languages?;f"g,fag, and all ¯nite sets are regular. n Example 2.4.5. fa j n ¸ 0g is regular as it can be represented by the ¤ expression a . ¤ Example2.4.6. § , thesetofallstringsoveranalphabet§, isregular. For ¤ instance, if § = fa ;a ;:::;a g, then § can be represented as (a +a + 1 2 n 1 2 ¤ ¢¢¢+a ) . n Example 2.4.7. The set of all strings over fa;bg which contain ab as a substring is regular. For instance, the set can be written as ¤ fx2fa;bg j ab is a substring of xg ¤ = fyabz j y;z2fa;bg g ¤ ¤ = fa;bg fabgfa;bg ¤ ¤ Hence, the corresponding regular expression is (a+b) ab(a+b) . 14Example 2.4.8. The language L overf0;1g that contains 01 or 10 as sub- string is regular. L = fxj 01 is a substring of xgfxj 10 is a substring of xg ¤ ¤ = fy01z j y;z2§ gfu10v j u;v2§ g ¤ ¤ ¤ ¤ = § f01g§ § f10g§ ¤ Since § ,f01g, andf10g are regular we have L to be regular. In fact, at this point, one can easily notice that ¤ ¤ ¤ ¤ (0+1) 01(0+1) +(0+1) 10(0+1) is a regular expression representing L. Example 2.4.9. The set of all strings over fa;bg which do not contain ab as a substring. By analyzing the language one can observe that precisely the language is as follows. n m fb a j m;n¸0g ¤ ¤ Thus, a regular expression of the language is b a and hence the language is regular. Example 2.4.10. The set of strings over fa;bg which contain odd number 0 of as is regular. Although the set can be represented in set builder form as ¤ fx2fa;bg jjxj =2n+1; for some ng; a writing a regular expression for the language is little tricky job. Hence, we postponetheargumenttoChapter3(seeExample3.3.6),whereweconstruct a regular grammar for the language. Regular grammar is a tool to generate regular languages. Example 2.4.11. The set of strings over fa;bg which contain odd number 0 0 of as and even number of bs is regular. As above, a set builder form of the set is: ¤ fx2fa;bg jjxj =2n+1; for some n andjxj =2m; for some mg: a b Writing a regular expression for the language is even more trickier than the earlier example. This will be handled in Chapter 4 using ¯nite automata, yet another tool to represent regular languages. De¯nition 2.4.12. Two regular expressions r and r are said to be equiv- 1 2 alent if they represent the same language; in which case, we write r ¼r . 1 2 15¤ ¤ ¤ ¤ Example2.4.13. Theregularexpressions(10+1) and((10) 1 ) areequiv- alent. ¤ n ¤ m Since L((10) ) = f(10) j n ¸ 0g and L(1 ) = f1 j m ¸ 0g, we have ¤ ¤ n m L((10) 1 )=f(10) 1 j m;n¸0g. This implies ¤ ¤ ¤ n m n m n m 1 1 2 2 l l L(((10) 1 ) ) = f(10) 1 (10) 1 ¢¢¢(10) 1 j m;n ¸0 and 0·i·lg i i l X = fx x ¢¢¢x j x =10 or 1g; where k = (m +n ) 1 2 k i i i i=0 ¤ µ L((10+1) ): ¤ Conversely, suppose x2L((10+1) ). Then, x=x x ¢¢¢x where x =10 or 1 1 2 p i p q p q p q 1 1 2 2 r r =) x=(10) 1 (10) 1 ¢¢¢(10) 1 for p;q ¸0 i j ¤ ¤ ¤ =) x2L(((10) 1 ) ): ¤ ¤ ¤ ¤ ¤ ¤ ¤ ¤ Hence,L((10+1) )=L(((10) 1 ) )andconsequently,(10+1) ¼((10) 1 ) . FrompropertyP14,bychoosingL =f10gandL =f1g,onemaynotice 1 2 that ¤ ¤ ¤ ¤ (f10gf1g) =(f10g f1g ) : Since 10 and 1 represent the regular languages f10g and f1g, respectively, from the above equation we get ¤ ¤ ¤ ¤ (10+1) ¼((10) 1 ) : Sincethosepropertiesholdgoodforalllanguages,byspecializingthoseprop- ertiestoregularlanguagesandinturnreplacingbythecorrespondingregular expressions we get the following identities for regular expressions. Let r;r ;r , and r be any regular expressions 1 2 3 1. r"¼"r¼r. 2. r r 6¼r r , in general. 1 2 2 1 3. r (r r )¼(r r )r . 1 2 3 1 2 3 4. r?¼?r¼?. ¤ 5. ? ¼". ¤ 6. " ¼". 16¤ + 7. If "2L(r), then r ¼r . ¤ ¤ + 8. rr ¼r r¼r . 9. (r +r )r ¼r r +r r . 1 2 3 1 3 2 3 10. r (r +r )¼r r +r r . 1 2 3 1 2 1 3 ¤ ¤ ¤ 11. (r ) ¼r . ¤ ¤ 12. (r r ) r ¼r (r r ) . 1 2 1 1 2 1 ¤ ¤ ¤ ¤ 13. (r +r ) ¼(r r ) . 1 2 1 2 Example 2.4.14. + ¤ ¤ ¤ ¤ + + ¤ + 1. b (a b +")b¼b(b a +")b ¼b a b . Proof. + ¤ ¤ + ¤ ¤ + b (a b +")b ¼ (b a b +b ")b + ¤ ¤ + ¼ b a b b+b b + ¤ + + ¼ b a b +b b + ¤ + + + ¤ + ¼ b a b ; since L(b b)µL(b a b ): ¤ ¤ + + ¤ + Similarly, one can observe that b(b a +")b ¼b a b . + ¤ ¤ ¤ ¤ ¤ 2. (0 (01) 0+0 (10) ) ¼(0+10) . Proof. + ¤ ¤ ¤ ¤ + ¤ ¤ ¤ ¤ (0 (01) 0+0 (10) ) ¼ (0 0(10) +0 (10) ) + ¤ ¤ ¤ ¼ ((0 0+0 )(10) ) ¤ ¤ ¤ + ¤ ¼ (0 (10) ) ; since L(0 0)µL(0 ) ¤ ¼ (0+10) : Notation 2.4.15. If L is represented by a regular expression r, i.e. L(r)=L, then we may simply use r instead of L(r) to indicated the language L. As 0 0 0 a consequence, for two regular expressions r and r, r ¼ r and r = r are equivalent. 17Chapter 3 Grammars Inthischapter,weintroducethenotionofgrammarcalledcontext-freegram- mar(CFG)asalanguagegenerator. Thenotionofderivationisinstrumental in understanding how the strings are generated in a grammar. We explain the various properties of derivations using a graphical representation called derivation trees. A special case of CFG, viz. regular grammar, is discussed as tool to generate to regular languages. A more general notion of grammars is presented in Chapter 7. Inthecontextofnaturallanguages, thegrammarofa languageis aset of rules which are used to construct/validate sentences of the language. It has beenpointedout, intheintroductionofChapter2, thatthisisthethirdstep in a formal learning of a language. Now we draw the attention of a reader to look into the general features of the grammars (of natural languages) to formalize the notion in the present context which facilitate for better under- standing of formal languages. Consider the English sentence The students study automata theory. In order to observe that the sentence is grammatically correct, one may attribute certain rules of the English grammar to the sentence and validate it. For instance, the Article the followed by the Noun students form a Noun-phrase andsimilarlythe Noun automata theory form a Noun-phrase. Further, study is a Verb. Now, choose the Sentential form \Subject Verb Object" of the English grammar. As Subject or Object can be a Noun-phrase by plugging in the above words one may conclude that the given sentence is a grammatically correct English sentence. This veri¯cation/derivation is depicted in Figure 3.1. The derivation can also be represented by a tree structure as shown in Figure 3.2. 18Sentence ) Subject Verb Object ) Noun-phrase Verb Object ) Article Noun Verb Object ) The Noun Verb Object ) The students Verb Object ) The students study Object ) The students study Noun-phrase ) The students study Noun ) The students study automata theory Figure 3.1: Derivation of an English Sentence In this process, we observe that two types of words are in the discussion. 1. The words like the, study, students. 2. The words like Article, Noun, Verb. The main di®erence is, if you arrive at a stage where type (1) words are appearing, then you need not say anything more about them. In case you arrive at a stage where you ¯nd a word of type (2), then you are assumed to say some more about the word. For example, if the word Article comes, then one should say which article need to be chosen among a, an and the. Let us call the type (1) and type (2) words as terminals and nonterminals, respectively, as per their features. Thus,agrammarshouldincludeterminalsandnonterminalsalongwitha setofruleswhichattributesomeinformationregardingnonterminalsymbols. 3.1 Context-Free Grammars We now understand that a grammar should have the following components. ² A set of nonterminal symbols. ² A set of terminal symbols. ² A set of rules. ² As the grammar is to construct/validate sentences of a language, we distinguish a symbol in the set of nonterminals to represent a sen- tence from which various sentences of the language can be gener- ated/validated. 19 Sentence Subject Verb Object Noun−phrase Noun−phrase Article Noun Noun The students study automata theory Figure 3.2: Derivation Tree of an English Sentence With this, we formally de¯ne the notion of grammar as below. De¯nition 3.1.1. A grammar is a quadruple G =(N;§;P;S) where 1. N is a ¯nite set of nonterminals, 2. § is a ¯nite set of terminals, 3. S2N is the start symbol, and ¤ 4. P is a ¯nite subset of N£V called the set of production rules. Here, V =N§. It is convenient to write A®, for the production rule (A;®)2P. To de¯ne a formal notion of validating or deriving a sentence using a grammar, we require the following concepts. De¯nition 3.1.2. LetG =(N;§;P;S) be a grammar with V =N§. ¤ 1. We de¯ne a binary relation)on V by G ®)¯ if and only if ® =® A® ;¯ =® °® and A°2P; 1 2 1 2 G ¤ for all ®;¯2V . 202. The relation)is called one step relation onG. If ®)¯, then we call G G ® yields ¯ in one step inG. ¤ 3. The re°exive-transitive closure of ) is denoted by ) . That is, for G G ¤ ®;¯2V , ½ ¤ 9n¸0 and ® ;® ;:::;® 2V such that 0 1 n ¤ ®)¯ if and only if G ® =® )® )¢¢¢ )® )® =¯: 0 1 n¡1 n G G G G ¤ ¤ 4. For ®;¯ 2 V , if ®)¯, then we say ¯ is derived from ® or ® derives G ¤ ¯. Further, ®)¯ is called as a derivation inG. G 5. If ® =® )® )¢¢¢ )® )® =¯ is a derivation, then the length 0 1 n¡1 n G G G G n of the derivation is n and it may be written as ®)¯. G 6. In a given context, if we deal with only one grammar G, then we may simply write), in stead of). G ¤ 7. If ®)¯ is a derivation, then we say ¯ is the yield of the derivation. ¤ 8. Astring®2V issaidtobeasentential form inG, if® canbederived ¤ from the start symbol S ofG. That is, S )®. ¤ 9. In particular, if ® 2 § , then the sentential form ® is known as a sentence. In which case, we say ® is generated byG. 10. The language generated by G, denoted by L(G), is the set of all sen- tences generated byG. That is, ¤ ¤ L(G)=fx2§ j S )xg: Note that a production rule of a grammar is of the form A ®, where A is a nonterminal symbol. If the nonterminal A appears in a sentential form X ¢¢¢X AX ¢¢¢X , then the sentential form X ¢¢¢X ®X ¢¢¢X 1 k k+1 n 1 k k+1 n can be obtained in one step by replacing A with ®. This replacement is independentoftheneighboringsymbolsofAinX ¢¢¢X AX ¢¢¢X . That 1 k k+1 n 0 is, X s will not play any role in the replacement. One may call A is within i 0 the context of X s and hence the rules A ® are said to be of context-free i type. Thus, the type of grammar that is de¯ned here is known as context- free grammar,simplyCFG.Inthelaterchapters,werelaxtheconstraintand discuss more general types of grammars. 21Example 3.1.3. Let P = fS ab;S bb;S aba;S aabg with § =fa;bg and N =fSg. ThenG = (N;§;P;S) is a context-free grammar. Since left hand side of each production rule is the start symbol S and their right hand sides are terminal strings, every derivation in G is of length one. In fact, we precisely have the following derivation inG. 1. S)ab 2. S)bb 3. S)aba 4. S)aab Hence, the language generated byG, L(G)=fab;bb;aba;aabg: Notation 3.1.4. 1. A® , A® can be written as A® j ® . 1 2 1 2 2. Normally we use S as the start symbol of a grammar, unless otherwise speci¯ed. 3. To give a grammarG =(N;§;P;S), it is su±cient to give the produc- tion rules only since one may easily ¯nd the other components N and § of the grammarG by looking at the rules. Example3.1.5. SupposeLisa¯nitelanguageoveranalphabet§, say L= fx ;x ;:::;x g. Then consider the ¯nite set P =fS x j x j ¢¢¢ j x g. 1 2 n 1 2 n Now, as discussed in Example 3.1.3, one can easily observe that the CFG (fSg;§;P;S) generates the language L. Example 3.1.6. Let § = f0;1g and N = fSg. Consider the CFG G = (N;§;P;S), where P has precisely the following production rules. 1. S0S 2. S1S 3. S" 22

Advise: Why You Wasting Money in Costly SEO Tools, Use World's Best Free SEO Tool Ubersuggest.