Languages, Automata and Grammars Lecture Notes

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21


Consider a nonempty set A of symbols. A string/ word w on the set A is a finite
sequence of its elements.

 For example, suppose Then the following sequences are strings on A:

When discussing strings/words on A, we frequently call A the alphabet, and its
elements are called characters.

 We will also abbreviate our notation and write for , for , and so on. Thus, for the
above words, u = aba and v = a b .
 The empty string has no characters and is denoted ε (Greek letter epsilon), or
denoted by λ (Greek letter lambda).
The set of all strings/words on A is denoted by (read: “A star”).

 The length of a string/word u, written |u| or (u), is the number of elements in its
sequence of letters. For the above strings/words u and v, we have (u) = 5 and (v) = 7.
Also, (λ) = 0, where λ is the empty word.

  of Strings
 Consider two strings/words u and v on the alphabet A. The concatenation of u and
v, written , is the word obtained by writing down the letters of u followed by the
letters of v
For the strings and
uv = ababbaccbaaa = abaab

 The concatenation operation for strings/words on an alphabet A is associative.

 The empty string/word ε is an identity element for the concatenation operation.

Adjoining the empty word before or after a word u does not change the word u.

The operation is not commutative, e.g., for the above words u and v.

  and Initial segments
 Consider any string/word u = . . . on an alphabet A. Any sequence w
= . . . is called a subword/substring of u.
 In particular, the subword/substring w = . . . beginning with the first letter of u, is
called an initial segment of u.
 In other words, w is a subword of u if u = w and w is an initial segment of u if u =
 Observe that ε and u are both subwords of uv since u = ε u.
Consider the word u = abca. The subwords and initial segments of u follow:
 Subwords: ε, a, b, c, ab, bc, ca, abc, bca, abca = u
 Initial segments: ε, a, ab, abc, abca = u
 Recall
   that A* denotes the set of all strings/words on an alphabet A.
A language over an alphabet A is a collection of words on A.

 Thus a language L is simply a subset of A*.

Let A = {a, b}. The following are languages over A .
 = {a, ab, a, . . .}
consists of all words beginning with an a and followed by zero or more b’s.
 = { |}
consists of all words beginning with one or more as followed by one or more b’s.
 = { |m > 0}
consists of all words beginning with one or more a’s and followed by the same number of

 Suppose L and M are languages over an alphabet A. Then the “concatenation” of L and M,
denoted by LM, is the language defined as follows:

If and L₂ = { aa, bb }, then
L₁L₂ = { aaa, abb, baaa, babb, bbaa, bbbb }

Language Exponentiation
 We can define what it means to “exponentiate” a language as follows:
= {ε}
The set containing just the empty string.
This means any string formed by concatenating zero strings together is the empty
This means concatenating (n+1) strings together works by concatenating
n strings, then concatenating one more.
 An important operation on languages is the Kleene Closure, which is defined as
L* = { w ∈ Σ* | ∃n ∈ ℕ. w ∈ }
 Mathematically: w ∈ L* if ∃n ∈ ℕ. w ∈
 Intuitively, all possible ways of concatenating zero or more strings in L together,
possibly with repetition.
Theorem on Kleene Closure of L :

If L = { a, bb }, then L* = {
a, bb,
aa, abb, bba, bbbb,
aaa, aabb, abba, abbbb, bbaa, bbabb, bbbba, bbbbbb,

 Languages are sets of strings.
 Strings are sequences of characters.
 Characters are individual symbols.
 Alphabets are sets of characters.
Regular Expressions
 Regular expressions are a way of describing a language via a string representation.
 Regular expressions match strings in the language. They describe the general shape
of all strings in the language.
 They’re used extensively in software systems for string processing.
 Conceptually, regular expressions are strings describing how to assemble a larger
language out of smaller pieces.
 Each of the following is a regular expression over an alphabet A.
 The symbol ε is a regular expression that represents the language {ε}.
 The symbol is a regular expression that represents the empty language This is just
a pair (empty expression).
 For any a ∈ A, the symbol a is a regular expression for the language {a}.
 If r is a regular expression, is a regular expression for the Kleene closure of the
language of r.
 If and are regular expressions, is a regular expression for the concatenation of
the languages of and .
 If and are regular expressions, ∪ is a regular expression for the union of the
languages of and .
 If r is a regular expression, (r) is a regular expression with the same meaning as r.

 All regular expressions are formed in this way.

 Observe that a regular expression r is a special kind of a word (string) which uses
the letters of A and the five symbols:
 The operator precedence for regular expressions, from highest to lowest:
(r) r* ∪
 The language of a regular expression is the language described by that regular
 The language over A defined by a regular expression r over A is as follows:
 L() = ϵ
 L() = , the empty set.
 L(a) = {a}, where a is a letter in A.
 L(r∗) = (L(r))* (the Kleene closure of L(r)).
 L(r1 r2) = L(r1) ∪ L(r2) (the union of the languages).
 L(r1r2) = L(r1)L(r2) (the concatenation of the languages).
 L((r))) = L(r)
 Parentheses will be omitted from regular expressions when possible. Since the

 * takes precedence over concatenation, and concatenation takes precedence over ∪.

Let L be a language over A. If L is a regular language, then there is a regular expression

for L such that L = L(r).

Let A = {a, b}. Each of the following is an expression r and its corresponding language L(r):
(a) Let r = a*.
Then L(r) consists of all powers of a including the empty word l.
(b) Let r = aa*.
Then L(r) consists of all positive powers of a excluding the empty word.
(c) Let r = a b∗.
Then L(r) consists of a or any word in b, that is, L(r) = {a,, b, , }.
(d) Let r = (a b)∗.
Note L(a b) = {a} ∪ {b} = A; hence L(r) = A∗, all words over A.
(e) Let r = (a b)∗bb.
Then L(r) consists of the concatenation of any word in A with bb, that is, all words ending in .
(f) Let r = a b∗.
L(r) does not exist since r is not a regular expression. (Specifically, is not one of the symbols
used for regular expressions.)
  automaton (plural: automata) is a mathematical model of a computing device.
 A finite automaton is a simple type of mathematical machine for determining
whether a string is contained within some language.

A finite state automaton (FSA) or, simply, an automaton M, consists of five parts:
 A finite set (alphabet) A of inputs.
 A finite set S of (internal) states.
 A subset Y of S (called accepting or “yes” states).
 An initial state in S.
 A next-state function F from into S.

Such an automaton M is denoted by M = (A, S, Y, , F)

 Some texts define the next-state function F : in by means of a collection of functions

, one for each . Setting shows that both definitions are equivalent.

  Diagram of an Automaton M
 An automaton M is usually defined by means of its state diagram D = D(M) rather
than by listing its five parts. The state diagram D = D(M) is a labeled directed graph
as follows.
 The vertices of D(M) are the states in S and an accepting state is denoted by means of a
double circle.
 There is an arrow (directed edge) in D(M) from state to state labeled by an input a
if F(, a) = or, equivalently, if () = .
 The initial state is indicated by means of a special arrow which terminates at but has
no initial vertex.
 For each vertex and each letter a in the alphabet A, there will be an arrow leaving
which is labeled by a; hence the outdegree of each vertex is equal to number of
elements in A. For notational convenience, we label a single arrow by all the inputs
which cause the same change of state rather than having an arrow for each such

The following defines an automaton M with two input symbols and three states:
 A = {a, b}, input symbols.
 S = {, , }, internal states.
 Y = {, }, “yes” states.
 , initial state.
 Next-state function defined explicitly by a list of functions or by the table below.


F a b
The state diagram D = D(M) of the automaton M above.

 Note that both a and b label the arrow from to since F(, a) = and F(, b) = .
Note also that the outdegree of each vertex is 2, the number of elements in A.

  L(M) Determined by an Automaton M
 Each automaton M with input alphabet A defines a language over A, denoted by
L(M), as follows.
 Let w = be a string on A. Then w determines the following path in the state
diagram graph D(M) where is the initial state and F(, ) = for i ≥ 1:
P = (, , , , , , , )
 We say that M recognizes the string/word w if the final state is an accepting state
in Y .
 The language of M is the set of all strings from A which are accepted by M.
Determine whether or not the automaton M in the Figure below accepts the strings/words:
; ; the empty word.

Path for string P = (, a, , , , ,,, )

The final state in the path is which is not an accepting state; hence is not accepted by M.

Path for string P = (, b, , , , ,,)

The final state in the path is which is an accepting state; hence is accepted by M.

The final state determined by is the initial state since = is the empty string. Thus is
accepted by M since ∈ Y .

A phrase structure grammar or, simply, a grammar G consists of four parts:

 A set N of nonterminal symbols (also called variables)
 A set T of terminal symbols (the alphabet of the Grammar)
 A set P of production rules saying how each nonterminal can be replaced by a
string of terminals and nonterminals
 A start symbol S (which must be a nonterminal) that begins the derivation.

Such a grammar G is denoted by G = G(V, T, S, P) when we want to indicate its four

V is a finite set ( vocabulary) such that T is a subset of V and N = V-T
 Terminals
   will be denoted by lower case letters, a, b, c, .
 Nonterminals will be denoted by uppercase letters, A,B,C, , with S as the start
 Also, Greek letters, , , will denote strings in V , that is, arbitrary strings/words of
terminals and nonterminals.

The following defines a grammar G with S as the start symbol:

The productions may be abbreviated as follows:


  L(G) of a Grammar
 Suppose w and are words over the vocabulary set V of a grammar G.
 We write w ⇒ w if w can be obtained from w by using one of the productions; that
is, if there exists words u and v such that w = uαv and w = uβv and there is a
production .
 Furthermore, we write
w ⇒⇒ w or w ∗⇒w
if can be obtained from w using a finite number of productions.
 A sequence of steps where nonterminals are replaced by the right-hand side of a
production is called a derivation.

Now let G be a grammar with terminal set T . The language of G, denoted by L(G),
consists of all strings/words in T that can be obtained from the start symbol S by the
above process; that is, L(G) = {w ∈ T ∗ | S ⇒⇒ w}

 That is, ℒ(G) is the set of strings of terminals derivable from the start symbol.

  of Grammars
 Grammars are classified according to the kinds of production which are allowed.
The following grammar classification is due to Noam Chomsky.
 A Type 0 grammar has no restrictions on its productions.
 Types 1, 2, and 3 are defined as follows:
 A grammar G is said to be of Type 1 if every production is of the form
where or of the form
 A grammar G is said to be of Type 2 if every production is of the form
where the left side A is a nonterminal.
 A grammar G is said to be of Type 3 if every production is of the form or , that is,
where the left side A is a single nonterminal and the right side is a single
terminal or a terminal followed by a nonterminal, or of the form .
 Observe that the grammars form a hierarchy; that is, every Type 3 grammar is a
Type 2 grammar, every Type 2 grammar is a Type 1 grammar, and every Type 1
grammar is a Type 0 grammar.

You might also like