Languages, Automata and Grammars Lecture Notes
Languages, Automata and Grammars Lecture Notes
Languages, Automata and Grammars Lecture Notes
Definitions
Consider a nonempty set A of symbols. A string/ word w on the set A is a finite
sequence of its elements.
We will also abbreviate our notation and write for , for , and so on. Thus, for the
above words, u = aba and v = a b .
The empty string has no characters and is denoted ε (Greek letter epsilon), or
denoted by λ (Greek letter lambda).
The set of all strings/words on A is denoted by (read: “A star”).
The length of a string/word u, written |u| or (u), is the number of elements in its
sequence of letters. For the above strings/words u and v, we have (u) = 5 and (v) = 7.
Also, (λ) = 0, where λ is the empty word.
LANGUAGES
Concatenation
of Strings
Consider two strings/words u and v on the alphabet A. The concatenation of u and
v, written , is the word obtained by writing down the letters of u followed by the
letters of v
Example
For the strings and
uv = ababbaccbaaa = abaab
The operation is not commutative, e.g., for the above words u and v.
LANGUAGES
Subwords
and Initial segments
Consider any string/word u = . . . on an alphabet A. Any sequence w
= . . . is called a subword/substring of u.
In particular, the subword/substring w = . . . beginning with the first letter of u, is
called an initial segment of u.
In other words, w is a subword of u if u = w and w is an initial segment of u if u =
wv.
Observe that ε and u are both subwords of uv since u = ε u.
Example
Consider the word u = abca. The subwords and initial segments of u follow:
Subwords: ε, a, b, c, ab, bc, ca, abc, bca, abca = u
Initial segments: ε, a, ab, abc, abca = u
LANGUAGES
Recall
that A* denotes the set of all strings/words on an alphabet A.
A language over an alphabet A is a collection of words on A.
Example
Let A = {a, b}. The following are languages over A .
= {a, ab, a, . . .}
consists of all words beginning with an a and followed by zero or more b’s.
= { |}
consists of all words beginning with one or more as followed by one or more b’s.
= { |m > 0}
consists of all words beginning with one or more a’s and followed by the same number of
b’s.
LANGUAGES
Language
Concatenation
Suppose L and M are languages over an alphabet A. Then the “concatenation” of L and M,
denoted by LM, is the language defined as follows:
Example
If and L₂ = { aa, bb }, then
L₁L₂ = { aaa, abb, baaa, babb, bbaa, bbbb }
Language Exponentiation
We can define what it means to “exponentiate” a language as follows:
= {ε}
The set containing just the empty string.
This means any string formed by concatenating zero strings together is the empty
string.
=
This means concatenating (n+1) strings together works by concatenating
n strings, then concatenating one more.
LANGUAGES
An important operation on languages is the Kleene Closure, which is defined as
L* = { w ∈ Σ* | ∃n ∈ ℕ. w ∈ }
Mathematically: w ∈ L* if ∃n ∈ ℕ. w ∈
Intuitively, all possible ways of concatenating zero or more strings in L together,
possibly with repetition.
Theorem on Kleene Closure of L :
Example
If L = { a, bb }, then L* = {
ε,
a, bb,
aa, abb, bba, bbbb,
aaa, aabb, abba, abbbb, bbaa, bbabb, bbbba, bbbbbb,
}
LANGUAGES
Summary
Languages are sets of strings.
Strings are sequences of characters.
Characters are individual symbols.
Alphabets are sets of characters.
LANGUAGES
Regular Expressions
Regular expressions are a way of describing a language via a string representation.
Regular expressions match strings in the language. They describe the general shape
of all strings in the language.
They’re used extensively in software systems for string processing.
Conceptually, regular expressions are strings describing how to assemble a larger
language out of smaller pieces.
LANGUAGES
Each of the following is a regular expression over an alphabet A.
The symbol ε is a regular expression that represents the language {ε}.
The symbol is a regular expression that represents the empty language This is just
a pair (empty expression).
For any a ∈ A, the symbol a is a regular expression for the language {a}.
If r is a regular expression, is a regular expression for the Kleene closure of the
language of r.
If and are regular expressions, is a regular expression for the concatenation of
the languages of and .
If and are regular expressions, ∪ is a regular expression for the union of the
languages of and .
If r is a regular expression, (r) is a regular expression with the same meaning as r.
A finite state automaton (FSA) or, simply, an automaton M, consists of five parts:
A finite set (alphabet) A of inputs.
A finite set S of (internal) states.
A subset Y of S (called accepting or “yes” states).
An initial state in S.
A next-state function F from into S.
,,,,,
F a b
FINITE STATE AUTOMATA
Example
The state diagram D = D(M) of the automaton M above.
Note that both a and b label the arrow from to since F(, a) = and F(, b) = .
Note also that the outdegree of each vertex is 2, the number of elements in A.
FINITE STATE AUTOMATA
Language
L(M) Determined by an Automaton M
Each automaton M with input alphabet A defines a language over A, denoted by
L(M), as follows.
Let w = be a string on A. Then w determines the following path in the state
diagram graph D(M) where is the initial state and F(, ) = for i ≥ 1:
P = (, , , , , , , )
We say that M recognizes the string/word w if the final state is an accepting state
in Y .
The language of M is the set of all strings from A which are accepted by M.
FINITE STATE AUTOMATA
Example
Determine whether or not the automaton M in the Figure below accepts the strings/words:
; ; the empty word.
The final state determined by is the initial state since = is the empty string. Thus is
accepted by M since ∈ Y .
GRAMMAR
Grammar
Example
The following defines a grammar G with S as the start symbol:
Now let G be a grammar with terminal set T . The language of G, denoted by L(G),
consists of all strings/words in T that can be obtained from the start symbol S by the
above process; that is, L(G) = {w ∈ T ∗ | S ⇒⇒ w}
That is, ℒ(G) is the set of strings of terminals derivable from the start symbol.
GRAMMAR
Types
of Grammars
Grammars are classified according to the kinds of production which are allowed.
The following grammar classification is due to Noam Chomsky.
A Type 0 grammar has no restrictions on its productions.
Types 1, 2, and 3 are defined as follows:
A grammar G is said to be of Type 1 if every production is of the form
where or of the form
A grammar G is said to be of Type 2 if every production is of the form
where the left side A is a nonterminal.
A grammar G is said to be of Type 3 if every production is of the form or , that is,
where the left side A is a single nonterminal and the right side is a single
terminal or a terminal followed by a nonterminal, or of the form .
Observe that the grammars form a hierarchy; that is, every Type 3 grammar is a
Type 2 grammar, every Type 2 grammar is a Type 1 grammar, and every Type 1
grammar is a Type 0 grammar.