Languages, Automata and Grammars Lecture Notes

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 21

LANGUAGES


Definitions
 
Consider a nonempty set A of symbols. A string/ word w on the set A is a finite
sequence of its elements.

 For example, suppose Then the following sequences are strings on A:


and
When discussing strings/words on A, we frequently call A the alphabet, and its
elements are called characters.

 We will also abbreviate our notation and write for , for , and so on. Thus, for the
above words, u = aba and v = a b .
 The empty string has no characters and is denoted ε (Greek letter epsilon), or
denoted by λ (Greek letter lambda).
 
The set of all strings/words on A is denoted by (read: “A star”).

 The length of a string/word u, written |u| or (u), is the number of elements in its
sequence of letters. For the above strings/words u and v, we have (u) = 5 and (v) = 7.
Also, (λ) = 0, where λ is the empty word.
LANGUAGES

Concatenation
  of Strings
 Consider two strings/words u and v on the alphabet A. The concatenation of u and
v, written , is the word obtained by writing down the letters of u followed by the
letters of v
Example
For the strings and
uv = ababbaccbaaa = abaab

 The concatenation operation for strings/words on an alphabet A is associative.

 The empty string/word ε is an identity element for the concatenation operation.


Adjoining the empty word before or after a word u does not change the word u.

The operation is not commutative, e.g., for the above words u and v.
LANGUAGES

Subwords
  and Initial segments
 Consider any string/word u = . . . on an alphabet A. Any sequence w
= . . . is called a subword/substring of u.
 In particular, the subword/substring w = . . . beginning with the first letter of u, is
called an initial segment of u.
 In other words, w is a subword of u if u = w and w is an initial segment of u if u =
wv.
 Observe that ε and u are both subwords of uv since u = ε u.
Example
Consider the word u = abca. The subwords and initial segments of u follow:
 Subwords: ε, a, b, c, ab, bc, ca, abc, bca, abca = u
 Initial segments: ε, a, ab, abc, abca = u
LANGUAGES
 Recall
   that A* denotes the set of all strings/words on an alphabet A.
 
A language over an alphabet A is a collection of words on A.

 Thus a language L is simply a subset of A*.

Example
Let A = {a, b}. The following are languages over A .
 = {a, ab, a, . . .}
consists of all words beginning with an a and followed by zero or more b’s.
 = { |}
consists of all words beginning with one or more as followed by one or more b’s.
 = { |m > 0}
consists of all words beginning with one or more a’s and followed by the same number of
b’s.
LANGUAGES

Language
  Concatenation
 Suppose L and M are languages over an alphabet A. Then the “concatenation” of L and M,
denoted by LM, is the language defined as follows:

Example
If and L₂ = { aa, bb }, then
L₁L₂ = { aaa, abb, baaa, babb, bbaa, bbbb }

Language Exponentiation
 We can define what it means to “exponentiate” a language as follows:
= {ε}
The set containing just the empty string.
This means any string formed by concatenating zero strings together is the empty
string.
=
This means concatenating (n+1) strings together works by concatenating
n strings, then concatenating one more.
LANGUAGES
 
 An important operation on languages is the Kleene Closure, which is defined as
L* = { w ∈ Σ* | ∃n ∈ ℕ. w ∈ }
 Mathematically: w ∈ L* if ∃n ∈ ℕ. w ∈
 Intuitively, all possible ways of concatenating zero or more strings in L together,
possibly with repetition.
 
Theorem on Kleene Closure of L :

Example
If L = { a, bb }, then L* = {
ε,
a, bb,
aa, abb, bba, bbbb,
aaa, aabb, abba, abbbb, bbaa, bbabb, bbbba, bbbbbb,

}
LANGUAGES
Summary
 Languages are sets of strings.
 Strings are sequences of characters.
 Characters are individual symbols.
 Alphabets are sets of characters.
LANGUAGES
Regular Expressions
 Regular expressions are a way of describing a language via a string representation.
 Regular expressions match strings in the language. They describe the general shape
of all strings in the language.
 They’re used extensively in software systems for string processing.
 Conceptually, regular expressions are strings describing how to assemble a larger
language out of smaller pieces.
LANGUAGES
 
 Each of the following is a regular expression over an alphabet A.
 The symbol ε is a regular expression that represents the language {ε}.
 The symbol is a regular expression that represents the empty language This is just
a pair (empty expression).
 For any a ∈ A, the symbol a is a regular expression for the language {a}.
 If r is a regular expression, is a regular expression for the Kleene closure of the
language of r.
 If and are regular expressions, is a regular expression for the concatenation of
the languages of and .
 If and are regular expressions, ∪ is a regular expression for the union of the
languages of and .
 If r is a regular expression, (r) is a regular expression with the same meaning as r.

 All regular expressions are formed in this way.


 Observe that a regular expression r is a special kind of a word (string) which uses
the letters of A and the five symbols:
 The operator precedence for regular expressions, from highest to lowest:
(r) r* ∪
LANGUAGES
 The language of a regular expression is the language described by that regular
expression.
 The language over A defined by a regular expression r over A is as follows:
 L() = ϵ
 L() = , the empty set.
 L(a) = {a}, where a is a letter in A.
 L(r∗) = (L(r))* (the Kleene closure of L(r)).
 L(r1 r2) = L(r1) ∪ L(r2) (the union of the languages).
 L(r1r2) = L(r1)L(r2) (the concatenation of the languages).
 L((r))) = L(r)
 Parentheses will be omitted from regular expressions when possible. Since the

 * takes precedence over concatenation, and concatenation takes precedence over ∪.

Let L be a language over A. If L is a regular language, then there is a regular expression


for L such that L = L(r).
LANGUAGES

Example
 
Let A = {a, b}. Each of the following is an expression r and its corresponding language L(r):
(a) Let r = a*.
Then L(r) consists of all powers of a including the empty word l.
(b) Let r = aa*.
Then L(r) consists of all positive powers of a excluding the empty word.
(c) Let r = a b∗.
Then L(r) consists of a or any word in b, that is, L(r) = {a,, b, , }.
(d) Let r = (a b)∗.
Note L(a b) = {a} ∪ {b} = A; hence L(r) = A∗, all words over A.
(e) Let r = (a b)∗bb.
Then L(r) consists of the concatenation of any word in A with bb, that is, all words ending in .
(f) Let r = a b∗.
L(r) does not exist since r is not a regular expression. (Specifically, is not one of the symbols
used for regular expressions.)
FINITE STATE AUTOMATA
An
  automaton (plural: automata) is a mathematical model of a computing device.
 A finite automaton is a simple type of mathematical machine for determining
whether a string is contained within some language.

 
A finite state automaton (FSA) or, simply, an automaton M, consists of five parts:
 A finite set (alphabet) A of inputs.
 A finite set S of (internal) states.
 A subset Y of S (called accepting or “yes” states).
 An initial state in S.
 A next-state function F from into S.

Such an automaton M is denoted by M = (A, S, Y, , F)

 Some texts define the next-state function F : in by means of a collection of functions


, one for each . Setting shows that both definitions are equivalent.
FINITE STATE AUTOMATA

State
  Diagram of an Automaton M
 An automaton M is usually defined by means of its state diagram D = D(M) rather
than by listing its five parts. The state diagram D = D(M) is a labeled directed graph
as follows.
 The vertices of D(M) are the states in S and an accepting state is denoted by means of a
double circle.
 There is an arrow (directed edge) in D(M) from state to state labeled by an input a
if F(, a) = or, equivalently, if () = .
 The initial state is indicated by means of a special arrow which terminates at but has
no initial vertex.
 For each vertex and each letter a in the alphabet A, there will be an arrow leaving
which is labeled by a; hence the outdegree of each vertex is equal to number of
elements in A. For notational convenience, we label a single arrow by all the inputs
which cause the same change of state rather than having an arrow for each such
input.
FINITE STATE AUTOMATA

Example
 
The following defines an automaton M with two input symbols and three states:
 A = {a, b}, input symbols.
 S = {, , }, internal states.
 Y = {, }, “yes” states.
 , initial state.
 Next-state function defined explicitly by a list of functions or by the table below.

,,,,,

F a b
FINITE STATE AUTOMATA
Example
 
The state diagram D = D(M) of the automaton M above.

 Note that both a and b label the arrow from to since F(, a) = and F(, b) = .
Note also that the outdegree of each vertex is 2, the number of elements in A.
FINITE STATE AUTOMATA

Language
  L(M) Determined by an Automaton M
 Each automaton M with input alphabet A defines a language over A, denoted by
L(M), as follows.
 Let w = be a string on A. Then w determines the following path in the state
diagram graph D(M) where is the initial state and F(, ) = for i ≥ 1:
P = (, , , , , , , )
 We say that M recognizes the string/word w if the final state is an accepting state
in Y .
 The language of M is the set of all strings from A which are accepted by M.
FINITE STATE AUTOMATA
 
Example
Determine whether or not the automaton M in the Figure below accepts the strings/words:
; ; the empty word.

Path for string P = (, a, , , , ,,, )


The final state in the path is which is not an accepting state; hence is not accepted by M.

Path for string P = (, b, , , , ,,)


The final state in the path is which is an accepting state; hence is accepted by M.

The final state determined by is the initial state since = is the empty string. Thus is
accepted by M since ∈ Y .
GRAMMAR
Grammar

A phrase structure grammar or, simply, a grammar G consists of four parts:


 A set N of nonterminal symbols (also called variables)
 A set T of terminal symbols (the alphabet of the Grammar)
 A set P of production rules saying how each nonterminal can be replaced by a
string of terminals and nonterminals
 A start symbol S (which must be a nonterminal) that begins the derivation.

Such a grammar G is denoted by G = G(V, T, S, P) when we want to indicate its four


parts.
V is a finite set ( vocabulary) such that T is a subset of V and N = V-T
GRAMMAR
 Terminals
   will be denoted by lower case letters, a, b, c, .
 Nonterminals will be denoted by uppercase letters, A,B,C, , with S as the start
symbol.
 Also, Greek letters, , , will denote strings in V , that is, arbitrary strings/words of
terminals and nonterminals.

Example
The following defines a grammar G with S as the start symbol:

The productions may be abbreviated as follows:


GRAMMAR

Language
  L(G) of a Grammar
 Suppose w and are words over the vocabulary set V of a grammar G.
 We write w ⇒ w if w can be obtained from w by using one of the productions; that
is, if there exists words u and v such that w = uαv and w = uβv and there is a
production .
 Furthermore, we write
w ⇒⇒ w or w ∗⇒w
if can be obtained from w using a finite number of productions.
 A sequence of steps where nonterminals are replaced by the right-hand side of a
production is called a derivation.

Now let G be a grammar with terminal set T . The language of G, denoted by L(G),
consists of all strings/words in T that can be obtained from the start symbol S by the
above process; that is, L(G) = {w ∈ T ∗ | S ⇒⇒ w}

 That is, ℒ(G) is the set of strings of terminals derivable from the start symbol.
GRAMMAR

Types
  of Grammars
 Grammars are classified according to the kinds of production which are allowed.
The following grammar classification is due to Noam Chomsky.
 A Type 0 grammar has no restrictions on its productions.
 Types 1, 2, and 3 are defined as follows:
 A grammar G is said to be of Type 1 if every production is of the form
where or of the form
 A grammar G is said to be of Type 2 if every production is of the form
where the left side A is a nonterminal.
 A grammar G is said to be of Type 3 if every production is of the form or , that is,
where the left side A is a single nonterminal and the right side is a single
terminal or a terminal followed by a nonterminal, or of the form .
 Observe that the grammars form a hierarchy; that is, every Type 3 grammar is a
Type 2 grammar, every Type 2 grammar is a Type 1 grammar, and every Type 1
grammar is a Type 0 grammar.

You might also like