Teoría de Autómatas Y Lenguajes Formales: Francisco Vico
Teoría de Autómatas Y Lenguajes Formales: Francisco Vico
Teoría de Autómatas Y Lenguajes Formales: Francisco Vico
Francisco Vico
departamento
Lenguajes y
Ciencias de la Computación
área de conocimiento
Ciencias de la Computación e
Inteligencia Artificial
ETSI Informática
Universidad de Málaga
[email protected]
geb.uma.es/fjv
10 de diciembre de 2015
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Foreword
It is the sign of the XXI century: collaboration, contribution to common initiatives. In this
sense, this manuscript has benefitted from the efforts of many people. Part of its content was
extracted from previous documents, kindly offered to the open access community by their
authors [Ruohonen, 2009; Jian et al, 2002]. The resulting manuscript has the structure of
[Ramos and Morales, 2011], which is the reference book in Spanish for the subject Teoría de
autómatas y lenguajes formales (that is the reason because numbering in some epigraphs is
not always correlative). The selection and compilation of the manuscript has been performed
by students taking the subject during the course 201516: Iustina Andronic (Erasmus student)
and Esteban Delgado; and curated by Francisco Vico, professor of Computer Science and
Artificial Intelligence at the University of Malaga, responsible for the content of this subject
at the university’s OpenCourseWare programme.
A package of software that implements the main concepts in this manuscript is also under
development for the programming language Octave, and it has been made public under CC0
license at
https://bitbucket.org/fjvico/umafol
. This document itself, in its latest version can
also be accessed at http://j.mp/talf_ocw.
This is just version 1.0, coming years will see it develop, enrich and accommodate new
knowledge and experience.
Francisco Vico
Málaga, December 8, 2015
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 2
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Notation
ℕ the set of nonnegative integers (or natural numbers), i.e., {0, 1, 2, … } (U+2115)
ℙ the set of positive integers (U+2119)
ℝ the set of real numbers (U+211D)
ℤ the set of integers (U+2124)
∅ the empty set (U+2205)
⊆ the (infix) subset relation between sets (U+2286)
⊂ the (infix) proper subset relation between sets (U+2282)
∪ the infix union operation on sets (U+222A)
∩ the infix intersection operation on sets (U+2229)
~ the prefix complementation operation on sets (U+007E)
− the infix set difference operation on sets (U+2212)
× the infix cartesian product of sets (U+00D7)
An the postfix n A
fold cartesian product of A
, i.e. A× … × n
( times)
A
2 the powerset of A
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 3
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
1 Regular languages
Languages and grammars
Definition 2.1. An alphabet is a finite nonempty set of symbols. Symbols are assumed to be
indivisible.
Definition 2.2. A string over an alphabet Σ is a finite sequence of symbols of Σ.
Definition 2.3. A special string which contains no symbols at all is the empty string, and it is
represented by ε (sometimes λ or Λ).
Definition 2.4. The set of all strings over an alphabet Σ is denoted by Σ
∗
, and the set of all
+
nonempty strings over Σ is denoted by Σ . The empty set of strings is denoted by ∅ .
x
Then for any string x
x =
, ε x
ε = . For any string x and integer xn
n ≥ 0, we use to denote the
string formed by sequentially concatenating n copies of x
.
a
These are examples of word concatenation in the alphabet { b
, c
, }:
x aacbba
= , y caac
= , xy
=
aacbbacaac
x
= aacbba, y = ε, xy x
= =
aacbba
x
= ε, y caac
= , xy y
= =
caac
Concatenation is associative, i.e.,
xyz
( =
) xy
( )
z
xy yx
≠ ,
but not always, and in the case of a unary alphabet concatenation is obviously commutative.
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 4
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Definition 2.14. The
nth w
(concatenation) power of the word is
x0
= ε
x
n
xxn
=
1
= xx… x
.
n
copies
Definition 2.9.
Nearly every characterization problem is algorithmically decidable for regular
languages. The most common ones are the following (where L or L 1 and L
2 are given regular
languages):
Emptiness Problem : Is the language L empty (i.e., does it equal ∅)?
L
It is fairly easy to check for a given finite automaton recognizing , whether or not
there is a state transition chain from an initial state to a terminal state.
Inclusion Problem L
: Is the language L
included in the language
1 ?
2
Clearly L L
1 ⊆ 2 if and only if
L L
1 − 2 = ∅
.
Equivalence Problem L
: Is L
1 = 2?
Clearly L =
1 L if and only if
2 L ⊆
1 L
2 and
L L
2 ⊆ 1.
Finiteness Problem L
: Is a finite language?
It is fairly easy to check for a given finite automaton recognizing L
, whether or not it
has arbitrarily long state transition chains from an initial state to a terminal state.
Membership Problem : Is the given word w in the language L or not?
Using a given finite automaton recognizing L it is easy to check whether or not it
accepts the given input word w.
Definition 2.20. For any alphabet Σ, a language over Σ is a set of strings over Σ. The
members of a language are also called the words of the language.
Definition 2.30. A grammar is a quadruple (Σ, V S
, P
, ), where:
1. Σ is a finite nonempty set called the terminal alphabet. The elements of Σ are called
the terminals.
2. V is a finite nonempty set disjoint from Σ. The elements of V are called the
nonterminals or variables.
3. S ∈ V
is a distinguished nonterminal called the start symbol.
4. P is a finite set of productions (or rules) of the form
α→β
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 5
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
where α ∈ (Σ∪V
)∗V
(Σ∪ V
)∗
and β ∈ (Σ∪V
)∗
, i.e. α is a string of terminals and
nonterminals containing at least one nonterminal and β is a string of terminals and
nonterminals.
The sequence γ ⇒ σ
1 ⇒ ... ⇒ σ
1 n ⇒ γ
is called a derivation of γ
2 from γ
2 .
1
L
The words in G
( L
) are also called the sentences of G
().
G
Let V
= (Σ, S
, P
, ) be a grammar.
Chomsky’s Hierarchy In Chomsky’s hierarchy grammars are divided into four types:
Type 0: No restrictions.
Type 1: CS grammars.
Type 2: CF grammars.
X
Type 3: Linear grammars having productions of the form i→wX
X
j or j→w
X
where i
*
X
and are nonterminals and
j w ∈ ΣT , the socalled rightlinear grammars.
Grammars of Types 1 and 2 generate the socalled CSlanguages and CFlanguages,
respectively, the corresponding families of languages are denoted by CS and CF. Languages
generated by Type 0 grammars are called computably enumerable languages (CElanguages),
the corresponding family is denoted by CE.
G
Definition 2.58. is also called a Type0 grammar or an unrestricted grammar.
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 6
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Definition 2.59. G is a Type1 or contextsensitive grammar if each production α→β in P
satisfies |α| ≤ |β|. By “special dispensation,” we also allow a Type1 grammar to have the
production S→ε, provided S
does not appear on the righthand side of any production.
Definition 2.61.
G is a Type3 or rightlinear or regular grammar if each production has one
of the following two forms:
A→ cB
A→ c
where A B
, are nonterminals (with B A
= allowed) and c
is a terminal.
L
The positive closure of , denoted by
L+
, is the language
+ i
L = ∪
i≥1
L .
In other words, the Kleene closure of a language L consists of all strings that can be formed
by concatenating zero or more words from L
. For example, if
L = {0, 01}, then
LL = {00,
001, 010, 0101}, and L
∗
comprises all binary strings in which every 1 is preceded by a 0.
Note that concatenating zero words always gives the empty string, and that a string with no
1s in it still makes the condition on “every 1” true. L + has the meaning “concatenate one or
more words from L,” and satisfies the properties L∗ L+
= L+
∪ {ε} and =
LL∗ . Furthermore,
for any language L L
, ∗
always contains , and L + contains if and only if L does. Also note that
Σ is in fact the Kleene closure of the alphabet Σ when Σ is viewed as a language of words of
∗
+
length 1, and Σ is just the positive closure of Σ.
Closure of the types of languages
Closure properties are often useful in constructing new languages from existing languages,
and for proving many theoretical properties of languages and grammars. The closure
properties of the four types of languages in the Chomsky hierarchy are summarized below.
Proofs may be found in [Harrison, 1978], [Hopcroft and Ullman, 1979], or [Gurari, 1989];
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 7
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
the closure of the CSLs under complementation is the famous ImmermanSzelepcsényi
theorem.
Theorem.
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 8
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Regular expressions
Definition 3.1. The regular expressions over an alphabet Σ and the languages they represent
are defined inductively as follows.
1. The symbol ∅ is a regular expression, and represents the empty language.
2. The symbol ε is a regular expression, and represents the language whose only member
is the empty string, namely {ε}.
3. For each c ∈ Σ,
c is a regular expression, and represents the language {c
}, whose
only member is the string consisting of the single character c.
4. If
r and
s are regular expressions representing the languages R and S r +
, then ( s rs
), ( )
and ( r
∗
) are regular expressions that represent the languages R ∪ S RS
, , and R
,
∗
respectively.
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 9
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Finitestate automata
An automaton is finite if it has a finite memory, i.e., the automaton may be thought to be in
one of its (finitely many) (memory)states. A finite deterministic automaton is defined
formally by giving its states, input symbols (the alphabet), the initial state, rules for the state
transition, and the criteria for accepting the input word.
q
Definition 4.1. A finite (deterministic) automaton (DFA) is a quintuple M = (Q, Σ, 0 , δ, A)
where
Q = { q,
0 q,…,
1 q } is a finite set of states, the elements of which are called states;
m
Σ is the set input symbols (the alphabet of the language);
q0 is the initial state (
q0 ∈ Q);
δ is the (state) transition function which maps each pair ( qi
a
, qi
), where is a state, and
a
is an input symbol, to exactly one next state q q
: δ(
j a
i, q
) = j ;
A
is the socalled set of terminal states ( A ⊆ Q).
As its input the automaton M receives a word
w a
= a
…
1 n
q q
j = δ( a
,
0 ).
1
Any word a
w = a
1 … n , be it an input or not, determines a socalled state transition chain of
the automaton M from a state q q
0 to a state
j jn :
q
j q
0 , j q
1 , … , jn ,
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 10
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
q
where always ji+1 q
= δ( a
,
ji i ).
+1
Any finite automaton can be represented graphically as a socalled state diagram. A state is
then represented by a circle enclosing the symbol of the state, and in particular a terminal
state is represented by a double circle:
q
A state transition δ(i,a
q
) = a
j is represented by an arrow labelled by , and in particular the
initial state is indicated by an incoming arrow:
Such a representation is in fact an edgelabelled directed graph.
A
Example. The automaton { B
, , 10}, {0, 1}, A, δ, {10} where δ is given by the state
transition table
is represented by the state transition diagram
The language recognized by the automaton is the regular language (0+1)
10.
∗
Definition 4.21. Defined formally a nondeterministic finite automaton (NFA) is a quintuple
M = (Q, Σ, S, δ, A) where:
● Q, Σ A
and are as for the deterministic finite automaton;
● S is the set of initial states;
q
● δ is the (state) transition function which maps each pair (i,
a), where qi is a state and
a is an input symbol, to exactly one subset T of the state set Q ( qi , a) = T
: δ .
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 11
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Definition 4.34. M accepts a word w if there is at least one terminal state in the set of states
δ (S, w)
*
. S
Λ is accepted if there is at least one terminal state in . The set of exactly all words
accepted by M is the language L(M) recognized by M .
Definition 4.36. The nondeterministic finite automaton may be thought of as a generalization
of the deterministic finite automaton, obtained by identifying in the latter each state qi by the
corresponding singleton set { qi}. It is however no more powerful in recognition ability.
A
and 1 consists of exactly all sets of states having a nonempty intersection with A. The states
*
M
of M
1 are thus all sets of states of . We clearly have δ1 q
(
0, w ) = δ̂
(
∗
S w
, ), so M
M and 1
accept exactly the same words, and M1 recognizes the language
L.
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 12
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
Regularity conditions
Proof. As is easily seen, in general
δ*
q
( xy
i , δ*
) = δ
*
( q
( x
i, y
), ).
So
δ*
q
( wu
0, δ*
) = δ
*
( w
( q0 , u
), δ*
) = δ
*
( q
( v
0, u
), δ*
) =
q
( vu
0, )
If a regular language L is defined by a deterministic finite automaton M = (Q, Σ, q 0 , δ, A)
recognizing it, then the minimization naturally starts from M . The first step is to remove all
idle states of M , i.e., states that cannot be reached from the initial state. After this we may
assume that all states of M can expressed as δ∗(q0
, w) for some word
w . For the minimization
the states of M are partitioned into Mequivalence classes as follows. The states qi and qj are
not Mequivalent if there is a word u such that one of the states δ (q
∗
i , u) and δ (q
∗
j , u) is
terminal and the other one is not, denoted by qi ≡ M q j . If there is no such word u , then q i and
qj are Mequivalent, denoted by q
i ≡ M q We may obviously assume
j. qi ≡ M qi . Furthermore, if
qi ≡M q
j , then also qj ≡
M qi
, and if q
i ≡M q q
j and j ≡
M q
k it follows that q
i ≡M q
k. Each
equivalence class consists of mutually M equivalent states, and the classes are disjoint. (Cf.
the Lequivalence classes and the equivalence relation ≡ L
.) Let us denote the M equivalence
class represented by the state q i by { qi}.
Note that it does not matter which of the
M equivalent states is chosen as the representative of the class. Let us then denote the set of
all Mequivalence classes by Q .
M L
equivalence and equivalence are related since〈
(q
δ∗ w
0, )〉
= 〈
(q
δ∗ v
0, )〉if and only if
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 13
OpenCourseWare
Teoría de autómatas y lenguajes formales Universidad de Malaga
w
[ v
] = []. Because now all states can be reached from the initial state, there are as many
Mequivalence classes as there are Lequivalence classes, i.e., the number given by the index
of L
. Moreover, Mequivalence classes and L
equivalence classes are in a onetoone
correspondence:
(q
δ∗
〈 , w
0 )〉⇌ w
[],
q
in particular 〈 〉 ⇋
0 Λ
[].
Vico, F. (2014) Teoría de autómatas y lenguajes formales.
OCWUniversidad de Málaga. http://ocw.uma.es.
Bajo Licencia Creative Commons
AttributionNonCommercial ShareAlike 3.0 Spain 14