0% found this document useful (0 votes)
85 views

automata-and-formal-languages-cheatsheet

The document provides an overview of automata and formal languages, detailing basic definitions, operations on words, and the structure of deterministic and non-deterministic finite state automata (DFA and NFA). It explains concepts such as the Kleene closure, palindromes, and the recognition of languages by automata. Additionally, it discusses the properties of regular languages and the role of transition functions in automata theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views

automata-and-formal-languages-cheatsheet

The document provides an overview of automata and formal languages, detailing basic definitions, operations on words, and the structure of deterministic and non-deterministic finite state automata (DFA and NFA). It explains concepts such as the Kleene closure, palindromes, and the recognition of languages by automata. Additionally, it discusses the properties of regular languages and the role of transition functions in automata theory.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Automata and Formal Languages • |ϵ| = 0 • L0 = Lϵ = {ϵ} Example: Another example is that of a cir- An input word is recognized or accepted

• For any word w, we have w·ϵ = ϵ·w = cular selector for radio channels LW, MW by a given recognizer if and only if a final
1 Basic Notions • Ln+1 = Ln · L and SW. This selector can be turned in any state is reached. The set of all recognized
w.
The iterative closure or Kleene’s clo- directions from LW to SW passing by MW, words by a given recognizer forms the
1.1 Basic Definitions The mirror operation associates the reverse sure of a language L, denoted as L∗ , is the but not directly from LW to SW. Such a language modeled by the automaton and is
of a word w = s1 s2 . . . sn , which is denoted set of words resulting from the concatenation selector can be modeled as follows: called the recognized language by the
A formal language is a set of strings over
finite sets of symbols. We have different by wR = sn . . . s2 s1 . of a finite number of words of L. Formally automaton.
ways of describing such languages: [
Recursive definition: L∗ = L0 ∪ L1 ∪ L2 ∪ L3 ∪ · · · = Li Formally defined, a DFA language recog-
• Finite automata 1. ϵR = ϵ i≥0 nizer is a DFA equipped with a set of final
• Regular expressions 2. sR = s, for any symbol s states. Final states are drawn as double
A variation of Kleene’s closure called circles. It is a 5-tuple (Q, Σ, δ, q0 , F ) where
• Grammars 3. wR = v R · uR , for any word w = u · v Kleene plus is defined as
• Turing machines If w = wR then w is called a palindrome,
[ • Q is a finite set of states
L+ = L1 ∪ L2 ∪ L3 ∪ · · · = Li
Example: The set of all strings of tokens i.e. a word that reads the same forwards
generated by the grammar that describes and backwards. i≥1 • Σ is a finite set of symbols, i.e. an
the Java programming language is a formal = L · L∗ = L∗ · L alphabet
language. The concatenation of the word w with
itself n times is denoted by wn where n ≥ 0. Example: let L = {aa, bb, c} • δ : Q×Σ → Q is a transition function
An alphabet is a finite set of symbols
such as letters, digits, etc. Greek letters Recursive definition: L∗ = {ϵ, aa, bb, c, aaaa, aabb, aac, bbaa, bbbb, • q0 ∈ Q is the initial state
are (Σ, Γ, etc.) are usually used to denote 1. w0 = ϵ bbc, caa, cbb, cc . . . }
alphabets. • F ⊆ Q is a set of final states
2. wn+1 = wn w
The set of all words of length greater or 2 Regular Languages
Examples:
equal to 1 over an alphabet Σ is denoted by 2.2.1 Deterministic Finite State Au- The difference between a DFA and a DFA
recognizer lies only in the additional final
• Σ1 = {a, b, c}, 2.1 Introduction tomata
Σ+ . states, which are included in Q.
• Σ2 = {a, b, c, . . . , z}, How do we specify i.e. describe languages? Deterministic finite state automata can
Example: Finite languages Since a language is a set be formally defined as being a 4-tuple Informally, a word is recognized by a DFA
• Γ = {a, b, c, $, □} (Q, Σ, δ, q0 ), where:
of words, any finite set can be exhaustively if and only if there exists one path from
Σ+ = {a, b, aa, ab, ba, bb, aaa, aab, enumerated. Infinite languages An
A word over an alphabet Σ is a string • Q is a finite set of states the initial state to one of the final states.
of finite length, in other words a finite aba, abb, baa, bab, . . . }, where Σ = {a, b} exhaustive enumeration is not possible (in Formally, however the following definition
sequence of symbols of Σ. a finite time). Consequently, more general • Σ is a finite set of symbols, i.e. an is used: A word w ∈ Σ∗ is recognized by a
The set of all words of length greater or concepts are required. alphabet DFA M = (Q, Σ, δ, q0 , F ) iff δ ∗ (q0 , w) ∈ F .
A (formal) language over an alphabet Σ equal ∗
to 0 over an alphabet Σ is denoted by • δ : Q×Σ → Q is a transition function
is defined as a set of words defined over Σ . Some remarks: Regular languages are the simplest class The language recognized (or accepted) by a
Σ. The letters w, u and v are usually used • Σ+ and Σ∗ are infinite if and only if of languages. Each one of these formalisms • q0 ∈ Q is the initial state
to denote words. The letters s, t and r Σ ̸= ∅ define this class of languages: The transition function δ : Q × Σ → Q DFA M is denoted L(M ). This is the set of
are usually used for single letters. Lan- all the words recognized by M .
• Σ∗ = Σ+ ∪ {ϵ} • Finite state automata expresses which symbol moves the automa-
guages are usually denoted by capital letters. ton from one state to the next one. Such
Definition of Σ+ : • Regular grammars functions represent the transition with their Finally, we can define regular languages as
Examples of languages: 1. any element of Σ belongs to Σ+ • Regular expressions label. Automata with an infinite number follows: Every language recognized by a
finite state recognizer automaton is called a
• L1 = {a, b, aab, abbcc}
2. for all a ∈ Σ and for all w ∈ Σ+ , we They all have the same expression power, of states do not share the same properties regular language.
i.e. they all describe the same things. with automata with finite states.
• L2 = {acbb, accbb, acccbb, accccbb, . . . } have aw ∈ Σ+
These languages are defined over the alpha- 1.3 2.2 Finite State Automata The extension of δ is the function δ ∗ : Example: The language
bet Σ = {a, b, c}.
Operations on Languages Q × Σ∗ → Q which provides the next state
A finite state automaton (FSA) consists for a sequence of symbols and can be de-
• L1 is finite. Operations on languages L1 , L2 and L are of the following three components: L2 = {acbb, accbb, acccbb, accccbb, . . . }
fined as follows: For a state q ∈ Q, a word
defined over an alphabet Σ: w ∈ Σ∗ and a symbol s ∈ Σ, the function
• L2 is composed of all the words start- • A finite set of states (drawn as
ing with the symbol a, followed by at • Union: L1 ∪ L2 circles). δ ∗ : Q × Σ∗ → Q is recursively defined as is recognized by the following automaton:
least one symbol c and ending with • Intersection: L1 ∩ L2 follows:
• A transition function describing
bb. • Complement: L = Σ∗ \L the actions which allow the transition 1. δ ∗ (q, ϵ) = q
From now on, only finite alphabets are con- Further operations include: from one state to another (drawn as 2. δ ∗ (q, ws) = δ(δ ∗ (q, w), s)
sidered. A language can be infinite, but each arrows).
of its elements are always finite. • Concatenation: L1 · L2 = {w1 · Example: Using the previous example with
w2 |w1 ∈ L1 and w2 ∈ L2 } • The initial state i.e. the state in the radio, the transition function of the au-
1.2 Operations on Words • Neutral element: Lϵ = {ϵ}
which the automaton is in initially tomaton is as follows:
(indicated by an arrow coming form
The concatenation operation of two words • Absorbance element: ∅ nowhere pointing at it). δ(MW, right) = SW
consists of appending the second word to the For the neutral element holds
end of the first one. Concatenation of two First, we will consider deterministic δ(SW, left) = MW Example: Let A be the DFA recognizer for-
words w1 = s1 s2 . . . sm and w2 = t1 t2 . . . tn L · Lϵ = Lϵ · L = L. finite state automata (DFA) and then mally defined as
non-deterministic finite state automata δ(MW, left) = LW
is defined by For the absorbance element holds (NFA). Both DFA and NFA are FSA. δ(LW, right) = MW A = (Q, Σ, δ, q0 , F )
w1 · w2 = s1 . . . sm t1 . . . tn . L · ∅ = ∅ · L = ∅.
Example: A switch changing alternately be- Further, the following three uses of δ ∗ are Q = q0 , qf
The concatenation operation is associative, Note that Lϵ ̸= ∅. tween two states “ON” and “OFF” can be correct:
but not commutative. modelled as follows: Σ = {0, 1}
Example: let Σ = {a, b}, L1 = {a, b}, L2 = δ(MW, left · right) = MW F = qf
The operation that returns the length, {bac, b, a}.
i.e. the number of symbols of a word w is δ(LW, right · right · left) = MW δ(q0 , 0) = q0
denoted by |w|. L1 · L2 = {abac, ab, aa, bbac, bb, ba}
δ(MW, right) = SW δ(q0 , 1) = qf
The concatenation of the language L with
ϵ denotes a special word called the empty itself n times is denoted by Ln where n ≥ 0. A language recognizer is a FSA that δ(qf , 0) = q0
string (also denoted by λ). Some properties answers whether a given word belongs to
of ϵ: Recursive definition: the language described by an automaton. δ(qf , 1) = qf
The co-domain of the transition function is • Non-determinism does not provide according to the method given in the proof with empty transitions.
the power set of the set of states Q: more expression power than deter- of the theorem:
δ : Q × Σ → P(Q) minism. Example: Here is a non-deterministic au-
• NFA are generally more compact tomaton with some empty transitions:
where P(Q) is the power set of Q. than equivalent DFA.
Formally defined, a non-deterministic au-
tomaton is a 5-tuple (Q, Σ, δ, q0 , F ) where The first step of the proof is trivial, since
• Q is a finite set of states any DFA is per definition a NFA (a de-
terministic FSA is a particular case of a
• Σ is a finite set of symbols, i.e. an non-deterministic FSA). The second step
If we consider that the words over the alpha- alphabet is not as easy as the first one. In order to
bet Σ = {0, 1} denote the binary numbers, • δ : Q × Σ → P(Q) is a transition show that for any NFA there exists a DFA
the recognized language by A is function we are going to construct a DFA from the
NFA and prove that both accept the same
L(A) = {x | x ∈ {0, 1}+ and x is odd} • q0 ∈ Q is the initial state language. The computation of the ϵ-closure provides
2.2.4 Automata with ϵ-transitions
• F ⊆ Q is a set of final states the new transition function δ:
Example: Let B be the DFA recognizer Similarly to DFA, the extension function δ ∗ Let M = (Q, Σ, δ, q0 , F ) be a NFA. We con- Up to now, we have seen FSA where the
graphically represented by: of δ finds the next state for a sequence of struct a DFA M ′ = (Q′ , Σ, δ ′ , q0′ , F ′ ) as fol- label of the transition functions belong
symbols and can be defined as follows: For to the alphabet. We can extend this so δ 0 1 2
lows: that we allow a transition to be caused by
a state q ∈ Q, a word w ∈ Σ∗ and a symbol q0 {q0 , q1 , q2 } {q1 , q2 } {q2 }
s ∈ Σ, the function δ ∗ : Q × Σ∗ → P(Q) is 1. Q′ = P(Q). An element of Q′ the empty string ϵ. Such transitions are
will be denoted [q1 , q2 , . . . , qi ] where called ϵ-transitions, empty transitions q1 ∅ {q1 , q2 } {q2 }
recursively defined as follows:
q1 , . . . , qi belong to Q. This element or spontaneous transitions. Empty tran- q2 ∅ ∅ {q2 }
1. δ ∗ (q, ϵ) = {q} represents a single state of the deter- sitions make the design and understanding
2. δ ∗ (q, ws) = q′ ∈δ∗ (q,w) δ(q ′ , s)
S
ministic automaton corresponding to of NFA easier.
a set of states of the NFA. All that remains now is to add q0 to the
Similarly to above, there exist NFA lan- A finite state automaton with ϵ-transitions set of final states since the empty string is
guage recognizers. A word is recognized by 2. F ′ is the set of members of Q′ com- is defined in the same manner as a classic
posed of at least one final state of recognized by the above automaton:
a NFA if and only there exists at least one non-deterministic automata except that the
path from the initial state to one of the final M. symbol ϵ is in the alphabet.
states. More formally: A word w ∈ Σ∗ is 3. q0′ = [q0 ]
recognized by a NFA M = (Q, Σ, δ, q0 , F ) if The concept of ϵ-closures can be formally
The language recognized by B is then and only if δ ∗ (q0 , w) ∩ F ̸= ∅. The language 4. δ ′ ([q1 , . . . , qi ], s) = [p1 , . . . , pj ] ⇐⇒ defined as follwing: For a state q in Q of
L(B) = {0n x1m | n ≥ 2, m ≥ 2, x ∈ {0, 1}∗ } recognized by a NFA consists of the set of δ({q1 , . . . , qi }, s) = {p1 , . . . , pj }. an automaton M , the ϵ-closure(q) is the set
all words recognized by the automaton. of all states of M reachable from q by a
Now we have to show that every word w sequence of empty transitions:
A function is called totally defined if it
is defined for all possible inputs. In finite Example: The following NFA accepts the accepted by M is also accepted by M ′ and
state automata, δ is called totally defined if word abaa since there exists at least one if M does not accept the word w, then M ′ ϵ-closure(q) = {p ∈ Q|(q, w) ⊢∗m (p, w)}
it is defined for all states and symbols. path from q0 to q3 : rejects w. [
ϵ-closure(P ) = ϵ-closure(q)
2.2.2 Non-deterministic Finite State We prove by induction on the length of w p∈P Any language L recognized by a NFA with
Automata that ϵ-tranistions is also recognized by a NFA
where P denotes a set of states. without ϵ-transitions.
Up until now, we have seen DFA with the δ ′∗ (q0′ , w) = [q1 , . . . , qi ]
property that form one state, another state
can be reached in an unique manner by

⇐⇒ δ ({q0 , w}, s) = {q1 , . . . , qi } For a given NFA with empty transitions, we 2.2.5 Minimization of DFA
δ. Non-deterministic finite state automata define a new transition function δ ∗ that uses
allow multiple transitions from a given state Base case: Trivial for |w| = 0 because the ϵ-closure. The new transition function 2.3 Properties of Regular Lan-
and a given symbol. Finite state automata q0′ = [q0 ] and w = ϵ. is defined as follows: guages
in general can be defined as the union of Induction: Let us assume that the above 1. δ(q, ϵ) = ϵ-closure(q) Regular languages have a few properties
DFA and NFA. is true for some words of length ≤ m. Let which are the following:
2. For every w ∈ Σ∗ and s ∈ Σ:
s ∈ Σ and ws be a string of length m + 1.
NFA and DFA have the same expression By definition of δ ′∗ , 1. The union L1 ∪ L2 of two regular
δ ∗ (q, ws) = ϵ-closure(P ) languages L1 and L2 is a regular
power. Everything that can be represented
as a DFA can also be represented as a δ ′∗ (q0′ , ws) = δ ′ (δ ′∗ (q0′ , w), s). where P = {p|p ∈ δr, s for an r ∈ language. Let M1 and M2 be two
NFA and vice versa. The usage of non- FSA such that L(M1 ) = L1 and
determinism makes the automata smaller δ ∗ (q, w)} L(M2 ) = L2 . The union of both
By induction hypothesis,
and easier to understand. We extend δ and δ ∗ for a set of states: the languages recognized by the au-
δ ′∗ (q0′ , w) = [p1 , . . . , pj ] ⇐⇒ δ ∗ (q0 , w) = {p1 , . . . , pj }. [ tomata is a regular language. Indeed,
Graphically, a non-deterministic choice is δ(R, s) = δ(q, s) we can construct an automaton that
represented by several arrows leaving a state, But by definition of δ , ′ recognized L1 and L2 as follows:
q∈R
with each arrow labeled with the same sym-
2.2.3 Equivalence of NFA and DFA δ ′ ([p1 , . . . , pj ], s) = [r1 , . . . , rk ]
[
bol. δ ∗ (R, w) = δ ∗ (q, w)
We are going to prove shortly that for any ⇐⇒ δ({p1 , . . . , pj }, s) = {r1 , . . . , rk }. q∈R
NFA, there exists a DFA which is equivalent,
i.e. if we consider the automata as black Consequently, The introduction of ϵ-transitions does not
boxes, we are not able to distinguish them provide any extra expression power. This
′∗
because they recognize the same language. δ (q0′ , ws) = [r1 , . . . , rk ] means that for any NFA with ϵ-transitions,
More formally, two finite state automata there exists a NFA without ϵ-transitions.
M1 and M2 are equivalent if and only ⇐⇒ δ ∗ (q0 , ws) = {r1 , . . . , rk }.
if they recognize the same language, i.e. Finally, we only have to add the following The removal method of empty transitions
L(M1 ) = L(M2 ). of a non-deterministic automaton consists
δ ′∗ (q0′ , w) ∈ F ′ only when δ ∗ (q0 , w) con- of computing the ϵ-closure for all states
The theorem states: For any nondetermin- tains a state of Q that belongs to F . Thus, of the automaton, which defines the tran- If w ∈ L1 (recognized by M1 ) or
istic finite state automaton NFA there exists L(M ) = L(M ′ ). sition function δ ∗ . Once the operation is
a deterministic finite state automaton DFA, performed, it is necessary to add the initial w ∈ L2 (recognized by M2 ), then
which is equivalent, and vice versa. From Example: Here is a non-deterministic au- state to the set of final states when the w ∈ L1 ∪L2 (recognized by the above
this theorem follows: tomaton. We eliminate the nondeterminism empty string is recognized by the automaton FSA).
2. The concatenation L1 · L2 of two reg- • An automaton recognizes a language, which can be optained or derived by rewrit- 2.4.3 Regular Expressions The following equalities hold for regular ex-
ular languages L1 and L2 is a regular i.e. consumes input symbols and de- ing from the axiom S and do not contain pressions:
language. termines whether a word belongs to any terminals: Regular expressions are the third formal-
ism after FSA and regular grammars that • r+s=s+r
a language.
L(G) = {w | S ⇒∗G w, w ∈ VT∗ } can be used to specify regular languages. • (r + s) + t = r + (s + t)
• A grammar on the other hand gener- We say that a regular expression denotes
ates words. Examples: a regular language (automata recognized • (rs)t = r(st)
Grammars are sometimes called rewriting 1. The grammar and grammars generate). Regular expres- • r(s + t) = rs + rt
systems and rules are called rewriting rules G = ({a, b}, {S, A}, P, S) sions are easy to integrate into software
or productions. systems and are widely used for searching • (r + s)t = rt + st
in which sub-strings in text, for instance in editors
A rule is composed of a left-hand side α P = {S → bbA, A → aaA, A → aa} or web search engines. Regular expressions • ∅∗ = ϵ
and a right-hand side β surrounding an
generates the language
are also intensively used for syntax analysis • (r∗ )∗ = r∗
arrow: and consequently for compiler design.
L(G) = {bb} · {aa}n where n ≥ 1. • (ϵ + r)∗ = r∗
α→β The formal definition of a regular expression
If w1 ∈ L1 and w2 ∈ L2 , then the 2. The (regular) grammar is as follows: Let Σ be an alphabet. A regu- • (r∗ s∗ )∗ = (r + s)∗
concatenation w1 · w2 ∈ L1 · L2 (rec- Both sides are composed of two types of lar expression defined over Σ is recursively Regular expressions are equivalent to FSA:
ognized by the above FSA). symbols: G = ({0, 1}, {S}, P, S) defined as follows: • Regular expressions to FSA: From
• The terminal symbols that compose where 1. ∅ us a regular expression that denotes any regular expression r, there ex-
3. The complementation L = Σ∗ \L of the words generated by the grammar.
a regular language L is a regular lan- P = {S → 1S, S → 0S, S → 1} the empty set ists a NFA with ϵ-transitions which
guage as well. To construct an au- • The non-terminal symbols that are 2. ϵ is a regular expression that denotes recognizes L(r).
generates
tomaton accepting L, it suffices to mainly used in the rewriting process. the set {ϵ} • DFA to regular expressions: If a lan-
L(G) = {0, 1}∗ · {1}. 3. For any a ∈ Σ, a is a regular expres- guage L is recognized by a DFA, then
transform every final state into a non- The rewriting of a term into another consists
final state and every non-final state of applying a rule to the original term, i.e. 3. The (context-free) grammar sion that denotes the set {a} there exists a regular expression r
into a final state. The transition func- to replace the left-hand side which appears 4. If r and s are two regular expressions that denotes L.
tion must be total. in the term with the right-hand side of the G = ({a, b}, {S}, P, S)
rule in order to obtain a new term. Starting
which denote the sets R and S respec-
tively then (r + s), (rs) and (r∗ ) are
3 Context-free Languages
4. The iterative closure L∗ of a language the rewriting process from a special rule
where
L is a regular language. regular expressions that denote the 3.1 Introduction
called the axiom and applying repetitively P = {S → aSb, S → ϵ} sets R ∪ S, R · S and R∗ .
the rules, we obtain (not always) a term generates Basic example of a non-regular language: L
composed of a terminal symbol only (no 5. Nothing else is considered a regular is composed of all words starting with a’s
L(G) = {an bn | n ≥ 0}. expression and ending with the same number of b’s:
more rules can be applied). Such a term is
a word generated by the grammar. The empty string ϵ means that a term The language denoted by a regular expres-
can be replaced by a string of length sion r is written L(r). L = {an bn | n ≥ 0}
The set of words generated by the rewriting zero. This language cannot be recognized by a
process is called the language generated by Examples:
4. The language L = {an bn cn | n ≥ 1} FSA, nor generated by a regular grammar,
the grammar. The rewriting process can be is generated by the grammar 1. Let Σ = {a, b} and the regular ex- nor denoted by a regular expression. The
endless and corresponds to a computer pro- pression limitation comes from the fact that FSA
gram which does not terminate. Since any G = ({a, b, c}, {S, A, B, C, D}, P, S)
r = (a(a + b)∗ ) have a finite set of states. Consequently,
rules can be selected during the rewriting where P = {S → aACD, A → FSA cannot memorize and arbitrary large
5. The intersection L1 ∩ L2 of two regu- process, the rewriting process is potentially aAC, A → ϵ, B → b, CD → then number n of symbols a and verify that the
lar languages L1 and L2 is a regular non-deterministic. BDc, CB → BC, D → ϵ}. This L(r) = {a} · {a, b}∗ . same number of b follows.
language. Indeed, because grammar is neither regular nor
The formal definition of a grammar is as context-free. 2. The regular expression Non-regular languages include:
L1 ∩ L2 = L1 ∪ L2 follows: A grammar is a 4-tuple G =
(VT , VN , P, S), where 2.4.2 Regular Grammar r = (0∗ 1)∗ + 0 • Nested paranthesis in arithmetic ex-
pressions
Previously, we saw that the comple- • VT is a set of terminal symbols We formally define regular grammars as denotes the set
ment of a regular language is regular • Nested begin, end in programming
• VN is a set of non terminal symbols follows: A grammar is right-regular if all
rewriting rules have the form α → β with
L(r) = (0∗ · {1})∗ ∪ {0} languages
(both L1 and L2 are regular) and the such that VT ∩ VN = ∅
union of two regular languages is also the following restrictions: or • Nested { and } in programming lan-
regular as well as the last complemen- • P is a set of rules α → β, where α 1. |α| = 1, α ∈ VN L(r) = {x | x ∈ {0, 1}∗ , guages
tation. It follows that the intersection and β are sequences of terminals or
of two regular languages is regular. non-terminals, but α must contain at 2. β = aB, β = a, or β = ϵ with B ∈ x represents an • And a lot more. . .
least non-terminal VN and a ∈ VT
odd binary number } Example (Stack): FSA cannot be used to
2.4 Grammars A grammar is left-regular if all productions
• S ∈ VN is the axiom or the initial have the form: model a stack with the operations push and
∪ {ϵ}. pop, for which the number of push opera-
2.4.1 Introduction symbol
1. |α| = 1, α ∈ VN 3. The regular expression tions is arbitrarily large.
Just like the automata, a grammar is a for- Let G = G = (VT , VN , P, S) be a grammar
and A → B a production of P . The 2. β = Ba, β = a, or β = ϵ with B ∈
malism used for specifying languages. Each VN and a ∈ VT r = (a(ab) ∗ (a + ϵ))∗
grammar corresponds to a type of machine sequence of terminals and non-terminals
or automaton: αAβ ∈ (VT ∪ VN )∗ can be derived into a Any right or left-regular grammar us simply corresponds to the automaton M :
sequence αBβ by applying the production called a regular grammar.
• Regular grammars ≡ finite state au- A → B, i.e. by substituting A for B. We
tomata call the move from a sequence α to another It is important to note that a regular
• Context-free grammars ≡ pushdown sequence β by applying one production grammar is either left regular or right Note that if a language L is non-regular,
automata of the grammar a one-step-derivation. regular, but not both (except for the trivial
We denote this derivation α ⇒G β. We cases). If the rules of right and left-regular a language L ⊂ L′ is not necessarily non-
• Grammars without restrictions ≡ call a derivation a sequence of one-step grammars are mixed, then the language regular: Example: The left-hand side lan-
Turing machines derivations and denote such a derivation generated by the grammar can be non- guage is non-regular, whereas the right-hand
Intuitively, a grammar is a system of rules α ⇒∗G β. side language is regular:
regular. Non-regular grammars generate
which allows to “rewrite” a term (or an ex- more complex languages than regular ones.
pression) into another term. Although the The formal definition of a language gen- {an , bn | n ≥ 0} ⊂ {a, b}∗
grammars and the FSA aare two similar erated by a grammar is as follows: The FSA and regular grammars are equivalent:
formalisms, there is a difference form an language L(G) generated by a grammar A language L is recognized by a FSA iff L The distinguishing characteristic is the struc-
operational standpoint: G = (VT , VN , P, S) is the set of all words is generated by a regular grammar. ture.
3.2 Pushdown Automata The language generated by a context-free Example: Consider the productions S → 4.2 Syntax Analysis
grammar G is called a context-free lan- AAB, A → aAa, A → a, B → b. There are
To overcome some limitations of FSA, push- guage, denoted by L(G). several ways of generated the string aaaab,
down automata have been devised: The second activity of parsing consists
e.g. in verifying that the input text is well
• A pushdown automaton (PDA) is Example: Let L = {an bn | n ≥ 1}. The constructed, i.e. the input is in conformity
a kind of finite state automaton grammar that generates L can be defined with a given grammar. For the verification
equipped with a stack that memo- as: of the conformity with respect of a given
rizes symbols. grammar, we build the parse tree of the
G = ({a, b}, {S}, P, S) input text.
• The transitions of a pushdown de-
pend on an input symbol but also on where P = {S → aSb, S → ab}
the symbol at the top of the stack. The empty string ϵ can be used as an input An input text is syntactically correct iff the
symbol. This is useful to detect that the Example: Let G = ({a,b}, {S,A,B}, P, S) parse tree can be built.
• The size of the stack has no limit. stack is empty. with
Example: Using the previous language A program that performs this activity is
{an bn | n ≥ 1}, the PDA behaves as fol- Example: A PDA that recognizes L = P = {S → aB, called a parser. The main task of a parser
lows: {an b2n |n ≥ 1} could be: S → bA, is to create the parse tree of its input using
The definition of an ambiguous grammar is a grammar that describes the structure of
1. While a symbol can be read, this sym- A → a, as follows: A grammar G is an ambiguous
bol is pushed onto the stack. its input. Note that a parser uses a lexer
grammar iff there exists a word w ∈ L(G) for the lexical analysis.
A → aS, such that w has more than one parse tree.
2. When the first input symbol b comes,
then one symbol a is removed from A → bAA, A language L is an ambiguous language
the stack. iff all the grammars that generate L are Parse trees graphically illustrate how a given
B → b, ambiguous. word is generated by a grammar, i.e. which
3. While a symbol b can be read, one productions are used to generate a given
symbol a is removed from the stack. B → bS, word.
The theorem for PDA and context-free lan-
4. The process ends when the stack is B → aBB} guages is as follows: A language L is recog-
empty and all the input symbols have nized by a non-deterministic PDA iff L is a 4.2.1 Bottom-Up Analysis
been consumed. G is context-free and generates the language context-free language.
of all words composed of the same number One approach to generate a parse tree of an
Formal definition: A pushdown automaton 3.2.1 Limitations of PDA of symbols a and b.
PDA is a 6-tuple M = (Q, Σ, Γ, δ, q0 , ◁), 4 Parsing input according to a given grammar is the
The language L = {an bm |m = n or mm = 3.4 bottom-up analysis.
where: BNF Processing an input text consists generally
2n} cannot be recognized by a PDA with
• Q is a finite set of states a single stack. It would require two stacks, The Backus-Naur Form (BNF) is a nota- in, verifying that the structure of the text is
• Σ is an input alphabet where the first stack would recognize the tion used to describe the syntax of program- correct, and in performing some operations
number n and the second one would recog- ming languages (originally Algol 60). BNF using the text under analysis.
• Γ is an auxiliary alphabet (symbols nize the number 2n. uses ASCII characters only and is equivalent
of the stack) such that Γ ∩ Σ = ∅ (i.e. to the notion of context-free grammars. In general, a grammar describes the syn-
Γ and Σ share no elements) 3.2.2 Non-deterministic PDA • Non-terminals are strings, e.g. type, tactical structure of the input text. Thus,
Non-determinism is modeled by means identifier, etc. verifying that the structure of the text is
• δ : Q × (Σ ∪ {ϵ}) × Γ → Q × Γ∗ is a of a transition function whose co-domain correct consists in making sure that the text
transition function • Terminals are (single or double) is in conformity with the grammar. This is
is a power set. The formal definitions quoted strings or symbols, such as parsing.
• q0 ∈ Q is the initial state is as follows: A non-deterministic PDA "while", "float", ’s’, ’+’, etc.
is a 7-tuple M = (Q, Σ, Γ, δ, q0 , ◁) with
• ◁ ∈ Γ is a special symbol which indi- δ : Q × (Σ ∪ {ϵ}) × Γ → P(Q × Γ∗ ) where • = corresponds to the arrow of a pro- The parsing operation is divided into two The parse tree of the input aaaab can be
cates that the stack is empty duction α → β activities:
P(Q × Γ∗ ) denotes the power set of Q × Γ∗ . constructed from the terminals using the
A PDA that recognizes a language is defined A word is recognized by a non-deterministic • | corresponds to alternative 1. Lexical analysis rules 2) twice and 4) once, then rule 3) and
similarly: We add the set of final states. PDA iff there is at least one path between finally rule 1).
This means that a recognizer is a 7-tuple the initial state and a final state. • , concatenates elements 2. Syntax analysis
M = (Q, Σ, Γ, δ, q0 , ◁, F ) , where F ⊆ Q is • ; terminates a production
the set of final states. It is important to observe that the classes of The extended BNF (EBNF) is similar, but 4.1 Lexical Analysis 4.2.2 Top-Down Analysis
languages recognized by deterministic and introduces some new constructs that make The lexical analysis activity consists in
Example: Let us construct a PDA M which non-deterministic PDA are not the same. the notation easier to read and more com- reading a text character by character and in The second approach called top-down
recognizes an bn | n ≥ 1: Non-deterministic PDA recognize a larger pact. The most immporant aspects of EBNF grouping the characters that form terminals analysis consists in applying derivations
class of languages than deterministic PDA, successively from the first rewriting rule or
of the grammar. Such terminals are usually
M = ({q0 , q1 , q2 }, {a, b}, {A, ◁}, q0 , ◁, {q2 }) i.e. they are more powerful. This is not the are: axiom. This approach is easier but has some
case for FSAs. • Parantheses can be used to group el- called tokens. E.g. limitations.
where ements • Identifiers of programming languages,
3.3 Context-free Grammars • [ γ ] means that γ is optional, i.e. such as anItem, n
δ(q0 , a, ◁) = (q0 , A◁)
The difference between regular grammars appears zero or one times. • Keywords of programming languages,
δ(q0 , a, A) = (q0 , AA) and context-free grammars is the re- • { γ } means that γ can be repeated such as if, extends
strictions imposed on the rewriting rules. zero or more times.
δ(q0 , b, A) = (q1 , ϵ) The rules of regular grammars are more • Various numerical constants
δ(q1 , b, A) = (q1 , ϵ) restrictive than the ones of context-free 3.5 Parse Trees, Ambiguity
grammars, which means that any regular • Character strings such as "This is
δ(q0 , ϵ, ◁) = (q2 , ϵ) grammar is also a context-free grammar. The application of rewriting rules captures a comment", "Yes"
Context-free grammars have a greater the notion of the parse tree. For some • Specific tokens such as (, =, etc.
expression power than regular ones. grammars, a word can be generated in
Convention: (a, Z | α) means: several different ways and have thus several Between terminals lie separators, such as
The formal definition of context-free gram- different parse trees. Such words are blank spaces, tabs, new lines, and comments, 4.2.3 Recursive Descent
• a is the input symbol mars is as follows: A context-free grammar generated ambiguously and may potentially which are all usually ignored. We usually
• Z is the symbol at the top of the G = (VT , VN , P, S) is a grammar whose pro- have several different meanings. Having use lexers or scanners, which are programs Recursive Descent is a top-down ap-
stack before applying the transition ductions have the form α → β, where different meanings is sometimes undesirable, that perform the lexical analysis of an input proach in which we execute a programm
function 1. |α| = 1 and α ∈ VN especially in the context of programming text. The implementation of lexers is gener- that applies a set of recursive functions to
languages, where a given program should ally based on the use of regular expressions process the input. This may involve back-
• α is the content of the stack after the 2. β is a sequence of terminal and/or have unique semantics. and/or FSAs. Some lexer tools include lex, tracking, but we will try to get rid of it, for
application of the transition function non-terminals flex, jflex, anf more. efficiency and simplification reasons.
4.3.2 Left-factoring • Only used through parser generators.
Due to the size of tables of LR parsers,
When two or more grammar rule choices parser generators only work for a sub-
share a common prefix, LL(1) parsers can- class of LR grammars called LALR
not determine which choice to use or which (lookahead LR)
part has to be expanded:
LR parsers are (advantages of LR vs LL):
X → αβ | αγ
• Left-recursion is allowed (grammars
To solve this, we can factorize the common are easier to read)
4.3 Classes of Context-free prefix: • More complex context-free languages
Grammars can be analyzed
X → αX ′
The class of LL(1) grammars is a subset of • Errors are detected as soon as they
context-free grammars. One way of parsing X′ → β | γ are encountered
such grammars is to use recursive descent • Almost all programming languages
parsers in which no backtracking is required. 4.4 LL Grammars can be parsed by bottom-up parsers
• LL(1) stands for Left to right Left- A grammar is LL(1) iff there exists a recur-
most derivation with 1 token looka- sive descent parser with 1 token lookahead
head. for this grammar (top-down analysis). A
• LL(1) is one of the easiest (sometimes grammar is LL(k) iff there exists a recursive
fastest) techniques. descent parser that uses k tokens lookahead.
• A recursive descent parser has a pars-
ing function (or method) for each non The class of LL(k) grammars strictly in-
terminal that can recognize any se- cludes the class LL(k − 1), where (k > 1);
quences of tokens generated by that in other words there are LL(k) grammars
nonterminals. for which no LL(k − 1) grammar exists. In
• The idea of a recursive descent parser other words LL(k) parser are more powerful
it to use the current input token to than LL(k − 1) parsers.
decide which alternative to choose.
Left-most derivation means that the left- Fortunately most LL(k) grammars can be
most rule is used (expanded) first. For ex- translated into LL(1) grammar, but not all.
ample the word acbb gets recognized as fol-
lows: 4.5 LR Grammars
LR(1), LR(k) grammars can be analyzed
with parsers based on the bottom-up
technique with 1 token lookahead, k token
lookahead respectively. LR(k) stands for
Left to right Right-most derivation with
k tokens lookahead. Bottom-up parsers
are much more powerful than top-down
parsers. It has been shown that any LR(k)
grammar (k > 0) can be translated into a
LR(1) grammar.

Left-most derivation visits the nodes of a 4.6 Comparison of LL and LR


parse tree according to the DFS algorithm. Parsers
The main problem of recursive descent is
left-recursion, which causes the parser to The following shows the relationships be-
enter into an infinite recursion. tween LL and LR:

4.3.1 Left-recursion
Left recursion has immediate left-
recursion if it has the form
X → Xα | β
or more generally
X →Xα1 | . . . Xαn | β1 | · · · | βn
General left-recursion can occur through
several rules for example:
S → Aa | b
A → Ac | Sd | ϵ
To remove immediate left-recursion, auxil- LR parsers are (disadvantages of LR vs LL):
iary rules are required, e.g.: • Much more complex than LL parsers
X →Xα1 | . . . Xαn | β1 | · · · | βn • More difficult to use
becomes • Less efficient
X →β1 X ′ | · · · | βn X ′
• Require much more space than LL
X ′ →α1 X ′ | · · · | αn X ′ | ϵ parsers

You might also like