ch3 M.PPTX - 0
ch3 M.PPTX - 0
1
Functions of Lexical Analysis
• To identify the tokens we need some method of describing the possible tokens that can appear in the
input stream.
• For this purpose we introduce regular expression, a notation that can be used to describe essentially
all the tokens of programming language.
• Secondly, having decided what the tokens are, we need some mechanism to recognize these in the
input stream.
• This is done by the token recognizers, which are designed using transition diagrams and finite
automata
2
Functions of Lexical Analysis
• The LA is the first phase of a compiler.
• It main task is to read the input character(source code or lexemes) and produce as output a sequence
of tokens that the parser uses for syntax analysis.
• Upon receiving a ‘get next token’ command form the parser, the lexical analyzer reads the input
character(lexeme) until it can identify the next token.
• The LA return to the parser representation for the token it has found.
3
Cont.
• Lexical Analysis Phase: In this phase, input is the source program(lexeme) that is to be read from left
to right and the output we get is a sequence of tokens that will be analyzed by the next Syntax Analysis
phase.
• During scanning the source code, white space characters, comments(// single line comment and /*
multiple line comment */), carriage return characters(‘\r’ is space created by enter key of the keyboard),
preprocessor directives(#include<iostream> or different header file), newline (‘\n’), line feed
characters(‘\f’ is page breaking ASCII character), blank spaces(‘ ’), horizontal tab (‘\t’), vertical tab (‘\v’
is six times of the newline), etc. are removed.
• The Lexical analyzer or Scanner also helps in error detection that means if the token is invalid LA
generate an error.
• To exemplify, if the source code contains invalid constants, incorrect spelling of keywords, etc. is taken
care by the lexical analysis phase.
• Regular expressions are used as a standard notation for specifying tokens of a programming language.
4
Role of the Lexical Analyzer
Token: Token is a sequence of characters that can be treated as a single logical entity(treated as a unit as
it cannot be further broken down).
• Typical tokens are –
1) Identifiers (name of a variable, function, etc.)
2) Keywords (int, float, char, break, continue, if, else, sizeof, return, etc.)
3) Operators (=,+,*,-,/, etc.)
4) Special Characters (#,$,_,-> etc.)
5) Punctuators((,),Comma,[,],{,},;,:, etc.)
6) Constants(3.14,1,2,5, etc.)
7) Literal(anything surrounded by double and single quotations “ ” or ‘ ’)
Pattern: Pattern is a description of the form that the lexemes of a token may take.
• It specifies a set of grammar rules(regular expressions) that a scanner follows to create a token.
• This set of strings is described by a rule called a pattern associated with the token
Lexeme: A lexeme is a sequence of characters in the source program that is matched by the pattern for a
token and is identified by the lexical analyzer as an instance of that token.
5
Cont.
Let’s understand now how to calculate tokens in a source code (C language) with the
following example
Example 1:
int value = 10; //Input this Source code
o Tokens
int (keyword), value(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
o Answer – Total number of tokens = 5
Example 2:
int main() {
// printf() sends the string inside quotation to
// the standard output (the display)
printf("Welcome to GeeksforGeeks!");
return 0;
}
o Tokens
'int', 'main', '(', ')', '{', 'printf ', '(', ' "Welcome to GeeksforGeeks!" ', ')', ';', 'return', '0', ';', '}'
o Answer – Total number of tokens = 14
6
Cont.
Let’s understand Lexeme
Example:
o User-defined names like value, a, b, c is lexeme of type identifier(token)
o (,),{,} are lexemes of type punctuation(token)
Let’s understand Pattern
Example of Programming Language (C, C++):
o For a keyword to be identified as a valid token, the pattern is the sequence of characters that
make the keyword.
o For identifier to be identified as a valid token, the pattern is the predefined rules that it must
start with alphabet, followed by alphabet or a digit.
7
Regular Expressions
• The grammar defined by regular expressions is known as regular grammar.
• The language defined by regular grammar is known as regular language.
• Regular expression is an important notation for specifying patterns.
• Each pattern matches a set of strings, so regular expressions serve as names for a set of strings.
• Programming language tokens can be described by regular languages.
• The specification of regular expressions is an example of a recursive definition.
• Regular languages are easy to understand and have efficient implementation.
• There are a number of algebraic laws that are obeyed by regular expressions, which can
be used to manipulate regular expressions into equivalent forms.
8
Operations
• The various operations on languages are:
Union of two languages L and M is written as
L U M = {s | t, s is in L or t is in M}
Concatenation of two languages L and M is written as
LM = {st, s is in L and t is in M}
The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
• If r and s are regular expressions denoting the languages L(r) and L(s), then
Union : (r)+(s) or (r)|(s) is a regular expression denoting L(r) U L(s)
Concatenation : (r)(s) is a regular expression denoting L(r)L(s)
Kleene closure : (r)* is a regular expression denoting (L(r))*
(r) is a regular expression denoting L(r)
9
Precedence and Associativity
• *, concatenation (.), and | (pipe sign) are left associative
* has the highest precedence
Concatenation (.) has the second highest precedence.
+ or | (Union) has the lowest precedence of all.
Representing valid tokens of a language in regular expression
• If x is a regular expression, then:
x* means zero or more occurrence of x. i.e., it can generate {ε, x, xx, xxx, xxxx, … }
x+ means one or more occurrence of x. i.e., it can generate { x, xx, xxx, xxxx … } or x.x*
x? means at most one occurrence of x i.e., it can generate either {x} or {ε}.
[a-z] is all lower-case alphabets of English language.
[A-Z] is all upper-case alphabets of English language.
[0-9] is all natural digits used in mathematics.
10
Representing occurrence of symbols using regular expressions
letter = [a – z] or [A – Z] or A | B | ……| Z | a | b |……| z|
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
digits =digit+
sign = [ + | - ]
Representing language tokens using regular expressions
Decimal = (sign)?(digit)+
Identifier = (letter)(letter | digit)*
• The only problem left with the lexical analyzer is how to verify the validity of a regular
expression used in specifying the patterns of keywords of a language.
• A well-accepted solution is to use finite automata for verification.
11
Examples of Regular Expression
Example 1:
Write the regular expression for the language accepting all combinations of a's, over the set ∑ = {a}
• Solution:
• All combinations of a's means a may be zero, single, double and so on.
• If a is appearing zero times, that means a null string.
• That is we expect the set string of L={ε, a, aa, aaa, ....}.
• So we give a regular expression for this as: RE = a*
Example 2:
Write the regular expression for the language accepting all combinations of a's except the null string,
over the set ∑ = {a}
• Solution:
• The regular expression has to be built for the language L, That is the set string of
L = {a, aa, aaa, ....}
• This set indicates that there is no null string.
• So we can denote regular expression as: RE = a+ 12
Examples of Regular Expression
Example 3:
Write the regular expression for the language accepting all the string containing any number of a's and
b's.
• Solution:
• The regular expression will be: RE= (a + b)*
• This will give the set string of L = {ε, a, aa, b, bb, ab, ba, aba, bab, .....}, any combination of a and b.
• The (a + b)* shows any combination with a and b even a null string.
Example 4:
Write the regular expression for the language accepting all the string which are starting with 1 and
ending with 0, over ∑ = {0, 1}.
• Solution:
• In a regular expression, the first symbol should be 1, and the last symbol should be 0.
• The RE. is as follows: RE = 1 (0+1)* 0
13
Examples of Regular Expression
Example 5:
Write the regular expression for the language starting with a but not having consecutive b's.
• Solution:
• The regular expression has to be built for the language L, That is the Set string of
L = {a, aba, aab, aba, aaa, abab, .....}
• The regular expression for the above language is: RE = (a + ab)*
14
Examples of Regular Expression
Example 6:
Write the regular expression for the language accepting all the string in which any number of a's is
followed by any number of b's is followed by any number of c's.
• Solution:
• As we know, any number of a's means a* any number of b's means b*, any number of c's means c*.
• Since as given in problem statement, b's appear after a's and c's appear after b's. So the regular
expression could be: RE = a* b* c*
Example 7:
Write the regular expression for the language over ∑ = {0} having even length of the string.
• Solution:
• The regular expression has to be built for the language L, That is Set string of
L = {ε, 00, 0000, 000000, ......}
• The regular expression for the above language is: RE = (00)*
15
Examples of Regular Expression
Example 8:
Write the regular expression for the language having a string which should have at least one 0 and at
least one 1.
• Solution:
• The regular expression will be:
RE = [(0 + 1)* 0 (0 + 1)* 1 (0 + 1)*] + [(0 + 1)* 1 (0 + 1)* 0 (0 + 1)*]
16
Examples of Regular Expression
Example 9:
Write the regular expression for the language L over ∑ = {0, 1} such that all the string do not contain
the substring 01.
• Solution:
• The Language is as follows: the set string of L = {ε, 0, 1, 00, 11, 10, 100, .....}
• The regular expression for the above language is as follows: RE = (1* 0*)
17
Examples of Regular Expression
Example 10:
• Write the regular expression for the language containing the string in which every 0 is
immediately followed by 11.
• Solution:
• The regular expectation will be: RE = (011 + 1)*
18
Finite Automata
• Automation is defined as a system where information is transmitted and used for performing some
functions without direct participation of man.
• An automation in which the output depends only on the input is called automation without memory.
• An automation in which the output depends on the input and state also is called as automation with
memory.
• An automation in which the output depends only on the state of the machine is called a Moore machine.
• An automation in which the output depends on the state and input at any instant of time is called a
mealy machine
Description of Automata
• An automata has a mechanism to read input from input tape, any language is recognized by some
automation, Hence these automation are basically language ‘acceptors’ or ‘language recognizers’.
• Types of Finite Automata
Deterministic Automata
Non-Deterministic Automata
19
Deterministic Automata
• A deterministic finite automata has at most one transition from each state on any input.
• A DFA is a special case of a NFA in which:-
It has no transitions on input ε ,
Each input symbol has at most one transition from any state.
• DFA formally defined by 5 tuple notation M = (Q, Σ, δ, qo, F), where
Q is a finite ‘set of states’, which is non empty.
Σ is ‘input alphabets’, indicates input set.
qo is an ‘initial state’ and qo is in Q ie, qo, Σ, Q
F is a set of Final states(a set of accepting states ),
δ is a ‘transmission function’ or mapping function, using this function the next state can be
determined
20
Cont.
• The regular expression is converted into minimized DFA by the following procedure:
Regular expression → NFA → DFA → Minimized DFA
• The Finite Automata is called DFA if there is only one path for a specific input from current state to
next state.
• From state S0 for input ‘a’ there is only one path going to S2.
• similarly from S0 there is only one path for input going to S1.
21
Cont.
The Transition Function
• It takes two arguments: a state and an input symbol.
• δ(q, a) = the state that the DFA goes to when it is in state q and input a is received.
• DFA do not allow non-deterministic state transitions.
• There can not be multiple state transition from state q with the same input a.
Graph Representation of DFA’s
• Nodes = states.
• Edges represent transition function.
• Edge from state p to state q labeled by all those input symbols that have transitions from p to q.
• Edge labeled “Start” to the start state.
• Final states indicated by double circles.
• DFA do not allow non-deterministic edges. i.e., there can not be more that one edge leaving any state
with the same label.
22
Cont.
Example: Graph of a DFA
• Accepts all strings without two consecutive 1’s.
0 0,1
1 1
A B C
Start 0
23
Cont.
Final states
Starred or circled Columns =
0 1 input symbols
* A A B
Arrow for
start state * B A C
C C C
Rows = states
24
Cont.
Language of a DFA
• Automata of all kinds define languages.
• If A is an automaton, L(A) is its language.
• For a DFA A, L(A) is the set of strings labeling paths from the start state to a final state.
• Formally: L(A) = the set of strings w such that δ(q 0, w) is in F.
Start at A.
0 0,1
1 1
A B C
Start 0 25
Cont.
Follow arc labeled 1. Finally arc labeled 1 from current state A.
Result is an accepting state, so 101 is in the language.
0
0,1 0 0,1
1 1
A B C 1 1
A B C
Start 0
Then arc labeled 0 from current state B. 0
Start
0 0,1
Concluded
1 1 The language of our example DFA is:
A B C
{w | w is in {0,1}* and w does not have
two consecutive 1’s}
Start 0
Read a set former as These conditions
“The set of strings w… Such that… about w are true.
26
Cont.
Example 1:
Given a DFA, M such that: L(M) = {x | x is in {a,b,c}* and x contains the substring aba}
b/c a a/b/c
a a
b
q0 q1 q2 q3
c
b/c
Example 2:
Given a DFA, M such that: L(M) = {x | x is in {a,b}* and x contains aa or bb}
a|b
q1 a q2
a
q0 a b a|b
b b
q3 q4
27
Cont.
Example 3:
Given a DFA, M such that: L(M) = {x | x is in {a,b}* and a is immediately followed by b}
b a/b
a
a
q0 q1
b q2
Example 4:
Given a DFA, M such that: L(M) = {x | x is in {0,1}* and x contains strings ending in 00}
1 0
0
0
q0 q1 q2
1
28
Cont.
Example 5:
Let M be a DFA given by:
M = ({q0,q1},{a,b}, δ,q0,{q0}) and δ is given as:
δ(q0,a)=q0
δ(q0,b)=q1
δ(q1,a)=q1
δ(q1,b)=q0
Construct transition diagram and table for the given DFA and determine the language L(M)
29
Nondeterministic Automata
• A NFA is a mathematical model that consists of
A set of states S.
A set of input symbols Σ.
A transition for move from one state to another.
A state so that is distinguished as the start (or initial) state.
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
• A NFA can be diagrammatically represented by a labeled directed graph, called a transition
graph, In which the nodes are the states and the labeled edges represent the transition function.
• This graph looks like a transition diagram, but the same character can label two or more
transitions out of one state and edges can be labeled by the special symbol ε as well as by input
symbols.
30
Cont.
• An NFA is a five-tuple:
M = (Q, Σ, δ, q0, F)
Q is a finite set of states
Σ is a finite input alphabet
q0 is the initial/starting state, q0 is in Q
F is a set of final/accepting states, which is a subset of Q
δ is a transition function, which is a total function from Q x Σ to 2 Q
δ: (Q x Σ) –> 2Q -2Q is the power set of Q, the set of all subsets of Q
31
Cont.
NFA Differences with DFA
• Three major differences
1. The range of δ is in the power set 2Q
2. ε (empty string) transitions are possible in NFA. NFA can make a transition without consuming an
input symbol.
3. In an NFA, the set δ(qi,a) may be empty; there is no transition defined for this specific situation.
32
Cont.
NFA Differences with DFA
• Example #1: Some 0’s followed by some 1’s
0 1 0/1
0 1
q0 q1 q2
δ: 0 1
{q0, q1} {} Q = {q0, q1, q2}
q0 Σ = {0, 1}
{} {q1, q2} Start state is q0
F = {q2}
q1 {q2} {q2}
*q2
33
Cont.
NFA Differences with DFA
• Example #2: Pair of 0’s or pair of 1’s 0/1 0/1
Q = {q0, q1, q2 , q3 , q4} 0 0
q0 q3 q4
Σ = {0, 1}
Start state is q0 1 0/1
F = {q2, q4}
δ: 0 1 1
q1 q2
q0 {q0, q3} {q0, q1}
q1 {} {q2}
{q2} {q2}
q2
{q4} {}
q3
{q4} {q4}
34
q
Cont.
Language of an NFA
• A string w is accepted by an NFA if δ(q0, w) contains at least one final state.
• The language of the NFA is the set of strings it accepts.
Equivalence of DFA’s, NFA’s
• A DFA can be turned into an NFA that accepts the same language.
• If δD(q, a) = p, let the NFA have δN(q, a) = {p}.
• Then the NFA is always in a set containing exactly one state – the state the DFA is in after reading
the same input.
• Surprisingly, for any NFA there is a DFA that accepts the same language.
• Proof is the subset construction.
• The number of states of the DFA may have exponentially more states than the NFA.
• Thus, NFA’s accept exactly the regular languages like DFA’s.
35
Cont.
Subset Construction
• Given an NFA with states Q, inputs Σ, transition function δ N, start state q0, and final states F,
construct equivalent DFA with:
States 2Q (Set of subsets of Q).
Inputs Σ.
Start state {q0}.
Final states = all those with a member of F.
• The transition function δD is defined by: δD({q1,…,qk}, a) is the union over all i = 1,…,k of δ N(qi, a)
Critical Point
• The DFA states have names that are sets of NFA states.
• But as a DFA state, an expression like {p, q} must be read as a single symbol, not as a set.
36
Cont.
Subset Construction
• Example #1: We’ll construct the DFA equivalent of this “chessboard” NFA.
r b
1 2 3
red 1 2,4 5
2 4,6 1,3,5
4 5 6
red red 3 2,6 5
4 2,8 1,5,7
7 8 9
red 5 2,4,6,8 1,3,7,9
6 2,8 3,5,9
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
37
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4}
3 2,6 5 {5}
4 2,8 1,5,7
5 2,4,6,8 1,3,7,9
6 2,8 3,5,9
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
38
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5}
4 2,8 1,5,7 {2,4,6,8}
5 2,4,6,8 1,3,7,9 {1,3,5,7}
6 2,8 3,5,9
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
39
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5} {2,4,6,8} {1,3,7,9}
4 2,8 1,5,7 {2,4,6,8}
5 2,4,6,8 1,3,7,9 {1,3,5,7}
6 2,8 3,5,9 * {1,3,7,9}
7 4,8 5
8 4,6 5,7,9
* 9 6,8 5
40
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5} {2,4,6,8} {1,3,7,9}
4 2,8 1,5,7 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 {1,3,5,7}
6 2,8 3,5,9 * {1,3,7,9}
7 4,8 5 * {1,3,5,7,9}
8 4,6 5,7,9
* 9 6,8 5
41
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5} {2,4,6,8} {1,3,7,9}
4 2,8 1,5,7 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
6 2,8 3,5,9 * {1,3,7,9}
7 4,8 5 * {1,3,5,7,9}
8 4,6 5,7,9
* 9 6,8 5
42
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5} {2,4,6,8} {1,3,7,9}
4 2,8 1,5,7 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
6 2,8 3,5,9 * {1,3,7,9} {2,4,6,8} {5}
7 4,8 5 * {1,3,5,7,9}
8 4,6 5,7,9
* 9 6,8 5
43
Cont.
Subset Construction
r b r b
1 2,4 5 {1} {2,4} {5}
2 4,6 1,3,5 {2,4} {2,4,6,8} {1,3,5,7}
3 2,6 5 {5} {2,4,6,8} {1,3,7,9}
4 2,8 1,5,7 {2,4,6,8} {2,4,6,8} {1,3,5,7,9}
5 2,4,6,8 1,3,7,9 {1,3,5,7} {2,4,6,8} {1,3,5,7,9}
6 2,8 3,5,9 * {1,3,7,9} {2,4,6,8} {5}
7 4,8 5 * {1,3,5,7,9} {2,4,6,8} {1,3,5,7,9}
8 4,6 5,7,9
* 9 6,8 5
44
Cont.
• Example #2 : Convert NFA to DFA of M=({q0,q1,q2},{a,b}, δ, q0,{q2}) where δ is given by:
Ans.
45
Cont.
• Example #3 : Convert NFA to DFA of M=({A,B},{0,1}, δ, A,{A}) where δ is given by:
Ans.
* *
46