Ch3 - Lexical Analysis
Ch3 - Lexical Analysis
1
Lexical Analysis
• Basic Concepts & Regular Expressions
• What does a Lexical Analyzer do?
• How does it Work?
• Formalizing Token Definition & Recognition
symbol
table
Important Issue:
• What are Responsibilities of each Box ? 3
• Focus on Lexical Analyzer and Parser.
Lexical Analyzer in Perspective
• PARSER
• LEXICAL ANALYZER
• Perform Syntax Analysis
• Scan Input
• Actions Dictated by Token
• Remove WS, NL, … Order
• Identify Tokens • Update Symbol Table Entries
• Create Symbol Table • Create Abstract Rep. of
• Insert Tokens into ST Source
• LEXEME
6
• Actual sequence of characters that matches pattern and is
classified by a token
Introducing Basic Terminology
Classifies
Pattern Actual values are critical. Info is : 7
1. Stored in symbol table
2. Returned to parser
Attributes forTokens
Example: E = M * C ** 2
<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
9
<exp_op, >
<num, integer value 2>
Handling Lexical Errors
• Its hard for lexical analyzer without the aid of other
components, that there is a source-code error.
• If the statement fi is encountered for the first time in a C
program it can not tell whether fi is misspelling of if statement
or a undeclared literal.
• Probably the parser in this case will be able to handle this.
11
Buffer Pairs
• Lexical analyzer needs to look ahead several characters beyond
the lexeme for a pattern before a match can be announced.
• Use a function ungetc to push look-ahead characters back into the
input stream.
• Large amount of time can be consumed moving characters.
E = M * C * * 2 eof
13
Lexeme_beginning forward
Comments and white space can be treated as patterns that yield no token
Code to advance forwardpointer
if forward at the end of first half
then begin reload second
half ;
forward : = forward + 1;
end
else if forward at end of second half
then begin reload first half ;
move forward to beginning of
first half
end
else forward : = forward + 1;
Pitfalls:
1. This buffering scheme works quite well most of the time
but with it amount of lookahead is limited.
14
2. Limited lookahead makes it impossible to recognize tokens
in situations where the distance, forward pointer must
travel is more than the length of buffer.
Specification ofTokens
Regular expressions are an important notation for specifying lexeme patterns
16
Example
Let: L = { a, b, c, ..., z }
D = { 0, 1, 2, ..., 9 }
• (a|b)*
• a|a*b
20
Example
20
Regular Definition
• If Σ is an alphabet of basic symbols then a regular
definition is a sequence of the following form:
d1→r1
d2→r2
……..
dn→rn
where
• Each di is a new symbol such that di Σ and di dj where
j<I 21
• Each ri is a regular expression over Σ {d1,d2,…,di-1)
Regular Definition
22
UnsignedNumber
1240, 39.45, 6.33E15, or 1.578E-41
digit → 0 | 1 | 2 | … | 9
digits → digit digit*
optional_fraction → . digits |
optional_exponent → ( E ( + | -| ) digits) |
num → digits optional_fraction optional_exponent
24
Addition Notation /Shorthand
23
UnsignedNumber 1240, 39.45, 6.33E15, or 1.578E-41
digit → 0 | 1 | 2 | … | 9
digits → digit digit*
optional_fraction → . digits |
optional_exponent → ( E ( + | -| ) digits) |
num → digits optional_fraction optional_exponent
Shorthand
digit → 0 | 1 | 2 | … | 9
digits → digit+
optional_fraction → (. digits ) ?
optional_exponent → ( E ( + | -) ? digits) ? 24
num → digits optional_fraction optional_exponent
TokenRecognition
How can we use concepts developed so far to assist in
recognizing tokens of a source language ?
blank → blank
tab → tab
newline → newline
delim → blank | tab | newline
ws → delim +
27
In these cases no token is returned to parser
Overall
ws - -
if if -
then then -
else else -
id id pointer to table entry
num num Exact value
< relop LT
<= relop LE
= relop EQ
<> relop NE
> relop GT
>= relop GE 28
Note: Each token has a unique token identifier to define category of lexemes
Constructing Transition Diagramsfor Tokens
>=: start
0
>
6
30
Example TDs
>=: start
0
>
6
=
7
RTN(GE)
other
8
* RTN(GT)
We’ve accepted “>” and have read one extra char that must be
unread. 30
Example : All RELOPs
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
other
= 4
*
return(relop, LT)
5 return(relop, EQ)
>
=
6 7 return(relop, GE)
other
8
*
return(relop, GT)
31
Example TDs : id and delim
letter or digit
start letter other *
9 10 11
delim
32
start delim other *
28 29 30
Example TDs :Unsigned #s
digit → 0 | 1 | 2 | … | 9
digits → digit digit*
optional_fraction → . digits |
optional_exponent → ( E ( + | -| ) digits) |
num → digits optional_fraction optional_exponent
Example TDs :Unsigned #s
digit digit digit
E digit
digit digit
return(num, install_num())
digit
34
Answer
cons → B | C | D | F | G | H | J | … | N | P | … | T | V | .. | Z
string → cons* Acons* E cons* I cons* O cons* U cons*
start A E I O U other
accept
38
Nondeterministic FiniteAutomata
A nondeterministic finite automaton (NFA) is a
mathematical model that consists of
1. A set of states S
2. A set of input symbols
3. A transition function that maps state/symbol pairs
to a set of states
4. A special state s0 called the start state
5. A set of states F (subset of S) of final states
INPUT: string 39
OUTPUT: yes or no
Example – NFA : (a|b)*abb
40
Example – NFA : (a|b)*abb
S = { 0, 1, 2, 3 } a
start a b b
s0 = 0 0 1 2 3
F= { 3}
b
= { a, b }
input
a b (null) moves possible
s
0 { 0, 1 } {0} i j
t
a 1 -- {2}
t Switch state but do not 40
e 2 -- {3} use any input symbol
Transition Table
How Does AnNFA Work ?
a
start a b b
0 1 2 3
start a b b
0 1 2 3
b a
a
a, b
4
42
Other Concepts
Not all paths may result in acceptance.
a
start a b b
0 1 2 3
a
start a b b
0 1 2 3 45
b
Relation between RE, NFA andDFA
1. There is an algorithm for converting any RE into an NFA.
2. There is an algorithm for converting any NFA to a DFA.
3. There is an algorithm for converting any DFA to a RE.
These facts tell us that REs, NFAs and DFAs have equivalent expressive
power.
46
NFA vs DFA
• An NFA may be simulated by algorithm, when NFA is constructed from the
R.E
• Algorithm run time is proportional to |N| * |x| where |N| is the
number of states and |x| is the length of input
• Alternatively, we can construct DFA from NFA and uses it to recognize
input
• The space requirement of a DFA can be large. The RE
(a+b)*a(a+b)(a+b)….(a+b) [n-1 (a+b) at the end] has no DFA with less
than 2n states. Fortunately, such RE in practice does not occur often
space time to
required simulate
NFA O(|r|) O(|r|*|x|)
47
DFA O(2|r|) O(|x|)
48