Chapter 2
Chapter 2
1
Lexical Analyzer
toke
source lexical n
analyzer parser
program
get
next
LEXICAL ANALYZER token
symbol
Scan Input table
PARSER
Remove WS, NL, …
Perform Syntax Analysis
Identify Tokens
Actions Dictated by Token
Create Symbol Table
Order
Insert Tokens into ST Update Symbol Table
5
Handling Lexical Errors
Error Handling is very localized, with Respect
to Input Source
For example: whil ( x := 0 ) do
generates no lexical errors
In what Situations do Errors Occur?
Prefix of remaining input doesn’t match any
defined token
Possible error recovery actions:
Deleting or Inserting Input Characters
Replacing or Transposing Characters
Or, skip over to next separator to “ignore”
problem
6
Input Buffering
to find the end of token, LA may need to go
one or more characters beyond the next
lexeme
E.g., to find ID or >, =, ==
Buffer Pairs
Concerns with efficiency issues
Used with a lookahead on the
E = M * C * * 2 e
input o
f
8
Lexical Analyzer: Implementation Approaches
9
Formalizing Token
Definition
DEFINITIONS:
10
Language Concepts
A language, L, is simply any set of strings
overAlphabet
a fixed alphabet.
Languages
{0,1}
{0,10,100,1000,100000…}
{0,1,00,11,000,111,…}
{a,b,c}
{abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {TEE,FORE,BALL,…}
{FOR,WHILE,GOTO,
…}
{A,…,Z,a,…,z,0,…9, { All legal C/C++
Special Languages: - EMPTY LANGUAGE
progs}
11
+,-,…,<,>,…} {- All
contains string
grammatically
only
Language & Regular Expressions
12
Towards Token Definition
Regular Definitions: Associate names with
Regular Expressions
For Example : C/C++ IDs
letter A | B | C | … | Z | a | b | … | z
digit 0 | 1 | 2 | … | 9
id letter ( letter | digit )*
Shorthand Notation:
“+” : one or more r* = r+ | & r+ = r
r*
“?” : zero or one
[range] : set range of characters (replaces
“|” )
Using Shorthand[A-Z]: =
C/C++
A | B |IDs
C|…|Z
id [A-Za-z][A-Za-z0-9]*
13
We’ll Use Both Techniques
Token Recognition
How can we use concepts developed so far to assist
in recognizing tokens of a source language ?
Assume Following Tokens:
if, then, else, relop,
id, num
What language construct are
Giventhey usedWhat
Tokens, for ?are
if
Patterns ?if
then then
else else
relop < | <= | > | >= | = | <>
id letter ( letter | digit )*
num digit + (. digit + ) ? ( E(+ | -) ? digit +
14 )?
What does this represent ?
Constructing Transition Diagrams for Tokens
• Transition Diagrams (TD) are used to represent
the tokens – these are automatons!
• As characters are read, the relevant TDs are
used to attempt to match lexeme to a pattern
• Each TD has:
• States : Represented by Circles
• Actions : Represented by Arrows between states
• Start State : Beginning of a pattern (Arrowhead)
• Final State(s) : End of pattern (Concentric Circles)
5 return(relop, EQ)
>
=
6 7 return(relop, GE)
othe
r 8 * return(relop, GT)
id :
letter or
digit
star lett oth *
0 1 return(id,
t er er
16
2 lexeme)
Important Final Notes on Transition
Diagrams & Lexical Analyzers
state = 0;
token nexttoken() •How does this work?
{ while(1) {
switch (state) {
case 0: c = nextchar();
/* c is lookahead character */
if (c== blank || c==tab || c== newline) {
state = 0;
lexeme_beginning++;
What /* advance beginning of lexeme */
does }
else if (c == ‘<‘) state = 1;
this
else if (c == ‘=‘) state = 5;
do? else if (c == ‘>’) state = 6;
else state = fail();
break;
…
17
Tokens / Patterns / Regular Expressions
Lexical Analysis - searches for matches of lexeme
to pattern
Lexical
ForAnalyzer returns:<actual
Token lexeme, symbolic
identifier Symbolic ID
of token>
Example:
if 1
Set of all regular then 2
expressions plus else 3
symbolic ids plus>,>=,<,… 4
analyzer define := 5
required id 6
functionality.
int 7
real 8
Non-Deterministic
Has more than one alternative
: action for the same input symbol.
Can’t utilize algorithm !
DeterministicHas
: at most one action for a given
input symbol.
Both types are used to recognize regular
expressions.
20
Representing NFAs
Number
Transition Diagrams : states (circles),
arcs, final states, …
F={3}
b
Fig:3
= { a,i bn }
put
b (null) moves
s possible
0 { 0, 1 { 0 } i j
t }
a 1 -- {2}
Switch state but do
t 2 -- {3}
21 not use any input
e
symbol
NFA- Regular Expressions & Compilation
Problems with NFAs for Regular Expressions:
1. Valid input might not be accepted
2. NFA may behave differently on the same input
Example: for Fig 3 aabb is accepted along path : 0 → 0 →
1→2→3
BUT… it is not
Relationship of accepted
NFAs to along the valid path: 0 → 0 → 0 → 0
Compilation:
→0
1. Regular expression “recognized” by NFA
2. Regular expression is “pattern” for a “token”
3. Tokens are building blocks for lexical analysis
4. Lexical analyzer can be described by a collection
of NFAs. Each NFA is for a language token.
22
Deterministic Finite Automata (DFA)
23
NFA to DFA Conversion
Look at the state reachable without consuming
any input, and Aggregate them in macro states
25
Conversion : NFA DFA Algorithm
2 a 3 b 4
0 1 5 8
6 c 7
From State 0, Where can we move without consuming
any input ?
27 This forms a new state: 0,1,2,6,8 What transitions are
defined for this new state ?