Chapter 2
Chapter 2
Lexical analysis
By Melese Alemante
2025
1
Outline
Introduction
Interaction of the Lexical Analyzer with the
Parser
Token, pattern, lexeme
Specification of patterns using regular expressions
Regular expressions
Regular expressions for tokens
NFA and DFA
Conversion from RE to NFA to DFA…
Lex Scanner Generator
Creating a Lexical Analyzer with Lex
Regular Expressions in Lex
Lex specifications and examples
2
Introduction
❑ The role of the lexical analyzer is:
• to read a sequence of characters from the source
program
• group them into lexemes and
• produce as output a sequence of tokens for each
lexeme in the source program.
The scanner can also perform the following
secondary tasks:
stripping out blanks, tabs, new lines
stripping out comments
keep track of line numbers (for error reporting)
3
Interaction of the Lexical Analyzer
with the Parser
Source
Program
symbol
table
(Contains a record
for each identifier)
5
Token, pattern, lexeme…
Example: The following table shows some tokens and
their lexemes in Pascal (a high level, case insensitive
programming language)
Token Some lexemes pattern
begin Begin, Begin, BEGIN, Begin in small or capital
beGin… letters
if If, IF, iF, If If in small or capital letters
ident Distance, F1, x, Dist1,… Letters followed by zero or
more letters and/or digits
8
Attributes of tokens…
9
9
Errors
Very few errors are detected by the lexical
analyzer.
For example, if the programmer mistakes
ebgin for begin, the lexical analyzer cannot
detect the error since it will consider ebgin as
an identifier.
Nonetheless, if a certain sequence of
characters follows none of the specified
patterns, the lexical analyzer can detect the
error.
10
Errors…
When an error occurs, the lexical analyzer
recovers by:
skipping (deleting) successive characters from the
remaining input until the lexical analyzer can find a
well forme token (panic mode recover)
deleting one character from the remaining input
inserting missing characters in to the remaining input
replacing an incorrect by a correct
charctercharacter
transposing two adjacent characters
11
Errors…
Example.
int num = 42#56; // Invalid character (#)
• The lexer may skip # to recover.
int x = 9$;
• The lexer deletes $, assuming it was a typo
int y 10; // Missing '='
• The lexer inserts = to correct it: int y = 10;
int val# = 5; // Incorrect: '#' is not allowed in variable
names
• The lexer may replace # with _, assuming it was a
typo
pritnf("Hello"); // Incorrect: "pritnf" instead of "printf"
• The lexer detects that "pritnf" is similar to "printf" and
corrects the order.
12
Specification of patterns using
regular expressions
Regular expressions
Regular expressions for tokens
13
Regular expression: Definitions
16
Regular expression: Language Operations
Union of L and M
L ∪ M = {s |s ∈L or s ∈M}
Concatenation of L and M
LM = {xy | x ∈L and y ∈ M}
Exponentiation of L
L0 = {ε}; Li = Li-1L The following shorthands
are often used:
Kleene closure of L
L* = ∪i=0,…,∞ Li r+ =rr*
r* = r+| ε
Positive closure of L
r? =r|ε-
L+ = ∪i=1,…,∞ Li optional
operator 17
Examples
L1={a,b,c,d} L2={1,2}
L1 ∪ L2={a,b,c,d,1,2}
L1L2={a1,a2,b1,b2,c1,c2,d1,d2}
L1*=all strings of letter a,b,c,d and empty string.
L1+= the set of all strings of one or more letter a,b,c,d,
empty string not included
18
Regular expressions…
Examples (more):
1- a | b = {a,b}
2 (a|b)a = {aa,ba}
3 (ab) | ε ={ab, ε}
4 ((a|b)a)* = {ε, aa,ba,aaaa,baba,....}
5 Even binary numbers (0|1)*0
6 An alphabet consisting of just three alphabetic
characters: Σ = {a, b, c}. Consider the set of all strings
over this alphabet that contains exactly one b.
(a | c)*b(a|c)* {b, abc, abaca, baaaac, ccbaca, cccccb}
19
Exercises
Describe the languages denoted by the following
regular expressions:
1 a(a|b)*a
2 ((ε|a)b*)*
3 (a|b)*a(a|b)(a|b)
4 a*ba*ba*ba*
5 (aa|bb)*((ab|ba)(aa|bb)*(ab|ba)(aa|bb)*)*
20
Regular expressions for tokens
21
Regular expressions for tokens…
Special symbols: including arithmetic operators,
assignment and equality such as =, :=, +, -, *
Identifiers: which are defined to be a sequence of
letters and digits beginning with letter,
we can express this in terms of regular definitions as
follows:
letter = A|B|…|Z|a|b|…|z
digit = 0|1|…|9
or
letter= [a-zA-Z]
digit = [0-9]
identifiers = letter(letter|digit)*
22
Regular expressions for tokens…
Numbers: Numbers can be:
sequence of digits (natural numbers), or
decimal numbers, or
numbers with exponent (indicated by an e or
E).
Example: 2.71E-2 represents the number 0.0271.
We can write regular definitions for these numbers as
follows:
nat = [0-9]+
signedNat = (+|-)? Nat
number = signedNat(“.” nat)?(E signedNat)?
Literals or constants: which can include:
numeric constants such as 42, and
string literals such as “ hello, world”. 23
Example: Divide the following Java program into
appropriate tokens.
public class Dog {
private String name;
private String color;
25
Transition diagram that recognizes the lexemes
matching the token relop and id.
3-27
Design of a Lexical Analyzer/Scanner
Finite Automata
❑ Lex – turns its input program into lexical analyzer.
❑ At the heart of the transition is the formalism known as
finite automata.
❑ Finite automata is graphs, like transition diagrams, with a
few differences:
1. Finite automata are recognizers; they simply say "yes" or
"no" about each possible input string.
2. Finite automata come in two flavors:
a) Nondeterministic finite automata (NFA) have no restrictions
on the labels of their edges.
ε, the empty string, is a possible label.
b) Deterministic finite automata (DFA) have, for each state,
and for each symbol of its input alphabet exactly one edge
with that symbol leaving that state.
27
The Whole Scanner Generator Process
Overview
❑ Direct construction of Nondeterministic finite
Automation (NFA) to recognize a given regular
expression.
❑ Easy to build in an algorithmic way
❑ Requires ε-transitions to combine regular sub expressions
❑ Construct a deterministic finite automation
(DFA) to simulate the NFA Optional
❑ Use a set-of-state construction
❑ Minimize the number of states in the DFA
❑ Generate the scanner code.
28
Design of a Lexical Analyzer …
Token ➔ Pattern
Pattern ➔ Regular Expression
Regular Expression ➔ NFA
NFA ➔ DFA
DFA’s or NFA’s for all tokens ➔ Lexical Analyzer
29
Non-Deterministic Finite Automata
(NFA)
Definition
An NFA M consists of five tuples: ( Σ,S, T, s0, F)
A set of input symbols Σ, the input alphabet
a finite set of states S,
a transition function T: S × (Σ U { ε}) -> S (next state),
a start state s0 from S, and
a set of accepting/final states F from S.
The language accepted by M, written L(M), is defined as:
The set of strings of characters c1c2...cn with each ci from
Σ U { ε} such that there exist states s1 in T(s0,c1), s2 in
T(s1,c2), ... , sn in T(sn-1,cn) with sn an element of F.
30
NFA…
It is a finite automata which has choice of
edges
• The same symbol can label edges from one state to
several different states.
An edge may be labeled by ε, the empty
string
• We can have transitions without any input
character consumption.
31
Transition Graph
The transition graph for an NFA recognizing the
language of regular expression (a|b)*abb
all strings of a's and b's ending in the
particular string abb
a
start a b b
0 1 2 3
b S={0,1,2,3}
Σ={a,b}
S0=0
F={3}
32
Transition Table
The mapping T of an NFA can be represented
in a transition table
State Input Input Input
a b ε
0 {0,1} {0} ø
a a b b
0 0 1 2 3 YES
a a b b
0 0 0 0 0 NO
34
Another NFA
a
a
start
b
b
aa*|bb*
35
Deterministic Finite Automata (DFA)
36
DFA example
A DFA that accepts (a|b)*abb
37
Simulating a DFA: Algorithm
How to apply a DFA to a string.
INPUT:
An input string x terminated by an end-of-file character
eof.
A DFA D with start state So, accepting states F, and
transition function move.
OUTPUT: Answer ''yes" if D accepts x; "no" otherwise
METHOD
Apply the algorithm in (next slide) to the input string x.
The function move(s, c) gives the state to which there is
an edge from state s on input c.
The function nextChar() returns the next character of
the input string x.
38
Simulating a DFA
s = so;
c = nextchar();
while ( c != eof ) {
s = move(s, c);
c = nextchar();
}
if ( s is in F ) return
"yes";
DFA accepting (a|b)*abb
else return "no";
40
Design of a Lexical Analyzer Generator
Two algorithms:
1 Translate a regular expression into an NFA
(Thompson’s construction)
Rules:
1- For an ε, a regular expressions, construct:
start a
42
From regular expression to an NFA…
2- For a composition of regular expression:
Case 1: Alternation: regular expression(s|r), assume
that NFAs equivalent to r and s have been
constructed.
45
45
From regular expression to an NFA…
Case 2: Concatenation: regular expression sr
ε
…r …s
Case 3: Repetition r*
44
From RE to NFA:Exercises
45
From an NFA to a DFA
(subset construction algorithm)
Rules:
Start state of D is assumed to be unmarked.
Start state of D is = ε-closure (S0),
where S0 -start state of N.
46
NFA to a DFA…
ε- closure
ε-closure (S’) – is a set of states with the following
characteristics:
1 S’ € ε-closure(S’) itself
2 if t € ε-closure (S’) and if there is an edge labeled
ε from t to v, then v € ε-closure (S’)
3 Repeat step 2 until no more states can be added
to ε-closure (S’).
E.g: for NFA of (a|b)*abb
ε-closure (0)= {0, 1, 2, 4, 7}
ε-closure (1)= {1, 2, 4}
47
NFA for identifier: letter(letter|digit)*
ε
letter
3 4
ε ε
start
letter ε ε
0 1 2 7 8
digit ε
ε 5 6
48
NFA to a DFA…
Example: Convert the following NFA into the corresponding
DFA. letter (letter|digit)*
A={0}
B={1,2,3,5,8}
start letter C={4,7,2,3,5,8}
A B
D={6,7,8,2,3,5}
letter digit
letter
digit D digit
C
letter
49
Exercise: convert NFA of (a|b)*abb in to DFA.
50
Other Algorithms
51
The Lexical- Analyzer Generator: Lex
The first phase in a compiler is, it reads the
input source and converts strings in the source
to tokens.
Lex: generates a scanner (lexical analyzer or
lexer) given a specification of the tokens using
REs.
The input notation for the Lex tool is referred toas
the Lex language and
The tool itself is the Lex compiler.
The Lex compiler transforms the input patterns into a
transition diagram and generates code, in a file
called lex.yy.c, that simulates this transition diagram.
52
Lex…
53
General Compiler Infra-structure
Parse tree
Program source Tokens Parser
Scanner Semantic
(tokenizer) Routines
(stream of
characters) Annotated/decorated
tree
Analysis/
Transformations/
Symbol and optimizations
literal Tables
IR: Intermediate
Representation
Code
Generator
Assembly code
54
Scanner, Parser, Lex and Yacc
5858
Generating a Lexical Analyzer using Lex
Lex is a scanner generator ----- it takes lexical specification as
input, and produces a lexical analyzer written in C.
Lex source
program Lex compiler lex.yy.c
lex.l
lex.yy.c
C compiler a.out
Lexical Analyzer
56
Lex specification
➢ Program structure C declarations in %{
...declaration section... %}
%%
P1 { action1 }
...rule section... P2 { action2 }
%%
...user defined functions...
Rules section – regular expression <--> action.
• The actions are C program.
Declaration section – variables, constants
57
Skeleton of a lex specification (.l file)
x.l *.c is generated after
running
%{
< C global variables, prototypes, This part is copied as–is to
comments > the top of the generated
C file
%}
Substitutions simplifies
[DEFINITION SECTION] pattern matching
Thompson’s
construction
ε-closure({0}) = {0,1,3,7}
move({0,1,3,7},a) = {2,4,7}
ε-closure({2,4,7}) = {2,4,7}
move({2,4,7},a) = {7}
ε-closure({7}) = {7}
move({7},b) = {8}
ε-closure({8}) = {8}
move({8},a) = ∅
63
Combining and simulation of NFAs of a Set of
Regular Expressions: Example 2
start a
a {action1} 1 2
start b
abb {action2} a b
3 4 5 6
a*b+ {action3}
start a
When two or more b
accepting states are 7 b 8
reached, the action is
executed a Action 1
ε 1 2
start b
a b b b
0 ε 3 a 4 5 6
0 2 5 6
1 4 8 8 ε Action 2
a b
3 7 7 8 b
7 None a Action 3
Action 2
Action 3 67
DFA's for Lexical Analyzers
NFA DFA. Transition table for DFA
State a b Token
found
0137 247 8 None
247 7 58 a
8 - 8 a*b+
7 7 8 None
58 - 68 a*b+
68 - 8 abb
66
Pattern matching examples
67
Meta-characters
68
Lex Regular Expression: Examples
• an integer: 12345
[1-9][0-9]*
• a word: cat
[a-zA-Z]+
• a (possibly) signed integer: 12345 or -12345
[-+]?[1-9][0-9]*
• a floating point number: 1.2345
[0-9]*”.”[0-9]+
69
Regular Expression: Examples…
•a delimiter for an English sentence
“.” | “?” | ! OR
[“.””?”!]
• C++ comment: // call foo() here!!
“//”.*
• white space
[ \t]+
• English sentence: Look at this!
([ \t]+|[a-zA-Z]+)+(“.”|”?”|!)
70
Two Rules
71
Suppose the input is ifx and the lex specification includes:
if → return IF_TOKEN
[a-zA-Z]+ → return IDENTIFIER_TOKEN
Without the longest match rule, lex might match just i or if and stop, but
with this rule, it scans ahead:
• It sees if (matches IF_TOKEN, length 2).
• It continues to ifx and checks if a longer match is possible (e.g., ifx as
an identifier, length 3).
If no longer match is found for ifx as a single token, it commits to if as
IF_TOKEN and leaves x for the next tokenization.
However, if ifx is defined as an identifier, lex would match the entire ifx
(length 3) as IDENTIFIER_TOKEN if it’s the longest possible match.
Suppose the input is if and the lex specification includes:
if return IF_TOKEN // Defined first
[a-z]+ return IDENTIFIER_TOKEN // Defined second
Both patterns could match the string if (length 2):
74
Lex functions
75
Lex predefined variables
76
Thank you!
77