0% found this document useful (0 votes)

37 views49 pages

Unit II - Lexical Analysis-20-1-2021

Lexical Analysis

Uploaded by

Yattin Gaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views49 pages

Unit II - Lexical Analysis-20-1-2021

Lexical Analysis

Uploaded by

Yattin Gaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

LEXICAL ANALYSIS

UNIT II
Contents
 Role of lexical analyzer
 Specification of tokens
 Recognition of tokens
 Lexical analyzer generator
 Finite automata
 From regular expression to NFA
 Design of lexical analyzer generator
 Optimization of DFA- based pattern matchers
The role of lexical analyzer

token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken

Symbol
table
Introduction
 Simple way to build lexical analyzer is to construct
a diagram that illustrates the structure of token
 Techniques used to implement lexical analyzers are
applicable in query languages & IR systems also
 Can utilize pattern matching algorithms
 Two secondary tasks- removal of white spaces &
comments, correlating error messages
 Sometimes divided into scanning & lexical analysis
Why to separate Lexical analysis and parsing

1. Simplicity of design
2. Improving compiler efficiency- Specialized buffering
techniques for reading input characters & processing tokens
can significantly speed up the performance of a compiler
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
 A token is a pair a token name and an optional
token value
 A pattern is a description of the form that the
lexemes of a token may take
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Example

Token Informal description Sample lexemes

if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2

number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”
Attributes for tokens
 When more than one pattern matches a lexeme, the
lexical analyzer must provide additional info about
that lexeme
 E = M * C ** 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>
Lexical errors
 Some errors are out of power of lexical analyzer to
recognize:
 fi (a == f(x)) …
 However it may be able to recognize errors like:
 d = 2r
 Such errors are recognized when no pattern for
tokens matches a character sequence
Error recovery
 Panic mode recovery : Simplest recovery strategy
in which successive characters are deleted until we
reach to a well formed token
 Other recovery options are
 Deleting extraneous character
 Inserting a missing character
 Replacing an incorrect character by correct one
 Transposing two adjacent characters
Input Buffering

 To ensure that a right lexeme is found, one or more characters have to be

looked up beyond the next lexeme.
 Techniques for speeding up the process of lexical analyzer such as the use of
sentinels to mark the buffer end have been adopted.

 There are three general approaches for the implementation of a

lexical analyzer:

 Use of Lexical Analyzer generator(Lex tool) to produce LA from a regular

expression based specification
 Write LA in System pro language using I/O facilities
 Write LA in assembly language

 Harder to implement – Generate faster LA

Input Buffering
 Two Types of Buffering
1. One buffer
2. Two Buffer
 Consists of two buffers, each consists of N-character size
which are reloaded alternatively.
 Two pointers lexemeBegin and forward are maintained.
 Lexeme Begin points to the beginning of the current lexeme
which is yet to be found.
 Forward scans ahead until a match for a pattern is found.
 Once a lexeme is found, lexeme begin is set to the character
immediately after the lexeme which is just found and forward
is set to the character at its right end.
 Current lexeme is the set of characters between two pointers.
What is Buffer Pairs ?

 A specialized buffering technique is used to reduce

the amount of overhead, which is required to
process an input character in moving characters.
Sentinels

E = M eof * C * * 2 eof eof

 eof-sentinels
Specification of tokens
 In theory of compilation regular expressions are
used to formalize the specification of tokens
 Regular expressions are means for specifying
regular languages
 Example:
 Letter_(letter_ | digit)*
 Each regular expression is a pattern specifying the
form of strings
Regular expressions

Ɛ is a regular expression, L(Ɛ) = {Ɛ}
 If a is a symbol in ∑then a is a regular expression,
L(a) = {a}
 (r) | (s) is a regular expression denoting the
language L(r) ∪ L(s)
 (r)(s) is a regular expression denoting the language
L(r)L(s)
 (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)
Regular definitions
d1 -> r1
d2 -> r2
…
dn -> rn

 Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Notational Short hands
 One or more instances: (r)+
 Zero or one instances: r?
 Character classes: [abc]

 Example:
 letter_ -> [A-Za-z_]
 digit -> [0-9]
 id -> letter_(letter|digit)*
Recognition of tokens
 Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt
|Ɛ
expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
 The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
 We also need to handle whitespaces:
delim-> blank | tab | newline
ws-> delim+
Transition diagrams
 Transition diagram for relop
Transition diagrams (cont.)
 Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
 Transition diagram for unsigned numbers
Transition diagrams (cont.)
 Transition diagram for whitespace
Lexical Analyzer Generator - Lex

Lex Source Lexical lex.yy.c

program Compiler
lex.l

lex.yy.c
C a.out
compiler

Input stream Sequence

a.out
of tokens
Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
functions/code
Finite Automata
 Regular expressions = specification
 Finite automata = implementation

 A finite automaton consists of

 An input alphabet 
 A set of states S
 A start state n
 A set of accepting states F  S
 A set of transitions state input state
Finite Automata
 Transition
s1  a s2
 Is read
In state s1 on input “a” go to state s2

 If end of input
 If in accepting state => accept, othewise => reject
 If no transition possible => reject
Finite Automata State Graphs
 A state

• The start state

• An accepting state

a
• A transition
A Simple Example
 A finite automaton that accepts only “1”

 A finite automaton accepts a string if we can follow

transitions labeled with the characters in the string
from the start to some accepting state
Another Simple Example
 A finite automaton accepting any number of 1’s
followed by a single 0
 Alphabet: {0,1}
1

 Accepted String Examples- 110,1110

Epsilon Moves
 Another kind of transition: -moves

A B

• Machine can move from state A to state B

without reading input
Deterministic and Nondeterministic Automata

 Deterministic Finite Automata (DFA)

 One transition per input per state
 No -moves
 Nondeterministic Finite Automata (NFA)
 Can have multiple transitions for one input in a given
state
 Can have -moves
 Finite automata have finite memory
 Need only to encode the current state
Execution of Finite Automata
 A DFA can take only one path through the state
graph
 Completely determined by input

 NFAs can choose

 Whether to make -moves
 Which of multiple transitions for a single input to take
NFA vs. DFA (1)
 NFAs and DFAs recognize the same set of
languages (regular languages)

 DFAs are easier to implement

 There are no choices to consider
Next

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA
Thomson’s Construction
 I/P – Regular Expression(r)
 o/p- NFA accepting L(r)
 Method-
 Break r into its construction sub expressions
 Construct NFA for each basic symbol
 Combine all NFA’s to get final one
Properties
 Each state has unique name
 NFA for any r has exactly one start state & one
accepting state
 N(r) has at most twice as many states as the
number of symbol & operation in r
 Each state of the NFA for r has either one outgoing
transition on symbol or at most two outgoing
empty transitions
Regular Expressions to NFA (1)
 For each kind of rexp, define an NFA
 Notation: NFA for rexp A

A
• For 


• For input a
a
Regular Expressions to NFA (2)
 For AB
A  B

• For A | B
B 


 A
Regular Expressions to NFA (3)
 For A*


A



Example of RegExp -> NFA conversion

 Consider the regular expression

(1 | 0)*1
 The NFA is


 C 1 E 
A B 1
 0 F G H  I J
 D 

Conversion of an NFA into DFA
 Subset construction algorithm is useful for simulating an NFA
by a computer program.
 In the transition table of an NFA, each entry is a set of states;
in the transition table of a DFA, each entry is just a single
state.
 The general idea behind the NFA-to-DFA construction is that
each DFA state corresponds to a set of NFA states.
 The DFA uses its state to keep track of all possible states the
NFA can be in after reading each input symbol.
Subset Construction
- constructing a DFA from an NFA

 Input: An NFA N.
 Output: A DFA D accepting the same language.
 Method: We construct a transition table Dtran for D.
Each DFA state is a set of NFA states and we
construct Dtran so that D will simulate “in parallel”
all possible moves N can make on a given input
string.
Subset Construction (II)

s represents an NFA state

T represents a set of NFA states.
Subset Construction (III)
Minimizing the number of states in DFA

 Minimize the number of states of a DFA by finding

all groups of states that can be distinguished by
some input string.
 Each group of states that cannot be distinguished is
then merged into a single state.
Minimizing the number of states in DFA (II)

MIC 17431-2018-Summer-Model-Answer-Paper
No ratings yet
MIC 17431-2018-Summer-Model-Answer-Paper
22 pages
7 B 8 C 8 Afe
No ratings yet
7 B 8 C 8 Afe
58 pages
CD ch2
No ratings yet
CD ch2
104 pages
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
No ratings yet
Lexical Analysis: Dr. Murali Krishna Enduri Department of CSE
88 pages
Problem 14 - Dalpro
No ratings yet
Problem 14 - Dalpro
16 pages
Important Questions
No ratings yet
Important Questions
3 pages
Chapter Two LexicalAnalysis
No ratings yet
Chapter Two LexicalAnalysis
16 pages
SLD 2
No ratings yet
SLD 2
67 pages
Introduction To C++ and Code Blocks
No ratings yet
Introduction To C++ and Code Blocks
22 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
34 pages
2 LexicalAnalysis
No ratings yet
2 LexicalAnalysis
11 pages
Lec02 Lexicalanalyzer
100% (1)
Lec02 Lexicalanalyzer
50 pages
PLDI Week 06 Parsing
No ratings yet
PLDI Week 06 Parsing
55 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Luhnov Algoritam
No ratings yet
Luhnov Algoritam
12 pages
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
No ratings yet
A Typical Lexical Analyzer Generator Nfa To Dfa DFA Analysis
64 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
UNIT-I - Lexical Analysis
No ratings yet
UNIT-I - Lexical Analysis
51 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
No ratings yet
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
52 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Computer Paper
No ratings yet
Computer Paper
10 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Energy Efficient Approximate Adder
No ratings yet
Energy Efficient Approximate Adder
12 pages
CA3 Question
No ratings yet
CA3 Question
1 page
00 Usefull Excel Functions
No ratings yet
00 Usefull Excel Functions
3 pages
CH 2
No ratings yet
CH 2
36 pages
Prediction of Road Traffic Congestion Based On Random Forest
No ratings yet
Prediction of Road Traffic Congestion Based On Random Forest
4 pages
2.1 Constituents of Lexical Analysis
No ratings yet
2.1 Constituents of Lexical Analysis
10 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Eti-Class Test
No ratings yet
Eti-Class Test
3 pages
HOWKTEAM.VN - Tạo cấu trúc lưu trữ dữ liệu Từ điển nói C# Winform
No ratings yet
HOWKTEAM.VN - Tạo cấu trúc lưu trữ dữ liệu Từ điển nói C# Winform
8 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Logic Gates
No ratings yet
Logic Gates
2 pages
Lexical Analysis
No ratings yet
Lexical Analysis
88 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
ch-2.pdf 2
No ratings yet
ch-2.pdf 2
27 pages
Chpater 2 Lexical Analysis
No ratings yet
Chpater 2 Lexical Analysis
48 pages
Compiler
No ratings yet
Compiler
60 pages
Lexical Analysis All Token List and Diffence
No ratings yet
Lexical Analysis All Token List and Diffence
4 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
CSCI 2400 - Exam 4
No ratings yet
CSCI 2400 - Exam 4
2 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
HSN
No ratings yet
HSN
23 pages
DAA UNIT 5 Backtracking
No ratings yet
DAA UNIT 5 Backtracking
23 pages
Unit 2-Introduction To Compilers
No ratings yet
Unit 2-Introduction To Compilers
51 pages
IV Year B. Tech. - Computer Science & Engineering Artificial Intelligence
No ratings yet
IV Year B. Tech. - Computer Science & Engineering Artificial Intelligence
1 page
CCS0007 - Laboratory Exercise 6
No ratings yet
CCS0007 - Laboratory Exercise 6
13 pages
Recognition of Tokens
No ratings yet
Recognition of Tokens
34 pages
Frisco High Computer Science Challenge - Written Pitch With Code
No ratings yet
Frisco High Computer Science Challenge - Written Pitch With Code
4 pages
1cO1CO2: A CO1CO1Co1
No ratings yet
1cO1CO2: A CO1CO1Co1
4 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
CS501 MidTerm MCQs by Talha Sajid
No ratings yet
CS501 MidTerm MCQs by Talha Sajid
30 pages
CompilerD L3
No ratings yet
CompilerD L3
36 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
Unit I Bks Lexical Analysis V - Re - and - Fsa
No ratings yet
Unit I Bks Lexical Analysis V - Re - and - Fsa
52 pages
Diving Into Mastery Horizontal Multiply
No ratings yet
Diving Into Mastery Horizontal Multiply
3 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
CC Note 1
No ratings yet
CC Note 1
11 pages
File 1675742677 110405 LexicalAnalysis-Continue1
No ratings yet
File 1675742677 110405 LexicalAnalysis-Continue1
39 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Slides 3 Solidity Programming
No ratings yet
Slides 3 Solidity Programming
66 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
Compiler Construction Lecture 3-4
No ratings yet
Compiler Construction Lecture 3-4
78 pages
USJ Master AI
No ratings yet
USJ Master AI
6 pages
CC Unit 2
No ratings yet
CC Unit 2
80 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
From Python To Numpy 2023 10 16 14 41 27
No ratings yet
From Python To Numpy 2023 10 16 14 41 27
20 pages
Lecture 04
No ratings yet
Lecture 04
37 pages
Chapter 3 Implementation - of - Lexical - Analysis
No ratings yet
Chapter 3 Implementation - of - Lexical - Analysis
63 pages
Summer Holiday HW XII CS 2023-24 Part-1
No ratings yet
Summer Holiday HW XII CS 2023-24 Part-1
4 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Topic: Programming Basics: Structure of C Program, Syntax, Errors Structure of A C Program
No ratings yet
Topic: Programming Basics: Structure of C Program, Syntax, Errors Structure of A C Program
4 pages
Token, Lexemes and Regular Expression
No ratings yet
Token, Lexemes and Regular Expression
22 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
Question Bank104
No ratings yet
Question Bank104
73 pages
Micromachine
No ratings yet
Micromachine
12 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet

Unit II - Lexical Analysis-20-1-2021

Uploaded by

Unit II - Lexical Analysis-20-1-2021

Uploaded by

LEXICAL ANALYSIS

Token Informal description Sample lexemes

id Letter followed by letter and digits pi, score, D2

 To ensure that a right lexeme is found, one or more characters have to be

 There are three general approaches for the implementation of a

 Use of Lexical Analyzer generator(Lex tool) to produce LA from a regular

 Harder to implement – Generate faster LA

 A specialized buffering technique is used to reduce

E = M eof * C * * 2 eof eof

Lex Source Lexical lex.yy.c

Input stream Sequence

 A finite automaton consists of

• The start state

 A finite automaton accepts a string if we can follow

 Accepted String Examples- 110,1110

• Machine can move from state A to state B

 Deterministic Finite Automata (DFA)

 NFAs can choose

 DFAs are easier to implement

 Consider the regular expression

s represents an NFA state

 Minimize the number of states of a DFA by finding

You might also like