0% found this document useful (0 votes)
37 views49 pages

Unit II - Lexical Analysis-20-1-2021

Lexical Analysis

Uploaded by

Yattin Gaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views49 pages

Unit II - Lexical Analysis-20-1-2021

Lexical Analysis

Uploaded by

Yattin Gaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

LEXICAL ANALYSIS

UNIT II
Contents
 Role of lexical analyzer
 Specification of tokens
 Recognition of tokens
 Lexical analyzer generator
 Finite automata
 From regular expression to NFA
 Design of lexical analyzer generator
 Optimization of DFA- based pattern matchers
The role of lexical analyzer

token
Source Lexical To semantic
program Parser analysis
Analyzer
getNextToken

Symbol
table
Introduction
 Simple way to build lexical analyzer is to construct
a diagram that illustrates the structure of token
 Techniques used to implement lexical analyzers are
applicable in query languages & IR systems also
 Can utilize pattern matching algorithms
 Two secondary tasks- removal of white spaces &
comments, correlating error messages
 Sometimes divided into scanning & lexical analysis
Why to separate Lexical analysis and parsing

1. Simplicity of design
2. Improving compiler efficiency- Specialized buffering
techniques for reading input characters & processing tokens
can significantly speed up the performance of a compiler
3. Enhancing compiler portability
Tokens, Patterns and Lexemes
 A token is a pair a token name and an optional
token value
 A pattern is a description of the form that the
lexemes of a token may take
 A lexeme is a sequence of characters in the source
program that matches the pattern for a token
Example

Token Informal description Sample lexemes


if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but “ sorrounded by “ “core dumped”
Attributes for tokens
 When more than one pattern matches a lexeme, the
lexical analyzer must provide additional info about
that lexeme
 E = M * C ** 2
 <id, pointer to symbol table entry for E>
 <assign-op>
 <id, pointer to symbol table entry for M>
 <mult-op>
 <id, pointer to symbol table entry for C>
 <exp-op>
 <number, integer value 2>
Lexical errors
 Some errors are out of power of lexical analyzer to
recognize:
 fi (a == f(x)) …
 However it may be able to recognize errors like:
 d = 2r
 Such errors are recognized when no pattern for
tokens matches a character sequence
Error recovery
 Panic mode recovery : Simplest recovery strategy
in which successive characters are deleted until we
reach to a well formed token
 Other recovery options are
 Deleting extraneous character
 Inserting a missing character
 Replacing an incorrect character by correct one
 Transposing two adjacent characters
Input Buffering

 To ensure that a right lexeme is found, one or more characters have to be


looked up beyond the next lexeme.
 Techniques for speeding up the process of lexical analyzer such as the use of
sentinels to mark the buffer end have been adopted. 

 There are three general approaches for the implementation of a


lexical analyzer:

 Use of Lexical Analyzer generator(Lex tool) to produce LA from a regular


expression based specification
 Write LA in System pro language using I/O facilities
 Write LA in assembly language

 Harder to implement – Generate faster LA


Input Buffering
 Two Types of Buffering
1. One buffer
2. Two Buffer
 Consists of two buffers, each consists of N-character size
which are reloaded alternatively.
 Two pointers lexemeBegin and forward are maintained. 
 Lexeme Begin points to the beginning of the current lexeme
which is yet to be found. 
 Forward scans ahead until a match for a pattern is found. 
 Once a lexeme is found, lexeme begin is set to the character
immediately after the lexeme which is just found and forward
is set to the character at its right end. 
 Current lexeme is the set of characters between two pointers. 
What is Buffer Pairs ?

 A specialized buffering technique is used to reduce


the amount of overhead, which is required to
process an input character in moving characters.
Sentinels

E = M eof * C * * 2 eof eof

 eof-sentinels
Specification of tokens
 In theory of compilation regular expressions are
used to formalize the specification of tokens
 Regular expressions are means for specifying
regular languages
 Example:
 Letter_(letter_ | digit)*
 Each regular expression is a pattern specifying the
form of strings
Regular expressions

Ɛ is a regular expression, L(Ɛ) = {Ɛ}
 If a is a symbol in ∑then a is a regular expression,
L(a) = {a}
 (r) | (s) is a regular expression denoting the
language L(r) ∪ L(s)
 (r)(s) is a regular expression denoting the language
L(r)L(s)
 (r)* is a regular expression denoting (L(r))*
 (r) is a regular expression denoting L(r)
Regular definitions
d1 -> r1
d2 -> r2

dn -> rn

 Example:
letter_ -> A | B | … | Z | a | b | … | Z | _
digit -> 0 | 1 | … | 9
id -> letter_ (letter_ | digit)*
Notational Short hands
 One or more instances: (r)+
 Zero or one instances: r?
 Character classes: [abc]

 Example:
 letter_ -> [A-Za-z_]
 digit -> [0-9]
 id -> letter_(letter|digit)*
Recognition of tokens
 Starting point is the language grammar to
understand the tokens:
stmt -> if expr then stmt
| if expr then stmt else stmt

expr -> term relop term
| term
term -> id
| number
Recognition of tokens (cont.)
 The next step is to formalize the patterns:
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
 We also need to handle whitespaces:
delim-> blank | tab | newline
ws-> delim+
Transition diagrams
 Transition diagram for relop
Transition diagrams (cont.)
 Transition diagram for reserved words and
identifiers
Transition diagrams (cont.)
 Transition diagram for unsigned numbers
Transition diagrams (cont.)
 Transition diagram for whitespace
Lexical Analyzer Generator - Lex

Lex Source Lexical lex.yy.c


program Compiler
lex.l

lex.yy.c
C a.out
compiler

Input stream Sequence


a.out
of tokens
Structure of Lex programs

declarations
%%
translation rules Pattern {Action}
%%
functions/code
Finite Automata
 Regular expressions = specification
 Finite automata = implementation

 A finite automaton consists of


 An input alphabet 
 A set of states S
 A start state n
 A set of accepting states F  S
 A set of transitions state input state
Finite Automata
 Transition
s1  a s2
 Is read
In state s1 on input “a” go to state s2

 If end of input
 If in accepting state => accept, othewise => reject
 If no transition possible => reject
Finite Automata State Graphs
 A state

• The start state

• An accepting state

a
• A transition
A Simple Example
 A finite automaton that accepts only “1”

 A finite automaton accepts a string if we can follow


transitions labeled with the characters in the string
from the start to some accepting state
Another Simple Example
 A finite automaton accepting any number of 1’s
followed by a single 0
 Alphabet: {0,1}
1

 Accepted String Examples- 110,1110


Epsilon Moves
 Another kind of transition: -moves

A B

• Machine can move from state A to state B


without reading input
Deterministic and Nondeterministic Automata

 Deterministic Finite Automata (DFA)


 One transition per input per state
 No -moves
 Nondeterministic Finite Automata (NFA)
 Can have multiple transitions for one input in a given
state
 Can have -moves
 Finite automata have finite memory
 Need only to encode the current state
Execution of Finite Automata
 A DFA can take only one path through the state
graph
 Completely determined by input

 NFAs can choose


 Whether to make -moves
 Which of multiple transitions for a single input to take
NFA vs. DFA (1)
 NFAs and DFAs recognize the same set of
languages (regular languages)

 DFAs are easier to implement


 There are no choices to consider
Next

NFA

Regular
expressions DFA

Lexical Table-driven
Specification Implementation of DFA
Thomson’s Construction
 I/P – Regular Expression(r)
 o/p- NFA accepting L(r)
 Method-
 Break r into its construction sub expressions
 Construct NFA for each basic symbol
 Combine all NFA’s to get final one
Properties
 Each state has unique name
 NFA for any r has exactly one start state & one
accepting state
 N(r) has at most twice as many states as the
number of symbol & operation in r
 Each state of the NFA for r has either one outgoing
transition on symbol or at most two outgoing
empty transitions
Regular Expressions to NFA (1)
 For each kind of rexp, define an NFA
 Notation: NFA for rexp A

A
• For 

• For input a
a
Regular Expressions to NFA (2)
 For AB
A  B

• For A | B
B 


 A
Regular Expressions to NFA (3)
 For A*

A


Example of RegExp -> NFA conversion

 Consider the regular expression


(1 | 0)*1
 The NFA is

 C 1 E 
A B 1
 0 F G H  I J
 D 

Conversion of an NFA into DFA
 Subset construction algorithm is useful for simulating an NFA
by a computer program.
 In the transition table of an NFA, each entry is a set of states;
in the transition table of a DFA, each entry is just a single
state.
 The general idea behind the NFA-to-DFA construction is that
each DFA state corresponds to a set of NFA states.
 The DFA uses its state to keep track of all possible states the
NFA can be in after reading each input symbol.
Subset Construction
- constructing a DFA from an NFA

 Input: An NFA N.
 Output: A DFA D accepting the same language.
 Method: We construct a transition table Dtran for D.
Each DFA state is a set of NFA states and we
construct Dtran so that D will simulate “in parallel”
all possible moves N can make on a given input
string.
Subset Construction (II)

s represents an NFA state


T represents a set of NFA states.
Subset Construction (III)
Minimizing the number of states in DFA

 Minimize the number of states of a DFA by finding


all groups of states that can be distinguished by
some input string.
 Each group of states that cannot be distinguished is
then merged into a single state.
Minimizing the number of states in DFA (II)

You might also like