0% found this document useful (0 votes)

7 views76 pages

Compiler Design 2

The document provides an overview of lexical analysis, detailing the role of a lexical analyzer in programming languages, including the definition and classification of tokens. It discusses the syntax, semantics, and grammar of programming languages, as well as formal language definitions and regular expressions used for token specification. Additionally, it explains finite automata and their application in recognizing tokens from input strings.

Uploaded by

tibit40784

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views76 pages

Compiler Design 2

Uploaded by

tibit40784

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Dr.

Rajashekara Murthy S
B.E., M.Tech., Ph.D.
[email protected]
BITS Pilani
Pilani | Dubai | Goa | Hyderabad

1
–2–
Lexical Analysis
Objectives
 To Understand
1. The Role of a Lexical Analyzer

2. Lexical Analysis using formal Language definitions

with Finite Automata

3. Specifications & Recognition of Tokens

4. A Language for Specifying Lexical Analyzers

3
Programming Language Structure
 Recall that a Programming Language is defined by
1. SYNTAX:
– Decides whether a sentence in a language is well-formed
2. SEMANTICS
– Determines the meaning, if any, of a syntactically well-
formed sentence
3. GRAMMAR
– A formal system that provides a generative finite description
of the language
4
Syntax of a Programming Language
 Describes the structure of programs without any
consideration of their meaning.
 The syntactic elements of a programming language
are determined by the computation model and
pragmatic concerns
 well developed tools (regular, context-free and attribute
grammars) are available for the description of the
syntax of programming language
 Lexical Analyzer & the Parser of a compiler handle the
Syntax of the programming language
5
Some Basic Definitions
 lex-i-cal : Of or relating to words or the vocabulary of a
language as distinguished from its grammar and
construction
The task concerned with breaking an input
 lexical analysis: into its smallest meaningful units, called
tokens.
The task concerned with fitting a sequence of
 syntax analysis: tokens into a specified syntax.

To break a sentence down into its component parts

 parsing: of speech with an explanation of the form, function,
and syntactical relationship of each part.
6
Lexical Analyzer (A.k.a. Scanner)
 The only part of a compiler that looks at each
character of the source text and does a linear analysis
 Reads source text and produces TOKENS
 Also keeps track of the source-coordinates of each
token - which file name, line number and position
– (This is useful for debugging & error indication purposes.)
 Advantages of a separate Lexical Analyzer:
– Keeps Compiler design simple
– Improves Efficiency and
– Increases Portability
7
The Role of a Lexical Analyzer
Lexical analyzer

next char next token

Syntax
get next analyzer
get next token
char
Source symbol
Program table
(Contains a record
for each identifier)

8
Tokens, Patterns and Lexemes
 What are Tokens ?
– The basic lexical units of the language
– A sequence of Abstract Characters that can be treated as
a unit in the grammar of the language
– A programming language classifies the tokens into a finite
set of token types Some tokens may have attributes
 A note on Terminology integer constant token will have the
Some texts refer to actual integer (17, 42) as an
attribute;
– token types as tokens & Identifiers will have a string with the
– tokens as lexemes actual id

We will stick to the terms Tokens and Token Types

9
Tokens Example
 Let us Consider the program segment:
void main() { printf("Hello World\n"); }
 The tokens of this program segment are:
1. void, 7. (,
2. main,
8. "Hello World\n",
3. (,
9. ),
4. ),
5. { 10. ; and
6. printf, 11. }
10
Specifications of Tokens
String Words and Sentences
1. Prefix of s A string obtained by deleting
trailing symbols
2. suffix of s A string obtained by deleting
leading symbols
3. Substring of s A string obtained by deleting
a prefix & a suffix
4. Proper A prefix, suffix or sub string
that is nonempty s.t s = x
5. Subsequence of s A string obtained by deleting
symbols not necessarily contiguous
11
The Principle of Longest match
 In most languages, the scanner should pick the
longest possible string to make up the next token if
there is a choice
 Example
return foobar != hohum;
should be recognized as 5 tokens
RETURN ID(foobar)0 NEQ ID(hohum) SCOLON

not more (i.e., not parts of words or identifiers, or !

and = as separate tokens)
12
Typical Tokens in Programming Languages
 Operators & Punctuation
– + - * / ( ) { } [ ] ; : :: < <= == = != ! …
– Each of these is a distinct lexical class ( or token type )
 Keywords
– if while for goto return switch void …
– Each of these is also a distinct lexical class (not a string)
 Identifiers
– A single ID lexical class, but parameterized by actual id
 Integer constants
– A single INT lexical class, but parameterized by int value
 Other constants, etc.
13
Tokens of a Typical Language
TYPE EXAMPLE

ID foo, n14, a, temp……

NUM 73 , 0 , 00 , 515 , +2 ……..

REAL 66.1 .5 10. 1e67 5.5e-10 ……..

KEYWORDS IF DO WHILE INT ………

SYMBOLS , (Comma) != (Noteq) ( (Lparen) …….

14
Tokens of a Typical Language
zed ?
n i
c o g a s
TYPE EXAMPLEd re ken
a n to
ed e a
fin fi n
ID foo, n14, y d e a, temp……
d e
a l l s to
o r m io n
NUM f
73ns, 0 , 00 e s
s , 515 , +2 ……..
k e x p r
t o r e g e
a e
r66.1 gu.5 l a u a 1e67 5.5e-10 ……..
REAL w e n 10.
g
H o g r l a
n : s in la r
s
KEYWORDS t io y u g
IFu DO WHILE INT ………
e B r e
Qu e r: a l
 s w o r m
An
SYMBOLS f
 a , (Comma) != (Noteq) ( (Lparen) …….
15
Formal Theory of Languages
 A language in real life is made up of
1. words made up of alphabets and
2. Sentences made up of words arranged according to the
Grammar of that language
 Natural languages display amazing variety of
expressions with Explicit & implicit meanings and
variations in meaning as well as grammars
 Computer languages on the contrary focus on
– The limited set of tasks to be performed
– Hence mathematical precision is essential in defining
their structure and Grammar
16
Formal Definition of Languages
 Alphabet  A finite (non-empty) set of symbols denoted by Σ
 String  A finite sequence of symbols from an alphabet which
includes even the empty sequence (denoted by λ )
 A set ( often infinite) of finite strings
 Language  The set of all possible finite strings of elements of
alphabet Σ ( including λ ) is denoted by Σ*
 Finite specifications of (possibly infinite) languages is
possible with
1. Automaton – a recognizer; a machine that accepts all strings in
a language (and rejects all other strings)
2. Grammar – a generator; a system for producing all strings in
the language (and no other strings)
17
Formal Definition of Languages
&
 A finite (non-empty) set of symbols ar s by Σ
denoted
 Alphabet m
r am
 String  A finite sequence of symbols from t g alphabet which
an
e n
r (denoted by gλe)
includes even the empty sequence iff e a
d
y finite strings ang u
 A set ( often infinite)anof l
 Language  The set of all possible m
y ta finite strings e
nof elements of
b
d ma y o
f i e
alphabetcΣi ( including
o n
λ ) is denoted l by Σ*
e t
u UT ies o
p a
s of (possibly if languages is
 Finite specifications b e B c
infinite)
e
y s p
possible withma to n
e
g – a recognizer; a
m a machine that accepts all strings in
ua
1. Automaton t o
a ng
language (and a
rejectsu all other strings)
A l o r
2. Grammar – a m ar
generator; a system for producing all strings in
the language r am(and no other strings)
A g
18
Formal Language Definition ( Contd. )
 As already defined A language L over an alphabet Σ is
a collection of strings of elements of Σ
– The PASCAL Language is the set of all strings that
constitute legal PASCAL programs (infinite set)
– The Language of primes is a set of all decimal digit strings
that constitute prime numbers (infinite set)
– The language of C reserved words is the set of all
alphabetic strings that can not be used as identifiers in the
C programming language (finite set)
 To specify some of these (possibly infinite) languages
with finite description we use the notation of
Regular Expressions
19
Regular Expressions
 Is always defined over some alphabet Σ
(For programming languages, it is commonly ASCII or
Unicode)
 If E is a regular expression, L(E ) is the “language” (set
of strings) generated by E
 For Example – For each symbol ‘a’ in the alphabet of
the language the regular expression {a} denotes the
language containing just the string a ( Known as
symbol)
 A regular expression generated with empty sequence λ
is denoted by ε
20
Operations with Regular Expressions
 Given 2 regular expressions M & N
 Alternation ( denoted by | )
makes a new regular expression M | N denoting a
“UNION” of languages L(M) and L(N) . { L(M) L(N) }
 Concatenation ( denoted by . Or )
makes a new regular expression MN denoting a
language L(M) followed by L(N).
 The Repetiton ( denoted y * )
makes a new expression denoting a language that has
0 or more occurrences (Kleene closure) of L(M)
21
Regular Expression Example
Expression Language Example Words
a|b { a, b } a,b
ab * a {a} {b} * {a) aa , aba , abba , abbba …
(ab)* { ab} * ε , ab , abab , ababab , …
abba { abba } abba
(0 | 1) * 0 { {0} {1} } * {0} 0 , 00 , 10, 010, 110, …...
( All binary Even numbers)
b*(abb*)*(a | ε) Strings of a and b with NO consecutive a

Similarly, using symbols, | , . ,* and ε, we can specify the regular

expressions corresponding to the lexical tokens of a programming
language using rules ( A.k.a. Productions) 22
Table of Operators & Abbreviations
Notation Description
a An ordinary character that stands for itself
ε The empty String
M|N Alternation; Choosing from M OR N
MN Concatenation : An M followed by N
M* Repetition ( Zero or more Times)
M+ Repetition ( one or more times)
M? Optional (Zero or one Occurrence of M)
[a–z A–z ] Character set alteration
[abxyz] One of the given characters (a|b|x|y|z)
. Stands for a single character ( except New line)
‘a.+*’ Quotation: A string in quotes stands for itself literally 23
Regular Expression Construction
 Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)
 Observations on numbers:
1. Could be made up of one or more digits from set (0 – 9)
2. Optionally Can have a decimal point in the end followed by 0
or more digits “.”(0 – 9)*
3. A number can also start with a Point followed by one or more
digits

[ (0 – 9)+ [“.”(0 – 9)*] ? ] | [“.”(0 – 9) +]

24
Regular Expressions for Some Tokens of
a Programming Language
Regular Expression Token Type

if [ Return IF; ]

[ a – z ] [ a – z 0 – 9 ]* [ return ID ]

[0–9]+ [ return NUM ]

( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( [ ‘ . ’[ 0 – 9 ] +) Return REAL

(‘\’ [ a – z ] ‘\n’ ) | (‘ ’)| ‘\n’ | ‘*/’)+ return Comment

. return ERROR
25
A regular Expression Recognizer
 Given an input string,
The function of a “regular Expression Analyzer” is to
say :
– “YES, the input is part of the language generated
from the regular expression”
– “NO, the input isn’t part of the language generated
from the regular expression”
 Using results from Finite Automata theory and theory
of algorithms, we can automate construction of such
recognizers from Regular Expressions
26
Finite Automata
 A finite Automation is a Transition Graph that has:
– A finite set of states S (represented by Nodes) with Edges
leading from one state to another
– Each edge is labeled with the symbol ( from the set Σ ) that
causes the transition ( Could be ε also !)
– One state is denoted as start state S0 and certain of the
states are distinguished as final states ( normally denoted
with two concentric circles)
 Mathematically, It can be represented as:
A = {S, , s0, F, move }
27
Recognizing Expressions as Tokens with
Finite State Automaton
 Operate by reading input symbols (usually characters)
– Transition can be taken if labeled with current symbol
– ε-transition can be taken at any time
 Accept when final state reached & no more input
– Scanner slightly different – accept longest match even if
more input
 Reject if no transition possible or no more input and
not in final state (DFA)
28
Finite Automata Examples

if start 1 i 2 f 3 return IF

a–z
start a–z
[ a – z ] [ a – z 0 – 9 ]* 1 2 return ID
0–9

[0–9]+ start 1 0–9 2 return NUM

0–9
29
Finite Automata Examples ( Contd.)
( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( ‘ . ’[ 0 – 9 ] +)

.
9 2 3 0–9
0 –
start 0–
1
9
.
4 0–9 5 0–9

return REAL
30
Deterministic Finite Automata (DFA)
 A finite automaton is deterministic if
1. It has no edges/transitions labeled with epsilon.
2. For each state and for each symbol in the alphabet,
there is exactly one edge labeled with that symbol.
 Such a transition graph is called a state graph.
A Deterministic Finite Automaton (DFA):

start a b b
0 1 2 3
b*abb
b
31
Non-deterministic Finite Automata (NFA)
 In Non-deterministic Finite Automata:
1. From a state (node), there may be more than one
edge labeled with the same alphabet and there may
be no edge from a node labeled with an input symbol
2. An edge can be labeled by an empty symbol too
A Non-deterministic Finite Automaton (NFA):
a

start a b b
0 1 2 3

b (a|b)*abb
32
Another NFA
a
 a
start
b
b


An -transition is taken without consuming any character

from the input.

What does the above NFA accept? aa* | bb*

33
NFA and DFA – A Comparison
 NFA  DFA
– Has edges/transitions – no edges/transitions
labeled with epsilon labeled with epsilon
– From a state (node), there – For each state and for
may be more than one edge each symbol in the
labeled with the same
alphabet, there is exactly
alphabet and there may be
one edge labeled with that
no edge from a node labeled
symbol
with an input symbol
– Quicker to build but slower – Slower to build but
to simulate quicker to simulate
34
Relationship between DFA & NFA
 It is obvious that
DFA can be simulated with an NFA
 But what is not so obvious is that
NFA can be simulated with a DFA !!!
 How ?
• Simulate sets of possible states
• Possible exponential blowup in the state space
• Still, Maintain one state per character in the input
stream
35
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
36
Building NFA From Regular Expression
 Remember that
A regular expression is formed by the use of :
– Basic symbols and their
– Alternation,
– Concatenation, and
– Repetition.
 Hence, All we need to do is to know is:
– How to build the NFA for the above (symbols &
Operations), and
– How to assemble those NFA’s corresponding to these
symbols into a composite NFA for the expression
37
Building NFA for Symbols & Operations
1. Building NFA for a basic symbol a:
1. Start with an Initial State i,
2. Draw an edge / Transition labeled with an alphabet
(This Could be an epsilon symbol too!!)
3. to the final state f

start a
i f start
i
 f
38
Building NFA for Symbols & Operations
2. Building NFA for Alternation N (s | t) :
– Given two NFA N(s) and N(t),
1. Construct new start state i, and new final state f.
2. Add a transition from the start state i to the start states of N(s) and N(t) and label them with
epsilon symbol
3. Add a transition from the Final states of N(s) and N(t) to the final state f and label them with
Epsilon symbol

 N(s) 
start f
i
 N(t) 
39
Building NFA for Symbols & Operations
3. Building NFA for Concatenation N(s.t) or N(st) :
– Given two NFA N(s) and N(t),
1. Construct new start state i, and new final state f.
2. Overlap the Start state of later [ N(t) ] with the final state of the former
[N(s) ]
3. From the start state, add an edge labeled with epsilon to start state of
N(s)
4. From the final state of E1, add an epsilon transition to Start state of N(t)

N(s) N(t)
start
i  f

40
Building NFA for Symbols & Operations
4. Building NFA for Repetition N(s*) :
1. Construct new start state and new final state
2. Add an epsilon transition from new Start state to the new
final state.
3. Add an epsilon transition from the new final state to the
start state of N(s).
4. Add another epsilon transition from the final state of N(s)
to the constructed final state.


start N(s) f
i 
 41
Construction of NFA – Examples
(a|b).(a|b)
a (b) b
(a)

 a 
(a|b)
b 

(a|b).(a|b)
 a   a 

b b
   
42
Construction of NFA – Examples (Contd.)
Symbol [ a – z ] [ a – z 0 – 9 ]* Repetition
 a-z
start
1
a–z 2
 Return ID
6 7 8
0-9

[0–9]+ = [0–9][0–9]* Repetition
Symbol 
start 1 0–9 2 0–9
3 4 5

 Return NUM
43
Combining Several NFA’s
f 3 IF 
2 a-z
i
a-z 
 4 5 6 7 8 ID
0-9
1 

  NUM
 0-9 0-9
9 10 11 12 13

14
Any
15 ERROR

character
44
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a tool
such as lex
45
Conversion of NFA to DFA
 A DFA can be constructed from the NFA, where each
DFA state represents a set of NFA states from the NFA
 Key idea
The state of the DFA after reading some input is
the set of all states the NFA could have reached
after reading the same input
 If NFA has n states, DFA will have at most 2n states
 Resulting DFA may have more states than needed
 Let us study the conversion with an example
46
Converting NFA to DFA
IF
2 f 3 
a-z
i
 4 a-z 5 6 7  8 ID
0-9
14
Any
15 
1  character ERROR
 NUM
 9 0-9
10 11
0-9
12 13

Q: What states can be reached from state 1
without consuming a character? 
A: {1,4,9,14} form the -closure of state 1

Defn: Given a set of NFA states T, the -closure(T) is the set of

states that are reachable through -transiton from any state s T.
47
Converting NFA to DFA
2
f 3 IF 
a-z
i
 4 a-z 5 6
0-9
7  8 ID
Any
15

1 14
 character ERROR
NUM

 9 0-9
10 11
0-9
12 13

What are ALL the state  closures in this NFA?

closure(1) = {1,4,9,14}
closure(5) = {5,6,8} closure(10) = {10,11,13}
closure(8) = {6,8} closure(13) = {11,13}
closure(7) = {7,8,6} closure(12) = {12,13} 48
Converting NFA to DFA
 We already Know that
Given a set of NFA states T, the -closure(T) is the set of
states that are reachable through -transiton from any
state s T.
 We now define
Given a set of NFA states T, move( T, a) is the set of
states that are reachable on input a from any state sT
 Now the Problem Definition:
Given an NFA find the DFA with the minimum number of
states that has the same behavior as the NFA for all
inputs.
49
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15
 NUM
 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}


1. Start with the initial state in the NFA ( s0), & work out the set of
states in the DFA, Dstates, initialized with a state representing -
closure(s0). 50
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8
ID
4 0-9
1 Any ERROR

 14
character
15
 NUM
 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15 Now we need to compute:
1-4-9-14 Move(1-4-9-14,a-h) = ?{ 5,15 }
Then, -closure({5,15}) = {5,6,8,15}
51
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15
Next we need to compute:
1-4-9-14
i Move(1-4-9-14,i) = ?{ 2,5,15 }
2-5-6-8-15
Then, -closure({2,5,15}) = {2,5,6,8,15}
52
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,j-z) = ?{ 5,15 }
2-5-6-8-15
Then, -closure(5,15}) = {5,6,8,15}
53
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,0-9) = {?10,15 }
0-9 2-5-6-8-15
Then, -closure(10,15}) = {10,13,11,15}
10-11-13-15 54
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8
ID
4 0-9
1 Any ERROR

 14
character
15
 NUM
 9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}

15 
other a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,other) = {?15 }
0-9 2-5-6-8-15
Then, -closure(15) = {15}
10-11-13-15 55
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8
ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}

15 
other a-h 5-6-8-15
j-z The analysis for 1-4-9-14 is
1-4-9-14
i complete. We mark it and pick
0-9 2-5-6-8-15 another state in the DFA to analyze.
10-11-13-15 56
Converted DFA
ID a-e, g-z, 0-9

2-5-6-8-15 f IF
i 3-6-7-8
ID a-z,0-9
a-h ID
5-6-8-15 6-7-8
1-4-9-14 j-z a-z,0-9
0-9 NUM NUM a-z,0-9
0-9 11-12-13
10-11-13-15

other error 0-9

15
57
Another Example of Conversion
 a   S7 a 
S1 S3  S9
S0 S5 S6 S11
b b
 S2 S4   S8 S10 

The above NFA would result in DFA below:

a a
s3,s5,s6,s7,s8 s9,s11
b
s0,s1,s2
b
a
s4,s5,s6,s7,s8 s10,s11
b
58
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
59
Systematically shrink the DFA
 The Big Picture
– Discover sets of equivalent states
– Represent each such set with just one state
 Two states are equivalent if and only if:
– The set of paths leading to them are equivalent
A
– α Є Σ, transitions on α lead to equivalent states (DFA)
– α-transitions to distinct sets states must be in distinct sets
 A partition P of S
– A collection of sets P s.t. each s Є S is in exactly one pi Є P
– The algorithm iteratively partitions the DFA’s states
60
Minimization
a p1
a
p3
b
p0
b a
p2 p4
b

 Group all the states together. {p0, p1, p2, p3, p4}.

 Separate states according to available exit transitions.

 Separate a set to two if from some of its states one can
reach another set and with others one cannot.
Repeat until cannot separate.
61
Minimization
a p1
a
p3
b
p0
b a
p2 p4
b

The above DFA can now be minimized as:

a a

b b
62
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
63
Pseudo Code For lexical Analyzer
function lexan; integer else if C is a letter then
Var lexbuf : array [0, ..100] of char begin
C: char place C and successive letters &
Begin digits into lexbuf :
loop begin p := lookup ( lexbuf ) :
read a character into C: tokenval := p:
if C is a blank or a tab then return the token field of table entry p
do nothing end
else if C is a newline then else
increment lineno begin /* token is a single character */
else if C is a digit set tokenval to NONE /* no attribute */
begin return integer encoding of character C
set Tokenval to the value end
of this & flwg digits;
return NUM end
end end
64
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as Lex
65
Building Lexical Analyzers Automatically
 The point to note is :
The Process studied so far is well suited for Automation
1. Implementer writes down the regular expressions
2. Scanner generator builds NFA, DFA, minimal DFA,
and then writes out the (table-driven or direct-coded)
code
3. This process reliably produces fast, robust Lexical
Analyzers
 One such Tool is Lex
66
Lexx – A tool for generating Scanner
 A widely used tool for specifying Lexical Analyzers for a
wide variety of languages. How does it work ?
1. Specs of a Lexical Analyzer is Lexx Source Pgm lex.l
prepared by creating a program
lex.l ( containing RE’s) in the LEX Compiler
Lex language
2. Then lex.l is run thru Lex lex.yy.c
Compiler to produce a program
lex.yy.c ( Contains a tabular C Compiler
representaion of state Transition
Diagram) A.out
3. Lex.yy.c is run thru C compiler to Input
produce an object code of Lex Stream A.out
Sequence
Analyzer Of Tokens
67
Lexx Functions
1. Translates the definitions into an automaton.

2. The automaton looks for the longest matching string.

3. Either return some value to the reading program

(parser), or looks for next token.

4. Look ahead operator: x/y  allow the token x only if

y follows it (but y is not part of the token).
68
Lexx Program Structure
 A Lexx Program ( nothing but specifications in lex.l )
Consists of THREE Parts.
1. Declarations This section includes declaration of Variables,
manifest Constants.

2. Translation Rules This section includes patterns and the

corresponding action to be taken ( RE)

3. Auxilliary procedures This section includes what ever

Auxiliary procedures that are needed

Three sections are separated by lines beginning with%%

69
A Sample Lexx Program
1) %{
/* Remove uppercase letters . Commands to execute are
lex test.l and gcc lex.yy.c -ll -o test */
%}
%%
[A-Z]+ ;
2) %{
/* Line numbering */
%}
%%
^.*\n printf(“%d\t%s”,yylineno-1,yytext);
70
Any
Questions ????

Thank you
71
Regular Expression Construction
 Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)
 Solution : Start with symbol and keep defining regular
sub-expressions till the final expression is achieved
RULE 1. digit 0|1|2|3| … |9
RULE 2. digits digit digit* (or digit+)
[Kleene star closure meaning 1 or more digits]

RULE 3. optional_fraction ‘.’ digits | epsilon

RULE 4. Num digits optional_fraction
72
Regular Expression Construction
io n
 Problem : Specify a set of unsigned numbers as a re ss
x p
regular expression. (Examples: 1997, 19.97)lar e
e g u
 Solution : Start with symbol and keep fdefining a r regular
s o
sub-expressions till the final expression t i o n is achieved
fi n i
RULE 1. digit d e
0 | 1 |he2 | 3 | … | 9
L t
RULE 2. digit A L digit* (or digit+)
digit
e d
s [Kleene star closure meaning 1 or more digits]
e u
h a v
we
RULE 3. a t optional_fraction ‘.’ digit | epsilon
e t h
o
RULEt 4. Num digit optional_fraction
N
73
Unsigned Number validation using Rules
 Let us derive the number from these rules
RULE 1. digit 0|1|2|3| … |9
RULE 2. digits digit digit* (or digit+)
[Kleene star closure meaning 1 or more digits]

RULE 3. optional_fraction ‘.’ digits | epsilon

RULE 4. Num digits optional_fraction

1 9 97 2 5 . 9 7 3 6 . . 14
74
Regular Expression Construction
 Qn: How to write a regular expression for identifiers?
(identifiers are letters followed by a letter or a digit).
 Answer:
1. Letter a|A|b|B|… |z|Z
2. Digit 0|1|2|3| … |9
3. Letter_or_Digit Letter | Digit
4. Identifier Letter | letter_or_digit
 One can define similar regular expression (s) for
comments, Strings, operators and delimiters ( the
different tokens of a language)
75
Grammar for a Tiny Language
 program ::= statement | program statement
 statement ::= assignStmt | ifStmt
 assignStmt ::= id = expr ;
 ifStmt ::= if ( expr ) stmt
 expr ::= id | int | expr + expr
 Id ::= a | b | c | i | j | k | n | x | y | z
 int ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The rules of a grammar are also Known as Productions
76

03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Compiler 2
No ratings yet
Compiler 2
38 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Compiler Construction Tools & Introduction To LA
No ratings yet
Compiler Construction Tools & Introduction To LA
5 pages
1st Phase Lexical Analyzer
No ratings yet
1st Phase Lexical Analyzer
33 pages
Describing Syntax and Semantics: ISBN 0-321-33025-0
No ratings yet
Describing Syntax and Semantics: ISBN 0-321-33025-0
139 pages
4-Intro To Flex and Bison-09!09!2024
No ratings yet
4-Intro To Flex and Bison-09!09!2024
28 pages
Structure and Phases of A Compiler
No ratings yet
Structure and Phases of A Compiler
54 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
100 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
CH 3 Myppt
No ratings yet
CH 3 Myppt
59 pages
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
No ratings yet
Acknowledgements: The Slides For This Lecture Are A Modified Versions of The Offering by
40 pages
Compiler Design Unit-1 - 4
No ratings yet
Compiler Design Unit-1 - 4
4 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
18 pages
Compiler Design
No ratings yet
Compiler Design
65 pages
Lecture 2 o
No ratings yet
Lecture 2 o
79 pages
PL Lec 2 Syntax and Semantics
No ratings yet
PL Lec 2 Syntax and Semantics
48 pages
Acd Unit-2
No ratings yet
Acd Unit-2
16 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
PL DR - Fallah 2.0 FE
No ratings yet
PL DR - Fallah 2.0 FE
64 pages
Chapter 2 Lexical - Analysis
No ratings yet
Chapter 2 Lexical - Analysis
38 pages
CD 1
No ratings yet
CD 1
92 pages
Chapter 3 Finite Automata and Lexical Analysis
No ratings yet
Chapter 3 Finite Automata and Lexical Analysis
95 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
33 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
No ratings yet
Slides 02 - Compiler Construction - UET CS - Lexical Analyzer Rev 2
69 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
Lecture 03
No ratings yet
Lecture 03
42 pages
Chapter 2 - Lexical Analysis - Regular Expressions
No ratings yet
Chapter 2 - Lexical Analysis - Regular Expressions
27 pages
Unit-I - CD R2021
No ratings yet
Unit-I - CD R2021
60 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Lecture 3 (30-1-23)
No ratings yet
Lecture 3 (30-1-23)
11 pages
CH 2
No ratings yet
CH 2
36 pages
16 Regexp Ocamllex
No ratings yet
16 Regexp Ocamllex
43 pages
3a. Context Free Grammar
No ratings yet
3a. Context Free Grammar
18 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Compiler 2
No ratings yet
Compiler 2
10 pages
SR5200 Service Manual
No ratings yet
SR5200 Service Manual
52 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
Chapter 2
No ratings yet
Chapter 2
91 pages
04 Lexi Cal A Analysis
No ratings yet
04 Lexi Cal A Analysis
39 pages
Unit 6
No ratings yet
Unit 6
109 pages
CEC366 Image Processing
No ratings yet
CEC366 Image Processing
2 pages
Lexical Analysis1
No ratings yet
Lexical Analysis1
44 pages
Compiler
No ratings yet
Compiler
60 pages
2 Lex
No ratings yet
2 Lex
45 pages
CS615 FINAL TERM SOLVED MCQs BY FAISAL
No ratings yet
CS615 FINAL TERM SOLVED MCQs BY FAISAL
65 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
2UCD030000E009 - D PCS100 SFC Technical Catalogue
No ratings yet
2UCD030000E009 - D PCS100 SFC Technical Catalogue
35 pages
(Ebook PDF) Digital Marketing 7th Edition by Dave Chaffeypdf Download
100% (4)
(Ebook PDF) Digital Marketing 7th Edition by Dave Chaffeypdf Download
42 pages
ROX-II v2.13 RX1500 ConfigurationManual WebUI
No ratings yet
ROX-II v2.13 RX1500 ConfigurationManual WebUI
1,358 pages
CSS Intelligent Power Brochure
No ratings yet
CSS Intelligent Power Brochure
2 pages
Computer Science and Engineering
No ratings yet
Computer Science and Engineering
247 pages
Lecture 3.3.1 Queue
No ratings yet
Lecture 3.3.1 Queue
17 pages
Javascript Beginner Handbook
No ratings yet
Javascript Beginner Handbook
70 pages
XII Maths-1
No ratings yet
XII Maths-1
6 pages
User Manual Edupage Professors
No ratings yet
User Manual Edupage Professors
18 pages
Pallavi BRM File
No ratings yet
Pallavi BRM File
29 pages
2PGDCA2 Unit II Internet and Web Designing
No ratings yet
2PGDCA2 Unit II Internet and Web Designing
26 pages
17ec741 - Multimedia Information Representation - Module 2
No ratings yet
17ec741 - Multimedia Information Representation - Module 2
54 pages
Assurance Features and Navigation: Cisco DNA Center 1.1.2 Training
No ratings yet
Assurance Features and Navigation: Cisco DNA Center 1.1.2 Training
54 pages
System Verilog Operators, Subprograms
No ratings yet
System Verilog Operators, Subprograms
26 pages
Data in ML
No ratings yet
Data in ML
26 pages
AI Question Bank
No ratings yet
AI Question Bank
3 pages
Essentium Drybox Data Sheet - v2 1 (379296)
No ratings yet
Essentium Drybox Data Sheet - v2 1 (379296)
11 pages
Lecture 02 Write Basic Go Web Server
No ratings yet
Lecture 02 Write Basic Go Web Server
17 pages
Summit Advisory 29 February 2024
No ratings yet
Summit Advisory 29 February 2024
4 pages
Introduction To Java Programming
No ratings yet
Introduction To Java Programming
24 pages
Passive Optical Local Area Network (POL)
No ratings yet
Passive Optical Local Area Network (POL)
13 pages
Pre-Stack Seismic Interpretation
No ratings yet
Pre-Stack Seismic Interpretation
4 pages
Pattern Recognition-Theory
No ratings yet
Pattern Recognition-Theory
2 pages
Dekra Se Whitepaper Digital Future of Auditing en Us v2
No ratings yet
Dekra Se Whitepaper Digital Future of Auditing en Us v2
4 pages
Health Care App - 20240723 - 214723 - 0000
No ratings yet
Health Care App - 20240723 - 214723 - 0000
1 page
Secondary Memory
No ratings yet
Secondary Memory
3 pages
Mongodb Database Design
No ratings yet
Mongodb Database Design
2 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Compiler Design 2

Uploaded by

Compiler Design 2

Uploaded by

Dr.

2. Lexical Analysis using formal Language definitions

3. Specifications & Recognition of Tokens

4. A Language for Specifying Lexical Analyzers

To break a sentence down into its component parts

next char next token

We will stick to the terms Tokens and Token Types

not more (i.e., not parts of words or identifiers, or !

ID foo, n14, a, temp……

REAL 66.1 .5 10. 1e67 5.5e-10 ……..

KEYWORDS IF DO WHILE INT ………

SYMBOLS , (Comma) != (Noteq) ( (Lparen) …….

Similarly, using symbols, | , . ,* and ε, we can specify the regular

[ (0 – 9)+ [“.”(0 – 9)*] ? ] | [“.”(0 – 9) +]

[0–9]+ [ return NUM ]

(‘\*’ [ a – z ] * ‘\n’ ) | (‘ ’)| ‘\n’ | ‘*/’)+ return Comment

[0–9]+ start 1 0–9 2 return NUM

An -transition is taken without consuming any character

What does the above NFA accept? aa* | bb*

Defn: Given a set of NFA states T, the -closure(T) is the set of

other error 0-9

The above NFA would result in DFA below:

 Separate states according to available exit transitions.

The above DFA can now be minimized as:

2. The automaton looks for the longest matching string.

3. Either return some value to the reading program

(parser), or looks for next token.

4. Look ahead operator: x/y  allow the token x only if

2. Translation Rules This section includes patterns and the

3. Auxilliary procedures This section includes what ever

Three sections are separated by lines beginning with%%

RULE 3. optional_fraction ‘.’ digits | epsilon

RULE 3. optional_fraction ‘.’ digits | epsilon

You might also like

(‘\’ [ a – z ] ‘\n’ ) | (‘ ’)| ‘\n’ | ‘*/’)+ return Comment