0% found this document useful (0 votes)
7 views76 pages

Compiler Design 2

The document provides an overview of lexical analysis, detailing the role of a lexical analyzer in programming languages, including the definition and classification of tokens. It discusses the syntax, semantics, and grammar of programming languages, as well as formal language definitions and regular expressions used for token specification. Additionally, it explains finite automata and their application in recognizing tokens from input strings.

Uploaded by

tibit40784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views76 pages

Compiler Design 2

The document provides an overview of lexical analysis, detailing the role of a lexical analyzer in programming languages, including the definition and classification of tokens. It discusses the syntax, semantics, and grammar of programming languages, as well as formal language definitions and regular expressions used for token specification. Additionally, it explains finite automata and their application in recognizing tokens from input strings.

Uploaded by

tibit40784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Dr.

Rajashekara Murthy S
B.E., M.Tech., Ph.D.
[email protected]
BITS Pilani
Pilani | Dubai | Goa | Hyderabad

1
–2–
Lexical Analysis
Objectives
 To Understand
1. The Role of a Lexical Analyzer

2. Lexical Analysis using formal Language definitions


with Finite Automata

3. Specifications & Recognition of Tokens

4. A Language for Specifying Lexical Analyzers

3
Programming Language Structure
 Recall that a Programming Language is defined by
1. SYNTAX:
– Decides whether a sentence in a language is well-formed
2. SEMANTICS
– Determines the meaning, if any, of a syntactically well-
formed sentence
3. GRAMMAR
– A formal system that provides a generative finite description
of the language
4
Syntax of a Programming Language
 Describes the structure of programs without any
consideration of their meaning.
 The syntactic elements of a programming language
are determined by the computation model and
pragmatic concerns
 well developed tools (regular, context-free and attribute
grammars) are available for the description of the
syntax of programming language
 Lexical Analyzer & the Parser of a compiler handle the
Syntax of the programming language
5
Some Basic Definitions
 lex-i-cal : Of or relating to words or the vocabulary of a
language as distinguished from its grammar and
construction
The task concerned with breaking an input
 lexical analysis: into its smallest meaningful units, called
tokens.
The task concerned with fitting a sequence of
 syntax analysis: tokens into a specified syntax.

To break a sentence down into its component parts


 parsing: of speech with an explanation of the form, function,
and syntactical relationship of each part.
6
Lexical Analyzer (A.k.a. Scanner)
 The only part of a compiler that looks at each
character of the source text and does a linear analysis
 Reads source text and produces TOKENS
 Also keeps track of the source-coordinates of each
token - which file name, line number and position
– (This is useful for debugging & error indication purposes.)
 Advantages of a separate Lexical Analyzer:
– Keeps Compiler design simple
– Improves Efficiency and
– Increases Portability
7
The Role of a Lexical Analyzer
Lexical analyzer

next char next token


Syntax
get next analyzer
get next token
char
Source symbol
Program table
(Contains a record
for each identifier)

8
Tokens, Patterns and Lexemes
 What are Tokens ?
– The basic lexical units of the language
– A sequence of Abstract Characters that can be treated as
a unit in the grammar of the language
– A programming language classifies the tokens into a finite
set of token types Some tokens may have attributes
 A note on Terminology integer constant token will have the
Some texts refer to actual integer (17, 42) as an
attribute;
– token types as tokens & Identifiers will have a string with the
– tokens as lexemes actual id

We will stick to the terms Tokens and Token Types


9
Tokens Example
 Let us Consider the program segment:
void main() { printf("Hello World\n"); }
 The tokens of this program segment are:
1. void, 7. (,
2. main,
8. "Hello World\n",
3. (,
9. ),
4. ),
5. { 10. ; and
6. printf, 11. }
10
Specifications of Tokens
String Words and Sentences
1. Prefix of s A string obtained by deleting
trailing symbols
2. suffix of s A string obtained by deleting
leading symbols
3. Substring of s A string obtained by deleting
a prefix & a suffix
4. Proper A prefix, suffix or sub string
that is nonempty s.t s = x
5. Subsequence of s A string obtained by deleting
symbols not necessarily contiguous
11
The Principle of Longest match
 In most languages, the scanner should pick the
longest possible string to make up the next token if
there is a choice
 Example
return foobar != hohum;
should be recognized as 5 tokens
RETURN ID(foobar)0 NEQ ID(hohum) SCOLON

not more (i.e., not parts of words or identifiers, or !


and = as separate tokens)
12
Typical Tokens in Programming Languages
 Operators & Punctuation
– + - * / ( ) { } [ ] ; : :: < <= == = != ! …
– Each of these is a distinct lexical class ( or token type )
 Keywords
– if while for goto return switch void …
– Each of these is also a distinct lexical class (not a string)
 Identifiers
– A single ID lexical class, but parameterized by actual id
 Integer constants
– A single INT lexical class, but parameterized by int value
 Other constants, etc.
13
Tokens of a Typical Language
TYPE EXAMPLE

ID foo, n14, a, temp……


NUM 73 , 0 , 00 , 515 , +2 ……..

REAL 66.1 .5 10. 1e67 5.5e-10 ……..

KEYWORDS IF DO WHILE INT ………

SYMBOLS , (Comma) != (Noteq) ( (Lparen) …….


14
Tokens of a Typical Language
zed ?
n i
c o g a s
TYPE EXAMPLEd re ken
a n to
ed e a
fin fi n
ID foo, n14, y d e a, temp……
d e
a l l s to
o r m io n
NUM f
73ns, 0 , 00 e s
s , 515 , +2 ……..
k e x p r
t o r e g e
a e
r66.1 gu.5 l a u a 1e67 5.5e-10 ……..
REAL w e n 10.
g
H o g r l a
n : s in la r
s
KEYWORDS t io y u g
IFu DO WHILE INT ………
e B r e
Qu e r: a l
 s w o r m
An
SYMBOLS f
 a , (Comma) != (Noteq) ( (Lparen) …….
15
Formal Theory of Languages
 A language in real life is made up of
1. words made up of alphabets and
2. Sentences made up of words arranged according to the
Grammar of that language
 Natural languages display amazing variety of
expressions with Explicit & implicit meanings and
variations in meaning as well as grammars
 Computer languages on the contrary focus on
– The limited set of tasks to be performed
– Hence mathematical precision is essential in defining
their structure and Grammar
16
Formal Definition of Languages
 Alphabet  A finite (non-empty) set of symbols denoted by Σ
 String  A finite sequence of symbols from an alphabet which
includes even the empty sequence (denoted by λ )
 A set ( often infinite) of finite strings
 Language  The set of all possible finite strings of elements of
alphabet Σ ( including λ ) is denoted by Σ*
 Finite specifications of (possibly infinite) languages is
possible with
1. Automaton – a recognizer; a machine that accepts all strings in
a language (and rejects all other strings)
2. Grammar – a generator; a system for producing all strings in
the language (and no other strings)
17
Formal Definition of Languages
&
 A finite (non-empty) set of symbols ar s by Σ
denoted
 Alphabet m
r am
 String  A finite sequence of symbols from t g alphabet which
an
e n
r (denoted by gλe)
includes even the empty sequence iff e a
d
y finite strings ang u
 A set ( often infinite)anof l
 Language  The set of all possible m
y ta finite strings e
nof elements of
b
d ma y o
f i e
alphabetcΣi ( including
o n
λ ) is denoted l by Σ*
e t
u UT ies o
p a
s of (possibly if languages is
 Finite specifications b e B c
infinite)
e
y s p
possible withma to n
e
g – a recognizer; a
m a machine that accepts all strings in
ua
1. Automaton t o
a ng
language (and a
rejectsu all other strings)
A l o r
2. Grammar – a m ar
generator; a system for producing all strings in
the language r am(and no other strings)
A g
18
Formal Language Definition ( Contd. )
 As already defined A language L over an alphabet Σ is
a collection of strings of elements of Σ
– The PASCAL Language is the set of all strings that
constitute legal PASCAL programs (infinite set)
– The Language of primes is a set of all decimal digit strings
that constitute prime numbers (infinite set)
– The language of C reserved words is the set of all
alphabetic strings that can not be used as identifiers in the
C programming language (finite set)
 To specify some of these (possibly infinite) languages
with finite description we use the notation of
Regular Expressions
19
Regular Expressions
 Is always defined over some alphabet Σ
(For programming languages, it is commonly ASCII or
Unicode)
 If E is a regular expression, L(E ) is the “language” (set
of strings) generated by E
 For Example – For each symbol ‘a’ in the alphabet of
the language the regular expression {a} denotes the
language containing just the string a ( Known as
symbol)
 A regular expression generated with empty sequence λ
is denoted by ε
20
Operations with Regular Expressions
 Given 2 regular expressions M & N
 Alternation ( denoted by | )
makes a new regular expression M | N denoting a
“UNION” of languages L(M) and L(N) . { L(M) L(N) }
 Concatenation ( denoted by . Or )
makes a new regular expression MN denoting a
language L(M) followed by L(N).
 The Repetiton ( denoted y * )
makes a new expression denoting a language that has
0 or more occurrences (Kleene closure) of L(M)
21
Regular Expression Example
Expression Language Example Words
a|b { a, b } a,b
ab * a {a} {b} * {a) aa , aba , abba , abbba …
(ab)* { ab} * ε , ab , abab , ababab , …
abba { abba } abba
(0 | 1) * 0 { {0} {1} } * {0} 0 , 00 , 10, 010, 110, …...
( All binary Even numbers)
b*(abb*)*(a | ε) Strings of a and b with NO consecutive a

Similarly, using symbols, | , . ,* and ε, we can specify the regular


expressions corresponding to the lexical tokens of a programming
language using rules ( A.k.a. Productions) 22
Table of Operators & Abbreviations
Notation Description
a An ordinary character that stands for itself
ε The empty String
M|N Alternation; Choosing from M OR N
MN Concatenation : An M followed by N
M* Repetition ( Zero or more Times)
M+ Repetition ( one or more times)
M? Optional (Zero or one Occurrence of M)
[a–z A–z ] Character set alteration
[abxyz] One of the given characters (a|b|x|y|z)
. Stands for a single character ( except New line)
‘a.+*’ Quotation: A string in quotes stands for itself literally 23
Regular Expression Construction
 Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)
 Observations on numbers:
1. Could be made up of one or more digits from set (0 – 9)
2. Optionally Can have a decimal point in the end followed by 0
or more digits “.”(0 – 9)*
3. A number can also start with a Point followed by one or more
digits

[ (0 – 9)+ [“.”(0 – 9)*] ? ] | [“.”(0 – 9) +]


24
Regular Expressions for Some Tokens of
a Programming Language
Regular Expression Token Type

if [ Return IF; ]

[ a – z ] [ a – z 0 – 9 ]* [ return ID ]

[0–9]+ [ return NUM ]

( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( [ ‘ . ’[ 0 – 9 ] +) Return REAL

(‘\*’ [ a – z ] * ‘\n’ ) | (‘ ’)| ‘\n’ | ‘*/’)+ return Comment

. return ERROR
25
A regular Expression Recognizer
 Given an input string,
The function of a “regular Expression Analyzer” is to
say :
– “YES, the input is part of the language generated
from the regular expression”
– “NO, the input isn’t part of the language generated
from the regular expression”
 Using results from Finite Automata theory and theory
of algorithms, we can automate construction of such
recognizers from Regular Expressions
26
Finite Automata
 A finite Automation is a Transition Graph that has:
– A finite set of states S (represented by Nodes) with Edges
leading from one state to another
– Each edge is labeled with the symbol ( from the set Σ ) that
causes the transition ( Could be ε also !)
– One state is denoted as start state S0 and certain of the
states are distinguished as final states ( normally denoted
with two concentric circles)
 Mathematically, It can be represented as:
A = {S, , s0, F, move }
27
Recognizing Expressions as Tokens with
Finite State Automaton
 Operate by reading input symbols (usually characters)
– Transition can be taken if labeled with current symbol
– ε-transition can be taken at any time
 Accept when final state reached & no more input
– Scanner slightly different – accept longest match even if
more input
 Reject if no transition possible or no more input and
not in final state (DFA)
28
Finite Automata Examples

if start 1 i 2 f 3 return IF

a–z
start a–z
[ a – z ] [ a – z 0 – 9 ]* 1 2 return ID
0–9

[0–9]+ start 1 0–9 2 return NUM


0–9
29
Finite Automata Examples ( Contd.)
( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( ‘ . ’[ 0 – 9 ] +)

.
9 2 3 0–9
0 –
start 0–
1
9
.
4 0–9 5 0–9

return REAL
30
Deterministic Finite Automata (DFA)
 A finite automaton is deterministic if
1. It has no edges/transitions labeled with epsilon.
2. For each state and for each symbol in the alphabet,
there is exactly one edge labeled with that symbol.
 Such a transition graph is called a state graph.
A Deterministic Finite Automaton (DFA):

start a b b
0 1 2 3
b*abb
b
31
Non-deterministic Finite Automata (NFA)
 In Non-deterministic Finite Automata:
1. From a state (node), there may be more than one
edge labeled with the same alphabet and there may
be no edge from a node labeled with an input symbol
2. An edge can be labeled by an empty symbol too
A Non-deterministic Finite Automaton (NFA):
a

start a b b
0 1 2 3

b (a|b)*abb
32
Another NFA
a
 a
start
b
b

An -transition is taken without consuming any character


from the input.

What does the above NFA accept? aa* | bb*


33
NFA and DFA – A Comparison
 NFA  DFA
– Has edges/transitions – no edges/transitions
labeled with epsilon labeled with epsilon
– From a state (node), there – For each state and for
may be more than one edge each symbol in the
labeled with the same
alphabet, there is exactly
alphabet and there may be
one edge labeled with that
no edge from a node labeled
symbol
with an input symbol
– Quicker to build but slower – Slower to build but
to simulate quicker to simulate
34
Relationship between DFA & NFA
 It is obvious that
DFA can be simulated with an NFA
 But what is not so obvious is that
NFA can be simulated with a DFA !!!
 How ?
• Simulate sets of possible states
• Possible exponential blowup in the state space
• Still, Maintain one state per character in the input
stream
35
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
36
Building NFA From Regular Expression
 Remember that
A regular expression is formed by the use of :
– Basic symbols and their
– Alternation,
– Concatenation, and
– Repetition.
 Hence, All we need to do is to know is:
– How to build the NFA for the above (symbols &
Operations), and
– How to assemble those NFA’s corresponding to these
symbols into a composite NFA for the expression
37
Building NFA for Symbols & Operations
1. Building NFA for a basic symbol a:
1. Start with an Initial State i,
2. Draw an edge / Transition labeled with an alphabet
(This Could be an epsilon symbol too!!)
3. to the final state f

start a
i f start
i
 f
38
Building NFA for Symbols & Operations
2. Building NFA for Alternation N (s | t) :
– Given two NFA N(s) and N(t),
1. Construct new start state i, and new final state f.
2. Add a transition from the start state i to the start states of N(s) and N(t) and label them with
epsilon symbol
3. Add a transition from the Final states of N(s) and N(t) to the final state f and label them with
Epsilon symbol

 N(s) 
start f
i
 N(t) 
39
Building NFA for Symbols & Operations
3. Building NFA for Concatenation N(s.t) or N(st) :
– Given two NFA N(s) and N(t),
1. Construct new start state i, and new final state f.
2. Overlap the Start state of later [ N(t) ] with the final state of the former
[N(s) ]
3. From the start state, add an edge labeled with epsilon to start state of
N(s)
4. From the final state of E1, add an epsilon transition to Start state of N(t)

N(s) N(t)
start
i  f

40
Building NFA for Symbols & Operations
4. Building NFA for Repetition N(s*) :
1. Construct new start state and new final state
2. Add an epsilon transition from new Start state to the new
final state.
3. Add an epsilon transition from the new final state to the
start state of N(s).
4. Add another epsilon transition from the final state of N(s)
to the constructed final state.

start N(s) f
i 
 41
Construction of NFA – Examples
(a|b).(a|b)
a (b) b
(a)

 a 
(a|b)
b 

(a|b).(a|b)
 a   a 

b b
   
42
Construction of NFA – Examples (Contd.)
Symbol [ a – z ] [ a – z 0 – 9 ]* Repetition
 a-z
start
1
a–z 2
 Return ID
6 7 8
0-9

[0–9]+ = [0–9][0–9]* Repetition
Symbol 
start 1 0–9 2 0–9
3 4 5

 Return NUM
43
Combining Several NFA’s
f 3 IF 
2 a-z
i
a-z 
 4 5 6 7 8 ID
0-9
1 

  NUM
 0-9 0-9
9 10 11 12 13

14
Any
15 ERROR

character
44
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a tool
such as lex
45
Conversion of NFA to DFA
 A DFA can be constructed from the NFA, where each
DFA state represents a set of NFA states from the NFA
 Key idea
The state of the DFA after reading some input is
the set of all states the NFA could have reached
after reading the same input
 If NFA has n states, DFA will have at most 2n states
 Resulting DFA may have more states than needed
 Let us study the conversion with an example
46
Converting NFA to DFA
IF
2 f 3 
a-z
i
 4 a-z 5 6 7  8 ID
0-9
14
Any
15 
1  character ERROR
 NUM
 9 0-9
10 11
0-9
12 13

Q: What states can be reached from state 1
without consuming a character? 
A: {1,4,9,14} form the -closure of state 1

Defn: Given a set of NFA states T, the -closure(T) is the set of


states that are reachable through -transiton from any state s T.
47
Converting NFA to DFA
2
f 3 IF 
a-z
i
 4 a-z 5 6
0-9
7  8 ID
Any
15

1 14
 character ERROR
NUM

 9 0-9
10 11
0-9
12 13

What are ALL the state  closures in this NFA?

closure(1) = {1,4,9,14}
closure(5) = {5,6,8} closure(10) = {10,11,13}
closure(8) = {6,8} closure(13) = {11,13}
closure(7) = {7,8,6} closure(12) = {12,13} 48
Converting NFA to DFA
 We already Know that
Given a set of NFA states T, the -closure(T) is the set of
states that are reachable through -transiton from any
state s T.
 We now define
Given a set of NFA states T, move( T, a) is the set of
states that are reachable on input a from any state sT
 Now the Problem Definition:
Given an NFA find the DFA with the minimum number of
states that has the same behavior as the NFA for all
inputs.
49
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15
 NUM
 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

1. Start with the initial state in the NFA ( s0), & work out the set of
states in the DFA, Dstates, initialized with a state representing -
closure(s0). 50
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8
ID
4 0-9
1 Any ERROR

 14
character
15
 NUM
 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15 Now we need to compute:
1-4-9-14 Move(1-4-9-14,a-h) = ?{ 5,15 }
Then, -closure({5,15}) = {5,6,8,15}
51
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15
Next we need to compute:
1-4-9-14
i Move(1-4-9-14,i) = ?{ 2,5,15 }
2-5-6-8-15
Then, -closure({2,5,15}) = {2,5,6,8,15}
52
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,j-z) = ?{ 5,15 }
2-5-6-8-15
Then, -closure(5,15}) = {5,6,8,15}
53
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8 ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13

Dstates = {1-4-9-14}

a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,0-9) = {?10,15 }
0-9 2-5-6-8-15
Then, -closure(10,15}) = {10,13,11,15}
10-11-13-15 54
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8
ID
4 0-9
1 Any ERROR

 14
character
15
 NUM
 9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}

15 
other a-h 5-6-8-15
j-z Next we need to compute:
1-4-9-14
i Move(1-4-9-14,other) = {?15 }
0-9 2-5-6-8-15
Then, -closure(15) = {15}
10-11-13-15 55
Converting NFA to DFA
f
3
IF 
2 a-z
i
 a-z 5 6 7  8
ID
4 0-9
1 Any ERROR

 14
character
15

NUM

 9 0-9
10 11
0-9
12 13
Dstates = {1-4-9-14}

15 
other a-h 5-6-8-15
j-z The analysis for 1-4-9-14 is
1-4-9-14
i complete. We mark it and pick
0-9 2-5-6-8-15 another state in the DFA to analyze.
10-11-13-15 56
Converted DFA
ID a-e, g-z, 0-9

2-5-6-8-15 f IF
i 3-6-7-8
ID a-z,0-9
a-h ID
5-6-8-15 6-7-8
1-4-9-14 j-z a-z,0-9
0-9 NUM NUM a-z,0-9
0-9 11-12-13
10-11-13-15

other error 0-9


15
57
Another Example of Conversion
 a   S7 a 
S1 S3  S9
S0 S5 S6 S11
b b
 S2 S4   S8 S10 

The above NFA would result in DFA below:

a a
s3,s5,s6,s7,s8 s9,s11
b
s0,s1,s2
b
a
s4,s5,s6,s7,s8 s10,s11
b
58
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
59
Systematically shrink the DFA
 The Big Picture
– Discover sets of equivalent states
– Represent each such set with just one state
 Two states are equivalent if and only if:
– The set of paths leading to them are equivalent
A
– α Є Σ, transitions on α lead to equivalent states (DFA)
– α-transitions to distinct sets states must be in distinct sets
 A partition P of S
– A collection of sets P s.t. each s Є S is in exactly one pi Є P
– The algorithm iteratively partitions the DFA’s states
60
Minimization
a p1
a
p3
b
p0
b a
p2 p4
b

 Group all the states together. {p0, p1, p2, p3, p4}.

 Separate states according to available exit transitions.


 Separate a set to two if from some of its states one can
reach another set and with others one cannot.
Repeat until cannot separate.
61
Minimization
a p1
a
p3
b
p0
b a
p2 p4
b

The above DFA can now be minimized as:


a a

b b
62
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as lex ( More on this Later )
63
Pseudo Code For lexical Analyzer
function lexan; integer else if C is a letter then
Var lexbuf : array [0, ..100] of char begin
C: char place C and successive letters &
Begin digits into lexbuf :
loop begin p := lookup ( lexbuf ) :
read a character into C: tokenval := p:
if C is a blank or a tab then return the token field of table entry p
do nothing end
else if C is a newline then else
increment lineno begin /* token is a single character */
else if C is a digit set tokenval to NONE /* no attribute */
begin return integer encoding of character C
set Tokenval to the value end
of this & flwg digits;
return NUM end
end end
64
Automating a RE Recognizer Construction
 To convert a specification into code:
1. Write down the RE for the input language
2. Build a big NFA
3. Build the DFA that simulates the NFA
4. Systematically shrink the DFA
5. Turn it into code
Note: The DFA construction is done automatically by a
tool such as Lex
65
Building Lexical Analyzers Automatically
 The point to note is :
The Process studied so far is well suited for Automation
1. Implementer writes down the regular expressions
2. Scanner generator builds NFA, DFA, minimal DFA,
and then writes out the (table-driven or direct-coded)
code
3. This process reliably produces fast, robust Lexical
Analyzers
 One such Tool is Lex
66
Lexx – A tool for generating Scanner
 A widely used tool for specifying Lexical Analyzers for a
wide variety of languages. How does it work ?
1. Specs of a Lexical Analyzer is Lexx Source Pgm lex.l
prepared by creating a program
lex.l ( containing RE’s) in the LEX Compiler
Lex language
2. Then lex.l is run thru Lex lex.yy.c
Compiler to produce a program
lex.yy.c ( Contains a tabular C Compiler
representaion of state Transition
Diagram) A.out
3. Lex.yy.c is run thru C compiler to Input
produce an object code of Lex Stream A.out
Sequence
Analyzer Of Tokens
67
Lexx Functions
1. Translates the definitions into an automaton.

2. The automaton looks for the longest matching string.

3. Either return some value to the reading program

(parser), or looks for next token.

4. Look ahead operator: x/y  allow the token x only if


y follows it (but y is not part of the token).
68
Lexx Program Structure
 A Lexx Program ( nothing but specifications in lex.l )
Consists of THREE Parts.
1. Declarations This section includes declaration of Variables,
manifest Constants.

2. Translation Rules This section includes patterns and the


corresponding action to be taken ( RE)

3. Auxilliary procedures This section includes what ever


Auxiliary procedures that are needed

Three sections are separated by lines beginning with%%


69
A Sample Lexx Program
1) %{
/* Remove uppercase letters . Commands to execute are
lex test.l and gcc lex.yy.c -ll -o test */
%}
%%
[A-Z]+ ;
2) %{
/* Line numbering */
%}
%%
^.*\n printf(“%d\t%s”,yylineno-1,yytext);
70
Any
Questions ????

Thank you
71
Regular Expression Construction
 Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)
 Solution : Start with symbol and keep defining regular
sub-expressions till the final expression is achieved
RULE 1. digit 0|1|2|3| … |9
RULE 2. digits digit digit* (or digit+)
[Kleene star closure meaning 1 or more digits]

RULE 3. optional_fraction ‘.’ digits | epsilon


RULE 4. Num digits optional_fraction
72
Regular Expression Construction
io n
 Problem : Specify a set of unsigned numbers as a re ss
x p
regular expression. (Examples: 1997, 19.97)lar e
e g u
 Solution : Start with symbol and keep fdefining a r regular
s o
sub-expressions till the final expression t i o n is achieved
fi n i
RULE 1. digit d e
0 | 1 |he2 | 3 | … | 9
L t
RULE 2. digit A L digit* (or digit+)
digit
e d
s [Kleene star closure meaning 1 or more digits]
e u
h a v
we
RULE 3. a t optional_fraction ‘.’ digit | epsilon
e t h
o
RULEt 4. Num digit optional_fraction
N
73
Unsigned Number validation using Rules
 Let us derive the number from these rules
RULE 1. digit 0|1|2|3| … |9
RULE 2. digits digit digit* (or digit+)
[Kleene star closure meaning 1 or more digits]

RULE 3. optional_fraction ‘.’ digits | epsilon


RULE 4. Num digits optional_fraction

1 9 97 2 5 . 9 7 3 6 . . 14
74
Regular Expression Construction
 Qn: How to write a regular expression for identifiers?
(identifiers are letters followed by a letter or a digit).
 Answer:
1. Letter a|A|b|B|… |z|Z
2. Digit 0|1|2|3| … |9
3. Letter_or_Digit Letter | Digit
4. Identifier Letter | letter_or_digit
 One can define similar regular expression (s) for
comments, Strings, operators and delimiters ( the
different tokens of a language)
75
Grammar for a Tiny Language
 program ::= statement | program statement
 statement ::= assignStmt | ifStmt
 assignStmt ::= id = expr ;
 ifStmt ::= if ( expr ) stmt
 expr ::= id | int | expr + expr
 Id ::= a | b | c | i | j | k | n | x | y | z
 int ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
The rules of a grammar are also Known as Productions
76

You might also like