Chapter 2
Chapter 2
by…
Mr. Birku L.
B.Sc. CS, M.Sc. SE
The Academic year of 2015 E.C
Revision chapter 1
Answer the following question among
them
1. What are translators?
2. What is ? Explain about
3. Explain about
4. Define and r?
5. What are the and between
and ?
6. Explain about and multi-pass
7. Explain about the
8. Explain about and ?
9. Write about with ?
Chapter 2 Outline
Lexical Analysis?
Role of lexical analysis
Issue in lexical analysis
Token ,Patter, Lexeme
Attributes of a token
Input Buffering
Buffer pairs
Sentinels (Guards)
Specification of token using Regular expression
Regular Expression
Recognizing of token using Transition diagram
Transition diagram
Lexical Analysis
Error Messages
Role of Lexical Analysis
1. To the characters
2. To them into
3. To a sequence of used by the
for syntax analysis As an of parser
4. To Interact with the
◦ To Insert
token
Source Lexical To semantic
Parser
program Analyzer analysis
getNextToken
Symbol table
Role of Lexical Analysis
5. To
• Comments : such as // or /* -------*/
• Whitespaces : such as blank, newline, tab, …
Keyword(if) If characters i, f
Id X Letter followed by seq. of alphanumeric l
(l+d)*
Relation <,<=,=,<>,>=,> < or <= or = or < > or >= or lette
Operator followed by letters & digi (l+d)*
Special symbols :,:,(,),{,},’,”,., : or : or ( or ) or { or } or’ or ” or . or ,
Number 3.14 Any numeric constant
Literal/Constant "core" Anything but “, surrounded by “
Attributes of a token
When lexemes match the then
the must provide with the
token.
is placed in a symbol-table entry and the
lexical analyzer represents these lexemes in the form of tokens
as:< , >
: an abstract symbol is used during syntax analysis,
: points to an entry in the symbol table for this
token.
. E = M * C ** 2 has 7 tokens and associated attribute-values:
E :- <(id,1) reference to symbol-table entry for E>
= :- <assign_op,>
M :- < (id,2) reference to symbol-table entry for M>
* :- <mult_op,>
C :- < (id,3) reference to symbol-table entry for C>
** :- <exp_op,>
2 :- <num, integer value 2>
Attributes of a token
<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>
token-name
Token(var)
(token attribute) Parser
11
Some example of token
1. #include <iostream.h>
Handle by
2. #include <conio.h>
1 2 3 4
3. void main()
void main ( )
4. // his is a hello world program X Comment, Invalid token
5. { 5
{
So The of is 11
Token of a Hello World program
void main ( )
{ cout < <
“Hello world!\n” ; }
Keyword Identifier Numbers Literal/ Special Operators
Constant symbols
1. #include <iostream.h>
Handle by
2. #include <conio.h>
9. cout <<“The value of c = " << c << endl; cout < < “The…"
< < c < < ;
10. } }
So The of is
46
Tokens of Adding two Numbers
Keyword Identifier Numbers Literal/ Special Operators
Constant symbols
void a 10 “Enter...” ( <<
int b “Sub....” ) >>
cout c { +
cin main } *
; =
,
4 4 1 2 6 5
+ = <<
* >>
^
2 0 0 0 1 3
Some example of token
So The of is 60
1. #include <iostream.h>
Handle by
2. #include <conio.h>
3. void main() void main ( )
4. { {
5. int n,m, r; int m , n , r ;
6. cout << "Enter two No to get G,C,F\n"; cout < < "Enter two ….." ;
7. cin >> n >> m;
Cin > > m > > n ;
8. while (n!=0)
while ( n ! = 0 ) ;
9. { {
10. r = n % m; r = n % m ;
11. m=n; m = n ;
12. n=r; n = r ;
13. } }
14. cout <<“G.C.F of entered numbers = " << m << endl; cout < < “Sum."
15. } } < < M < < ;
E.g. Token of a G.C.F program
Keyword Identifier Numbers Literal/ Special Operators
Constant Symbols
void n 0 “Enter...” ( <<
int m “C.F.G....” ) >>
cout r { %
cin Main } !=
while ; =
,
4 4 1 2 6 5
% != = <<
>>
2 1 0 0 1 2
Input Buffer
is also commonly known as the
When referring to computer memory,
is a location that holds all incoming information
before it continues to the .
• This is the of the that the entire
one a
• The scanner accounts for ~ of total compile time.
• There are general to the implementation of a
:
such as Lex or Flex compiler
to produce the lexical analyzer from a regular expression
i ,i ,a b, a b, b, a b, a b, a b, a b
Buffer Pair
The input buffer has two halves with N
characters in each half.
might be the of a block like 1024 or 4096.
Mark the stream with a special character
.
Maintain two pointers into the buffer marking
of the current lexeme.
two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is
found
Buffer Pair
The input buffer has two halves with N
characters in each half.
might be the of a block like 1024 or 4096.
Mark the with a special character
. Maintain two pointers into the buffer marking
of the current lexeme.
two pointers to the input are maintained
– marks the beginning of the current
lexeme
– scans ahead until a pattern match is
found
Simple Input Buffering algorithm
both to the of the
next to be found.
The is until a for a
pattern is found. the lexeme is set both
pointers to the following the lexeme.
Code to advance forward pointer:
forward at end of half
second half;
:= forward + 1
forward at end of half
reload first half;
forward := forward + 1;
Sentinels (Guards)
• to test if it is at the
• to determine what character is read ( )
added at each
: Optimize the common case by the
of to one per of fwd.
: Extend each to hold a at the end.
• This is a that cannot occur in a
( )
• It the need for some (fill
, or processing).
• when other than at the of a it means that the
is at
Simple Sentinels algorithm
• For almost character perform :
1. Is the character ?
2. Is the at the end of the ?
3. Is the at the end of the ?
• This can be to by using .
• Add an eof character past the end of .
• Use the code from next slide to .
:= forward + 1;
forward = then
forward at end of then
second half;
:= forward + 1 end
forward at the then
first half;
move forward to of
end
else end;
Specification of Tokens
There are Two Dual Notions to specification and
recognition of tokens.
• (automaton)
Any is called a .
of the is the of occurrence of
,
Specification of Tokens
denoted by a Greek word
,
of denoted in
e.g. 0010, |0010| = 4, oo7,|oo7| = 3, ,||=0
r|s=s|r | is commutative
r | (s | t) = (r | s) | t | is associative
(r s) t = r (s t) concatenation is associative
r(s|t)=rs|rt concatenation distributes over |
(s|t)r=sr|tr
r = r Is the identity element for
r = r concatenation
r* = ( r | )* relation between * and
r** = r* * is idempotent
:- A language may be represented by or equivalent
. idempotent operation is one that has no
if it is called than with the same
Notational shorthand’s
Certain constructs occur so frequently in
that it is convenient to introduce
for them.
:- The
Means “ or of ” if is a
that denotes the then is a
regular expression that denotes the language
means.
(r)+ denoting (L(r)) +
( r )+ digit+
the * operator
R* is shorthand for |r+
If is a , then:
• means occurrence of x.
, it can generate { }
• means occurrence of x.
i.e., it can generate } or
• means occurrence of
i.e., it can generate
is all language.
is all language.
is all used in mathematics.
.
Notational shorthand’s
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Digit = (digit)+
Letter = (letter) +
Identifier = (letter)(letter | digit)*
Decimal = (sign)?(digit)+
Real Number ([digit] + “.” [digit] *) | (“.” [digit] +)
keyword -> keyword
Relation operator -> < | > | <= | >= | = | <>
The only problem left with the is how to verify
the of a used in the
of of a . A well-accepted solution is