1 - Scanning Slides Sanyal Part1
1 - Scanning Slides Sanyal Part1
Amitabha Sanyal
(www.cse.iitb.ac.in/˜as)
January 2016
Introduction
main ()
{
int i,sum;
sum = 0;
for (i=1; i<=10; i++)
sum = sum + i;
printf("%d\n",sum);
}
Introduction
[
[
[
[ [
[ [
[
[ [
[
[ [
[ [
[ [
[
[
[
[
[
=1; i<=10; i++); sum = sum + i;←֓ printf("%d\n",
[
[
[
[
[
[
[
sum);←֓}
Step 1:
[
[
[
[
[
; ←֓ sum = 0 ; ←֓
[
[
[
[
[
[
[
[
for ( i = 1 ; i <= 10 ; i ++ ) ;
[
[
sum = sum + i ; ←֓
[
[
[
[
[
[
printf ( "%d\n" , sum ) ; ←֓ }
[
[
main ( ) { int i , sum ; sum = 0 ; for (
i = 1 ; i <= 10 ; i ++ ) ; sum = sum + i
; printf ( "%d\n" , sum ) ; }
fundef
int ...
Discovering the structure of the program
fundef
var identifier
identifier sum
Distinguish between
• lexemes – smallest logical units (words) of a program.
Examples – i, sum, for, 10, ++, "%d\n", <=.
• tokens – sets of similar lexemes, i.e. lexemes which have a common
syntactic description.
Examples –
identifier = {i, sum, buffer, . . . }
int constant = {1, 10, . . . }
addop = {+, -}
Lexemes, Tokens and Patterns
What is the basis for grouping lexemes into tokens?
• Why can’t addop and mulop be combined? Why can’t + be a token
by itself?
Lexemes which play similar roles during syntax analysis are grouped into
a common token.
• Operators in addop and mulop have different roles – mulop has an
higher precedence than addop.
• Each keyword plays a different role – is therefore a token by itself.
• Each punctuation symbol and each delimiter is a token by itself.
• All comments are uniformly ignored. They are all grouped under the
same token.
• All identifiers are grouped in a common token.
Lexemes, Tokens and Patterns
Apart from the token itself, the lexical analyser also passes other
information regarding the token. These items of information are called
token attributes
EXAMPLE
lexeme <token, token value>
3 < const, 3>
A <identifier, A>
if <if, –>
= <assignop, –>
> <relop, >>
; <semicolon, –>
Lexemes, Tokens and Patterns
3. Delimiters: (, ), {, }, [, ] , ;, . and ,
4. Operators: =, >, <, . . . , >=
5. Keywords: abstract, boolean . . . volatile, while.
Lexemes, Tokens and Patterns
How does one describe the lexemes that make up the token identifier.
A pattern is used to
• specify tokens precisely
• build a recognizer from such specifications
Basic concepts and issues
Where does a lexical analyser fit into the rest of the compiler?
• The front end of most compilers is parser driven.
• When the parser needs the next token, it invokes the Lexical
Analyser.
• Instead of analysing the entire input string, the lexical analyser sees
enough of the input string to return a single token.
• The actions of the lexical analyser and parser are interleaved.
parser
Two approaches:
1. Hand code – This is only of historical interest now.
• Possibly more efficient.
2. Use a generator – To generate the lexical analyser from a formal
description.
• The generation process is faster.
• Less prone to errors.
Automatic Generation of Lexical Analysers
Lex can read this description and generate a lexical analyser for whole
numbers and identifiers. How?
• The generator puts together:
• A deterministic finite automaton (DFA) constructed from the token
specification.
• A code fragment called a driver routine which can traverse any DFA.
• Code for the action routines.
• These three things taken together constitutes the generated lexical
analyser.
Automatic Generation of Lexical Analysers
• How is the lexical analyser generated from the description?
specification
regular action
expression routines
processed copied
• Note that the driver routine is common for all generated lexical
analysers.