Compilers Lecture 2
Compilers Lecture 2
Lexical analysis
1
Definition
Lexical analysis is a relatively simple phase in which
the symbols or tokens of the language are formed.
Examples of single symbols are:
Words such as: for, do, while.
User-defined identifiers such as: name, salary.
Characters such as: ++, ==
2
Role
It is the role of lexical analyzer to read sequences of
characters and to replace them by language symbols
for input to the syntax and semantic analyzers.
In doing so, the lexical analyzer mimics the human
reader when reading a program. He expects to see a
sequence of symbols made up of these characters.
3
Role(cont.)
At the same time as forming symbols of characters,
the lexical analyzer should deal with multiple spaces
and remove comments and any other characters not
relevant to the later stages of analysis.
The lexical analyzer is only concerned with forming
symbols in any way with the order in which symbol
may appear.
4
Role(cont.)
If a program written in C starts with:
; number int return do == ++
then the lexical analyzer should pass on the sequence
of symbols to the syntax analyzer.
The lexical analyzer has no context to work with, when
processing a symbol it has no knowledge of any of the
symbols that precede or will follow this symbol.
5
Basic ideas
Lexical analysis is the phase in which language
symbols are formed from sequences of characters.
For example in C there are six types of symbols:
1. keywords such as const, char, if, else.
2. Identifiers such as sum, main, printf.
3. Constants such as 28, 3.14, 017
6
Basic ideas(cont.)
4. String literals such as “Katherine”
5. Operators such as +, - , ++, >>, ==, &&.
6. Punctuators such as {,],;
Each of these types of symbol is formed during
lexical analysis by the lexical analyzer or lexer.
7
Basic ideas(cont.)
The relatively simple nature of the symbols means
that they can be represented by regular expressions .
It is easy to produce the type of recognizers required
from the corresponding regular grammars.
8
Basic ideas(cont.)
In addition to recognize the symbols of the language
the lexical analyzer will usually:
Delete comments.
Keep track line numbers.
Evaluating constants.
Some argues that the last of these is better left to the
machine back end of the compiler.
9
Interactions between the lexical analyzer
and the parser
10
Token Class (or Class)
In English:
Noun, verb, adjective..etc.
In a programming language:
Identifier : strings of letters or digits, starting with a
letter
Integer: a non-empty string of digits
Keyword: “else” or “if” or “begin” or …
Whitespace: a non-empty sequence of blanks, newlines,
and tabs
11
Example 1
if (i == j)
Z = 0;
else
Z = 1;
12
Example 2
For the code fragment below, choose the correct
number of tokens in each class that appear in the code
fragment.
x = 0;\n\twhile (x < 10) {\n\tx++;\n}
13
Example 2(cont.)
1. W = 9; K = 1; I = 3; N = 2; O = 9
2. W = 11; K = 4; I = 0; N = 2; O = 9
3. W = 9; K = 4; I = 0; N = 3; O = 9
4. W = 11; K = 1; I = 3; N = 3; O = 9
W: Whitespace
K: Keyword
I: Identifier
N: Number
O: Other Tokens: { } ( ) < ++ ; =
14
Symbol Recognition
The lexical analyzer is only concerned with
recognizing language symbols in order to pass them on
to the syntax analyzer.
It is not concerned with the order in which symbols
appear.
It is concerned with the symbols in their selves.
15
Symbol Recognition(cont.)
Example
The lexical analyzer do not detect an error in the
following sequence of symbols:
64 const char typedef >> +
as each symbol is correct in itself.
The syntax analyzer is responsible to realize that these
symbols do not form the start of any program.
16
Symbol Recognition(cont.)
The lexical analyzer has no understanding about the
scope of variables.
It could not distinguish between the use of an
identifier to represent two different variables within
two different functions.
For the lexical analyzer it is the same identifier each
time it occurred.
17
Symbol Recognition(cont.)
For the purpose of lexical analysis, regular expressions
are a convenient method of representing symbols such
as identifiers and constants.
For example an identifier might be represented as:
letter (letter|digit)*
18
Symbol Recognition(cont.)
A real number might be represented by:
(+|-| )digit*.digit digit*
It is relatively simple to write code to recognize the
symbol.
For an identifier the code would be:
19
Code to recognize identifiers
# include <stdio.h>
# include <ctype.h>
main( )
{ char in;
in = getchar();
if (isalpha(in))
in = getchar( );
else error( );
while (isalpha(in) || isdigit(in))
in = getchar( );
}
20
Code to recognize real numbers
# include <stdio.h> if (in == ‘.’)
# include<ctype.h> in = getchar( );
main( ) else error ( );
{ char in; while (isdigit(in))
in = getchar( ); in= getchar( );
if (in == ‘+’ || in == ‘-’) printf(“Ok\n”);
in = getchar( ); }
while (isdigit(in))
in = getchar( );
21
Code to recognize real numbers
There are three situation in this code:
1. Characters that may appear optionally (‘+’, ‘-’) no error if
they do not appear, just read the next character.
2. Characters that must appear, such as the decimal and
the digit after it, if they do not appear call the error
function.
3. Characters that may appear zero or more times, such as
the digit before the decimal point or after the first digit
after the point, set up a while loop to check each
occurrence and read the next character, no need to call
the error function.
22
Questions?
23