Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan
Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan
2
Compiler – Front End
3
Analysis Phase
• Syntax
– Proper form of program
• Semantics
– What a program means; what each program does when it executes
4
The Role of the Lexical Analyzer
5
The Role of t he Lexical Analyzer
• The main task of the lexical analyzer is to
– read the input characters of the source program, group them into lexemes, and
– produce as output a sequence of tokens for each lexeme in the source program.
– The stream of tokens is sent to the parser for syntax analysis.
– It is common for the lexical analyzer to interact with the symbol table as well.
– When the lexical analyzer discovers a lexeme constituting an identifier, it needs
to enter that lexeme into the symbol table.
– it may perform certain other tasks besides identification of lexemes. One such
task is stripping out comments and whitespace
– Another task is correlating error messages generated by the compiler with the
source program.
6
Lexical Analyzer - Process
• Lexical analyzers are divided into a cascade of two processes:
– Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
– Lexical analysis proper is the more complex portion, where the
scanner produces the sequence of tokens as output.
7
Tokens
• A token is a pair consisting of a token name and an optional
attribute value.
• The token name is an abstract symbol representing a kind of
lexical unit, e.g., a particular keyword, or a sequence of input
characters denoting an identifier.
• The token names are the input symbols that the parser
processes.
• In what follows, we shall generally write the name of a token in
boldface. We will often refer to a token by its token name.
8
Patterns
• A pattern is a description of the form that the lexemes of a
token may take.
• In the case of a keyword as a token, the pattern is just the
sequence of characters that form the keyword. For identifiers
and some other tokens, the pattern is a more complex
structure that is matched by many strings.
9
Lexemes
• A lexeme is a sequence of characters in the source program
that matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token.
10
Example 3.1
printf ( "Total = %d\n" , score ) ;
• both printf and score are lexemes matching the pattern for
token id, and " Total = %d\n" is a lexeme matching literal.
• In many programming languages, the following classes cover
most or all of the tokens:
11
Example 3.1 contd.
• One token for each keyword. The pattern for a keyword is the
same as the keyword itself.
• Tokens for the operators, either individually or in classes such
as the token comparison mentioned in Fig. 3.2.
• One token representing all identifiers.
• One or more tokens representing constants, such as numbers
and literal strings .
• Tokens for each punctuation symbol, such as left and right
parentheses,comma, and semicolon.
12
Example 3.2
• Write the token names and associated attribute values for the
Fortran statement
E = M * C ^ 2
13
Input Buffering
• Because of the amount of time taken to process characters and
the large number of characters that must be processed during
the compilation of a large source program, specialized
buffering techniques have been developed to reduce the
amount of overhead required to process a single input
character.
14
Input Buffering - Buffer Pairs
• Each buffer is of the same size N , and N is usually the size of a
disk block, e.g., 4096 bytes.
• Using one system read command we can read N characters
into a buffer, rather than using one system call per character.
• If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the
source file and is different from any possible character of the
source program.
• Two pointers to the input are maintained:
– Pointer lexemeBegin, marks the beginning of the current lexeme, 15
Input Buffering - Buffer Pairs
16
Input Buffering - Buffer Pairs
• Once the next lexeme is determined, forward is set to the
character at its right end. Then, after the lexeme is recorded as
an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the
lexeme just found.
• In Fig. 3.3, we see forward has passed the end of the next
lexeme and must be retracted one position to its left.
• Advancing forward requires that we first test whether we have
reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to 17
Input Buffering - Sentinels
• we must check, each time we advance forward, that we have not
moved off one of the buffers; if we do, then we must also reload
the other buffer.
• Thus, for each character read, we make two tests: one for the end
of the buffer, and one to determine what character
• is read (the latter may be a multiway branch) .
• We can combine the buffer-end test with the test for the current
character if we extend each buffer to hold a sentinel character at
the end.
• The sentinel is a special character that cannot be part of the source
program, and a natural choice is the character eof. 18
Input Buffering - Sentinels
19
Challenge Task
• Write a program for input buffer with the desired pointers.
20
Lexical Analyzer - Process
Lexical Analyzer
Specification of Recognition of
Tokens Tokens
Thomson
Regular Expression Construction
(RE -> NFA)
Subset
Construction
(NFA -> DFA)
21
Specification of Tokens
• Regular expressions are an important notation for specifying
lexeme patterns.
22
Strings and Languages
• An alphabet is any finite set of symbols. Typical examples of
symbols are letters , digits, and punctuation. The set {a, 1 } is
the binary alphabet. ASCII is an important example of an
alphabet ; it is used in many software systems.
• A string over an alphabet is a finite sequence of symbols drawn
from that alphabet. In language theory, the terms "sentence"
and "word" are often used as synonyms for "string." The length
of a string s, usually written |s| , is the number of occurrences
of symbols in s. For example, banana is a string of length six.
The empty string, denoted , is the string of length zero. 23
Strings and Languages
24
Operations on Languages
25
Kleene star
26
Kleene plus
27
Example of Kleene closure
• Example of Kleene star applied to set of strings:
– {"ab","c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "ababc",
"abcab", "abcc", "cabab", "cabc", "ccab", "ccc", ...}.
• Example of Kleene star applied to set of characters:
– {"a", "b", "c"}* = { ε, "a", "b", "c", "aa", "ab", "ac", "ba", "bb", "bc", "ca",
"cb", "cc", "aaa", "aab", ...}.
• Example of Kleene star applied to the empty set:
– ∅* = {ε}.
• Example of Kleene plus applied to the empty set:
– ∅+ = ∅ ∅* = { } = ∅
28
Example 3.3
• Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D
be the set of digits {a, 1 , . . . 9}. We may think of L and D in
two, essentially equivalent, ways. One way is that L ?>nd D are,
respectively, the alphabets of uppercase and lowercase letters
and of digits. The second way is that L and D are languages, all
of whose strings happen to be of length one. Here are some
other languages that can be constructed from languages L and
D
29
Example 3.3 - Solution
• L U D is the set of letters and digits - strictly speaking the
language with 62 strings of length one, each of which strings is
either one letter or one digit.
• LD is the set df 520 strings of length two, each consisting of
one letter followed by one digit.
• L4 is the set of all 4-letter strings.
• L * is the set of ail strings of letters, including E, the empty
string.
• L(L U D)* is the set of all strings of letters and digits beginning
30
with a letter.
Regular Expressions
• A regular expression is a formula for representing a (complex)
language in terms of “elementary” languages combined using
the three operations union, concatenation and Kleene closure.
• Refer the book
Title: Compiler Construction: Principles and Practice
Author(s)/Editor(s): Kenneth C. Louden
Publisher: PWS Publishing Company
ISBN: 0-534-93972-4
Pages: 43-52 and 73-83
31
Thompson Construction
• Discussed in class
• Refer the book
Title: Compiler Construction: Principles and Practice
Author(s)/Editor(s): Kenneth C. Louden
Publisher: PWS Publishing Company
ISBN: 0-534-93972-4
Pages: 43-52 and 73-83
32
Subset Construction
• DIY
• Refer the book
Title: Compiler Construction: Principles and Practice
Author(s)/Editor(s): Kenneth C. Louden
Publisher: PWS Publishing Company
ISBN: 0-534-93972-4
Pages: 43-52 and 73-83
33
Introduction to FLEX
• FLEX (Fast LEXical analyzer generator) is a tool for generating
scanners. In stead of writing a scanner from scratch, you only
need to identify the vocabulary of a certain language (e.g.
Simple), write a specification of patterns using regular
expressions (e.g. DIGIT [0-9]), and FLEX will construct a scanner
for you.
34
Environment Setting with FLEX
35
Flex regular expressions
• s string s literally
• \c character c literally, where c would normally be a lex
operator
• [s] character class
• ^ indicates beginning of line
• [^s] characters not in character class
• [s-t] range of characters
• s? s occurs zero or one time
• . any character except newline 36
Flex regular expressions
• s+ one or more occurrences of s
• r|s r or s
• (s) grouping
• $ end of line
• s/r s iff followed by r (not recommended) (r is *NOT*
consumed)
• s{m,n} m through n occurences of s
37
Examples of regular expressions in flex
• a* zero or more a’s
• .* zero or more of any character except newline
• .+ one or more characters
• [a-z] a lowercase letter
• [a-zA-Z] any alphabetic letter
• [^a-zA-Z] any non-alphabetic character a.b a followed by any
character followed by b
• rs|tu rs or tu
38
Examples of regular expressions in flex
• a(b|c)d abd or acd
• ^start beginning of line with then the literal characters start
• END$ the characters END followed by an end-of-line.
39
A flex Input File
• flex input files are structured as follows:
%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines 40
A flex Input File
• The optional Declarations and User
subroutines sections are used for ordinary C code that you wa
nt copied verbatim to the generated C file. Declarations are co
pied to the top of the file, user subroutines to the bottom. The
optional Definitions section is where you specify options for t
he scanner and can set up definitions to give names to regular
expressions as a simple substitution mechanism that allows for
more readable entries in the Rules section that follows. The r
equired Rules section is where you specified the patterns that
identify your tokens and the action to perform upon recognizi 41
flex Global Variables
• The token-
grabbing function yylex takes no arguments and returns an inte
ger. Often more information is needed about the token just re
ad than that one integer code. The usual way information abo
ut the token is communicated back to the caller is by having th
e scanner set the contents of a global variable which can be rea
d by the caller. After counseling you for years that globals are a
bsolute evil, we reluctantly sanction their limited use here, bec
ause our tools require we use them.
42
flex Global Variables -yytext
• yytext is a null-
terminated string containing the text of the lexeme just recogn
ized as a token. This global variable is declared and managed in
the lex.yy.c file. Do not modify its contents. The buffer is over
written with each 4
subsequent token, so you must make your own copy of a lexem
e you need to store more permanently
43
flex Global Variables -yyleng
• yyleng is an integer holding the length of the lexeme stored in
yytext. This global variable is declared and managed in the lex.
yy.c file. Do not modify its contents.
44
flex Global Variables -yylval
• yylval is the global variable used to store attributes about the t
oken, e.g. for an integer lexeme it might store the value, for a s
tring literal, the pointer to its characters and so on. This variab
le is declared to be of type YYSTYPE, and is usually a union of a
ll the various fields needed for different token types. If you are
using a parser generator (such as yacc or bison), it will define t
his type for you, otherwise, you must provide the definition yo
urself. Your scanner actions should appropriately set the cont
ents of the variable for each token.
45
flex Global Variables -yylloc
• yylloc is the global variable that is used to store the location (li
ne and column) of the token. This variable is declared to be of
type YYLTYPE. Again, the parser generator can provide this or
it may be your responsibility. Your scanner actions should app
ropriately set the contents of the variable for each token.
46
Example 1 – ex1.lex
/* either indent or +=yyleng;}
use %{ %} */ . ++num_chars;
%{ %%
int num_lines = 0; int main(int argc,
int num_chars = 0; char **argv)
int num_words=0; {
#define yywrap() 1 yylex();
%} printf("# of lines =
%% %d, # of chars = %d, #
\n ++num_lines; of words = %d\n", 47
Compiling .lex file
• On cmd execute the following commands
flex ex1.lex
• This will generate lex.yy.c file
gcc lex.yy.c –o ex1
• Then to execute your scanner execute
ex1.exe
48
Example 2
%{ yytext);
#define yywrap() 1 ’.’ printf("found
%} character: {%s}\n",
digits [0-9] yytext);
ltr [a-zA-Z] . { /* absorb others
*/ }
alphanum [a-zA-Z0-9]
%%
%%
int main(int argc,
(-|\+)*{digits}+ char **argv)
printf("found number:
’%s’\n", yytext); { 49
Example 3
%{ {
#define yywrap() 1 yylex();
%} return 0;
%% }
[0-9]+ printf("?");
%%
int main()
50
Reading Assignment
• 3. 1 . 1 Lexical Analysis Versus Parsing
• 3.1.4 Lexical Errors
• 3 . 3 . 5 Extensions of Regular Expressions
51
Assignment # 1
• Due Date: 6th July, 2017
52
THANK YOU
53