Unit 1 CD
Unit 1 CD
CD
(Compiler Design)
What is Compiler?
Compiler is a software which converts a
program written in high level language called
Source Language to low level language
(Object/Target/Machine Language).
Lex is a computer program that generates lexical analyzers. Lex is commonly used
with the yacc parser generator.
Creating a lexical analyzer
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
Definitions include declarations of variables, constants, and regular
definitions
User subroutinesare auxiliary procedures needed by the actions. These can be
compiledseparately and loaded with the lexical analyzer.
Flex regular expressions
In addition to the usual regular expressions, Flex introduces some new notations.
[abcd]
[0-9]
In brackets, a dash indicates a range of characters. For example, [a-zA-Z] matches any
single letter. If you want a dash as one of the characters, put it first.
[^abcd]
This indicates any character except a, b, c or d. For example, [^a-zA-Z] matches any
nonletter.
The input character is read from secondary storage. But reading in this
way from secondary storage is costly. Hence buffering technique is used
A block of data is first read into a buffer, and then scanned by lexical
analyzer
There are two methods used in this context
1. One Buffer Scheme
2. Two Buffer Scheme
One Buffer Scheme:
In this scheme, only one buffer is used to store the input string. But the
problem with this scheme is that if lexeme is very long then it crosses the
buffer boundary, to scan rest of the lexeme the buffer has to be refilled,
that makes overwriting the first part of lexeme.
Two Buffer Scheme:
To overcome the problem of one buffer scheme, in this method two buffers
are used to store the input string. The first buffer and second buffer are
scanned alternately. When end of current buffer is reached the other
buffer is filled.
Initially both the bp and fp are pointing to the first character of first
buffer. Then the fp moves towards right in search of end of lexeme. as
soon as blank character is recognized, the string between bp and fp is
identified as corresponding token. To identify, the boundary of first
buffer end of buffer character should be placed at the end first buffer.
Similarly end of second buffer is also recognized by the end of buffer
mark present at the end of second buffer. When fp encounters first eof,
then one can recognize end of first buffer and hence filling up second
buffer is started. in the same way when second eof is obtained then it
indicates of second buffer. Alternatively both the buffers can be filled up
until end of the input program and stream of tokens is identified.
This eof character introduced at the end is calling Sentinel which is used
to identify the end of buffer.