0% found this document useful (0 votes)
2 views

pcdunit2 class

The document discusses the role of the lexical analyzer as the first phase of a compiler, which reads source code to generate tokens that represent logical units such as identifiers and operators. It outlines the functions of the lexical analyzer, including token production, comment elimination, and error reporting, as well as the importance of separating lexical analysis from parsing for design simplicity and compiler efficiency. Additionally, it covers terminology related to lexical analysis, input buffering techniques, and error recovery methods.

Uploaded by

venkat Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

pcdunit2 class

The document discusses the role of the lexical analyzer as the first phase of a compiler, which reads source code to generate tokens that represent logical units such as identifiers and operators. It outlines the functions of the lexical analyzer, including token production, comment elimination, and error reporting, as well as the importance of separating lexical analysis from parsing for design simplicity and compiler efficiency. Additionally, it covers terminology related to lexical analysis, input buffering techniques, and error recovery methods.

Uploaded by

venkat Mohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT 2

LEXICAL ANALYSIS
The Role of The Lexical Analyzer

Lexical analyzer is the first phase of compiler.


The lexical analyzer reads the input source program from left to
right one character at a time and generate the sequence of
tokens.
Each token is a single logical cohesive unit such as identifier,
keywords, operators and punctuation marks.
Then the parser to determine the syntax of the source program
can use these tokens.
The role of lexical analyzer in the process of compilation is shown
below
token
Source Lexical Parser
program analyzer Get next
token

Symbol
table
How lexical analyzer works?
Syntax checker in a compiler serves as the master program.
It first sends a request to get a valid token from the lexical
analyzer

Scanner then do its pattern matching to create a valid token if


possible and then sends back to syntax checker.

As soon as it sends back the token the scanner gets suspended.

Syntax checker do the grammar check over the token and then
asks for next token.

This process continues recursively till the entire input string is


consumed
Apart from token identification lexical analyzer also performs
following functions

Functions of lexical analyzer

• It produces stream of tokens

• It eliminates blank and comments

• It generates symbol table which stores the information about


identifiers, constants encountered in the input.

• It keeps track of line numbers.

• It reports the error encountered while generating the tokens


Why to separate Lexical analysis and parsing
There are several reasons for separating the analysis phase of
compiling into lexical analysis and parsing.
1. Simpler design
It is the most important consideration. The separation of lexical
analysis from syntax analysis often allows us to simplify one or the
other of these phases.
Eg: If parser need to deal with whitespaces and command lines - task
is hard to design

2. Compiler efficiency is improved:


Optimization of lexical analysis because a large amount of time is
spent reading the source program and partitioning it into tokens.

3. Compiler portability is enhanced:


Input alphabet peculiarities and other device-specific anomalies can
be restricted to the lexical analyzer.
Eg: special and non standard symbols can be isolated.
Terminology used in lexical analysis
1 Token: A set of input string which are related through a
similar pattern.

Any word that starts with an alphabet and can contain any
number or alphabet in between is called an identifier.

Identifier is called a token: raghu,raghu123

2 Lexeme : The actual input string which represents the


token
Identifier – token
Raghu,raghu123—lexeme

3 Pattern : rule which a lexical analyzer follow to create a


token
Example:
X= X * (acc + 123 )

Token Lexeme
Identifier X
Operator equal =
Identifier X
Operator multiply *
Left parenthesis (
Identifier Acc
Operator Plus +
Integer constant 123
Examples of Tokens

regular
expression

• In most programming language, the following


constructs are treated as tokens: keyword,
identifiers, constants, literal strings, operators,
and punctuation symbols.
Attributes for Tokens
• When more than one lexeme matches a
pattern, the lexical analyzer must provide
additional information about the particular
lexeme that matched to the subsequent
phases of the compiler.

• The lexical analyzer collects information


about tokens into their associated
attributes.
Attributes for tokens
• E = M * C ** 2
– <id, pointer to symbol table entry for E>
– <assign-op>
– <id, pointer to symbol table entry for M>
– <mult-op>
– <id, pointer to symbol table entry for C>
– <exp-op>
– <number, integer value 2>
Lexical Errors
• Few errors are detected at lexical level alone,
because a lexical analyzer has a very localized
view of a source program.
• For example, if the string fi is encountered in a
C program for the first time in the context
– fi ( a == f(x)) …..
– whether fi is a misspelling of the keyword if or an
undeclared function identifier
– Since fi is a valid identifier, the lexical analyzer must
return the token for an identifier and let latter phase
handle any error.
Lexical Errors (Cont.)
• Panic mode: successive characters are ignored
until we reach to a well formed token

• Other possible error-recovery actions are:


– Deleting an extraneous character
– Inserting a missing character
– Replacing an incorrect character by a correct
character
– Transposing two adjacent characters
Input Buffering

Examining ways of speeding reading the


source program
Two-buffer scheme handling large look ahead
safely
The lexical analyzer scans the input string from left to right
one character at a time.

It uses two pointers begin pointer and forward pointer bptr and
fptr to keep track of the portion of the input scanned.

Initially both the pointers point to the first character of the input
string
The bptr remains at the beginning of the string to be
read and the fptr moves ahead to search for end of
lexeme.
As soon as the blank space is encountered it
indicates end of lexeme
Then fptr will be at white space. When fptr
encounters white space it ignore and moves ahead.
Then both the bptr and fptr is set at next token i.
The input character is thus read from secondary
storage.
But reading in this way from secondary storage is
costly. Hence buffering technique is used.
A block of data is first read into a buffer, and then
scanned by lexical analyzer

There are two methods used in this context:


one buffer scheme and
two buffer scheme
One buffer scheme

In this buffer scheme, only one buffer is used to


store the input string.

But the problem with this scheme is that if lexeme


is very long then it crosses the buffer boundary,
to scan rest of the lexeme the buffer has to be
refilled, that makes overwriting the first part of
lexeme.
Two Buffer scheme
To overcome the problem of one buffer scheme, in this
method two buffers are used to store the input string.

In this method, the first buffer and second buffer are scanned
alternately.

When end of current buffer is reached the other buffer is


filled.

The only problem with this method is that if length of the


lexeme is longer than length of the buffer then the input
cannot be scanned completely.
Initially both the bptr and fptr are pointing to the first character
of first buffer.

Then the fptr moves towards right in search of end of lexeme.

As soon as blank character is recognized, the string between


bptr and fptr is identified as corresponding token.

To identified the boundary of first buffer end of buffer


character should be placed at the end of first buffer.

Similarly end of second buffer is also recognized by the end of


buffer mark present at the end of second buffer.
When fptr encounters first eof, then one can
recognize end of first buffer and hence filling up of
second buffer is started.

In the same way when second eof is obtained then it


indicates end of second buffer.

Alternatively both the buffers can be filled up until end


of the input program and stream of tokens is
identified.

This eof character introduced at the end is called


sentinel which is used to identify the end of buffer.
If (fptr==eof(buff1)) /* encounters end of first buffer */
{ /* refill buffer2 */
fptr++;
}
else if (fp==eof(buff2) // encounters end of second
buffer
{
//refill buffer1
fptr++;
}
else if (fptr == eof(input))
return; // terminate scanning
else
fptr++; // still remaining input has to be scanned

You might also like