0% found this document useful (0 votes)
24 views4 pages

Chapter Two-Lexical Analysis

The document discusses lexical analysis in compiler design. It defines key terms like tokens and lexemes. It describes the functions of a lexical analyzer as processing input characters into valid tokens while skipping comments and whitespace. Regular expressions are used to specify token patterns which are then recognized using transition diagrams. The lexical analyzer interacts with the symbol table and parser. Input buffering techniques like single/pair buffers use pointers to efficiently scan input for lexemes without reloading characters.

Uploaded by

gebrehiwot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

Chapter Two-Lexical Analysis

The document discusses lexical analysis in compiler design. It defines key terms like tokens and lexemes. It describes the functions of a lexical analyzer as processing input characters into valid tokens while skipping comments and whitespace. Regular expressions are used to specify token patterns which are then recognized using transition diagrams. The lexical analyzer interacts with the symbol table and parser. Input buffering techniques like single/pair buffers use pointers to efficiently scan input for lexemes without reloading characters.

Uploaded by

gebrehiwot
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Chapter-2: Lexical Analysis

2.1 Functions of Lexical Analysis2.2 Role of the Lexical Analyzer2.3 Input Buffering
2.4 Specification of tokens2.5 Recognition of tokens

Lexical Analysis
Lexical means “Anything related to words”.
Terminology used in Lexical Analysis.
1. Token: A set of input string which are related through a similar pattern.
2. Lexeme: The actual input string which represents the token.
3. Pattern: Rule which a Lexical analyzer follow to create a token.

Functions of Lexical Analysis in Compiler Design


1. Process the input characters that constitute a high level program into valid set of tokens.
2. It skips the comment and white spaces while creating these tokens.
3. If any erroneous input is provided by the user in the program Lexical analyzer correlate that error with source
file and line number.
Ex:

Token Lexeme
Identifier X
Operator Eq =
Identifier X
Operator Mul *
Left Paranthesis (
Indentifier Acc
Operator Plus +
Integer Constant 123
Right Paranthesis )

Role of a Lexical Analyzer (Scanner or Lexer)

 The main task of the lexical analyzer is to read the input characters of the source program, group them into
lexemes, and produce as an output a sequence of tokens for each lexeme in the source program
 The stream of tokens is sent to the parser for syntax analysis
 The lexical analyzer also interacts with the symbol table, e.g., when the lexical analyzer discovers a lexeme
constituting an identifier, it needs to enter that lexeme into the symbol table
 The interactions are suggested in Figure 2.1
1
Figure 2.1: Interactions between the lexical analyzer and the parser

 The following are additional tasks performed by the lexical analyzer other than identifying lexemes:
 Stripping out comments and whitespace (blank, newline, and tab)
 Correlating error messages generated by the compiler with the source program by keeping track of line
numbers (using newline characters)

Input Buffering
- Reading character by character from secondary storage is slow process and time consuming as well
- It is necessary to look ahead several characters beyond the lexeme for a pattern before a match can be announced
- One technique is to read characters from the source program and if pattern is not matched than push look ahead
character back to the source program
- This technique is time consuming
- Use of buffer techniques can eliminate this prob. And increase efficiency
Single Buffer
- A buffer of n character is defined in the memory which is usually of block size (1024k)
- Two pointers will be used
o Beginning pointer (BP) o Forwarding pointer (FP)
- BP points to the start of the lexeme while FP scans the input buffer for lexeme
- When the lexeme end is found it is processed i.e. matched with pattern and converted into a token where FP
remains still
- When the lexeme is processed both pointers points to the next character /lexeme
- If the character is white space, it is also matched but no token is generated, just both pointers move ahead to
detect next lexeme

Disadvantage of Single buffer technique


- if the file size is greater than the buffer size , then the content of the last lexeme under process will be over
written by the new data if wereload the data

Buffer Pair technique


- In this technique buffer is divided into 2 half's, each n characters long which are contiguous to one another
- One half will be loaded at a time
- Same 2 pointers are used i.e. BP and FP
- When FP reaches the end of one half, second half is loaded and FP points to the beginning of the next half
- Lexeme processing remains the same
- In this way single buffer problem can be eliminated

Disadvantage of Buffer Pair Technique


- Two checks to check the end of 2 half of buffer will be performed each time FP is advanced

Psuedo code for the advancement of FP

2
If FP at the end of first half then begin
Reload second Half
FP=FP+ 1End
Else if FP at the end of second half then begin
Reload first half
Move FP at the beginning of First half End
Else
FP = FP+1

Sentinel
- While using buffer pair technique, we have to check each time FP is moved that it doesn't crosses the buffer half
and when it reaches end of buffer, the other one needs to be loaded
- We can solve this problem by introducing a sentinel character at the end of both halves of buffer
- This sentinel can be any character which is not a part of source program. EOF is usually preferred as it will also
indicate end of source program
- Through this sentinel buffer end can be determined
- FP will only be checked for this sentinel and when it is encountered appropriate action will be taking to Fill next
buffer. If eof is used than, encountering eof in elsewhere in buffer will mean end of source program

Algorithm for Advancement of FP while using sentinel


FP = FP + 1
If FP =eof then
begin
If FP at the end of first half then
begin
Reload seeondHalf
FP = FP + 1
End
Else if FP at the end of second half then
begin
Reload first half
Move FP at the beginning of First Half
End
Else
/* eof is within the buffer signifying end of i/p */
Terminate Lexical Analyzer
End

Specification of Tokens
Regular expressions are used to specify token patterns.
Notational shorthands
+ means one or more
*means zero or more
? means zero or one
Character classes [a-z],[abc],[0-9].

3
Ex: A pascal identifier is a letter followed by zero or more letters or digits.
id letter(letter/digit)*
letter A/B/…/Z/a/b/…z.
digit 0/1/…/9.

Each regular expression r denotes a language L(r).


Let ={a,b} then
Regular expression a/b denotes the set {a,b}
Regular expression (a/b)(a/b) denotes the set {aa,ab,ba,bb}
Regular expression a* denotes the set { ,a,aa,aaa,aaaa,….}
Regular expression ab+ denotes the set {ab,abb,abbb,….}

Recognition of Tokens
Lexical analyzer recognizes tokens
by using transition diagrams. Some important conventions about the transition diagram are:
1. Certain states are said to be accepting or final. These states indicate that a lexeme has been found.
Accepting states are indicated by a double circle, and if there is an action to be taken – typically returning a
token and an attribute value to the parser – we shall attach that action to the accepting state
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the
symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. Any
number of *s can be attached depending on the number of positions to retract
3. One state is designated the start state, or initial state; it is indicated by an edge labeled “start”, entering from
nowhere. The transition diagram always begins in the start state before any input symbols have been read

Figure 2.4 Transition diagram for relop


 Figure 2.4 is a transition diagram that recognizes the lexemes matching the token relop in Pascal
 Being on state 0 (initial state), if we see < as the first input symbol, then the lexemes that match the pattern for
relop can only be <, <>, or <=
 We therefore go to state 1 and look at the next character. If it is =, then we recognize lexeme <=, enter state 2,
and return the token relop with attribute LE, the symbolic constant representing this particular operator
 If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an indication that
the not-equals operator has been found
 On any other character, after < , the lexeme is invalid, and we enter state 4 to return this information. Note,
however, that state 4 has a * to indicate that we must retract the input one position.
 On the other hand, if in state 0 the first character we see is =, then this one character must be the lexeme. We
immediately return the fact from state 5
 The remaining possibility is that the first character is >. Then, we must enter state 6 and decide, on the basis of
the next character, whether the lexeme >= (if we next see the sign =), or just > (on any other character)
 Note that if, in state 0, we see any character besides <, =, or >, we cannot possibly be seeing a relop lexeme, so
this transition diagram will not be used.

You might also like