Chapter Two-Lexical Analysis
Chapter Two-Lexical Analysis
2.1 Functions of Lexical Analysis2.2 Role of the Lexical Analyzer2.3 Input Buffering
2.4 Specification of tokens2.5 Recognition of tokens
Lexical Analysis
Lexical means “Anything related to words”.
Terminology used in Lexical Analysis.
1. Token: A set of input string which are related through a similar pattern.
2. Lexeme: The actual input string which represents the token.
3. Pattern: Rule which a Lexical analyzer follow to create a token.
Token Lexeme
Identifier X
Operator Eq =
Identifier X
Operator Mul *
Left Paranthesis (
Indentifier Acc
Operator Plus +
Integer Constant 123
Right Paranthesis )
The main task of the lexical analyzer is to read the input characters of the source program, group them into
lexemes, and produce as an output a sequence of tokens for each lexeme in the source program
The stream of tokens is sent to the parser for syntax analysis
The lexical analyzer also interacts with the symbol table, e.g., when the lexical analyzer discovers a lexeme
constituting an identifier, it needs to enter that lexeme into the symbol table
The interactions are suggested in Figure 2.1
1
Figure 2.1: Interactions between the lexical analyzer and the parser
The following are additional tasks performed by the lexical analyzer other than identifying lexemes:
Stripping out comments and whitespace (blank, newline, and tab)
Correlating error messages generated by the compiler with the source program by keeping track of line
numbers (using newline characters)
Input Buffering
- Reading character by character from secondary storage is slow process and time consuming as well
- It is necessary to look ahead several characters beyond the lexeme for a pattern before a match can be announced
- One technique is to read characters from the source program and if pattern is not matched than push look ahead
character back to the source program
- This technique is time consuming
- Use of buffer techniques can eliminate this prob. And increase efficiency
Single Buffer
- A buffer of n character is defined in the memory which is usually of block size (1024k)
- Two pointers will be used
o Beginning pointer (BP) o Forwarding pointer (FP)
- BP points to the start of the lexeme while FP scans the input buffer for lexeme
- When the lexeme end is found it is processed i.e. matched with pattern and converted into a token where FP
remains still
- When the lexeme is processed both pointers points to the next character /lexeme
- If the character is white space, it is also matched but no token is generated, just both pointers move ahead to
detect next lexeme
2
If FP at the end of first half then begin
Reload second Half
FP=FP+ 1End
Else if FP at the end of second half then begin
Reload first half
Move FP at the beginning of First half End
Else
FP = FP+1
Sentinel
- While using buffer pair technique, we have to check each time FP is moved that it doesn't crosses the buffer half
and when it reaches end of buffer, the other one needs to be loaded
- We can solve this problem by introducing a sentinel character at the end of both halves of buffer
- This sentinel can be any character which is not a part of source program. EOF is usually preferred as it will also
indicate end of source program
- Through this sentinel buffer end can be determined
- FP will only be checked for this sentinel and when it is encountered appropriate action will be taking to Fill next
buffer. If eof is used than, encountering eof in elsewhere in buffer will mean end of source program
Specification of Tokens
Regular expressions are used to specify token patterns.
Notational shorthands
+ means one or more
*means zero or more
? means zero or one
Character classes [a-z],[abc],[0-9].
3
Ex: A pascal identifier is a letter followed by zero or more letters or digits.
id letter(letter/digit)*
letter A/B/…/Z/a/b/…z.
digit 0/1/…/9.
Recognition of Tokens
Lexical analyzer recognizes tokens
by using transition diagrams. Some important conventions about the transition diagram are:
1. Certain states are said to be accepting or final. These states indicate that a lexeme has been found.
Accepting states are indicated by a double circle, and if there is an action to be taken – typically returning a
token and an attribute value to the parser – we shall attach that action to the accepting state
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the
symbol that got us to the accepting state), then we shall additionally place a * near that accepting state. Any
number of *s can be attached depending on the number of positions to retract
3. One state is designated the start state, or initial state; it is indicated by an edge labeled “start”, entering from
nowhere. The transition diagram always begins in the start state before any input symbols have been read