0% found this document useful (0 votes)
14 views37 pages

Compiler Construction Lec 1b

The document discusses the role of lexical analysis in compiler construction, detailing its purpose of tokenization, whitespace removal, and error detection. It explains the components involved, such as the lexer, tokens, and symbol table, as well as challenges like ambiguity and efficiency. Additionally, it covers lexical errors and their handling, providing examples of common errors encountered during the lexical analysis phase.

Uploaded by

usamajaved425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views37 pages

Compiler Construction Lec 1b

The document discusses the role of lexical analysis in compiler construction, detailing its purpose of tokenization, whitespace removal, and error detection. It explains the components involved, such as the lexer, tokens, and symbol table, as well as challenges like ambiguity and efficiency. Additionally, it covers lexical errors and their handling, providing examples of common errors encountered during the lexical analysis phase.

Uploaded by

usamajaved425
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Compiler Construction

Lecture 1b
Instructor: Alisha Farman
LEXICAL ANALYZER
Lexical analysis is the first phase of the compiler design process. It
involves converting a sequence of characters from the source code into a
sequence of tokens.
Purpose of Lexical Analysis
1. Tokenization: The primary goal is to break down the source
code into tokens. Tokens are the smallest units of meaning,
such as keywords, identifiers, operators, and symbols.
2. Removing Whitespace and Comments: The lexical analyzer
(or lexer) strips out unnecessary whitespace and comments,
which are not needed for further analysis.
3. Error Detection: It identifies and reports lexical errors, such
as illegal characters or malformed tokens.
Components of Lexical Analysis
Lexer: The lexer processes the input buffer to recognize and extract
tokens. It uses patterns to identify different types of tokens.
Token: A token is a categorized unit of the source code. Each token
has a type (e.g., keyword, identifier) and a value (e.g., the actual
string of the identifier).
Symbol Table: A symbol table is used to keep track of identifiers and
their attributes (e.g., data type, scope).
Input buffer
An input buffer is a storage area used by the compiler's lexical
analyzer to read source code efficiently.
The lexical analyzer scans the input from left to right one character
at a time. It uses two pointers begin ptr(bp) and forward ptr(fp) to
keep track of the pointer of the input scanned.
One Buffer Scheme: In this scheme, only one buffer is used to store the input string but the problem with
this scheme is that if lexeme is very long then it crosses the buffer boundary, to scan rest of the lexeme the
buffer has to be refilled, that makes overwriting the first of lexeme.
Two Buffer Scheme: To overcome the problem of one buffer scheme, in this method
two buffers are used to store the input string. the first buffer and second buffer are
scanned alternately. when end of current buffer is reached the other buffer is filled.
Buffering with EOF

● When EOF is encountered, the lexer stops processing further input.


● In double buffering, EOF can be used as a sentinel character to
indicate the end of the input.
● This eof character introduced at the end is calling Sentinel which is
used to identify the end of buffer.
int x = 10; int y = 20; // (long line, buffer is full)
Challenges in Lexical Analysis
1. Ambiguity: Tokens must be unambiguously defined to avoid
confusion. For example, distinguishing between an identifier
and a keyword.
2. Efficiency: The lexer should efficiently handle large inputs and
complex token patterns.
3. Error Handling: Properly identifying and reporting errors while
continuing the analysis process.
C tokens are of this types. They are, (For All
Languages)
1. Integer(Identifier ID)(a—z)(A----Z)
2. Assignment operator(=,+=,-=,*=)
3. Arithmetic operator(*,+,/,%)
4. Integer constant(O----9DECI)
5. Relational operator(==,<,>,<=,>,=<,!=)
6. Semicolon(;) seperator
7. Float constant(66.78)
8. Strings(“alisha”)
9. keyword(int, float, main…..)
LEXICAL ANALYZER
• Lexical Analyzer is First Phase Of Compiler.
• Input to Lexical Analyzer is “Source Code“
• Lexical Analysis Identifies Different Lexical Units in a Source Code.
• Different Lexical Classes or Tokens or Lexemes
– Identifiers
– Constants
– Keywords
– Operators
• Example : sum = num1 + num2 ;
So Lexical Analyzer Will Produce following Symbol Table

Token Type
Sum Id

= op

Num1 id

+ op

Num2 id

; Separator
Example 1
int main() {
int x = 10;
printf("Hello, world!");
}
int (Keyword)
main (Identifier)
( (Left Parenthesis)
) (Right Parenthesis)
{ (Left Brace)
int (Keyword)
x (Identifier)
= (Assignment Operator)
10 (Integer Literal)
; (Semicolon)
printf (Identifier)
( (Left Parenthesis)
"Hello, world!" (String Literal)
) (Right Parenthesis)
; (Semicolon)
} (Right Brace)
Example:
#include<stdio.h>
void main()
{
int a,b,c;
printf(“Enter “);
while(a<b)
a=b+c;
c=a;
}
TASK 1:CREATE TOKEN OF THE CODE
main()
{
int a=“abcxyz=55”;
if(a>b)
a+=b;
//c-variable//
}
TASK 2:CREATE TOKEN OF THE CODE
int main()
{
int x, y, total;
x = 10, y = 20;
total = x + y;
Printf (“Total = %d \n”, total);
}
LEXICAL ERROR
Lexical errors occur during the lexical analysis phase of
compilation when the lexer encounters a sequence of characters
that do not match any valid token in the programming language.
These errors typically arise from typos, incorrect usage of
symbols, or invalid characters.
How Error Tokens Are Handled
1. Discarding the Invalid Token
○ The lexer detects an invalid token and discards it, possibly issuing a
warning or error message.
2. Replacing with a Special Token
○ Some compilers generate a special error token (ERROR_TOKEN) to
handle the issue in later stages.
3. Attempting Recovery
○ The compiler might attempt to recover by suggesting corrections or
ignoring minor mistakes.
int 3var = 10;
Example 1: Invalid Character
int main() {
int a = 5;
@a = a + 1; // Invalid character '@'
return 0;
}
Error Description: The '@' character is not recognized as a valid token in C,
leading to a lexical error.
Example 2: Unrecognized Token
int main() {
int a = 5;
int b = a $ 3; // Unrecognized token '$'
return 0;
}

Error Description: The '$' symbol is not a valid operator in C, causing a lexical error.
Example 3: Incomplete String Literal
int main() {
char* str = "Hello; // Incomplete string literal
return 0;
}
Error Description: The string literal is not properly closed with a double quote,
leading to a lexical error.
Example 4: Invalid Number Format
int main() {
int num = 123abc; // Invalid number format
return 0;
}
Error Description: The sequence '123abc' is not a valid integer literal in C,
resulting in a lexical error.
Example 5: Unmatched Comment Delimiter
int main() {
int a = 5;
/* This is a comment
return 0;
}
Error Description: The comment is not properly closed with '*/', causing the
lexer to throw a lexical error.
Example 6: Use of Reserved Keyword
int main() {
int for = 10; // 'for' is a reserved keyword
return 0;
}
Error Description: The keyword 'for' is reserved for loop constructs and cannot be
used as an identifier, leading to a lexical error.
Handling Lexical Errors
Lexical analyzers typically handle these errors by:
● Reporting the Error: Providing a meaningful error message indicating
the type of error and its location in the source code.
● Skipping the Invalid Token: Moving past the invalid token to continue
scanning the rest of the source code.
Example:
int main() {

int a = 5;

inttt b = 10; // Invalid token 'inttt'

return 0;

}
Total Count

● Valid Tokens: 18
● Invalid Tokens: 1
Example:
int main() {
int a = 5;
@a = a + 1; // Invalid character '@'
return 0;
}

Total Count
● Valid Tokens: 19
● Invalid Tokens: 1

You might also like