Concepts - Assignment (Technical Report Template)
Concepts - Assignment (Technical Report Template)
Build Scanner
Prepared By
Student Name
Ghada Mohamed Mostafa
Student ID
200047779
Under Supervision
Name of Doctor
Nehal Abdelsalam
Name of T. A.
Faris emadeldin
1. Introduction
In this report, I will discuss the process of building a lexical analyzer, which
is the first phase in the process of compilation. A lexical analyzer, also
known as a scanner, plays an essential role in converting a stream of
characters from source code into tokens—each representing a meaningful
element in the program. This conversion is crucial for further stages in the
compiler, such as syntax and semantic analysis.
For this project, I converted a simple C-based lexical analyzer into C++ to
make the code more modular and easier to understand. The implementation
of this lexical analyzer follows the traditional concepts of lexical analysis
using finite state machines to match patterns like keywords, identifiers,
operators, and literals. This report outlines the methodology used, the tools
applied, and the challenges faced during the development of the lexical
analyzer.
1.1. Phases of Compiler
A compiler performs several phases to convert the high-level source
code into machine code. The major phases of a compiler are as follows:
1.Lexical Analysis:
This phase breaks the input source code into tokens (basic units of
code such as keywords, operators, and identifiers). The Lexical
Analyzer (also called the Scanner) reads characters and groups them
into lexemes.
2.Syntax Analysis:
The Syntax Analyzer (Parser) examines the syntax of the program to
ensure the correct arrangement of tokens. It checks if the token
sequence adheres to the language’s grammar.
3.Semantic Analysis:
In this phase, the compiler ensures that the program’s logic makes
sense. It checks for semantic errors like type mismatches, undeclared
variables, and other logical errors.
4.Intermediate Code Generation:
The compiler translates the program into an intermediate form that
is easier to manipulate. This code is independent of the target machine.
5.Code Optimization:
The intermediate code is optimized for better performance, which
may include improving memory usage or execution time.
6.Code Generation:
The final phase generates the target machine code (or assembly
language), which can be executed by the computer.
7.Code Linking and Assembly:
The generated machine code is linked with libraries or other modules to
create the final executable.
2. Lexical Analyzer
The lexical analyzer is the first phase of the compiler, and its job is to read
the source code character-by-character, identifying meaningful sequences
and grouping them into tokens. These tokens represent the smallest units of
the programming language, such as keywords, operators, identifiers, and
literals.
The lexical analyzer can recognize these tokens using a state machine
approach, where each state corresponds to a specific action (e.g., recognizing
a number or an identifier). It works in conjunction with regular expressions,
which define the pattern for each token type.
The main goal of the lexical analyzer is to simplify the parsing process by
providing a stream of tokens to the next phase of the compiler. The analyzer
reads the input stream and identifies the tokens, helping the compiler build
a structured representation of the program.
3. Software Tools
3.1. Computer Program
The program for the lexical analyzer was written in C++. The choice of
C++ allows for efficient handling of memory and the ability to
manipulate strings easily. The program reads input from a file,
processes the characters one by one, and generates tokens based on
predefined patterns.
3.2. Programming Language
The lexer was designed for simple arithmetic expressions using C++. It
supports basic arithmetic operations such as addition, subtraction,
multiplication, and division, as well as parentheses and integer literals.
The lexer also identifies identifiers, which can represent variables or
function names.
4. Implementation of a Lexical Analyzer
1. Including Libraries:
#include <iostream>
#include <fstream>
#include <cctype>
#include <string>
In this section, we include several standard libraries:
•iostream: allows us to handle input and output, which we use to print
out data using cout.
•fstream: is used for file operations, enabling us to read from files.
•cctype: provides functions like isalpha() and isdigit(), which are
helpful to check whether a character is a letter or a digit.
•string: includes the string class to work with text data, which we use
to store the lexemes we encounter.
2. Defining Constants:
#define LETTER 0
#define DIGIT 1
#define UNKNOWN 99
#define END_OF_FILE -1
Here, we define constants to categorize the characters:
•LETTER: indicates that the character is a letter (alphabet).
•DIGIT: used when the character is a number.
•UNKNOWN: when the character is something else that doesn’t fit into
the known categories.
•END_OF_FILE: marks the end of the file.
3.Token Codes:
#define INT_LIT 10
#define IDENT 11
#define ASSIGN_OP 20
#define ADD_OP 21
#define SUB_OP 22
#define MULT_OP 23
#define DIV_OP 24
#define LEFT_PAREN 25
#define RIGHT_PAREN 26
4. Global Variables:
int charClass;
string lexeme = "";
char nextChar;
int lexLen;
int token;
int nextToken;
ifstream inFile;
Here we declare several variables:
• charClass: stores the type of the character (whether it’s a letter, digit,
etc.).
• lexeme: holds the current lexeme (a sequence of characters treated as
a unit).
• nextChar: stores the next character to be processed.
• lexLen: keeps track of the length of the lexeme.
• token: stores the current token we’re processing.
• nextToken: stores the next token after processing the current one.
• inFile: the file stream used to open and read from the input file.
5. Function Declarations:
void addChar();
void getChar();
void getNonBlank();
int lex();
int lookup(char ch);
We declare functions here that will handle various tasks:
•addChar(): adds the current character to the lexeme.
•getChar(): retrieves the next character from the input file.
•getNonBlank(): skips over any blank spaces or tabs.
•lex(): the main lexical analysis function that processes the input
and identifies tokens.
•lookup(): handles special characters like operators and
parentheses.
6.addChar() Function:
void addChar() {
lexeme += nextChar;
}
This function adds the current character (nextChar) to the lexeme
being built. For example, if we encounter a series of letters or
digits, they are appended one by one to form the full lexeme.
7.getChar() Function:
void getChar() {
if (inFile.get(nextChar)) {
if (isalpha(nextChar))
charClass = LETTER;
else if (isdigit(nextChar))
charClass = DIGIT;
else
charClass = UNKNOWN;
} else {
charClass = END_OF_FILE;
}
}
Here, we read the next character from the file using
inFile.get(nextChar). We check the character:
• If it’s a letter (isalpha(nextChar)), we set charClass to
LETTER.
• If it’s a digit (isdigit(nextChar)), we set charClass to DIGIT.
• If it’s neither, we classify it as UNKNOWN.
• If we’ve reached the end of the file, we set charClass to
END_OF_FILE.
8. getNonBlank() Function:
void getNonBlank() {
while (isspace(nextChar))
getChar();
}
This function is used to skip over any whitespace (spaces, tabs, etc.)
in the input. It calls getChar() repeatedly until a non-whitespace
character is found.
9. lookup() Function:
int lookup(char ch) {
switch (ch) {
case '(': addChar(); return LEFT_PAREN;
case ')': addChar(); return RIGHT_PAREN;
case '+': addChar(); return ADD_OP;
case '-': addChar(); return SUB_OP;
case '*': addChar(); return MULT_OP;
case '/': addChar(); return DIV_OP;
default: addChar(); return END_OF_FILE;
}
}
In this function, we check for special characters like parentheses or
operators. Based on the character, we:
• Add it to the lexeme using addChar().
• Return the corresponding token (e.g., LEFT_PAREN for (,
ADD_OP for +).
10. lex() Function:
int lex() {
lexeme = "";
getNonBlank();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT) {
addChar();
getChar();
}
nextToken = IDENT;
break;
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
nextToken = INT_LIT;
break;
case UNKNOWN:
nextToken = lookup(nextChar);
getChar();
break;
case END_OF_FILE:
nextToken = END_OF_FILE;
lexeme = "EOF";
break;
}
cout << "Next token is: " << nextToken << ", Next lexeme is " <<
lexeme << endl;
return nextToken;
}
In this function, we process the current character and determine
what token it represents:
• If the character is a letter, we start building an identifier
(IDENT).
• If it’s a digit, we build an integer literal (INT_LIT).
• If it’s a special character (like an operator or parenthesis),
we use the lookup() function.
• If we reach the end of the file, we set the token to
END_OF_FILE.
11. main() Function:
int main() {
inFile.open("front.in");
if (!inFile.is_open()) {
cout << "ERROR - cannot open front.in" << endl;
return 1;
}
getChar();
do {
lex();
} while (nextToken != END_OF_FILE);
inFile.close();
return 0;
}
In the main function:
•We attempt to open the input file front.in. If it fails, we print an
error message.
•We start reading characters from the file with getChar().
•We process the tokens using lex() until we reach the end of the
file.
•Finally, we close the file.
5.References
Sebesta, R. W. (2025). A Concept of Programming Language
(12th ed.). University of Colorado at Colorado Springs.
Important Note: -
Technical reports include a mixture of text, tables, and figures. Consider how you can
present the information best for your reader. Would a table or figure help to convey your
ideas more effectively than a paragraph describing the same data?
Figures and tables should: -
Be numbered
Be referred to in-text, e.g. In Table 1…, and
Include a simple descriptive label - above a table and below a figure.