Compiler Construction: Department of Computer Science
Compiler Construction: Department of Computer Science
1|Page
Question # 01: Explain Lexical Analyzer Generator, Structure of Lex program,
Sample Lex programs. Also explain other Compiler Construction Tools.
2|Page
1.1.2. Advantages
o Automation
o Efficiency
o Consistency
➢ Definitions Section
The Definitions Section in a Lex program includes:
Header Declarations:
o Enclosed within %{ and %} symbols in Lex, the "code section" holds C or target language code.
o This code is included at the start of the generated lexical analyzer.
Macro Definitions:
o The "definitions section" in Lex defines macros or regular expression patterns and their
associated names.
o Enclosed within %{ and %}, these definitions are used in the lexical rules section of the Lex
specification.
Example:
%{
#include <stdio.h>
%}
DIGIT [0-9]
ID [a-zA-Z][a-zA-Z0-9]*
%%
3|Page
➢ Rules Section
The Rules Section contains the lexical rules and associated actions:
Pattern-Action Pairs:
o The "lexical rules section" in Lex comprises rules, each with a pattern and an associated action.
o Patterns are usually regular expressions, specifying what the lexer should identify.
o Actions determine the response when a pattern is matched, defining what actions to take.
Regular Expressions:
o Regular expressions match input text, identifying tokens or patterns to be recognized.
Action Code:
o Action code in Lex can include C or target language code.
o When a pattern is matched, the linked action code is executed.
Example:
4|Page
Example:
%%
int main() {
yylex(); // Function that starts the lexical analysis
return 0;
}
5|Page
➢ Sample Lex Programs
1. Identifying Keywords and Identifiers
This Lex program identifies keywords (if, else, while, etc.) and identifiers (variable names).
%{
#include <stdio.h>
%}
// Definitions section
letter [a-zA-Z]
digit [0-9]
identifier {letter}({letter}|{digit})*
%%
// Rules section
"if" { printf("<keyword , if>\n"); }
"else" { printf("<keyword , else>\n"); }
"while" { printf("<keyword , while>\n"); }
{identifier} { printf("<id , %s>\n", yytext); }
. ; // Ignore other characters
%%
int main() {
yylex();
return 0;
}
6|Page
2. Handling Arithmetic Operations and Numbers
This Lex program identifies arithmetic operators (+, -, *, /) and numeric constants.
%{
#include <stdio.h>
%}
%%
// Rules section
"+" { printf("<operator , +>\n"); }
"-" { printf("<operator , ->\n"); }
"*" { printf("<operator , *>\n"); }
"/" { printf("<operator , />\n"); }
[0-9]+ { printf("<number , %s>\n", yytext); }
. ; // Ignore other characters
%%
int main() {
yylex();
return 0;
}
7|Page
➢ Other Compiler Construction Tools
o Yacc (Yet another Compiler Constructor)
o ANTLR (ANother Tool for Language Recognition)
o LLVM (Low-Level Virtual Machine)
o JavaCC (Java Compiler Constructor)
1. Yacc
Example:
%{
#include <stdio.h>
%}
%token NUMBER
%%
| NUMBER
%%
int main() {
yyparse();
return 0;
8|Page
2. ANTLR
ANTLR is a powerful parser generator used for building parsers, lexers, and tree walkers. It's capable of
generating code in various languages like Java, C#, Python, and JavaScript.
Example:
grammar Expr;
expression : term (('+' | '-') term)* ;
term : factor (('*' | '/') factor)* ;
factor : NUMBER | '(' expression ')' ;
NUMBER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
3. LLVM
LLVM is an infrastructure for building compilers that provides reusable libraries and tools for various
compilation tasks, including optimization and code generation.
Example:
9|Page
4. JavaCC
JavaCC is a parser generator specifically designed for the Java programming language, creating parsers,
lexers, and syntax trees.
Example:
options {
STATIC = false;
}
PARSER_BEGIN(ExpressionParser)
public class ExpressionParser {}
PARSER_END(ExpressionParser)
TOKEN: {
<NUMBER: (["0"-"9"])+>
}
void expression() :
{}
{
( <NUMBER> )+
}
10 | P a g e
Question # 02: Write a program of identification of tokens. Your program will
take a .txt file that contains the source code of C language as an input and gives the
sequence of tokens an output.
➢ Python Code
def tokenize_c_code(file_path):
patterns = [
('keyword', r'\b(?:int|if|else|while|for|return)\b'),
('id', r'\b[a-zA-Z_]\w*\b'),
('relational_operator', r'==|<=|>=|!=|[<>]=?'),
('operator', r'[+\-*/%<>&|^!]=?'),
('assignment', r'='),
('number', r'\b\d+(\.\d+)?\b'),
('delimiter', r'[;,(){}\[\]]'),
('whitespace', r'\s+')
]
pattern = '|'.join('(?P<%s>%s)' % pair for pair in patterns)
regex = re.compile(pattern)
with open(file_path, 'r') as file:
source_code = file.read()
tokens = []
printed_tokens = set() # Track printed tokens
for match in regex.finditer(source_code):
token_type = match.lastgroup
token_value = match.group()
if token_type != 'whitespace':
token = (token_type, token_value)
if token not in printed_tokens: # Check if token is already printed
11 | P a g e
tokens.append(token)
printed_tokens.add(token) # Add token to printed set
return tokens
input_file_path = 'sample_c_code.txt' # Replace with the path to your C code file
tokens = tokenize_c_code(input_file_path)
o Input
int b;
b = 5;
if ( b== 5 )
{
b=b+1;
}
o Ouput
12 | P a g e
Question # 03: Explain Recursive Descent parser and Predictive Parser with the
help of examples.
➢ Key Features
o Top-Down Parsing:
• Parsing process begins at the root of the syntax tree.
• Progresses from root to leaves of the tree.
• Starts from the start symbol of the grammar.
• Seeks to match input with grammar rules.
o LL(k) Parsing:
• Recursive Descent Parsers are termed LL(k) parsers.
• LL stands for Left-to-right, Leftmost derivation.
• "k" indicates the number of lookahead tokens for parsing decisions.
• These parsers predict the production to use based on a fixed number of lookahead
tokens.
o Procedure-based:
• Each non-terminal in the grammar corresponds to a parsing procedure.
• This approach makes it easy to conceptualize and implement parsing logic.
o Readability:
• Easy to understand and write, particularly for simpler grammars.
• Parsing logic closely mirrors the grammar rules.
o Backtracking:
• Traditional Recursive Descent parsers may use backtracking.
• Backtracking involves exploring different production rules if the current path fails.
• Drawback in efficiency for complex grammars due to potential reevaluation of paths.
o Error Handling:
• Error recovery and reporting can be challenging, especially when dealing with
ambiguous or incorrect input.
➢ Steps
o Grammar Representation: The grammar is represented explicitly, usually in BNF (Backus-Naur
Form) or EBNF (Extended Backus-Naur Form).
o Parsing Procedures: Recursive procedures are created for each non-terminal symbol in the
grammar. These procedures are called recursively to parse the input.
13 | P a g e
o Tokenization: The input is tokenized, breaking it down into tokens (like identifiers, keywords,
operators, etc.).
o Parsing Logic: The parsing procedures handle each production rule, recursively calling
themselves to parse sub-components of the input according to the grammar rules.
o Error Handling: Error handling is crucial in Recursive Descent parsers. It can involve identifying
syntax errors and possibly recovering from them to continue parsing.
➢ Example
Grammar:
S → mXn | mZn
X → pq | sq
Z → qr
Step 1:
Step 2:
Step 3:
Backtracking
Step 4:
14 | P a g e
➢ Predictive Parser
A Predictive Parser is a specialized form of Recursive Descent parser that can predict which production
rule to use without backtracking. It achieves this by using a parsing table derived from the grammar.
➢ Key Features
o Deterministic Parsing: Based on a parsing table constructed from the grammar, allowing the
parser to make deterministic choices without backtracking.
o Parsing Table: Utilizes a table (often called a parsing or LL(1) table) to predict the production
rule based on the current input symbol and the symbol at the top of the stack.
o Efficiency: Avoids backtracking, resulting in potentially faster parsing for unambiguous
grammars.
o Handling Ambiguity: Works well for grammars that are unambiguous and don't have left
recursion. Ambiguous grammars might require modifications to be parsed predictively.
o Error Handling: Because of its deterministic nature, error handling is generally more
straightforward compared to traditional Recursive Descent parsers.
➢ Steps
o Constructing Parsing Table: The parsing table is created based on the grammar. Rows represent
non-terminal symbols, columns represent terminal symbols, and cells in the table contain the
production rule to use.
o Input Parsing: The parsing algorithm reads the input stream and the parsing table to predict the
correct production rule for each step.
o Stack-based Parsing: The parser uses a stack to simulate the parsing process. It matches the input
symbols with the stack symbols and decides which production rule to apply based on the current
input symbol and the symbol at the top of the stack.
o Handling Errors: Similar to Recursive Descent parsers, error handling is essential in Predictive
parsers. Invalid inputs or syntax errors need to be identified and handled gracefully.
15 | P a g e
Grammar:
S → aABb
S’ → c / є
E → d/є
First Sets:
FIRST(S) = { a }
FIRST(S’) = { c, є }
FIRST(E) = { d, є }
Follow Sets:
FOLLOW(S) = { $ }
FOLLOW(A) = { d, b}
FOLLOW(B) = { b }
Input:
acdb
Using Stack:
16 | P a g e
M-Table:
a b c d $
S S->aABb
A A-> є A-> є A-> є
B B-> є B->d
THE END
17 | P a g e