Chapter 1-2 Compiler Design
Chapter 1-2 Compiler Design
Introduction
Introduction
Programming languages are notations for describing
computations to people and to machines.
The world as we know it depends on programming
languages, because all the software running on all the
computers was written in some programming
language.
But, before a program can be run, it first must be
translated into a form in which it can be executed by a
computer.
The software systems that do this translation are called
compilers.
Cont….
This course is about
How to design and implement compilers.
About basic ideas that can be used to construct
translators for a wide variety of languages and
machines.
Besides compilers, the principles and techniques
for compiler design.
Phases of a Compiler
A compiler processes source code through several phases to
convert high-level programming code into machine code.
These phases are typically divided into two main parts: *Analysis*
and *Synthesis*. Here’s an overview of each phase:
Analysis: The analysis phase reads the source code and breaks it down into an
intermediate representation.
Synthesis: The synthesis phase takes the intermediate representation
from the analysis phase and transforms it into the target code.
1. Lexical Analysis (Scanner)
2. Syntax Analysis (Parser)
3. Semantic Analysis
4. Intermediate Code Generation
5. Code Optimization
6. Code Generation
7. Code Linking and Loading
Lexical Analysis (Scanner)
The compiler reads the source code as a stream of
characters and groups them into meaningful
sequences, called *tokens* (e.g., keywords,
identifiers, literals).
Removes whitespace, comments, and may perform
simple error-checking.
Output: a stream of tokens.
Syntax Analysis (Parser)
Uses tokens from the lexical analyzer to create
a *parse tree* (or syntax tree) based on the
source language’s grammar.
Checks for syntactical errors and reports if the
code violates the language’s rules.
Output: a parse tree or abstract syntax tree (AST).
Semantic Analysis
Checks the parse tree for semantic errors,
ensuring that expressions, variable types, and
scope rules are valid.
May involve type checking, ensuring that operators
apply to compatible data types.
Output: an annotated syntax tree, with type
information added.
Intermediate Code Generation
<expression> (+)
/ | \ / \
<expression> '+' <term> 3 (*)
| / | \ / \
3 <term> '*' <factor> 5 2
| |
5 2
Abstract Syntax Tree (AST)
Parse Tree
Symbol Table (Semantic Representation)
The compiler creates a *symbol table* that holds
information about identifiers (variables, functions,
etc.), including types, scopes, and memory locations.
This table is essential for semantic checks and
intermediate code generation.
Example: sum is an int variable, scoped locally within a
function.
Intermediate Representation (IR)
A lower-level, platform-independent representation
used between the high-level code and machine
code.
Common IRs include three-address code, quadruples,
and static single-assignment (SSA) form.
Example: sum = a + b in three-address code might look like
t1 = a + b; sum = t1.
Assembly Code (Low-Level Representation)
Assembly code is a human-readable form of
machine instructions, specific to the architecture.
It represents operations directly supported by the
processor, such as MOV, ADD, and JMP.
Example (in x86 Assembly):
o MOV EAX, a
o ADD EAX, b
o MOV sum, EAX
Machine Code (Binary Representation)
The final compiled code is in binary format, consisting of
a series of 0s and 1s.
Machine code is executed directly by the CPU and is
specific to the hardware architecture.
Example: The machine code for MOV EAX, a might look like a
sequence of binary digits, depending on the CPU instruction set.
Compiler Construction Tools
Compiler construction tools are specialized software
utilities designed to aid in developing compilers by
automating various parts of the process.
These tools help in generating parts of the compiler,
such as lexical analyzers and parsers, without requiring
developers to write them from scratch.
Cont….
Here are some key compiler construction tools:
1. Lexical Analyzer Generators
2. Parser Generators
3. Syntax-Directed Translation Engines
4. Intermediate Code Generators
5. Code Optimization Tools
6. Code Generation Tools
7. Debugger and Profiler Generators
8. Parser/Grammar Visualization Tools
9. Integrated Development Environments (IDEs) for
Compiler Construction
10. Automated Testing Tools
Lexical Analyzer Generators
Tools like Lex and Flex (Fast Lexical Analyzer) are
commonly used to create lexical analyzers.
These tools take a set of regular expressions that define
token patterns and automatically generate code to
identify tokens in the source code.
Example: Lex or Flex can generate a tokenizer for recognizing
keywords, identifiers, numbers, and symbols in a programming
language.
Parser Generators
Parser generators are tools that automate the creation of
parsers from a formal grammar.
A parser is a fundamental component in compilers and
interpreters, responsible for analyzing the syntactic structure of
source code to ensure it matches the grammar of the
programming language.
Tools like Yacc (Yet Another Compiler Compiler), Bison, ANTLR
(Another Tool for Language Recognition), and JavaCC (Java
Compiler Compiler) are widely used to create parsers.
They take a grammar definition, written in Backus-Naur Form
(BNF) or a similar notation, and generate code that can parse
input according to that grammar.
These tools can generate parsers for both top-down and bottom-
up parsing methods (e.g., LL and LR parsers).
Syntax-Directed Translation Engines
Syntax-Directed Translation (SDT) engines are mechanisms
used in compilers and interpreters to perform semantic
analysis and intermediate code generation by associating
specific semantic actions with grammar rules.
Tools like Yacc and ANTLR also support syntax-directed
translation, where semantic actions (code) are embedded in
the grammar.
These actions can be used to build syntax trees, generate
intermediate code, or populate symbol tables while parsing
the input.
Syntax-directed translation engines streamline the process of
associating grammar rules with actions that execute when
those rules are recognized.
Intermediate Code Generators
Some compiler tools have built-in support for
generating intermediate representations (IR) that
simplify code optimization and target code
generation.
LLVM (Low-Level Virtual Machine) is a popular tool in
modern compiler construction that provides an IR, as well
as various backends and optimizations for generating
machine code across different architectures.
Code Optimization Tools
Tools like LLVM and GCC’s optimization libraries
provide a suite of code optimizations that can be
applied to intermediate code to make the final
machine code more efficient.
These tools allow developers to focus on writing the
core of the compiler while leveraging existing, well-
tested optimization techniques, such as dead code
elimination, loop unrolling, and constant folding.
Code Generation Tools
LLVM also doubles as a code generation tool,
where the IR can be transformed into machine
code for various architectures.
Retargetable code generators (like GCC and
LLVM) can convert intermediate
representations into the machine code of
different CPUs, making the compiler more
versatile.
Debugger and Profiler Generators
Tools like GDB (GNU Debugger) and Valgrind
provide debugging and profiling capabilities.
Some compilers integrate debug information
generators that allow code to be debugged
at the source level, tracking variables,
functions, and line numbers.
Parser/Grammar Visualization Tools
JFLAP (Java Formal Languages and Automata
Package) and other similar tools allow
visualization of parsing and automata, helping
compiler developers better understand the
grammar and its ambiguities.
Visualization tools are helpful for debugging
parser conflicts, like shift/reduce or
reduce/reduce conflicts in LR parsers.
Integrated Development Environments
(IDEs) for Compiler Construction
IDEs such as Eclipse IDE with Xtext and IntelliJ
IDEA (when used with appropriate plugins) offer
support for creating domain-specific languages
(DSLs) and language processing features like
syntax highlighting and code completion.
These IDEs simplify compiler construction by
providing editing and debugging support within a
more user-friendly environment.
Automated Testing Tools
Tools like DejaGnu and Torture Test Suites help in
validating compiler implementations by testing for
compliance with language specifications and standard
behaviors.
Automated testing tools are essential to ensure that the
compiler correctly translates programs across a wide
range of cases.
Chapter 2
Computer Threat
Computer Threat
Computer threats are malicious activities or software designed to harm,
disrupt, or gain unauthorized access to computer systems.
They can lead to data loss, privacy breaches, and system damage.
Malicious code refers to software or scripts intentionally created to
cause harm, steal data, or compromise system security.
Trojan Horses : program downloaded and installed on a computer that appears harmless.
A Trojan horse appears as a legitimate software but contains hidden malicious functions.
Unlike viruses , Trojans do not replicate themselves but rely on user actions to execute.
users are tricked into installing them, thinking they are harmless
Often used to create backdoors, steal information, or download other malware
Trojan Horses
Example
Zeus Trojan: Used to steal banking information by logging keystrokes and capturing sensitive
data.
Emotet: Initially a banking Trojan, later used to deliver other malware.
Remote Access Trojan (RAT) : Allows attackers to control a victim's system remotely, often
used for spying.
Worms
Is standalone malware that replicates itself to spread across networks.
Do not need user action to spread; exploit vulnerabilities in network or software.
Can cause network congestion, system crashes, and resource exhaustion.
Example.
Morris Worm : One of the first internet worms, caused widespread network disruption.
Conficker Worm: Exploited Windows vulnerabilities to infect millions of systems, causing
network issues.
Stuxnet Worm: Targeted industrial control systems, notably damaging Iran's nuclear facilities.
Spyware
Software that secretly collects user information without their knowledge.
Monitors user activity, gathers data, and may record keystrokes.
Invades privacy, can lead to identity theft, and slows down system
performance.
Example
CoolWebSearch: Redirected browsers to unwanted websites, tracking
user activities.
Keyloggers: Monitored and recorded keystrokes to steal sensitive
information like passwords.
Adware (e.g., Gator): Displayed intrusive ads while secretly collecting
user data
Adware
Adware is software that displays unwanted advertisements to generate
revenue for the developer.
It may redirect users to advertising sites, show pop-up ads, or add
banners to browsers.
While some adware is relatively harmless and funds free software, others
can be intrusive, slowing down devices and consuming system resources.
Adware can also collect data about user preferences to target ads but is
generally less invasive than spyware.
Phishing
Phishing is the act of pretending to be someone or something to steal sensitive information. Common targets include
passwords, financial data, or system credentia
How Phishing Works
Attackers may send malicious links or attachments. These can infect systems with malware or trick individuals into
revealing sensitive information.
E.g
Companies like Mastercard can lose millions due to successful phishing attacks. It puts both business operations and
employee safety at risk.
Threat Attack
Threats can be intentional or unintentional. The attack is intentional.
Information may or may not be altered or The chance for information alteration and
damaged. damage is very high.
Types of Attack
Active Attacks − is an attempt to change system resources or influence their operation.
Passive Attacks − is an attempt to understand or retrieve sensitive data from a system
without influencing the system resources.
Primary Classes of Attack
Access
Reconnaissance
Denial of service(DOS)
Access
System access refers to unauthorized access to a device without an account or password.
Unauthorized attempts to gain access to a network or resources
Class of Attacks
Access Attacks Can be:
External Attacks:
Conducted by outside individuals or groups.
The used like hacking, phishing, or exploiting vulnerabilities.
The goal is to steal confidential data or disrupt services.
Internal Attacks:
Conducted by trusted, internal users.
Can involve accessing unauthorized areas out of curiosity or malicious intent.
The goal is sabotage, data theft, or misuse of resources.
Unauthorized access attacks are attempted via four means
password attacks, trust exploitation, port redirection, and man-in-the-middle attacks.
All of which try to bypass some facet of the authentication process.
Access Attacks
Password Attacks
Attackers use techniques like brute force, dictionary attacks, or credential stuffing to guess or
crack passwords.
Example: Hackers repeatedly try common or stolen passwords until they find one that works.
Prevention: Use strong, unique passwords and multi-factor authentication (MFA).
Trust Exploitation
Occurs when attackers abuse established trust relationships between systems.
Example: An attacker compromises a server in the demilitarized zone (DMZ) to access the
internal network, exploiting trust between systems.
Prevention: Restrict trust relationships and regularly monitor access.
Port Redirection
Involves redirecting traffic from a secure port to an unauthorized one, bypassing security
controls like firewalls.
Example: An attacker uses a compromised internal machine to redirect traffic through a port
that is otherwise blocked.
Prevention: Implement strict firewall rules and monitor network traffic for anomalies.
Access Attacks
Man-in-the-Middle (MitM) Attacks
An attacker intercepts and alters communication between two parties
without their knowledge.
Attackers may impersonate one or both parties involved in the
communication.
Decryption
The intercepted data is captured and decoded, allowing attackers to
steal or alter sensitive information, such as login credentials or
financial details.
Reconnaissance Attacks
Reconnaissance is the act of gathering information about a target before launching an
attack.
Important information that can be compiled during a reconnaissance attack includes the
following:
• Ports open on a server
• Ports open on a firewall
• IP addresses on the host network
• Hostnames associated with the IP addresses
The four common tools used for reconnaissance attacks used for gathering network data
are:-
• packet sniffers, ping sweeps, port scans, and information queries.
Reconnaissance Attacks
Ping Sweeps - Sends echo requests to multiple IP addresses to identify active hosts.
- Useful for network mapping.
- Scans network for open ports to identify running applications.
Port Scans
- Helps find vulnerabilities linked to specific ports.
Information Queries - Resolves hostnames to IPs or vice versa using tools like nslookup.
- Useful for gathering network information.
Denial of Service (DoS) Attacks
A DoS attack prevents legitimate users from accessing information systems, devices, or
network resources by overwhelming the targeted host or network with excessive traffic.
DoS attack leverages different methods to overwhelm systems, leading to service
disruption or complete outages.
Affected Services:
Email accounts.
Websites.
Online accounts (e.g., banking).
Other services relying on the affected network.
Key Takeaway
DoS Attacks: Simpler but less powerful
DDoS Attacks: More complex, harder to defend against, and significantly more damaging
Denial of Service (DoS) Attacks Vs Distributed Denial of Service Attack (DDoS)
Program Flaws
•Implications:
•Affects all software types; can lead to unpredictable behavior, memory access
errors, or crashes.
•Example:
A buffer for an 8-byte username receives a 10-byte input, overwriting memory
past the buffer.
Buffer Overflows
Types of Buffer Overflow Attacks How to Prevent Buffer Overflows
•Stack-based Buffer Overflows •Developer Measures:
• Most common type. • Implement security practices in code.
• Exploits stack memory, which exists only during
• Use programming languages with built-in
function execution.
•Heap-based Buffer Overflows protection.
• More complex and harder to execute. •Operating System Protections:
• Floods the memory allocated for a program • Address Space Randomization (ASLR):
beyond current runtime operations. Randomizes memory address locations,
making it difficult for attacks to target
Vulnerable Programming Languages specific executable code.
Highly Vulnerable • Data Execution Prevention: Flags memory
C and C++: No built-in safeguards against memory
areas as non-executable to prevent code
overwriting, common in Mac OSX, Windows, and
Linux. execution in those regions.
Less Vulnerable: • Structured Exception Handler Overwrite
PERL, Java, JavaScript, C#: Use built-in safety Protection (SEHOP): Protects against
mechanisms to reduce buffer overflow risks. exploiting Structured Exception Handling
(SEH) via buffer overflows.
Time-of-Check to Time-of-Use (TOCTOU) Flaws
Time-of-check to time-of-use (TOCTOU) refers to a class of software bugs that occur due to race
conditions.
This happens when a system checks the state of a component (e.g., security credential) and then uses that
result without ensuring it remains valid.
•Means:
The gap between the time a condition is checked and the time it is used can allow other
processes to alter the state, leading to potential vulnerabilities.
Occurrence:
Common in Unix systems, especially during file operations.
Can also arise in local sockets and due to improper database transaction handling.
•Historical Examples:
BSD 4.3 UNIX: Had a race condition in its mail utility for temporary files using the mktemp()
function.
OpenSSH: Early versions experienced race conditions with Unix domain sockets.
Types of Defenses
Software Development Controls: Secure coding practices, code reviews, and thorough
testing techniques.
Database Management Systems Security: Access controls, encryption, and regular
updates to protect sensitive data.
Controls to Protect Against Program Flaws in Execution
•Mechanisms and practices designed to reduce the risk of exploitation of program flaws
during execution.
Key Controls:
Operating System Support: Utilization of security features in the OS, such as memory
protection and user permissions.
Administrative Controls: Policies and procedures that enforce secure configurations and
access restrictions.
?
Thank You