0% found this document useful (0 votes)
31 views50 pages

Week-4 (Lecture-2)

The document discusses approaches to reverse engineering, including data gathering techniques like lexical and syntactic analysis of source code as well as control flow and data flow graphing. It provides examples and explanations of these techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views50 pages

Week-4 (Lecture-2)

The document discusses approaches to reverse engineering, including data gathering techniques like lexical and syntactic analysis of source code as well as control flow and data flow graphing. It provides examples and explanations of these techniques.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Reverse Engineering

Week-4
Reverse Engineering
• Reverse Engineering supports understanding of a system
through identification of the components or artifacts of the
system, discovering relationships between them and
generating abstractions of that information.

• The goal of reverse engineering is not to alter the system in


any way.
Reverse Engineering Activities
The three main Reverse Engineering activities:

1. Data Gathering
2. Knowledge Organization
3. Information Exploration
Reverse Engineering Activities
1. Data Gathering

Raw data is used to identify a system’s artifacts and relationships


Data Gathering
Approaches to Automating Reverse Engineering

• A variety of approaches for automated assistance are available for the


reverse engineer in program comprehension
• Some of the prominent approaches include:

1. Textual, lexical and syntactic analysis


• These approaches focus on the source code itself and its representations.
• These include the use of lexical metrics (counting assignments, identifiers,
etc.), and even automated parsing of the code.
• The unit of examination is the program source itself.
Data Gathering
• Textual, lexical and syntactic analysis

• Lexical analysis is the process of decomposing the sequence of characters


in the source code into its constituent lexical units.

• A program performing lexical analysis is called a lexical analyzer, and it is


a part of a programming language’s compiler.

• Typically, it uses rules describing lexical program structures that are


expressed in a mathematical notation called regular expressions.
Data Gathering
Textual, lexical and syntactic analysis
Data Gathering
Lexical and syntactic analysis:
• Tokenization: Lexical analysis involves breaking down the source code
or binary file into a stream of tokens. Tokens are the smallest units of
the code, such as keywords, identifiers, constants, and operators.

• Whitespace and Comments Handling: Lexical analysis also deals with


handling whitespace and comments. Removing or ignoring these
elements simplifies the code, making it easier to work with.
• Reverse engineering:
• Lexical analysis helps in understanding the basic structure of the code,
identifying keywords, and recognizing variables and functions.

• It aids in identifying potential vulnerabilities or suspicious code patterns


by extracting relevant information.
Data Gathering
Lexical and syntactic analysis:
• Parsing: Syntactic analysis, also known as parsing, checks whether the
sequence of tokens adheres to the grammar rules of the programming
language. It builds a hierarchical structure, often represented as a parse tree
or abstract syntax tree (AST).

• Reverse engineering:
• Syntactic analysis is essential for reconstructing a high-level representation of
the code. This representation makes it easier to understand and modify the
code.

• By analyzing the AST or parse tree, reverse engineers can identify control
flow structures, data structures, and relationships between different parts of
the code.

• It helps in identifying functions and their parameters, which is crucial for


understanding the code's functionality.
Data Gathering
Lexical and syntactic analysis:
• Lexeme
– A sequence of characters in the source program with the lowest level
of syntactic meanings
E.g., sum, +, -
• Token
– A category of lexemes
– A lexeme is an instance of token
– The basic building blocks of programs
Data Gathering

Lexical and syntactic analysis:

result = oldsum – value / 100;

Tokens and lexemes of this


statement?
Data Gathering
Lexical and syntactic analysis:
Assignment Task
• Parse the code to understand its structure and create an AST of the
code.
Data Gathering
Approaches to Automating Reverse Engineering

2. Graphing methods
• There are a variety of graphing approaches for program understanding.
• These include, in increasing order of complexity and richness:
• graphing the control flow of the program,
• the data flow of the program ,
• and program dependence graphs.

• The unit of examination is a graphical representation of the program source.


Data Gathering
Graphing Method
• Static Source Code Analysis

• Static analysis of code a program is the analysis of the code without regard to
its execution or input.

• What analysis is useful for understanding:

• Control flow analysis; what pieces of the code would be executed and in what sequence

• Data flow analysis; how does information flow within a program and across programs
Control Flow – Introduction
• Control Flow
• Used to identify the possible paths through the program
• The flow is represented as a directed graph with splits and joins
• Identify loops

• Control Flow represented as a graph of Basic Blocks


• Sequence of operations with 1-entry and 1-exit (usually a sequence of
statements)
• Unique start point where program begins
• Edge between basic blocks shows the flow
Control Flow Analysis
• The two kinds of control flow analysis are:

1. Intraprocedural: It shows the order in which statements are executed within a


subprogram.
2. Interprocedural: It shows the calling relationship among program units.

Intraprocedural analysis:
• The idea of basic blocks is central to constructing a CFG.
• A basic block is a maximal sequence of program statements such that execution
enters at the top of the block and leaves only at the bottom via a conditional or an
unconditional branch statement.
• A basic block is represented with one node in the CFG, and an arc indicates possible
flow of control from one node to another.
Control Flow Analysis
Interprocedural analysis:
• Interprocedural analysis is performed by constructing a call graph.

• Calling relationships between subroutines in a program are represented as a call


graph which is basically a directed graph.

• Specifically, a procedure in the source code is represented by a node in the graph,


and the edge from node f to g indicates that procedure f calls procedure g.

• Call graphs can be static or dynamic. A dynamic call graph is an execution trace of
the program.

• Thus, a dynamic call graph is exact, but it only describes one run of the program.

• On the other hand, a static call graph represents every possible run of the program.
Control Flow Analysis
• An approach that avoids the burden of annotations, and can capture
what a procedure actually does as used in a particular program, is
building a control flow graph for the entire program, rather than just
one procedure.

• To make this work, we handle call and return instructions specially as


follows:

• We add additional edges to the control flow graph. For every call to function g,
we add an edge from the call site to the first instruction of g, and from every
return statement of g to the instruction following that call.
Control Flow Graph
Control Flow Graph
Control Flow Graph
Control Flow Graph
Flow Graphs of various blocks
Flow Graphs of various blocks
Control Flow – Example
Control Flow – Code View
• Another example of visualizing the control flow of a program is using a Control
Structure Diagram (CSD).

• CSD is a algorithmic level graphical representation for software source code.

• It automatically documents the program flow within the source code and adds
indentation with graphical symbols

• The following notations are used:

• Sequential flow – straight line


• If/The/Else/Switch statements – diamonds
• For/While – elongated loop
• Loop exit – arrow
• Function – open-ended box
CSD Example
CSD Program Components
CSD Control Constructs
• The basic control constructs are grouped into the following
categories:

• Sequence
• Selection
• Iteration
• Exception Handling
CSD Control Constructs
CSD Control Constructs
CSD Control Constructs
CSD Control Constructs
CSD Control Constructs
Data Flow Graph: Data Analysis
• All control edges together form a graph called the Control Flow Graph
(CFG).
• All data edges together form a graph called the Data Flow Graph (DFG).
• A DFD shows what kind of information will be

• Input to and output from the system,


• Where the data will come from and go to,
• Where the data will be stored.

• A data flow graph is information oriented.

• It passes data between other components.


Example
Example
int max(int a, int b) { The data flow graph (DFG) must be
if (a > b) derived after the control flow graph.
r = a;
The data flow graph has the same set of
else
nodes as the control flow graph.
r = b;
return r; The data flow graph requires
identification of the data dependencies
between every node.
Example
• Start by annotating, with each node, what variables are read and what
variables are written.

// a b r
• int max(int a, int b) { // 1 W W
• if (a > b) // 2 R R
• r = a; // 3 R W
• else
• r = b; // 4 R W
• return r; // 5 R
Example
• Next, we draw each data dependency.
• A data dependency goes from a node that writes into a variable to
another node that reads from the variable.
• To have a valid dependency, we must identify the correct ‘write’ node
for each ‘read’ node. That is done as follows.

• Start with a node that reads from a variable. For example, node 3 in the
example reads variable a. That read operation is the endpoint of a data
dependency.

• Next, walk backward in the control flow graph until you find a node that
writes the same variable. That is the starting point of a data dependency. For
example, going backward from node 3, we visit node 2, and then node 1. Only
node 1 writes a. Therefore, the data dependency for a goes from 1 to 3.
Example
• Data Flow graph
Class Activity
Definition-Use Pairs
• A def-use (du) pair associates a point in a program where a value is
produced with a point where it is used

• Definition: where a variable gets a value


– Variable declaration
– Variable initialization
– Assignment
– Values received by a parameter
• Use: extraction of a value from a variable
– Expressions
– Conditional statements
– Parameter passing
– Returns
Definition-Use Pairs
Definition-Use Pairs
Definition-Clear & Killing

• A definition-clear path is a path along the CFG from a definition to a use


of the same variable without another definition of the variable between.

• If, instead, another definition is present on the path, then the latter
definition kills the former

• A def-use pair is formed if and only if there is a definition-clear path


between the definition and the use
Definition-Clear & Killing
(Direct) Data Dependence Graph
Control Dependence

You might also like