1.1 Static Reverse Engineering
1.1 Static Reverse Engineering
online material
Part 1
2. Role of Entry Points in Code Discovery & How They Are Identified
Disassembly entry points are addresses in the binary where a disassembler starts
decoding instructions. Correct identification is crucial for accurate disassembly.
Common sources of entry points:
1. Executable header information (e.g., the main entry point).
2. Exported symbol information (function symbols).
3. Debug information (if available).
4. Pattern matching of function prologues (common in advanced disassemblers).
5. User or script annotations (in interactive tools like IDA Pro, Ghidra, Binary Ninja).
3. Linear Sweep vs. Recursive Descent Approaches
1. Linear Sweep
Disassembles instructions starting at a known address and continues line by line.
Simple but prone to interpreting data as code (over-approximation).
Typically skips bytes that fail to decode but continues scanning forward.
2. Recursive Descent
Starts from known entry points; follows only the “real” control flow (including
direct/conditional branches).
When a branch or call is encountered, it adds the target to a “work list” and
disassembles from there.
Can under-approximate if it fails to detect some indirect jumps/calls.
More precise for typical code but misses any code not reached by the discovered
control-flow edges.
After basic code discovery (linear sweep or recursive descent), control flow recovery typically
follows these steps:
On variable-width ISAs, it is possible for one sequence of bytes to decode into multiple
different overlapping instruction streams, depending on where the decoder starts.
Obfuscation Trick: Attackers can intentionally craft code such that one region of memory
can be decoded in two (or more) ways, leading to confusion in disassemblers.
Different Disassembler Policies:
Some (IDA Pro) insist that each byte belongs to exactly one instruction stream once it’s
recognized as code.
Others (Binary Ninja) allow a single byte to be part of multiple possible disassembly
paths.
What it is: Iteratively apply both data flow analysis and control flow analysis, refining each
other in multiple rounds.
1. Initially, a rough CFG is formed (e.g., ignoring complex indirect jumps).
2. Data flow analysis (like value set analysis) can resolve some unknown jump targets
(e.g., switch-case tables).
3. The improved knowledge of jump targets refines the CFG.
4. The refined CFG in turn refines the data flow analysis.
5. Repeat as needed.
Particular Example: Switch dispatchers using computed jumps. A global data flow analysis
might figure out possible jump destinations, adding them as new code blocks in the CFG.
Part 2
2. Why Constant Propagation, Dead Code Elimination, and Copy Propagation Are Needed
Constant Propagation:
Replaces register references that hold known constant values with those constants in
subsequent IR operations.
Simplifies expressions and often enables further optimizations.
Dead Code Elimination:
Removes instructions that compute values never used.
E.g., certain condition flags or registers might be set but not read afterward.
Copy Propagation:
Eliminates unnecessary movements of the same value around.
Replacing “mov rA, rB; use rA” with “use rB” directly if no conflict.
All these steps are essential to produce a readable, high-level IR and to remove low-level
assembly artifacts.
Heuristic points:
1. Function boundary identification (especially in stripped binaries).
2. Stack frame analysis (guessing which parts of the stack are local variables, saved
registers, parameters, etc.).
3. Type inference (constraint solving may be incomplete or ambiguous).
4. Control flow structuring (mapping low-level jumps to high-level loops and
conditionals).
5. Name inference (guessing function/variable names).
Potential issues:
Incorrect function boundaries (merging two functions into one, or splitting one into
multiple).
Incorrect type assignments leading to nonsensical decompiled code.
Ugly or confusing high-level control structures (unreadable spaghetti code even at the
source level).
Wrong variable or function naming.
Interactive decompilers allow humans to fix these by manually renaming, retyping variables, etc.
Part 3
1. How & Why Compiled Java Code Differs from Compiled Native Code (Easier
Disassembly/Decompilation)
Java bytecode:
Produced by a Java compiler that does minimal low-level optimization.
Typically retains a structure closely matching the source code (one-to-one mapping of
many constructs).
The Java Virtual Machine (JVM) does further Just-In-Time (JIT) optimizations at
runtime, so the compiler doesn’t do heavy optimization upfront.
Native code (e.g., x86):
Produced by a native compiler (GCC, Clang, MSVC) that can heavily optimize at a low
level (inlining, loop unrolling, constant folding, strange register usage, etc.).
Often has fewer recognizable remnants of the original high-level structure, making
reverse-engineering more difficult.
Thus, disassembling Java bytecode to something close to Java-like code is usually much
easier than going from x86 assembly back to C/C++.
Semantic gap: The difference in abstraction between the original high-level source and the
low-level (or bytecode-level) representation.
For Java:
Bytecode still uses instructions like new + <init> calls (constructor) that map
closely to Java’s object creation.
However, Java bytecode also allows unstructured goto instructions and some
transformations that don’t look so clean in source code (especially in obfuscated
bytecode).
For native code:
High-level structures ( for loops, exceptions) can be compiled into complicated
sets of jumps, registers, stack manipulations, etc.
High-level variables may be split or merged across multiple registers/stack slots.
Listing Window
Displays assembly code with fields such as addresses, bytes, mnemonics, operands, comments,
and labels. It allows tracking cross-references (XREFs) for navigation within the code.
Decompiler Window
Presents decompiled C-like code for easier readability and comprehension. This window
synchronizes with the Listing and Function Graph windows.
Provides a visual representation of function control flow through a Control Flow Graph (CFG),
making it easier to understand function structure and execution paths.
Program Tree Window
Shows the hierarchical organization of memory segments in the binary, aiding in code structuring
and navigation.
Lists all symbols within the binary, including functions, variables, and labels, and allows tracking
of references.
Displays a tree structure of function callers and callees, providing insights into function
dependencies.
Offers a graphical overview of function calls throughout the binary, with options to expand deeper
into the call hierarchy.
Manages data types used in the program and allows defining custom structures for better
analysis.
Shows raw memory data in formats such as hexadecimal, ASCII, and binary, assisting in the
analysis of encoded or packed data.
Search Program Text – Finds labels, comments, and symbols within the program text.
Search Memory – Searches for byte patterns, strings, and numerical values.
Search for Strings – Identifies potential strings within memory with options for null
termination.
Search for Scalars – Locates scalar values such as constants.
Search for Direct References – Finds occurrences where an address is explicitly
referenced.
Search for Address Tables – Detects sequences of addresses used in lookup tables.
Search for Matching Instructions – Identifies specific opcode patterns.
Search for Instruction Patterns – Finds repeating instruction sequences using masks.
Q&A
Purpose:
Identifies known code fragments in a binary by comparing against a library of known
patterns.
Useful for recognizing common library functions and previously analyzed code to speed
up reverse engineering.
Inputs:
A new binary and a catalog of known code fragments (e.g., libraries, previously
analyzed code).
Outputs:
Matches found in the binary with corresponding known patterns, allowing reuse of
knowledge such as function names and annotations.
Operation:
Uses masked byte sequences to ignore variable parts like addresses.
Hashes are generated and compared to detect known patterns.
Purpose:
Compares two binaries to identify differences, such as security patches, changes in
functionality, or malware evolution.
Inputs:
Two binaries (e.g., patched vs. unpatched versions).
Outputs:
Reports on identical and differing functions, code blocks, and data sections.
Operation:
Functions are matched based on attributes such as code hashes, call graphs, and
control flow structures.
Recursive partitioning and refinement are applied to identify changes accurately.
Headless Analyzer
Definition:
A Ghidra feature that allows scripts and analyses to be run from the command line
without launching the GUI.
Use Case:
Automating tasks like batch analysis, processing large numbers of binaries, and
integrating Ghidra into automated pipelines.
Implementation Overview:
C++ uses vtables (virtual tables) to manage polymorphic behavior at runtime.
Each polymorphic class has a vtable storing pointers to virtual functions.
Objects store a pointer to their respective vtable, allowing method resolution via indirect
calls.
Example Code Fragment:
class A {
public:
virtual void M1() {};
virtual void M2() {};
};
class B : public A {
public:
void M2() override {};
};
A Object B Object
+----------+ +----------+
| vtable ptr|-->| vtable B |
+----------+ | M1 -> A::M1 |
| M2 -> B::M2 |
+----------+
Execution Process:
The call obj->M2(); translates to:
1. Dereferencing the vtable pointer.
2. Looking up the function address in the vtable.
3. Indirectly calling the function.