0% found this document useful (0 votes)
4 views10 pages

1.1 Static Reverse Engineering

The document discusses static reverse engineering techniques, focusing on control flow analysis, disassembly methods, and the challenges of accurately interpreting code. It outlines the differences between over-approximation and under-approximation in disassembly, the importance of entry points, and the iterative process of control flow recovery. Additionally, it covers decompilation steps, the semantic gap between high-level and low-level code, and tools like Ghidra for navigating binaries.

Uploaded by

Arthur Werbrouck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views10 pages

1.1 Static Reverse Engineering

The document discusses static reverse engineering techniques, focusing on control flow analysis, disassembly methods, and the challenges of accurately interpreting code. It outlines the differences between over-approximation and under-approximation in disassembly, the importance of entry points, and the iterative process of control flow recovery. Additionally, it covers decompilation steps, the semantic gap between high-level and low-level code, and tools like Ghidra for navigating binaries.

Uploaded by

Arthur Werbrouck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

1 Static Reverse Engineering

online material

Part 1

1. Over-Approximation vs. Under-Approximation of Control Flow

Over-approximation means the disassembler recognizes code paths or instructions that


are not actually executed (i.e., it interprets some data regions or “junk bytes” as valid
instructions).
Examples:
Linear sweep often decodes data bytes as instructions if it just keeps scanning
forward.
Obfuscated programs may include bogus (never-executed) control-flow edges that
trick a recursive descent disassembler.
Under-approximation means the disassembler fails to identify code that is actually present
and reachable in the program (i.e., misses valid instructions or targets).
Examples:
Indirect jumps or calls through function pointers might not be fully resolved, so the
real jump/call targets are missed.
Obfuscated code that calculates jump addresses on the fly, preventing static
analysis from seeing all possible branches.
Bogus Control Flow and Bogus Data:
Bogus control flow refers to intentionally inserted branch instructions (e.g., conditional
jumps) whose taken or not-taken path leads into dead or junk code/data, tricking the
disassembler into over- or under-interpreting code.
Bogus data in code sections can also look like instructions and lead to over-
approximation.

2. Role of Entry Points in Code Discovery & How They Are Identified

Disassembly entry points are addresses in the binary where a disassembler starts
decoding instructions. Correct identification is crucial for accurate disassembly.
Common sources of entry points:
1. Executable header information (e.g., the main entry point).
2. Exported symbol information (function symbols).
3. Debug information (if available).
4. Pattern matching of function prologues (common in advanced disassemblers).
5. User or script annotations (in interactive tools like IDA Pro, Ghidra, Binary Ninja).
3. Linear Sweep vs. Recursive Descent Approaches

1. Linear Sweep
Disassembles instructions starting at a known address and continues line by line.
Simple but prone to interpreting data as code (over-approximation).
Typically skips bytes that fail to decode but continues scanning forward.
2. Recursive Descent
Starts from known entry points; follows only the “real” control flow (including
direct/conditional branches).
When a branch or call is encountered, it adds the target to a “work list” and
disassembles from there.
Can under-approximate if it fails to detect some indirect jumps/calls.
More precise for typical code but misses any code not reached by the discovered
control-flow edges.

4. The Five Steps in Control Flow Recovery

After basic code discovery (linear sweep or recursive descent), control flow recovery typically
follows these steps:

1. Identify all control flow transfer instructions


Direct jumps/calls are straightforward.
Indirect jumps/calls need further analysis (e.g., pattern matching for switch statements,
value set analysis).
2. Identify all possible targets
For direct jumps/calls, the target is encoded in the instruction.
For indirect jumps/calls, the set of possible targets must be inferred (often complex or
incomplete).
3. Partition the instructions into Basic Blocks
A basic block (BB) is a maximal sequence of instructions with a single entry point and
single exit.
4. Identify potential function entry points
From symbols or from direct call targets, etc.
Build a per-function CFG by starting at the entry block.
5. Grow the CFG and build the Call Graph (CG)
Follow intraprocedural edges (conditional branches, fall-throughs, etc.) to add
successor blocks into the function.
If an instruction is recognized as a call, connect the caller function to callee in the Call
Graph.
Different disassemblers have different “stop” rules—e.g., IDA Pro does not add a block
to a new function if it has already been added to another function.
5. Synchronization: Definition & Circumstances

Synchronization refers to how a linear-sweep disassembler (particularly on variable-width


ISAs like x86) can “get back on track” after misdecoding a sequence of bytes.
If the disassembler misinterprets some bytes as an instruction, it may desynchronize.
But soon it might re-align to the correct instruction boundaries because the real
instructions’ lengths won’t consistently match the disassembler’s incorrect interpretation.
Relevance:
More common on variable-width instruction sets (x86, x86-64).
Less of an issue on fixed-width ISAs (ARM64, MIPS) because instructions align on
fixed boundaries.

6. Overlapping Instructions (and Data)

On variable-width ISAs, it is possible for one sequence of bytes to decode into multiple
different overlapping instruction streams, depending on where the decoder starts.
Obfuscation Trick: Attackers can intentionally craft code such that one region of memory
can be decoded in two (or more) ways, leading to confusion in disassemblers.
Different Disassembler Policies:
Some (IDA Pro) insist that each byte belongs to exactly one instruction stream once it’s
recognized as code.
Others (Binary Ninja) allow a single byte to be part of multiple possible disassembly
paths.

7. Iterative Disassembling & Why It’s Useful

What it is: Iteratively apply both data flow analysis and control flow analysis, refining each
other in multiple rounds.
1. Initially, a rough CFG is formed (e.g., ignoring complex indirect jumps).
2. Data flow analysis (like value set analysis) can resolve some unknown jump targets
(e.g., switch-case tables).
3. The improved knowledge of jump targets refines the CFG.
4. The refined CFG in turn refines the data flow analysis.
5. Repeat as needed.
Particular Example: Switch dispatchers using computed jumps. A global data flow analysis
might figure out possible jump destinations, adding them as new code blocks in the CFG.

8. Potential Starting Points for Disassemblers (Beyond Binaries & Libraries)

Memory dumps: Snapshots of a running process’s memory.


Program traces: Execution traces (lists of instruction addresses and possibly
register/memory values) can inform a static disassembler about which addresses were truly
executed.
Any partially processed or intermediate representations used or exported by other
analysis tools.

Part 2

1. The 8 Steps of Decompilation

1. Disassembly & CFG reconstruction


Same principles as in Part 1.
2. Code lifting
Translate assembly instructions into an intermediate representation (IR) that is easier to
analyze and transform.
3. Data flow analysis & optimization
Includes constant propagation, dead code elimination, copy propagation, etc. to simplify
the IR.
4. Variable extraction & type inference
Identify which registers/stack locations are local variables, global variables, etc.
Infer data types (integers, pointers, arrays, etc.) through constraint-based analysis.
5. Control flow structure recovery
Convert low-level jumps into high-level control constructs (if-else, loops, switch, etc.).
6. Code generation
Emit code in a high-level language (e.g., C) from the IR.
7. Code optimization/simplification (after generation)
Further structural cleanups (remove redundant gotos, re-combine repeated
expressions, etc.).
8. Name inference
Attempt to give more meaningful names to functions and variables (as opposed to auto-
generated var_1, FUN_0xABC, etc.).

2. Why Constant Propagation, Dead Code Elimination, and Copy Propagation Are Needed

Constant Propagation:
Replaces register references that hold known constant values with those constants in
subsequent IR operations.
Simplifies expressions and often enables further optimizations.
Dead Code Elimination:
Removes instructions that compute values never used.
E.g., certain condition flags or registers might be set but not read afterward.
Copy Propagation:
Eliminates unnecessary movements of the same value around.
Replacing “mov rA, rB; use rA” with “use rB” directly if no conflict.

All these steps are essential to produce a readable, high-level IR and to remove low-level
assembly artifacts.

3. Where Heuristics Are Used & Potential Incorrect or Suboptimal Outcomes

Heuristic points:
1. Function boundary identification (especially in stripped binaries).
2. Stack frame analysis (guessing which parts of the stack are local variables, saved
registers, parameters, etc.).
3. Type inference (constraint solving may be incomplete or ambiguous).
4. Control flow structuring (mapping low-level jumps to high-level loops and
conditionals).
5. Name inference (guessing function/variable names).
Potential issues:
Incorrect function boundaries (merging two functions into one, or splitting one into
multiple).
Incorrect type assignments leading to nonsensical decompiled code.
Ugly or confusing high-level control structures (unreadable spaghetti code even at the
source level).
Wrong variable or function naming.

Interactive decompilers allow humans to fix these by manually renaming, retyping variables, etc.

Part 3

1. How & Why Compiled Java Code Differs from Compiled Native Code (Easier
Disassembly/Decompilation)

Java bytecode:
Produced by a Java compiler that does minimal low-level optimization.
Typically retains a structure closely matching the source code (one-to-one mapping of
many constructs).
The Java Virtual Machine (JVM) does further Just-In-Time (JIT) optimizations at
runtime, so the compiler doesn’t do heavy optimization upfront.
Native code (e.g., x86):
Produced by a native compiler (GCC, Clang, MSVC) that can heavily optimize at a low
level (inlining, loop unrolling, constant folding, strange register usage, etc.).
Often has fewer recognizable remnants of the original high-level structure, making
reverse-engineering more difficult.

Thus, disassembling Java bytecode to something close to Java-like code is usually much
easier than going from x86 assembly back to C/C++.

2. Examples & Explanation of the “Semantic Gap”

Semantic gap: The difference in abstraction between the original high-level source and the
low-level (or bytecode-level) representation.
For Java:
Bytecode still uses instructions like new + <init> calls (constructor) that map
closely to Java’s object creation.
However, Java bytecode also allows unstructured goto instructions and some
transformations that don’t look so clean in source code (especially in obfuscated
bytecode).
For native code:
High-level structures ( for loops, exceptions) can be compiled into complicated
sets of jumps, registers, stack manipulations, etc.
High-level variables may be split or merged across multiple registers/stack slots.

When no obfuscation is used, Java’s semantic gap is smaller, so decompiling to a near-original


source is simpler. With obfuscation or highly optimized code, the gap widens and complicates
reverse engineering.

Ghidra: Navigating Binaries in CodeBrowser

CodeBrowser Windows and Their Purposes

Listing Window

Displays assembly code with fields such as addresses, bytes, mnemonics, operands, comments,
and labels. It allows tracking cross-references (XREFs) for navigation within the code.

Decompiler Window

Presents decompiled C-like code for easier readability and comprehension. This window
synchronizes with the Listing and Function Graph windows.

Function Graph Window

Provides a visual representation of function control flow through a Control Flow Graph (CFG),
making it easier to understand function structure and execution paths.
Program Tree Window

Shows the hierarchical organization of memory segments in the binary, aiding in code structuring
and navigation.

Symbol Table Window

Lists all symbols within the binary, including functions, variables, and labels, and allows tracking
of references.

Function Call Trees Window

Displays a tree structure of function callers and callees, providing insights into function
dependencies.

Function Call Graph Window

Offers a graphical overview of function calls throughout the binary, with options to expand deeper
into the call hierarchy.

Data Type Manager Window

Manages data types used in the program and allows defining custom structures for better
analysis.

Byte Viewer Window

Shows raw memory data in formats such as hexadecimal, ASCII, and binary, assisting in the
analysis of encoded or packed data.

Common Interactive Activities and Their Associated Windows

Activity Relevant Window


Viewing disassembled instructions Listing Window
Understanding function logic Decompiler Window
Visualizing control flow Function Graph Window
Tracking variable references Symbol Table, Listing Windows
Creating custom data types Data Type Manager
Navigating function calls Function Call Trees
Annotating and labeling code Listing Window
Ghidra Search Options

Search Program Text – Finds labels, comments, and symbols within the program text.
Search Memory – Searches for byte patterns, strings, and numerical values.
Search for Strings – Identifies potential strings within memory with options for null
termination.
Search for Scalars – Locates scalar values such as constants.
Search for Direct References – Finds occurrences where an address is explicitly
referenced.
Search for Address Tables – Detects sequences of addresses used in lookup tables.
Search for Matching Instructions – Identifies specific opcode patterns.
Search for Instruction Patterns – Finds repeating instruction sequences using masks.

Navigation and Artifact Finding Strategies

Methods to Navigate Code

Using XREFs to follow cross-references between instructions and data.


The Program Tree for structured navigation of memory blocks.
The Goto command for direct jumps to symbols or addresses.
Navigation history to move back and forth through previous locations.

Finding Relevant Artifacts in Attack Strategies

Crypto Asset Localization


Utilize cross-references to trace function calls related to cryptographic operations and search
for known cryptographic constants and keys in memory.

License Check Localization


Look for licensing-related strings and error messages. Examine exit functions linked to failed
checks and analyze control flow.

Game Resource Hack Localization


Identify frequently accessed resource values in memory and track pointer chains leading to in-
game assets.

Q&A

Pattern Matching Approach (Slides 8-9)

Purpose:
Identifies known code fragments in a binary by comparing against a library of known
patterns.
Useful for recognizing common library functions and previously analyzed code to speed
up reverse engineering.
Inputs:
A new binary and a catalog of known code fragments (e.g., libraries, previously
analyzed code).
Outputs:
Matches found in the binary with corresponding known patterns, allowing reuse of
knowledge such as function names and annotations.
Operation:
Uses masked byte sequences to ignore variable parts like addresses.
Hashes are generated and compared to detect known patterns.

Binary Diffing Approach (Slides 13-22)

Purpose:
Compares two binaries to identify differences, such as security patches, changes in
functionality, or malware evolution.
Inputs:
Two binaries (e.g., patched vs. unpatched versions).
Outputs:
Reports on identical and differing functions, code blocks, and data sections.
Operation:
Functions are matched based on attributes such as code hashes, call graphs, and
control flow structures.
Recursive partitioning and refinement are applied to identify changes accurately.

Headless Analyzer

Definition:
A Ghidra feature that allows scripts and analyses to be run from the command line
without launching the GUI.
Use Case:
Automating tasks like batch analysis, processing large numbers of binaries, and
integrating Ghidra into automated pipelines.

C++ Polymorphism and Vtables

Implementation Overview:
C++ uses vtables (virtual tables) to manage polymorphic behavior at runtime.
Each polymorphic class has a vtable storing pointers to virtual functions.
Objects store a pointer to their respective vtable, allowing method resolution via indirect
calls.
Example Code Fragment:

class A {
public:
virtual void M1() {};
virtual void M2() {};
};

class B : public A {
public:
void M2() override {};
};

A* obj = new B();


obj->M2();

Memory Layout Drawing:

A Object B Object
+----------+ +----------+
| vtable ptr|-->| vtable B |
+----------+ | M1 -> A::M1 |
| M2 -> B::M2 |
+----------+

Execution Process:
The call obj->M2(); translates to:
1. Dereferencing the vtable pointer.
2. Looking up the function address in the vtable.
3. Indirectly calling the function.

You might also like