Reverse Engineering For Beginners
Reverse Engineering For Beginners
Beginners
Dennis Yurichev
Reverse Engineering for Beginners
Dennis Yurichev
<dennis(a)yurichev.com>
cbnd
©2013-2015, Dennis Yurichev.
This work is licensed under the Creative Commons
Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of
this license, visit
http://creativecommons.org/licenses/by-nc-nd/3.0/.
Text version (November 25, 2015).
The latest version (and Russian edition) of this text accessible at beginners.re. An
A4-format version is also available.
You can also follow me on twitter to get information about updates of this text:
@yurichev1 or to subscribe to the mailing list2 .
The cover was made by Andy Nechaevsky: facebook.
1 twitter.com/yurichev
2 yurichev.com
i
Warning: this is a shortened
LITE-version!
If you still interesting in reverse engineering, full version of the book is always
available on my website: beginners.re.
ii
CONTENTS CONTENTS
Contents
I Code patterns 1
1 A short introduction to the CPU 3
3 Hello, world! 7
3.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1.1 MSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2.1 MSVC—x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
5 Stack 13
5.1 Why does the stack grow backwards? . . . . . . . . . . . . . . . . . . . . 13
5.2 What is the stack used for? . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5.2.1 Save the function’s return address . . . . . . . . . . . . . . . . . 14
5.2.2 Passing function arguments . . . . . . . . . . . . . . . . . . . . . 16
5.2.3 Local variable storage . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.4 x86: alloca() function . . . . . . . . . . . . . . . . . . . . . . . . . 17
5.2.5 (Windows) SEH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2.6 Buffer overflow protection . . . . . . . . . . . . . . . . . . . . . . 19
5.2.7 Automatic deallocation of data in stack . . . . . . . . . . . . . 19
5.3 A typical stack layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
CONTENTS CONTENTS
7 scanf() 26
7.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.1.1 About pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.1.2 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.1.3 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
7.2 Global variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2.1 MSVC: x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2.2 MSVC: x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.3 scanf() result checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3.1 MSVC: x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3.2 MSVC: x86 + Hiew . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.3.3 MSVC: x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.4 Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
10 GOTO operator 48
10.1 Dead code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11 Conditional jumps 50
11.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
11.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
11.2 Calculating absolute value . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11.2.1 Optimizing MSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11.3 Ternary conditional operator . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.3.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
11.3.2 Let’s rewrite it in an if/else way . . . . . . . . . . . . . . . . 59
11.4 Getting minimal and maximal values . . . . . . . . . . . . . . . . . . . . 60
11.4.1 32-bit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
11.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
11.5.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
iv
CONTENTS CONTENTS
11.5.2 Branchless . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
12 switch()/case/default 63
12.1 Small number of cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
12.1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2 A lot of cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.2.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.3 When there are several case statements in one block . . . . . . . . . 72
12.3.1 MSVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
12.4 Fall-through . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
12.4.1 MSVC x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
13 Loops 77
13.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.1.2 One more thing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.2 Memory blocks copying routine . . . . . . . . . . . . . . . . . . . . . . . 80
13.2.1 Straight-forward implementation . . . . . . . . . . . . . . . . . 80
13.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
16 Arrays 93
16.1 Simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
16.1.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
16.2 Buffer overflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
16.2.1 Reading outside array bounds . . . . . . . . . . . . . . . . . . . . 95
16.2.2 Writing beyond array bounds . . . . . . . . . . . . . . . . . . . . 97
16.3 One more word about arrays . . . . . . . . . . . . . . . . . . . . . . . . . 101
16.4 Array of pointers to strings . . . . . . . . . . . . . . . . . . . . . . . . . . 102
16.4.1 x64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
16.5 Multidimensional arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
v
CONTENTS CONTENTS
19 Structures 129
19.1 MSVC: SYSTEMTIME example . . . . . . . . . . . . . . . . . . . . . . . . . 129
19.1.1 Replacing the structure with array . . . . . . . . . . . . . . . . . 131
19.2 Let’s allocate space for a structure using malloc() . . . . . . . . . . . . 132
19.3 Fields packing in structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
19.3.1 x86 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
19.3.2 One more word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
19.4 Nested structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
19.5 Bit fields in a structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
19.5.1 CPUID example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
vi
CONTENTS CONTENTS
21 64 bits 153
21.1 x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
23 Memory 159
25 Strings 167
25.1 Text strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
25.1.1 C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
25.1.2 Borland Delphi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
25.1.3 Unicode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
25.1.4 Base64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
25.2 Error/debug messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
25.3 Suspicious magic strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
27 Constants 177
27.1 Magic numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
27.1.1 DHCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
27.2 Searching for constants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
vii
CONTENTS CONTENTS
IV Tools 190
32 Disassembler 191
32.1 IDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
33 Debugger 192
33.1 tracer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
34 Decompilers 193
37 Blogs 197
37.1 Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
38 Other 198
Afterword 200
39 Questions? 200
Index 207
viii
CONTENTS CONTENTS
Bibliography 210
ix
CONTENTS CONTENTS
Preface
There are several popular meanings of the term “reverse engineering”: 1) The re-
verse engineering of software: researching compiled programs; 2) The scanning of
3D structures and the subsequent digital manipulation required order to duplicate
them; 3) recreating DBMS3 structure. This book is about the first meaning.
• “It’s very well done .. and for free .. amazing.”4 Daniel Bilar, Siege Technolo-
gies, LLC.
• “... excellent and free”5 Pete Finnigan, Oracle RDBMS security guru.
• “... book is interesting, great job!” Michael Sikorski, author of Practical Mal-
ware Analysis: The Hands-On Guide to Dissecting Malicious Software.
3 Database management systems
4 twitter.com/daniel_bilar/status/436578617221742593
5 twitter.com/petefinnigan/status/400551705797869568
x
CONTENTS CONTENTS
• “... my compliments for the very nice tutorial!” Herbert Bos, full professor at
the Vrije Universiteit Amsterdam, co-author of Modern Operating Systems (4th
Edition).
• “... It is amazing and unbelievable.” Luis Rocha, CISSP / ISSAP, Technical
Manager, Network & Information Security at Verizon Business.
• “Thanks for the great work and your book.” Joris van de Vis, SAP Netweaver
& Security specialist.
• “... reasonable intro to some of the techniques.”6 Mike Stay, teacher at the
Federal Law Enforcement Training Center, Georgia, US.
• “I love this book! I have several students reading it at the moment, plan to
use it in graduate course.”7 Sergey Bratus, Research Assistant Professor at the
Computer Science Department at Dartmouth College
• “Dennis @Yurichev has published an impressive (and free!) book on reverse
engineering”8 Tanel Poder, Oracle RDBMS performance tuning expert.
• “This book is some kind of Wikipedia to beginners...” Archer, Chinese Trans-
lator, IT Security Researcher.
Thanks
For patiently answering all my questions: Andrey “herm1t” Baranovich, Slava “Avid”
Kazakov.
For sending me notes about mistakes and inaccuracies: Stanislav “Beaver” Bobryt-
skyy, Alexander Lysenko, Shell Rocket, Zhu Ruijin, Changmin Heo.
For helping me in other ways: Andrew Zubinski, Arnaud Patard (rtp on #debian-arm
IRC), Aliaksandr Autayeu.
For translating the book into Simplified Chinese: Antiy Labs (antiy.cn) and Archer.
For translating the book into Korean: Byungho Min.
For proofreading: Alexander “Lstar” Chernenkiy, Vladimir Botov, Andrei Brazhuk,
Mark “Logxen” Cooper, Yuan Jochen Kang, Mal Malakov, Lewis Porter, Jarle Thorsen.
Vasil Kolev did a great amount of work in proofreading and correcting many mis-
takes.
For illustrations and cover art: Andy Nechaevsky.
Thanks also to all the folks on github.com who have contributed notes and correc-
tions.
6 reddit
7 twitter.com/sergeybratus/status/505590326560833536
8 twitter.com/TanelPoder/status/524668104065159169
xi
CONTENTS CONTENTS
Many LATEX packages were used: I would like to thank the authors as well.
Donors
Those who supported me during the time when I wrote significant part of the book:
2 * Oleg Vygovsky (50+100 UAH), Daniel Bilar ($50), James Truscott ($4.5), Luis
Rocha ($63), Joris van de Vis ($127), Richard S Shultz ($20), Jang Minchang ($20),
Shade Atlas (5 AUD), Yao Xiao ($10), Pawel Szczur (40 CHF), Justin Simms ($20),
Shawn the R0ck ($27), Ki Chan Ahn ($50), Triop AB (100 SEK), Ange Albertini (e10+50),
Sergey Lukianov (300 RUR), Ludvig Gislason (200 SEK), Gérard Labadie (e40), Sergey
Volchkov (10 AUD), Vankayala Vigneswararao ($50), Philippe Teuwen ($4), Martin
Haeberli ($10), Victor Cazacov (e5), Tobias Sturzenegger (10 CHF), Sonny Thai ($15),
Bayna AlZaabi ($75), Redfive B.V. (e25), Joona Oskari Heikkilä (e5), Marshall Bishop
($50), Nicolas Werner (e12), Jeremy Brown ($100), Alexandre Borges ($25), Vladimir
Dikovski (e50), Jiarui Hong (100.00 SEK), Jim Di (500 RUR), Tan Vincent ($30), Sri
Harsha Kandrakota (10 AUD), Pillay Harish (10 SGD), Timur Valiev (230 RUR), Carlos
Garcia Prado (e10), Salikov Alexander (500 RUR), Oliver Whitehouse (30 GBP), Katy
Moe ($14), Maxim Dyakonov ($3), Sebastian Aguilera (e20), Hans-Martin Münch
(e15), Jarle Thorsen (100 NOK), Vitaly Osipov ($100), Yuri Romanov (1000 RUR),
Aliaksandr Autayeu (e10), Tudor Azoitei ($40), Z0vsky (e10), Yu Dai ($10).
Thanks a lot to every donor!
mini-FAQ
9 Operating System
10 A very good text about this topic: [Fog13]
11 Central processing unit
xii
CONTENTS CONTENTS
Q: I have a question...
A: Send it to me by email (dennis(a)yurichev.com).
This is the A5-format version for e-book readers. Although the content is mostly
the same, the illustrations are resized and probably not readable. You may try to
change scale in your e-book reader. Otherwise, you can always view them in the
A4-format version here: beginners.re.
12 reddit.com/r/ReverseEngineering/
xiii
Part I
Code patterns
1
Everything is comprehended in
comparison
Author unknown
When the author of this book first started learning C and, later, C++, he used to write
small pieces of code, compile them, and then look at the assembly language output.
This made it very easy for him to understand what was going on in the code that
he had written. 13 . He did it so many times that the relationship between the C/C++
code and what the compiler produced was imprinted deeply in his mind. It’s easy
to imagine instantly a rough outline of C code’s appearance and function. Perhaps
this technique could be helpful for others.
Sometimes ancient compilers are used here, in order to get the shortest (or sim-
plest) possible code snippet.
13 In fact, he still does it when he can’t understand what a particular bit of code does.
2
CHAPTER 1. A SHORT INTRODUCTION TO THE CPU CHAPTER 1. A SHORT INTRODUCTION TO THE CPU
Chapter 1
The CPU is the device that executes the machine code a program consists of.
A short glossary:
Instruction : A primitive CPU command. The simplest examples include: moving
data between registers, working with memory, primitive arithmetic opera-
tions . As a rule, each CPU has its own instruction set architecture (ISA1 ).
Machine code : Code that the CPU directly processes. Each instruction is usually
encoded by several bytes.
Assembly language : Mnemonic code and some extensions like macros that are
intended to make a programmer’s life easier.
CPU register : Each CPU has a fixed set of general purpose registers (GPR2 ). ≈ 8 in
x86, ≈ 16 in x86-64, ≈ 16 in ARM. The easiest way to understand a register is
to think of it as an untyped temporary variable . Imagine if you were working
with a high-level PL3 and could only use eight 32-bit (or 64-bit) variables .
Yet a lot can be done using just these!
One might wonder why there needs to be a difference between machine code and
a PL. The answer lies in the fact that humans and CPUs are not alike— it is much
easier for humans to use a high-level PL like C/C++, Java, Python, etc., but it is
easier for a CPU to use a much lower level of abstraction. Perhaps it would
be possible to invent a CPU that can execute high-level PL code, but it would be
many times more complex than the CPUs we know of today. In a similar fashion,
it is very inconvenient for humans to write in assembly language, due to it being
so low-level and difficult to write in without making a huge number of annoying
1 Instruction
Set Architecture
2 GeneralPurpose Registers
3 Programming language
3
CHAPTER 1. A SHORT INTRODUCTION TO THE CPU CHAPTER 1. A SHORT INTRODUCTION TO THE CPU
mistakes. The program that converts the high-level PL code into assembly is
called a compiler.
4
CHAPTER 2. THE SIMPLEST FUNCTION CHAPTER 2. THE SIMPLEST FUNCTION
Chapter 2
The simplest possible function is arguably one that simply returns a constant value:
Here it is:
2.1 x86
Here’s what both the optimizing GCC and MSVC compilers produce on the x86 plat-
form:
There are just two instructions: the first places the value 123 into the EAX register,
which is used by convention for storing the return value and the second one is RET,
which returns execution to the caller. The caller will take the result from the EAX
register.
5
CHAPTER 2. THE SIMPLEST FUNCTION CHAPTER 2. THE SIMPLEST FUNCTION
It is worth noting that MOV is a misleading name for the instruction in both x86
and ARM ISAs. The data is not in fact moved, but copied.
6
CHAPTER 3. HELLO, WORLD! CHAPTER 3. HELLO, WORLD!
Chapter 3
Hello, world!
Let’s use the famous example from the book “The C programming Language”[Ker88]:
#include <stdio.h>
int main()
{
printf("hello, world\n");
return 0;
}
3.1 x86
3.1.1 MSVC
7
CHAPTER 3. HELLO, WORLD! CHAPTER 3. HELLO, WORLD!
_TEXT SEGMENT
_main PROC
push ebp
mov ebp, esp
push OFFSET $SG3830
call _printf
add esp, 4
xor eax, eax
pop ebp
ret 0
_main ENDP
_TEXT ENDS
The compiler generated the file, 1.obj, which is to be linked into 1.exe. In our
case, the file contains two segments: CONST (for data constants) and _TEXT (for
code).
The string hello, world in C/C++ has type const char[] [Str13, p176, 7.3.2],
but it does not have its own name. The compiler needs to deal with the string
somehow so it defines the internal name $SG3830 for it.
That is why the example may be rewritten as follows:
#include <stdio.h>
int main()
{
printf($SG3830);
return 0;
}
Let’s go back to the assembly listing. As we can see, the string is terminated by
a zero byte, which is standard for C/C++ strings. More about C strings: 25.1.1 on
page 167.
In the code segment, _TEXT, there is only one function so far: main(). The func-
tion main() starts with prologue code and ends with epilogue code (like almost
any function)1 .
After the function prologue we see the call to the printf() function: CALL
_printf. Before the call the string address (or a pointer to it) containing our
greeting is placed on the stack with the help of the PUSH instruction.
When the printf() function returns the control to the main() function, the
string address (or a pointer to it) is still on the stack. Since we do not need it
1 You can read more about it in the section about function prologues and epilogues ( 4 on page 11).
8
CHAPTER 3. HELLO, WORLD! CHAPTER 3. HELLO, WORLD!
After calling printf(), the original C/C++ code contains the statement return
0 —return 0 as the result of the main() function. In the generated code this is
implemented by the instruction XOR EAX, EAX. XOR is in fact just “eXclusive
OR”3 but the compilers often use it instead of MOV EAX, 0— again because it is
a slightly shorter opcode (2 bytes for XOR against 5 for MOV).
Some compilers emit SUB EAX, EAX, which means SUBtract the value in the EAX
from the value in EAX, which, in any case, results in zero.
The last instruction RET returns the control to the caller. Usually, this is C/C++ CRT4
code, which, in turn, returns control to the OS.
3.2 x86-64
3.2.1 MSVC—x86-64
9
CHAPTER 3. HELLO, WORLD! CHAPTER 3. HELLO, WORLD!
main PROC
sub rsp, 40
lea rcx, OFFSET FLAT:$SG2989
call printf
xor eax, eax
add rsp, 40
ret 0
main ENDP
In x86-64, all registers were extended to 64-bit and now their names have an R-
prefix. In order to use the stack less often (in other words, to access external mem-
ory/cache less often), there exists a popular way to pass function arguments via
registers (fastcall). I.e., a part of the function arguments is passed in registers, the
rest—via the stack. In Win64, 4 function arguments are passed in the RCX, RDX, R8,
R9 registers. That is what we see here: a pointer to the string for printf() is
now passed not in the stack, but in the RCX register.
The pointers are 64-bit now, so they are passed in the 64-bit registers (which have
the R- prefix). However, for backward compatibility, it is still possible to access the
32-bit parts, using the E- prefix.
This is how the RAX/EAX/AX/AL register looks like in x86-64:
7th (byte number) 6th 5th 4th 3rd 2nd 1st 0th
RAXx64
EAX
AX
AH AL
The main() function returns an int-typed value, which is, in C/C++, for better back-
ward compatibility and portability, still 32-bit, so that is why the EAX register is
cleared at the function end (i.e., the 32-bit part of the register) instead of RAX.
There are also 40 bytes allocated in the local stack. This is called the “shadow
space”, about which we are going to talk later: 8.2.1 on page 43.
3.3 Conclusion
The main difference between x86/ARM and x64/ARM64 code is that the pointer to
the string is now 64-bits in length. Indeed, modern CPUs are now 64-bit due to
both the reduced cost of memory and the greater demand for it by modern applica-
tions. We can add much more memory to our computers than 32-bit pointers are
able to address. As such, all pointers are now 64-bit.
10
CHAPTER 4. FUNCTION PROLOGUE AND EPILOGUE CHAPTER 4. FUNCTION PROLOGUE AND EPILOGUE
Chapter 4
What these instruction do: save the value in the EBP register, set the value of the
EBP register to the value of the ESP and then allocate space on the stack for local
variables.
The value in the EBP stays the same over the period of the function execution and
is to be used for local variables and arguments access. For the same purpose one
can use ESP, but since it changes over time this approach is not too convenient.
The function epilogue frees the allocated space in the stack, returns the value in
the EBP register back to its initial state and returns the control flow to the callee:
mov esp, ebp
pop ebp
ret 0
Function prologues and epilogues are usually detected in disassemblers for func-
tion delimitation.
4.1 Recursion
Epilogues and prologues can negatively affect the recursion performance.
11
CHAPTER 4. FUNCTION PROLOGUE AND EPILOGUE CHAPTER 4. FUNCTION PROLOGUE AND EPILOGUE
12
CHAPTER 5. STACK CHAPTER 5. STACK
Chapter 5
Stack
The stack is one of the most fundamental data structures in computer science1 .
Technically, it is just a block of memory in process memory along with the ESP or
RSP register in x86 or x64, or the SP2 register in ARM, as a pointer within that
block.
The most frequently used stack access instructions are PUSH and POP (in both x86
and ARM Thumb-mode). PUSH subtracts from ESP/RSP/SP 4 in 32-bit mode (or
8 in 64-bit mode) and then writes the contents of its sole operand to the memory
address pointed by ESP/RSP/SP.
POP is the reverse operation: retrieve the data from the memory location that SP
points to, load it into the instruction operand (often a register) and then add 4 (or
8) to the stack pointer.
After stack allocation, the stack pointer points at the bottom of the stack. PUSH
decreases the stack pointer and POP increases it. The bottom of the stack is
actually at the beginning of the memory allocated for the stack block. It seems
strange, but that’s the way it is.
13
CHAPTER 5. STACK CHAPTER 5. STACK
The reason that the stack grows backward is probably historical. When the com-
puters were big and occupied a whole room, it was easy to divide memory into two
parts, one for the heap and one for the stack. Of course, it was unknown how big
the heap and the stack would be during program execution, so this solution was
the simplest possible.
Start of heap Start of stack
Heap . Stack
This reminds us how some students write two lecture notes using only one note-
book: notes for the first lecture are written as usual, and notes for the second one
are written from the end of notebook, by flipping it. Notes may meet each other
somewhere in between, in case of lack of free space.
x86
When calling another function with a CALL instruction, the address of the point
exactly after the CALL instruction is saved to the stack and then an unconditional
jump to the address in the CALL operand is executed.
14
CHAPTER 5. STACK CHAPTER 5. STACK
ss.cpp
c:\tmp6\ss.cpp(4) : warning C4717: 'f' : recursive on all ⤦
Ç control paths, function will cause runtime stack overflow
… Also if we turn on the compiler optimization (/Ox option) the optimized code
will not overflow the stack and will work correctly 3 instead:
?f@@YAXXZ PROC ; f
; File c:\tmp6\ss.cpp
; Line 2
$LL3@f:
; Line 3
jmp SHORT $LL3@f
3 irony here
15
CHAPTER 5. STACK CHAPTER 5. STACK
?f@@YAXXZ ENDP ; f
By the way, the callee function does not have any information about how many
arguments were passed. C functions with a variable number of arguments (like
printf()) determine their number using format string specifiers (which begin
with the % symbol). If we write something like
printf("%d %d %d", 1234);
printf() will print 1234, and then two random numbers, which were lying next
to it in the stack.
5 For example, in the “The Art of Computer Programming” book by Donald Knuth, in section 1.4.1
dedicated to subroutines [Knu98, section 1.4.1], we could read that one way to supply arguments to a
subroutine is simply to list them after the JMP instruction passing control to subroutine. Knuth explains
that this method was particularly convenient on IBM System/360.
16
CHAPTER 5. STACK CHAPTER 5. STACK
That’s why it is not very important how we declare the main() function: as main(),
main(int argc, char *argv[]) or main(int argc, char *argv[],
char *envp[]).
In fact, the CRT-code is calling main() roughly as:
push envp
push argv
push argc
call main
...
If you declare main() as main() without arguments, they are, nevertheless, still
present in the stack, but are not used. If you declare main() as main(int argc,
char *argv[]), you will be able to use first two arguments, and the third will
remain “invisible” for your function. Even more, it is possible to declare main(int
argc), and it will work.
A function could allocate space in the stack for its local variables just by decreasing
the stack pointer towards the stack bottom. Hence, it’s very fast, no matter how
many local variables are defined.
It is also not a requirement to store local variables in the stack. You could store
local variables wherever you like, but traditionally this is how it’s done.
17
CHAPTER 5. STACK CHAPTER 5. STACK
#ifdef __GNUC__
#include <alloca.h> // GCC
#else
#include <malloc.h> // MSVC
#endif
#include <stdio.h>
void f()
{
char *buf=(char*)alloca (600);
#ifdef __GNUC__
snprintf (buf, 600, "hi! %d, %d, %d\n", 1, 2, 3); // GCC
#else
_snprintf (buf, 600, "hi! %d, %d, %d\n", 1, 2, 3); // MSVC
#endif
puts (buf);
};
_snprintf() function works just like printf(), but instead of dumping the re-
sult into stdout (e.g., to terminal or console), it writes it to the buf buffer. Function
puts() copies the contents of buf to stdout. Of course, these two function calls
might be replaced by one printf() call, but we have to illustrate small buffer
usage.
MSVC
push 3
push 2
push 1
push OFFSET $SG2672
push 600 ; 00000258H
push esi
call __snprintf
push esi
18
CHAPTER 5. STACK CHAPTER 5. STACK
call _puts
add esp, 28 ; 0000001cH
...
The sole alloca() argument is passed via EAX (instead of pushing it into the
stack)7 . After the alloca() call, ESP points to the block of 600 bytes and we
can use it as memory for the buf array.
SEH10 records are also stored on the stack (if they are present)..
Perhaps, the reason for storing local variables and SEH records in the stack is that
they are freed automatically upon function exit, using just one instruction to correct
the stack pointer (it is often ADD). Function arguments, as we could say, are also
deallocated automatically at the end of function. In contrast, everything stored in
the heap must be deallocated explicitly.
One of the reasons we need a separate function instead of just a couple of instructions in the code, is
because the MSVC8 alloca() implementation also has code which reads from the memory just allocated,
in order to let the OS map physical memory to this VM9 region.
10 Structured Exception Handling
19
CHAPTER 5. STACK CHAPTER 5. STACK
… …
ESP-0xC local variable #2, marked in IDA as var_8
ESP-8 local variable #1, marked in IDA as var_4
ESP-4 saved value of EBP
ESP return address
ESP+4 argument#1, marked in IDA as arg_0
ESP+8 argument#2, marked in IDA as arg_4
ESP+0xC argument#3, marked in IDA as arg_8
… …
20
CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS
Chapter 6
Now let’s extend the Hello, world! ( 3 on page 7) example, replacing printf() in
the main() function body with this:
#include <stdio.h>
int main()
{
printf("a=%d; b=%d; c=%d", 1, 2, 3);
return 0;
};
6.1 x86
MSVC
...
push 3
21
CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS
push 2
push 1
push OFFSET $SG3830
call _printf
add esp, 16 ; ⤦
Ç 00000010H
Almost the same, but now we can see the printf() arguments are pushed onto
the stack in reverse order. The first argument is pushed last.
By the way, variables of int type in 32-bit environment have 32-bit width, that is 4
bytes.
So, we have 4 arguments here. 4 ∗ 4 = 16 —they occupy exactly 16 bytes in the
stack: a 32-bit pointer to a string and 3 numbers of type int.
When the stack pointer (ESP register) has changed back by the ADD ESP, X
instruction after a function call, often, the number of function arguments could be
deduced by simply dividing X by 4.
Of course, this is specific to the cdecl calling convention, and only for 32-bit envi-
ronment.
In certain cases where several functions return right after one another, the compiler
could merge multiple “ADD ESP, X” instructions into one, after the last call:
push a1
push a2
call ...
...
push a1
call ...
...
push a1
push a2
push a3
call ...
add esp, 24
22
CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS
.text:100113F8 push 1
.text:100113FA call sub_100018B0 ; takes one
argument (1)
.text:100113FF add esp, 8 ; drops two
arguments from stack at once
To see how other arguments are passed via the stack, let’s change our example
again by increasing the number of arguments to 9 (printf() format string + 8 int
variables):
#include <stdio.h>
int main()
{
printf("a=%d; b=%d; c=%d; d=%d; e=%d; f=%d; g=%d; h=%d\⤦
Ç n", 1, 2, 3, 4, 5, 6, 7, 8);
return 0;
};
MSVC
As it was mentioned earlier, the first 4 arguments has to be passed through the
RCX, RDX, R8, R9 registers in Win64, while all the rest—via the stack. That is
exactly what we see here. However, the MOV instruction, instead of PUSH, is used
for preparing the stack, so the values are stored to the stack in a straightforward
manner.
main PROC
sub rsp, 88
23
CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS
mov edx, 1
lea rcx, OFFSET FLAT:$SG2923
call printf
; return 0
xor eax, eax
add rsp, 88
ret 0
main ENDP
_TEXT ENDS
END
The observant reader may ask why are 8 bytes allocated for int values, when 4
is enough? Yes, one has to remember: 8 bytes are allocated for any data type
shorter than 64 bits. This is established for the convenience’s sake: it makes it
easy to calculate the address of arbitrary argument. Besides, they are all located
at aligned memory addresses. It is the same in the 32-bit environments: 4 bytes
are reserved for all data types.
6.2 Conclusion
Here is a rough skeleton of the function call:
24
CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS CHAPTER 6. PRINTF() WITH SEVERAL ARGUMENTS
25
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
Chapter 7
scanf()
int main()
{
int x;
printf ("Enter X:\n");
return 0;
};
It’s not clever to use scanf() for user interactions nowadays. But we can, however,
illustrate passing a pointer to a variable of type int.
Pointers are one of the fundamental concepts in computer science. Often, passing
a large array, structure or object as an argument to another function is too expen-
sive, while passing their address is much cheaper. In addition if the callee function
26
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
7.1.2 x86
MSVC
27
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
push ecx
push OFFSET $SG3831 ; 'Enter X:'
call _printf
add esp, 4
lea eax, DWORD PTR _x$[ebp]
push eax
push OFFSET $SG3832 ; '%d'
call _scanf
add esp, 8
mov ecx, DWORD PTR _x$[ebp]
push ecx
push OFFSET $SG3833 ; 'You entered %d...'
call _printf
add esp, 8
; return 0
xor eax, eax
mov esp, ebp
pop ebp
ret 0
_main ENDP
_TEXT ENDS
x is a local variable.
According to the C/C++ standard it must be visible only in this function and not
from any other external scope. Traditionally, local variables are stored on the stack.
There are probably other ways to allocate them, but in x86 that is the way it is.
The goal of the instruction following the function prologue, PUSH ECX, is not to
save the ECX state (notice the absence of corresponding POP ECX at the function’s
end).
In fact it allocates 4 bytes on the stack for storing the x variable.
x is to be accessed with the assistance of the _x$ macro (it equals to -4) and the
EBP register pointing to the current frame.
Over the span of the function’s execution, EBP is pointing to the current stack
frame making it possible to access local variables and function arguments via
EBP+offset.
It is also possible to use ESP for the same purpose, although that is not very con-
venient since it changes frequently. The value of the EBP could be perceived as a
frozen state of the value in ESP at the start of the function’s execution.
Here is a typical stack frame layout in 32-bit environment:
28
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
… …
EBP-8 local variable #2, marked in IDA as var_8
EBP-4 local variable #1, marked in IDA as var_4
EBP saved value of EBP
EBP+4 return address
EBP+8 argument#1, marked in IDA as arg_0
EBP+0xC argument#2, marked in IDA as arg_4
EBP+0x10 argument#3, marked in IDA as arg_8
… …
The scanf() function in our example has two arguments.
The first one is a pointer to the string containing %d and the second is the address
of the x variable.
First, the x variable’s address is loaded into the EAX register by the lea eax,
DWORD PTR _x$[ebp] instruction
We could say that in this case LEA simply stores the sum of the EBP register value
and the _x$ macro in the EAX register.
This is the same as lea eax, [ebp-4].
So, 4 is being subtracted from the EBP register value and the result is loaded in the
EAX register. Next the EAX register value is pushed into the stack and scanf() is
being called.
printf() is being called after that with its first argument — a pointer to the string:
You entered %d...\n.
The second argument is prepared with: mov ecx, [ebp-4]. The instruction
stores the x variable value and not its address, in the ECX register.
Next the value in the ECX is stored on the stack and the last printf() is being
called.
By the way
By the way, this simple example is a demonstration of the fact that compiler trans-
lates list of expressions in C/C++-block into sequential list of instructions. There
are nothing between expressions in C/C++, and so in resulting machine code, there
are nothing between, control flow slips from one expression to the next one.
7.1.3 x64
The picture here is similar with the difference that the registers, rather than the
stack, are used for arguments passing.
29
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
MSVC
_TEXT SEGMENT
x$ = 32
main PROC
$LN3:
sub rsp, 56
lea rcx, OFFSET FLAT:$SG1289 ; 'Enter X:'
call printf
lea rdx, QWORD PTR x$[rsp]
lea rcx, OFFSET FLAT:$SG1291 ; '%d'
call scanf
mov edx, DWORD PTR x$[rsp]
lea rcx, OFFSET FLAT:$SG1292 ; 'You entered %d...'
call printf
; return 0
xor eax, eax
add rsp, 56
ret 0
main ENDP
_TEXT ENDS
int main()
{
30
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
return 0;
};
_DATA SEGMENT
COMM _x:DWORD
$SG2456 DB 'Enter X:', 0aH, 00H
$SG2457 DB '%d', 00H
$SG2458 DB 'You entered %d...', 0aH, 00H
_DATA ENDS
PUBLIC _main
EXTRN _scanf:PROC
EXTRN _printf:PROC
; Function compile flags: /Odtp
_TEXT SEGMENT
_main PROC
push ebp
mov ebp, esp
push OFFSET $SG2456
call _printf
add esp, 4
push OFFSET _x
push OFFSET $SG2457
call _scanf
add esp, 8
mov eax, DWORD PTR _x
push eax
push OFFSET $SG2458
call _printf
add esp, 8
xor eax, eax
pop ebp
ret 0
_main ENDP
_TEXT ENDS
In this case the x variable is defined in the _DATA segment and no memory is allo-
cated in the local stack. It is accessed directly, not through the stack. Uninitialized
31
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
global variables take no space in the executable file (indeed, why one needs to
allocate space for variables initially set to zero?), but when someone accesses their
address, the OS will allocate a block of zeroes there1 .
Now let’s explicitly assign a value to the variable:
int x=10; // default value
We got:
_DATA SEGMENT
_x DD 0aH
...
Here we see a value 0xA of DWORD type (DD stands for DWORD = 32 bit) for this
variable.
If you open the compiled .exe in IDA, you can see the x variable placed at the
beginning of the _DATA segment, and after it you can see text strings.
If you open the compiled .exe from the previous example in IDA, where the value
of x was not set, you would see something like this:
.data:0040FA80 _x dd ? ; DATA ⤦
Ç XREF: _main+10
.data:0040FA80 ; _main⤦
Ç +22
.data:0040FA84 dword_40FA84 dd ? ; DATA ⤦
Ç XREF: _memset+1E
.data:0040FA84 ; ⤦
Ç unknown_libname_1+28
.data:0040FA88 dword_40FA88 dd ? ; DATA ⤦
Ç XREF: ___sbh_find_block+5
.data:0040FA88 ; ⤦
Ç ___sbh_free_block+2BC
.data:0040FA8C ; LPVOID lpMem
.data:0040FA8C lpMem dd ? ; DATA ⤦
Ç XREF: ___sbh_find_block+B
.data:0040FA8C ; ⤦
Ç ___sbh_free_block+2CA
.data:0040FA90 dword_40FA90 dd ? ; DATA ⤦
Ç XREF: _V6_HeapAlloc+13
.data:0040FA90 ; ⤦
Ç __calloc_impl+72
.data:0040FA94 dword_40FA94 dd ? ; DATA ⤦
Ç XREF: ___sbh_free_block+2FE
32
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
_x is marked with ? with the rest of the variables that do not need to be initialized.
This implies that after loading the .exe to the memory, a space for all these variables
is to be allocated and filled with zeroes [ISO07, 6.7.8p10]. But in the .exe file these
uninitialized variables do not occupy anything. This is convenient for large arrays,
for example.
_TEXT SEGMENT
main PROC
$LN3:
sub rsp, 40
; return 0
xor eax, eax
add rsp, 40
ret 0
main ENDP
_TEXT ENDS
The code is almost the same as in x86. Please note that the address of the x
variable is passed to scanf() using a LEA instruction, while the variable’s value
is passed to the second printf() using a MOV instruction. DWORD PTR is a part
of the assembly language (no relation to the machine code), indicating that the
variable data size is 32-bit and the MOV instruction has to be encoded accordingly.
33
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
int main()
{
int x;
printf ("Enter X:\n");
return 0;
};
By standard, the scanf()2 function returns the number of fields it has successfully
read.
In our case, if everything goes fine and the user enters a number scanf() returns
1, or in case of error (or EOF3 ) — 0.
Let’s add some C code to check the scanf() return value and print error message
in case of an error.
This works as expected:
C:\...>ex3.exe
Enter X:
123
You entered 123...
C:\...>ex3.exe
Enter X:
ouch
What you entered? Huh?
34
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
The caller function (main()) needs the callee function (scanf()) result, so the
callee returns it in the EAX register.
We check it with the help of the instruction CMP EAX, 1 (CoMPare). In other words,
we compare the value in the EAX register with 1.
A JNE conditional jump follows the CMP instruction. JNE stands for Jump if Not
Equal.
So, if the value in the EAX register is not equal to 1, the CPU will pass the execution
to the address mentioned in the JNE operand, in our case $LN2@main. Passing
the control to this address results in the CPU executing printf() with the argu-
ment What you entered? Huh?. But if everything is fine, the conditional
jump is not be be taken, and another printf() call is to be executed, with two
arguments: 'You entered %d...' and the value of x.
Since in this case the second printf() has not to be executed, there is a JMP
preceding it (unconditional jump). It passes the control to the point after the second
printf() and just before the XOR EAX, EAX instruction, which implements
return 0.
So, it could be said that comparing a value with another is usually implemented by
CMP/Jcc instruction pair, where cc is condition code. CMP compares two values
and sets processor flags4 . Jcc checks those flags and decides to either pass the
control to the specified address or not.
4 x86 flags, see also: wikipedia.
35
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
This could sound paradoxical, but the CMP instruction is in fact SUB (subtract). All
arithmetic instructions set processor flags, not just CMP. If we compare 1 and 1,
1 − 1 is 0 so the ZF flag would be set (meaning that the last result was 0). In no
other circumstances ZF can be set, except when the operands are equal. JNE
checks only the ZF flag and jumps only if it is not set. JNE is in fact a synonym for
JNZ (Jump if Not Zero). Assembler translates both JNE and JNZ instructions into
the same opcode. So, the CMP instruction can be replaced with a SUB instruction
and almost everything will be fine, with the difference that SUB alters the value of
the first operand. CMP is SUB without saving the result, but affecting flags.
36
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
This can also be used as a simple example of executable file patching. We may try
to patch the executable so the program would always print the input, no matter
what we enter.
Assuming that the executable is compiled against external MSVCR*.DLL (i.e., with
/MD option)5 , we see the main() function at the beginning of the .text section.
Let’s open the executable in Hiew and find the beginning of the .text section
(Enter, F8, F6, Enter, Enter).
We can see this:
Hiew finds ASCIIZ6 strings and displays them, as it does with the imported func-
tions’ names.
37
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
Move the cursor to address .00401027 (where the JNZ instruction, we have to
bypass, is located), press F3, and then type “9090”(, meaning two NOP7 s):
Then press F9 (update). Now the executable is saved to the disk. It will behave as
we wanted.
Two NOPs are probably not the most æsthetic approach. Another way to patch this
instruction is to write just 0 to the second opcode byte (jump offset), so that JNZ
will always jump to the next instruction.
We could also do the opposite: replace first byte with EB while not touching the
second byte (jump offset). We would get an unconditional jump that is always
triggered. In this case the error message would be printed every time, no matter
the input.
Since we work here with int-typed variables, which are still 32-bit in x86-64, we
see how the 32-bit part of the registers (prefixed with E-) are used here as well.
7 No OPeration
38
CHAPTER 7. SCANF() CHAPTER 7. SCANF()
While working with pointers, however, 64-bit register parts are used, prefixed with
R-.
_TEXT SEGMENT
x$ = 32
main PROC
$LN5:
sub rsp, 56
lea rcx, OFFSET FLAT:$SG2924 ; 'Enter X:'
call printf
lea rdx, QWORD PTR x$[rsp]
lea rcx, OFFSET FLAT:$SG2926 ; '%d'
call scanf
cmp eax, 1
jne SHORT $LN2@main
mov edx, DWORD PTR x$[rsp]
lea rcx, OFFSET FLAT:$SG2927 ; 'You entered %d...'
call printf
jmp SHORT $LN1@main
$LN2@main:
lea rcx, OFFSET FLAT:$SG2929 ; 'What you entered? ⤦
Ç Huh?'
call printf
$LN1@main:
; return 0
xor eax, eax
add rsp, 56
ret 0
main ENDP
_TEXT ENDS
END
7.4 Exercise
• http://challenges.re/53
39
CHAPTER 8. ACCESSING PASSED ARGUMENTS CHAPTER 8. ACCESSING PASSED ARGUMENTS
Chapter 8
Now we figured out that the caller function is passing arguments to the callee via
the stack. But how does the callee access them?
int main()
{
printf ("%d\n", f(1, 2, 3));
return 0;
};
8.1 x86
8.1.1 MSVC
40
CHAPTER 8. ACCESSING PASSED ARGUMENTS CHAPTER 8. ACCESSING PASSED ARGUMENTS
_a$ = 8 ; size ⤦
Ç = 4
_b$ = 12 ; size ⤦
Ç = 4
_c$ = 16 ; size ⤦
Ç = 4
_f PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _a$[ebp]
imul eax, DWORD PTR _b$[ebp]
add eax, DWORD PTR _c$[ebp]
pop ebp
ret 0
_f ENDP
_main PROC
push ebp
mov ebp, esp
push 3 ; 3rd argument
push 2 ; 2nd argument
push 1 ; 1st argument
call _f
add esp, 12
push eax
push OFFSET $SG2463 ; '%d', 0aH, 00H
call _printf
add esp, 8
; return 0
xor eax, eax
pop ebp
ret 0
_main ENDP
What we see is that the main() function pushes 3 numbers onto the stack and calls
f(int,int,int). Argument access inside f() is organized with the help of
macros like: _a$ = 8, in the same way as local variables, but with positive offsets
(addressed with plus). So, we are addressing the outer side of the stack frame by
adding the _a$ macro to the value in the EBP register.
Then the value of a is stored into EAX. After IMUL instruction execution, the value
in EAX is a product of the value in EAX and the content of _b. After that, ADD
adds the value in _c to EAX. The value in EAX does not need to be moved: it is
already where it must be. On returning to caller, it takes the EAX value and use it
as an argument to printf().
41
CHAPTER 8. ACCESSING PASSED ARGUMENTS CHAPTER 8. ACCESSING PASSED ARGUMENTS
8.2 x64
The story is a bit different in x86-64. Function arguments (first 4 or first 6 of them)
are passed in registers i.e. the callee reads them from registers instead of reading
them from the stack.
8.2.1 MSVC
Optimizing MSVC:
Listing 8.3: Optimizing MSVC 2012 x64
$SG2997 DB '%d', 0aH, 00H
main PROC
sub rsp, 40
mov edx, 2
lea r8d, QWORD PTR [rdx+1] ; R8D=3
lea ecx, QWORD PTR [rdx-1] ; ECX=1
call f
lea rcx, OFFSET FLAT:$SG2997 ; '%d'
mov edx, eax
call printf
xor eax, eax
add rsp, 40
ret 0
main ENDP
f PROC
; ECX - 1st argument
; EDX - 2nd argument
; R8D - 3rd argument
imul ecx, edx
lea eax, DWORD PTR [r8+rcx]
ret 0
f ENDP
As we can see, the compact function f() takes all its arguments from the registers.
The LEA instruction here is used for addition, apparently the compiler considered
it faster than ADD. LEA is also used in the main() function to prepare the first and
third f() arguments. The compiler must have decided that this would work faster
than the usual way of loading values into a register using MOV instruction.
Let’s take a look at the non-optimizing MSVC output:
Listing 8.4: MSVC 2012 x64
42
CHAPTER 8. ACCESSING PASSED ARGUMENTS CHAPTER 8. ACCESSING PASSED ARGUMENTS
f proc near
; shadow space:
arg_0 = dword ptr 8
arg_8 = dword ptr 10h
arg_10 = dword ptr 18h
; return 0
xor eax, eax
add rsp, 28h
retn
main endp
It looks somewhat puzzling because all 3 arguments from the registers are saved to
the stack for some reason. This is called “shadow space” 1 : every Win64 may (but
is not required to) save all 4 register values there. This is done for two reasons:
1) it is too lavish to allocate a whole register (or even 4 registers) for an input
argument, so it will be accessed via stack; 2) the debugger is always aware where
to find the function arguments at a break2 .
So, some large functions can save their input arguments in the “shadows space” if
they need to use them during execution, but some small functions (like ours) may
1 MSDN
2 MSDN
43
CHAPTER 8. ACCESSING PASSED ARGUMENTS CHAPTER 8. ACCESSING PASSED ARGUMENTS
not do this.
It is a caller responsibility to allocate “shadow space” in the stack.
44
CHAPTER 9. MORE ABOUT RESULTS RETURNING CHAPTER 9. MORE ABOUT RESULTS RETURNING
Chapter 9
In x86, the result of function execution is usually returned1 in the EAX register. If
it is byte type or a character (char), then the lowest part of register EAX (AL) is used.
If a function returns a float number, the FPU register ST(0) is used instead.
In other words:
exit(main(argc,argv,envp));
If you declare main() as void, nothing is to be returned explicitly (using the return
statement), then something random, that was stored in the EAX register at the end
1 See also: MSDN: Return Values (C++): MSDN
45
CHAPTER 9. MORE ABOUT RESULTS RETURNING CHAPTER 9. MORE ABOUT RESULTS RETURNING
of main() becomes the sole argument of the exit() function. Most likely, there
will be a random value, left from your function execution, so the exit code of pro-
gram is pseudorandom.
We can illustrate this fact. Please note that here the main() function has a void
return type:
#include <stdio.h>
void main()
{
printf ("Hello, world!\n");
};
46
CHAPTER 9. MORE ABOUT RESULTS RETURNING CHAPTER 9. MORE ABOUT RESULTS RETURNING
The result of the rand() function is left in EAX, in all four cases. But in the first 3
cases, the value in EAX is just thrown away.
47
CHAPTER 10. GOTO OPERATOR CHAPTER 10. GOTO OPERATOR
Chapter 10
GOTO operator
int main()
{
printf ("begin\n");
goto exit;
printf ("skip me!\n");
exit:
printf ("end\n");
};
_main PROC
push ebp
mov ebp, esp
push OFFSET $SG2934 ; 'begin'
call _printf
add esp, 4
jmp SHORT $exit$3
48
CHAPTER 10. GOTO OPERATOR CHAPTER 10. GOTO OPERATOR
The goto statement has been simply replaced by a JMP instruction, which has the
same effect: unconditional jump to another place.
The second printf() could be executed only with human intervention, by using
a debugger or by patching the code.
_main PROC
push OFFSET $SG2981 ; 'begin'
call _printf
push OFFSET $SG2984 ; 'end'
$exit$4:
call _printf
add esp, 8
xor eax, eax
ret 0
_main ENDP
49
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
Chapter 11
Conditional jumps
#include <stdio.h>
int main()
{
f_signed(1, 2);
f_unsigned(1, 2);
return 0;
50
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
};
11.1.1 x86
x86 + MSVC
The first instruction, JLE, stands for Jump if Less or Equal. In other words, if the
second operand is larger or equal to the first one, the control flow will be passed to
the specified in the instruction address or label. If this condition does not trigger
because the second operand is smaller than the first one, the control flow would
51
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
not be altered and the first printf() would be executed. The second check is
JNE: Jump if Not Equal. The control flow will not change if the operands are equal.
The third check is JGE: Jump if Greater or Equal—jump if the first operand is larger
than the second or if they are equal. So, if all three conditional jumps are triggered,
none of the printf() calls would be executed whatsoever. This is impossible
without special intervention.
Now let’s take a look at the f_unsigned() function. The f_unsigned() func-
tion is the same as f_signed(), with the exception that the JBE and JAE instruc-
tions are used instead of JLE and JGE, as follows:
52
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
See also the section about signed number representations ( 22 on page 157). That
is why if we see JG/JL in use instead of JA/JB or vice-versa, we can be almost
sure that the variables are signed or unsigned, respectively.
Here is also the main() function, where there is nothing much new to us:
53
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
We can try to patch the executable file in a way that the f_unsigned() function
would always print “a==b”, no matter the input values. Here is how it looks in Hiew:
54
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
• The third jump we replace with JMP just as we do with the first one, so it will
always trigger.
55
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
If we miss to change any of these jumps, then several printf() calls may execute,
while we want to execute only one.
56
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
Here is an example:
const char* f (int a)
{
return a==10 ? "it is ten" : "it is not ten";
};
11.3.1 x86
57
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
_a$ = 8 ; size = 4
_f PROC
; compare input value with 10
cmp DWORD PTR _a$[esp-4], 10
mov eax, OFFSET $SG792 ; 'it is ten'
; jump to $LN4@f if equal
je SHORT $LN4@f
mov eax, OFFSET $SG793 ; 'it is not ten'
$LN4@f:
ret 0
_f ENDP
58
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
a$ = 8
f PROC
; load pointers to the both strings
lea rdx, OFFSET FLAT:$SG1355 ; 'it is ten'
lea rax, OFFSET FLAT:$SG1356 ; 'it is not ten'
; compare input value with 10
cmp ecx, 10
; if equal, copy value from RDX ("it is ten")
; if not, do nothing. pointer to the string "it is not ten" is
still in RAX as for now.
cmove rax, rdx
ret 0
f ENDP
Optimizing GCC 4.8 for x86 also uses the CMOVcc instruction, while the non-optimizing
GCC 4.8 uses conditional jumps.
Interestingly, optimizing GCC 4.8 for x86 was also able to use CMOVcc in this case:
Listing 11.8: Optimizing GCC 4.8
.LC0:
.string "it is ten"
.LC1:
.string "it is not ten"
f:
.LFB0:
; compare input value with 10
cmp DWORD PTR [esp+4], 10
mov edx, OFFSET FLAT:.LC1 ; "it is not ten"
mov eax, OFFSET FLAT:.LC0 ; "it is ten"
; if comparison result is Not Equal, copy EDX value to EAX
; if not, do nothing
cmovne eax, edx
ret
59
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
11.4.1 32-bit
_a$ = 8
_b$ = 12
60
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
_my_max PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _a$[ebp]
; compare A and B:
cmp eax, DWORD PTR _b$[ebp]
; jump if A is less or equal to B:
jle SHORT $LN2@my_max
; reload A to EAX if otherwise and jump to exit
mov eax, DWORD PTR _a$[ebp]
jmp SHORT $LN3@my_max
jmp SHORT $LN3@my_max ; this is redundant JMP
$LN2@my_max:
; return B
mov eax, DWORD PTR _b$[ebp]
$LN3@my_max:
pop ebp
ret 0
_my_max ENDP
These two functions differ only in the conditional jump instruction: JGE (“Jump if
Greater or Equal”) is used in the first one and JLE (“Jump if Less or Equal”) in the
second.
There is one unneeded JMP instruction in each function, which MSVC probably left
by mistake.
11.5 Conclusion
11.5.1 x86
61
CHAPTER 11. CONDITIONAL JUMPS CHAPTER 11. CONDITIONAL JUMPS
11.5.2 Branchless
If the body of a condition statement is very short, the conditional move instruction
can be used: MOVcc in ARM (in ARM mode), CSEL in ARM64, CMOVcc in x86.
62
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
Chapter 12
switch()/case/default
#include <stdio.h>
void f (int a)
{
switch (a)
{
case 0: printf ("zero\n"); break;
case 1: printf ("one\n"); break;
case 2: printf ("two\n"); break;
default: printf ("something unknown\n"); break;
};
};
int main()
{
f (2); // test
};
12.1.1 x86
Non-optimizing MSVC
63
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
Our function with a few cases in switch() is in fact analogous to this construction:
void f (int a)
{
if (a==0)
printf ("zero\n");
64
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
else if (a==1)
printf ("one\n");
else if (a==2)
printf ("two\n");
else
printf ("something unknown\n");
};
Optimizing MSVC
needs
65
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
66
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
12.1.2 Conclusion
A switch() with few cases is indistinguishable from an if/else construction, for exam-
ple: listing.12.1.1.
void f (int a)
{
switch (a)
{
case 0: printf ("zero\n"); break;
case 1: printf ("one\n"); break;
case 2: printf ("two\n"); break;
case 3: printf ("three\n"); break;
case 4: printf ("four\n"); break;
default: printf ("something unknown\n"); break;
};
};
int main()
{
f (2); // test
};
12.2.1 x86
Non-optimizing MSVC
67
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
push ecx
mov eax, DWORD PTR _a$[ebp]
mov DWORD PTR tv64[ebp], eax
cmp DWORD PTR tv64[ebp], 4
ja SHORT $LN1@f
mov ecx, DWORD PTR tv64[ebp]
jmp DWORD PTR $LN11@f[ecx*4]
$LN6@f:
push OFFSET $SG739 ; 'zero', 0aH, 00H
call _printf
add esp, 4
jmp SHORT $LN9@f
$LN5@f:
push OFFSET $SG741 ; 'one', 0aH, 00H
call _printf
add esp, 4
jmp SHORT $LN9@f
$LN4@f:
push OFFSET $SG743 ; 'two', 0aH, 00H
call _printf
add esp, 4
jmp SHORT $LN9@f
$LN3@f:
push OFFSET $SG745 ; 'three', 0aH, 00H
call _printf
add esp, 4
jmp SHORT $LN9@f
$LN2@f:
push OFFSET $SG747 ; 'four', 0aH, 00H
call _printf
add esp, 4
jmp SHORT $LN9@f
$LN1@f:
push OFFSET $SG749 ; 'something unknown', 0aH, 00H
call _printf
add esp, 4
$LN9@f:
mov esp, ebp
pop ebp
ret 0
npad 2 ; align next label
$LN11@f:
DD $LN6@f ; 0
DD $LN5@f ; 1
DD $LN4@f ; 2
DD $LN3@f ; 3
DD $LN2@f ; 4
68
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
_f ENDP
What we see here is a set of printf() calls with various arguments. All they have
not only addresses in the memory of the process, but also internal symbolic labels
assigned by the compiler. All these labels are also mentioned in the $LN11@f
internal table.
At the function start, if a is greater than 4, control flow is passed to label $LN1@f,
where printf() with argument 'something unknown' is called.
But if the value of a is less or equals to 4, then it gets multiplied by 4 and added with
the $LN11@f table address. That is how an address inside the table is constructed,
pointing exactly to the element we need. For example, let’s say a is equal to 2.
2 ∗ 4 = 8 (all table elements are addresses in a 32-bit process and that is why all
elements are 4 bytes wide). The address of the $LN11@f table + 8 is the table
element where the $LN4@f label is stored. JMP fetches the $LN4@f address from
the table and jumps to it.
This table is sometimes called jumptable or branch table4 .
Then the corresponding printf() is called with argument 'two'. Literally, the
jmp DWORD PTR $LN11@f[ecx*4] instruction implies jump to the DWORD that
is stored at address $LN11@f + ecx * 4.
is assembly language macro that aligning the next label so that it is to be stored at
an address aligned on a 4 byte (or 16 byte) boundary. This is very suitable for the
processor since it is able to fetch 32-bit values from memory through the memory
bus, cache memory, etc, in a more effective way if it is aligned.
Non-optimizing GCC
push ebp
mov ebp, esp
sub esp, 18h
cmp [ebp+arg_0], 4
4 The whole method was once called computed GOTO in early versions of FORTRAN: wikipedia. Not
69
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
ja short loc_8048444
mov eax, [ebp+arg_0]
shl eax, 2
mov eax, ds:off_804855C[eax]
jmp eax
70
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
12.2.2 Conclusion
case1:
; do something
JMP exit
case2:
; do something
JMP exit
case3:
; do something
JMP exit
case4:
; do something
JMP exit
case5:
; do something
JMP exit
default:
...
exit:
....
jump_table dd case1
dd case2
dd case3
dd case4
dd case5
71
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
The jump to the address in the jump table may also be implemented using this
instruction: JMP jump_table[REG*4]. Or JMP jump_table[REG*8] in x64.
A jumptable is just array of pointers, like the one described later: 16.4 on page 102.
void f(int a)
{
switch (a)
{
case 1:
case 2:
case 7:
case 10:
printf ("1, 2, 7, 10\n");
break;
case 3:
case 4:
case 5:
case 6:
printf ("3, 4, 5\n");
break;
case 8:
case 9:
case 20:
case 21:
printf ("8, 9, 21\n");
break;
case 22:
printf ("22\n");
break;
default:
printf ("default\n");
break;
};
};
72
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
int main()
{
f(4);
};
It’s too wasteful to generate a block for each possible case, so what is usually done
is to generate each block plus some kind of dispatcher.
12.3.1 MSVC
73
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
31 $LN11@f:
32 DD $LN5@f ; print '1, 2, 7, 10'
33 DD $LN4@f ; print '3, 4, 5'
34 DD $LN3@f ; print '8, 9, 21'
35 DD $LN2@f ; print '22'
36 DD $LN1@f ; print 'default'
37 $LN10@f:
38 DB 0 ; a=1
39 DB 0 ; a=2
40 DB 1 ; a=3
41 DB 1 ; a=4
42 DB 1 ; a=5
43 DB 1 ; a=6
44 DB 0 ; a=7
45 DB 2 ; a=8
46 DB 2 ; a=9
47 DB 0 ; a=10
48 DB 4 ; a=11
49 DB 4 ; a=12
50 DB 4 ; a=13
51 DB 4 ; a=14
52 DB 4 ; a=15
53 DB 4 ; a=16
54 DB 4 ; a=17
55 DB 4 ; a=18
56 DB 4 ; a=19
57 DB 2 ; a=20
58 DB 2 ; a=21
59 DB 3 ; a=22
60 _f ENDP
We see two tables here: the first table ($LN10@f) is an index table, and the second
one ($LN11@f) is an array of pointers to blocks.
First, the input value is used as an index in the index table (line 13).
Here is a short legend for the values in the table: 0 is the first case block (for values
1, 2, 7, 10), 1 is the second one (for values 3, 4, 5), 2 is the third one (for values 8, 9,
21), 3 is the fourth one (for value 22), 4 is for the default block.
There we get an index for the second table of code pointers and we jump to it (line
14).
What is also worth noting is that there is no case for input value 0. That’s why we
see the DEC instruction at line 10, and the table starts at a = 1, because there is
no need to allocate a table element for a = 0.
This is a very widespread pattern.
74
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
12.4 Fall-through
Another very popular usage of switch() is the fall-through. Here is a small ex-
ample:
1 #define R 1
2 #define W 2
3 #define RW 3
4
5 void f(int type)
6 {
7 int read=0, write=0;
8
9 switch (type)
10 {
11 case RW:
12 read=1;
13 case W:
14 write=1;
15 break;
16 case R:
17 read=1;
18 break;
19 default:
20 break;
21 };
22 printf ("read=%d, write=%d\n", read, write);
23 };
75
CHAPTER 12. SWITCH()/CASE/DEFAULT CHAPTER 12. SWITCH()/CASE/DEFAULT
The code mostly resembles what is in the source. There are no jumps between
labels $LN4@f and $LN3@f: so when code flow is at $LN4@f, read is first set to 1,
then write. This is why it’s called fall-through: code flow falls through one piece
of code (setting read) to another (setting write). If type = W , we land at $LN3@f,
so no code setting read to 1 is executed.
76
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
Chapter 13
Loops
13.1.1 x86
There is a special LOOP instruction in x86 instruction set for checking the value
in register ECX and if it is not 0, to decrement ECX and pass control flow to the
label in the LOOP operand. Probably this instruction is not very convenient, and
there are no any modern compilers which emit it automatically. So, if you see this
instruction somewhere in code, it is most likely that this is a manually written piece
of assembly code.
77
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
#include <stdio.h>
void printing_function(int i)
{
printf ("f(%d)\n", i);
};
int main()
{
int i;
return 0;
};
78
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
ret 0
_main ENDP
What happens here is that space for the i variable is not allocated in the local stack
anymore, but uses an individual register for it, ESI. This is possible in such small
functions where there aren’t many local variables.
One very important thing is that the f() function must not change the value in
ESI. Our compiler is sure here. And if the compiler decides to use the ESI register
in f() too, its value would have to be saved at the function’s prologue and restored
at the function’s epilogue, almost like in our listing: please note PUSH ESI/POP
ESI at the function start and end.
In the generated code we can see: after initializing i, the body of the loop is not
to be executed, as the condition for i is checked first, and only after that loop body
can be executed. And that is correct. Because, if the loop condition is not met at
the beginning, the body of the loop must not be executed. This is possible in the
following case:
for (i=0; i<total_entries_to_process; i++)
loop_body;
If total_entries_to_process is 0, the body of the loo must not be executed at all. This
is why the condition checked before the execution.
79
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
However, an optimizing compiler may swap the condition check and loop body, if
it sure that the situation described here is not possible (like in the case of our very
simple example and Keil, Xcode (LLVM), MSVC in optimization mode).
80
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
ret
13.3 Conclusion
Rough skeleton of loop from 2 to 9 inclusive:
If the body of the loop is short, a whole register can be dedicated to the counter
variable:
81
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
; do something here
; use counter in EBX, but do not modify it!
INC EBX ; increment
check:
CMP EBX, 9
JLE body
Usually the condition is checked before loop body, but the compiler may rearrange
it in a way that the condition is checked after loop body. This is done when the
compiler is sure that the condition is always true on the first iteration, so the body
of the loop is to be executed at least once:
Using the LOOP instruction. This is rare, compilers are not using it. When you see
it, it’s a sign that this piece of code is hand-written:
82
CHAPTER 13. LOOPS CHAPTER 13. LOOPS
; loop body
; do something here
; use counter in ECX, but do not modify it!
LOOP body
83
CHAPTER 14. SIMPLE C-STRINGS PROCESSING CHAPTER 14. SIMPLE C-STRINGS PROCESSING
Chapter 14
14.1 strlen()
Let’s talk about loops one more time. Often, the strlen() function1 is imple-
mented using a while() statement. Here is how it is done in the MSVC standard
libraries:
int my_strlen (const char * str)
{
const char *eos = str;
while( *eos++ ) ;
int main()
{
// test
return my_strlen("hello!");
};
84
CHAPTER 14. SIMPLE C-STRINGS PROCESSING CHAPTER 14. SIMPLE C-STRINGS PROCESSING
14.1.1 x86
Non-optimizing MSVC
Let’s compile:
_eos$ = -4 ; size = 4
_str$ = 8 ; size = 4
_strlen PROC
push ebp
mov ebp, esp
push ecx
mov eax, DWORD PTR _str$[ebp] ; place pointer to string
from "str"
mov DWORD PTR _eos$[ebp], eax ; place it to local
variable "eos"
$LN2@strlen_:
mov ecx, DWORD PTR _eos$[ebp] ; ECX=eos
85
CHAPTER 14. SIMPLE C-STRINGS PROCESSING CHAPTER 14. SIMPLE C-STRINGS PROCESSING
Optimizing MSVC
Now let’s compile all this in MSVC 2012, with optimizations turned on (/Ox):
Now it is all simpler. Needless to say, the compiler could use registers with such
efficiency only in small functions with a few local variables.
INC/DEC— are increment/decrement instructions, in other words: add or substract
1 to/from a variable.
86
CHAPTER 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER
CHAPTER
ONES 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES
Chapter 15
Replacing arithmetic
instructions to other ones
15.1 Multiplication
87
CHAPTER 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER
CHAPTER
ONES 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES
Multiplication by 4 is just shifting the number to the left by 2 bits and inserting 2
zero bits at the right (as the last two bits). It is just like multiplying 3 by 100 —we
need to just add two zeroes at the right.
That’s how the shift left instruction works:
7. 6 5 4 3 2 1 0
CF 7 6 5 4 3 2 1 0 0
88
CHAPTER 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER
CHAPTER
ONES 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES
It’s still possible to get rid of the multiplication operation when you multiply by
numbers like 7 or 17 again by using shifting. The mathematics used here is rela-
tively easy.
32-bit
#include <stdint.h>
int f1(int a)
{
return a*7;
};
int f2(int a)
{
return a*28;
};
int f3(int a)
{
return a*17;
};
x86
; a*28
_a$ = 8
_f2 PROC
89
CHAPTER 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER
CHAPTER
ONES 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES
; a*17
_a$ = 8
_f3 PROC
mov eax, DWORD PTR _a$[esp-4]
; EAX=a
shl eax, 4
; EAX=EAX<<4=EAX*16=a*16
add eax, DWORD PTR _a$[esp-4]
; EAX=EAX+a=a*16+a=a*17
ret 0
_f3 ENDP
64-bit
#include <stdint.h>
int64_t f1(int64_t a)
{
return a*7;
};
int64_t f2(int64_t a)
{
return a*28;
};
int64_t f3(int64_t a)
{
return a*17;
};
x64
90
CHAPTER 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER
CHAPTER
ONES 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES
; a*28
f2:
lea rax, [0+rdi*4]
; RAX=RDI*4=a*4
sal rdi, 5
; RDI=RDI<<5=RDI*32=a*32
sub rdi, rax
; RDI=RDI-RAX=a*32-a*4=a*28
mov rax, rdi
ret
; a*17
f3:
mov rax, rdi
sal rax, 4
; RAX=RAX<<4=a*16
add rax, rdi
; RAX=a*16+a=a*17
ret
15.2 Division
Example of division by 4:
unsigned int f(unsigned int a)
{
return a/4;
};
91
CHAPTER 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER
CHAPTER
ONES 15. REPLACING ARITHMETIC INSTRUCTIONS TO OTHER ONES
_a$ = 8 ; size ⤦
Ç = 4
_f PROC
mov eax, DWORD PTR _a$[esp-4]
shr eax, 2
ret 0
_f ENDP
The SHR (SHift Right) instruction in this example is shifting a number by 2 bits to
the right. The two freed bits at left (e.g., two most significant bits) are set to zero.
The two least significant bits are dropped. In fact, these two dropped bits are the
division operation remainder.
The SHR instruction works just like SHL, but in the other direction.
7. 6 5 4 3 2 1 0
0 7 6 5 4 3 2 1 0 CF
So the remainder is dropped, but that’s OK, we work on integer values anyway,
these are not a real numbers!
92
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
Chapter 16
Arrays
An array is just a set of variables in memory that lie next to each other and that
have the same type1 .
#include <stdio.h>
int main()
{
int a[20];
int i;
return 0;
};
93
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
16.1.1 x86
MSVC
Let’s compile:
94
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
$LN1@main:
xor eax, eax
mov esp, ebp
pop ebp
ret 0
_main ENDP
Nothing very special, just two loops: the first is a filling loop and second is a printing
loop. The shl ecx, 1 instruction is used for value multiplication by 2 in ECX,
more about below 15.2.1 on page 92.
80 bytes are allocated on the stack for the array, 20 elements of 4 bytes.
So, array indexing is just array[index]. If you study the generated code closely, you’ll
probably note the missing index bounds checking, which could check if it is less
than 20. What if the index is 20 or greater? That’s the one C/C++ feature it is often
blamed for.
Here is a code that successfully compiles and works:
#include <stdio.h>
int main()
{
int a[20];
int i;
return 0;
};
95
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
It is just something that was lying in the stack near to the array, 80 bytes away from
its first element.
96
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
OK, we read some values from the stack illegally, but what if we could write some-
thing to it?
Here is what we have got:
#include <stdio.h>
int main()
{
int a[20];
int i;
return 0;
};
MSVC
97
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
The compiled program crashes after running. No wonder. Let’s see where exactly
does it is crash.
98
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
Let’s load it into OllyDbg, and trace until all 30 elements are written:
99
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
Figure 16.3: OllyDbg: EIP was restored, but OllyDbg can’t disassemble at 0x15
100
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
a[19]=something statement writes the last int in the bounds of the array (in
bounds so far!)
a[20]=something statement writes something to the place where the value of
EBP is saved.
Please take a look at the register state at the moment of the crash. In our case,
20 was written in the 20th element. At the function end, the function epilogue
restores the original EBP value. (20 in decimal is 0x14 in hexadecimal). Then RET
gets executed, which is effectively equivalent to POP EIP instruction.
The RET instruction takes the return address from the stack (that is the address in
CRT), which was called main()), and 21 iss stored there (0x15 in hexadecimal).
The CPU traps at address 0x15, but there is no executable code there, so exception
gets raised.
Welcome! It is called a buffer overflow 3 .
Replace the int array with a string (char array), create a long string deliberately and
pass it to the program, to the function, which doesn’t check the length of the string
and copies it in a short buffer, and you’ll able to point the program to an address
to which it must jump. It’s not that simple in reality, but that is how it emerged4
That’s just because the compiler must know the exact array size to allocate space
for it in the local stack layout on at the compiling stage.
If you need an array of arbitrary size, allocate it by using malloc(), then access
the allocated memory block as an array of variables of the type you need.
Or use the C99 standard feature[ISO07, pp. 6.7.5/2], and it works like alloca() ( 5.2.4
on page 17) internally.
It’s also possible to use garbage collecting libraries for C. And there are also li-
braries supporting smart pointers for C++.
3 wikipedia
4 Classic article about it: [One96].
101
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
// in 0..11 range
const char* get_month1 (int month)
{
return month1[month];
};
16.4.1 x64
102
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
DQ FLAT:$SG3133
$SG3122 DB 'January', 00H
$SG3123 DB 'February', 00H
$SG3124 DB 'March', 00H
$SG3125 DB 'April', 00H
$SG3126 DB 'May', 00H
$SG3127 DB 'June', 00H
$SG3128 DB 'July', 00H
$SG3129 DB 'August', 00H
$SG3130 DB 'September', 00H
$SG3156 DB '%s', 0aH, 00H
$SG3131 DB 'October', 00H
$SG3132 DB 'November', 00H
$SG3133 DB 'December', 00H
_DATA ENDS
month$ = 8
get_month1 PROC
movsxd rax, ecx
lea rcx, OFFSET FLAT:month1
mov rax, QWORD PTR [rcx+rax*8]
ret 0
get_month1 ENDP
103
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
32-bit MSVC
The input value does not need to be extended to 64-bit value, so it is used as is.
And it’s multiplied by 4, because the table elements are 32-bit (or 4 bytes) wide.
104
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
Offset in memory array element
0 [0][0]
1 [0][1]
2 [0][2]
3 [0][3]
4 [1][0]
5 [1][1]
6 [1][2]
7 [1][3]
8 [2][0]
9 [2][1]
10 [2][2]
11 [2][3]
0 1 2 3
4 5 6 7
8 9 10 11
So, in order to calculate the address of the element we need, we first multiply the
first index by 4 (array width) and then add the second index. That’s called row-major
order, and this method of array and matrix representation is used in at least C/C++
and Python. The term row-major order in plain English language means: “first,
write the elements of the first row, then the second row …and finally the elements
of the last row”.
Another method for representation is called column-major order (the array indices
are used in reverse order) and it is used at least in FORTRAN, MATLAB and R. column-
major order term in plain English language means: “first, write the elements of the
first column, then the second column …and finally the elements of the last column”.
Which method is better? In general, in terms of performance and cache memory,
the best scheme for data organization is the one, in which the elements are ac-
cessed sequentially. So if your function accesses data per row, row-major order is
better, and vice versa.
105
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
We are going to work with an array of type char, which implies that each element
requires only one byte in memory.
char a[3][4];
int main()
{
int x, y;
// clear array
for (x=0; x<3; x++)
for (y=0; y<4; y++)
a[x][y]=0;
All three rows are marked with red. We see that second row now has values 0, 1, 2
and 3:
106
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
char a[3][4];
int main()
{
int x, y;
// clear array
for (x=0; x<3; x++)
for (y=0; y<4; y++)
a[x][y]=0;
The three rows are also marked in red here. We see that in each row, at third
position these values are written: 0, 1 and 2.
char a[3][4];
107
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
int main()
{
a[2][3]=123;
printf ("%d\n", get_by_coordinates1(a, 2, 3));
printf ("%d\n", get_by_coordinates2(a, 2, 3));
printf ("%d\n", get_by_coordinates3(a, 2, 3));
};
108
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
array$ = 8
a$ = 16
b$ = 24
get_by_coordinates2 PROC
movsxd rax, r8d
movsxd r9, edx
add rax, rcx
movzx eax, BYTE PTR [rax+r9*4]
ret 0
get_by_coordinates2 ENDP
array$ = 8
a$ = 16
b$ = 24
get_by_coordinates1 PROC
movsxd rax, r8d
movsxd r9, edx
add rax, rcx
movzx eax, BYTE PTR [rax+r9*4]
ret 0
get_by_coordinates1 ENDP
It’s thing in multidimensional arrays. Now we are going to work with an array of
type int: each element requires 4 bytes in memory.
Let’s see:
int a[10][20][30];
109
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
x86
Nothing special. For index calculation, three input arguments are used in the for-
mula address = 600 ⋅ 4 ⋅ x + 30 ⋅ 4 ⋅ y + 4z, to represent the array as multidimensional.
Do not forget that the int type is 32-bit (4 bytes), so all coefficients must be multi-
plied by 4.
x = dword ptr 8
y = dword ptr 0Ch
z = dword ptr 10h
value = dword ptr 14h
push ebp
110
CHAPTER 16. ARRAYS CHAPTER 16. ARRAYS
The GCC compiler does it differently. For one of the operations in the calculation
(30y), GCC produces code without multiplication instructions. This is how it done:
(y + y) ≪ 4 − (y + y) = (2y) ≪ 4 − 2y = 2 ⋅ 16 ⋅ y − 2y = 32y − 2y = 30y. Thus, for the
30y calculation, only one addition operation, one bitwise shift operation and one
subtraction operation are used. This works faster.
16.6 Conclusion
An array is a pack of values in memory located adjacently. It’s true for any element
type, including structures. Access to a specific array element is just a calculation
of its address.
111
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
Chapter 17
A lot of functions define their input arguments as flags in bit fields. Of course, they
could be substituted by a set of bool-typed variables, but it is not frugally.
17.1.1 x86
112
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
Here we see the TEST instruction, however it doesn’t take the whole second argu-
ment, but only the most significant byte (ebp+dwDesiredAccess+3) and checks
it for flag 0x40 (which implies the GENERIC_WRITE flag here) TEST is basically
the same instruction as AND, but without saving the result (recall the fact CMP is
merely the same as SUB, but without saving the result ( 7.3.1 on page 35)).
The logic of this code fragment is as follows:
if ((dwDesiredAccess&0x40000000) == 0) goto loc_7C83D417
If AND instruction leaves this bit, the ZF flag is to be cleared and the JZ condi-
tional jump is not to be triggered. The conditional jump is triggered only if the
0x40000000 bit is absent in dwDesiredAccess variable —then the result of
AND is 0, ZF is to be set and the conditional jump is to be triggered.
113
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
#include <stdio.h>
int f(int a)
{
int rt=a;
return rt;
};
int main()
{
f(0x12340678);
};
17.2.1 x86
Non-optimizing MSVC
114
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
pop ebp
ret 0
_f ENDP
The OR instruction sets one bit to value while ignoring the rest.
AND resets one bit. It can be said that AND just copies all bits except one. Indeed,
in the second AND operand only the bits that need to be saved are set, just the
one do not want to copy is not (which is 0 in the bitmask). It is the easier way to
memorize the logic.
Optimizing MSVC
If we compile it in MSVC with optimization turned on (/Ox), the code is even shorter:
17.3 Shifts
Bit shifts in C/C++ are implemented using ≪ and ≫ operators.
The x86 ISA has the SHL (SHift Left) and SHR (SHift Right) instructions for this.
Shift instructions are often used in division and multiplications by powers of two:
2n (e.g., 1, 2, 4, 8, etc): 15.1.2 on page 88, 15.2.1 on page 91.
Shifting operations are also so important because they are often used for specific
bit isolation or for constructing a value of several scattered bits.
115
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
return rt;
};
int main()
{
f(0x12345678); // test
};
In this loop, the iteration count value i is counting from 0 to 31, so the 1 ≪ i
statement is counting from 1 to 0x80000000. Describing this operation in natural
language, we would say shift 1 by n bits left. In other words, 1 ≪ i statement
consequently produces all possible bit positions in a 32-bit number. The freed bit
at right is always cleared.
Here is a table of all possible 1 ≪ i for i = 0 . . . 31:
2 modern x86 CPUs (supporting SSE4) even have a POPCNT instruction for it
116
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
C/C++ expression Power of two Decimal form Hexadecimal form
1≪0 1 1 1
1≪1 21 2 2
1≪2 22 4 4
1≪3 23 8 8
1≪4 24 16 0x10
1≪5 25 32 0x20
1≪6 26 64 0x40
1≪7 27 128 0x80
1≪8 28 256 0x100
1≪9 29 512 0x200
1 ≪ 10 210 1024 0x400
1 ≪ 11 211 2048 0x800
1 ≪ 12 212 4096 0x1000
1 ≪ 13 213 8192 0x2000
1 ≪ 14 214 16384 0x4000
1 ≪ 15 215 32768 0x8000
1 ≪ 16 216 65536 0x10000
1 ≪ 17 217 131072 0x20000
1 ≪ 18 218 262144 0x40000
1 ≪ 19 219 524288 0x80000
1 ≪ 20 220 1048576 0x100000
1 ≪ 21 221 2097152 0x200000
1 ≪ 22 222 4194304 0x400000
1 ≪ 23 223 8388608 0x800000
1 ≪ 24 224 16777216 0x1000000
1 ≪ 25 225 33554432 0x2000000
1 ≪ 26 226 67108864 0x4000000
1 ≪ 27 227 134217728 0x8000000
1 ≪ 28 228 268435456 0x10000000
1 ≪ 29 229 536870912 0x20000000
1 ≪ 30 230 1073741824 0x40000000
1 ≪ 31 231 2147483648 0x80000000
These constant numbers (bit masks) very often appear in code and a practicing
reverse engineer must be able to spot them quickly. You probably haven’t to
memorize the decimal numbers, but the hexadecimal ones are very easy to remem-
ber.
These constants are very often used for mapping flags to specific bits. For example,
here is excerpt from ssl_private.h from Apache 2.4.6 source code:
/**
* Define the SSL options
117
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
*/
#define SSL_OPT_NONE (0)
#define SSL_OPT_RELSET (1<<0)
#define SSL_OPT_STDENVVARS (1<<1)
#define SSL_OPT_EXPORTCERTDATA (1<<3)
#define SSL_OPT_FAKEBASICAUTH (1<<4)
#define SSL_OPT_STRICTREQUIRE (1<<5)
#define SSL_OPT_OPTRENEGOTIATE (1<<6)
#define SSL_OPT_LEGACYDNFORMAT (1<<7)
17.4.1 x86
MSVC
118
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
17.4.2 x64
int f(uint64_t a)
{
uint64_t i;
int rt=0;
return rt;
};
119
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
mov edx, 1
lea r8d, QWORD PTR [rax+64]
; R8D=64
npad 5
$LL4@f:
test rdx, rcx
; there are no such bit in input value?
; skip the next INC instruction then.
je SHORT $LN3@f
inc eax ; rt++
$LN3@f:
rol rdx, 1 ; RDX=RDX<<1
dec r8 ; R8--
jne SHORT $LL4@f
fatret 0
f ENDP
Here the ROL instruction is used instead of SHL, which is in fact “rotate left” instead
of “shift left”, but in this example it works just as SHL.
R8 here is counting from 64 to 0. It’s just like an inverted i.
Here is a table of some registers during the execution:
RDX R8
0x0000000000000001 64
0x0000000000000002 63
0x0000000000000004 62
0x0000000000000008 61
... ...
0x4000000000000000 2
0x8000000000000000 1
120
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
Optimizing MSVC 2012 does almost the same job as optimizing MSVC 2010, but
somehow, it generates two identical loop bodies and the loop count is now 32
instead of 64. To be honest, it’s not possible to say why. Some optimization trick?
Maybe it’s better for the loop body to be slightly longer? Anyway, such code is
relevant here to show that sometimes the compiler output may be really weird and
illogical, but perfectly working.
17.5 Conclusion
Analogous to the C/C++ shifting operators ≪ and ≫, the shift instructions in x86
are SHR/SHL (for unsigned values) and SAR/SHL (for signed values).
The shift instructions in ARM are LSR/LSL (for unsigned values) and ASR/LSL (for
signed values). It’s also possible to add shift suffix to some instructions (which are
called “data processing instructions”).
121
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
Sometimes, AND is used instead of TEST, but the flags that are set are the same.
This is usually done by this C/C++ code snippet (shift value by n bits right, then cut
off lowest bit):
Or (shift 1 bit n times left, isolate this bit in input value and check if it’s not zero):
122
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
123
CHAPTER 17. MANIPULATING SPECIFIC BIT(S) CHAPTER 17. MANIPULATING SPECIFIC BIT(S)
124
CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR
Chapter 18
The linear congruential generator is probably the simplest possible way to generate
random numbers. It’s not in favour in modern times1 , but it’s so simple (just one
multiplication, one addition and one AND operation), we can use it as an example.
#include <stdint.h>
int my_rand ()
{
rand_state=rand_state*RNG_a;
rand_state=rand_state+RNG_c;
return rand_state & 0x7fff;
}
125
CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR
There are two functions: the first one is used to initialize the internal state, and the
second one is called to generate pseudorandom numbers.
We see that two constants are used in the algorithm. They are taken from [Pre+07].
Let’s define them using a #define C/C++ statement. It’s a macro. The difference
between a C/C++ macro and a constant is that all macros are replaced with their
value by C/C++ preprocessor, and they don’t take any memory, unlike variables. In
contrast, a constant is a read-only variable. It’s possible to take a pointer (or
address) of a constant variable, but impossible to do so with a macro.
The last AND operation is needed because by C-standard my_rand() has to return
a value in the 0..32767 range. If you want to get 32-bit pseudorandom values, just
omit the last AND operation.
18.1 x86
_init$ = 8
_srand PROC
mov eax, DWORD PTR _init$[esp-4]
mov DWORD PTR _rand_state, eax
ret 0
_srand ENDP
_TEXT SEGMENT
_rand PROC
imul eax, DWORD PTR _rand_state, 1664525
add eax, 1013904223 ; 3c6ef35fH
mov DWORD PTR _rand_state, eax
and eax, 32767 ; 00007fffH
ret 0
_rand ENDP
_TEXT ENDS
Here we see it: both constants are embedded into the code. There is no memory
allocated for them. The my_srand() function just copies its input value into the
internal rand_state variable.
my_rand() takes it, calculates the next rand_state, cuts it and leaves it in the
EAX register.
126
CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR
_init$ = 8
_srand PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _init$[ebp]
mov DWORD PTR _rand_state, eax
pop ebp
ret 0
_srand ENDP
_TEXT SEGMENT
_rand PROC
push ebp
mov ebp, esp
imul eax, DWORD PTR _rand_state, 1664525
mov DWORD PTR _rand_state, eax
mov ecx, DWORD PTR _rand_state
add ecx, 1013904223 ; 3c6ef35fH
mov DWORD PTR _rand_state, ecx
mov eax, DWORD PTR _rand_state
and eax, 32767 ; 00007fffH
pop ebp
ret 0
_rand ENDP
_TEXT ENDS
18.2 x64
The x64 version is mostly the same and uses 32-bit registers instead of 64-bit ones
(because we are working with int values here). But my_srand() takes its input
argument from the ECX register rather than from stack:
127
CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR CHAPTER 18. LINEAR CONGRUENTIAL GENERATOR
init$ = 8
my_srand PROC
; ECX = input argument
mov DWORD PTR rand_state, ecx
ret 0
my_srand ENDP
_TEXT SEGMENT
my_rand PROC
imul eax, DWORD PTR rand_state, 1664525 ; ⤦
Ç 0019660dH
add eax, 1013904223 ; 3⤦
Ç c6ef35fH
mov DWORD PTR rand_state, eax
and eax, 32767 ; 00007⤦
Ç fffH
ret 0
my_rand ENDP
_TEXT ENDS
128
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
Chapter 19
Structures
A C/C++ structure, with some assumptions, is just a set of variables, always stored
in memory together, not necessary of the same type 1 .
129
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
void main()
{
SYSTEMTIME t;
GetSystemTime (&t);
return;
};
130
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
16 bytes are allocated for this structure in the local stack —that is exactly sizeof(WORD)*8
(there are 8 WORD variables in the structure).
Pay attention to the fact that the structure begins with the wYear field. It can
be said that a pointer to the SYSTEMTIME structure is passed to the GetSystem-
Time()3 , but it is also can be said that a pointer to the wYear field is passed, and
that is the same! GetSystemTime() writes the current year to the WORD pointer
pointing to, then shifts 2 bytes ahead, writes current month, etc, etc.
The fact that the structure fields are just variables located side-by-side, can be
easily demonstrated by doing the following. Keeping in mind the SYSTEMTIME
structure description, it’s possible to rewrite this simple example like this:
#include <windows.h>
#include <stdio.h>
void main()
{
WORD array[8];
GetSystemTime (array);
return;
};
131
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
push ebp
mov ebp, esp
sub esp, 16
lea eax, DWORD PTR _array$[ebp]
push eax
call DWORD PTR __imp__GetSystemTime@4
movzx ecx, WORD PTR _array$[ebp+12] ; wSecond
push ecx
movzx edx, WORD PTR _array$[ebp+10] ; wMinute
push edx
movzx eax, WORD PTR _array$[ebp+8] ; wHoure
push eax
movzx ecx, WORD PTR _array$[ebp+6] ; wDay
push ecx
movzx edx, WORD PTR _array$[ebp+2] ; wMonth
push edx
movzx eax, WORD PTR _array$[ebp] ; wYear
push eax
push OFFSET $SG78573
call _printf
add esp, 28
xor eax, eax
mov esp, ebp
pop ebp
ret 0
_main ENDP
void main()
{
SYSTEMTIME *t;
132
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
GetSystemTime (t);
free (t);
return;
};
Let’s compile it now with optimization (/Ox) so it would be easy see what we need.
133
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
void main()
{
WORD *t;
GetSystemTime (t);
free (t);
return;
};
We get:
_main PROC
push esi
push 16
call _malloc
add esp, 4
mov esi, eax
134
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
push esi
call DWORD PTR __imp__GetSystemTime@4
movzx eax, WORD PTR [esi+12]
movzx ecx, WORD PTR [esi+10]
movzx edx, WORD PTR [esi+8]
push eax
movzx eax, WORD PTR [esi+6]
push ecx
movzx ecx, WORD PTR [esi+2]
push edx
movzx edx, WORD PTR [esi]
push eax
push ecx
push edx
push OFFSET $SG78594
call _printf
push esi
call _free
add esp, 32
xor eax, eax
pop esi
ret 0
_main ENDP
Again, we got the code cannot be distinguished from the previous one. And again
it should be noted, you haven’t to do this in practice, unless you really know what
you are doing.
struct s
{
char a;
int b;
char c;
int d;
};
135
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
void f(struct s s)
{
printf ("a=%d; b=%d; c=%d; d=%d\n", s.a, s.b, s.c, s.d);
};
int main()
{
struct s tmp;
tmp.a=1;
tmp.b=2;
tmp.c=3;
tmp.d=4;
f(tmp);
};
As we see, we have two char fields (each is exactly one byte) and two more —int
(each — 4 bytes).
19.3.1 x86
136
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
We pass the structure as a whole, but in fact, as we can see, the structure is being
copied to a temporary one (a place in stack is allocated in line 10 for it, and then
all 4 fields, one by one, are copied in lines 12 … 19), and then its pointer (address)
is to be passed. The structure is copied because it’s not known whether the f()
function going to modify the structure or not. If it gets changed, then the structure
in main() has to remain as it was. We could use C/C++ pointers, and the resulting
code will be almost the same, but without the copying.
As we can see, each field’s address is aligned on a 4-byte boundary. That’s why
each char occupies 4 bytes here (like int). Why? Because it is easier for the CPU to
access memory at aligned addresses and to cache data from it.
However, it is not very economical.
Let’s try to compile it with option (/Zp1) (/Zp[n] pack structures on n-byte boundary).
Listing 19.7: MSVC 2012 /GS- /Zp1
1 _main PROC
2 push ebp
3 mov ebp, esp
4 sub esp, 12
5 mov BYTE PTR _tmp$[ebp], 1 ; set field a
137
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
Now the structure takes only 10 bytes and each char value takes 1 byte. What does
it give to us? Size economy. And as drawback —the CPU accessing these fields
slower than it could.
The structure is also copied in main(). Not field-by-field, but directly 10 bytes,
using three pairs of MOV. Why not 4? The compiler decided that it’s better to copy
10 bytes using 3 MOV pairs than to copy two 32-bit words and two bytes using 4
138
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
MOV pairs.
As it can be easily guessed, if the structure is used in many source and object files,
all these must be compiled with the same convention about structures packing.
Aside from MSVC /Zp option which sets how to align each structure field, there is
also the #pragma pack compiler option, which can be defined right in the source
code. It is available in both MSVC5 and GCC6 .
Let’s get back to the SYSTEMTIME structure that consists of 16-bit fields. How
does our compiler know to pack them on 1-byte alignment boundary?
WinNT.h file has this:
And this:
This tell the compiler how to pack the structures defined after #pragma pack.
5 MSDN: Working with Packing Structures
6 Structure-Packing Pragmas
139
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
struct inner_struct
{
int a;
int b;
};
struct outer_struct
{
char a;
int b;
struct inner_struct c;
char d;
int e;
};
int main()
{
struct outer_struct s;
s.a=1;
s.b=2;
140
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
s.c.a=100;
s.c.b=101;
s.d=3;
s.e=4;
f(s);
};
… in this case, both inner_struct fields are to be placed between the a,b and
d,e fields of the outer_struct.
Let’s compile (MSVC 2010):
_TEXT SEGMENT
_s$ = 8
_f PROC
mov eax, DWORD PTR _s$[esp+16]
movsx ecx, BYTE PTR _s$[esp+12]
mov edx, DWORD PTR _s$[esp+8]
push eax
mov eax, DWORD PTR _s$[esp+8]
push ecx
mov ecx, DWORD PTR _s$[esp+8]
push edx
movsx edx, BYTE PTR _s$[esp+8]
push eax
push ecx
push edx
push OFFSET $SG2802 ; 'a=%d; b=%d; c.a=%d; c.b=%d; d=%d; ⤦
Ç e=%d'
call _printf
add esp, 28
ret 0
_f ENDP
_s$ = -24
_main PROC
sub esp, 24
push ebx
push esi
push edi
mov ecx, 2
sub esp, 24
mov eax, esp
141
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
One curious thing here is that by looking onto this assembly code, we do not even
see that another structure was used inside of it! Thus, we would say, nested struc-
tures are unfolded into linear or one-dimensional structure.
Of course, if we replace the struct inner_struct c; declaration with struct
inner_struct *c; (thus making a pointer here) the situation will be quite dif-
ferent.
The C/C++ language allows to define the exact number of bits for each structure
field. It is very useful if one needs to save memory space. For example, one bit is
enough for a bool variable. But of course, it is not rational if speed is important.
Let’s consider the CPUID7 instruction example. This instruction returns information
about the current CPU and its features.
If the EAX is set to 1 before the instruction’s execution, CPUID returning this infor-
mation packed into the EAX register:
7 wikipedia
142
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
3:0 (4 bits) Stepping
7:4 (4 bits) Model
11:8 (4 bits) Family
13:12 (2 bits) Processor Type
19:16 (4 bits) Extended Model
27:20 (8 bits) Extended Family
MSVC 2010 has CPUID macro, but GCC 4.4.1 does not. So let’s make this function
by ourselves for GCC with the help of its built-in assembler8 .
#include <stdio.h>
#ifdef __GNUC__
static inline void cpuid(int code, int *a, int *b, int *c, int ⤦
Ç *d) {
asm volatile("cpuid":"=a"(*a),"=b"(*b),"=c"(*c),"=d"(*d):"a"(⤦
Ç code));
}
#endif
#ifdef _MSC_VER
#include <intrin.h>
#endif
struct CPUID_1_EAX
{
unsigned int stepping:4;
unsigned int model:4;
unsigned int family_id:4;
unsigned int processor_type:2;
unsigned int reserved1:2;
unsigned int extended_model_id:4;
unsigned int extended_family_id:8;
unsigned int reserved2:4;
};
int main()
{
struct CPUID_1_EAX *tmp;
int b[4];
#ifdef _MSC_VER
__cpuid(b,1);
#endif
#ifdef __GNUC__
8 More about internal GCC assembler
143
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
return 0;
};
After CPUID fills EAX/EBX/ECX/EDX, these registers are to be written in the b[]
array. Then, we have a pointer to the CPUID_1_EAX structure and we point it to
the value in EAX from the b[] array.
In other words, we treat a 32-bit int value as a structure. Then we read specific bits
from the structure.
MSVC
144
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
push eax
push OFFSET $SG15435 ; 'stepping=%d', 0aH, 00H
call _printf
shr esi, 20
and esi, 255
push esi
push OFFSET $SG15440 ; 'extended_family_id=%d', 0aH, 00H
call _printf
add esp, 48
pop esi
add esp, 16
ret 0
_main ENDP
145
CHAPTER 19. STRUCTURES CHAPTER 19. STRUCTURES
The SHR instruction shifting the value in EAX by the number of bits that must be
skipped, e.g., we ignore some bits at the right side.
The AND instruction clears the unneeded bits on the left, or, in other words, leaves
only those bits in the EAX register we need.
146
CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT
Chapter 20
#include <stdint.h>
uint64_t f ()
{
return 0x1234567890ABCDEF;
};
20.1.1 x86
In a 32-bit environment, 64-bit values are returned from functions in the EDX:EAX
register pair.
147
CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT
#include <stdint.h>
void f_add_test ()
{
#ifdef __GNUC__
printf ("%lld\n", f_add(12345678901234, 23456789012345)⤦
Ç );
#else
printf ("%I64d\n", f_add(12345678901234, ⤦
Ç 23456789012345));
#endif
};
20.2.1 x86
_f_add_test PROC
push 5461 ; 00001555H
push 1972608889 ; 75939f79H
push 2874 ; 00000b3aH
push 1942892530 ; 73ce2ff_subH
call _f_add
148
CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT
push edx
push eax
push OFFSET $SG1436 ; '%I64d', 0aH, 00H
call _printf
add esp, 28
ret 0
_f_add_test ENDP
_f_sub PROC
mov eax, DWORD PTR _a$[esp-4]
sub eax, DWORD PTR _b$[esp-4]
mov edx, DWORD PTR _a$[esp]
sbb edx, DWORD PTR _b$[esp]
ret 0
_f_sub ENDP
We can see in the f_add_test() function that each 64-bit value is passed using
two 32-bit values, high part first, then low part.
In addition, the low 32-bit part are added first. If carry was occurred while adding,
the CF flag is set. The following ADC instruction adds the high parts of the values,
and also adds 1 if CF = 1.
Subtraction also occurs in pairs. The first SUB may also turn on the CF flag, which
is to be checked in the subsequent SBB instruction: if the carry flag is on, then 1
is also to be subtracted from the result.
It is easy to see how the f_add() function result is then passed to printf().
#include <stdint.h>
149
CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT
20.3.1 x86
_a$ = 8 ; size = 8
_b$ = 16 ; size = 8
_f_div PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _b$[ebp+4]
push eax
mov ecx, DWORD PTR _b$[ebp]
push ecx
mov edx, DWORD PTR _a$[ebp+4]
push edx
mov eax, DWORD PTR _a$[ebp]
push eax
call __aulldiv ; unsigned long long division
pop ebp
ret 0
_f_div ENDP
150
CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT
_a$ = 8 ; size = 8
_b$ = 16 ; size = 8
_f_rem PROC
push ebp
mov ebp, esp
mov eax, DWORD PTR _b$[ebp+4]
push eax
mov ecx, DWORD PTR _b$[ebp]
push ecx
mov edx, DWORD PTR _a$[ebp+4]
push edx
mov eax, DWORD PTR _a$[ebp]
push eax
call __aullrem ; unsigned long long remainder
pop ebp
ret 0
_f_rem ENDP
Multiplication and division are more complex operations, so usually the compiler
embeds calls to a library functions doing that.
#include <stdint.h>
uint64_t f (uint64_t a)
{
return a>>7;
};
20.4.1 x86
151
CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT CHAPTER 20. 64-BIT VALUES IN 32-BIT ENVIRONMENT
Shifting also occurs in two passes: first the lower part is shifted, then the higher
part. But the lower part is shifted with the help of the SHRD instruction, it shifts
the value of EDX by 7 bits, but pulls new bits from EAX, i.e., from the higher part.
The higher part is shifted using the more popular SHR instruction: indeed, the freed
bits in the higher part must be filled with zeroes.
#include <stdint.h>
int64_t f (int32_t a)
{
return a;
};
20.5.1 x86
Here we also run into necessity to extend a 32-bit signed value into a 64-bit signed
one. Unsigned values are converted straightforwardly: all bits in the higher part
must be set to 0. But this is not appropriate for signed data types: the sign has to
be copied into the higher part of the resulting number. The CDQ instruction does
that here, it takes its input value in EAX, extends it to 64-bit and leaves it in the
EDX:EAX register pair. In other words, CDQ gets the number sign from EAX (by
getting the most significant bit in EAX), and depending of it, sets all 32 bits in EDX
to 0 or 1. Its operation is somewhat similar to the MOVSX instruction.
152
CHAPTER 21. 64 BITS CHAPTER 21. 64 BITS
Chapter 21
64 bits
21.1 x86-64
It is a 64-bit extension to the x86 architecture.
From the reverse engineer’s perspective, the most important changes are:
• Almost all registers (except FPU and SIMD) were extended to 64 bits and got
a R- prefix. 8 additional registers wer added. Now GPR’s are: RAX, RBX, RCX,
RDX, RBP, RSP, RSI, RDI, R8, R9, R10, R11, R12, R13, R14, R15.
It is still possible to access the older register parts as usual. For example, it
is possible to access the lower 32-bit part of the RAX register using EAX:
7th (byte number) 6th 5th 4th 3rd 2nd 1st 0th
RAXx64
EAX
AX
AH AL
The new R8-R15 registers also have their lower parts: R8D-R15D (lower
32-bit parts), R8W-R15W (lower 16-bit parts), R8L-R15L (lower 8-bit parts).
7th (byte number) 6th 5th 4th 3rd 2nd 1st 0th
R8
R8D
R8W
R8L
The number of SIMD registers was doubled from 8 to 16: XMM0-XMM15.
153
CHAPTER 21. 64 BITS CHAPTER 21. 64 BITS
1 Random-access memory
154
Part II
Important fundamentals
155
156
CHAPTER 22. SIGNED NUMBER REPRESENTATIONS CHAPTER 22. SIGNED NUMBER REPRESENTATIONS
Chapter 22
There are several methods for representing signed numbers1 , but “two’s comple-
ment” is the most popular one in computers.
Here is a table for some byte values:
binary hexadecimal unsigned signed (2’s complement)
01111111 0x7f 127 127
01111110 0x7e 126 126
...
00000110 0x6 6 6
00000101 0x5 5 5
00000100 0x4 4 4
00000011 0x3 3 3
00000010 0x2 2 2
00000001 0x1 1 1
00000000 0x0 0 0
11111111 0xff 255 -1
11111110 0xfe 254 -2
11111101 0xfd 253 -3
11111100 0xfc 252 -4
11111011 0xfb 251 -5
11111010 0xfa 250 -6
...
10000010 0x82 130 -126
10000001 0x81 129 -127
10000000 0x80 128 -128
1 wikipedia
157
CHAPTER 22. SIGNED NUMBER REPRESENTATIONS CHAPTER 22. SIGNED NUMBER REPRESENTATIONS
The difference between signed and unsigned numbers is that if we represent 0xFFFFFFFE
and 0x0000002 as unsigned, then the first number (4294967294) is bigger than
the second one (2). If we represent them both as signed, the first one is to be −2, and
it is smaller than the second (2). That is the reason why conditional jumps ( 11 on
page 50) are present both for signed (e.g. JG, JL) and unsigned (JA, JB) operations.
158
CHAPTER 23. MEMORY CHAPTER 23. MEMORY
Chapter 23
Memory
159
CHAPTER 23. MEMORY CHAPTER 23. MEMORY
was called on it, which is very dangerous. Example in this book: 19.2 on
page 132.
160
Part III
Finding important/interesting
stuff in the code
161
Minimalism it is not a prominent feature of modern software.
But not because the programmers are writing a lot, but because a lot of libraries are
commonly linked statically to executable files. If all external libraries were shifted
into an external DLL files, the world would be different. (Another reason for C++
are the STL and other template libraries.)
Thus, it is very important to determine the origin of a function, if it is from standard
library or well-known library (like Boost1 , libpng2 ), or if it is related to what we are
trying to find in the code.
It is just absurd to rewrite all code in C/C++ to find what we’re looking for.
One of the primary tasks of a reverse engineer is to find quickly the code he/she
needs.
The IDA disassembler allow us to search among text strings, byte sequences and
constants. It is even possible to export the code to .lst or .asm text files and then
use grep, awk, etc.
When you try to understand what some code is doing, this easily could be some
open-source library like libpng. So when you see some constants or text strings
which look familiar, it is always worth to google them. And if you find the open-
source project where they are used, then it’s enough just to compare the functions.
It may solve some part of the problem.
For example, if a program uses XML files, the first step may be determining which
XML library is used for processing, since the standard (or well-known) libraries are
usually used instead of self-made one.
For example, author of these lines once tried to understand how the compres-
sion/decompression of network packets worked in SAP 6.0. It is a huge software,
but a detailed .PDB with debugging information is present, and that is convenient.
He finally came to the idea that one of the functions, that was called CsDecom-
prLZC, was doing the decompression of network packets. Immediately he tried to
google its name and he quickly found the function was used in MaxDB (it is an
open-source SAP project) .
http://www.google.com/search?q=CsDecomprLZC
Astoundingly, MaxDB and SAP 6.0 software shared likewise code for the compres-
sion/decompression of network packets.
1 http://go.yurichev.com/17036
2 http://go.yurichev.com/17037
162
CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32) CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32)
Chapter 24
Sometimes it’s enough to observe some function’s inputs and outputs in order to
understand what it does. That way you can save time.
Files and registry access: for the very basic analysis, Process Monitor1 utility from
SysInternals can help.
For the basic analysis of network accesses, Wireshark2 can be useful.
But then you will have to to look inside anyway.
The first thing to look for is which functions from the OS’s API3 s and standard
libraries are used.
If the program is divided into a main executable file and a group of DLL files, some-
times the names of the functions in these DLLs can help.
If we are interested in exactly what can lead to a call to MessageBox() with
specific text, we can try to find this text in the data segment, find the references to
it and find the points from which the control may be passed to the MessageBox()
call we’re interested in.
If we are talking about a video game and we’re interested in which events are more
or less random in it, we may try to find the rand() function or its replacements
(like the Mersenne twister algorithm) and find the places from which those func-
tions are called, and more importantly, how are the results used.
1 http://go.yurichev.com/17301
2 http://go.yurichev.com/17303
3 Application programming interface
163
CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32) CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32)
But if it is not a game, and rand() is still used, it is also interesting to know why.
There are cases of unexpected rand() usage in data compression algorithms (for
encryption imitation): blog.yurichev.com.
164
CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32) CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32)
Or, let’s set INT3 breakpoints on all functions with the xml prefix in their name:
--one-time-INT3-bp:somedll.dll!xml.*
On the other side of the coin, such breakpoints are triggered only once.
Tracer will show the call of a function, if it happens, but only once. Another drawback—
it is impossible to see the function’s arguments.
Nevertheless, this feature is very useful when you know that the program uses a
DLL, but you do not know which functions are actually used. And there are a lot of
functions.
For example, let’s see, what does the uptime utility from cygwin use:
tracer -l:uptime.exe --one-time-INT3-bp:cygwin1.dll!.*
Thus we may see all that cygwin1.dll library functions that were called at least
once, and where from:
One-time INT3 breakpoint: cygwin1.dll!__main (called from ⤦
Ç uptime.exe!OEP+0x6d (0x40106d))
One-time INT3 breakpoint: cygwin1.dll!_geteuid32 (called from ⤦
Ç uptime.exe!OEP+0xba3 (0x401ba3))
One-time INT3 breakpoint: cygwin1.dll!_getuid32 (called from ⤦
Ç uptime.exe!OEP+0xbaa (0x401baa))
One-time INT3 breakpoint: cygwin1.dll!_getegid32 (called from ⤦
Ç uptime.exe!OEP+0xcb7 (0x401cb7))
24 MSDN
165
CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32) CHAPTER 24. COMMUNICATION WITH THE OUTER WORLD (WIN32)
166
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
Chapter 25
Strings
25.1.1 C/C++
A minor difference was that the unit of I/O was the word, not
the byte, because the PDP-7 was a word-addressed machine. In
practice this meant merely that all programs dealing with charac-
ter streams ignored null characters, because null was used to pad
a file to an even number of characters.
167
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
The string in Pascal and Borland Delphi is preceded by an 8-bit or 32-bit string
length.
For example:
...
CODE:00518AFC dd 10h
CODE:00518B00 aPreparingRun__ db 'Preparing run...',0
25.1.3 Unicode
Often, what is called Unicode is a methods for encoding strings where each charac-
ter occupies 2 bytes or 16 bits. This is a common terminological mistake. Unicode
is a standard for assigning a number to each character in the many writing systems
of the world, but does not describe the encoding method.
The most popular encoding methods are: UTF-8 (is widespread in Internet and *NIX
systems) and UTF-16LE (is used in Windows).
168
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
UTF-8
UTF-8 is one of the most successful methods for encoding characters. All Latin
symbols are encoded just like in ASCII, and the symbols beyond the ASCII table
are encoded using several bytes. 0 is encoded as before, so all standard C string
functions work with UTF-8 strings just like any other string.
Let’s see how the symbols in various languages are encoded in UTF-8 and how it
looks like in FAR, using the 437 codepage 1 :
As you can see, the English language string looks the same as it is in ASCII. The
Hungarian language uses some Latin symbols plus symbols with diacritic marks.
These symbols are encoded using several bytes, these are underscored with red.
It’s the same story with the Icelandic and Polish languages. There is also the “Euro”
1 The example and translations was taken from here: http://go.yurichev.com/17304
169
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
currency symbol at the start, which is encoded with 3 bytes. The rest of the writing
systems here have no connection with Latin. At least in Russian, Arabic, Hebrew
and Hindi we can see some recurring bytes, and that is not surprise: all symbols
from a writing system are usually located in the same Unicode table, so their code
begins with the same numbers.
At the beginning, before the “How much?” string we see 3 bytes, which are in fact
the BOM2 . The BOM defines the encoding system to be used.
UTF-16LE
Many win32 functions in Windows have the suffixes -A and -W. The first type of
functions works with normal strings, the other with UTF-16LE strings (wide). In
the second case, each symbol is usually stored in a 16-bit value of type short.
The Latin symbols in UTF-16 strings look in Hiew or FAR like they are interleaved
with zero byte:
int wmain()
{
wprintf (L"Hello, world!\n");
};
170
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
Strings with characters that occupy exactly 2 bytes are called “Unicode” in IDA:
.data:0040E000 aHelloWorld:
.data:0040E000 unicode 0, <Hello, world!>
.data:0040E000 dw 0Ah, 0
What we can easily spot is that the symbols are interleaved by the diamond char-
acter (which has the ASCII code of 4). Indeed, the Cyrillic symbols are located in
the fourth Unicode plane 3 . Hence, all Cyrillic symbols in UTF-16LE are located in
the 0x400-0x4FF range.
3 wikipedia
171
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
Let’s go back to the example with the string written in multiple languages. Here is
how it looks like in UTF-16LE.
Here we can also see the BOM in the beginning. All Latin characters are interleaved
with a zero byte. Some characters with diacritic marks (Hungarian and Icelandic
languages) are also underscored in red.
25.1.4 Base64
The base64 encoding is highly popular for the cases when you need to transfer
binary data as a text string. In essence, this algorithm encodes 3 binary bytes into
4 printable characters: all 26 Latin letters (both lower and upper case), digits, plus
sign (“+”) and slash sign (“/”), 64 characters in total.
One distinctive feature of base64 strings is that they often (but not always) ends
with 1 or 2 padding equality symbol(s) (“=”), for example:
AVjbbVSVfcUMu1xvjaMgjNtueRwBbxnyJw8dpGnLW8ZW8aKG3v4Y0icuQT+⤦
Ç qEJAp9lAOuWs=
172
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
WVjbbVSVfcUMu1xvjaMgjNtueRwBbxnyJw8dpGnLW8ZW8aKG3v4Y0icuQT+⤦
Ç qEJAp9lAOuQ==
The equality sign (“=”) is never encounter in the middle of base64-encoded strings.
173
CHAPTER 25. STRINGS CHAPTER 25. STRINGS
science algorithm which uses such strange byte sequences. And it doesn’t look
like an error or debugging message. So it’s a good idea to inspect the usage of
such weird strings.
Sometimes, such strings are encoded using base64. So it’s a good idea to
decode them all and to scan them visually, even a glance should be enough.
More precise, this method of hiding backdoors is called “security through obscu-
rity”.
174
CHAPTER 26. CALLS TO ASSERT() CHAPTER 26. CALLS TO ASSERT()
Chapter 26
Calls to assert()
Sometimes the presence of the assert() macro is useful too: commonly this
macro leaves source file name, line number and condition in the code.
The most useful information is contained in the assert’s condition, we can deduce
variable names or structure field names from it. Another useful piece of information
are the file names—we can try to deduce what type of code is there. Also it is
possible to recognize well-known open-source libraries by the file names.
...
...
175
CHAPTER 26. CALLS TO ASSERT() CHAPTER 26. CALLS TO ASSERT()
It is advisable to “google” both the conditions and file names, which can lead us to
an open-source library. For example, if we “google” “sp->lzw_nbits <= BITS_MAX”,
this predictably gives us some open-source code that’s related to the LZW compres-
sion.
176
CHAPTER 27. CONSTANTS CHAPTER 27. CONSTANTS
Chapter 27
Constants
Humans, including programmers, often use round numbers like 10, 100, 1000, in
real life as well as in the code.
The practicing reverse engineer usually know them well in hexadecimal represen-
tation: 10=0xA, 100=0x64, 1000=0x3E8, 10000=0x2710.
The constants 0xAAAAAAAA (10101010101010101010101010101010) and
0x55555555 (01010101010101010101010101010101) are also popular—those
are composed of alternating bits. That may help to distinguish some signal from
the signal where all bits are turned on (1111 …) or off (0000 …). For example, the
0x55AA constant is used at least in the boot sector, MBR1 , and in the ROM2 of
IBM-compatible extension cards.
Some algorithms, especially cryptographical ones use distinct constants, which are
easy to find in code using IDA.
For example, the MD53 algorithm initializes its own internal variables like this:
var int h0 := 0x67452301
var int h1 := 0xEFCDAB89
var int h2 := 0x98BADCFE
var int h3 := 0x10325476
If you find these four constants used in the code in a row, it is very highly probable
that this function is related to MD5.
177
CHAPTER 27. CONSTANTS CHAPTER 27. CONSTANTS
…or by calling a function for comparing memory blocks like memcmp() or any other
equivalent code up to a CMPSB instruction.
When you find such point you already can say where the loading of the MIDI file
starts, also, we could see the location of the buffer with the contents of the MIDI
file, what is used from the buffer, and how.
27.1.1 DHCP
This applies to network protocols as well. For example, the DHCP protocol’s net-
work packets contains the so-called magic cookie: 0x63538263. Any code that
4 wikipedia
5 wikipedia
178
CHAPTER 27. CONSTANTS CHAPTER 27. CONSTANTS
generates DHCP packets somewhere must embed this constant into the packet. If
we find it in the code we may find where this happens and, not only that. Any
program which can receive DHCP packet must verify the magic cookie, comparing
it with the constant.
For example, let’s take the dhcpcore.dll file from Windows 7 x64 and search for the
constant. And we can find it, twice: it seems that the constant is used in two func-
tions with descriptive names like DhcpExtractOptionsForValidation() and
DhcpExtractFullOptions():
And here are the places where these constants are accessed:
And:
6 GitHub
179
CHAPTER 28. FINDING THE RIGHT INSTRUCTIONS CHAPTER 28. FINDING THE RIGHT INSTRUCTIONS
Chapter 28
If the program is utilizing FPU instructions and there are very few of them in the
code, one can try to check each one manually with a debugger.
For example, we may be interested how Microsoft Excel calculates the formulae
entered by user. For example, the division operation.
If we load excel.exe (from Office 2010) version 14.0.4756.1000 into IDA, make a full
listing and to find every FDIV instruction (except the ones which use constants as
a second operand—obviously, they do not suit us):
We can enter a string like =(1/3) in Excel and check each instruction.
180
CHAPTER 28. FINDING THE RIGHT INSTRUCTIONS CHAPTER 28. FINDING THE RIGHT INSTRUCTIONS
FPU StatusWord=
FPU ST(0): 1.000000
ST(0) holds the first argument (1) and second one is in [EBX].
.text:3011E91B DD 1E fstp ⤦
Ç qword ptr [esi]
Excel shows 666 in the cell, finally convincing us that we have found the right
point.
181
CHAPTER 28. FINDING THE RIGHT INSTRUCTIONS CHAPTER 28. FINDING THE RIGHT INSTRUCTIONS
If we try the same Excel version, but in x64, we will find only 12 FDIV instructions
there, and the one we looking for is the third one.
tracer.exe -l:excel.exe bpx=excel.exe!BASE+0x1B7FCC,set(st0⤦
Ç ,666)
It seems that a lot of division operations of float and double types, were replaced
by the compiler with SSE instructions like DIVSD (DIVSD is present 268 times in
total).
182
CHAPTER 29. SUSPICIOUS CODE PATTERNS CHAPTER 29. SUSPICIOUS CODE PATTERNS
Chapter 29
This AWK script can be used for processing IDA listing (.lst) files:
gawk -e '$2=="xor" { tmp=substr($3, 0, length($3)-1); if (tmp!=⤦
Ç $4) if($4!="esp") if ($4!="ebp") { print $1, $2, tmp, ⤦
Ç ",", $4 } }' filename.lst
183
CHAPTER 29. SUSPICIOUS CODE PATTERNS CHAPTER 29. SUSPICIOUS CODE PATTERNS
Commonly there is no fixed system for passing arguments to functions in the hand-
written code.
Indeed, if we look in the WRK1 v1.2 source code, this code can be found easily in
file WRK-v1.2\base\ntos\ke\i386\cpu.asm.
184
CHAPTER 30. USING MAGIC NUMBERS WHILE TRACING CHAPTER 30. USING MAGIC NUMBERS WHILE TRACING
Chapter 30
Often, our main goal is to understand how the program uses a value that was either
read from file or received via network. The manual tracing of a value is often a very
labour-intensive task. One of the simplest techniques for this (although not 100%
reliable) is to use your own magic number.
This resembles X-ray computed tomography is some sense: a radiocontrast agent
is injected into the patient’s blood, which is then used to improve the visibility of
the patient’s internal structure in to the X-rays. It is well known how the blood of
healthy humans percolates in the kidneys and if the agent is in the blood, it can be
easily seen on tomography, how blood is percolating, and are there any stones or
tumors.
We can take a 32-bit number like 0x0badf00d, or someone’s birth date like 0x11101979
and write this 4-byte number to some point in a file used by the program we inves-
tigate.
Then, while tracing this program with tracer in code coverage mode, with the help
of grep or just by searching in the text file (of tracing results), we can easily see
where the value was used and how.
Example of grepable tracer results in cc mode:
0x150bf66 (_kziaia+0x14), e= 1 [MOV EBX, [EBP+8]] [EBP⤦
Ç +8]=0xf59c934
0x150bf69 (_kziaia+0x17), e= 1 [MOV EDX, [69AEB08h]] [69⤦
Ç AEB08h]=0
0x150bf6f (_kziaia+0x1d), e= 1 [FS: MOV EAX, [2Ch]]
185
CHAPTER 30. USING MAGIC NUMBERS WHILE TRACING CHAPTER 30. USING MAGIC NUMBERS WHILE TRACING
This can be used for network packets as well. It is important for the magic number
to be unique and not to be present in the program’s code.
Aside of the tracer, DosBox (MS-DOS emulator) in heavydebug mode is able to write
information about all registers’ states for each executed instruction of the program
to a plain text file1 , so this technique may be useful for DOS programs as well.
186
CHAPTER 31. OTHER THINGS CHAPTER 31. OTHER THINGS
Chapter 31
Other things
187
CHAPTER 31. OTHER THINGS CHAPTER 31. OTHER THINGS
188
CHAPTER 31. OTHER THINGS CHAPTER 31. OTHER THINGS
memory on these, but the game usually consumes even less memory) and you know
that you have now, let’s say, 100 bullets, you can do a “snapshot” of all memory and
back it up to some place. Then shoot once, the bullet count goes to 99, do a second
“snapshot” and then compare both: the must be must be a byte somewhere which
was 100 in the beginning, and now it is 99. Considering the fact that these 8-bit
games were often written in assembly language and such variables were global,
it can be said for sure which address in memory was holding the bullet count. If
you searched for all references to the address in the disassembled game code, it
was not very hard to find a piece of code decrementing the bullet count, then to
write a NOP instruction there, or a couple of NOP-s, and then have a game with
100 bullets forever. Games on these 8-bit computers were commonly loaded at
the constant address, also, there were not much different versions of each game
(commonly just one version was popular for a long span of time), so enthusiastic
gamers knew which bytes must be overwritten (using the BASIC’s instruction POKE)
at which address in order to hack it. This led to “cheat” lists that contained POKE
instructions, published in magazines related to 8-bit games. See also: wikipedia.
Likewise, it is easy to modify “high score” files, this does not work with just 8-bit
games. Notice your score count and back up the file somewhere. When the “high
score” count gets different, just compare the two files, it can even be done with the
DOS utility FC1 (“high score” files are often in binary form). There will be a point
where a couple of bytes are different and it is easy to see which ones are holding
the score number. However, game developers are fully aware of such tricks and
may defend the program against it.
It is also possible to compare the Windows registry before and after a program
installation. It is a very popular method of finding which registry elements are used
by the program. Probably, this is the reason why the “windows registry cleaner”
shareware is so popular.
31.3.2 Blink-comparator
189
Part IV
Tools
190
CHAPTER 32. DISASSEMBLER CHAPTER 32. DISASSEMBLER
Chapter 32
Disassembler
32.1 IDA
An older freeware version is available for download 1 .
1 hex-rays.com/products/ida/support/download_freeware.shtml
191
CHAPTER 33. DEBUGGER CHAPTER 33. DEBUGGER
Chapter 33
Debugger
33.1 tracer
The author often use tracer 1 instead of a debugger.
The author of these lines stopped using a debugger eventually, since all he need
from it is to spot function arguments while executing, or registers state at some
point. Loading a debugger each time is too much, so a small utility called tracer was
born. It works from command line, allows intercepting function execution, setting
breakpoints at arbitrary places, reading and changing registers state, etc.
However, for learning purposes it is highly advisable to trace code in a debugger
manually, watch how the registers state changes (e.g. classic SoftICE, OllyDbg,
WinDbg highlight changed registers), flags, data, change them manually, watch
the reaction, etc.
1 yurichev.com
192
CHAPTER 34. DECOMPILERS CHAPTER 34. DECOMPILERS
Chapter 34
Decompilers
193
CHAPTER 35. OTHER TOOLS CHAPTER 35. OTHER TOOLS
Chapter 35
Other tools
1 visualstudio.com/en-US/products/visual-studio-express-vs
2 hiew.ru
194
Part V
195
CHAPTER 36. BOOKS CHAPTER 36. BOOKS
Chapter 36
Books
36.1 Windows
[RA09].
36.2 C/C++
[ISO13].
36.4 ARM
ARM manuals: http://go.yurichev.com/17024
36.5 Cryptography
[Sch94]
196
CHAPTER 37. BLOGS CHAPTER 37. BLOGS
Chapter 37
Blogs
37.1 Windows
• Microsoft: Raymond Chen
• nynaeve.net
197
CHAPTER 38. OTHER CHAPTER 38. OTHER
Chapter 38
Other
1 Reverse Engineering
2 freenode.net
198
Afterword
199
CHAPTER 39. QUESTIONS? CHAPTER 39. QUESTIONS?
Chapter 39
Questions?
The author is working on the book a lot, so the page and listing numbers, etc.
are changing very rapidly. Please, do not refer to page and listing numbers in your
emails to me. There is a much simpler method: make a screenshot of the page, in
a graphics editor underline the place where you see the error, and send it to me.
He’ll fix it much faster. And if you familiar with git and LATEX you can fix the error
right in the source code:
GitHub.
Do not worry to bother me while writing me about any petty mistakes you found,
even if you are not very confident. I’m writing for beginners, after all, so beginners’
opinions and comments are crucial for my job.
200
CHAPTER 39. QUESTIONS? CHAPTER 39. QUESTIONS?
If you still interesting in reverse engineering, full version of the book is always
available on my website: beginners.re.
201
Acronyms used
202
CHAPTER 39. QUESTIONS? CHAPTER 39. QUESTIONS?
OS Operating System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
PL Programming language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
RA Return Address . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
NOP No OPeration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
203
CHAPTER 39. QUESTIONS? CHAPTER 39. QUESTIONS?
VM Virtual Memory
204
Glossary Glossary
Glossary
real number numbers which may contain a dot. this is float and double in C/C++.
92
decrement Decrease by 1. 77, 86, 189
increment Increase by 1. 77, 86
product Multiplication result. 41
stack pointer A register pointing to a place in the stack. 9, 13, 17, 22, 203
quotient Division result. 92
callee A function being called by another. 11, 16, 26, 27, 35, 40, 42, 66, 154, 159
caller A function calling another. 5, 9, 35, 40, 41, 44, 66, 154
heap usually, a big chunk of memory provided by the OS so that applications can
divide it by themselves as they wish. malloc()/free() work with the heap. 14,
16, 132
jump offset a part of the JMP or Jcc instruction’s opcode, to be added to the address
of the next instruction, and this is how the new PC1 is calculated. May be
negative as well. 38, 54
PDB (Win32) Debugging information file, usually just function names, but some-
times also function arguments and local variables names. 162
POKE BASIC language instruction for writing a byte at a specific address. 189
1 Program Counter. IP/EIP/RIP in x86/64. PC in ARM.
205
Glossary Glossary
register allocator The part of the compiler that assigns CPU registers to local vari-
ables. 86, 154
reverse engineering act of understanding how the thing works, sometimes in order
to clone it. iv
stack frame A part of the stack that contains information specific to the current
function: local variables, function arguments, RA, etc. 28, 41
stdout standard output. 18, 66
tracer My own simple debugging tool. You can read more about it here: 33.1 on
page 192. 165, 180, 185, 186
206
Index
207
INDEX INDEX
208
INDEX INDEX
209
BIBLIOGRAPHY BIBLIOGRAPHY
Bibliography
210
BIBLIOGRAPHY BIBLIOGRAPHY
[RA09] Mark E. Russinovich and David A. Solomon with Alex Ionescu. Windows® Internal
2009.
[Rit79] Dennis M. Ritchie. “The Evolution of the Unix Time-sharing System”. In:
(1979).
[RT74] D. M. Ritchie and K. Thompson. “The UNIX Time Sharing System”. In:
(1974). Also available as http://go.yurichev.com/17270.
[Sch94] Bruce Schneier. Applied Cryptography: Protocols, Algorithms, and Source Code in
1994.
[Str13] Bjarne Stroustrup. The C++ Programming Language, 4th Edition. 2013.
[Yur13] Dennis Yurichev. C/C++ programming language notes. Also available as
http://go.yurichev.com/17289. 2013.
211