IDAPython Book
IDAPython Book
IDAPython Book
by Alexander Hanel
Introduction
Hello!
I orig inally wrote it as a ref erence f or myself - I wanted a place to go to where I could f ind
examples of f unctions that I commonly use (and f orget) in IDAPython. Since I started this
bookI have used it many times as a quick ref erence to understand syntax or see an example
of some code - if you f ollow my blog you may notice a f ew f amiliar f aces – lots of scripts
that I cover here are result of sophomoric experiments that I documented online.
Over the years I have received numerous emails asking what is the best g uide f or learning
IDAPython. Usually I will point them to to Ero Carrera’s Introduction to IDAPython or the
example scripts in the IDAPython’s public repo. They are excellent sources f or learning but
they don’t cover some common issues that I have come across. I wanted to create a book
that covers these issues.I f eel this book will be of value f or anyone learning IDAPython or
wanting a quick ref erence f or examples and snippets. Being an e-book it will not be a static
document and I plan on updating it in the f uture on regular basis.
If you come across any issues, typos or have questions please send me an email
alexander< dot >hanel< at >gmail< dot > com.
Updates
It should be stated that my background is in reverse eng ineering of malware. This book
does not cover compiler concepts such as basic blocks or other academic concepts used in
static analysis. The reason be, is I rarely ever use these concepts when reverse engineering
malware. Occasionally I have used them f or de-obf uscating code but not of ten enough
that I f eel they would be of value f or a beginner. Af ter reading this book the reader will f eel
comf ortable with dig ging into the IDAPython documentation on their own. One last
disclaimer, f unctions f or IDA’s debugger are not covered.
Conventions
IDA’s Output Windows (command line interf ace) was used f or the examples and output. For
brevity some examples do not contain the assig nment of the current address to a variable.
Usually represented as ea = here() . All of the code can be cut and paste into the
command line or IDA’s script command option shift-F2 . Reading f rom beg inning to end
is the recommend approach f or this book. There are a number of examples that are not
explained line by line because it assumed the reader understands the code f rom previous
examples. Dif f erent authors will call IDAPython’s in dif f erent ways. Sometimes the code will
be called as idc.SegName(ea) or SegName(ea) . In this book we will be using the f irst
style. I have f ound this convention to be easier to read and debug. Sometimes when using
this convention an error will be thrown as shown below.
Python>DataRefsTo(here())
<generator object refs at 0x05247828>
Python>idautils.DataRefsTo(here())
Traceback (most recent call last):
File "<string>", line 1, in <module>
NameError: name 'idautils' is not defined
Python>import idautils # manual importing of module
Python>idautils.DataRefsTo(here())
<generator object refs at 0x06A398C8>
If this happens the module will be need to be manually imported as shown above.
IDAPython Background
IDAPython was created in 2004 . It was a joint ef f ort by Gergely Erdelyi and Ero Carrera. Their
g oal was to combine the power of Python with the analysis automation of IDA’s IDC C-like
scripting lang uage. IDAPython consists of three separate modules. The f irst is idc . It is a
compatibility module f or wrapping IDA’s IDC f unctions. The second module is idautils . It
is a hig h level utility f unctions f or IDA. The third module is idaapi . It allows access to more
low level data. This data could be classes used by IDA.
Basics
Bef ore we dig too deep we should def ine some keywords and g o over the structure of IDA’s
disassembly output. We can use the f ollowing line of code as an example.
The .text is the section name and the address is 00012529 . The displayed address is in a
hexadecimal f ormat. The instruction mov is ref erred to as a mnemonic. Af ter the
mnemonic is the f irst operand esi and the second operand is [esp+4+arg_0] . When
working with IDAPython f unctions the most common passed variable is the address. In the
IDAPython documentation the address is ref erenced as ea . The address can be accessed
manually by a couple of dif f erent f unctions. The most commonly used f unctions are
idc.ScreenEA() or here() . They will return an integer value. If we want to g et the
minimum address that is present in an IDB we can use MinEA() or to g et the max we can
use MaxEA() .
Python>ea = idc.ScreenEA()
Python>print "0x% x % s" % (ea, ea)
0x12529 75049
Python>ea = here()
Python>print "0x% x % s" % (ea, ea)
0x12529 75049
Python>hex(MinEA())
0x401000
Python>hex(MaxEA())
0x437000
Python>idaapi.BADADDR
4294967295
Python>hex(idaapi.BADADDR)
0xffffffffL
Python>if BADADDR != here(): print "valid address"
valid address
Segments
Printing a sing le line is not very usef ul. The power of IDAPython comes f rom iterating
throug h all instructions, cross-ref erences addresses and searching f or code or data. The
last two will be described in more details later. Iterating through all seg ments will be a g ood
place to start.
idautils.Segments() returns an iterator type object. We can loop throug h the object by
using a f or loop. Each item in the list is a segment’s start address. The address can be used
to get the name if we pass it as an argument to idc.SegName(ea) . The start and end of
the seg ments can be f ound by calling idc.SegStart(ea) or idc.SegEnd(ea) . The
address or ea needs to be within the rang e of the start or end of the seg ment. If we didn’t
want to iterate throug h all seg ments but wanted to f ind the next seg ment we could use
idc.NextSeg(ea) . The address can be any address within the segment rang e f or which
we would want to f ind the next segment f or. If by chance we wanted to g et a segment’s
start address by name we could use idc.SegByName(segname) .
Functions
Now that we know how to iterate throug h all segments we should go over how to iterate
throug h all known f unctions.
idautils.Functions() will return a list of known f unctions. The list will contain the start
address of each f unction. idautils.Functions() can be passed arg uments to search
within a range. If we wanted to do this we would pass the start and end address
idautils.Functions(start_addr, end_addr) . To g et a f unctions name we use
idc.GetFunctionName(ea) . ea can be any address within the f unction boundaries.
IDAPython contains a larg e set of APIs f or working with f unctions. Let’s start with a simple
f unction. The semantics of this f unction is not important but we should create a mental
note of the addresses.
Python>func = idaapi.get_func(ea)
Python>type(func)
<class 'idaapi.func_t'>
Python>print "Start: 0x% x, End: 0x% x" % (func.startEA,
func.endEA)
Start: 0x45c7c3, End: 0x45c7cd
From the output we can see the startEA and endEA this is used to access the start and
end of the f unction. These attributes are only applicable towards the current f unction. If we
wanted to access surrounding f unctions we could use idc.NextFunction(ea) and
idc.PrevFunction(ea) . The value of ea only needs to be an address within the
boundaries of the analyzed f unction. A caveat with enumerating f unctions is that it only
works if IDA has identif ied the block of code as a f unction. Until the block of code is marked
as a f unction it will be skipped during the f unction enumeration process. Code that is not
marked as f unctions will be labeled red in the leg end (colored bar at the top). These can be
manually f ixed or automated.
IDAPython has a lot of dif f erent ways to access the same data. A common approach f or
accessing the boundaries within a f unction is using idc.GetFunctionAttr(ea,
FUNCATTR_START) and idc.GetFunctionAttr(ea, FUNCATTR_END) .
Python>ea = here()
Python>start = idc.GetFunctionAttr(ea, FUNCATTR_START)
Python>end = idc.GetFunctionAttr(ea, FUNCAT T R_END)
Python>cur_addr = start
Python>while cur_addr <= end:
print hex(cur_addr), idc.GetDisasm(cur_addr)
cur_addr = idc.NextHead(cur_addr, end)
Python>
0x45c7c3 mov eax, [ebp-60h]
0x45c7c6 push eax ; void *
0x45c7c7 call w_delete
0x45c7cc retn
This f lag is used to identif y a f unction that does not execute a return instruction. It’s
internally represented as equal to 1. An example of a f unction that does not return a value
can be seen below.
FUNC_FAR
This f lag is rarely seen unless reversing sof tware that uses seg mented memory. It is
internally represented as an integ er of 2.
FUNC_USERFAR
This f lag is rarely seen and has very little documentation. HexRays describes the f lag as
“user has specif ied f ar-ness of the f unction”. It has an internal value of 32.
FUNC_LIB
This f lag is used to f ind library code. Identif ying library code is very usef ul because it is code
that typically can be ignored when doing analysis. It’ internally represented as an integ er
value of 4 . Below is an example of it’s usag e and f unctions it has identif ied.
FUNC_STATIC
This f lag is used to identif y f unctions that were compiled as a static f unction. In C f unctions
are g lobal by def ault. If the author def ines a f unction as static it can be only accessed by
other f unctions within that f ile. In a limited way this could be used to aid in understanding
how the source code was structured.
FUNC_FRAME
This f lag indicates the f unction uses a f rame pointer ebp . Functions that use f rame
pointers will typically start with the standard f unction prolog ue f or setting up the stack
f rame.
FUNC_BOTTOMBP
Similar to FUNC_FRAME this f lag is used to track the f rame pointer. It will identif y f unctions
that f rame pointers is equal to the stack pointer.
FUNC_HIDDEN
Functions with the FUNC_HIDDEN f lag means they are hidden and will need to be expanded
to view. If we were to go to an address of a f unction that is marked as hidden it would
automatically be expanded.
FUNC_THUNK
This f lag identif ies f unctions that are thunk f unctions. They are simple f unctions that jump
to another f unction.
0x1a716697 FUNC_LIB
0x1a716697 FUNC_FRAME
0x1a716697 FUNC_HIDDEN
0x1a716697 FUNC_BOTTOMBP
Instructions
Since we know how to work with f unctions go over how to access their instructions. If we
have the address of a f unction we can use idautils.FuncItems(ea) to get a list of all
the addresses.
Python>dism_addr = list(idautils.FuncItems(here()))
Python>type(dism_addr)
<type 'list'>
Python>print dism_addr
[4573123, 4573126, 4573127, 4573132]
Python>for line in dism_addr: print hex(line),
idc.GetDisasm(line)
0x45c7c3 mov eax, [ebp-60h]
0x45c7c6 push eax ; void *
0x45c7c7 call w_delete
0x45c7cc retn
idautils.FuncItems(ea) actually returns an iterator type but is cast to a list . The list
will contain the start address of each instruction in consecutive order. Now that we have a
g ood knowledg e base f or looping throug h segments, f unctions and instructions let show a
usef ul example. Sometimes when reversing packed code it is usef ul to only know where
dynamic calls happens. A dynamic call would be a call or jump to an operand that is a reg ister
such as call eax or jmp edi .
Python>
for func in idautils.Functions():
flags = idc.GetFunctionFlags(func)
if flags & FUNC_LIB or flags & FUNC_THUNK:
continue
dism_addr = list(idautils.FuncItems(func))
for line in dism_addr:
m = idc.GetMnem(line)
if m == 'call' or m == 'jmp':
op = idc.GetOpType(line, 0)
if op == o_reg:
print "0x% x % s" % (line, idc.GetDisasm(line))
Python>
0x43ebde call eax ; VirtualProtect
We call idautils.Functions() to get a list of all known f unctions. For each f unction we
retrieve the f unctions f lag s by calling idc.GetFunctionFlags(ea) . If the f unction is
library code or a thunk f unction the f unction is passed. Next we call
idautils.FuncItems(ea) to get all the addresses within the f unction. We loop throug h
the list using a for loop. Since we are only interested in call and jmp instructions we
need to g et the mnemonic by calling idc.GetMnem(ea) . We then use a simple string
comparison to check the mnemonic. If the mnemonic is a jump or call we g et the operand
type by calling idc.GetOpType(ea, n) . This f unction will return a integer that is
internally called op_t.type . This value can be used to determine if the operand is a
register, memory ref erence, etc. We then check if the op_t.type is a register. If so, we
print the line. Casting the return of idautils.FuncItems(ea) into a list is usef ul
because iterators do not have objects such as len() . By casting it as a list we could easily
g et the number of lines or instructions in a f unction.
Python>ea = here()
Python>len(idautils.FuncItems(ea))
Traceback (most recent call last):
File "<string>", line 1, in <module>
TypeError: object of type 'generator' has no len()
Python>len(list(idautils.FuncItems(ea)))
39
In the previous example we used a list that contained all addresses within a f unction. We
looped each entity to access the next instruction. What if we only had an address and
wanted to get the next instruction? To move to the next instruction address we can use
idc.NextHead(ea) and to get the previous instruction address we use
idc.PrevHead(ea) . These f unctions will g et the start of the next instruction but not the
next address. To get the next address we use idc.NextAddr(ea) and to g et the previous
address we use idc.PrevAddr(ea) .
Python>ea = here()
Python>print hex(ea), idc.GetDisasm(ea)
0x10004f24 call sub_10004F32
Python>next_instr = idc.NextHead(ea)
Python>print hex(next_instr), idc.GetDisasm(next_instr)
0x10004f29 mov [esi], eax
Python>prev_instr = idc.PrevHead(ea)
Python>print hex(prev_instr), idc.GetDisasm(prev_instr)
0x10004f1e mov [esi+98h], eax
Python>print hex(idc.NextAddr(ea))
0x10004f25
Python>print hex(idc.PrevAddr(ea))
0x10004f23
Operands
Operand types are commonly used so it will be benef icial to go over all the types. As
previous stated we can use idc.GetOpType(ea,n) to g et the operand type. ea is the
address and n is the index. There are eig ht dif f erent type of operand types.
o_void
o_reg
If an operand is a g eneral register it will return this type. This value is internally represented
as 1.
o_mem
If an operand is direct memory ref erence it will return this type. This value is internally
represented as 2. This type is usef ul f or f inding ref erences to DATA.
o_phrase
This operand is returned if the operand consists of a base reg ister and/or a index reg ister.
This value is internally represented as 3.
o_displ
This operand is returned if the operand consists of reg isters and a displacement value. The
displacement is an integ er value such 0x18. It is commonly seen when an instruction
accesses values in a structure. Internally it is represented as a value of 4 .
o_imm
Operands that are a value such as an integer of 0xC are of this type. Internally it is
represented as 5.
This operand is not very common when reversing x86 or x86_64 . It is used to f ind operands
that are accessing immediate f ar addresses. It is represented internally as 6
o_near
This operand is not very common when reversing x86 or x86_64 . It is used to f ind operands
that are accessing immediate near addresses. It is represented internally as 7.
Example
While reversing an executable we might notice that the code keeps ref erencing recurring
displacement values. This is a likely indicator that the code is passing a structure to
dif f erent f unctions. g o over an example to create a Python dictionary that contains all the
displacements as keys and each key will have a list of the addresses. In the code below there
will be a new f unction that has yet to be described. The f unction is similar to
idc.GetOpType(ea, n) .
import idautils
import idaapi
displace = {}
The start of the code should already look f amiliar. We use a combination of
idautils.Functions() and GetFunctionFlags(ea) to get all applicable f unctions
while ig noring libraries and thunks. We g et each instruction in a f unction by calling
idautils.FuncItems(ea) . From here this is where are new f unction
idaapi.decode_insn(ea) is called. This f unction takes the address of instruction we
want decoded. Once it is decoded we can access dif f erent properties of the instruction by
accessing it via idaapi.cmd .
Python>dir(idaapi.cmd)
['Op1', 'Op2', 'Op3', 'Op4', 'Op5', 'Op6', 'Operands', .....,
'assign', 'auxpref', 'clink', 'clink_ptr', 'copy', 'cs', 'ea',
'flags', 'get_canon_feature', 'get_canon_mnem', 'insnpref', 'ip',
'is_canon_insn', 'is_macro', 'itype', 'segpref', 'size']
As we can see f rom the dir() command idaapi.cmd has a good amount of attributes.
Now back to our example. The operand type is accessed by using idaapi.cmd.Op1.type .
Please note that the operand index starts at 1 rather than 0 which is dif f erent
than idc.GetOpType(ea,n) . We then check if the operand one or operand two is of
o_displ type. We use idaapi.tag_remove(idaapi.ua_outop2(ea, n)) to g et a
string representation of the operand. It would be shorter and easier to read if we called
idc.GetOpnd(ea, n) . For example purposes this is a good way to show that there is
more than one f unction to access attributes using IDAPython. If we were to look at the
IDAPython source code f or idc.GetOpnd(ea, n) we would see the lower level approach.
if not res:
return ""
else:
return idaapi.tag_remove(res)
Now back to our example. Since we have the string we need to check if the operand
contains the string "bp" . This is a quick way to determine if the register bp , ebp or rbp
is present in the operand. We check f or “bp” because we need to determine if the
displacement value is neg ative or not. To access the displacement value we use
idaapi.cmd.Op1.addr . This will return a string. Now that we have the address we
convert it to an integ er, make it positive if needed, and then added it to our dictionary
named displace . If there is a displacement value that we wanted to search f or we could
access it using the f ollowing f or loop.
0x130 is the displacement value we are interested in. This can be modif ied to print other
displacements.
Example
Sometimes when reversing a memory dump of an executable the operands are not
recognized as an of f set.
The second value being pushed is a memory of f set. If we were to rig ht click on it and
chang e it to a data type; we would see the of f set to a string . This is okay to do once or
twice but af ter that we mig ht as well automate the process.
min = MinEA()
max = MaxEA()
# for each known function
for func in idautils.Functions():
flags = idc.GetFunctionFlags(func)
# skip library & thunk functions
if flags & FUNC_LIB or flags & FUNC_THUNK:
continue
dism_addr = list(idautils.FuncItems(func))
for curr_addr in dism_addr:
if idc.GetOpType(curr_addr, 0) == 5 and \
(min < idc.GetOperandValue(curr_addr,0) < max):
idc.OpOff(curr_addr, 0, 0)
if idc.GetOpType(curr_addr, 1) == 5 and \
(min < idc.GetOperandValue(curr_addr,1) < max):
idc.OpOff(curr_addr, 1, 0)
Af ter running the above code we would now see the string .
At the start we g et the minimum and maximum address by calling MinEA() and MaxEA()
We loop throug h all f unctions and instructions. For each instruction we check if the operand
type is of o_imm and is represented internally as the number 5. o_imm types are values
such as an integ er or an of f set. Once a value is f ound we read the value by calling
idc.GetOperandValue(ea,n) . The value is then checked to see if it is in the rang e of the
minimum and maximum addresses. If so, we use idc.OpOff(ea, n, base) to convert
the operand to an of f set. The f irst arg ument ea is the address, n is the operand index
and base is the base address. Our example only needs to have a base of zero.
Xrefs
Being able to locate cross-ref erences aka xref s to data or code is very important. Xref s are
important because they provide locations of where certain data is being used or where a
f unction is being called f rom. For example what if we wanted to locate the address of
where WriteFile was called f rom. Using Xref s all we would need to do is locate the
address of WriteFile in the import table and then f ind all xref s to it.
Python>wf_addr = idc.LocByName("WriteFile")
Python>print hex(wf_addr), idc.GetDisasm(wf_addr)
0x1000e1b8 extrn WriteFile:dword
Python>for addr in idautils.CodeRefsTo(wf_addr, 0):\
print hex(addr), idc.GetDisasm(addr)
0x10004932 call ds:WriteFile
0x10005c38 call ds:WriteFile
0x10007458 call ds:WriteFile
In the f irst line we get the address of the API WriteFile by using idc.LocByName(str) .
This f unction will return the address of the API. We print out the address of WriteFile
and it’s string representation. Then loop through all code cross ref erences by calling
idautils.CodeRefsTo(ea, flow) . It will return an iterator that can be looped through.
ea is the address that we would like to have cross-ref erenced to. The argument f low is a
bool . It is used to specif y to f ollow normal code f low or not. Each cross ref erence to the
address is then displayed. A quick note about the use of idc.LocByName(str) . All
renamed f unctions and APIs in an IDB can be accessed by calling idautils.Names() . This
f unction returns an iterator object which can be lopped throug h to print or access the
names. Each named item is a tupple of (ea, str_name) .
Python>ea = 0x10004932
Python>print hex(ea), idc.GetDisasm(ea)
0x10004932 call ds:WriteFile
Python>for addr in idautils.CodeRefsFrom(ea, 0):\
print hex(addr), idc.GetDisasm(addr)
Python>
0x1000e1b8 extrn WriteFile:dword
Python>hex(ea)
0xa26c78
Python>idc.MakeName(ea, "RtlCompareMemory")
True
Python>for addr in idautils.CodeRefsTo(ea, 0):\
print hex(addr), idc.GetDisasm(addr)
IDA will not label these APIs as code cross ref erences. A litle later we will describe a generic
technique to get all cross ref erences. If we wanted to search f or cross ref erences to and
f rom data we could use idautils.DataRefsTo(e) or idautils.DataRefsFrom(ea) .
The f irst line displays our address and a string named <Path> . We use
idautils.XrefsTo(ea, 1) to get all cross ref erences to the string. We then use
xref.type to print the xref s type value. idautils.XrefTypeName(xref.type) is used
to print the string representation of this type. There are twelve dif f erent documented
ref erence type values. The value can be seen on the lef t and it’s correpsonding name can be
seen below.
0 = 'Data_Unknown'
1 = 'Data_Offset'
2 = 'Data_Write'
3 = 'Data_Read'
4 = 'Data_Text'
5 = 'Data_Informational'
16 = 'Code_Far_Call'
17 = 'Code_Near_Call'
18 = 'Code_Far_Jump'
19 = 'Code_Near_Jump'
20 = 'Code_User'
21 = 'Ordinary_Flow'
The xref.frm prints out the f rom address and xref.to prints out the two address.
xref.iscode prints if the xref is in a code seg ment. In the previous example we had the
f lag of idautils.XrefsTo(ea, 1) set to the value 1. If the f lag is zero any cross
ref erence will be displayed. say we have the below block of assembly.
We have the cursor at 1000AB02 . This address has a cross ref erence f rom 1000AAF6 but
it also has second cross ref erence.
The second cross ref erence is f rom 1000AAFF to 1000AB02 . Cross ref erences do not
have to be caused by branch instructions. They can also be caused by normal ordinary code
f low. If we set the f lag to 1 Ordinary_Flow ref erence types will not be added. go back to
our RtlCompareMemory example f rom eariler. We can use idautils.XrefsTo(ea,
flow) to g et all cross ref erences.
Python>hex(ea)
0xa26c78
Python>idc.MakeName(ea, "RtlCompareMemory")
True
Python>for xref in idautils.XrefsTo(ea, 1):
print xref.type, idautils.XrefTypeName(xref.type), \
hex(xref.frm), hex(xref.to), xref.iscode
Python>
3 Data_Read 0xa142a3 0xa26c78 0
3 Data_Read 0xa143e8 0xa26c78 0
3 Data_Read 0xa162da 0xa26c78 0
The verboseness comes f rom the Data_Read and the Code_Near both added to the
xref s. Getting all the addresses and adding them to a set can be usef ul to slim down on all
the addresses.
def get_to_xrefs(ea):
xref_set = set([])
for xref in idautils.XrefsTo(ea, 1):
xref_set.add(xref.frm)
return xref_set
Searching
We have already g one over some basic searches by iterating over all known f unctions or
instructions. This is usef ul but sometimes we need to search f or specif ic bytes such as
0x55 0x8B 0xEC . This byte pattern is the classic f unction prologue push ebp, mov
ebp, esp . To search f or byte or binary patterns we can use idc.FindBinary(ea,
flag, searchstr, radix=16) . ea is the address that we would like to search f rom the
flag is the direction or condition. There are a number of dif f erent types of f lag s. The
names and values can be seen below.
SEARCH_UP = 0
SEARCH_DOWN = 1
SEARCH_NEXT = 2
SEARCH_CASE = 4
SEARCH_REGEX = 8
SEARCH_NOBRK = 16
SEARCH_NOSHOW = 32
SEARCH_UNICODE = 64 **
SEARCH_IDENT = 128 **
SEARCH_BRK = 256 **
** Older versions of IDAPython do not support these
Not all of these f lag s are worth g oing over but touch upon the most commonly used f lag s.
SEARCH_UP and SEARCH_DOWN are used to select the direction we would like our
search to f ollow.
SEARCH_NEXT is used to get the next f ound object.
SEARCH_CASE is used to specif y case sensitivity.
SEARCH_NOSHOW will not show the search progress.
SEARCH_UNICODE is used to treat all search strings as Unicode.
searchstr is the pattern we are search f or. The radix is used when writing processor
modules. This topic is outside of the scope of this book. I would recommend reading
Chapter 19 of Chris Eagle’s The IDA Pro Book. For now the radix f ield can be lef t blank. g o
over a quick walk throug h on f inding the f unction prologue byte patten mentioned earlier.
In the f irst line we def ine our search pattern. The search pattern can be in the f ormat of
hexadecimal starting with 0x as in 0x55 0x8B 0xEC or as bytes appear in IDA’s hex view
55 8B EC . The f ormat \x55\x8B\xEC can not be used unless we were using
idc.FindText(ea, flag, y, x, searchstr) . MinEA() is used to get the f irst
address in the executable. We then assig n the return of idc.FindBinary(ea, flag,
searchstr, radix=16) to a variable called addr .
When searching it is important to verif y that the search did f ind the pattern. This is tested
by comparing addr with idc.BADADDR . We then print the address and disassembly.
Notice how the address did not increment? This is because we did not pass the
SEARCH_NEXT f lag . If this f lag is not passed the current address is used to search f or the
pattern. If the last address contained our byte pattern the search will never increment
passed it. Below is the corrected version.
Searching f or byte patterns is usef ul but sometimes we might want to search f or string s
such as “chrome.dll”. We could convert the string s to a hex bytes using [hex(y) for y
in bytearray("chrome.dll")] but this is a little ugly. Also, if the string is unicode we
would have to account f or that f ormat. The simplest approach is using FindText(ea,
flag, y, x, searchstr) . Most of these f ields should look f amiliar because they are the
same as idc.FindBinary . ea is the start address and f lag is the direction and types to
search f or. y is the number of lines at ea to search f rom and x is the coordinate in the
line. These f ields are typically assig ned as 0 . Now search f or occurrences of the string
“Accept”. Any string f rom the string s window shift+F12 can be used f or this example.
Python>cur_addr = MinEA()
end = MaxEA()
while cur_addr < end:
cur_addr = idc.FindText(cur_addr, SEARCH_DOWN, 0, 0,
"Accept")
if cur_addr == idc.BADADDR:
break
else:
print hex(cur_addr), idc.GetDisasm(cur_addr)
cur_addr = idc.NextHead(cur_addr)
Python>
0x40da72 push offset aAcceptEncoding; "Accept-Encoding:\n"
0x40face push offset aHttp1_1Accept; " HTTP/1.1\r\nAccept: */*
\r\n "
0x40fadf push offset aAcceptLanguage; "Accept-Language: ru
\r\n"
...
0x423c00 db 'Accept',0
0x423c14 db 'Accept-Language',0
0x423c24 db 'Accept-Encoding',0
0x423ca4 db 'Accept-Ranges',0
We use MinEA() to get the minimum address and assign that to a variable named
cur_addr . This is similarly done ag ain f or the maximum address by calling MaxEA() and
assig ning the return to a variable named the end . Since we do not know how many
occurrences of the string will be present, we need to check that the search continues down
and is less than the maximum address. We then assign the return of idc.FindText to the
current address. Since we will be manually incrementing the address by calling
idc.NextHead(ea) we do not need the SEARCH_NEXT f lag. The reason why we manually
increment the current address to the f ollowing line is because a string can occur multiple
times on a sing le line. This can make it tricky to get the address of the next string.
Along with pattern searching previously described there a couple of f unctions that can be
used to f ind other types. The naming conventions of the f ind APIs makes it easy to inf er it’s
overall f unctionality. Bef ore we discuss f inding the dif f erent types we f irstly go over
identif ying types by their address. There is a subset of APIs that start with is that can be
used to determine an address’ type. The APIs return a Boolean value of True or False .
idc.isCode(f )
idc.isData(f )
idc.isTail(f )
idc.isUnknown(f )
Returns True if IDA has marked the address as unknown. This type is used when IDA has
not identif ied if the address is code or data.
idc.isHead(f )
The f is new to us. Rather than passing an address we f irst need to g et the internal f lag s
representation and then pass it to our idc.is set of f unctions. To get the internal f lags
we use idc.GetFlags(ea) . Now that we have a basics on how the f unction can be used
and the dif f erent types lets do a quick example.
idc.FindCode(ea, f lag)
It is used to f ind the next address that is marked as code. This can be usef ul if we want to
f ind the end of a block of data. If ea is an address that is already marked as code it will
return the next address. The flag is used as previously described in idc.FindText .
As we can see ea is the address 0x4140e8 of some data. We assign the return of
idc.FindCode(ea, SEARCH_DOWN|SEARCH_NEXT) to addr . Then we print addr and
it’s disassembly. By calling this sing le f unction we skipped 36 bytes of data to g et the start
of a section marked as code.
idc.FindData(ea, f lag )
It is used exactly as idc.FindCode except it will return the start of the next address that
is marked as a block of data. If we reverse the previous scenario and start f rom the address
of code and search up to f ind the start of the data.
The only thing that is slig htly dif f erent than the previous example is the direction of
SEARCH_UP|SEARCH_NEXT and searching f or data.
idc.FindUnexplored(ea, f lag )
This f unction is used to f ind the address of bytes that IDA did not identif y as code or data.
The unknown type will require f urther manual analysis either visually or throug h scripting.
0x41b900 db ? ;
Python>addr = idc.FindExplored(ea, SEARCH_UP)
Python>print hex(addr), idc.GetDisasm(addr)
0x41b5f4 dd ?
This might not seem of any real value but if we were to print the cross ref erences of addr
we would see it is being used.
Rather than searching f or a type we mig ht want to search f or a specif ic value. say f or
example that we have a f eeling that the code calls rand to g enerate a random number but
we can’t f ind the code. If we knew that rand uses the value 0x343FD as a seed we could
search f or that number.
In the f irst line we pass the minimum address via MinEA() , search down and then search
f or the value 0x343FD . Rather than returning an address as shown in the previous Find APIs
idc.FindImmediate returns a tupple. The f irst item in the tupple will be the address and
second will be the operand. Similar to the return of idc.GetOpnd the f irst operand starts
at zero. When we print the address and disassembly we can see the value is the second
operand. If we wanted to search f or all uses of an immediate value we could do the
f ollowing.
Python>addr = MinEA()
while True:
addr, operand = idc.FindImmediate(addr,
SEARCH_DOWN|SEARCH_NEXT, 0x7a )
if addr != BADADDR:
print hex(addr), idc.GetDisasm(addr), "Operand ", operand
else:
break
Python>
0x402434 dd 9, 0FF0Bh, 0Ch, 0FF0Dh, 0Dh, 0FF13h, 13h, 0FF1Bh, 1Bh
Operand 0
0x40acee cmp eax, 7Ah Operand 1
0x40b943 push 7Ah Operand 0
0x424a91 cmp eax, 7Ah Operand 1
0x424b3d cmp eax, 7Ah Operand 1
0x425507 cmp eax, 7Ah Operand 1
Most of the code should look f amiliar but since we are searching f or multiple values we will
be using a while loop and the SEARCH_DOWN|SEARCH_NEXT f lags.
Selecting Data
Not always will we want to write code that automatically searches f or code or data. In
some instances we already know the location of the code or data but we want to select it
f or analysis. In situations like this we mig ht just want to highlight the code and start
working with it in IDAPython. To get the boundaries of selected data we can use
idc.SelStart() to g et the start and idc.SelEnd() to g et the end. say we have the
below code selected.
Python>start = idc.SelStart()
Python>hex(start)
0x408e46
Python>end = idc.SelEnd()
Python>hex(end)
0x408e58
We assig n the return of idc.SelStart() to start . This will be the address of the f irst
selected address. We then use the return of idc.SelEnd() and assig n it to end . One
thing to note is that end is not the last selected address but the start of the next address.
If we pref erred to make only one API call we could use idaapi.read_selection() . It
returns a tuple with the f irst value being a bool if the selection was read, the second being
the start address and the last address being the end.
Be cautious when working with 64 bit samples. The base address is not always correct
because the selected start address will cause an integer overf low and the leading digit will
be incorrect.
A personal belief of mine is that if I’m not writing I’m not reversing . Adding comments,
renaming f unctions and interacting with the assembly is one of the best ways to
understand what the code is doing . Over time some of the interaction becomes redundant.
In situations like this it usef ul to automate the process.
Bef ore we go over some examples we should f irst discuss the basics of comments and
renaming . There are two types of comments. The f irst one is a regular comment and the
second is a repeatable comment. A reg ular comment appears at address 0041136B as the
text regular comment . A repeatable comment can be seen at address 00411372 ,
00411386 and 00411392 . Only the last comment is a comment that was manually
entered. The other comments appear when an instruction ref erences an address (such as a
branch condition) that contains a repeatable comment.
We print the address, disassembly and f unction name in the f irst couple of lines. We then
use idc.SetFunctionCmt(ea, comment, repeatable) to set a repatable comment of
"check out later" . If we look at the start of the f unction we will see our comment.
Since the comment is repeatable, when there is a cross-ref ernece to the f unction we will
see the comment. This is a g reat place to add reminders or notes about a f unction.
Renaming f unctions and addresses is a commonly automated task, especially when dealing
with position independent code (PIC), packers or wrapper f unctions. The reason why this is
common in PIC or unpacked code is because the import table might not be present in the
dump. In the case of wrapper f unctions the f ull f unction simply calls an API.
In the above code the f unction could be called w_HeapAlloc . The w_ is short f or wrapper.
To rename an address we can use the f unction idc.MakeName(ea, name) . ea is the
address and name is the string name such as "w_HeapAlloc" . To rename a f unction ea
needs to be the f irst address of the f unction. To rename the f unction of our HeapAlloc
wrapper we would use the f ollowing code.
Above we can see the f unction has been renamed. To conf irm it has been renamed we can
use idc.GetFunctionName(ea) to print the new f unction`s name.
Python>idc.GetFunctionName(ea)
w_HeapAlloc
Now that we have a g ood basis of knowledge. show an example of how we can use what
we have learned so f ar to automate the naming of wrapper f unctions. Please see the inline
comments to g et an idea about the logic.
im port idautils
def check_for_wrapper(func):
flags = idc.GetFunctionFlags(func)
# skip library & thunk functions
if flags & FUNC_LIB or flags & FUNC_THUNK:
return
dism_addr = list(idautils.FuncItems(func))
# get length of the function
func_length = len(dism_addr)
# if over 32 lines of instruction return
if func_length > 0x20:
return
func_call = 0
instr_cmp = 0
op = None
op_addr = None
op_type = None
# for each instruction in the function
for ea in dism_addr:
m = idc.GetMnem(ea)
if m == 'call' or m == 'jmp':
if m == 'jmp':
temp = idc.GetOperandValue(ea,0)
# ignore jump conditions within the function
boundaries
if temp in dism_addr:
continue
func_call += 1
# wrappers should not contain multiple function calls
if func_call == 2:
return
op_addr = idc.GetOperandValue(ea , 0)
op_type = idc.GetOpType(ea,0)
elif m == 'cmp' or m == 'test':
# wrappers functions should not contain much logic.
instr_cmp += 1
if instr_cmp == 3:
return
else:
continue
# all instructions in the function have been analyzed
if op_addr == None:
return
name = idc.Name(op_addr)
# skip mangled function names
if "[" in name or "$" in name or "?" in name or "@" in name
or name == "":
return
name = "w_" + name
if op_type == 7:
if idc.GetFunctionFlags(op_addr) & FUNC_THUNK:
rename_wrapper(name, func)
return
if op_type == 2 or op_type == 6:
rename_wrapper(name, func)
return
Example Output
Most of the code should be f amiliar. One notable dif f erence is the use of
idc.MakeNameEx(ea, name, flag) f rom rename_wrapper . We use this f unction
because idc.MakeName will throw a warning dialogue if the f unction name is already in use.
By passing a f lag value of SN_NOWARN or 256 we avoid the dialogue box. We could apply
some logic to rename the f unction to w_HeapFree_1 but f or brevity we will leave that out.
Being able to access raw data is essential when reverse eng ineering. Raw data is the binary
representation of the code or data. We can see the raw data or bytes of the instructions on
the lef t side f ollowing the address.
idc.Byte(ea)
idc.Word(ea)
idc.Dword(ea)
idc.Qword(ea)
idc.GetFloat(ea)
idc.GetDouble(ea)
If the cursor was at 00A14380 in the assembly f rom above we would have the f ollowing
output.
When writing decoders it is not always usef ul to get a sing le byte or read a dword but to
read a block of raw data. To read a specif ied size of bytes at an address we can use
idc.GetManyBytes(ea, size, use_dbg=False) . The last argument is optional and is
only needed if we wanted the debug gers memory.
Sometimes when reversing malware the sample will have string s that are encoded. This is
done to slow down the analysis process and to thwart using a strings viewer to recover
indicators. In situations like this patching the IDB is usef ul. We could rename the address but
renaming is limited. This is due to the naming convention restrictions. To patch an address
with a value we can use the f ollowing f unctions.
idc.PatchByte(ea, value)
idc.PatchWord(ea, value)
idc.PatchDword(ea, value)
ea is the address and value is the integ er value that we would like to patch the IDB with.
The size of the value needs to match the size specif ied by the f unction name we choose.
say f or example that we f ound the f ollowing encoded strings.
The f unction is a standard XOR decoder f unction with arg uments of size, key and a decoded
buf f er.
Python>start = idc.SelStart()
Python>end = idc.SelEnd()
Python>print hex(start)
0x1001ed3c
Python>print hex(end)
0x1001ed50
Python>def xor(size, key, buff):
for index in range(0,size):
cur_addr = buff + index
temp = idc.Byte( cur_addr ) ^ key
idc.PatchByte(cur_addr, temp)
Python>
Python>xor(end - start, 0x30, start)
Python>idc.GetString(start)
WSAEnumNetworkEvents
We select the hig hlig hted data address start and end using idc.SelStart() and
idc.SelEnd() . Then we have a f unction that reads the byte by calling idc.Byte(ea) ,
XOR the byte with key passed to the f unction and then patch the byte by calling
idc.PatchByte(ea, value) .
Importing and exporting f iles into IDAPython can be usef ul when we do not know the f ile
path or when we do not know where the user wants to save their data. To import or save a
f ile by name we use AskFile(forsave, mask, prompt) . forsave can be a value of 0
if we want to open a dialog box or 1 is we want to open the save dialog box. mask is the
f ile extension or patten. If we want to open only .dll f iles we would use a mask of
"*.dll" and prompt is the title of the window. A good example of input and output and
selecting data is the f ollowing IO_DATA class.
im port sys
im port idaapi
def checkBounds(self):
if self.start is BADADDR or self.end is BADADDR:
self.status = False
def getData(self):
'''get data between start and end put them into
object.buffer'''
self.ogLen = self.end - self.start
self.buffer = ''
try:
for byte in idc.GetManyBytes(self.start, self.ogLen):
self.buffer = self.buffer + byte
except:
self.status = False
return
def run(self):
'''basically main'''
self.checkBounds()
if self.status == False:
sys.stdout.write('ERROR: Please select valid data\n')
return
self.getData()
def im portb(self):
'''import file to save to buffer'''
fileName = idc.AskFile(0, "*.*", 'Import File')
try:
self.buffer = open(fileName, 'rb').read()
except:
sys.stdout.write('ERROR: Cannot access file')
def export(self):
'''save the selected buffer to a file'''
exportFile = idc.AskFile(1, "*.*", 'Export Buffer')
f = open(exportFile, 'wb')
f.write(self.buffer)
f.close()
def stats(self):
print "start: % s" % hex(self.start)
print "end: % s" % hex(self.end)
print "len: % s" % hex(len(self.buffer))
With this class data can be selected saved to a buf f er and then stored to a f ile. This is usef ul
f or encoded or encrypted data in an IDB. We can use IO_DATA to select the data decode
the buf f er in Python and then patch the IDB. Example of how to use the IO_DATA class.
Python>f = IO_DATA()
Python>f.stats()
start: 0x401528
end: 0x401549
len: 0x21
Rather than explaining each line of the code it would be usef ul f or the reader to g o over the
f unctions one by one and see how they work. The below bullet points explain each variable
and what the f unctions does. obj is whatever variable we assign the class. f is the obj
in f = IO_DATA() .
obj.start
contains the address of the start of the selected of f set
. obj.end
contains the address of the end of the selected of f set.
obj.buf f er
contains the binary data.
obj.og Len
contains the size of the buf f er.
obj.g etData()
copies the binary data between obj.start and obj.end to obj.buf f er obj.run() the
selected data is copied to the buf f er in a binary f ormat
obj.patch()
patch the IDB at obj.start with the data in the obj.buf f er.
obj.patch(d)
patch the IDB at obj.start with the argument data.
obj.importb()
opens a f ile and saves the data in
obj.stats()
print hex of obj.start, obj.end and obj.buf f er length.
Pin is a dynamic binary instrumentation f ramework f or the IA-32 and x86-64 . Combing the
dynamic analysis results of PIN with the static analysis of IDA makes it a powerf ul mix. A
hurdle f or combing IDA and Pin is the initial setup and running of Pin. The below steps are
the 30 second (minus downloads) g uide to installing , executing a Pintool that traces an
executable and adds the executed addresses to an IDB.
00401500
00401506
00401520
00401526
00401549
0040154F
0040155E
00401564
0040156A
Af ter the pintools has executed we can run the f ollowing IDAPython code to add comments
to all the executed addresses. The output f ile itrace.out will need to be in the working
directory of the IDB.
f = open('itrace.out', 'r')
lines = f.readlines()
for y in lines:
y = int(y,16)
idc.SetColor(y, CIC_ITEM, 0xfffff)
com = idc.GetCommentEx(y,0)
if com == None or 'count' not in com:
idc.MakeComm(y, "count:1")
else:
try:
count = int(com.split(':')[1],16)
except:
print hex(y)
tmp = "count:0x% x" % (count + 1)
idc.MakeComm(y, tmp)
f.close()
We f irst open up itrace.out and read all lines into a list. We then iterate over each line in
the list. Since the address in the output f ile was in hexadecimal string f ormat we need to
convert it into an integ er.
Sometimes it can be usef ul to create IDBs or ASMs f or all the f iles in a directory. This can
help save time when analyzing a set of samples that are part of the same f amily of
malware. It’s much easier to do batch f ile g eneration than doing it manually on a large set.
To do batch analysis we will need to pass the -B arg ument to the text idaw.exe . The
below code can be copied to the directory that contains all the f iles we would like to
g enerate f iles f or.
im port os
im port subprocess
im port glob
paths = glob.glob("*")
ida_path = os.path.join(os.environ['PROGRAMFILES'], "IDA",
"idaw.exe")
We use glob.glob("*") to get a list of all f iles in the directory. The argument can be
modif ied if we wanted to only select a certain regular expression pattern or f ile type. If we
wanted to only g et f iles with a .exe extension we would use glob.glob("*.exe") .
os.path.join(os.environ['PROGRAMFILES'], "IDA", "idaw.exe") is used to the
g et the path to idaw.exe . Some versions of IDA have a f older name with the version
number present. If this is the case the argument "IDA" will need to be modif ied to the
f older name. Also, the whole command mig ht have to be modif ied if we choose to use a
non-standard install location f or IDA. For now lets assume the install path f or IDA is
C:\Program Files\IDA . Af ter we f ound the path we loop throug h all the f iles in the
directory that do not contain a .py extension and then pass them to IDA. For an individual
f ile it would look like C:\Prog ram Files\IDA\idaw.exe -B bad_f ile.exe`. Once ran it would
g enerate an ASM and IDB f or the f ile. All f iles will be written in the working directory. An
example output can be seen below.
C:\injected>dir
C:\injected>python batch_analysis.py
C:\injected>dir
Executing Scripts
IDAPython scripts can be executed f rom the command line. We can use the f ollowing code
to count each instruction in the IDB and then write it to a f ile named instru_count.txt .
im port idc
im port idaapi
im port idautils
idaapi.autoWait()
count = 0
for func in idautils.Functions():
# Ignore Library Code
flags = idc.GetFunctionFlags(func)
if flags & FUNC_LIB:
continue
for instru in idautils.FuncItems(func):
count += 1
f = open("instru_count.txt", 'w')
print_me = "Instruction Count is % d" % (count)
f.write(print_me)
f.close()
idc.Exit(0)
From a command line perspective the two most important f unctions are
idaapi.autoWait() and idc.Exit(0) . When IDA opens a f ile it is important to wait f or
the analysis to complete. This allows IDA to populate all f unctions, structures, or other
values that are based on IDA’s analysis eng ine. To wait f or the analysis to complete we call
idaapi.autoWait() . It will wait/pause until IDA is completed with its analysis. Once the
analysis is completed it will return control back to the script. It is important to execute this
at the beg inning of the script bef ore we call any IDAPython f unctions that rely on the
analysis to be completed. Once our script has executed we will need to call idc.Exit(0) .
This will stop execution of our script, close out the database and return to the caller of the
script. If not our IDB would not be closed properly.
If we wanted to execute the IDAPython to count all lines we IDB we would execute the
f ollowing command line.
-A is f or Autonomous mode and -S signals f or IDA to run a script on the IDB once it has
opened. In the working directory we would see a f ile named instru_count.txt that
contained a count of all instructions.