Anti Virus 2.0 "Compilers in Disguise": Mihai G. Chiriac Bitdefender

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

Anti Virus 2.

0
“Compilers in disguise”
Mihai G. Chiriac
BitDefender
Talk outline
• AV History
• Emulation basics
• Compiler technology
– Intermediate Language
– Optimizations
– Code generation
• Conclusions
AV History!
• String searching
– Aho-Corasick, KMP, Boyer-Moore
– PolySearch
– Bookmarks (from top, tail, EP?)
– Hashes (B, ofs1, sz1, crc1, ofs2, sz2, crc2)

NYB.A
AV History (cont’d)
• Encrypted viruses
– Static decryption loop (signature)
– Simple encryption (xray-ing)
– Algorithmic detection

Cascade.1706 decryption loop


AV History (cont’d)
• Polymorphic viruses
– Simple encryption (xray-ing)
– Algorithmic detection, heuristics

TPE.Giraffe.A
Emulation!
• Hardware
– Virtual CPU
– Virtual memory
– Virtual devices
• Software
– Partial OS simulation
• Bonus Goodies
– Fake IRC, SMTP, DNS, etc servers
Workflow
• Init the CPU / VM
• Init the virtual OS
– Modules
– Structures
• Map the file
• Start emulation from cs:eip
• Scan (when conditions are met)
• Quit (when conditions are met)
Sample

Win32.Parite (Pinfi) decryption loop


Ready to emulate our first
instruction?
…Not yet…
Chores! 
• Pre-instruction tasks
– DRx handling
– Segment access rights
– Page access rights
• Post-instruction tasks
– TF handling
– Update the virtual EIP
– Update the EI number
Tasks
• Fetch instruction from cs:eip
• Decode
– Handle prefixes

• Emulate!
• Easy, huh? 
On average,
one every three instructions
references memory…

Memory accesses
• We have to virtualize the entire 4 GB
space….
• Every memory access needs:
– Segment access checks
– Linear address computation
– Page access checks
– Hardware debugging checks
– SMC checks!! (for memory writes)
Problems…
• Millions of instructions…
• Polymorphic decryption loops are full of
do-nothing, “garbage” code…
• Decompression loops are optimized for
size, not speed…

• …This results in unacceptable


performance …
600000

200000
400000
Parses 800000

0
1000000
1200000
1400000
1600000

0x00565577 ->…
0x005D1EF2 ->…
0x005D1F04 ->…
0x005D1F1C ->…
0x005D1F33 ->…
0x005D1F4F ->…
0x005D1F61 ->…
0x005D1F74 ->…
0x005D1F9A ->…
UPX decompression

0x005D1FC3 ->…
0x005D1FF6 ->…
0x005D201F ->…
0x7FF00C0E ->…
0x7FF80430 ->…
0x7FF80498 ->…
0x7FF804E8 ->…
0x7FF80521 ->…
0x7FF8055D ->…
Advantages
• Typically, an emulator spends the most
time in loops…

• A small percentage of code is responsible


for a large percentage of emulation time…

• So… we know what to optimize!


The plan
• Identify hot-spots
– Basic blocks that execute very frequently

• Try to make them run as fast as possible


– Reducing to a minimum the set of repetitive
actions
– Reducing to a minimum the number of
reduntant operations
Back to our code…
• .420010 (31 1C 3E) xor [esi+edi], ebx
First thoughts
• For loops, keep the opcodes already
decoded!
• Memory model is usually flat…
– We can catch accesses to DS, SS,…
• Hardware debugging rarely used…
– We can catch accesses to DRx
• Trap Flag rarely used…
– We can monitor accesses to EFlags
Back to our code…
• .420010 (31 1C 3E) xor [esi+edi], ebx
But we can do much more!
• x86 - Very rich instruction set
– One instruction – many basic operations
– Different encodings, same result
– Hard(er) to optimize…

• Mike’s Intermediate Language Format


• …apparently the acronym is taken 
IL Basics
• Very RISC-like
• Single-purpose micro-operations
• Infinite number of virtual registers
• Many info, useful for optimizations
– Operation type, operands
– Input / output variables (use-define)
• Many info, useful for dynamic analysis
– Memory access info
Parite.A decryption (1)
• Decrypt:
• .420010 xor [esi+edi], ebx
• .420013 sub esi, 2
• .420016 sub esi, 2
• .420019 jnz Decrypt

Compute_ZF (tm1)
mm0 = esi + edi Compute_SF (tm1)
tm0 = load32 (mm0) Compute_PF (tm1)
tm1 = tm0 ^ ebx
store32 (mm0, tm1) Compute_OF (OP_XOR, …)
Compute_AF (OP_XOR, …)
Compute_CF (OP_XOR, …)
Parite.A decryption (2)
• Decrypt:
• .420010 xor [esi+edi], ebx
• .420013 sub esi, 2
• .420016 sub esi, 2
• .420019 jnz Decrypt

Compute_ZF (esi)
Compute_SF (esi)
tm0 = esi Compute_PF (esi)
esi = esi – 2
Compute_OF (OP_SUB, …)
Compute_AF (OP_SUB, …)
Compute_CF (OP_SUB, …)
Parite.A decryption (3)
• Decrypt:
• .420010 xor [esi+edi], ebx
• .420013 sub esi, 2
• .420016 sub esi, 2
• .420019 jnz Decrypt

Compute_ZF (esi)
Compute_SF (esi)
tm0 = esi Compute_PF (esi)
esi = esi – 2
Compute_OF (OP_SUB, …)
Compute_AF (OP_SUB, …)
Compute_CF (OP_SUB, …)
Parite.A decryption (4)

• We can follow the use-def chains and


remove unnecessary micro-ops…

mm0 = esi + edi Compute_ZF (esi)


tm0 = load32 (mm0) Compute_SF (esi)
tm1 = tm0 ^ ebx Compute_PF (esi)
store32 (mm0, tm1)
esi = esi – 2 Compute_OF (OP_SUB, …)
tm0 = esi Compute_AF (OP_SUB, …)
esi = esi – 2 Compute_CF (OP_SUB, …)
Parite.A decryption (5)

• We can compute some values only if


really needed…

mm0 = esi + edi


tm0 = load32 (mm0)
tm1 = tm0 ^ ebx Set_LazyFlags (OP_SUB, …)
store32 (mm0, tm1) Compute_ZF (esi)
esi = esi – 2
tm0 = esi
esi = esi – 2
Static single assignment
Sample code… Three-address code..

int a, b, c; int a, b, c;

a = 5; a = 5;
b = 3; b = 3;
c = a + b + 3; c = a + b;
b = c + 1; c = c + 3;
b = c + 1;
SSA (cont’d)
Three-address code… SSA Form

a = 5; a[0] = cnst(5)
b = 3; b[0] = cnst(3)
c = a + b; c[0] = a[0]+b[0]
c = c + 3; c[1] = c[0]+cnst(3)
b = c + 1; b[1] = c[1]+cnst(1)

Easy! Create a different version for every variable state!


SSA (cont’d)
SSA Form… Graph!
b[1]
+
a[0] = cnst(5) / \
b[0] = cnst(3) c[1] cnst (1)
c[0] = a[0]+b[0] /
+
c[1] = c[0]+cnst(3) / \
b[1] = c[1]+cnst(1) c[0] cnst (3)
+
/ \
a[0] b[0]
SSA (cont’d)
• Very simple optimization
b[1]
framework +
– Constant folding / \
c[1] cnst (1)
– Constant propagation /
+
– Common sub-expression / \
c[0] cnst (3)
elimination +
/ \
– Dead code removal a[0] b[0]

• Expensive, so it’s used


only when needed…
Memory!
0040517E 812B 84F1183C SUB DWORD PTR DS:[EBX],3C18F184
00405184 832B 96 SUB DWORD PTR DS:[EBX],-6A
00405187 013B ADD DWORD PTR DS:[EBX],EDI
00405189 D1CF ROR EDI,1
0040518D 832B DF SUB DWORD PTR DS:[EBX],-21
00405190 812B 69802E61 SUB DWORD PTR DS:[EBX],612E8069
00405196 29C9 SUB ECX,ECX
00405198 812B CD05B390 SUB DWORD PTR DS:[EBX],90B305CD
0040519E 832B 79 SUB DWORD PTR DS:[EBX],79
004051A3 87C1 XCHG ECX,EAX
004051A5 29D1 SUB ECX,EDX
004051A7 832B C9 SUB DWORD PTR DS:[EBX],-37
004051AE 2933 SUB DWORD PTR DS:[EBX],ESI

Win32.Harrier decryption loop (partial)


Challenges
• Memory locations = variables, but…
– Hard to prove the addresses are valid…
– Problems with pointer aliasing (including
ESP!!)

• A possible solution
– Perform these optimizations only after we’ve
gathered a set of run-time data…
Execution modes – 1
• No code generation! 
• Simply simulate the micro-ops
• Advantages:
– Very portable
– Easy to profile
– Easy to debug
• Disadvantages:
– Slow 
PSP 
Execution modes – 2
• Trivial code generation…

• Simply link the micro-functions that


simulate the micro-ops
– Most of them are 2-4 x86 instructions
– Compiler generated, so they’re portable
– Need a (very basic) platform-specific linker
– Fast! 
Execution modes – 3
• Generate code tailored for the target CPU!
• Advantages
– Fastest!
– We can combine multiple micro-ops into a
single CPU instruction
– Special case: X86->X86
• Disadvantages
– Platform dependent
Speed statistics
Speed
20
18
16
14
12
10
8 Speed
6
4
2
0
Normal uOp
uOp link code gen
Execution simulation
Speed 1 17.44 9.11 5.3
Exit conditions
• We want to quit as early as possible when
analyzing clean files
– Too many GUI calls?
• We want to quit in less than X seconds, no
matter what…
– Inject “time check” code…
• We want to “chew” as much from the file
as possible in those X seconds…
What NOT to do…
• For every basic block!!!
• pushfd / pushad
• call GetTickCount
• sub eax, dword ptr [start_count]
• xor edx, edx
• mov ecx, 0x3e8
• div ecx
• cmp eax, dword ptr [max_seconds]
• jg __out
• popad / popfd
UPX CFG
An idea...
• We have the control-flow-graph…
• Why add “time check” code for every BB?
– We can check only once / cycle in the CFG
– Make sure there’s at least one “time check”
per graph cycle
• Easier way!
– We can add “time check” code only for
“backward” branches 
Scan conditions
• Old(er) techniques
– Specific APIs
– Common startup code (CRT?)
• New(er) techniques
– Execution from a “dirty” page
– Memory access patterns! (e.g. linear
decryption loops)
– Suspicious branches, purging of decryption
code etc …
Conclusions
• CPU-intensive packers are here to stay…
• …Ex: VMProtect requires 40 billion
instructions…
• Code optimization is a good way to reduce
analysis time…
• Compiler-like structures are good ways of
solving other difficult AV problems 
Questions?
[email protected]

You might also like