Go Speed Tracer
Go Speed Tracer
Richard Johnson
RECON 2016
Go Speed Tracer
Introduction
• Agenda
– Tracing Applications
– Guided Fuzzing
– Binary Translation
– Hardware Tracing
• Goals
– Understand the attributes required for optimal guided fuzzing
– Identify areas that can be optimized today
– Deliver performant and reusable tracing engines
Applications
• Software Engineering
– Performance Monitoring
– Unit Testing
• Malware Analysis
– Unpacking
– Runtime behavior
– Sandboxing
• Mitigations
– Shadow Stacks
– Memory Safety checkers
Applications
• Software Security
– Corpus distillation
• Minimal set of inputs to reach desired conditions
– Guided fuzzing
• Automated refinement / genetic mutation
– Crash analysis
• Crash bucketing
• Graph slicing
• Root cause determination
– Interactive Debugging
Tracing Engines
• OS Provided APIs
– Debuggers
• ptrace
• dbgeng
• signals
– Hook points
• Linux LTT(ng)
• Linux perf
• Windows Nirvana
• Windows AppVerifier Check out Alex Ionescu’s
• Windows Shim Engine RECON 2015 talk
– Performance counters
• Linux perf
• Windows PDH
Tracing Engines
• Binary Instrumentation
– Compiler plugins
• gcc-gcov
• llvm-cov
– Binary translation
• Valgrind
• DynamoRIO
• Pin
• DynInst
• Frida and others
• ...
Tracing Engines
• Observations
– Academic
• Approach was too closely tied to traditional genetic algorithms
• Not enough attention to performance or real world targets
• Only targeted text protocols
Amercian Fuzzy Lop
• Michal Zalewski 2013
– Bunny The Fuzzer 2007
• Features
– Block coverage via compile time instrumentation
– Simplified approach to genetic algorithm
• Edge transitions are encoded as tuple and tracked in global map
• Includes coverage and frequency
– Uses variety of traditional mutation fuzzing strategies
– Dictionaries of tokens/constants
– First practical high performance guided fuzzer
– Helper tools for minimizing test cases and corpus
– Attempts to be idiot proof
Amercian Fuzzy Lop
• Michal Zalewski 2013
– Bunny The Fuzzer 2007
• Contributions
– Tracks edge transitions
• Not just block entry
– Global coverage map
• Generation tracking
– Fork server
• Reduce fuzz target initialization
– Persistent mode fuzzing
– Builds corpus of unique inputs
reusable in other workflows
Amercian Fuzzy Lop
• autodafe
– Martin Vuagnoux 2004
– First generation guided fuzzer using pattern matching via API hooks
• Blind Code Coverage Fuzzer
– Joxean Koret 2014
– Uses off-the-shelf components to assemble a guided fuzzer
• radamsa, zzuf, custom mutators
• drcov, COSEINC RunTracer for coverage
• covFuzz
– Atte Kettunen 2015
– Simple node.js server for guided fuzzing
– custom fuzzers, ASanCoverage
Guided Fuzzing
• Required
– Fast tracing engine
• Block based granularity
– Fast logging
• Memory resident coverage map
– Fast evolutionary algorithm
• Minimum of global population map, pool diversity
• Desired
– Portable
– Easy to use
– Helper tools
– Grammar detection
• AFL and Honggfuzz still most practical options
Binary Translation
Binary Translation
• Advantages
– Supported on most mainstream OS/archs
– Can be faster than hardware tracing
– Can easily be targeted at certain parts of code
– Can be tuned for specific applications
• Disadvantages
– Performance overhead
• Introduces additional context switch
– ISA compatibility not guarenteed
– Not always robust against detection or escape
Valgrind
• Obligatory slide
• Lots of deep inspection tools
• VEX IR is well suited for security applications
• Example
VOID Trace(TRACE trace, VOID *v)
{
for (BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl
= BBL_Next(bbl))
{
BBL_InsertCall(bbl, IPOINT_ANYWHERE, AFUNPTR(basic_block_hook),
IARG_FAST_ANALYSIS_CALL, IARG_END);
}
}
DynamoRIO
• “A connoisseur's DBT”
• Features
– Block level instrumentation
• Blocks are directly copied into code cache
– Direct modification of IL possible
– Portable
• Linux, Windows, Android
• x86/x64, ARM
– C API / BSD Licensed (since 2009)
• Observations
– Much more flexible for block level instrumentation
– Performance is a priority, Windows is a priority
– Powerful tools like Dr Memory
• Shadow memory, taint tracking
• Twice as fast as Valgrind memcheck
DynamoRIO
• Example
event_basic_block(void *drcontext, void *tag, instrlist_t *bb,
bool for_trace, bool translating)
{
instr_t *instr, *first = instrlist_first(bb);
uint flags;
/* Our inc can go anywhere, so find a spot where flags are dead. */
for (instr = first; instr != NULL; instr = instr_get_next(instr))
{
flags = instr_get_arith_flags(instr);
/* OP_inc doesn't write CF but not worth distinguishing */
if (TESTALL(EFLAGS_WRITE_6, flags) && !TESTANY(EFLAGS_READ_6,
flags))
break;
}
…
DynamoRIO
• Example
if (instr == NULL)
dr_save_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
instrlist_meta_preinsert(bb,
(instr == NULL) ? first : instr,
INSTR_CREATE_inc(drcontext,
OPND_CREATE_ABSMEM((byte *)&global_count, OPSZ_4)));
if (instr == NULL)
dr_restore_arith_flags(drcontext, bb, first, SPILL_SLOT_1);
return DR_EMIT_DEFAULT;
}
DynInst
• Example
bool insertBBCallback(BPatch_binaryEdit * appBin, BPatch_function * curFunc,
char *funcName, BPatch_function * instBBIncFunc,int *bbIndex)
{
unsigned short randID;
BPatch_flowGraph *appCFG = curFunc->getCFG ();
BPatch_Set <BPatch_basicBlock *> allBlocks;
BPatch_Set <BPatch_basicBlock *>::iterator iter;
for (iter = allBlocks.begin (); iter != allBlocks.end (); iter++)
{
unsigned long address = (*iter)->getStartAddress ();
randID = rand() % USHRT_MAX;
BPatch_Vector <BPatch_snippet *> instArgs;
BPatch_constExpr bbId (randID);
instArgs.push_back (&bbId);
…
DynInst
• Example
…
BPatch_point *bbEntry = (*iter)->findEntryPoint();
BPatch_funcCallExpr instIncExpr (*instBBIncFunc, instArgs);
BPatchSnippetHandle *handle =
appBin->insertSnippet (instIncExpr, *bbEntry, BPatch_callBefore,
BPatch_lastSnippet);
(*bbIndex)++;
}
return true;
}
Tuning Binary Translation
• Programmer checklist
– Memory must not be swapped
– Use static variables if necessary
– Must wrap functions with assembly
• disable interrupts
• push all registers
• call interrupt handler
• pop all registers
• iretd
Its a Trap
• Single Stepping
– Enabled by setting the Trap Flag
– After each instruction, CPU checks flag and fires exception if
enabled
– Accessible from userspace
– slooooooooow, not applicable
• Branch Trace Flag
– Modifies single step behavior to trap on branch
– Single flag in IA32_DEBUGCTL MSR
– Requires kernel privileges to write to MSR
– Windows includes a mapping from DR7 to set MSR
IA32_DEBUGCTL
– MSR Address 0x1d9
• LBR [0] - Enable Last Branch Record mechanism
• BTF [1] - when enabled with TF in EFLAGS does single stepping on branches
• TR [6] - enables Tracing (sending BTMs to system bus)
• BTS [7] - enables sending BTMs to memory buffer from system bus
• BTINT [8] - full buffer generates interrupt otherwise circular write
• BTS_OFF_OS [9] - does not count for priv. level 0
• BTS_OFF_USR [10] - does not count for priv. level 1,2,3
• FRZ_LBRS_ON_PMI [11] - freeze LBR stack on a PMI
• FRZ_PERFMON_ON_PMI [12] - disable all performance counters on a PMI
• UNCORE_PMI_EN [13] - uncore counter interrupt generation
• SMM_FRZ [14] - event counters are frozen during SMM
Branch Trace Store
• First generation hardware struct DS_AREA {
u64 bts_buffer_base;
branch tracing via PMU u64 bts_index;
u64 bts_absolute_maximum;
u64 bts_interrupt_threshold;
• Allows configurable u64 pebs_buffer_base;
memory buffer for trace u64 pebs_index;
u64 pebs_absolute_maximum;
storage u64 pebs_interrupt_threshold;
u64 pebs_event_reset[4];
};
• MSR_IA32_DS_AREA MSR
defines storage location struct DS_AREA_RECORD {
u64 flags;
u64 ip;
u64 regs[16];
u64 status;
u64 dla;
u64 dse;
u64 lat;
};
Branch Trace Store
Branch Trace Store
• Features
– Ring -3? Can trace SMM, HyperVisor, Kernel, Userspace [CPL -2 to 3]
– Logs directly to physical memory
• Bypasses CPU cache and eliminates TLB cache misses
• Can be a contiguous segment or a set of ranges
• Ringbuffer snapshot or interrupt mode supported
– Minimal log format
• One bit per conditional branch
• Only indirect branches log dest address
• Interrupts log source and destination
• Decoding log requires original binaries and memory map
– Filter logging based on CR3
– Linux can automatically add log to coredump
– GDB Support
Intel Processor Trace