Lecture 02 2022
Lecture 02 2022
Thea Rossman
Jan 6, 2022
Logistics
● Today (+ exercises): what are some tools that we can use to find mistakes in
C/C++ code? What are their limitations?
● Next week: How do other languages address some shortcomings of C?
How can we find bugs in a program?
Dynamic Analysis
Dynamic analysis: high-level
● Run the program, watch what it does, and look for problematic behavior
● Can find problems, but only if the program exhibits problematic behavior on
the inputs you use to test. (Separately, some tools only check for certain
types of issues.)
● Commonly combined with techniques to run the program with lots of different
test inputs (e.g. fuzzing), yet this still can’t give us any assurances that code
is bug-free
● Dynamic analysis is great! Test your code! *and* understand the limitations!
Dynamic analysis tool: Valgrind
int main() {
char *buf = (char*)malloc(8);
buf[16] = 'a';
}
(compiler)
mov edi, 8
call valgrind_malloc
mov edi, 8
mov QWORD PTR [rbp-8], rax
call malloc
record memory write ^
mov QWORD PTR [rbp-8], rax (valgrind)
mov rax, QWORD PTR [rbp-8]
mov rax, QWORD PTR [rbp-8]
record memory read ^
add rax, 16
add rax, 16
mov BYTE PTR [rax], 97 Invalid write of size 4
mov BYTE PTR [rax], 97 (writing to the heap, but it’s not
record memory write ^ inside any heap allocation that was
previously made)
Valgrind (summary)
● Works with any binary compiled by any compiler (even if you don’t have
source code available!)
● Downside: not a lot of information is available in binaries…
● Works with any binary compiled by any compiler (even if you don’t have
source code available!)
● Downside: not a lot of information is available in binaries
○ E.g. the stack is just a chunk of memory. You might be able to observe
that the stack pointer grows up/down, but no information about how it’s
divided into variables.
■ => cannot detect stack-based buffer overflows!
int main() {
char buf[8];
Record stack buffer “buf” with size 8
buf[16] = 'a';
Record write to “buf” with offset 16
}
LLVM Sanitizers
● AddressSanitizer
○ Finds use of improper memory addresses: out of bounds memory accesses, double
free, use after free
● LeakSanitizer
○ Finds memory leaks
● MemorySanitizer
○ Finds use of uninitialized memory
● UndefinedBehaviorSanitizer
○ Finds usage of null pointers, integer/float overflow, etc
● ThreadSanitizer
○ Finds improper usage of threads (second half of CS 110)
● More…
Cool! Let’s sanitize all the code!! 🏎🔥💯
(screw)
Fundamental limitation of dynamic analysis
● Dynamic analysis can only report bad behavior that actually happened
● If your program worked fine with the input you provided, but it might do bad
things in certain edge cases, dynamic analysis cannot tell you anything about
that
#include <stdio.h>
#include <string.h>
int main() {
char s[100];
int i;
printf("\nEnter a string : ");
gets(s);
for (i = 0; s[i]!='\0'; i++) {
if(s[i] >= 'a' && s[i] <= 'z') {
s[i] = s[i] -32;
}
}
printf("\nString in Upper Case = %s", s);
return 0;
}
How can we find weird edge cases?
Fuzzing
Input seed
Fuzzing
Input seed
(semi-random)
mutation
These inputs
made the
Input seed program do
new things!
run the
run the program again program again
and observe behavior more mutation
(semi-random)
mutation
More new
behavior!
Input seed
You want to write a tool to help people writing code like this. What do you do?
#include <stdio.h>
#include <string.h>
int main() {
char s[100];
int i;
printf("\nEnter a string : ");
gets(s);
for (i = 0; s[i]!='\0'; i++) {
if(s[i] >= 'a' && s[i] <= 'z') {
s[i] = s[i] -32;
}
}
printf("\nString in Upper Case = %s", s);
return 0;
}
Basic static analysis (“linting”)
Stephen C. Johnson, a computer scientist at Bell Labs, came up with lint in 1978… The term
"lint" was derived from the name of the tiny bits of fiber and fluff shed by clothing, as the
command should act like a dryer machine lint trap, detecting small errors with big effects.
https://en.wikipedia.org/wiki/Lint_(software)
● Linters employ very simple techniques (e.g. ctrl+f) to find obvious mistakes
● The person running the linter can configure a set of rules to enforce
○ Rules are intended to improve the style of the codebase
○ Just because there is a linter error doesn’t mean the code is broken (e.g. it’s possible
to call strcpy() without introducing bugs, but many linters will complain if you call it)
● Common C/C++ linter: clang-tidy
○ Can even auto-fix many of the issues!
You Be the Static Analyzer: Round 2
You want to write a tool to help people writing code like this. What do you do?
void printToUpper(const char *str) { int main(int argc, char *argv[]) {
char *upper = strdup(str); printf("Enter a string to make uppercase,
for (int i = 0; str[i] != '\0'; i++) { or type \"quit\" to quit:\n");
if(str[i] >= 'a' && str[i] <= 'z') { char input[512];
upper[i] = str[i] - ('a' - 'A'); // safely read input string
} fgets(input, sizeof(input), stdin);
} char *toMakeUppercase;
printf("%s\n", upper); if (strcmp(input, "quit") != 0) {
free(upper); toMakeUppercase = input;
} }
printToUpper(toMakeUppercase);
}
Dataflow analysis
We can trace through how the program might execute, keeping track of possible variable values
We can trace through how the program might execute, keeping track of possible variable values
We can trace through how the program might execute, keeping track of possible variable values
You want to write a tool to help people writing code like this. What do you do?
int main(int argc, char *argv[]) { // Find the close bracket
// Goal: parse out a string between brackets char *close_bracket = strchr(parsed, ']');
// (e.g. " [target string]" -> "target string") if (close_bracket == NULL) {
printf("Malformed input!\n");
char *parsed = strdup(argv[1]); return 1;
}
// Find open bracket
char *open_bracket = strchr(parsed, '['); // Replace the close bracket with a null
if (open_bracket == NULL) { // terminator to end the parsed string there
printf("Malformed input!\n");
Common mistake: early *close_bracket = '\0';
return 1; return fails to clean up
} resources printf("Parsed string: %s\n", parsed);
free(parsed);
// Make the output string start after the open bracket return 0;
parsed = open_bracket + 1; }
Dataflow analysis: very powerful!
Liveness analysis: observe when variables go away, and make sure they’re cleaned up appropriately
Liveness analysis: observe when variables go away, and make sure they’re cleaned up appropriately
int main() {
void *buf = malloc(8);
freeSometimes(buf); buf = {heap allocation}
return 0;
}
Dataflow analysis: works across functions
int main() {
void *buf = malloc(8);
freeSometimes(buf);
return 0;
}
Dataflow analysis: works across functions
int main() {
void *buf = malloc(8);
freeSometimes(buf);
return 0;
}
Dataflow analysis: works across functions
int main() {
void *buf = malloc(8);
freeSometimes(buf);
return 0;
}
Dataflow analysis: works across functions
int main() {
void *buf = malloc(8);
freeSometimes(buf);
return 0; buf = {heap allocation, freed allocation}
}
Dataflow analysis: works across functions
● False positives
○ Dataflow analysis will follow each branch, even if it’s impossible for some
condition to be true in real life
○ False positives are the Achille’s heel of static analysis. Need a good
signal/noise ratio or else no one will use your analyzer
● Need to limit scope to get reasonable performance
○ Many static analyzers only analyze a single file at a time: they don’t do
dataflow analysis into/out of functions elsewhere in the codebase
○ If you have a huge codebase, loops, tons of conditions, etc., dataflow
analysis can get unwieldy.
Take CS 243 for more info!
static analysis to the moon 🚀 🚀🌙
(screw)
Low-hanging fruit #1
🍓 clang-tidy easy.c
🍓 cppcheck easy.c no output here means no issues found
Checking easy.c ...
🍓 scan-build clang-11 -Wall easy.c
scan-build: Using '/usr/local/Cellar/llvm/11.0.0_1/bin/clang-11' for static analysis
scan-build: Analysis run complete.
scan-build: Removing directory '/var/folders/6_/jdc6ljyd5n795x1xl8drptm80000gn/T/scan-
build-2021-04-01-002241-43549-1' because it contains no reports.
scan-build: No bugs found.
How do we fix this?
● Okay, I’ll just make sure programs can handle receiving NULL from strchr
● But what if the program is calling strchr on a string that is guaranteed to have
the character they’re looking for? (i.e. strchr will for sure not return NULL)
● And what about all the other functions that can potentially return NULL for
one reason or another?
● And what about…
Low-hanging fruit #2
https://linux.die.net/man/3/strncpy
🍓 clang-tidy easy.c
🍓 cppcheck easy.c
Checking easy.c ...
🍓 scan-build clang-11 -Wall easy.c
scan-build: Using '/usr/local/Cellar/llvm/11.0.0_1/bin/clang-11' for static analysis
scan-build: Analysis run complete.
scan-build: Removing directory '/var/folders/6_/jdc6ljyd5n795x1xl8drptm80000gn/T/scan-
build-2021-04-01-002241-43549-1' because it contains no reports.
scan-build: No bugs found.
How do we fix this?
● Okay, I’ll just make sure programs add a null terminator after calling strncpy
● But what if the program actually uses the copied “string” as a character array
instead of a null-terminated string (i.e. the code is actually fine)?
● And how are you going to track down every function that depends on the
string having a null terminator?
● Note: outright banning strncpy() might be a better idea, but there are still
other ways we could end up with a char* that is not a null-terminated string
Fundamental limitations of static analysis
● If you can only look at a few lines of code, it’s hard to tell (without broader context)
whether that code is safe
● Getting broader context is impossible in the general case (see: “the halting problem”)
○ We can guesstimate what values get passed around in a program using dataflow
analysis, and we can guesstimate how they get used, but it breaks down when
code gets complicated
● You can always add more specific things to check for, but there will always be other
ways to mess up
● Begin to think about: is there some way we can make it easier to verify small
snippets of code in isolation, without broader context?
○ More next week: This general idea is a key motivation for Rust!
Takeaways
● If you are writing C/C++, you should absolutely be running sanitizers, fuzzers,
and static analyzers
○ You should understand the limitations of these tools, but…
○ Just because they are limited does not mean they aren’t helpful
● If you are in a position to use languages with more robust protections, you
should!
For next week