Static Analysis of String Manipulations in Critical Embedded C Programs
Static Analysis of String Manipulations in Critical Embedded C Programs
arbitrary loop
[ f() function call
::= type
[ scalar type
[ [n] array type
[ |1 f1 . . . n fn structure type
::= scalar type
[ char character type
[ int integer type
[ pointer type
lv ::= left value
[ x variable
[ lv.f eld access
[ lv[e] array access
[ e pointer deref.
e ::= expression
[ c constant
[ lv left value
[ e1 e2 binary operation
[ &lv address of
[ ()e cast
::= binary operator
[ +, , , / integer arithmetic
[ +, pointer arithmetic
::= comparison
[ = equality
[ ,= dierence
Fig. 1. Syntax
maintainable. This is not an easy task especially for a language as complex as C.
To attain these goals we adopt the methodology of abstract interpretation [5]:
section 2 presents the subset of C we tackle and its concrete semantics; section
3 describes the abstraction of strings and the sound static analysis algorithm;
section 4 shows how string copy operations are handled, and what checks are
performed by the tool; sections 5 and 6 address the implementation, experiments
and related work.
2 Embedded C programs
2.1 Syntax
The C programming language is inherently complex, which makes the formal def-
inition of its semantics a dicult task. Hopefully and for obvious safety reasons,
programming critical embedded applications is subject to severe constraints. In
practice, only a subset of C is allowed. The main limitation results from the
obligation to know at compile time the maximum memory usage of any piece of
software. To achieve this, the use of dynamic allocation (function malloc()) and
recursive functions are both forbidden. For the sake of expositional clarity, we
set aside some additional features such as numerous C scalar types, union types
and goto statements. Dealing with these features brings issues orthogonal to the
object of this paper. Some ideas to address these issues may be found in [7,
21]. In the end, we consider the relatively small kernel language with the syntax
of gure 1. Complex assignments in C are broken down to simpler assignments
between scalar types. All variables declared in a given scope of the program have
distinct names.
2.2 Store
A memory address is a pair (x, o) of a variable identier in 1 and an oset.
It denotes the o
th
byte from the address at which the content of variable x is
stored. Operation shifts an address by a given oset:
(x, o) i = (x, o + i)
Programs manipulate three kinds of basic values: integers in Z, characters in C
and pointers in P. For sake of simplicity, integers are unbounded. The nature of
characters is left unspecied. It is sucient to say that there is one null character
denoted by 0. A pointer is a triple a, i, n) that references the i
th
byte of a
buer that starts from address a and is n bytes long.
The store maps each allocated address to a basic value. The kind of values
stored at a given address never changes. Hence, a store = (
Z
,
C
,
P
) dened
on allocated addresses / = /
Z
/
C
/
P
belongs to the set:
= (/
Z
Z) (/
C
C) (/
P
P)
In this model, any operation that alters the interpretation of data too severely
leads to an error at runtime. For instance, a cast from int a; int b to int is
valid; whereas a cast from int to char is illegitimate.
The layout of memory is given by two functions: sizeof() returns the size of
a data of type and oset(f) the oset of a eld f from the beginning of its
enclosing structure.
2.3 Semantics
We assign a denotational semantics [29] to the kernel language. In the following,
we use notations lv : and e : to retrieve the type of a left value lv or
expression e as computed by a standard C typechecker. A left value lv evaluates
to a set of addresses L[lv[, as formalized in gure 2. Sets allow to encode both
non-determinism and halting behaviours. A pointer of type can be safely
dereferenced, as long as there remains enough space to store an element of type
. Likewise, an expression e of integer type evaluates to a set of integers
Z
[e[.
Notice how an access to some address not allocated in the store of integer halts
program execution. We skip the classical denition of relation v
1
v
2
v which
explicits the meaning of each binary operation. Denitions for expressions of
character or pointer type are completely identical. The last four equations in
gure 2 dene pointer creation and cast.
/|[x[ = |(x, 0)
/|[lv.f[ = |a oset(f) [ a /|[lv[
/|[lv[e][ = |a i sizeof() [ a /|[lv[ i 1
Z
|[e[ 0 i < n
where lv : [n]
/|[e[ = |a i [ a, i, n) 1
P
|[e[ 0 i n sizeof () where e :
1
Z
|[c[ = |c
1
Z
|[lv[ = |
Z
(a) [ a /|[lv[ dom(
Z
)
1
Z
|[e1 e2[ = |v [ v1 1
Z
|[e1[ v2 1
Z
|[e2[ v1 v2 v
1
P
|[&x[ = |(x, 0), 0, sizeof ()) where x :
1
P
|[&lv.f[ = |a, 0, sizeof ()) [ a /|[lv.f[ where lv.f :
1
P
|[&lv[e][ = |a, i sizeof(), sizeof ([n])) [ a /|[lv[ i 1
Z
|[e[
where lv : [n]
1
P
|[()e[ = 1
P
|[e[
Fig. 2. Semantics of left values and expressions
Three atomic commands operate on the store:
([ x; cmd[ =
|dom()
[
x
= 1[[(x, 0)
([cmd[(
x
)
where dom() dom(
x
) =
([lv = e[ = [a v] [ a L[lv[ dom(
Z
) v
Z
[e[
where lv, e : Z
([?(lv 0)[ = [ v
C
[lv[ v 0
At variable declaration, a new store fragment
x
is initialized and concatenated
to the existing store, execution then continues until variable x is eventually
deleted from the resulting store. The new store fragment is built by induction
on the type of the declared variable:
1[[a = [a v]
1[[n][a =
0i<n
1[[(a i sizeof())
1[
1
f
1
. . .
n
f
n
[a =
0<in
1[
i
[(a oset(f
i
))
where v is any value of type and joins two disjoint stores. Assignments come
in three avours, one for each basic type: integer, character and pointer. Here, we
only describe the integer assignment since the other two are completely similar.
Assignment to a non-allocated address brings the program to a halt. Guards let
execution continue when the store satises the boolean condition. We consider
only equality or disequality with the null character even though other kinds of
C
= |
C
, 0, 1,
C
C
(
C
) =
C
(
C
) = C
C
(0) = |\0
C
(1) = C \ |\0
Z
= (Z Z)
Z
(
Z
) =
Z
([l; u]) = |i [ l i u
A
= (1 Z
A
(
A
) =
A
(
A
) = A
A
(x, O) = |(x, o) [ o
Z
(O)
P
= (A
P
(
P
) =
P
(A, I, N) = |a, i, n) [ a
A
(A) i
Z
(I) n
Z
(N)
Fig. 3. Abstract addresses and values
guards may easily be handled. The remaining commands control the ow of
execution:
([cmd
1
+ cmd
2
[ = ([cmd
1
[ ([cmd
2
[
([cmd
1
; cmd
2
[ = ([cmd
2
[(([cmd
1
[)
([cmd
[ = lfp
F
0
(X) =
0
[ X
([cmd[
([f()[ = ([cmd[ where cmd is the body of function f
A programP consists of a set of functions and a main command which is executed
in an initially empty store: [P[ = ([cmd[.
3 Static analysis
We wish to automatically verify that all string manipulations in a program
are innocuous. This is, by nature, an undecidable problem. So, we design a
static analysis that computes an approximate but sound representation of all the
stores that result from the execution of a program. Following the methodology
of abstract interpretation [5], an abstraction of sets of stores is rst devised. The
analysis algorithm is then systematically derived thanks to this abstraction from
the concrete semantics. The results of the analysis are used to check as many
potentially dangerous memory operations as possible and to emit warnings in
other cases.
3.1 Abstract values, integer and pointer stores
Figure 3 lists the abstract domains and concretization functions used for sets of
addresses and values. These abstractions are all built from well-known standard
domains: integers are represented by ranges [5]; characters thanks to the domain
of equality/disequality with the null character; a pair of a variable identier and
a range of possible osets stands for a set of addresses; and abstract pointers
are triples made of an abstract address, followed by two ranges for possible
osets and sizes. We use the standard set notations for all operations on ranges:
(, , min, max). Moreover, I
1
I
2
denotes the smallest range that contains both
I
1
and I
2
; I n the smallest range that contains all elements in I except n;
I + n (I n) is the range obtained after the addition (subtraction) of n to all
the elements in I.
The abstract domain (D, ) of the analysis is built as the product of three
domains: one for each type of basic value. An abstract store S is thus a triple
(S
Z
, S
P
, S
C
). Abstract integer S
Z
and pointer S
P
stores map each allocated ad-
dress to an abstract value of corresponding type. A fully edged description of
these standard non-relational domains is skipped. On the other hand, the ab-
stract character store S
C
, being the object of our study, is discussed at length in
the next section.
3.2 Abstract character store
A string in C is a sequence of characters stored in memory. The rst null char-
acter (0) signals the end of the string. If no null character is found before the
end of the allocated area, then the string is not well-formed. Hence the length
of a string stored on a buer (a : n) of n consecutive bytes starting at address a
in a store is:
strlen
(a : n) = min(n l [ 0 l < n (a l) = 0)
Now, in order to prove the correctness of string manipulations it is necessary to
at least retain some information about the length of the various strings in the
store.
Let be a partition of the set of all allocated addresses, such that each
element in the partition is a connected set (a buer). The abstract store maps
each buer in the partition to a range that approximates the possible lengths of
the string stored on that buer:
= ( Z
(S) = [ b : strlen
(b)
Z
(S(b))
() =
Several primitives operate on the domain of character store. Each primitive
obeys a soundness condition. Normalization returns the empty store as soon as
any buer is associated with an empty range:
(S) =
if b : S(b) =
Z
S otherwise
Normalization preserves the meaning of the abstract store, thus: ((S)) = (S).
From now on, we assume that the store is always in normal form so that no
abstract length can ever be the empty range. A new abstract store with no
information at all may be created using primitive universe from a partition . It
is such that for any buer (a : n) in :
universe()(a : n) = [0; n]
It is straightforward to show that: (/ C) (universe()). Abstract stores
S
1
and S
2
dened on the same partition can be compared:
S
1
S
2
(S
1
= (S
2
,= b : S
1
(b) S
2
(b)))
(S
1
) (S
2
)
Abstract join and meet operations are performed pointwise. To deal with
variable declarations, we need to concatenate stores of disjoint domains and
remove all the buers allocated for a given variable:
S = S =
S
1
S
2
= S
1
S
2
S x = S
|{(a:n)|a=(y,o)y=x}
These operations verify the following set inequalities:
(S
1
) (S
2
) (S
1
S
2
)
(S
1
) (S
2
) (S
1
S
2
)
2
[
1
(S
1
)
2
(S
2
) (S
1
S
2
)
|{(y,o)A|y=x}
[ (S) (S x)
Boolean conditions present in if statements, switches and loops must be taken
into account in order to produce suciently precise results. Primitive guard con-
strains the store according to an equality or disequality comparison with char-
acter 0:
[ (S) a (A) / (a) 0 (guard(A 0, S))
Suppose the constraint implies that there is at least one 0 character in a
memory region that spans from address (x, o
1
) to address (x, o
2
). Suppose further
that this region is contained in a unique buer (a : n) of the partition. Then, the
length of a string starting in a is necessarily smaller than the distance from a
to (x, o
2
). Hence:
guard((x, [o
1
; o
2
]) = 0, S) = (S[a : n S(a : n) [0; ]])
Similarly, suppose now that the value stored at address (x, o) is not the 0
character. If address (x, o) belongs to some buer a : n of the partition and is
the distance from a to (x, o), then:
guard((x, [o; o]) ,= 0, S) = (S[b S(b) ])
In all other cases, guard simply leaves the store unchanged:
guard(A 0, S) = S
Transport structure and store accesses. Operations to read and write in the
store are primordial to the analysis. However they are not easily dened mainly
because the region in memory that is impacted by the operation does not nec-
essarily coincide with a particular buer in the partition. In order to alleviate
this diculty, we rst devise transformations on the abstract store that allow
to change the underlying partition. Transformation cut C
([b L]) =
[b
1
[; ]; b
2
L ] if min(L)
[b
1
L [0; ]; b
2
[0; n]] otherwise
G
([b
1
K; b
2
L]) =
[b K (L + )] if K
[b K] otherwise
Both operations are sound in that their result includes at least all the concrete
stores originally present:
(S [b L]) (S C
([b L]))
(S [b
1
K; b
2
L]) (S G
([b
1
K; b
2
L]))
Building on glue and cut, there is a simple algorithm to move from any partition
1
to another
2
(of course,
1
and
2
must be dened on the same set of
allocated addresses). Starting from
1
, the rst step consists in splitting buers
until we get to the coarsest partition which is ner than both
1
and
2
. Then, in
a second step buers are glued together to get back to
2
. Let us introduce two
very useful shortcut notations built on top of this algorithm. In the following,
all addresses in buer b = (a : n) are allocated (i.e. b /):
S(b) minimally modies the store so as to include buer b in the result-
ing partition and then returns the value associated with this buer. More
accurately, let (a
1
: n
1
) . . . (a
k
: n
k
) be the initial partition, where all
buers that overlap b are listed in increasing order as (a
1
: n
1
) to (a
k
: n
k
).
Then the destination partition is (a
1
: ), (a : n), (a n :
), where
and
C
if S = A =
A
eval
|b|
(S(b)) if A = (x, O) b = tobu(A) b /
C
otherwise
write(S, A, V ) =
8
>
<
>
:
if S = A =
A
V =
Z
S[b update
|b|
(S(b), V )] if A = (x, O) b = tobu(A) b /
universe() otherwise
tobu(x, [o1; o2]) = ((x, o1) : (o2 o1 + 1))
evaln(L) =
8
>
<
>
:
0 if n = 1 L = [0; 0]
1 if L = [n; n]
C
otherwise
update
n
([l; u], 0) = [0; min(u, n 1)]
update
n
([l; u], 1) =
(
[1; 1] if n = 1
[l; n] otherwise
update
n
(L, ) = [0; n]
Fig. 4. Abstract memory access
when the buer contains only one character that is equal to 0, then 0 is
returned,
when L = [n; n], the rst 0 character is not in the buer, so the returned
value is 1,
in all other cases, there is insucient information to conclude and
C
is
returned.
The intuition that motivates denition of function update goes as follows:
After a 0 character is written somewhere in the buer, we can be sure that
the length is strictly less than its size n. Moreover, previous 0 characters
remain so that update
n
([l; u], 0) = [0; min(u, n 1)].
If exactly one non-null character is copied in a buer of size n = 1, then the
rst 0 can not be at index 0, so update
1
(L, 0) = [1; 1].
In the remaining cases when a non-null character is written, it may erase the
rst 0 character in the buer, so that the length of the string may be un-
bounded. Since non-null characters are untouched, the information about the
lower bound on the possible string lengths is kept, thus update
n
([l; u], 0) =
[l; n].
At last, when an unknown value is copied, all information is lost.
These operations are sound with respect to:
(a) [ (S) a
A
(A) dom()
C
(read(S, A))
[a v] [ (S) a
A
(A) dom() v
C
(V ) (write(S, A, V ))
1
C
[[c]]S =
(
0 if c = \0
1 otherwise
1
C
[[lv]]S = read(S
C
, /[[lv]]S)
1[[char]]a = (, , universe(|(a, 1)))
1[[char[n]]]a = (, , universe(|(a, n)))
1[[[n]]]a =
L
0i<n
|1[[]](a i sizeof())
1[[|1f1 . . . nfn]]a =
L
0<in
|1[[i]](a oset(fi))
([[ x; cmd]]S = ([[cmd]](S 1[[]](x, 0)) \ x
([[lv = e]](S
Z
, S
P
, S
C
) = (S
Z
, S
P
, write(S
C
, /[[lv]]S, 1
C
[[e]])) where lv, e : char
([[?(lv \0)]](S
Z
, S
P
, S
C
) = (S
Z
, S
P
, guard(/[[lv]]S \0, S
C
))
([[cmd1 + cmd2]]S = ([[cmd1]]S ([[cmd2]]S
([[cmd1; cmd2]]S = ([[cmd2]](([[cmd1]]S)
([[cmd
]]S = lfp
S
F
S
0
(S) = S0 ([[cmd]]S
([[f()]]S = ([[cmd]]S where cmd is the body of function f
Fig. 5. Abstract evaluation, initialization and execution of commands
3.3 Abstract semantics
Building on the previous primitives, the static analysis computes abstract stores
while mimicking the concrete semantics. Figure 5 presents the denition that
are specically related to the handling of characters and strings. The remaining
aspects of the analysis are standard and thus not thoroughly described here.
Let us paraphrase some of the most spicy equations:
To initialize a zone of memory starting at address a with a single character
or with an array of n characters, 1[[a creates a store whose partition is
reduced to a unique buer of size 1 or n and that contains no information,
Non-deterministic choice amounts to abstract join and the sequence to func-
tion composition,
The abstract store after a loop is the result of an abstract xpoint com-
putation. The constructive version of Tarskis theorem [6] suggests a naive
algorithm: starting from , the successive iterates of F
[ (S)
([cmd[ (([[cmd]]S)
Proof. The proof is done by structural induction on the syntax of commands.
It reduces to the assembly of the various atomic soundness conditions of each
primitive.
Note, that since, our static analysis is built in a modular way, it would be
possible to replace some components to improve either precision or eciency and
still retain the overall soundness theorem. In particular any other non-relational
numerical domain can be easily used instead of ranges.
Example 1. Here are the invariants collected by the static analysis with the
character store for a small example:
l0: char buf[10]; (buf, 0) : 10 [0; 10]
l1: buf[0] = a; (buf, 0) : 10 [1; 10]
l2: buf[4] = 0; (buf, 0) : 10 [1; 4]
l3: buf[1] = b; (buf, 0) : 10 [2; 10]
l4: buf[2] = 0; (buf, 0) : 10 [2; 2]
The partition is reduced to one buer that starts at (buf, 0) of 10 bytes. Let us
delve into the details of the computation from label l2 to l3. The tool reaches
label l2 with the knowledge that the length of buf is greater than 1:
(buf, 0) : 10 [1; 10]
The partition is split in three around the zone that is being written:
(buf, 0) : 4 [1; 4]
(buf, 4) : 1 [0; 1]
(buf, 5) : 5 [0; 5]
The null character is written in buer (buf, 4) : 1, using primitive update:
(buf, 0) : 4 [1; 4]
(buf, 4) : 1 [0; 0]
(buf, 5) : 5 [0; 5]
At last, the buers are glued together to restore the initial partition:
(buf, 0) : 10 [1; 4]
Note that at instruction l3, after character b is written at index 1 of buf, the
upper bound on the length of the string is forgotten. This is indeed necessary.
Consider the concrete store where the rst 0 character is exactly at index 1;
since it is overwritten by a non-null character and the tool has no information
about the position of the remaining 0 characters after the rst one, the new
length is unknown.
Imagine now that the previous example were ended by a call to strcpy that
copies string buf into a buer of size strictly larger than 2. Such a call would be
correct and the approximation computed by the tool precise enough to prove this.
Next section is about the analysis of the strcpy and the checks that are made
to show the correctness of possibly dangerous memory manipulation operations.
strlen(S, A) =
8
>
>
>
<
>
>
>
:
Z
if S = A =
A
S((x, o) : m) \ |m A = (x, [o; o]) m = allocszA(x, o)
weakstrlen(S, b) if A = (x, O) b = tobu(A)
[0; +] otherwise
strcpy(S, A, ) =
strcpy(S, A, [l; u]) =
8
>
>
>
<
>
>
>
:
if S = A =
A
(S[(x, o) : m [l; m1]]) if A = (x, [o; o]) m = min(u + 1, allocszA(x, o))
(weakstrcpy(S, b, [l; u])) if A = (x, O) b = tobu(A) b /
universe() otherwise
weakstrlen(S, X) =
|S(a : m) \ |m [ a X m = allocszA(a)
weakstrcpy(S, a : n, [l; u]) = S[a : m S(a : m) [l; m1]]
where m = min(u + n, allocszA(a))
Fig. 6. Abstract string length and string copy
4 String copy
4.1 Concrete semantics
The syntax of commands is enriched with strcpy(e
1
, e
2
). This call copies the
string pointed to by pointer e
2
into the buer starting in e
1
:
([strcpy(e
1
, e
2
)[ =
[a
1
j (a
2
j)]
0jl
a
1
L[e
1
[ n
1
= allocsz
A
(a
1
)
a
2
L[e
2
[ n
2
= allocsz
A
(a
2
)
l = strlen
(a
2
: n
2
) l ,= n
2
l < n
1
In the previous equation allocsz(a) denotes the number of bytes that are allocated
starting from address a. The source buer denoted by e
2
should contain a valid
string, i.e. there should be some 0 character before the end of the allocated
source memory zone. In other words, the length l = strlen
(a
2
: n
2
) of the string
should be dierent from n
2
. Additionally, l should be smaller than the size n
1
of
the destination buer. Otherwise, there is not enough space to copy the entire
string and, according to this semantics, the program halts.
4.2 Abstract semantics
In the abstract world, strcpy is performed in two phases:
([[strcpy(e
1
, e
2
)]]S = (S
Z
, S
P
, strcpy(S
C
, L[[e
1
]]S, strlen(S
C
, L[[e
2
]]S)))
Both phases are dened in gure 6. First, strlen retrieves the length of the source
string. When the address a of the source string is exactly known, it reads the
information associated with the buer that starts from a and goes until the rst
non-allocated address a m. The length m represents the case when no null
character is found before the end of the buer. This case would halt the pro-
gram and is thus eliminated from the result. Then, primitive strcpy updates the
destination buer with the new abstract length. When the destination address
a is precisely known, the information is replaced by the new abstract length
bounded by the size of the source zone. When the possible destination addresses
are contained in a buer (a : n), weakstrcpy merges the previous length with
interval [l; u + n 1] bounded by the size of the destination zone. The lower
bound l corresponds to the case when the smallest string is copied to a. The
upper bound u +n1 corresponds to the case when the longest string is copied
to a (n 1). Notice how both primitives make extensive use of the algorithm
to change partitions. They satisfy conditions:
(S) a
A
(A) n = allocsz
A
(a)
l = strlen
(a : n) l ,= n
Z
(strlen(S, A))
[a j c
j
]
0jl
(S) a
A
(A) l
Z
(L)
n = allocsz
A
(a) l < n
0 j < l : c
j
,= 0 c
l
= 0
(strcpy(S, A, L))
This ensures the soundness of the abstract string copy with respect to its concrete
counterpart. Theorem 1 still holds.
4.3 Checks
Information gathered by the static analysis is used to check that all potentially
dangerous memory manipulations are safe. A predicate is applied to the abstract
value computed for the arguments of each operation. If the predicate does not
hold, then the tool has insucient information to conclude the operation is safe
and it emits a warning. We present three such predicates
1
:
Buer overows: when accessing an array of size n at any index in
Z
(I),
the index should be within bounds:
check
[]
(I, n) = (0 min(I)) (max(I) < n)
Pointer overows: when dereferencing a pointer P to a data of type , the
pointer should be within the referenced zone:
check
in
A
(y, O
), there
should be a null character before the end of the allocated memory starting
in a and there should be at least l bytes of allocated memory from a
:
check
\0
(S, (x, O), (y, O
), L) =
tobu(x, O) / a = (x, max(O)) m = allocsz
A
(a) m / S(a : m)
tobu(y, O
))
5 Experiments
The static analysis was implemented in OCaml [16]. It uses CIL [18] as front-end.
A simplication phase is applied to the CIL output to get to our kernel language.
The analysis then propagates the abstract store following the structure of the
code. Loops are dealt with simple xpoint computation algorithms. Some loops
are unfolded in order to improve precision. Once computations have stabilized,
an ultimate pass checks potentially dangerous operations and emits warnings.
Excluding CIL, the whole source code totals approximately 4000 lines of code.
The design of this static analysis was constantly lead by software most similar
to what is found on actual aeronautical products. It is interesting to note that, in
these case studies, approximately 60% of calls to strcpy have a constant string as
source argument. Another 25% are called with a source buer that is initialized
with a constant string. Experiments were performed on small benchmarks from
this software base. We sometimes had to manually remove union types which
are not handled by this analysis. Among others, all 63 calls to strcpy in a 3000
lines of code program were successfully checked. Here is a small example that
embodies some of the more dicult cases the tool had to process:
typedef struct {
char* f;
} s;
char buf[10];
void init(s* x) {
x[1].f = buf;
}
int main() {
s a[2][2];
s* ptr = (s*) &a[1];
init(ptr);
ptr = (s*) &a[0];
strcpy(a[1][1].f, "strcpy ok");
strcpy(a[1][1].f, "strcpy not ok");
}
The tool ags the second call to strcpy. Since it knows variable x and &a[1]
are aliased, it deduces that a[1][1].f has size 10 and doesnt emit any warning
for the rst call. This example demonstrates that the integration of several C
features in one tool are necessary to obtain sucient precision.
6 Related work
The detection of buer overows in C programs is an active eld of research and
various approaches have been proposed.
Fuzzing is a testing technique that consists in hooking a random generator to
the inputs of a program. If the program crashes then defects may be uncovered.
Smart fuzzing tools take advantage of the network protocols [1, 3] or le formats
[23] expected by the software in order to exercise the code in more depth. How-
ever, testing can usually not be exhaustive. Tools like StackGuard [8], ProPolice
[2], CRED [20] and other [28] are C compilers extension that implement runtime
protection mechanisms. For instance, StackGuard uses a canary to detect attacks
on the stack. Unfortunately, these techniques incur a non negligible overhead:
either by slowing down execution or using up memory. Light static analyzes may
remove unnecessary checks and improve performances [17]. In the end, all these
techniques just turn buer overows into denial-of-service attacks.
Static analyses can detect defects before execution of the code. Several tools
[27, 25, 4, 15, 11, 30, 26, 12] sacrice soundness to scalability or eciency. Un-
sound tools include fast and imprecise lexical analyzer such as ITS4 [25]. BOON
[26] and [12] both translate the verication problem into an integer constraint
problem but ignore potential aliasing. Soundness is clearly mandatory in our
context. ASTREE [7], Airac [14], CGS [24] are all sound tools based on abstract
interpretation that aim at detecting all runtime errors in C code. ASTREE
focuses on control command software without pointers, Airac on array out of
bounds and CGS on dynamic memory manipulation. These approaches do not
have any special treatment for strings, which is a potential source of impreci-
sion in our case. CSSV [9] and the analysis of [22] are most close to our work.
Like us, both adopt the abstraction pioneered in [26] of strings by their possible
lengths. Unlike us, they use the expensive numerical domain of polyhedra. They
handle dynamic allocation. Instead of incorporating value and pointer analysis
together, both perform the pointer analysis separately. CSSV then translates
the C program into an integer program. It needs function level annotations to
produce precise results during a whole program analysis. CSSV can handle union
types, albeit in a very imprecise way: each memory location has a size and any
assignments of a value of dierent size sets the location to unknown. Interest-
ingly, the abstraction in [22] associates the length of strings to pointers, rather
than to the buer where the string is stored. It seems dicult to extend the for-
malism in order to deal with more language features. In particular, two pointers
are aliased when they have the same base address and length. This condition is
clearly too restrictive and prevents the handling of multi-dimensional arrays or
cast operations.
7 Conclusion
We have designed and implemented a new static analysis to check the correct-
ness of all memory manipulations in C programs. It integrates several analysis
techniques to handle pointers, structures, multi-dimensional arrays, some kinds
of casts and strings. The analysis of strings is made as simple as possible thanks
to transport operators that let tune the granularity of the abstraction. First ex-
perimental results are extremely promising, and the abstraction seems adequate
to prove actual case studies correct. Further work will explore the semantics and
abstractions necessary to deal with C union types with much precision.
References
1. Dave Aitel. The advantage of block-based protocol analysis for security testing.
Technical report, Immunity,Inc., 2002.
2. A. Baratloo, N. Singh, and T. Tsai. Transparent run-time defense against stack
smashing attacks. In Proceedings of the USENIX Annual Technical Conference,
2000.
3. Philippe Biondi. Scapy. http://www.secdev.org/projects/scapy/.
4. B. Chess. Improving computer security using extended static checking. In IEEE
Symposium on Security and Privacy, 2002.
5. P. Cousot and R. Cousot. Abstract interpretation: a unied lattice model for static
analysis of programs by construction or approximation of xpoints. In Conference
Record of the 4th ACM Symposium on Principles of Programming Languages. ACM
Press, 1977.
6. P. Cousot and R. Cousot. Constructive versions of Tarskis xed point theorems.
Pacic Journal of Mathematics, 81(1), 1979.
7. P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Mine, D. Monniaux, and X.
Rival. The ASTR