0% found this document useful (0 votes)
70 views17 pages

Static Analysis of String Manipulations in Critical Embedded C Programs

The document describes a static analysis to detect memory errors, especially string buffer overflows, in critical embedded C programs. The analysis is based on abstract interpretation and models strings by retaining their lengths in an abstract store. It can handle features like multi-dimensional arrays, structures, pointers, and function calls. A prototype implementation shows promising early results.

Uploaded by

Umesh M Naik
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views17 pages

Static Analysis of String Manipulations in Critical Embedded C Programs

The document describes a static analysis to detect memory errors, especially string buffer overflows, in critical embedded C programs. The analysis is based on abstract interpretation and models strings by retaining their lengths in an abstract store. It can handle features like multi-dimensional arrays, structures, pointers, and function calls. A prototype implementation shows promising early results.

Uploaded by

Umesh M Naik
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Static Analysis of String Manipulations in

Critical Embedded C Programs


Xavier Allamigeon, Wenceslas Godard, and Charles Hymans
EADS CCR DCR/STI/C
12, rue Pasteur BP 76 92152 Suresnes France
[email protected]
Abstract. This paper describes a new static analysis to show the ab-
sence of memory errors, especially string buer overows in C programs.
The analysis is specically designed for the subset of C that is found in
critical embedded software. It is based on the theory of abstract inter-
pretation and relies on an abstraction of stores that retains the length
of string buers. A transport structure allows to change the granularity
of the abstraction and to concisely dene several inherently complex ab-
stract primitives such as destructive update and string copy. The analysis
integrates several features of the C language such as multi-dimensional
arrays, structures, pointers and function calls. A prototype implementa-
tion produces encouraging results in early experiments.
1 Introduction
Programming in C with strings, and more generally with buers, is risky busi-
ness. Before any copy, the programmer should make sure that the destination
buer is large enough to accept the source data in its entirety. When it is not the
case, random bytes may end up in unexpected memory locations. This scenario
is particularly unpleasant as soon as the source data can somehow be forged by
an attacker: he may be able to smash [19] the return address on the stack and
run its own code instead of the sequel of the program. Indeed, buer overows
account for more than half of the vulnerabilities reported by the CERT [13] and
are a popular target for viruses [10].
Needless to say defects that may abandon control of the equipment to an in-
truder are unacceptable in the context of embedded software. Testing being not
a proof, we aim at designing a static analysis that shows the absence of memory
manipulation errors (buer, string buer and pointer overows) in embedded
C software. We expect such a tool to be sound; to yield as few false alarms as
possible in practice; to require as less human intervention as possible (manual
annotations are unsuitable) and to scale to realistically sized programs. Any soft-
ware engineer would easily benet from a tool with all these traits and could rely
on its results. Obviously the analysis should be able to handle all the features of
the C language that are used in practice in the embedded world. This requires the
smooth integration of several analysis techniques together. Simplicity of design
is also a crucial point, since the analysis implementation should be bug-free and
cmd ::= command
[ x; cmd variable declaration
[ lv = e assignment
[ ?(lv \0) guard
[ cmd1 + cmd2 alternative
[ cmd1; cmd2 sequence
[ cmd

arbitrary loop
[ f() function call
::= type
[ scalar type
[ [n] array type
[ |1 f1 . . . n fn structure type
::= scalar type
[ char character type
[ int integer type
[ pointer type
lv ::= left value
[ x variable
[ lv.f eld access
[ lv[e] array access
[ e pointer deref.
e ::= expression
[ c constant
[ lv left value
[ e1 e2 binary operation
[ &lv address of
[ ()e cast
::= binary operator
[ +, , , / integer arithmetic
[ +, pointer arithmetic
::= comparison
[ = equality
[ ,= dierence
Fig. 1. Syntax
maintainable. This is not an easy task especially for a language as complex as C.
To attain these goals we adopt the methodology of abstract interpretation [5]:
section 2 presents the subset of C we tackle and its concrete semantics; section
3 describes the abstraction of strings and the sound static analysis algorithm;
section 4 shows how string copy operations are handled, and what checks are
performed by the tool; sections 5 and 6 address the implementation, experiments
and related work.
2 Embedded C programs
2.1 Syntax
The C programming language is inherently complex, which makes the formal def-
inition of its semantics a dicult task. Hopefully and for obvious safety reasons,
programming critical embedded applications is subject to severe constraints. In
practice, only a subset of C is allowed. The main limitation results from the
obligation to know at compile time the maximum memory usage of any piece of
software. To achieve this, the use of dynamic allocation (function malloc()) and
recursive functions are both forbidden. For the sake of expositional clarity, we
set aside some additional features such as numerous C scalar types, union types
and goto statements. Dealing with these features brings issues orthogonal to the
object of this paper. Some ideas to address these issues may be found in [7,
21]. In the end, we consider the relatively small kernel language with the syntax
of gure 1. Complex assignments in C are broken down to simpler assignments
between scalar types. All variables declared in a given scope of the program have
distinct names.
2.2 Store
A memory address is a pair (x, o) of a variable identier in 1 and an oset.
It denotes the o
th
byte from the address at which the content of variable x is
stored. Operation shifts an address by a given oset:
(x, o) i = (x, o + i)
Programs manipulate three kinds of basic values: integers in Z, characters in C
and pointers in P. For sake of simplicity, integers are unbounded. The nature of
characters is left unspecied. It is sucient to say that there is one null character
denoted by 0. A pointer is a triple a, i, n) that references the i
th
byte of a
buer that starts from address a and is n bytes long.
The store maps each allocated address to a basic value. The kind of values
stored at a given address never changes. Hence, a store = (
Z
,
C
,
P
) dened
on allocated addresses / = /
Z
/
C
/
P
belongs to the set:
= (/
Z
Z) (/
C
C) (/
P
P)
In this model, any operation that alters the interpretation of data too severely
leads to an error at runtime. For instance, a cast from int a; int b to int is
valid; whereas a cast from int to char is illegitimate.
The layout of memory is given by two functions: sizeof() returns the size of
a data of type and oset(f) the oset of a eld f from the beginning of its
enclosing structure.
2.3 Semantics
We assign a denotational semantics [29] to the kernel language. In the following,
we use notations lv : and e : to retrieve the type of a left value lv or
expression e as computed by a standard C typechecker. A left value lv evaluates
to a set of addresses L[lv[, as formalized in gure 2. Sets allow to encode both
non-determinism and halting behaviours. A pointer of type can be safely
dereferenced, as long as there remains enough space to store an element of type
. Likewise, an expression e of integer type evaluates to a set of integers
Z
[e[.
Notice how an access to some address not allocated in the store of integer halts
program execution. We skip the classical denition of relation v
1
v
2
v which
explicits the meaning of each binary operation. Denitions for expressions of
character or pointer type are completely identical. The last four equations in
gure 2 dene pointer creation and cast.
/|[x[ = |(x, 0)
/|[lv.f[ = |a oset(f) [ a /|[lv[
/|[lv[e][ = |a i sizeof() [ a /|[lv[ i 1
Z
|[e[ 0 i < n
where lv : [n]
/|[e[ = |a i [ a, i, n) 1
P
|[e[ 0 i n sizeof () where e :
1
Z
|[c[ = |c
1
Z
|[lv[ = |
Z
(a) [ a /|[lv[ dom(
Z
)
1
Z
|[e1 e2[ = |v [ v1 1
Z
|[e1[ v2 1
Z
|[e2[ v1 v2 v
1
P
|[&x[ = |(x, 0), 0, sizeof ()) where x :
1
P
|[&lv.f[ = |a, 0, sizeof ()) [ a /|[lv.f[ where lv.f :
1
P
|[&lv[e][ = |a, i sizeof(), sizeof ([n])) [ a /|[lv[ i 1
Z
|[e[
where lv : [n]
1
P
|[()e[ = 1
P
|[e[
Fig. 2. Semantics of left values and expressions
Three atomic commands operate on the store:
([ x; cmd[ =

|dom()
[
x
= 1[[(x, 0)

([cmd[(
x
)
where dom() dom(
x
) =
([lv = e[ = [a v] [ a L[lv[ dom(
Z
) v
Z
[e[
where lv, e : Z
([?(lv 0)[ = [ v
C
[lv[ v 0
At variable declaration, a new store fragment
x
is initialized and concatenated
to the existing store, execution then continues until variable x is eventually
deleted from the resulting store. The new store fragment is built by induction
on the type of the declared variable:
1[[a = [a v]
1[[n][a =

0i<n
1[[(a i sizeof())
1[
1
f
1
. . .
n
f
n
[a =

0<in
1[
i
[(a oset(f
i
))
where v is any value of type and joins two disjoint stores. Assignments come
in three avours, one for each basic type: integer, character and pointer. Here, we
only describe the integer assignment since the other two are completely similar.
Assignment to a non-allocated address brings the program to a halt. Guards let
execution continue when the store satises the boolean condition. We consider
only equality or disequality with the null character even though other kinds of
C

= |
C
, 0, 1,
C

C
(
C
) =
C
(
C
) = C

C
(0) = |\0

C
(1) = C \ |\0
Z

= (Z Z)

Z
(
Z
) =

Z
([l; u]) = |i [ l i u
A

= (1 Z

A
(
A
) =

A
(
A
) = A

A
(x, O) = |(x, o) [ o
Z
(O)
P

= (A

P
(
P
) =

P
(A, I, N) = |a, i, n) [ a
A
(A) i
Z
(I) n
Z
(N)
Fig. 3. Abstract addresses and values
guards may easily be handled. The remaining commands control the ow of
execution:
([cmd
1
+ cmd
2
[ = ([cmd
1
[ ([cmd
2
[
([cmd
1
; cmd
2
[ = ([cmd
2
[(([cmd
1
[)
([cmd

[ = lfp

F
0
(X) =
0

[ X

([cmd[
([f()[ = ([cmd[ where cmd is the body of function f
A programP consists of a set of functions and a main command which is executed
in an initially empty store: [P[ = ([cmd[.
3 Static analysis
We wish to automatically verify that all string manipulations in a program
are innocuous. This is, by nature, an undecidable problem. So, we design a
static analysis that computes an approximate but sound representation of all the
stores that result from the execution of a program. Following the methodology
of abstract interpretation [5], an abstraction of sets of stores is rst devised. The
analysis algorithm is then systematically derived thanks to this abstraction from
the concrete semantics. The results of the analysis are used to check as many
potentially dangerous memory operations as possible and to emit warnings in
other cases.
3.1 Abstract values, integer and pointer stores
Figure 3 lists the abstract domains and concretization functions used for sets of
addresses and values. These abstractions are all built from well-known standard
domains: integers are represented by ranges [5]; characters thanks to the domain
of equality/disequality with the null character; a pair of a variable identier and
a range of possible osets stands for a set of addresses; and abstract pointers
are triples made of an abstract address, followed by two ranges for possible
osets and sizes. We use the standard set notations for all operations on ranges:
(, , min, max). Moreover, I
1
I
2
denotes the smallest range that contains both
I
1
and I
2
; I n the smallest range that contains all elements in I except n;
I + n (I n) is the range obtained after the addition (subtraction) of n to all
the elements in I.
The abstract domain (D, ) of the analysis is built as the product of three
domains: one for each type of basic value. An abstract store S is thus a triple
(S
Z
, S
P
, S
C
). Abstract integer S
Z
and pointer S
P
stores map each allocated ad-
dress to an abstract value of corresponding type. A fully edged description of
these standard non-relational domains is skipped. On the other hand, the ab-
stract character store S
C
, being the object of our study, is discussed at length in
the next section.
3.2 Abstract character store
A string in C is a sequence of characters stored in memory. The rst null char-
acter (0) signals the end of the string. If no null character is found before the
end of the allocated area, then the string is not well-formed. Hence the length
of a string stored on a buer (a : n) of n consecutive bytes starting at address a
in a store is:
strlen

(a : n) = min(n l [ 0 l < n (a l) = 0)
Now, in order to prove the correctness of string manipulations it is necessary to
at least retain some information about the length of the various strings in the
store.
Let be a partition of the set of all allocated addresses, such that each
element in the partition is a connected set (a buer). The abstract store maps
each buer in the partition to a range that approximates the possible lengths of
the string stored on that buer:

= ( Z

(S) = [ b : strlen

(b)
Z
(S(b))
() =
Several primitives operate on the domain of character store. Each primitive
obeys a soundness condition. Normalization returns the empty store as soon as
any buer is associated with an empty range:
(S) =

if b : S(b) =
Z
S otherwise
Normalization preserves the meaning of the abstract store, thus: ((S)) = (S).
From now on, we assume that the store is always in normal form so that no
abstract length can ever be the empty range. A new abstract store with no
information at all may be created using primitive universe from a partition . It
is such that for any buer (a : n) in :
universe()(a : n) = [0; n]
It is straightforward to show that: (/ C) (universe()). Abstract stores
S
1
and S
2
dened on the same partition can be compared:
S
1
S
2
(S
1
= (S
2
,= b : S
1
(b) S
2
(b)))
(S
1
) (S
2
)
Abstract join and meet operations are performed pointwise. To deal with
variable declarations, we need to concatenate stores of disjoint domains and
remove all the buers allocated for a given variable:
S = S =
S
1
S
2
= S
1
S
2
S x = S
|{(a:n)|a=(y,o)y=x}
These operations verify the following set inequalities:
(S
1
) (S
2
) (S
1
S
2
)
(S
1
) (S
2
) (S
1
S
2
)

2
[
1
(S
1
)
2
(S
2
) (S
1
S
2
)

|{(y,o)A|y=x}
[ (S) (S x)
Boolean conditions present in if statements, switches and loops must be taken
into account in order to produce suciently precise results. Primitive guard con-
strains the store according to an equality or disequality comparison with char-
acter 0:
[ (S) a (A) / (a) 0 (guard(A 0, S))
Suppose the constraint implies that there is at least one 0 character in a
memory region that spans from address (x, o
1
) to address (x, o
2
). Suppose further
that this region is contained in a unique buer (a : n) of the partition. Then, the
length of a string starting in a is necessarily smaller than the distance from a
to (x, o
2
). Hence:
guard((x, [o
1
; o
2
]) = 0, S) = (S[a : n S(a : n) [0; ]])
Similarly, suppose now that the value stored at address (x, o) is not the 0
character. If address (x, o) belongs to some buer a : n of the partition and is
the distance from a to (x, o), then:
guard((x, [o; o]) ,= 0, S) = (S[b S(b) ])
In all other cases, guard simply leaves the store unchanged:
guard(A 0, S) = S
Transport structure and store accesses. Operations to read and write in the
store are primordial to the analysis. However they are not easily dened mainly
because the region in memory that is impacted by the operation does not nec-
essarily coincide with a particular buer in the partition. In order to alleviate
this diculty, we rst devise transformations on the abstract store that allow
to change the underlying partition. Transformation cut C

splits the buer b


into two consecutive buers b
1
and b
2
of respective sizes and n; the reverse
transformation glue G

lumps together two buers that are contiguous:


C

([b L]) =

[b
1
[; ]; b
2
L ] if min(L)
[b
1
L [0; ]; b
2
[0; n]] otherwise
G

([b
1
K; b
2
L]) =

[b K (L + )] if K
[b K] otherwise
Both operations are sound in that their result includes at least all the concrete
stores originally present:
(S [b L]) (S C

([b L]))
(S [b
1
K; b
2
L]) (S G

([b
1
K; b
2
L]))
Building on glue and cut, there is a simple algorithm to move from any partition

1
to another
2
(of course,
1
and
2
must be dened on the same set of
allocated addresses). Starting from
1
, the rst step consists in splitting buers
until we get to the coarsest partition which is ner than both
1
and
2
. Then, in
a second step buers are glued together to get back to
2
. Let us introduce two
very useful shortcut notations built on top of this algorithm. In the following,
all addresses in buer b = (a : n) are allocated (i.e. b /):
S(b) minimally modies the store so as to include buer b in the result-
ing partition and then returns the value associated with this buer. More
accurately, let (a
1
: n
1
) . . . (a
k
: n
k
) be the initial partition, where all
buers that overlap b are listed in increasing order as (a
1
: n
1
) to (a
k
: n
k
).
Then the destination partition is (a
1
: ), (a : n), (a n :

), where
and

are the respective distances from a


1
to a and from a n to a
k
n
k
,
S[b L] transforms the partition to add buer b as previously explained,
updates its value with L and translates back to the initial partition.
Memory accesses can now be described by the equations of gure 4. Let us
comment the cases when the abstract address A that is read or written is of the
form (x, [o
1
; o
2
]) and all the addresses from (x, o
1
) to (x, o
2
) are allocated. In
this case, the buer b that corresponds to A starts in (x, o
1
) and stretches over
n = (o
2
o
1
+ 1) bytes. Thanks to the previously introduced transformations,
we can easily convert the abstract store so that buer b belongs to the partition.
Then, to evaluate the value that is read, we apply function eval
n
to the abstract
length L associated with b. There are three cases:
read(S, A) =
8
>
<
>
:

C
if S = A =
A
eval
|b|
(S(b)) if A = (x, O) b = tobu(A) b /

C
otherwise
write(S, A, V ) =
8
>
<
>
:
if S = A =
A
V =
Z
S[b update
|b|
(S(b), V )] if A = (x, O) b = tobu(A) b /
universe() otherwise
tobu(x, [o1; o2]) = ((x, o1) : (o2 o1 + 1))
evaln(L) =
8
>
<
>
:
0 if n = 1 L = [0; 0]
1 if L = [n; n]

C
otherwise
update
n
([l; u], 0) = [0; min(u, n 1)]
update
n
([l; u], 1) =
(
[1; 1] if n = 1
[l; n] otherwise
update
n
(L, ) = [0; n]
Fig. 4. Abstract memory access
when the buer contains only one character that is equal to 0, then 0 is
returned,
when L = [n; n], the rst 0 character is not in the buer, so the returned
value is 1,
in all other cases, there is insucient information to conclude and
C
is
returned.
The intuition that motivates denition of function update goes as follows:
After a 0 character is written somewhere in the buer, we can be sure that
the length is strictly less than its size n. Moreover, previous 0 characters
remain so that update
n
([l; u], 0) = [0; min(u, n 1)].
If exactly one non-null character is copied in a buer of size n = 1, then the
rst 0 can not be at index 0, so update
1
(L, 0) = [1; 1].
In the remaining cases when a non-null character is written, it may erase the
rst 0 character in the buer, so that the length of the string may be un-
bounded. Since non-null characters are untouched, the information about the
lower bound on the possible string lengths is kept, thus update
n
([l; u], 0) =
[l; n].
At last, when an unknown value is copied, all information is lost.
These operations are sound with respect to:
(a) [ (S) a
A
(A) dom()
C
(read(S, A))
[a v] [ (S) a
A
(A) dom() v
C
(V ) (write(S, A, V ))
1
C
[[c]]S =
(
0 if c = \0
1 otherwise
1
C
[[lv]]S = read(S
C
, /[[lv]]S)
1[[char]]a = (, , universe(|(a, 1)))
1[[char[n]]]a = (, , universe(|(a, n)))
1[[[n]]]a =
L
0i<n
|1[[]](a i sizeof())
1[[|1f1 . . . nfn]]a =
L
0<in
|1[[i]](a oset(fi))
([[ x; cmd]]S = ([[cmd]](S 1[[]](x, 0)) \ x
([[lv = e]](S
Z
, S
P
, S
C
) = (S
Z
, S
P
, write(S
C
, /[[lv]]S, 1
C
[[e]])) where lv, e : char
([[?(lv \0)]](S
Z
, S
P
, S
C
) = (S
Z
, S
P
, guard(/[[lv]]S \0, S
C
))
([[cmd1 + cmd2]]S = ([[cmd1]]S ([[cmd2]]S
([[cmd1; cmd2]]S = ([[cmd2]](([[cmd1]]S)
([[cmd

]]S = lfp

S
F

S
0
(S) = S0 ([[cmd]]S
([[f()]]S = ([[cmd]]S where cmd is the body of function f
Fig. 5. Abstract evaluation, initialization and execution of commands
3.3 Abstract semantics
Building on the previous primitives, the static analysis computes abstract stores
while mimicking the concrete semantics. Figure 5 presents the denition that
are specically related to the handling of characters and strings. The remaining
aspects of the analysis are standard and thus not thoroughly described here.
Let us paraphrase some of the most spicy equations:
To initialize a zone of memory starting at address a with a single character
or with an array of n characters, 1[[a creates a store whose partition is
reduced to a unique buer of size 1 or n and that contains no information,
Non-deterministic choice amounts to abstract join and the sequence to func-
tion composition,
The abstract store after a loop is the result of an abstract xpoint com-
putation. The constructive version of Tarskis theorem [6] suggests a naive
algorithm: starting from , the successive iterates of F

are computed un-


til stabilization. In practice other more complex algorithms [31], the use of
widening, and loop unfolding may be safely applied.
Theorem 1 (Soundness). The abstract semantics of a command cmd on an
abstract store S includes all stores that are obtained by any run of the command
starting from some initial store in (S):

[ (S)

([cmd[ (([[cmd]]S)
Proof. The proof is done by structural induction on the syntax of commands.
It reduces to the assembly of the various atomic soundness conditions of each
primitive.
Note, that since, our static analysis is built in a modular way, it would be
possible to replace some components to improve either precision or eciency and
still retain the overall soundness theorem. In particular any other non-relational
numerical domain can be easily used instead of ranges.
Example 1. Here are the invariants collected by the static analysis with the
character store for a small example:
l0: char buf[10]; (buf, 0) : 10 [0; 10]
l1: buf[0] = a; (buf, 0) : 10 [1; 10]
l2: buf[4] = 0; (buf, 0) : 10 [1; 4]
l3: buf[1] = b; (buf, 0) : 10 [2; 10]
l4: buf[2] = 0; (buf, 0) : 10 [2; 2]
The partition is reduced to one buer that starts at (buf, 0) of 10 bytes. Let us
delve into the details of the computation from label l2 to l3. The tool reaches
label l2 with the knowledge that the length of buf is greater than 1:
(buf, 0) : 10 [1; 10]
The partition is split in three around the zone that is being written:
(buf, 0) : 4 [1; 4]
(buf, 4) : 1 [0; 1]
(buf, 5) : 5 [0; 5]
The null character is written in buer (buf, 4) : 1, using primitive update:
(buf, 0) : 4 [1; 4]
(buf, 4) : 1 [0; 0]
(buf, 5) : 5 [0; 5]
At last, the buers are glued together to restore the initial partition:
(buf, 0) : 10 [1; 4]
Note that at instruction l3, after character b is written at index 1 of buf, the
upper bound on the length of the string is forgotten. This is indeed necessary.
Consider the concrete store where the rst 0 character is exactly at index 1;
since it is overwritten by a non-null character and the tool has no information
about the position of the remaining 0 characters after the rst one, the new
length is unknown.
Imagine now that the previous example were ended by a call to strcpy that
copies string buf into a buer of size strictly larger than 2. Such a call would be
correct and the approximation computed by the tool precise enough to prove this.
Next section is about the analysis of the strcpy and the checks that are made
to show the correctness of possibly dangerous memory manipulation operations.
strlen(S, A) =
8
>
>
>
<
>
>
>
:

Z
if S = A =
A
S((x, o) : m) \ |m A = (x, [o; o]) m = allocszA(x, o)
weakstrlen(S, b) if A = (x, O) b = tobu(A)
[0; +] otherwise
strcpy(S, A, ) =
strcpy(S, A, [l; u]) =
8
>
>
>
<
>
>
>
:
if S = A =
A
(S[(x, o) : m [l; m1]]) if A = (x, [o; o]) m = min(u + 1, allocszA(x, o))
(weakstrcpy(S, b, [l; u])) if A = (x, O) b = tobu(A) b /
universe() otherwise
weakstrlen(S, X) =

|S(a : m) \ |m [ a X m = allocszA(a)
weakstrcpy(S, a : n, [l; u]) = S[a : m S(a : m) [l; m1]]
where m = min(u + n, allocszA(a))
Fig. 6. Abstract string length and string copy
4 String copy
4.1 Concrete semantics
The syntax of commands is enriched with strcpy(e
1
, e
2
). This call copies the
string pointed to by pointer e
2
into the buer starting in e
1
:
([strcpy(e
1
, e
2
)[ =

[a
1
j (a
2
j)]
0jl

a
1
L[e
1
[ n
1
= allocsz
A
(a
1
)
a
2
L[e
2
[ n
2
= allocsz
A
(a
2
)
l = strlen

(a
2
: n
2
) l ,= n
2
l < n
1

In the previous equation allocsz(a) denotes the number of bytes that are allocated
starting from address a. The source buer denoted by e
2
should contain a valid
string, i.e. there should be some 0 character before the end of the allocated
source memory zone. In other words, the length l = strlen

(a
2
: n
2
) of the string
should be dierent from n
2
. Additionally, l should be smaller than the size n
1
of
the destination buer. Otherwise, there is not enough space to copy the entire
string and, according to this semantics, the program halts.
4.2 Abstract semantics
In the abstract world, strcpy is performed in two phases:
([[strcpy(e
1
, e
2
)]]S = (S
Z
, S
P
, strcpy(S
C
, L[[e
1
]]S, strlen(S
C
, L[[e
2
]]S)))
Both phases are dened in gure 6. First, strlen retrieves the length of the source
string. When the address a of the source string is exactly known, it reads the
information associated with the buer that starts from a and goes until the rst
non-allocated address a m. The length m represents the case when no null
character is found before the end of the buer. This case would halt the pro-
gram and is thus eliminated from the result. Then, primitive strcpy updates the
destination buer with the new abstract length. When the destination address
a is precisely known, the information is replaced by the new abstract length
bounded by the size of the source zone. When the possible destination addresses
are contained in a buer (a : n), weakstrcpy merges the previous length with
interval [l; u + n 1] bounded by the size of the destination zone. The lower
bound l corresponds to the case when the smallest string is copied to a. The
upper bound u +n1 corresponds to the case when the longest string is copied
to a (n 1). Notice how both primitives make extensive use of the algorithm
to change partitions. They satisfy conditions:

(S) a
A
(A) n = allocsz
A
(a)
l = strlen

(a : n) l ,= n


Z
(strlen(S, A))

[a j c
j
]
0jl

(S) a
A
(A) l
Z
(L)
n = allocsz
A
(a) l < n
0 j < l : c
j
,= 0 c
l
= 0

(strcpy(S, A, L))
This ensures the soundness of the abstract string copy with respect to its concrete
counterpart. Theorem 1 still holds.
4.3 Checks
Information gathered by the static analysis is used to check that all potentially
dangerous memory manipulations are safe. A predicate is applied to the abstract
value computed for the arguments of each operation. If the predicate does not
hold, then the tool has insucient information to conclude the operation is safe
and it emits a warning. We present three such predicates
1
:
Buer overows: when accessing an array of size n at any index in
Z
(I),
the index should be within bounds:
check
[]
(I, n) = (0 min(I)) (max(I) < n)
Pointer overows: when dereferencing a pointer P to a data of type , the
pointer should be within the referenced zone:
check

(A, I, N), ) = (0 min(I)) (max(I) + sizeof() min(N))


1
all abstract arguments are suppose to be dierent from .
String buer overows: when copying a string of length l in
Z
(L) from a
source address a in
A
(x, O) to some destination address a

in
A
(y, O

), there
should be a null character before the end of the allocated memory starting
in a and there should be at least l bytes of allocated memory from a

:
check
\0
(S, (x, O), (y, O

), L) =
tobu(x, O) / a = (x, max(O)) m = allocsz
A
(a) m / S(a : m)
tobu(y, O

) / max(L) < allocsz


A
(y, max(O

))
5 Experiments
The static analysis was implemented in OCaml [16]. It uses CIL [18] as front-end.
A simplication phase is applied to the CIL output to get to our kernel language.
The analysis then propagates the abstract store following the structure of the
code. Loops are dealt with simple xpoint computation algorithms. Some loops
are unfolded in order to improve precision. Once computations have stabilized,
an ultimate pass checks potentially dangerous operations and emits warnings.
Excluding CIL, the whole source code totals approximately 4000 lines of code.
The design of this static analysis was constantly lead by software most similar
to what is found on actual aeronautical products. It is interesting to note that, in
these case studies, approximately 60% of calls to strcpy have a constant string as
source argument. Another 25% are called with a source buer that is initialized
with a constant string. Experiments were performed on small benchmarks from
this software base. We sometimes had to manually remove union types which
are not handled by this analysis. Among others, all 63 calls to strcpy in a 3000
lines of code program were successfully checked. Here is a small example that
embodies some of the more dicult cases the tool had to process:
typedef struct {
char* f;
} s;
char buf[10];
void init(s* x) {
x[1].f = buf;
}
int main() {
s a[2][2];
s* ptr = (s*) &a[1];
init(ptr);
ptr = (s*) &a[0];
strcpy(a[1][1].f, "strcpy ok");
strcpy(a[1][1].f, "strcpy not ok");
}
The tool ags the second call to strcpy. Since it knows variable x and &a[1]
are aliased, it deduces that a[1][1].f has size 10 and doesnt emit any warning
for the rst call. This example demonstrates that the integration of several C
features in one tool are necessary to obtain sucient precision.
6 Related work
The detection of buer overows in C programs is an active eld of research and
various approaches have been proposed.
Fuzzing is a testing technique that consists in hooking a random generator to
the inputs of a program. If the program crashes then defects may be uncovered.
Smart fuzzing tools take advantage of the network protocols [1, 3] or le formats
[23] expected by the software in order to exercise the code in more depth. How-
ever, testing can usually not be exhaustive. Tools like StackGuard [8], ProPolice
[2], CRED [20] and other [28] are C compilers extension that implement runtime
protection mechanisms. For instance, StackGuard uses a canary to detect attacks
on the stack. Unfortunately, these techniques incur a non negligible overhead:
either by slowing down execution or using up memory. Light static analyzes may
remove unnecessary checks and improve performances [17]. In the end, all these
techniques just turn buer overows into denial-of-service attacks.
Static analyses can detect defects before execution of the code. Several tools
[27, 25, 4, 15, 11, 30, 26, 12] sacrice soundness to scalability or eciency. Un-
sound tools include fast and imprecise lexical analyzer such as ITS4 [25]. BOON
[26] and [12] both translate the verication problem into an integer constraint
problem but ignore potential aliasing. Soundness is clearly mandatory in our
context. ASTREE [7], Airac [14], CGS [24] are all sound tools based on abstract
interpretation that aim at detecting all runtime errors in C code. ASTREE
focuses on control command software without pointers, Airac on array out of
bounds and CGS on dynamic memory manipulation. These approaches do not
have any special treatment for strings, which is a potential source of impreci-
sion in our case. CSSV [9] and the analysis of [22] are most close to our work.
Like us, both adopt the abstraction pioneered in [26] of strings by their possible
lengths. Unlike us, they use the expensive numerical domain of polyhedra. They
handle dynamic allocation. Instead of incorporating value and pointer analysis
together, both perform the pointer analysis separately. CSSV then translates
the C program into an integer program. It needs function level annotations to
produce precise results during a whole program analysis. CSSV can handle union
types, albeit in a very imprecise way: each memory location has a size and any
assignments of a value of dierent size sets the location to unknown. Interest-
ingly, the abstraction in [22] associates the length of strings to pointers, rather
than to the buer where the string is stored. It seems dicult to extend the for-
malism in order to deal with more language features. In particular, two pointers
are aliased when they have the same base address and length. This condition is
clearly too restrictive and prevents the handling of multi-dimensional arrays or
cast operations.
7 Conclusion
We have designed and implemented a new static analysis to check the correct-
ness of all memory manipulations in C programs. It integrates several analysis
techniques to handle pointers, structures, multi-dimensional arrays, some kinds
of casts and strings. The analysis of strings is made as simple as possible thanks
to transport operators that let tune the granularity of the abstraction. First ex-
perimental results are extremely promising, and the abstraction seems adequate
to prove actual case studies correct. Further work will explore the semantics and
abstractions necessary to deal with C union types with much precision.
References
1. Dave Aitel. The advantage of block-based protocol analysis for security testing.
Technical report, Immunity,Inc., 2002.
2. A. Baratloo, N. Singh, and T. Tsai. Transparent run-time defense against stack
smashing attacks. In Proceedings of the USENIX Annual Technical Conference,
2000.
3. Philippe Biondi. Scapy. http://www.secdev.org/projects/scapy/.
4. B. Chess. Improving computer security using extended static checking. In IEEE
Symposium on Security and Privacy, 2002.
5. P. Cousot and R. Cousot. Abstract interpretation: a unied lattice model for static
analysis of programs by construction or approximation of xpoints. In Conference
Record of the 4th ACM Symposium on Principles of Programming Languages. ACM
Press, 1977.
6. P. Cousot and R. Cousot. Constructive versions of Tarskis xed point theorems.
Pacic Journal of Mathematics, 81(1), 1979.
7. P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Mine, D. Monniaux, and X.
Rival. The ASTR

EE Analyser. In Proceedings of the European Symposium on


Programming, volume 3444 of Lecture Notes in Computer Science. Springer, 2005.
8. C. Cowan and al. StackGuard: Automatic adaptive detection and prevention of
buer-overow attacks. In Proceedings of the 7th USENIX Security Symposium.
USENIX Association, 1998.
9. Nurit Dor, Michael Rodeh, and Mooly Sagiv. CSSV: towards a realistic tool for
statically detecting all buer overows in C. In Proceedings of the ACM SIGPLAN
2003 conference on Programming language design and implementation. ACM Press,
2003.
10. Mark W. Eichin and Jon A. Rochlis. With microscope and tweezers: An analysis of
the internet virus of november 1988. In Proceedings of the 1989 IEEE Symposium
on Security and Privacy. IEEE Computer Society Press, 1989.
11. David Evans and David Larochelle. Improving security using extensible lightweight
static analysis. IEEE Software, 19(1), 2002.
12. Vinod Ganapathy, Somesh Jha, David Chandler, David Melski, and David Vitek.
Buer overrun detection using linear programming and static analysis. In Pro-
ceedings of the 10th ACM conference on Computer and communications security.
ACM Press, 2003.
13. Erich Haugh and Matthew Bishop. Testing C programs for buer overow vulnera-
bilities. In Proceedings of the Network and Distributed System Security Symposium.
The Internet Society, 2003.
14. Yungbum Jung, Jaehwang Kim, Jaeho Shin, and Kwangkeun Yi. Taming false
alarms from a domain-unaware C analyzer by a bayesian statistical post analysis.
In Static Analysis, 12th International Symposium, volume 3672 of Lecture Notes
in Computer Science. Springer, 2005.
15. D. Larochelle and D. Evans. Statically detecting likely buer overow vulnerabil-
ities. In Proceedings of the 10th USENIX Security Symposium, 2001.
16. X. Leroy, D. Doliguez, J. Garrigue, D. Remy, and J. Vouillon. The Objective
Caml system release 3.06, documentation and users manual. Institut National de
Recherche en Informatique et en Automatique (INRIA), 2002.
17. George C. Necula, Jeremy Condit, Matthew Harren, Scott McPeak, and Westley
Weimer. CCured: type-safe retrotting of legacy software. ACM Transactions
Programming Languages and Systems, 27(3), 2005.
18. George C. Necula, Scott McPeak, S.P. Rahul, and Westley Weimer. CIL: Inter-
mediate language and tools for analysis and transformation of C programs. In
Proceedings of Conference on Compiler Construction, 2002.
19. Aleph One. Smashing the stack for fun and prot. Phrack, 7(49), 1996.
20. Olatunji Ruwase and Monica S. Lam. A practical dynamic buer overow detector.
In Network and Distributed System Security Symposium. The Internet Society,
2004.
21. Michael Si, Satish Chandra, Thomas Ball, Krishna Kunchithapadam, and
Thomas W. Reps. Coping with type casts in C. In 7th European Software Engi-
neering Conference, volume 1687 of Lecture Notes in Computer Science. Springer,
1999.
22. Axel Simon and Andy King. Analyzing string buers in C. In Proceedings of the
9th International Conference on Algebraic Methodology and Software Technology.
Springer-Verlag, 2002.
23. Michael Sutton and Adam Greene. The art of le format fuzzing. In Black Hat
USA 2005, 2005.
24. Arnaud Venet and Guillaume Brat. Precise and ecient static array bound check-
ing for large embedded C programs. In Proceedings of the ACM SIGPLAN 2004
conference on Programming language design and implementation. ACM Press,
2004.
25. John Viega, J. T. Bloch, Y. Kohno, and Gary McGraw. ITS4: A static vulnerability
scanner for C and C++ code. In 16th Annual Computer Security Applications
Conference. IEEE Computer Society, 2000.
26. David Wagner, Jerey S. Foster, Eric A. Brewer, and Alexander Aiken. A rst
step towards automated detection of buer overrun vulnerabilities. In Proceedings
of the Network and Distributed System Security Symposium. The Internet Society,
2000.
27. John Wilander and Mariam Kamkar. A comparison of publicly available tools for
static intrusion prevention. In 7th Nordic Workshop on Secure IT Systems, 2002.
28. John Wilander and Mariam Kamkar. A comparison of publicly available tools for
dynamic buer overow prevention. In Network and Distributed System Security
Symposium. The Internet Society, 2003.
29. Glynn Winskel. The Formal Semantics of Programming Languages: An Introduc-
tion. The MIT Press, 1993.
30. Yichen Xie, Andy Chou, and Dawson R. Engler. ARCHER: using symbolic, path-
sensitive analysis to detect memory access errors. In Proceedings of the 11th ACM
SIGSOFT Symposium on Foundations of Software Engineering. ACM, 2003.
31. K. Yi. Yet another ensemble of abstract interpreter, higher-order data-ow equa-
tions, and model checking. Technical Memorandum 2001-10, Research on Program
Analysis System, National Creative Research Center, Korea Advanced Institute of
Science and Technology, 2001.

You might also like