How Close Is Existing CC Code To A Safe Subset
How Close Is Existing CC Code To A Safe Subset
*
Christian DeLozier
doi: 10.20944/preprints202311.0089.v1
Copyright: This is an open access article distributed under the Creative Commons
Attribution License which permits unrestricted use, distribution, and reproduction in any
Disclaimer/Publisher’s Note: The statements, opinions, and data contained in all publications are solely those of the individual author(s) and
contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting
from any ideas, methods, instructions, or products referred to in the content.
Article
How Close Is Existing C/C++ Code to a Safe Subset?
Christian DeLozier 1, *,†
1 United States Naval Academy
* Correspondence: [email protected]
† Current address: 105 Maryland Ave, Annapolis, MD, USA.
Abstract: Using a safe subset of C++ is a promising direction for increasing the safety of the
programming language while maintaining its performance and productivity. In this paper, we
examine how close existing C/C++ code is to conforming to a safe subset of C++. We examine
the rules presented in existing safe C++ standards and safe C++ subsets. We analyze the code
characteristics of 5.8 million code samples from the Exebench benchmark suite and 5 modern C++
applications using a static analysis tool. We find that raw pointers, unsafe casts, and unsafe library
functions are used in both C++ code at large and modern C++ applications. In general, C++ code at
large does not differ much from modern C++ code, and continued work will be required to transition
from existing C/C++ code to a safe subset of C++.
0. Introduction
For decades, the lack of memory safety in C and C++ has been the culprit behind a significant
number of software vulnerabilities [2]. Despite numerous available approaches to preventing memory
safety errors in these languages, developers have been hesitant to adopt any of them.
Following the rise in popularity of languages like Rust, multiple organizations have advised
that software developers should transition away from using languages like C and C++ in favor of
memory-safe languages [3,8]. In response, the C++ Standards Committee released a report on the
future direction for ISO C++ [1]. As part of their response, the committee indicated that moving toward
defining a safe subset of C++ could serve the purpose of allowing applications to be written in C++,
gaining the advantages of C++’s performance and productivity, while improving the safety of code
written in C++.
Our goal is to understand how close existing code is to conforming to a safe subset and highlight
the work that would be required to transition from existing C/C++ code to a safe subset of C++. To
accomplish this task, we applied a static analysis tool that identifies potentially problematic code
constructs in existing C/C++ code. We ran this static analysis tool on 5.8 million code samples from the
Exebench benchmark suite [4] and five modern C++ applications. We analyze the data to determine
(Q1) how often raw pointers, void pointers, and smart pointers are used in existing C/C++ code, (Q2)
how commonly unsafe constructs are used in existing C/C++ code, and (Q3) how much "modern" C++
code differs, in terms of using unsafe constructs, from C++ code at large.
This paper makes three main contributions.
• We summarize existing work on safe C++ standards and safe subsets of C++.
• We present a static analysis tool and methodology for identifying potentially problematic code
constructs in existing C/C++ code.
• We analyze data from 5.8 million code samples in the Exebench benchmark suite and 5 modern
C++ applications to determine how close existing C/C++ code is to conforming to a safe subset
of C++.
The remainder of the paper is organized as follows. Section 1 provides background information
on recent arguments over the safety of C/C++ and the motivation to transition toward a safe subset of
2 of 21
C++. Section 2 analyzes and compares a set of existing safe C++ standards. Section 3 discusses prior
work on safe subsets of C/C++. Section 4 introduces our experimental methodology and provides
background information on the Exebench benchmark suite and the modern C++ applications studied
in the remainder of the paper. Section 5 presents the results of our experiments and discussion on the
state of existing C/C++ code in relation to a safe subset of C++. Section 6 discusses the limitations of
our study and potential for future work, and Section 7 concludes the paper.
1. Background
In recent years, there has been a growing chorus of concerns surrounding the use of C and C++
programming languages, primarily centered on issues related to software safety. These apprehensions
have arisen as a response to years of security vulnerabilities and incidents that have underscored
the risks inherent in these languages. In response to these apprehensions, the software development
community has actively sought ways to address safety concerns, resulting in the emergence of safe
C++ standards and safe subsets of C++. This background section will delve into the contemporary
arguments against using C and C++ due to safety concerns, explore the various safe C++ standards that
have been developed, and examine the concept of safe subsets of C++, shedding light on the ongoing
efforts to strike a balance between the power and efficiency of these languages and the imperative of
ensuring software security.
unsigned i n t s i z e = g e t I n p u t ( ) ;
i n t * v a l u e s = new i n t [ s i z e + 8 ] ;
values [ 0 ] = 1 5 ;
Figure 1. Buffer overflow in C++ due to integer overflow. If the size calculation wraps around to 0, the
array is allocated with size 0, leading to an overflow, even at index 0.
Lifetime errors occur when a program accesses a memory location after the associated object has
been deallocated or when the pointer points to an invalid memory address. Figure 2 demonstrates
a lifetime error caused by storing the address of a stack-allocated variable into a global pointer. The
global pointer has a longer lifetime than the stack-allocated variable, allowing the pointer to refer
to the memory after it has been reallocated for a new purpose. These errors can lead to crashes,
data corruption, and unpredictable behavior in a program. The manual memory management in C
and C++ means that developers must be meticulous in tracking and managing pointers, which can
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
3 of 21
be error-prone and complex. A number of similar issues, including null and uninitialized pointer
dereferences, fall into the same category of errors.
int *p = n u l l p t r ;
void foo ( ) {
int x = 5;
...
p = &x ;
}
void bar ( ) {
int y = 0;
...
*p = 1 0 ;
}
Figure 2. Use-after-free in C++ due to assigning a stack-allocated address into a global pointer.
Race conditions can exacerbate both of these issues. A lack of proper synchronization in concurrent
code can lead to issues with uninitialized memory, null pointer dereferences, use-after-free, and buffer
overflows. These bugs can be difficult to find and fix in parallel programs.
1.2.2. NSA
In November 2022, the National Security Agency published a "Cybersecurity Information Sheet"
focused on memory safety issues in unsafe languages like C and C++ [3]. This paper highlighted recent
reports from Microsoft [6] and Google [7] regarding memory safety vulnerabilities in their products.
As noted by the NSA, memory safety vulnerabilities are often exploited by attackers to gain remote
code execution capabilities.
The NSA paper highlights four issues with memory safety that can lead to vulnerabilities: buffer
overflows, use-after-free errors, uninitialized variables, and race conditions. These types of bugs
can occur in C and C++ programs due to the lack of array bounds checking, use of manual memory
management, lack of requirements to initialize memory, and use of a weak consistency memory model
for concurrency. Memory safe languages solve these issues through static restrictions and dynamic
checks. However, as noted by the NSA, "memory safety can be costly in performance and flexibility."
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
4 of 21
The paper goes on to discuss various approaches to defending against memory safety
vulnerabilities, including static and dynamic analyses. Static analyses can be costly in terms of
programmer flexibility and time, and dynamic analyses can negatively impact run-time performance.
Other approaches offer band-aid solutions in the form of anti-exploitation features, but these can often
be bypassed by a clever attacker.
The NSA paper concludes that the "path forward" should be to shift from using languages like C
and C++ to memory safe languages when possible.
1.2.3. NIST
In October 2021, the National Institute for Standards and Technology (NIST) released a document
entitled "Guidelines on Minimum Standards for Developer Verification of Software." Overall, this
document recommends techniques for software verification, including static and dynamic analysis,
and does not directly recommend moving away from languages like C and C++. However, in section
3.1, NIST recommends using compile-time flags that enable run-time buffer overflow checking and
other memory safety protections. In section 3.2, NIST recommends using tools [5,9,10] that enforce
memory safety in C and C++.
5 of 21
immutability (Con), Templates and generic programming (T), C-style programming (CPL), Source files
(SF), and The Standard Library (SL).
The C++ Core Guidelines also includes the C+ Core Profiles for type-safety, bounds-safety, and
lifetime-safety, which provide additional guidance on writing C++ code that preserves these important
properties. The type-safety profile recommends recognizing avoiding casts, using dynamic_cast to
downcast, avoiding C-style casts, initializing all variables, avoiding unions, and avoiding varargs.
The bounds safety profile recommends avoiding pointer arithmetic, only indexing into arrays with
constant expressions, avoiding array to pointer decay, and avoiding unsafe library functions. Finally,
the lifetime safety profile recommends avoiding dereferencing potentially invalid pointers.
2.4. CERT
The CERT (Computer Emergency Response Team) C/C++ Coding Standard, developed by the
CERT Division of the Software Engineering Institute (SEI) at Carnegie Mellon University, is a set
of guidelines and best practices for writing secure and reliable C and C++ code [19]. The standard
is designed to help software developers and organizations reduce vulnerabilities and improve the
overall quality of their code. The CERT coding standard consists of 11 categories of recommendations,
including Declarations and Initialization (DCL), Expressions (EXP), Integers (INT), Containers (CTR),
Characters and Strings (STR), Memory Management (MEM), Input Output (FIO), Exceptions and
Error Handling (ERR), Object Oriented Programming (OOP), Concurrency (CON), and Miscellaneous
(MSC).
6 of 21
The HIC standard is categorized into numeric divisions as follows: (1) General, (2) Lexical Conventions,
(3) Basic Concepts, (4) Standard Conversions, (5) Expressions, (6) Statements, (7) Declarations, (8)
Definitions, (9) Classes, (10) Derived Classes, (11) Member Access Control, (12) Special Member
Functions, (13) Overloading, (14) Templates, (15) Exception Handling, (16) Preprocessing, (17) Standard
Library, and (18) Concurrency.
Rules like this improve the maintainability, readability, and cleanliness of the code. Lifetime errors
are less likely in functions with no side effects. Likewise, preventing spatial safety errors may be easier
in code without side effects.
Other rules, especially those relating to spatial and lifetime safety, may be difficult to enforce, as
acknowledged by the standards themselves.
Multiple standards advocate for some form of run-time checks to ensure that that safety
requirements are met. The implementation of run-time checks is left up to some combination of
the programmer and tool-chain.
Following a C++ standard will improve the quality of C++ code, but security vulnerabilities like
buffer overflows and lifetime errors cannot be prevented exclusively at compile time. A safe subset
of C++ will likely need to rely on both static enforcement of conformance to a subset and dynamic
checking for run-time errors.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
7 of 21
Table 1. Summary of rules in Safe C++ Standards that directly relate to memory safety properties.
3.1. SafeC
SafeC [14] improves the memory safety of the C programming language by replacing pointers
with a SafePtr structure that enables memory safety checks. The SafePtr structure includes the original
pointer, a base pointer, a size, a storage class (Heap, Local, Global), and a capability.
To apply SafeC to existing C code, all pointers must be translated to the safe pointer structure.
Once pointers have been translated, safety checks can be inserted into the code that perform tests
using the additional metadata. Operations on pointers must also be modified to produce new pointer
structures.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
8 of 21
SafeC introduced the concept of securing C through transformations on pointers, but its
implementation was limited by the lack of template support in C++ at that time. The safe pointer
definition had a template-like definition, but it could not use operator overloading to implement the
safety checks or pointer operations. SafeC highlights the necessity of securing pointer operations to
secure code written in C.
3.2. CCured
Similar to SafeC, CCured [11] adds memory safety guarantees to C programs via program
transformations on pointers. CCured introduces SAFE, SEQ, WILD, and RTTI pointer types to replace
raw pointers in C programs. SAFE pointers do not use pointer arithmetic or casts. SEQ pointers
can use pointer arithmetic but not casts. WILD pointers can perform both pointer arithmetic and
casts. SEQ and WILD pointers must carry additional metadata to secure these additional operations.
RTTI pointers allow downcasts by carrying run-time type information. CCured’s automatic program
transformation identified, via a pointer analysis, the correct type for each pointer in a program.
In a real-world study [12], the authors of CCured examined the types of pointers that were
required to secure C programs. In a set of Apache models, they found that the expensive WILD and
RTTI pointers were rarely required. Similarly, in a set of system software applications, SAFE and
SEQ pointers made up the majority of pointers required. This study emphasized the importance of
identifying how pointers are used at the source-level of a C program to efficiently enforce memory
safety.
3.3. Cyclone
Cyclone [13] restricts unsafe idioms in C and provides extensions to allow programmers to safely
use similar constructs. Cyclone requires that pointers are initialized before use and applies safety
checks to pointer usage. Cyclone restricts casts and provides tagged unions to ensure type safety.
Cyclone restricts the use of control-flow constructs like goto, setjmp, and longjmp. Cyclone introduced
the idea of using static analysis in a compiler to ensure that the code conforms to the safe subset of C.
9 of 21
SaferCPlusPlus [16] is a more thorough implementation of Ironclad C++ that wraps many of the
library features provided by C and C++ to ensure safety.
3.5. What work has been required to translate to previous safe subsets of C++?
As discussed in the prior section, moving from standard C/C++ code to a safe subset of C++
would likely require translating unsafe constructs to safe constructs. Generally, this task can be
accomplished by replacing unsafe constructs in the source code or instrumenting the unsafe code with
a compiler extension. For this analysis, we focus on translation at the source code level because it
better captures the workload required to move from unsafe C/C++ code to a safe subset. We also note
that compiler transformations may have difficulty capturing source-level semantics such as singleton
versus array pointers and class hierarchies that can impact the performance of the translated code. We
assume that backwards compatibility is a requirement for any safe subset and as such do not assume
that existing language constructs can be co-opted into new functionality.
Given that pointers are the main source of unsafety in C/C++, moving to a safe subset will require
replacing raw pointers with types that can provide capabilities such as null checking, bounds checking,
and lifetime checking when necessary. Prior work [12,15] has shown that the run-time performance
overhead of safety checks on pointers can be reduced by matching pointer types to their required
capabilities. With smart pointer classes, or potentially a new set of basic types for pointers, only the
types of pointers would need to be translated to move to a safe subset - operations on pointers could
use the same syntax as they do now.
Functions that create pointers, such as malloc and new would also need to be modified or
translated. New allocation functions would create new pointer types that carry the required metadata
for safety checks. Likewise, operations like address-of (&) may need special treatment to ensure that
metadata is maintained. Unsafe functions, like strcpy, must be replaced with safe versions of those
functions.
Unions and unchecked casts would also need to be eliminated or replaced. In general, treating
one type as another type should be checked at run-time or proven at compile-time to be safe.
Existing code may feature language constructs that are difficult to automatically translate to a safe
subset of C++ and would therefore require more programmer effort to realize a safe subset. Replacing
raw pointers with smart pointers can cause more implicit casts to be required than the number that
are allowed to be used by the C++ language. C++ allows an implicit conversion sequence of up to
one standard conversion, up to one user-defined conversion, and up to one standard conversion
after the user-defined conversion [17]. A standard conversion consists of up to one lvalue-to-rvalue,
array-to-pointer, or function-to-pointer conversion, up to one numeric conversion, up to one function
pointer conversion, and up to one qualification conversion. The limit of one user-defined conversion
may be an issue for smart pointer replacements for raw pointers because the constructor for the smart
pointer is a user-defined conversion, and as such, a class that takes a pointer as one of the parameters
to its constructor may now need two implicit user-defined conversions. Figure 3 demonstrates this
issue. In this example, the Slice class has a constructor that takes a char* as an argument. This
constructor allows a Slice to be implicitly constructed from a string literal to match a function or
method definition that takes a Slice as a parameter. If we introduce a smart pointer class and wrap
the char* as a smart_pointer<char>, the code in the if-statement conditional would now require two
user-defined casts. One of these casts would need to be made explicit in the source code. Generally,
this issue can be mitigated by either applying an explicit conversion or using a first-class type instead
of a smart pointer class to replace raw pointers.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
10 of 21
class Slice {
char * _ s t r ;
S l i c e ( char * s t r ) : _ s t r ( s t r ) { }
bool e q u a l s ( S l i c e & o t h e r ) {
...
}
};
void CompareSlices ( ) {
S l i c e s1 ( "OK" ) ; / / C o n s t r u c t o r i s e x p l i c i t
i f ( s1 . e q u a l s ( "NOT" ) ) { / / I m p l i c i t c o n v e r s i o n f r o m S t r i n g L i t e r a l
...
}
}
Figure 3. An example of code that may be impacted by introducing smart pointers due to the limited
number of user-defined casts that can be performed per implicit conversion sequence
References have also posed a challenging issue for previous safe subsets because the dot operator
cannot be overloaded. However, references are not as difficult to secure as pointers because the
address that a reference refers to cannot be changed after the reference has been initialized. Therefore,
references can partially be secured by static rules rather than via run-time enforcement. Ironclad
C++ [15] secured references by disallowing reference class members and restricting the values that
could be used as a return value by reference. In particular, only the dereference of a smart pointer, a
reference function parameter, the dereference of the this pointer, or a class member could be returned
by reference from a method. Existing safe C++ standards also restrict the values that can be returned
by reference and generally restrict how references can be used in C++ code.
Array operations, especially on two-dimensional or higher arrays, can be difficult to implement
efficiently when bounds checks are required for safety. In C/C++, two-dimensional arrays are naturally
represented as a pointer to a pointer. Accessing the elements of a two-dimensional array therefore
requires two pointer dereferences, which may require two safety checks. This issue can be mitigated
by using a large one-dimensional array to represent a two-dimensional array, but fixing the issue
in this manner negates the productivity benefit of representing the two-dimensional array naturally.
Taking the address-of an array element can also be problematic for a safe subset of C++ because the
address of operator produces a pointer into the array that may lack the supporting metadata from the
original array. Most existing safe C++ standards restrict code to use only a single pointer indirection
and disallow multiple levels of indirection.
4. Methodology
We developed a static analysis tool to identify relevant features in C/C++ code and applied that
tool to all of the code samples in the Exebench benchmark suite. We also applied the same tool to a
set of larger, modern C++ applications. Our goal in using this static analysis tool was to answer three
main experimental questions.
11 of 21
can be found on Github 1 . We used clang’s ASTMatchers library [27] to identify patterns matching the
language features described in the previous section. We highlight relevant rules from existing C++
standards that motivate the static analysis patterns that we chose to identify in the source code.
4.1.1. Pointers
As a first step, the static analysis tool identifies all pointers, void pointers, and smart pointers
in the C/C++ code. Enforcing safety on pointers is a critical task to move toward a safe subset, and
work will likely be required by programmers to ensure that pointers are used safely. This need to
understand how often pointers are used in existing code is motivated by existing rules in safe C++
standards and subsets.
Pointers are identified as any declaration with PointerType. On each pointer match, the tool
checks for Void Pointers, which are identified by the VoidPointerType.
Smart Pointers are identified by searching for declarations with a C++ class type with the names
unique_ptr, shared_ptr, weak_ptr, auto_ptr, and ptr. We note that ptr is not a standard library
smart pointer type, but at least one of the applications that we examined created their own smart
pointer wrappers with this name.
Unsafe Functions are identified by first matching all call expressions with Pointer arguments. The
called function name is then compared against a list of known unsafe functions, such as memcpy,
textttstrcmp, and puts. Calls to Malloc and Free are also identified by function name.
4.1.3. Casts
Unchecked casts can break type safety in C/C++ code. Existing C++ standards highlight the need
to avoid C-style casts and other unchecked casts.
1 https://github.com/crdelozier/subsets
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
12 of 21
MISRA 5-2-4: C-style casts (other than void casts) and functional notation casts (other than explicit
constructor calls) shall not be used.
As noted in the prior section, implicit casts to constructors may complicate the transition to a safe
subset because smart pointers add an extra user-defined conversion. The C++ Core Guidelines also
recommend avoiding implicit casts to constructors.
Like unsafe casts, unions break type safety in C/C++ code by allowing an implicit conversion
between types. C++ standards recommend avoiding unions and replacing them with tagged unions.
Unsafe Casts are identified as C-style casts and reinterpret_cast. Construct from Implicit Casts
are identified by finding implicit cast expressions with constructor ancestors in the abstract syntax tree.
In this case, we use the ancestor matcher instead of the parent matcher because there may be multiple
casts applied to a single constructor. Unions are identified by type.
4.1.4. References
References can pose a uniquely difficult problem to solve for a safe subset because they can cause
similar lifetime and initialization errors to pointers, but they cannot be checked as effectively due to
the inability to check for null references or to overload the dot operator. Existing C++ standards and
subsets specify rules for how references should be used, especially in the context of return by reference.
C++ Core F.43: Never (directly or indirectly) return a pointer or reference to a local object.
JSF AV 111: A function shall not return a pointer or reference to a non-static local object.
MISRA 7-5-1: A function shall not return a reference or a pointer to an automatic variable (including
parameters) defined within the function.
MISRA 7-5-3: A function shall not return a reference or pointer to a parameter that is passed by
reference or const reference.
Reference Class Members are identified as reference type declarations with a class ancestor. Reference
Returns and const Reference Returns are identified by finding function declarations that return a reference
type and further identifying the constant references. Reference to Dereferenced are identified as variable
declarations with a reference type that are initialized with the pointer dereference unary operator.
4.1.5. Arrays
Of course, without bounds checking on array operations, a safe subset will be doomed.
Out-of-bounds reads and writes are still some of the most common vulnerabilities found in C/C++
code [2]. We examine two issues with arrays that may affect a safe subset. First, array to pointer decay
may lose information about the bounds of the array. Array to pointer decay can happen in multiple
contexts, including when the address of an array element is taken. Bounds checking on arrays may also
hinder the run-time performance of a safe subset if two-dimensional arrays are not handle carefully.
Address of Array Subscript operations are identified by the address-of operator with an array
subscript operand, and 2D Arrays are identified as array subscript expressions with an array subscript
expression as an ancestor. We note that this may also find further nested arrays, such as three
dimensional arrays.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
13 of 21
Table 2. Modern C++ applications studied with the static analysis tool.
5. Results
We present the results collected by running the static analysis tool on 5.8 million code samples
from Exebench and the modern C++ applications listed in Table 2.
14 of 21
bit of work needs to be done to eliminate raw pointers from C++ code to transition to a safe subset that
can provide checked pointer operations.
100%
75%
50%
25%
0%
exebench cereal fmt folly json redex
Figure 4. Pointers found in Exebench and other C++ applications. Each bar shows the percentage of all
pointers variables that are raw pointers, void pointers, and smart pointers.
These results may be somewhat expected based on the existing safe C++ standards. Only the C++
Core Guidelines advocates strongly for smart pointer use, and its recommendations for using smart
pointers are limited to certain allocation ownership scenarios. On the other hand, previous work on
safe subsets of C/C++ have advocated that smart pointers, or other pointer type replacements, will be
necessary to dynamically enforce memory safety properties.
Likewise, the results for void pointers may be expected based on existing standards. Void
pointers are generally used to implement polymorphism without class hierarchies or to enable storage
of data with any type. We expect that the relatively large ratios of void pointers found in these
modern C++ applications are used to enable storage of data with any type, especially for use cases
such as serialization. Existing standards caution against the use of dynamic_cast, which may hinder
performance and cause unexpected run-time errors when a cast fails. To avoid the use of dynamic_cast,
programmers may err on the side of using void pointers to implement polymorphism.
15 of 21
1000000
100000
10000
1000
100
10
exebench cereal fmt folly json redex
Figure 5. Calls to unsafe functions found in Exebench and other C++ applications
We find that modern C++ code is still using unsafe library functions, despite decades of
recommendations to stop using them. In Exebench, 27% of all code samples use an unsafe library
function. It is possible that these uses of unsafe library functions are wrapped as recommended by
multiple C++ standards, but our static analysis tool cannot effectively check for this appropriate
wrapping. Work will be required to replace or wrap uses of these unsafe library function calls to move
toward a safe subset of C++. As future work, we also intend to identify which unsafe functions are
still widely used and why they are still used instead of safe alternatives.
16 of 21
1000000
100000
10000
1000
100
Figure 6. Casts and unions found in Exebench and other C++ applications
5.4. References
Figure 7 shows the total number of reference types used in the contexts of class members, return
values, and references initialized by the dereference of a pointer. We do not include results for the
Exebench suite because we found a negligible number of references used in the function samples. Only
a small number of Exebench code samples used return by reference. On the other hand, the modern
C++ applications examined in this study use references in all of these contexts. We note that not all
references are dangerous, and return by reference may be acceptable if the returned value does not live
beyond its scope. More work will be required on our static analysis tool to narrow down the cases
in which returning by reference may be dangerous, based on the rules presented in the existing C++
standards.
Reference class members and references to dereferenced pointers create the possibility of invalid
references due to initialization or lifetime errors, unless the initialization is checked. It is difficult
to check for initialization in these cases, and more work is required to determine if these cases are
dangerous or not. In general, references are commonplace enough in modern C++ code that a solution
will be required to ensure memory safety for references in a safe subset of C++.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
17 of 21
Reference Class Members Reference Returns const Reference Returns Reference to Dereferenced Pointer
50000
10000
5000
1000
500
100
50
cereal fmt folly json redex
5.5. Arrays
Figure 8 shows the total number of potentially problematic array operations in the C++ samples
and applications. In total, arrays are not nearly as prevalent in these applications as pointers and
references. Some work may be required to ensure metadata is maintained and to prevent arrays from
decaying to pointers. Further, performance optimizations may be required for two-dimensional and
higher arrays. However, the effort required to deal with issues related to arrays seems small compared
to handling pointers and references.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
18 of 21
10000
1000
100
10
5.6. Summary
To summarize, we return to the three questions posed at the beginning of the methodology. For
(Q1), we find that pointers are common both in C++ code at large and in more modern C++ code.
Although modern C++ code has begun to adopt smart pointers, raw pointers, and even void pointers,
are much more common than smart pointers. Significant work, either at the source level or within the
compilation tool-chain, will be required to replace raw pointers in a safe subset of C++.
For (Q2), we find that problematic constructs like unsafe functions, unsafe casts, and unions are
found in both C++ code at large and modern C++ code. Again, work will be required to transition
from existing C/C++ code toward a safe subset of C++ that avoids these unsafe constructs.
For (Q3), we find that modern C++ code is not all that different, in terms of use of unsafe constructs,
than C++ code at large. The one main difference appears to be the use of references, as we find little to
no use of references in the Exebench code samples.
19 of 21
7. Conclusions
We analyzed the code characteristics of 5.8 million code samples from the Exebench benchmark
suite and 5 modern C++ applications using a static analysis tool. Our analysis of C++ code, both at
large and in the context of more modern C++ practices, has revealed important insights. We have
found that raw pointers remain prevalent in both categories, despite some adoption of smart pointers
in modern C++ code. This indicates a substantial need for efforts, whether at the source code level or
within the compilation tool-chain, to replace raw pointers with safer alternatives in a safe subset of
C++.
Furthermore, we find that problematic constructs such as unsafe functions, unsafe casts, and
unions has shown their persistence in both C++ code at large and modern C++ code. This underscores
the necessity of a transition from existing C/C++ code towards a safer subset of C++ that avoids these
hazardous constructs.
Lastly, we observed that modern C++ code does not significantly differ in terms of safety when
compared to C++ code at large. The primary distinguishing factor is the usage of references, which are
notably absent or rarely utilized in the Exebench code samples. In sum, these findings emphasize the
importance of ongoing efforts to enhance the safety and modernization of C++ code-bases.
Abbreviations
The following abbreviations are used in this manuscript:
References
1. Hinnant, R; Orr, R.; Stroustrup, B.; Vandevoorde, D.; Wong, M. DG Opinion on Safety for ISO C++. In The
C++ Standards Committee.; JTC1, SC22, WG21. ISO: 2023; Document Number P2759R0.
2. National Vulnerability Database CWE Over Time. Available online: https://nvd.nist.gov/general/
visualizations/vulnerability-visualizations/cwe-over-time (accessed on 27 July 2023).
3. National Security Agency Cybersecurity Information Sheet. Available online: https://media.defense.gov/
2022/Nov/10/\2003112742/-1/-1/0/CSI_SOFTWARE_MEMORY_SAFETY.PDF (accessed on 27 July 2023).
4. Armengol-Estape, J.; Woodruff, J.; Brauckmann, A.; Magalhaes, J.; De Souza, W.; O’Boyle, M. Exebench: An
ML-Scale Dataset of Executable C Functions. In Proceedings of the 6th ACM SIGPLAN International Symposium
on Machine Learning.. New York, NY, USA, 2022. MAPS 2022, Association for Computing Machinery, 50–59.
5. Zhou, J.; Criswell, J.; Hicks, M. Fat Pointers for Temporal Memory Safety in C. In Proceedings of the ACM on
Programming Languages.. OOPSLA 2023, Association for Computing Machinery, 316–347.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
20 of 21
6. Miller, M. Trends, Challnges, and strategic shifts in the software vulnerability mitigation landscape.
BlueHat IL, February 7th, 2019. Available online: https://github.com/Microsoft/MSRC-Security-Research/
blob/master/presentations/2019_02_BlueHatIL/\2019_01%20-%20BlueHatIL%20-%20Trends%2C%
20challenge%2C%20and%20shifts%20in%20software%20vulnerability\%20mitigation.pdf (accessed on 4
Oct 2023).
7. Taylor, A.; Whalley, A.; Jansens, D.; Nasko, O. An update on Memory Safety in Chrome. Google Security Blog,
September 21, 2021. Available online: https://security.googleblog.com/2021/09/an-update-on-memory-
safety-in-chrome.html (accessed 4 Oct 2023).
8. Biden, J. Executive Order on Improving the Nation’s Cybersecurity. The White House, May 12, 2021.
Available online: https://www.whitehouse.gov/briefing-room/presidential-actions/2021/05/12/
executive-order-on-improving-the-nations-cybersecurity/ (accessed 6 Oct 2023).
9. Periklis Akritidis, Manuel Costa, Miguel Castro, and Steven Hand. 2009. Baggy bounds checking: an efficient
and backwards-compatible defense against out-of-bounds errors. In Proceedings of the 18th conference on
USENIX security symposium (SSYM’09). USENIX Association, USA, 51–66.
10. Santosh Nagarakatte, Jianzhou Zhao, Milo M.K. Martin, and Steve Zdancewic. 2009. SoftBound: highly
compatible and complete spatial memory safety for c. SIGPLAN Not. 44, 6 (June 2009), 245–258.
https://doi.org/10.1145/1543135.1542504
11. George C. Necula, Jeremy Condit, Matthew Harren, Scott McPeak, and Westley Weimer. 2005. CCured:
type-safe retrofitting of legacy software. ACM Trans. Program. Lang. Syst. 27, 3 (May 2005), 477–526.
https://doi.org/10.1145/1065887.1065892
12. Jeremy Condit, Matthew Harren, Scott McPeak, George C. Necula, and Westley Weimer. 2003. CCured in
the real world. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design
and implementation (PLDI ’03). Association for Computing Machinery, New York, NY, USA, 232–244.
https://doi.org/10.1145/781131.781157
13. Trevor Jim, J. Greg Morrisett, Dan Grossman, Michael W. Hicks, James Cheney, and Yanling Wang. 2002.
Cyclone: A Safe Dialect of C. In Proceedings of the General Track of the annual conference on USENIX
Annual Technical Conference (ATEC ’02). USENIX Association, USA, 275–288.
14. Todd M. Austin, Scott E. Breach, and Gurindar S. Sohi. 1994. Efficient detection of all pointer and array
access errors. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design
and implementation (PLDI ’94). Association for Computing Machinery, New York, NY, USA, 290–301.
https://doi.org/10.1145/178243.178446
15. Christian DeLozier, Richard Eisenberg, Santosh Nagarakatte, Peter-Michael Osera, Milo M.K. Martin, and
Steve Zdancewic. 2013. Ironclad C++: a library-augmented type-safe subset of c++. In Proceedings of
the 2013 ACM SIGPLAN international conference on Object oriented programming systems languages
and applications (OOPSLA ’13). Association for Computing Machinery, New York, NY, USA, 287–304.
https://doi.org/10.1145/2509136.2509550
16. SaferCPlusPlus, Hardened C++ for the internet age. Available online: http://duneroadrunner.github.io/
SaferCPlusPlus/ (accessed on 25 October 2023).
17. International Standard ISO/IEC 14882:2020. Programming Languages – C++. International Organization for
Standards, 2020.
18. High Integrity C++ Standard. Available online: https://www.perforce.com/resources/qac/high-integrity-
cpp-coding-standard (accessed on 27 Oct 2023).
19. SEI CERT C++ Coding Standard. Available online: https://wiki.sei.cmu.edu/confluence/pages/viewpage.
action?pageId=88046682 (accessed 29 Oct 2023).
20. Joint Strike Fighter Air Vehicle C++ Coding Standards. Available online: https://www.stroustrup.com/JSF-
AV-rules.pdf (accessed 29 Oct 2023).
21. C++ Core Guidelines. Available online: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines.
html (accessed 29 Oct 2023).
22. AUTOSAR Guidelines for the use of the C++14 language in critical and safetyrelated systems. Available
online: https://www.autosar.org/fileadmin/standards/R22-11/AP/AUTOSAR_RS_CPP14Guidelines.pdf
(accessed 29 Oct 2023).
23. MISRA Publications. Available online: https://misra.org.uk/publications/ (accessed 01 Oct 2023).
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 2 November 2023 doi:10.20944/preprints202311.0089.v1
21 of 21
24. Anderson Faustino da Silva, Bruno Conde Kind, José Wesley de Souza Magalhães, Jerônimo Nunes Rocha,
Breno Campos Ferreira Guimarães, and Fernando Magno Quinão Pereira. Anghabench: A suite with one
million compilable c benchmarks for code-size reduction. In 2021 IEEE/ACM International Symposium on
Code Generation and Optimization (CGO), pages 378–390, 2021. doi: 10.1109/CGO51591.2021.9370322.
25. Georgios Gousios, Bogdan Vasilescu, Alexander Serebrenik, and Andy Zaidman. Lean GHTorrent: Github
data on demand. MSR, 2014. doi: 10.1145/2597073.2597126
26. SSH CRC32 attack detection code contains remote integer overflow. Available online: https://www.kb.cert.
org/vuls/id/945216 (accessed 30 Oct 2023).
27. AST Matcher Reference. Available online: https://clang.llvm.org/docs/LibASTMatchersReference.html
(accessed 30 Oct 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those
of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s)
disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or
products referred to in the content.