Softwarex: Francisco Ortin, Javier Escalada
Softwarex: Francisco Ortin, Javier Escalada
Softwarex: Francisco Ortin, Javier Escalada
SoftwareX
journal homepage: www.elsevier.com/locate/softx
article info a b s t r a c t
Article history: The Big Code and Mining Software Repositories research lines analyze large amounts of source code to
Received 25 January 2021 improve software engineering practices. Massive codebases are used to train machine learning models
Received in revised form 7 May 2021 aimed at improving the software development process. One example is decompilation, where C code
Accepted 13 May 2021
and its compiled binaries can be used to train machine learning models to improve decompilation.
Keywords: However, obtaining massive codebases of portable C code is not an easy task, since most applications
Big code use particular libraries, operating systems, or language extensions. In this paper, we present Cnerator,
Mining software repositories a Python application that provides the stochastic generation of large amounts of standard C code.
Machine learning It is highly configurable, allowing the user to specify the probability distributions of each language
C programming language construct, properties of the generated code, and post-processing modifications of the output programs.
Stochastic program generation Cnerator has been successfully used to generate code that, utilized to train machine learning models,
Python has improved the performance of existing decompilers. It has also been used in the implementation
of an infrastructure for the automatic extraction of code patterns.
© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Code metadata
Software metadata
Current software version v.1.0.1
Permanent link to executables of this version https://github.com/ComputationalReflection/Cnerator/releases
Legal software license BSD 3-Clause
Operating system Linux, Windows and Mac OS
Installation requirements & dependencies numpy PyPI Python package
User manual https://github.com/ComputationalReflection/Cnerator/blob/main/user-manual.md
Support email for questions [email protected]
1. Introduction
∗ Corresponding author at: University of Oviedo, Computer Science ‘‘Big code’’ is a recent line of research that brings together big
Department, Federico Garcia Lorca 18, 33007, Oviedo, Spain.
data and source code analysis [1]. It is based on using the source
E-mail addresses: [email protected] (Francisco Ortin),
[email protected] (Javier Escalada). code of millions of programs to build different types of tools to
URLs: http://www.reflection.uniovi.es/ortin (Francisco Ortin), improve software development [2]. Machine learning is used to
https://www.javierescalada.es (Javier Escalada). create useful predictive models that learn common patterns from
https://doi.org/10.1016/j.softx.2021.100711
2352-7110/© 2021 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Francisco Ortin and Javier Escalada SoftwareX 15 (2021) 100711
a large number of source code applications [3,4]. For example, The rest of this article is structured as follows. Section 2 details
JavaScript programs are used to train a model capable of deob- the related work, and the software functionality and architecture
fuscate variable names from their usage [5]; Java and C# files are presented in Section 3. An illustrative example is described
with the same behavior are used to learn automatic translation in Section 4. Section 5 evaluates Cnerator and compares it with
between these two languages [6]; and vulnerable C source code related approaches. Conclusions are presented in Section 6.
is used to train models that predict vulnerabilities by analyzing
the source code of new programs [7]. 2. Problems and background
Likewise, the Mining Software Repositories (MSR) field ana-
lyzes the rich data available in source code repositories to un- As mentioned, there are tools for the random generation of
cover information about software systems and projects [8]. In
C code. Most of them are aimed at finding bugs in C compilers,
this case, the source code data is enriched with other information
rather than at training machine learning models.
taken from defect tracking systems, archived communications be-
Csmith is a well-known generator of random C programs [25]
tween project personnel, version control systems, and question-
created as a fork of randprog [26]. Its main purpose is the de-
and-answer sites [9]. Examples of MSR projects include software
tection of bugs in C compilers. Generated programs conform to
repair models that analyze bug fix transactions in software repos-
the C99 standard, and they avoid the undefined behavior con-
itories [10], change prediction systems that identify the code
structs specified in C99. To find compiler bugs, each generated
prone to change in subsequent releases [11], and the automatic
program is compiled by different compilers and executed. If a
retrieval of help information for source code fragments using
checksum of the global variables upon program termination is
question-and-answer websites [12].
One of the languages used to build those machine learning different from the rest of executions, the compiler that produced
models is the C programming language. From its creation in that binary has an error (i.e., randomized differential testing).
the 70s, C is still in use, particularly for the development of Csmith implements different safety mechanisms such as pointer
systems software, embedded system applications, and programs analysis, bounded loop constructs, and different dynamic checks.
that access specific hardware addresses [13]. Its low demand for Csmith has been used to detect more than 325 errors in existing
runtime system resources and its wide availability have made compilers, including the verified CompCert C compiler [27].
it a usual candidate to implement language interpreters and ldrgen is a tool for the random generation of C programs
computationally intensive programs. According to the Tiobe [14], to test compilers and program analysis tools [28]. Existing sys-
LangPop [15], and the Transparent Language Popularity Index [16] tems generate large amounts of dead code that the compiler
programming language rankings, C is still the most widely used reduces to little relevant binary, because dead code is deleted.
language in January 20211 , obtaining the fifth position in the For this reason, ldrgen implements a liveness analysis algo-
PYPL [17], Redmonk [18] and Trendy Skills [19] rankings. rithm during program generation to avoid producing dead code.
There exist many different variants of the C programming It is implemented as a plugin for the Frama-C extensible frame-
language, which include language extensions and modifications work [29]. ldrgen has been used to detect missed compiler
depending on the operating system, compiler and target hard- optimizations [30].
ware. Therefore, different ANSI/ISO standardizations of C are de- YARPGen is a random test-case generator for C and C++ com-
fined to facilitate the development of portable software [20]. pilers, created to find and report compiler bugs [31]. YARPGen
However, it is still difficult to find applications written in 100% is created to overcome the saturation point reached by existing
standard C source code that could be compiled with many differ- compiler testing methods, where very few bugs are found. This is
ent compilers. Most of the existing open-source applications have not because compilers are bug-free, but rather because generators
particular dependencies on non-portable code. This is an issue contain biases that make them incapable of testing specific parts
when building predictive models from source code, since a large of compiler implementations. YARPGen generates programs free
number of programs is usually required [21]. of undefined behaviors without dynamic safety checks, unlike
There exist tools capable of generating random C source code, Csmith. Its approach is to implement different static analyses
but they are mainly aimed at testing compilers, rather than cre- to generate code that conservatively avoids undefined behaviors.
ating machine learning models [22]. Therefore, they are not de- It also implements generation policies that systematically skew
signed to build massive amounts of standard C code, and they do probability distributions to cause certain optimizations to take
not cover every language construct—we detail them in Section 2 place more often. YARPGen has found more than 220 bugs in GCC,
and evaluate them in Section 5. LLVM, and the Intel C++ compiler. Those bugs were not previously
For this reason, we developed Cnerator, a Python applica- found by other compiler testing tools.
tion that generates large amounts of standard ANSI/ISO C source The family of Orange random C code generators is focused
code [20] to train machine learning models. Cnerator is highly on generating arithmetic expressions [32]. Instead of differential
customizable to generate all the syntactic constructs of the C testing, they track the expected values of each test after execu-
language, necessary to build accurate predictive models with tion, checking whether the obtained values are the expected ones.
machine learning algorithms. The code it generates is ready to
The programs generated by Orange generators are safe, avoiding
be compiled by any standard language implementation. Cnerator
the undefined behaviors of the C programming language. Orange
has been used to improve state-of-the-art decompilers [23] and to
code generators do not include important language features such
implement an infrastructure for the automatic extraction of code
as control flow statements, structs, arrays or pointers.
patterns [24]. The stochastic generation of source code programs
Quest is a code generator tool aimed at finding several com-
has also been used to detect bugs in existing compilers [25].
piler bugs related to calling conventions [33]. It generates func-
Another potential use of Cnerator is testing whether a compiler
tion declarations randomly, and then generates type-driven test
implements the ANSI/ISO standard specification correctly.
cases that invoke each function. A global variable is generated
for each parameter and return value. Assertions are used to
1 These rankings measure the popularity of each programming language,
check that each value received and returned is the appropriate
using different criteria. For example, the Tiobe language ranking uses 25
search engines to calculate each language index, depending on the number
one. Quest avoids undefined behavior by simply not generating
of searches done by users (https://www.tiobe.com/tiobe-index/programming- potentially dangerous constructs (e.g., arithmetic expressions). It
languages-definition). was used to find 13 bugs in 5 different compilers [33].
2
Francisco Ortin and Javier Escalada SoftwareX 15 (2021) 100711
3
Francisco Ortin and Javier Escalada SoftwareX 15 (2021) 100711
Fig. 3. Two example JSON files used to customize program generation with Cnerator. The left-hand side shows a sample probability specification file, and the
right-hand side specifies an example of controlled function generation.
6
Francisco Ortin and Javier Escalada SoftwareX 15 (2021) 100711
[6] Karaivanov S, Raychev V, Vechev M. Phrase-based statistical translation [25] Yang X, Chen Y, Eide E, Regehr J. Finding and understanding bugs in
of programming languages. In: Proceedings of the 2014 ACM international c compilers. In: Proceedings of the 32nd ACM SIGPLAN conference on
symposium on new ideas, new paradigms, and reflections on programming programming language design and implementation. PLDI ’11, New York,
& software. Onward! 2014, New York, NY, USA: ACM; 2014, p. 173–84. NY, USA: Association for Computing Machinery; 2011, p. 283–94.
[7] Yamaguchi F, Lottmann M, Rieck K. Generalized vulnerability extrapolation [26] Eide E, Regehr J. Volatiles are miscompiled, and what to do about it.
using abstract syntax trees. In: Proceedings of the 28th annual computer In: Proceedings of the 8th ACM international conference on embedded
security applications conference. ACSAC ’12, New York, NY, USA: ACM; software. EMSOFT ’08, New York, NY, USA: Association for Computing
2012, p. 359–68. Machinery; 2008, p. 255–64.
[8] Herzig K, Zeller A. Mining your own evidence. In: Making software. O’Reilly [27] Leroy X. Formal verification of a realistic compiler. Commun ACM
Media, Inc.; 2010. 2009;52(7):107–15.
[9] Jung W, Lee E, Wu C. A survey on mining software repositories. IEICE Trans [28] Barany G. Liveness-driven random program generation. In: Fioravanti F,
Inf Syst 2012;E95.D(5):1384–406. Gallagher JP, editors. Logic-based program synthesis and transformation.
[10] Martinez M, Monperrus M. Mining software repair models for reasoning Cham: Springer International Publishing; 2018, p. 112–27.
on the search space of automated program fixing. Empir Softw Eng [29] Cuoq P, Kirchner F, Kosmatov N, Prevosto V, Signoles J, Yakobowski B.
2015;20(1):176–205. Frama-c. In: Eleftherakis G, Hinchey M, Holcombe M, editors. Software
[11] Malhotra R, Khanna M. Software change prediction: A systematic review engineering and formal methods. Berlin, Heidelberg: Springer Berlin
and future guidelines. e Inf Softw Eng J 2019;13(1):227–59. Heidelberg; 2012, p. 233–47.
[12] Ponzanelli L, Bavota G, Di Penta M, Oliveto R, Lanza M. Mining stack- [30] Barany G. Finding missed compiler optimizations by differential testing. In:
overflow to turn the IDE into a self-confident programming prompter. Proceedings of the 27th international conference on compiler construction.
In: Proceedings of the 11th working conference on mining software CC 2018, New York, NY, USA: Association for Computing Machinery; 2018,
repositories. MSR 2014, New York, NY, USA: Association for Computing p. 82–92.
Machinery; 2014, p. 102–11. [31] Livinskii V, Babokin D, Regehr J. Random testing for c and c++ com-
[13] Meyerovich LA, Rabkin AS. Empirical analysis of programming language pilers with YARPGen. In: Proceedings of the ACM on programming
adoption. In: Proceedings of the international conference on object- languages. Vol. 4, (OOPSLA). New York, NY, USA: Association for Computing
oriented programming systems languages & applications. OOPSLA ’13, New Machinery; 2020.
York, NY, USA: Association for Computing Machinery; 2013, p. 1–18. [32] Nagai E, Awazu H, Ishiura N, Takeda N. Random testing of C compilers
[14] Tiobe. Tiobe programming language index for january 2021. 2021, https: targeting arithmetic optimization. In: Proceedings of workshop on synthe-
//www.tiobe.com/tiobe-index. sis and system integration of mixed information technologies, SASIMI ’12,
[15] LangPop. Programming language popularity. 2021, http://65.39.133.14. 2012, p. 48–53.
[16] SourceForge. The transparent language popularity index. 2021, http://lang- [33] Lindig C. Random testing of c calling conventions. In: Proceedings of the
index.sourceforge.net. sixth international symposium on automated analysis-driven debugging.
[17] PYPL. Popularity of programming languages. 2021, https://pypl.github.io. AADEBUG’05, New York, NY, USA: Association for Computing Machinery;
[18] Redmonk. The redmonk programming language ranking. 2021, https:// 2005, p. 3–12.
redmonk.com/sogrady/2020/07/27/language-rankings-6-20. [34] Ortin F, Escalada J. Cnerator user manual. 2021, https://github.com/
[19] Trendy Skills. Extracting skills that employers seek in the IT industry. 2021, ComputationalReflection/Cnerator/blob/main/user-manual.md.
https://trendyskills.com. [35] Ortin F, Quiroga J, Redondo JM, Garcia M. Attaining multiple dispatch in
[20] ISO. ISO/IEC 9899:2018 - information technology – programming languages widespread object-oriented languages. Dyna 2014;81(186):242–50.
– c. 2018, https://www.iso.org/standard/74528.html. [36] Langa L. Singledispatch python package 3.4.0.3. 2021, https://pypi.org/
[21] Babii H, Janes A, Robbes R. Modeling vocabulary for big code machine project/singledispatch/.
learning. 2019, http://arxiv.org/abs/1904.01873. [37] Rodriguez-Prieto O, Mycroft A, Ortin F. An efficient and scalable platform
[22] Chen J, Patra J, Pradel M, Xiong Y, Zhang H, Hao D, Zhang L. A survey of for java source code analysis using overlaid graph representations. IEEE
compiler testing. ACM Comput Surv 2020;53(1). Access 2020;8:72239–60.
[23] Escalada J, Scully T, Ortin F. Improving type information inferred by [38] Erich G, Richard H, Ralph J, John V. Design patterns: elements of reusable
decompilers with supervised machine learning. 2021, http://arxiv.org/abs/ object-oriented software. Addison-Wesley Professional Computing Series;
2101.08116. 1995.
[24] Escalada J, Ortin F, Scully T. An efficient platform for the automatic [39] Ortin F, Escalada J. Cnerator developer manual. 2021, https:
extraction of patterns in native code. Sci Program 2017;2017:1–16. //computationalreflection.github.io/Cnerator.
[40] Ortin F, López B, Pérez-Schofield JBG. Separating adaptable persistence
attributes through computational reflection. IEEE Softw 2004;21(6):41–9.