GPU4S Benchmark
GPU4S Benchmark
∗ UniversitatPolitècnica de Catalunya
† Barcelona Supercomputing Center (BSC), Spain
‡ Airbus Defence and Space, Toulouse, France
§ European Space Agency, The Netherlands
Abstract—The constant demand for increased on-board per- which require massively parallel processing, but also exhibit
formance in future space missions calls for the exploration and regular behaviour in memory accesses and branching.
adoption of new hardware architectures, able to provide high Second, we need to ensure that programming of this new
performance in a low power envelope. In the GPU4S (GPU for
Space) project, funded by the European Space Agency (ESA), we architecture can be mastered with reasonable effort by indus-
study the applicability of embedded Graphics Processing Units try, and therefore highly efficient software versions can be
(GPUs) in space, in order to show whether current and future on- achieved without excessive investment on the development
board processing algorithms can be benefited from them, as well cost. For example, the Cell Broadband Engine [5] (CBE)
as to select the appropriate embedded GPU which can satisfy the jointly designed by IBM, Sony and Toshiba was one of the
performance needs of future missions. However, in the absence
of relevant benchmarking solutions for space applications and most powerful and energy efficient architectures of its time,
GPUs, a new benchmark suite had to be designed. In this work, but it was proven notoriously difficult to program [6], reducing
we describe the design and implementation of the GPU4S Bench its industry adoption beyond the gaming sector.
benchmark suite, which has been specifically designed to achieve Finally, we need to experimentally test various embedded
the goals of our project, accompanied by some indicative results. GPUs to identify the most promising candidates for space use,
primarily based on their performance and energy efficiency, as
I. I NTRODUCTION AND M OTIVATION
well as their software tools and libraries. This will result in sa
The space industry is facing a dramatic increase in the selection of the most promising IP (intellectual property) GPU
performance required by future missions as future spacecraft solution as well as the most promising COTS (commercial
require to acquire orders of magnitude more data compared to off-the-shelf) GPU solution. The former will be used for the
existing ones, supporting much higher resolutions, precision development of a radiation hardened version of an embedded
and sampling frequencies. Moreover, other types of space GPU IP based on European technology, which is the leader in
missions like robotic exploration such as the Rosalind Franklin embedded GPU designs, to support Europe’s non-dependence
ExoMars rover [1] and new types of space missions and in the space domain. The latter can be used in the shorter term
concepts like the space tug [2] and active debris removal [3] to enable the fast adoption of GPUs in space.
require highly autonomous operations, which need significant In order to achieve the aforementioned necessary steps
on-board processing capabilities. within the project, there is a clear need for a benchmarking
Embedded Graphics Processing Units (GPUs) have shown a suite which can provide all the required information to take
great potential in high performance processing in temperature the correct decisions. However, we notice a considerable lack
and battery-constrained devices, following the widespread and of standard benchmarking solutions in the space domain,
successful use of GPUs in high-performance domain. The especially regarding GPUs.
GPU4S (GPU for Space) [4] project funded by the European In addition to the theoretical MIPS (Million instructions per
Space Agency (ESA) aims at evaluating the potential of second) metric of a given processing technology and MFLOPS
embedded GPUs for use in space, for the first time and define (Million Floating Point Instructions per second), in the single
the roadmap of GPU adoption in space. In order to achieve this core domain, we notice the use of proprietary representative
long term goal, several intermediate steps have to be achieved. benchmarks known as ESTEC-mix and Rudstone [7], as well
First, we need to confirm that the existing and mainly future as the Euclid mission software provided by ESA [8]. Multi-
aerospace software can be effectively parallelised in order to core evaluations include microbenchmarks [9], non-space rep-
exploit the performance advantage of GPUs. This is because resentative parallel benchmarks such as SPLASH [10] and
GPUs are known to work well with certain types of algorithms proprietary space applications like GAIA’s mission [11].
This gap in benchmarking solutions has been acknowledged Space Relevance and Space Domain Coverage: The
by ESA, which addressed this issue by developing the NGDSP benchmarks contained in GPU4S Bench need to be relevant to
(Next Generation Digital Signal Processing) Benchmark [12] the space domain and accompanied with representative input
[13]. However, this benchmark suite focuses only on signal data, while they need to cover as many space domains as
processing, which is just one of the available space domains. possible. The benchmarks need to cover not only existing
In general GPU benchmarking, several open benchmark on-board processing in terms of both algorithms and input
suites exist such as Rodinia [14] and Parboil [15], however configurations but also future cases envisioned in the different
their algorithms are not representative of the ones used in space domains and their performance requirements.
space processing. In critical domains the only existing solution Portability and fairness of comparison: The benchmarks
is the EEMBC ADASmark [16], however it is only available should not be limited to a given embedded GPU e.g. support-
for a single GPU programming model (OpenCL) and it is not ing a single programming model or architecture, but to be able
representative of on-board processing algorithms either. to run in a variety of GPUs. If the benchmark code should
In order to cover this gap which was essential for our be differentiated to support different platforms, this should
project, we have developed an open source GPU bench- be limited to the absolutely minimum. The non-platform
mark suite focusing on space on-board processing. Our suite, dependent portion of the benchmark should be identical for
GPU4S Bench has the following characteristics: it covers all all platforms, in order to allow fair comparison. If support for
space domains, it is portable to any embedded GPU and a new platform needs to be added in the future, the benchmark
it is easily extendible to cover new programming models. structure should allow its extension in an easy manner.
Moreover, it is configurable, supporting different endianess Configurability: The software needs to support multiple
and several data types and input sizes, in order to cover configurations in terms of inputs sizes, execution parameters
both existing and future missions requirements. In addition to and data types used in the computation, as well as different
randomised inputs and reference CPU implementations of each endianess, for comparison with existing on-board processors.
benchmark, it is also accompanied by representative inputs and Correctness and Computation Precision: The software
outputs and a comparison framework, which allow not only to should be functionally correct, following a standard implemen-
check for the equivalence of the output but also to evaluate the tation if available. A reference CPU implementation should be
precision of the computation. Finally, at the end of the project provided. The software should be accompanied with reference
its source will be released as open source. outputs in order to check for the correctness of the target plat-
GPU4S Bench differs with respect to other benchmark form outputs. If the outputs do not match, which can happen
suites in the fact that in addition to performance and energy in parallel algorithms using floating point arithmetic [17], the
efficiency evaluation, it can also provide additional insights precision of the computation should be evaluated.
which are crucial for our project such as the ease of develop- Reproducibility: The benchmarks results should be repro-
ment and optimisation for a given GPU programming model, ducible. If random input generation is supported, replication
as well as the programming efficiency compared to optimal of the results should be possible to be performed if needed.
implementations provided by GPU vendors. Ease of Development and Optimisation Evaluation:
The development of the benchmark suite should allow the
II. D ESIGN P RINCIPLES assessment of the difficulty of software development and opti-
The GPU4S Bench design covers these requirements: misation for GPUs. Moreover, it should enable the evaluation
Open Source: The benchmarks need to be free of company of the performance of the optimised version in comparison
IP rights. Although the development is funded by ESA, which with the optimal implementation of the software.
owns the rights of the software and the purpose of its funded
projects is to use their outcomes for the benefit of its member III. D ESIGN OF GPU4S B ENCH
states, an open source benchmark suite maximises the potential In order to design a highly relevant benchmark suite cov-
of becoming the de-facto means of performance and energy ering as many space domains as possible, we have performed
efficiency comparison between embedded GPUs for space, a survey of on-board software across all Airbus Defence and
as well as to allow reproducibility and crowdsourcing results Space divisions, collecting requirements and types of software
from new architectures. Being able to directly compare results used in each space domain for current and future missions.
from the same suite among different targets saves time and In order to maximise our domain coverage and also comply
reduces costs, while it enables taking more straightforward with our freedom of company IP restrictions, we decided to
decisions for the hardware of future space programmes. Such a identify common algorithm building blocks between different
benefit has been already observed with the NIR HAWAII-2RG space domains. Table I provides an overview of the identified
BM algorithm [8], which has been used in several ESA-funded algorithms which according to our analysis span more than
and internal ESA activities for benchmarking of numerous one domain and are included in GPU4S Bench. An additional
platforms. Such an algorithm is much more useful compared reason for the selection of the particular building blocks is
to an advanced and complete but proprietary processing space the availability of near optimal implementations from GPU
application e.g. [11], which cannot be reproduced in future vendors, which allow to compare the ease of development
studies performed by different contractors. and evaluation as discussed later. In addition to the individual
TABLE I
F UNDAMENTAL BUILDING BLOCKS EXTRACTED FROM CURRENT AND FUTURE ON - BOARD APPLICATIONS ACROSS ALL SPACE DOMAINS .
Domains Compression Vision Based Image Processing Neural Network Signal Processing
Building Blocks Navigation Processing
Fast Fourier Transform GENEVIS [18] ADS-B [19], NGDSP [13]
Finite Impulse Response Filter ADS-B [19], NGDSP [13]
Discrete Wavelet Transform CCSDS 122 [20]
Pairwise Orthogonal Transform CCSDS 122 [20]
Predictor CCSDS 122 [20]
Matrix Computation GENEVIS [18] Inference
Convolution Kernel OpenCV GO3S [21], GENEVIS [18] Inference
Max Detection GO3S [21] Inference ADS-B [19]
Synchronization Mechanism GENEVIS [18] EUCLID NIR [8], GO3S [21] TensorFlow ADS-B [19], NGDSP [13]
Memory Allocation CERES [22], OpenCV EUCLID NIR [8], GO3S [21] TensorFlow ADS-B [19], NGDSP [13]
building blocks, we have also selected to implement two them in to ground. In telecommunication satellites on the other
space-relevant complex applications, in order to be able to hand, the dominant type of computation is signal processing.
demonstrate additional effects that are only visible when Future missions are expected to be highly autonomous
performing a chain of GPU operations. In Section IV we in order to support applications such as Active Debris Re-
provide a detailed information of the GPU4S benchmarks. moval [3] and robotic exploration. These functionalities can
In addition to the benchmark selection, we have selected be enabled using Vision-based Navigation for Guidance and
the appropriate parameters e.g. input sizes which match Navigation (GNC) and AI (Artificial Intelligence) inference
existing and future missions requirements, and we defined solutions based on Deep Neural Networks (DNN).
representative inputs for them. Each algorithm is implemented Next we examine the building blocks and complete appli-
in a parametric way to support multiple data types: single cations and how they fit on the above domains.
precision floating point (float), double precision floating
point (double), and 32-bit integer (int). For the GPU A. Building Blocks
implementation of the benchmarks we also support half pre- Fast Fourier Transform: The Fast Fourier Transform
cision (16-bit) floating point (half) in order to evaluate their (FFT) is a ubiquitous algorithm used across several space
performance impact compared with the rest of the data types. domains, mainly in telecommunications and for image anal-
However, since this data type is not supported natively in ysis, in its 2D form. Moreover, it is an essential part of the
CPUs, we don’t provide any means for functional validation. ADS-B (Automatic Dependent Surveillance) system [19], a
Furthermore, we added the option of random input gen- surveillance technology in which an aircraft determines its
eration to increase the potential test cases and ensure that position via satellite navigation. In the latter case, the FFT
the benchmark implementations behave as expected under is applied in a sliding window of 128 points. For the library
different inputs. To guarantee reproducibility, we print the implementation of this block, we are using the cuFFT library
random seed used in an experiment with randomised input, and for NVIDIA targets and the clFFT library for OpenCL targets
we support setting a specific seed for repeating experiments. and for validation we use the MATLAB fft function.
Since we required a standard implementation for each Finite Impulse Response: The Finite Impulse Response
building block, we have selected to follow the specification (FIR) filter is also widely used in signal processing. No
of existing widespread software for our reference CPU im- vendor-provided library supports this block, so we only use the
plementation and our equivalent GPU code. In particular, we MATLAB option of the convolution function (conv) between
follow the implementation of MATLAB/GNU Octave, which the signal and the filter taps.
are standard tools used frequently for prototyping on-board Matrix Operations: Matrix operations are used in many
algorithms and which we have also used for our output vali- domains, especially matrix multiplication which is one of the
dation methodology, explained in Section V. In addition, we most well studied and understandable benchmarks. In partic-
follow the implementation of optimised vendor GPU libraries ular it is used in Vision-based navigation for perspective cor-
for both functional validation as well as for development rections and in neural processing, where it is the fundamental
and optimisation efficiency comparison, as discussed further way of implementing inference in a fully connected network.
in Section VI, which in all but one cases (neural network We use the MATLAB matrix multiplication functionality for
convolution) match the MATLAB specification. functional validation, and the cuBLAS and clBLAS libraries
for the development efficiency evaluation.
IV. T HE GPU4S B ENCH S UITE
Discrete Wavelet Transform, Pairwise Orthogonal
The domains we covered in our analysis are the follow- Transform and Predictor: These algorithmic blocks are used
ing: Earth and sky observation, as well as science missions in compression within the CCSDS-122 standard [20]. Since
are primarily focused on image processing and analysis for these building blocks are contributing in a single domain and
processing the acquired data and compression for transmitting it is known from previous ESA studies that this algorithm is
difficult to parallelise due to dependencies, we prioritised the output is compared to the reference binary output and select
implementation of the rest of the algorithms. the maximum difference value, which is the reference error.
Convolution: Convolution is used in conjunction with im- We selected the binary representation for the input, output
ages in its 2D form for vision based navigation and in image and comparison, because the precision of a decimal representa-
processing. It is also fundamental for the implementation of tion of an IEEE-754-encoded value has not always enough sig-
Convolutional Neural Networks (CNN). Note however that nificant digits. For example, two floating point values printed
the former implements convolution with padding, function- with 15 digits after the decimal point cannot be represented
ally equivalent to MATLAB, while the latter is implemented unambiguously. For example, both values (in IEE754 format)
without padding as defined in cuDNN library from NVIDIA. 0x327d7168acfae2c8 and 0x327d7168acfae2cc are displayed
Max: Maximum detection is common across several do- as 1.747362713315761e-65. For this reason, we have written
mains: image, neural network and signal processing. It is a MATLAB/GNU Octave program which performs the com-
also included in the softmax and max-pooling of the cuDNN parison of the benchmark output against the golder output and
library. Other DNN building blocks included in GPU4S Bench displays the residual error. As an indication, the residual error
are the Relu and the Local Response Normalisation. reported by our procedure for the matrix multiplication bench-
Synchronisation and memory allocation: The effect of mark using the reference input on several NVIDIA platforms
these operations are visible in complex processing chains, for is negligible, 7.4351e-15 which is acceptable, considering
this reason they are implemented in the complex applications that much of on-board processing is currently performed in
of our suite. They can be found in all space domains. fixed point and with much less precision bits. Since both
our implementations and the CUDA library provides identical
B. Complex Applications results, this difference from our sequential reference CPU
version comes from the parallel execution of floating point
CIFAR-10: This application performs inference using a
instructions which slightly affects its result [17].
neural network with 10 layers, trained with the CIFAR-10
dataset [23]. Each layer is implemented by reusing the neural VI. E ASE OF D EVELOPMENT AND O PTIMISATION
network building blocks from the individual benchmarks.
Our benchmark suite design doesn’t only focus on the
Euclid NIR: This application [8] is widely used in ESA
comparison of the hardware platforms that it is executed on,
projects, therefore we performed its GPU parallelisation.
but also their software stack, including their libraries and
In both individual building blocks and applications we
programming models. In order to support the code portability
support a mode in which multiple frames are processed, in
in all embedded GPUs, we designed our benchmarks to allow
order to amortise the additional overhead of GPU memory
multiple implementations from different programming models,
allocations and transfers.
which is described in detail in the next Section. However,
even within the same programming model, we provide 3 dif-
V. F UNCTIONAL VALIDATION AND P RECISION
ferent implementations: a naive implementation, an optimised
Our benchmarks are configurable to use floating point or version and a version using the vendor provided library if
integer formats. A given GPU may favour the implementation available. The naive implementation is a straightforward GPU
for a certain format, but their results might not be bit-identical. parallelisation which may be suboptimal, but it is portable
However, we need to know whether the results of a benchmark across different GPUs. The optimised version is parametric
execution are functionally correct, and also what is the preci- and includes common optimisations like thread coarsening,
sion of the computation in the target GPU. For the benchmarks tiling/blocking (use of shared memory in order to reduce
with an available library implementation we validate our bandwidth from the main memory and allow for reuse across
benchmark results against the library implementation, for its threads in the same block/workgroup) and loop unrolling. This
supported data types e.g. cuFFT supports only single or double optimised version requires tuning on the specific GPU target,
precision floating point. However, for the rest of the cases, we in order to find the optimal configuration of the optimisation
designed a methodology to measure the residual error of the parameters. Finally, the vendor provided library version is
target GPU against a double precision computation on a CPU. known to be the most optimised version available for a GPU.
For each benchmark, we include a sequential reference im- All implementations of a given benchmark are undertaken
plementation written in C, which we execute in the maximum by the same expert GPU developer and the development time
precision format, e.g. double floating point format (64-bit). of each version is recorded. This allows three different types of
The benchmark is paired with a representative reference input evaluation. First, by comparing the performance of the naive
data set in 64 bits binary format of the same input type. The and optimised implementations against the vendor optimised
output of the reference implementation is also stored in binary library, we can get an indication of how close a handwritten
form of the IEEE-754 double format, producing the reference implementation can get to the true performance capabilities of
output, also known as golden standard or ground truth. the GPU. If the naive implementation is good enough, there
For functional verification we execute the target implemen- is no need for further optimisation of the code. Otherwise, the
tation and store the results, which are subsequently converted optimised version can provide an indication of how close to the
to IEEE-754 double binary format. Finally, the target binary optimal performance can be achieved with reasonable effort.
TABLE II
T IME SPENT IN THE DEVELOPMENT OF EACH BENCHMARK
Total Development Time Time to Develop in hours
Algorithm
per Benchmark in hours CPU CUDA OpenCL CUDA Opt OpenCL Opt CUDA Lib OpenCL Lib
Matrix Multiplication 8.5 1 1 0.5 1 1 2 2
FFT 33 8 14 2 5 1 2 1
FFT window (ADS-B) 12 2 3 2 1 1 1 2
FFT 2D 11 5 - - - - 3 3
Convolution 2D 11 5 1 1 2 1 1 -
Relu 4 0.5 0.5 0.5 1 0.5 1 -
Max Pooling 6 1 1 0.5 2 0.5 1 -
Softmax 10 1 1 2 3 2 1 -
Local Response Normalization 3.5 0.5 0.5 0.5 0.5 0.5 1 -
Finite Impulse Response 4 1 2 1 - - - -
Neural Network for CIFAR-10 40 - 10 10 8 8 4 -
Neural Network for
15 - 2 2 4 5 2 -
CIFAR-10 Multiple
Total 158 25 36 22 27.5 20.5 19 8
Second, the recording of the development time can provide basic blocks for our tested sizes. The reason in both cases
an indication of the development effort compared with the is that both libraries have a large initialisation cost, which is
additional performance gained. For example, if the optimised not amortised in relative small sizes (64K elements) required
version comes with a 20% increase in the development time in on-board processing. Moreover, cuDNN is more efficient
and provides a bigger performance increase e.g. 50%, then for training of neural networks as opposed to inference which
the optimisation can be deemed worthwhile. Third, the de- we are interested in. However, if the initialisation cost of
velopment time of the algorithm provides useful information cuFFT is excluded and it is executed repeatedly 30K times, our
about how easy and productive is to use a certain programming handwritten versions achieve only 10-15% of its performance.
model. If the naive implementation development takes too long
– or much longer than another programming model – it might VII. B ENCHMARK S TRUCTURE
influence the decision of selecting a certain GPU architecture. Our benchmark suite is designed to use multiple program-
Finally, it is worth noting that the handwritten implementa- ming models, similar to other general purpose GPU suites.
tions are also important for another reason. In critical systems However, our main difference is that in order to guarantee
subject to certification/qualification (which is not the case of fair comparison between them, we don’t provide separate
all space systems though), the availability of source code of benchmark implementations which may structurally differ for
all software is critical. However, this is not the case of vendor each programming model like e.g. Rodinia [14], but each
provided libraries, which are provided in a black box form benchmark is divided in target-agnostic and target specific
and only their interfaces are available. portions. Although in the current version of the GPU4S suite
Table II provides the development time of each of the we only support CUDA and OpenCL, our design can support
building blocks of GPU4S. Despite OpenCL is a lower-level other programming models which we plan to support in the
language than CUDA and therefore is more challenging, this future such as OpenGL ES 2, OpenGL SC 2, Brook Auto [24]
is not reflected in the development time of the benchmarks. and Vulkan in order to cover the entire embedded GPU
The reason for this is that we first implemented the 3 CUDA hardware space. Our implementation allows us to share as
versions before moving to the OpenCL one and as such we much as possible code between the different programming
greatly benefited from basing our OpenCL implementation on model versions. The structure of each benchmark is as follows:
the CUDA versions. Moreover, we notice that the time it took 1) Platform initialization: this step performs the necessary
us to implement the reference CPU implementation compared actions to select the compute device etc. It is mostly
to the first CUDA implementation (naive) is almost equivalent, required for OpenCL since the user needs to explicitly
which confirms that the ease of development. initialize the compute environment.
Although we cannot provide full performance results for all 2) Read reference input files: in this step the application
the platforms we have evaluated so far, since the project is still reads the reference input files in the format explained
on-going, we provide some indicative results on two embedded in Section V, and puts them in the host memory of the
platforms based on NVIDIA GPUs, which are the ones with accelerator. This is a platform independent step.
more available vendor provided libraries, the NVIDIA TX2 3) Copy the reference input data to the GPU memory: the
and NVIDIA Xavier. For matrix multiplication for floating benchmark initiates the transfer towards the accelerator.
point our hand-written versions provide very low performance This step uses different calls in CUDA and OpenCL, but
around 5 and 10% of the provided library. However, for double it provides a common function interface with different
precision on the Xavier we outperform the NVIDIA library, implementation per programming language.
probably because it is not optimised yet for this platform. 4) Kernel Invocation: in this step the computation is of-
Surprisingly, we also outperform both cuFFT and cuDNN floaded to the GPU. Its implementation depends on
Main.cpp R EFERENCES
[1] K. McManamon et al., “ExoMars Rover Vehicle Perception System
benchmark_library.h
Architecture and Test Results,” 12th Symposium on Advanced Space
Technologies in Robotics and Automation (ASTRA), 2017.
[2] M. Mammarella et al., “The Lunar Space Tug: A sustainable bridge
CPU between low Earth orbits and the Cislunar Habitat,” Acta Astronautica,
CUDA OPENCL
vol. 138, pp. 102 – 117, 2017.
[3] S. Kawamoto et al., “Current Status of Research and Development on
lib_cuda.cu lib_opencl.cpp kernel.cl lib_cpu.h lib_cpu.cpp Active Debris Removal at JAXA,” 7th European Conference on Space
Debris (SDC7), 2017.
[4] L. Kosmidis et al., “GPU4S: Embedded GPUs for Space,” in Digital
Fig. 1. Generic benchmark structure System Design (DSD) Euromicro Conference, 2019.
[5] M. Gschwind et al., “Synergistic Processing in Cell’s Multicore Archi-
CUDA or OpenCL but it is done similar to the previous tecture,” IEEE Micro, vol. 26, no. 2, pp. 10–24, March 2006.
[6] A. Arevalo et al., Programming the Cell Broadband Engine Architecture:
step. As explained in Section VI there are three different Examples and Best Practices. IBM Red Books, 2008.
versions which are selected: naive, hand-optimised and [7] J. Gaisler, “Benchmarking of 32-bit processors for space applications,”
vendor-provided library implementation. ESA/ESTEC, Tech. Rep. WDI/JG/2105/NL Issue 4, 20-11-1995, 1995.
[Online]. Available: http://microelectronics.esa.int/erc32/misc/ERC32-
5) Copy the reference output back in the host memory: ADA-Benchmarking-Gaisler-1995-11-20.pdf
this is the same as step 3 with the opposite direction [8] A. Jung and P.-E. Crouzet, “The H2RG Infrared Detector:
of the transfer. The steps from 3 to 5, or only the 4 Introduction and Results of Data Processing on Different
Platforms,” European Space Agency (ESA), Presentation, 2012,
can be executed multiple times in a loop, in order to http://www.esa.int/Our Activities/Space Engineering Technology/
increase the execution time of the benchmark so that Onboard Data Processing/General Benchmarking and Specific Algorithms.
the measurement is not subject to measurement precision [9] F. J. Cazorla et al., “Multicore OS Benchmarks,” Final Report, ESA-
ESTEC, Tech. Rep. RFQ- 3-13153/10/NL/JK, 2012. [Online]. Avail-
errors. Moreover, allows to simulate real use scenarios able: http://microelectronics.esa.int/gr740/MulticoreOSBenchmark-
of the GPU like repetitive transfers to the GPU between FinalReport v7.pdf
computations, or a single transfer of data followed by [10] G. Beltrame et al., “Benchmarks for the GINA platform,”
Final Report, ESA-ESTEC, Tech. Rep., 2008. [Online]. Available:
several kernel invocations between transfers. http://microelectronics.esa.int/gr740/GINABench.pdf
6) Write reference output file: in this step, the benchmark [11] D. Hellström and F. Cros, “RTEMS SMP Final Report:
output is saved in a file, using the binary format of its Development Environment for Future Leon Multi-core,” Final
Report, ESA-ESTEC, Tech. Rep. RTEMSSMP-FR-00, 2015.
implementation data type. This functionality is platform [Online]. Available: http://microelectronics.esa.int/gr740/RTEMS-SMP-
independent. If the benchmark output type does not FinalReport-CGAislerASD-OAR.pdf
match the reference output type, an offline tool written [12] J. Franklin, “NGDSP Benchmarking & SDE Evaluation.
Final Report. European DSP Trade-off and Definition
in Matlab/GNU Octave, performs the conversion and the Study,” ESA-ESTEC, Tech. Rep. 22645/09/NL/LvH,
comparison. 2012. [Online]. Available: http://spacewire.esa.int/edp-
page/events/DSP Day Presentation Astrium UK NGDSP-tradeoff-
The main benchmark body is platform independent, written in study.PDF
C for maximum portability and it only contains calls to the dif- [13] ESA, “On-Board Data Processing - Benchmarks,” 2012,
ferent steps. The platform independent steps are implemented https://www.esa.int/Enabling Support/Space Engineering Technology
/Onboard Data Processing/General Benchmarking and Specific Algorithms.
in a common library, while the platform dependent steps [14] S. Che et al., “Rodinia: A Benchmark Suite for Heterogeneous Com-
are implemented in separate libraries with same contracts. puting,” in 2009 IEEE International Symposium on Workload Charac-
Preprocessor directives and different makefile rules ensure that terization (IISWC), Oct 2009, pp. 44–54.
[15] J. Stratton et al., “Parboil: A Revised Benchmark Suite for Scientific and
each version uses the appropriate components. The CUDA Commercial Throughput Computing,” University of Illinois at Urbana-
and OpenCL kernels are implemented in separate files, but Champaign, Urbana, Tech. Rep. IMPACT-12-01, Mar. 2012.
they support the same interface. A visual representation of [16] EEMBC. (2019) The ADASMark Benchmark. [Online]. Available:
https://www.eembc.org/adasmark/
the benchmark structure is provided in Figure 1. The main [17] N. Whitehead et al., “Precision & Performance: Floating Point and IEEE
data types involved in the computations for both the main 754 Compliance for NVIDIA GPUs,” NVIDIA, Tech. Rep., 2011.
benchmark body as well as the kernel implementations are [18] R. Brochard et al., “Scientific Image Rendering for Space Scenes with
the SurRender Software,” in 69th International Astronautical Congress
written in such a way that can be changed so that different (IAC), 2018.
benchmark versions are implemented for the data types of [19] M. Strohmeier, M. Schfer, V. Lenders, and I. Martinovic, “Realities and
interest. This is achieved either with preprocessor directives or Challenges of Nextgen Air Traffic Management: The Case of ADS-B,”
IEEE Communications Magazine, vol. 52, no. 5, pp. 111–118, 2014.
typedefs depending on the particular needs of each benchmark. [20] The Consultative Committee for Space Data Systems, Image Data
Compression Recommended Standard CCSDS 122.0-B-2, 2017.
[21] E. Maliet et al., “Geostationary Observation Space Surveillance System
VIII. C ONCLUSIONS (GO3S) Real Time Video From Space,” in 65th International Astro-
nautical Congress (IAC), 2014.
In this paper we described the design and implementation [22] CNES, “CERES: Three satellites to boost Frances intelligence capabil-
of the GPU4S benchmark suite, which targets GPU on board- ities,” 2019, https://ceres.cnes.fr/en/ceres-2.
processing for space applications. We have explained our [23] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Im-
ages, , 2009.” University of Toronto, Tech. Rep., 2009.
design principles and design decisions, as well as our method- [24] M. M. Trompouki and L. Kosmidis, “Brook Auto: High-Level
ology. Finally, we have presented some indicative results, Certification-Friendly Programming for GPU-powered Automotive Sys-
showing that the benchmark suite covers all our needs. tems,” in 55th Annual Design Automation Conference (DAC), 2018.