0% found this document useful (0 votes)
40 views12 pages

(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views12 pages

(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Research Paper

The International Journal of High


Performance Computing Applications
Development of a hardware-accelerated 2021, Vol. 0(0) 1–12
© The Author(s) 2021
simulation kernel for ultra-high vacuum Article reuse guidelines:
sagepub.com/journals-permissions
DOI: 10.1177/10943420211056654
with Nvidia RTX GPUs journals.sagepub.com/home/hpc

Pascal R Bähr1,2 , Bruno Lang2 , Peer Ueberholz3, Marton Ady1 and Roberto Kersevan1

Abstract
Molflow+ is a Monte Carlo (MC) simulation software for ultra-high vacuum, mainly used to simulate pressure in particle
accelerators. In this article, we present and discuss the design choices arising in a new implementation of its ray-tracing–
based simulation unit for Nvidia RTX Graphics Processing Units (GPUs). The GPU simulation kernel was designed with
Nvidia’s OptiX 7 API to make use of modern hardware-accelerated ray-tracing units, found in recent RTX series GPUs
based on the Turing and Ampere architectures. Even with the challenges posed by switching to 32 bit computations, our
kernel runs much faster than on comparable CPUs at the expense of a marginal drop in calculation precision.

Keywords
Molflow, vacuum, Monte Carlo, simulations, graphics processing units, RTX, Nvidia

1. Introduction from Single Instruction Multiple Data (SIMD) processing


via GPUs, due to its independent work load. Surprisingly, a
Molflow+ deploys the Test Particle Monte Carlo method previous study about the GPU compatibility for Molflow
(TPMC), which is especially useful for complex vacuum concluded that the benefits of using a GPU kernel for the
components as it converges sufficiently fast with accurate simulations are marginal at most. This was mainly due to
results. In particular, with the introduction of time- the computationally demanding and complex nature of the
dependent simulations (see Ady (2016)), the demand for algorithm (see Kersevan and Pons (2009)). Mainly driven
high-performing simulations has been rising, as even more by its purpose for 3D computer graphics, such as movie
Monte Carlo events have to be processed in order to make rendering or computer games, there have been constant
accurate assumptions about a given model. advancements in developing more efficient algorithms or
Designing vacuum components is often a trial and error dedicated computing units, for example, Deng et al. (2017).
process. In earlier design phases, where continuous feed- Recent advancements in GPU hardware did not only result
back is desirable and queue times for HPC resources are in an increase of raw processing power, but also in specially
often too long, physicists and engineers often utilize ev- designed ray-tracing units, for example, with Nvidia’s RTX
eryday computer hardware in the form of laptops or office GPUs. With the Turing architecture (see NVIDIA (2018))—
computers, so running simulations on high performance and also its recent successor Ampere—Nvidia introduced so
computers is not always an option. With later stages sim- called RT cores. This hardware unit implements the ray-
ulations shift to more powerful hardware, improved con- tracing algorithm directly on the hardware. Comparing an
vergence accuracy is necessary. Thus, making use of the RTX 2080 against its predecessor GTX 1080 TI, based on
available—also often affordable—hardware is desirable,
where Graphics Processing Units (GPUs) are always a good
candidate for computationally demanding tasks. Further, we 1
CERN, Geneve, Switzerland
will refer to Molflow+ when the C++ implementation for the 2
University of Wuppertal, Wuppertal, Germany
3
CPU simulation engine is explicitly mentioned and to Niederrhein University of Applied Sciences, Krefeld, Germany
Molflow (without the suffix) when talking about general
Corresponding author:
aspects of the algorithm. Pascal R Bähr, CERN TE-VSC-VSM, Espl. des Particules 1M28400, 1211
Fundamentally, Molflow’s simulation algorithm is based Geneva 23, Switzerland.
on ray tracing, which was always thought to benefit a lot Email: [email protected]
2 The International Journal of High Performance Computing Applications 0(0)

the Pascal architecture, the new hardware has been proven RTX hardware, the author observes speedups of ∼ 1660×
to be ∼ 10 times faster in terms of processed rays per second. compared to CPU simulations and a speedup of ∼ 5× for
In this paper, we revisit the suitability of Molflow’s simulations with the RTX feature disabled (see Blyth
algorithm for GPU processing by investigating the possible (2020)). Discrepancies in the results were mainly attrib-
performance increase with both a software- and hardware- uted to be geometry related, where remaining discrepancies
accelerated GPU ray-tracing engine based on Nvidia’s ray- were briefly attributed to arithmetic precision. Opticks uses
tracing API OptiX 7 (see Parker et al. (2010)). Our approach CSG to represent geometries instead of polygon- or triangle-
leverages the capabilities of dedicated ray-tracing units as meshes, which are not further accelerated by RTX hardware.
much as possible and tries to work around the hardware- Given that potential performance benefits using RTX
bound constraints. Considering a common vacuum com- acceleration have been studied in comparable Monte Carlo
ponent with well-known properties, we investigate the simulators, we give a further outlook for RTX usage in
potential performance gains, as well as the question whether addition to a performance study by including a comparison
the reduced precision for the ray-tracing algorithm has a of the accuracy of our algorithms and discussing the real-life
negative impact on the precision of the results that would suitability for such simulations with a well-studied vacuum
make it an unfeasible choice for simulations with Molflow. component. Further, we elaborate a few design choices,
The remainder of the paper is organized as follows. First, such as the handling of random number generation or the
we consider the contributions of other studies that researched usage of thread local counters, which are necessary for a fast
the usability of RTX hardware, with a focus on comparable and more robust kernel and which can be applied to similar
physical MC simulation applications (see section 1.1). Next, physical applications.
in section 2, we briefly discuss the principles of Molflow’s
algorithm and introduce the OptiX 7 API. In section 3, we
2. Background
discuss the design choices for Molflow’s GPU simulation
engine. At last, in section 4, we conduct a performance and The fundamental algorithm of Molflow’s vacuum simula-
precision study to evaluate how the GPU code compares to tion engine is based on ray tracing, where particles are
Molflow’s CPU implementation for different parameters and traced within vacuum components until they hit the
geometries, where we use a set of cylindrical vacuum tubes boundary and the accumulated statistics are used to derive
with varying length–radius ratios. various key figures for these components. We introduce the
Molflow+ algorithm and give a brief introduction to the
features of Nvidia’s ray-tracing API OptiX 7, which we
1.1. Previous work
utilize for Molflow’s GPU kernel.
Studies on the use of RTX hardware for applications beyond
typical rendering or even ray-tracing tasks have already
been undertaken. For example, Wald et al. (2019) applied a
2.1. Ray tracing in Molflow+
point-in-polyhedron picking algorithm to the common ray- Molflow+ uses the TPMC to simulate free molecular flow in
tracing problem. They achieved speedups of up to ∼ 6.5× systems in the (ultra-)high vacuum regime (see Ady
with an optimized RTX kernel compared to a full CUDA (2016)). Here, gas molecules are assumed to be indepen-
implementation on a TITAN RTX, while on the predecessor dent of each other; hence, no inter-molecular collisions have
TITAN V without RT cores, the CUDA implementation to be accounted for. 3D models of vacuum components are
even prevailed. described via polygons, which we refer to as facets. Im-
Molflow is a Monte Carlo simulation software used for porting geometries via formats like STL is possible, and
vacuum simulations. For similar applications, which are whenever, feasible the given triangle mesh will be converted
used to study different physical phenomena and are also into a more general polygon mesh. The simulation algo-
based on a ray-tracing–based engine, some initial studies for rithm then translates into the following steps.
the usability of RTX hardware have already been con- First, a particle is generated according to our desorption
ducted. With an RTX compatible extension for OpenMC model, which considers the particle’s source location, the
(see Romano et al. (2015)), speedups of about ∼ 33× could rectilinear trajectory, and the velocity. Individual particles
be observed for triangle-meshes compared to comparable are then traced inside a vacuum component by identifying
CPUs (see Salmon and McIntosh-Smith (2019)). OpenMC the location of the collision with the component’s boundary.
is a Monte Carlo particle transport simulation code com- The amount of necessary intersection tests is reduced by
monly used for modeling of nuclear reactors, which is grouping polygons together inside axis aligned bounding
capable of simulating various nuclear reactions for neutrons boxes (AABB) ultimately forming a tree structure. Tra-
and photons. For the open-source software Opticks, an versing this bounding volume hierarchy (BVH) results in a
OptiX powered GPU engine is used for the simulation of logarithmic amount of simplified ray-box intersections
optical photons as an interface for Geant4. Referencing followed by only a few final ray-polygon intersection tests
Bähr et al. 3

when evaluating the BVH leaf nodes. Following a verified primitive type or material of the hit location in relation to the
hit, statistics are gathered at the hit location depending on ray payload, which is an arbitrary data structure that en-
the type of the facet. A particle either remains in the system compasses a ray’s individual attributes and other necessary
getting reflected or it gets pumped. information along its path. In computer graphics, this step
relates to a shading routine. We like to refer to it as “Trace
Processing” for generality. If no valid intersection is found, a
2.2. GPU ray tracing with OptiX 7 Miss program will be called. As emphasized in the OptiX
Not only due to high-level languages such as CUDA or Programming Guide (see NVIDIA (2020)), utilizing an Any
OpenCL, GPUs have been used to solve general problems Hit program negatively impacts the performance. In this
outside of rendering or linear algebra. GPU computing is study, with the goal to maximize the performance benefits of
widely used for scientific applications, for example, for data the RTX technology and with the assumption that it is a
analysis or data reduction in high energy physics, for ex- feasible strategy, we utilize only a Closest Hit and a Miss
ample, Schouten et al. (2015) or vom Bruch (2020). In program.
principle, Molflow+’s ray-tracing algorithm is well suited Figure 1 shows a sketch for the OptiX 7 pipeline. The
for the SIMD architecture on GPUs. Given the independent green labels highlight the parts that can make use of pro-
nature of particles in molecular free flow, one can maximize prietary algorithms, which are hardware-accelerated on
the parallel workload by generating and tracing individual compatible GPUs. Scene traversal always utilizes the built-
particles per thread. Nvidia’s OptiX 7 is a ray-tracing API in method. Further, we can see that for maximum leverage
that provides a simple interface to leverage capabilities of of the RTX hardware, we have to use the built-in intersect
newly introduced RT units on compatible GPUs. Other routine, if it is feasible. For a complete study, we also
common choices are Microsoft’s DXR and the Vulkan API. implement Molflow’s standard approach, where a ray-
We decided on utilizing OptiX 7, not only because we polygon intersection algorithm is utilized, and compare it
expect best results staying in the Nvidia universe, but also to the RTX capable ray-triangle intersection test. It is
because it is CUDA-centric and provides the compatibility possible to launch so called secondary rays recursively from
for older architectures up to Maxwell. In this study, we the “Trace-Processing” step, which allows to skip the ray-
mainly focus on the recent RTX hardware, because of the generation step in the pipeline and is useful for further
promises of the RT cores. Details about how the RT cores following particles that are reflected from the boundary of
work on the Turing architecture can be found in the cor- the vacuum volume.
responding white paper (NVIDIA, 2018: 30). A challenge, which has already posed problems in a
A full ray-tracing kernel in OptiX consists of a ray previous attempt to port Molflow’s simulation engine to the
generation, a ray-primitive intersection test, and a hit pro- GPU, is to minimize the effects of thread divergence to
gram. The ray-generation routine initializes an individual ray properly utilize the SIMD architecture. Thread divergence
with starting parameters and starts the trace. Afterwards, the occurs when the post-processing step for a particle differs
ray will be tested against the geometry with the built-in scene depending on the interaction with a particular facet type. In
traversal algorithm as well as the built-in triangle intersection Molflow, particles can reflect off or get absorbed by a
or a custom primitive intersection test to find a potential hit facet—in a deterministic or probabilistic way—neglecting
location. At last, either just the closest intersection or every more advanced techniques for simplicity.
intersection—for example, when passing through non- With OptiX 7, we can make use of the so called Shader
opaque primitives—on a ray’s trajectory can be processed Binding Table (SBT), which allows us to map specific
with a Closest Hit or an Any Hit program, respectively. This programs to individual facets or materials—think of
step is used to compute the effects corresponding to the function pointers. We assume that OptiX’s internal

Figure 1. Ray-tracing pipeline with RTX technology, cf. Wald and Parker (2019).
4 The International Journal of High Performance Computing Applications 0(0)

execution scheduling algorithm in combination with the of particles with an outgassing rate of 10 mbar  l s1.
SBT lowers the negative impact of diverging threads as it Further, both end facets serve as perfect absorbers—
allows for manageable reordering or grouping of functions removing all particles from the system. For the side facets,
(see Parker et al. (2010)). These benefits could also be they are defined as perfectly diffusive—reflecting all parti-
attributed to the RTX technology itself, as it supposedly cles. The tube can be defined with various length/radius ratios
implements some sort of GPU work creation (see Sanzharov (L/R). In Molflow, the circular form is approximated with a
et al. (2019)), allowing for efficient handling of workloads finite amount of side facets. In our numerical experiments, the
and direct invocation of ray-tracing on the GPU. L/R ratios and the levels of approximation were varied, as
indicated in the respective sections.
Figure 2 shows the geometry in Molflow both as a
3. Molflow’s GPU implementation polygon and a triangle mesh with an exemplary ratio of L/
In this section, we present our design for the ray-tracing R = 10. In this example, the circular inlet (green circle) and
programs using the OptiX API as well as choices related to outlet (red circle) are each approximated with 10 vertices,
the memory and random number generation. We give a brief which results in 10 rectangular facets for the side walls, thus
explanation of the parts that are straightforward ports from 12 polygons in total. For the GPU kernel, the geometry is
the CPU algorithm, which is discussed in greater detail by further triangulated resulting in a mesh with 36 triangles in
Ady (2016). total for the same amount of vertices.
The OptiX API handles the traversal and intersection
routines directly with built-in algorithms and utilizes the
hardware-acceleration features of Nvidia RTX GPUs; thus, 3.2. Ray generation
only the ray-generation and trace-processing procedures are Handling a new particle, the Ray-generation program will
the main concern when trying to achieve a high performance initially generate the starting parameters according to
ray-tracing algorithm. For a complete study, we compare the Molflow’s physical model. The minimal set of attributes for a
performance of the built-in ray-triangle intersection with a ray-tracing engine consists of the source location (ray origin)
port of Molflow’s ray-polygon intersection routine. One and the trajectory (ray direction), where Molflow also accounts
major restriction with the API is that the built-in algorithms for the particle velocity. Next, the actual ray-tracing can be
work only on 32 bit representations of the geometry and the initiated. Compared to Molflow’s original implementation, the
rays. This drawback will be discussed together with the ray-generation program also handles particles that reside inside
performance and precision study of the kernel (see section 4). the system, for example, after a collision. Here, the set of
attributes can simply be fetched from memory to initiate the
ray-tracing. Intuitively, one would directly call the ray-tracing
3.1. Geometry routine again given the new parameters. For OptiX, this is only
For our study, we utilize a cylindrical tube serving as a possible via recursion.
vacuum component, for which an analytical solution can be
used for the benchmarks (see Gómez-Goñi and Lobo (2003)). 3.2.1. Ray origin. The source location has to be found in two
The cylindrical inlet (desorption facet) defines a steady influx steps, which slightly differ for polygons and triangles,

Figure 2. Cylindrical tube with a length/radius ratio of L/R = 10 in Molflow. The top shows an approximation with a polygon mesh, the
bottom with a triangle mesh. The green lines show arbitrary particle paths.
Bähr et al. 5

where the latter are preferred due to their native support on


RTX hardware. First, given the local influx rate of particles
dNf/dt for each facet, we can determine the probability that a
particular facet f will be chosen as the starting location. For a
polygon, this can be done by generating a random point
inside the bounding rectangle and by accepting or refusing
the candidate point with a point-in-polygon test. We select the
candidate triangle with the same principle. Coming from
polygons, we simply have to account for the triangle area
when calculating the appropriate desorption probability from
Figure 3. A point P can be sampled with two random numbers
the original influx rate of the overlying facet beforehand. from a uniform distribution, cf. Osada et al. (2002).
A random point inside a triangle is then selected ac-
cording to a formula described by Osada et al. (2002). Given
a triangle with vertices A,B,C 2 R3 and two uniformly x ¼ u  Ux þ v  Vx þ n  Nx (8)
distributed random numbers r1, r2 2 [0, 1], a point P can be
sampled from a uniform distribution with the following y ¼ u  Uy þ v  Vy þ n  Ny (9)
equation
z ¼ u  Uz þ v  Vz þ n  Nz (10)
pffiffiffiffi pffiffiffiffi pffiffiffiffi
P ¼ ð1  r1 Þ  A þ r1  ð1  r2 Þ  B þ r1  r2  C
(1)
3.2.3. Particle velocity. Assuming an ideal gas at equilibrium
pffiffiffiffi in a defined environment, we can describe the speed of
r1 gives the percentage from vertex A to the opposing
edge connecting vertices B, C and r2 denotes the percentage molecules with the Maxwell–Boltzmann distribution. For
along that edge. The relation between the random numbers simplicity, we can assume that the average velocity is suf-
and the vertices is sketched in Figure 3. ficient to describe most effects inside a vacuum component.
In case of the Maxwell–Boltzmann distribution, it is given by
3.2.2. Ray direction. We calculate the particle direction ac- rffiffiffiffiffiffiffiffiffi
8RT
cording to Knudsen’s cosine law, which assumes that a <v> ¼ (11)
molecule’s inbound and outbound direction are indepen- πm
dent. It states that a molecule leaves a surface in solid angle where R is the ideal gas constant, T the gas temperature, and
dω with the probability ds m the molar mass (in kg). Given an isothermal system, one
ds ¼ πdω  cos θ (2) can further assume a constant velocity that is not affected by
collisions. The preferred method in Molflow, in more detail
One can derive the azimuth and polar angles (f and θ, discussed by Ady (2016), accounts for the effect of faster
respectively) as follows molecules hitting surfaces more often, resulting in a distri-
bution that is slightly different from the Maxwell–Boltzmann
φ ¼ X2π , X 2 ½0; 1 (3) distribution, but that can still be inverted for fast lookup when
pffiffiffiffi generating particles. For simplicity, we use formula (11) in
θ ¼ sin1 Y , Y 2 ½0; 1 (4) our experiments for both CPU and GPU simulations.
where x and y are uniformly distributed random numbers.
For use in a 3-dimensional Cartesian system, one has to 3.2.4. Ray launch. The RTX unit works with single-precision
transform the derived spherical coordinates further into floating point values, which can have various negative effects
Cartesian coordinates (1, θ, f) → (x, y, z). First, we derive on the ray-tracing results, most simply leading to occasional
the local Cartesian coordinates (u, v, n) misclassifications: Given that the RTX intersection algorithm
is guaranteed to be watertight in the sense that a ray can never
u ¼ sinðθÞ  cosðφÞ (5) go between two adjacent triangles which share the same edge,
it will choose either one or the other for the hit location. If the
v ¼ sinðθÞ  sinðφÞ (6)
ray passes very close to this edge, then the intersection may
n ¼ cosðθÞ (7) be attributed to the wrong triangle. For rendering problems,
this is rarely noticeable when the color of a single pixel is
and translate them with the corresponding facet’s ortho- slightly darker or brighter than expected. In Molflow, this can
normal basis {U, V, N}, where N is the facet normal, into lead to more severe problems, as the magnification of these
global Cartesian coordinates (x, y, z) for the particle direction errors can easily influence the whole system.
6 The International Journal of High Performance Computing Applications 0(0)

Further problems arise, when we compute the exact hit with its state from a previous launch. After the offset has
location and take this value as the next ray origin to relaunch a been applied, the particle is launched with a call to the
ray from the corresponding surface: a secondary ray. The new OptiX API and to trace the ray and to identify a potential
location could be classified on either side of the polygon due to intersection. Initially (see line 1), a check is conducted to see
arithmetic error, as discussed in great detail by Wächter and whether a thread can terminate early due to a fulfilled end
Binder (2019). Potential workarounds differ with the use of condition, which is set on the CPU.
back-face culling. A polygon’s front is identified by one of its
surface normals, that is, the perpendicular vector corresponding
to the surface. Without culling, both faces of a polygon serve for
3.3. Intersection test
possible intersection points. With back-face culling, an inter- RT cores accelerate BVH traversal as well as ray-triangle
section is only valid if the surface normal is facing the ray intersection via hardware. For our study, we investigate both
direction. triangle and polygon meshes. To utilize hardware-accelerated
Considering we do not deploy back-face culling, there BVH traversal for a polygon mesh, a set of AABB has to be
are few possibilities of what could happen in case the ray provided.
origin is found to be outside. Best case, it will simply get For triangles, we use the built-in intersection routine,
redirected into the body of the mesh. Worst case, the hit while we also provide a custom port of the two-step ray-
location is on the edge between two non-coplanar tri- polygon test routine, which is used in Molflow. First, a ray-
angles and the ray direction is quasi-parallel to the rectangle intersection is used, where we solve a system of
neighboring triangle plane. In this case, the ray will not linear equations via Cramer’s rule (see Shirley and
find a collision point with the neighboring triangle: a Marschner (2009)). This is followed by a point-in-
miss. In addition to back-face culling, we can reduce the polygon test in 2D space, where the ray-casting algorithm
amount of misses of arithmetical origin by deploying an provided by Franklin (2018) provided the best results in
adaptive offset along the facet normal. We chose the single-precision both in performance and reliability in terms
strategy presented by Wächter and Binder (2019) as it of the false-negative rate.
leads to minimal computational overhead and minimal
impact on the accuracy compared to other options. In an
experiment with the introduced cylindrical tube ap-
3.4. Trace processing
proximated with 1000 vertices per end cap, we found that Following the calculation of a potential hit location, the
back-face culling itself resulted in a miss-hit ratio of 1.35 events are accumulated in hit counters for statistical
107 and the adaptive offset in a ratio of 1.44 106. purposes. The gathered statistics correspond not only to
Deploying both solutions reduced the rate to only 5.32 the plain amount of collisions or “MC Hits” Nhit, but also
1011, and therefore, this was used as the default strategy to the sum of orthogonal momentum changes of particles
for further simulations. ΣI’ and the sum of reciprocals of the orthogonal speed
Possible consequences for the given geometry in case of components of particles Σv1 ’ . From these three counters,
a miss can be neglected. If there was a surrounding shell, the the most important values for vacuum components can be
ray would hit that surface instead, leading to a series of derived, for example, the pressure or density as explained
unexpected hits. Depending on the complexity of a ge- by Ady (2016).
ometry, this can lead to uninterpretable results. The counters are increased according to the corresponding
collision event related to the facet type. Besides the per-facet
Algorithm 1 Ray-generation algorithm hit counters, the users can also enable more fine-grained
counters in the form of textures (N × M) or profiles (1 ×
1: function RAYGENERATION(Particle p) N), which divide a facet in two directions or one direction,
2: (p.terminate == true) → return respectively. Here, N ,M 2 Nþ describe the spatial reso-
3: switch p.state do lution. If used, an extra step to calculate the index of the
4: case new texture element or profile bin has to be deployed. Textures
5: INIT_STATE() and profile coordinates are given in local coordinates (u, v)
6: case active 2 [0,1]2 in 2D space, which simplifies indexing of the
7: LOAD_STATE() underlying data structure. The coordinates are already
8: APPLY_OFFSET(p.position) provided by the polygon intersection test as they are part of
9: OPTIXTRACE() ⊳ Launch ray with OptiX API the calculation. For a texture mapped onto a triangle, we
10: end function provide local texture coordinates for each vertex A, B, C 2
[0,1]2 and use the barycentric coordinates α, β, γ 2 [0, 1]
The ray-generation procedure is sketched in algorithm 1. returned by the intersection test to interpolate the exact
A particle is either initialized with the described model or local hit location p = (u, v)
Bähr et al. 7

p ¼ αA þ βB þ γC (12) leading to a hybrid approach, where residual particles,


that are those that have not been absorbed after a ray-
In our simplified model, we differentiate between two types tracing step, are reinitialized inside the ray-generation
of collision events based on a target facet’s properties. If a program after the maximum recursion depth has been
particle gets absorbed, new particles will be generated in the reached. We benchmark this technique in the next
next ray-generation step. In case of a collision with a solid section.
surface, the particle is reflected. The hit location can be reu-
tilized as source location; thus, only a new ray direction needs to
be generated. Instead of postponing this to the next ray-
3.5. Random numbers and recursion
generation step, a new ray can be recursively launched from A crucial part of a good Monte Carlo simulator is the
the shading routine. In rendering applications, this is commonly underlying pseudo-random number generation (PRNG)
done for so called secondary rays, simulating light reflections algorithm in both credibility and performance. For the
on surfaces. These secondary rays are launched recursively up best results, we utilize the cuRAND library (see
to a pre-specified recursion depth. Skipping the ray-generation NVIDIA (2019)) from the CUDA toolkit and its default
step is cheaper, but comes with some restrictions. PRNG Xorwow (see Marsaglia (2003)). To deploy the
random numbers, there are two fundamental ap-
Algorithm 2 Trace-processing algorithm proaches. Most straightforward, each GPU thread has
its own random state curandState_t of 48 bytes for the
1: function TRACEPROCESSING(Particle p) PRNG to generate random numbers on an ad hoc basis.
2: if hit same polygon again? then A different approach is to batch generate multiple
3: state ← active ⊳ Try again with new offset random numbers in advance every few cycles in a
4: return separate CUDA kernel.
5: end if The first approach is memory friendly, as we only have to
6: MOVE_PARTICLE() save a random state instead of Nmax × TC random numbers per
7: if particle sticks? then thread, where Nmax and Nmin are the maximum, respectively;
8: RECORD_ABSORPTION() ⊳ record statistics the minimal amount of random numbers needed for one ray-
9: state ← new tracing cycle and TC 2 ½1,∞ describes the number of cycles
10: return to generate numbers for in advance. All random numbers are
11: else if particle reflects? then generated in a static fashion for all threads in a separate
12: ⊳ record statistics and update particle CUDA kernel on device memory. An additional counter
direction keeps track of the index for the next random number. De-
13: RECORD_REFLECTION() pending on how an interaction with the boundary affects the
14: if do recursion? then particle, not all numbers have to be accessed. With batch-
15: APPLY_OFFSET(p.position) wise generation, an additional memory amount of O(Nmax 
16: OPTIXTRACE() Nmin) random numbers per thread has to be accounted for,
17: ⊳ Recursive launch ray with OptiX which can remain unused. For example, in case a particle
API remains inside the system, the particle origin does not have
18: else to be recalculated with two random numbers. When using
19: recursion_depth ← 0 recursion, more random numbers have to be generated in
20: end if advance. For a full cycle, Molflow’s algorithm utilizes up
21: end if to 8 random numbers; skipping ray generation, only 2
22: end function random numbers are needed to compute a new ray di-
rection, leading to Nmax = 8 + recursion_depth × 2 random
In algorithm 2, we show a sketch of the trace-processing numbers per cycle and ultimately to TC × Nmax buffered
algorithm. The corresponding RECORD_ routines account for random numbers.
both global and local statistics. We analyze in section 4.3 how the usage of both ap-
As particles in Molflow tend to stay inside a given ge- proaches for random number generation affects the per-
ometry for several hundreds or thousands of collisions formance with varying recursion depth limits.
on average, full recursion would easily reach the
maximum stack size of a thread. Given that some par-
ticles could terminate, while others could be recursively
3.6. Device memory
retraced, some idle time is to be expected. A well-chosen Molflow geometries tend to be memory space efficient and
value for a maximum recursion depth will likely have a are ultimately shared between all threads. We can divide the
positive impact on overall performance either way, remaining memory among all particles, which we want to
8 The International Journal of High Performance Computing Applications 0(0)

launch in one go. For every thread, the memory demand is radius and index n denotes the amount of side facets used to
usually not fixed and depends on the deployed design so- approximate the circular shape. For most tests, a pipe with a
lutions. Accounting for about 64 B for the particle state and ratio length/radius of L/R = 100 and either 100 or 1000 side
an additional 48 B for a RNG state or TC × 32 B + 4 B for facets are used: G100;100 and G100;1000 , respectively. This
using batch generation with 8 single-precision random configuration has an analytic solution, which can be used as
numbers per cycle and an additional index for the next a benchmark. The inlet (desorption facet) defines a steady
random number, the remaining memory can be used for the influx of particles. Further, both end facets serve as perfect
hit buffers. To prevent race conditions or excessive access absorbers, removing all particles from the system. For the
synchronization, every thread ideally has its own hit buffer side facets, we define them as perfectly diffusive, reflecting
per facet that amounts to 7 × 4 B. In addition, when all particles. Algorithmically, we utilize both the built-in
utilizing textures or profiles for data collection on a more intersection routine provided by the OptiX API and a custom
fine-grained scale, we have to add 3 × 4 B per texture polygon intersection routine. Using the former, one can see
element or profile bin. Obviously, having individual hardware-acceleration effects on both BVH traversal and
counters per thread increases performance, since we do not intersection testing, whereas for the custom routine, only
have to utilize any sort of synchronization, for example, in BVH traversal can benefit from these effects. For a fair
the form of atomics. comparison with the CPU algorithm of Molflow+, we utilize
For example, considering a simulation on an Nvidia RTX an updated version of the code, with major improvements to
2060 as the reference GPU, we found deploying around the BVH structure and intersection routine.
128 × 16,384 threads (see section 4.1) to saturate the GPU
nicely. Now, given a simple pipe geometry with a total of
4.1. Amount of threads
only 102 total facets1 as an example, we would need to
allocate around 128 × 16,384 × (64 + 36)B = 210 MB for the First, we run a simple experiment to find how different
rays’ states and additionally around 5989 MB if every thread amounts of simultaneously launched threads influence the
would have its own hit counters, thus easily reaching the full performance. We use power-of-two multiples of 128
6 GB VRAM capacity, without deploying any additional threads, which represents the amount of CUDA cores per
fine-grained counters. This is obviously not a feasible streaming multiprocessor for the RTX 3090, scaling up to
strategy, as the memory limit would have already been 225 threads in total.
reached with the other memory requirements. Therefore, we In Figure 4, we can see that the performance on the
consider to utilize one shared facet buffer by default, which is overall faster RTX 3090 is unsteady for the simulation of the
modifiable with atomic operations. In section 4.2, we analyze G100;100 geometry. It is an effect of the atomic operations for
the option to use multiple buffers among a group of threads. the hit counters, which we investigate in the next experi-
ment. Further, we can see that the performance increase
stagnates between 128 × 8192 = 1, 048, 576 and 128 ×
4. Performance and precision study 16,384 = 2, 097, 152 simultaneous threads for the Nvidia
To account for all types of Molflow users, we decided to RTX 3090 and the RTX 2060, respectively, for both ge-
provide benchmarks for two sets of hardware. The first set ometries. This seems to be the sweet spot, when the GPU is
accounts for the average Molflow user, where we consider saturated enough. We can choose these values for memory
consumer-grade hardware as found in laptops or office critical simulations.
computers, on which simulations in early phases are usually
run. We utilize a Turing GPU with RTX technology of the
first generation. With a stronger emphasis on complex and
4.2. Extra buffers
time-demanding simulations, the second set focuses on As described in section 3.6, atomic operations can be used
high-end hardware. to effectively counter race conditions. Considering that it is
possible that certain facets are frequently hit, using a single
• Set 1: {CPU i7-8557U, GPU RTX 2060} hit buffer for each facet can save memory, but negatively
• Set 2: {CPU Epyc 7302P, GPU RTX 3090} impact the overall performance. We investigate the benefits
of utilizing multiple buffers per facet, where the buffers are
The software was compiled with GCC 10.2 and level 3 split equally among all threads in a warp.
optimizations on all machines, running an Ubuntu 20.04- For the G100;100 geometry, Figure 5 shows that increased
based system. The GPU kernel was compiled with OptiX performance can have atomic operations as a bottleneck as
7.1 and CUDA 10.1. As input geometry, all experiments the probability of simultaneous access increases. The effect
utilize a cylindrical tube, as it was previously introduced in can be reduced by utilizing multiple counters instead of only
section 3.1. Henceforth, we will refer to these pipes with a single one, where 4 counters are sufficient in this case. A
Glr,n for simplicity, where index lr denotes the ratio length/ triangle describing a side facet has an average hit chance of
Bähr et al. 9

around 0.45%, decreasing the chance to 0.11% for one of instead slightly slows down the simulations. We found that
four counters. As the effect was hardly visible with a slower this is mostly cache related, where generation for one cycle
GPU (RTX, 2060) or a geometry with more facets already leverages the positive effects and multiple cycles
(G100;1000 ), because of the lower hit frequency per facet, it is slightly suffer from less hits in both L1 and L2 cache. The ad
not necessary for most geometries to utilize multiple hoc generation of random numbers was inferior in all cases
counters as memory is likely more of a concern. compared to the single-cycle batch generation (TC = 1).
For vacuum geometries, particles usually reside inside
the system for a large amount of collision events. This
4.3. Recursion and RNG property makes them suitable for any level of recursion that
can be deployed within the applicable memory constraints.
To evaluate the effects of the different approaches for
For example, this is the case with the given test geometry,
random number generation—ad hoc and batched—and
where on average, each particle yields around ∼ 100×
possible benefits coming from utilizing a recursive kernel
events until it exits from the system. By contrast, a cylin-
for the launch of secondary rays in the case of reflections,
drical tube with a ratio of L/R = 1 instead of L/R = 100 does
we compared the results for different geometries (G100;100
not benefit a lot from recursion as particles yield for around
and G100;1000 ) on both GPUs. The simulations are run with
∼ 2× events on average.
the corresponding thread numbers, which were found in the
previous experiment: 128 × 8192 = 1, 048, 576 and 128 ×
16,384 = 2, 097, 152 for the RTX 3090 and the RTX 2060,
respectively. For the geometry G100;100, we deploy 16 hit
4.4. Performance
buffers for the simulations on the RTX 3090. We analyze the raw performance of the ray-tracing engine in
In Figure 6, we see that random number generation in two experiments. First, we run simulations on the specified
batches does yield better performance in all cases in con- geometry without any modifications to highlight the impact
junction with our ray-tracing kernel. Surprisingly, generating of the actual ray tracing on the performance. Next, we
more random numbers in advance does not speed up, but include textures on all facets to put a heavier load on the

Figure 4. Performance measured in Hits per second (1 MHit = Figure 5. Effect of atomic operations in combination with a
106 Hit) in relation to simultaneously launched threads, where variable amount of facet buffers N on the performance, given
the x-axis is log 2-scaled. Memory limits are reached for higher more simultaneously launched threads, for the G100;100 geometry
thread numbers on the RTX 2060, resulting in 0 MHit/s. simulated on an Nvidia RTX 3090.

Table 1. Performance measured in MRay/s (106) for the geometry with different approximations (G100;100 and G100;1000 ) and with (w/) or
without (w/o) textures. The GPUs run on either a polygon mesh (Poly) or a triangle mesh (Tri).

i7-8775U 2060+Poly 2060+Tri Epyc 7302P 3090+Poly 3090+Tri

G100;100 w/o tex 6.34 298.62 747.10 34.10 1308.18 2646.48


G100;100 w/tex 5.74 274.10 575.10 32.27 1168.75 2003.73
G100;1000 w/o tex 4.56 172.48 538.55 10.07 602.03 1819.27
G100;1000 w/tex 4.02 152.82 374.87 9.97 548.91 1347.90
10 The International Journal of High Performance Computing Applications 0(0)

Figure 6. Performance for the Molflow GPU algorithm for a pipe with different configurations on both RTX GPUs. Results are
generated for ad hoc and batched random number generation with varying cycles TC and recursive depths.

trace-processing kernel, which is closer to the needs of real-


life simulations. For the simulations, we utilize the same
parameters as before. We generate random numbers with the
single-cycle batch generation method (TC = 1) and a re-
cursive depth limit of 16.
We can see in Table 1 that we get good speedups for all
test cases. The triangle-based algorithm is in all cases the
most performant. The speedup is largest for the consumer-
grade hardware (set 1), for the G100;1000 geometry and no
textures, showing that geometries with a focus on ray-
tracing profit more from GPU utilization. Using textures
has a bigger negative impact for the triangle-based GPU
algorithm compared to the polygon-based algorithms on
both GPU and CPU. This is likely related to the extra step in
calculating the texture coordinates in 2D space. The RT Figure 7. Transmission probability for a L/R=100 cylindrical tube
algorithm for polygons on the CPU calculates the exact for the GPU and CPU algorithm (with 32 bit and 64 bit geometry)
location as part of the intersection test. For the GPU al- in relation to an analytical solution Wref = 0.0,252,763,636 (see
gorithm, the barycentric coordinates returned by the in- Gómez-Goñi and Lobo (2003)). Error bars denote the maximal and
tersection routine are translated with texture coordinates of minimal value of the set, considering 50 simulations per set. The
height of the blue and orange bars denotes the corresponding
the corresponding vertices in an extra step (see section 3.4). average result. The green bar is the absolute tolerance region εabs =
105, and the red bar is the relative tolerance region εrel = 104
4.5. Precision surrounding the analytical value.

With some fundamental changes to the ray-tracing al-


gorithm, we also have to consider the possible impact on and post-processing) with a set of isolated simulations to
the accuracy of the simulation. With prior testing, we had identify a single point of error. We had concluded that
analyzed the individual custom kernels (ray generation only the intersection test, crucial for the calculation of the
Bähr et al. 11

Table 2. Transmission probabilities W{GPU,CPU} for cylindrical tubes Glr,100 approximated with 100 side facets and varying L/R ratio, as
calculated with simulations for 109 desorbed particles with the GPU kernel and Molflow's CPU kernel, respectively. The corresponding Δ
values denote the absolute difference from the reference value Wref. The speedup for each set is given as S = PCPU/PGPU, where P denotes
the corresponding performance measurement.

L/R WGPU ΔGPU WCPU ΔCPU Wref SSet1 SSet2

1 0.67192400 5.99 × 105 0.67201900 3.51 × 105 0.67198390 63.20 31.09


2 0.51416070 6.98 × 105 0.51406400 1.66 × 104 0.51423050 66.04 30.75
5 0.31044060 1.88 × 104 0.31041900 1.67 × 104 0.31025230 68.32 30.47
8 0.22521630 4.68 × 105 0.22526600 2.90 × 106 0.22526310 126.56 36.68
10 0.19090100 4.14 × 105 0.19087900 6.34 × 105 0.19094240 121.59 40.08
20 0.10928520 3.55 × 105 0.10932500 4.30 × 106 0.10932070 114.16 49.34
50 0.04845371 2.27 × 105 0.04847840 2.00 × 106 0.04847640 76.77 49.69
80 0.03123623 1.11 × 105 0.03123400 1.33 × 105 0.03124730 174.41 49.54
100 0.02526948 6.92 × 106 0.02527330 3.10 × 106 0.02527640 164.04 49.67
200 0.01294084 5.40 × 105 0.01293520 5.96 × 105 0.01299480 118.35 48.78
500 0.00524651 1.65 × 105 0.00524247 2.05 × 105 0.00526300 133.14 49.65

hit location, could possibly have a big impact. This is 5. Conclusion and future work
likely due to its 32 bit floating point limitation on the
We developed a GPU kernel powered by Nvidia’s OptiX 7
dedicated ray-tracing units, which is a common problem
API to utilize the novel ray-tracing cores found in recent
for ray tracers (see Haines and Akenine-Möller (2019)).
GPU architectures. We analyzed different techniques to
Thus, for further conclusions, we included a modified
design the algorithm to achieve peak performance for the
CPU algorithm into our test set, which uses 32 bit pre-
given geometry, while also keeping memory into account.
cision for the geometry and ray description to correspond
Our design achieved major speedups on budget hardware
to the RTX hardware limitations. As the CPU algorithm
(×63  ×175) as well as on high-end hardware (×30  ×50),
has been found as unstable when run completely with 32 without heavy influence on the precision.
bit floating precision, the remaining parts were kept with In this study, we gave a brief outlook on possible issues
64 bit precision. We evaluated the transmission proba- that can arise for physical simulations as a trade-off for a
bility W after 109 desorptions2 for 50 runs3 for the CPU highly performant GPU kernel fully utilizing new RTX
and GPU algorithm and compared them to an analytical hardware. The adaptive offset mitigates effects such as the
solution (see Gómez-Goñi and Lobo (2003)). The displaced ray origins, but it does not solve all issues by
transmission probability is the ratio of particles that got itself; some of these problems have also been discussed by
absorbed on one end of the facet to the total amount of Wächter and Binder (2019). A marginal, but noticeable, risk
desorbed particles. of offsetting a ray outside of another facet remains. Possible
As can be seen in Figure 7, we find that the converged solutions would be to use a different approach in the form of
results are sufficiently close on all architectures. Consid- OptiX’s Any Hit routine or to account for neighboring facets
ering the mean values, the results of all methods converge that join with a sharp angle. Such techniques are necessary for
towards the analytical solution staying at least within the the extension of the current implementation, when other facet
εrel = 104 error margin. Increasing the level of approxi- properties and more complex geometries have to be taken
mation, G100;1000 , the results converge having at most an into account as these are more prone to single-precision errors
error margin of εabs = 105. Calculations with 32 bit floating and thus demand more robust methods. A detailed analysis
point precision does not seem to have a major impact on the and design study for such problems and the implementation
precision, where the span between the minimum and of Molflow’s full feature set are left for future work.
maximum for each set are also comparable between the
different architectures. Hence, if there is a demand for Declaration of conflicting interests
higher precision, a better approximation of the geometry has The author(s) declared no potential conflicts of interest with re-
to be considered first. spect to the research, authorship, and/or publication of this article.
Further, we show in Table 2 that the behavior is similar
for other geometries. We compare the results for simulations
on cylindrical tubes with various length–radius ratios. The Funding
results differ for neither configuration by a significant The author(s) received no financial support for the research, au-
amount besides the expected MC fluctuations. thorship, and/or publication of this article.
12 The International Journal of High Performance Computing Applications 0(0)

ORCID iDs visualization/technologies/turing-architecture/NVIDIA-Turing-


Architecture-Whitepaper.pdf.
Pascal R Bähr  https://orcid.org/0000-0001-8579-7386
NVIDIA (2019). cuRAND Library – programming guide, cuda
Bruno Lang  https://orcid.org/0000-0001-9197-4836
toolkit v11.1.1. https://docs.nvidia.com/cuda/curand.
NVIDIA (2020) NVIDIA OptiX 7.2 – programming guide. https://
Notes
raytracing-docs.nvidia.com/optix7/guide/index.html.
1. Note that for a triangle mesh, we map triangles onto their Osada R, Funkhouser T, Chazelle B, et al. (2002) Shape distributions.
corresponding parent polygons to prevent allocating extra hit ACM Transactions on Graphics 21(4): 807–832.
buffers per triangle. Parker SG, Bigler J, Dietrich A, et al. (2010) Optix: a general
2. Note that one has to account for all particles to actually leave the purpose ray tracing engine. ACM Transactions on Graphics
system, as residual particles—especially in a large number like 29(4): 1–13. DOI: 10.1145/1778765.1778803.
on the GPU—still contribute to the overall results. Romano PK, Horelik NE, Herman BR, et al. (2015) Openmc: a
3. We consider the mean, min, and max value of the set. The mean state-of-the-art monte carlo code for research and development.
equals the probability after 5 × 1010 desorptions. Annals of Nuclear Energy 82: 90–97. https://doi.org/10.1016/j.
anucene.2014.07.048http://www.sciencedirect.com/science/
References article/pii/S030645491400379X. Joint International Confer-
Ady M (2016) Monte Carlo Simulations of Ultra High Vacuum and ence on Supercomputing in Nuclear Applications and Monte
Synchrotron Radiation for Particle Accelerators. PhD Thesis, Carlo 2013, SNA + MC 2013. Pluri- and Trans-disciplinarity,
Ecole Polytechnique, Lausanne. https://cds.cern.ch/record/ Towards New Modeling and Numerical Simulation Paradigms.
2157666 (Presented 03 May 2016). Salmon J and McIntosh-Smith S (2019) Exploiting hardware-
Blyth S (2020) Meeting the challenge of JUNO simulation with accelerated ray tracing for Monte Carlo particle transport
Opticks: GPU optical photon acceleration via NVIDIA® with openmc, pp. 19–29. DOI: 10.1109/PMBS49563.2019.
OptiXTM. EPJ Web of Conferences 245: 11003. DOI: 10. 00008.
1051/epjconf/202024511003. Schouten D, DeAbreu A and Stelzer B (2015) GPUs for Higgs
Deng Y, Ni Y, Li Z, et al. (2017) Toward real-time ray tracing: A boson data analysis at the LHC using the Matrix Element
survey on hardware acceleration and microarchitecture Method. In GPU Computing in High-Energy Physics. Ger-
techniques. ACM Comput. Surv 50(4): 1–41. https://doi.org/ many: Verlag Deutsches Elektronen-Synchrotron, 109–118.
10.1145/3104067. DOI: 10.3204/DESY-PROC-2014-05/20.
Franklin WR (2018) PNPOLY – point inclusion in polygon test. Shirley P and Marschner S (2009) Fundamentals of Computer
wrf.ecse.rpi.edu/Research/Short_Notes/pnpoly.html. Graphics. 3rd edition. Natick, MA, USA: A. K. Peters,
Gómez-Goñi J and Lobo PJ (2003) Comparison between monte Ltd.
carlo and analytical calculation of the conductance of cy- vom Bruch D (2020) Real-time data processing with GPUs in high
lindrical and conical tubes. Journal of Vacuum Science & energy physics. Journal of Instrumentation 15(06): C06010.
Technology A: Vacuum, Surfaces, and Films 21: 1452–1457. DOI: 10.1088/1748-0221/15/06/c06010.
DOI: 10.1116/1.1568746. Sanzharov VV, Gorbonosov AI, Frolov VA, et al. (2019) Exam-
Haines E and Akenine-Möller T (eds), (2019) Ray Tracing Gems. ination of the Nvidia Rtx, 7–12. DOI: 10.30987/graphicon-
Berkeley, CA: Apress. http://raytracinggems.com. 2019-2-7-12.
Kersevan R and Pons J-L (2009) Introduction to molflow+: new Wächter C and Binder N (2019) A fast and robust method for
graphical processing unit-based monte carlo code for simu- avoiding self-intersection.
lating molecular flows and for calculating angular coefficients Wald I and Parker SG (2019) Rtx accelerated ray tracing with
in the compute unified device architecture environment. optix. In: ACM SIGGRAPH 2019 Courses, SIGGRAPH ’19.
Journal of Vacuum Science & Technology A: Vacuum, Sur- New York, NY, USA: Association for Computing Machinery.
faces, and Films 27: 1017–1023. DOI: 10.1116/1.3153280. DOI: 10.1145/3305366.3340297.
Marsaglia G (2003) Random number generators. Journal of Wald I, Usher W, Morrical N, et al. (2019) RTX beyond ray
Modern Applied Statistical Methods 2: 2–13. tracing: exploring the use of hardware ray tracing cores for
NVIDIA (2018) Nvidia turing gpu architecture. White paper, tet-mesh point location. In: High-Performance Graphics -
NVIDIA Corporation. WP-09183-001_v01 available at Short Papers. Switzerland: The Eurographics Association.
https://www.nvidia.com/content/dam/en-zz/Solutions/design- DOI: 10.2312/hpg.20191189.

You might also like