(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs
(2021) Development of A Hardware-Accelerated Simulation Kernel For Ultra-High Vacuum With Nvidia RTX GPUs
Pascal R Bähr1,2 , Bruno Lang2 , Peer Ueberholz3, Marton Ady1 and Roberto Kersevan1
Abstract
Molflow+ is a Monte Carlo (MC) simulation software for ultra-high vacuum, mainly used to simulate pressure in particle
accelerators. In this article, we present and discuss the design choices arising in a new implementation of its ray-tracing–
based simulation unit for Nvidia RTX Graphics Processing Units (GPUs). The GPU simulation kernel was designed with
Nvidia’s OptiX 7 API to make use of modern hardware-accelerated ray-tracing units, found in recent RTX series GPUs
based on the Turing and Ampere architectures. Even with the challenges posed by switching to 32 bit computations, our
kernel runs much faster than on comparable CPUs at the expense of a marginal drop in calculation precision.
Keywords
Molflow, vacuum, Monte Carlo, simulations, graphics processing units, RTX, Nvidia
the Pascal architecture, the new hardware has been proven RTX hardware, the author observes speedups of ∼ 1660×
to be ∼ 10 times faster in terms of processed rays per second. compared to CPU simulations and a speedup of ∼ 5× for
In this paper, we revisit the suitability of Molflow’s simulations with the RTX feature disabled (see Blyth
algorithm for GPU processing by investigating the possible (2020)). Discrepancies in the results were mainly attrib-
performance increase with both a software- and hardware- uted to be geometry related, where remaining discrepancies
accelerated GPU ray-tracing engine based on Nvidia’s ray- were briefly attributed to arithmetic precision. Opticks uses
tracing API OptiX 7 (see Parker et al. (2010)). Our approach CSG to represent geometries instead of polygon- or triangle-
leverages the capabilities of dedicated ray-tracing units as meshes, which are not further accelerated by RTX hardware.
much as possible and tries to work around the hardware- Given that potential performance benefits using RTX
bound constraints. Considering a common vacuum com- acceleration have been studied in comparable Monte Carlo
ponent with well-known properties, we investigate the simulators, we give a further outlook for RTX usage in
potential performance gains, as well as the question whether addition to a performance study by including a comparison
the reduced precision for the ray-tracing algorithm has a of the accuracy of our algorithms and discussing the real-life
negative impact on the precision of the results that would suitability for such simulations with a well-studied vacuum
make it an unfeasible choice for simulations with Molflow. component. Further, we elaborate a few design choices,
The remainder of the paper is organized as follows. First, such as the handling of random number generation or the
we consider the contributions of other studies that researched usage of thread local counters, which are necessary for a fast
the usability of RTX hardware, with a focus on comparable and more robust kernel and which can be applied to similar
physical MC simulation applications (see section 1.1). Next, physical applications.
in section 2, we briefly discuss the principles of Molflow’s
algorithm and introduce the OptiX 7 API. In section 3, we
2. Background
discuss the design choices for Molflow’s GPU simulation
engine. At last, in section 4, we conduct a performance and The fundamental algorithm of Molflow’s vacuum simula-
precision study to evaluate how the GPU code compares to tion engine is based on ray tracing, where particles are
Molflow’s CPU implementation for different parameters and traced within vacuum components until they hit the
geometries, where we use a set of cylindrical vacuum tubes boundary and the accumulated statistics are used to derive
with varying length–radius ratios. various key figures for these components. We introduce the
Molflow+ algorithm and give a brief introduction to the
features of Nvidia’s ray-tracing API OptiX 7, which we
1.1. Previous work
utilize for Molflow’s GPU kernel.
Studies on the use of RTX hardware for applications beyond
typical rendering or even ray-tracing tasks have already
been undertaken. For example, Wald et al. (2019) applied a
2.1. Ray tracing in Molflow+
point-in-polyhedron picking algorithm to the common ray- Molflow+ uses the TPMC to simulate free molecular flow in
tracing problem. They achieved speedups of up to ∼ 6.5× systems in the (ultra-)high vacuum regime (see Ady
with an optimized RTX kernel compared to a full CUDA (2016)). Here, gas molecules are assumed to be indepen-
implementation on a TITAN RTX, while on the predecessor dent of each other; hence, no inter-molecular collisions have
TITAN V without RT cores, the CUDA implementation to be accounted for. 3D models of vacuum components are
even prevailed. described via polygons, which we refer to as facets. Im-
Molflow is a Monte Carlo simulation software used for porting geometries via formats like STL is possible, and
vacuum simulations. For similar applications, which are whenever, feasible the given triangle mesh will be converted
used to study different physical phenomena and are also into a more general polygon mesh. The simulation algo-
based on a ray-tracing–based engine, some initial studies for rithm then translates into the following steps.
the usability of RTX hardware have already been con- First, a particle is generated according to our desorption
ducted. With an RTX compatible extension for OpenMC model, which considers the particle’s source location, the
(see Romano et al. (2015)), speedups of about ∼ 33× could rectilinear trajectory, and the velocity. Individual particles
be observed for triangle-meshes compared to comparable are then traced inside a vacuum component by identifying
CPUs (see Salmon and McIntosh-Smith (2019)). OpenMC the location of the collision with the component’s boundary.
is a Monte Carlo particle transport simulation code com- The amount of necessary intersection tests is reduced by
monly used for modeling of nuclear reactors, which is grouping polygons together inside axis aligned bounding
capable of simulating various nuclear reactions for neutrons boxes (AABB) ultimately forming a tree structure. Tra-
and photons. For the open-source software Opticks, an versing this bounding volume hierarchy (BVH) results in a
OptiX powered GPU engine is used for the simulation of logarithmic amount of simplified ray-box intersections
optical photons as an interface for Geant4. Referencing followed by only a few final ray-polygon intersection tests
Bähr et al. 3
when evaluating the BVH leaf nodes. Following a verified primitive type or material of the hit location in relation to the
hit, statistics are gathered at the hit location depending on ray payload, which is an arbitrary data structure that en-
the type of the facet. A particle either remains in the system compasses a ray’s individual attributes and other necessary
getting reflected or it gets pumped. information along its path. In computer graphics, this step
relates to a shading routine. We like to refer to it as “Trace
Processing” for generality. If no valid intersection is found, a
2.2. GPU ray tracing with OptiX 7 Miss program will be called. As emphasized in the OptiX
Not only due to high-level languages such as CUDA or Programming Guide (see NVIDIA (2020)), utilizing an Any
OpenCL, GPUs have been used to solve general problems Hit program negatively impacts the performance. In this
outside of rendering or linear algebra. GPU computing is study, with the goal to maximize the performance benefits of
widely used for scientific applications, for example, for data the RTX technology and with the assumption that it is a
analysis or data reduction in high energy physics, for ex- feasible strategy, we utilize only a Closest Hit and a Miss
ample, Schouten et al. (2015) or vom Bruch (2020). In program.
principle, Molflow+’s ray-tracing algorithm is well suited Figure 1 shows a sketch for the OptiX 7 pipeline. The
for the SIMD architecture on GPUs. Given the independent green labels highlight the parts that can make use of pro-
nature of particles in molecular free flow, one can maximize prietary algorithms, which are hardware-accelerated on
the parallel workload by generating and tracing individual compatible GPUs. Scene traversal always utilizes the built-
particles per thread. Nvidia’s OptiX 7 is a ray-tracing API in method. Further, we can see that for maximum leverage
that provides a simple interface to leverage capabilities of of the RTX hardware, we have to use the built-in intersect
newly introduced RT units on compatible GPUs. Other routine, if it is feasible. For a complete study, we also
common choices are Microsoft’s DXR and the Vulkan API. implement Molflow’s standard approach, where a ray-
We decided on utilizing OptiX 7, not only because we polygon intersection algorithm is utilized, and compare it
expect best results staying in the Nvidia universe, but also to the RTX capable ray-triangle intersection test. It is
because it is CUDA-centric and provides the compatibility possible to launch so called secondary rays recursively from
for older architectures up to Maxwell. In this study, we the “Trace-Processing” step, which allows to skip the ray-
mainly focus on the recent RTX hardware, because of the generation step in the pipeline and is useful for further
promises of the RT cores. Details about how the RT cores following particles that are reflected from the boundary of
work on the Turing architecture can be found in the cor- the vacuum volume.
responding white paper (NVIDIA, 2018: 30). A challenge, which has already posed problems in a
A full ray-tracing kernel in OptiX consists of a ray previous attempt to port Molflow’s simulation engine to the
generation, a ray-primitive intersection test, and a hit pro- GPU, is to minimize the effects of thread divergence to
gram. The ray-generation routine initializes an individual ray properly utilize the SIMD architecture. Thread divergence
with starting parameters and starts the trace. Afterwards, the occurs when the post-processing step for a particle differs
ray will be tested against the geometry with the built-in scene depending on the interaction with a particular facet type. In
traversal algorithm as well as the built-in triangle intersection Molflow, particles can reflect off or get absorbed by a
or a custom primitive intersection test to find a potential hit facet—in a deterministic or probabilistic way—neglecting
location. At last, either just the closest intersection or every more advanced techniques for simplicity.
intersection—for example, when passing through non- With OptiX 7, we can make use of the so called Shader
opaque primitives—on a ray’s trajectory can be processed Binding Table (SBT), which allows us to map specific
with a Closest Hit or an Any Hit program, respectively. This programs to individual facets or materials—think of
step is used to compute the effects corresponding to the function pointers. We assume that OptiX’s internal
Figure 1. Ray-tracing pipeline with RTX technology, cf. Wald and Parker (2019).
4 The International Journal of High Performance Computing Applications 0(0)
execution scheduling algorithm in combination with the of particles with an outgassing rate of 10 mbar l s1.
SBT lowers the negative impact of diverging threads as it Further, both end facets serve as perfect absorbers—
allows for manageable reordering or grouping of functions removing all particles from the system. For the side facets,
(see Parker et al. (2010)). These benefits could also be they are defined as perfectly diffusive—reflecting all parti-
attributed to the RTX technology itself, as it supposedly cles. The tube can be defined with various length/radius ratios
implements some sort of GPU work creation (see Sanzharov (L/R). In Molflow, the circular form is approximated with a
et al. (2019)), allowing for efficient handling of workloads finite amount of side facets. In our numerical experiments, the
and direct invocation of ray-tracing on the GPU. L/R ratios and the levels of approximation were varied, as
indicated in the respective sections.
Figure 2 shows the geometry in Molflow both as a
3. Molflow’s GPU implementation polygon and a triangle mesh with an exemplary ratio of L/
In this section, we present our design for the ray-tracing R = 10. In this example, the circular inlet (green circle) and
programs using the OptiX API as well as choices related to outlet (red circle) are each approximated with 10 vertices,
the memory and random number generation. We give a brief which results in 10 rectangular facets for the side walls, thus
explanation of the parts that are straightforward ports from 12 polygons in total. For the GPU kernel, the geometry is
the CPU algorithm, which is discussed in greater detail by further triangulated resulting in a mesh with 36 triangles in
Ady (2016). total for the same amount of vertices.
The OptiX API handles the traversal and intersection
routines directly with built-in algorithms and utilizes the
hardware-acceleration features of Nvidia RTX GPUs; thus, 3.2. Ray generation
only the ray-generation and trace-processing procedures are Handling a new particle, the Ray-generation program will
the main concern when trying to achieve a high performance initially generate the starting parameters according to
ray-tracing algorithm. For a complete study, we compare the Molflow’s physical model. The minimal set of attributes for a
performance of the built-in ray-triangle intersection with a ray-tracing engine consists of the source location (ray origin)
port of Molflow’s ray-polygon intersection routine. One and the trajectory (ray direction), where Molflow also accounts
major restriction with the API is that the built-in algorithms for the particle velocity. Next, the actual ray-tracing can be
work only on 32 bit representations of the geometry and the initiated. Compared to Molflow’s original implementation, the
rays. This drawback will be discussed together with the ray-generation program also handles particles that reside inside
performance and precision study of the kernel (see section 4). the system, for example, after a collision. Here, the set of
attributes can simply be fetched from memory to initiate the
ray-tracing. Intuitively, one would directly call the ray-tracing
3.1. Geometry routine again given the new parameters. For OptiX, this is only
For our study, we utilize a cylindrical tube serving as a possible via recursion.
vacuum component, for which an analytical solution can be
used for the benchmarks (see Gómez-Goñi and Lobo (2003)). 3.2.1. Ray origin. The source location has to be found in two
The cylindrical inlet (desorption facet) defines a steady influx steps, which slightly differ for polygons and triangles,
Figure 2. Cylindrical tube with a length/radius ratio of L/R = 10 in Molflow. The top shows an approximation with a polygon mesh, the
bottom with a triangle mesh. The green lines show arbitrary particle paths.
Bähr et al. 5
Further problems arise, when we compute the exact hit with its state from a previous launch. After the offset has
location and take this value as the next ray origin to relaunch a been applied, the particle is launched with a call to the
ray from the corresponding surface: a secondary ray. The new OptiX API and to trace the ray and to identify a potential
location could be classified on either side of the polygon due to intersection. Initially (see line 1), a check is conducted to see
arithmetic error, as discussed in great detail by Wächter and whether a thread can terminate early due to a fulfilled end
Binder (2019). Potential workarounds differ with the use of condition, which is set on the CPU.
back-face culling. A polygon’s front is identified by one of its
surface normals, that is, the perpendicular vector corresponding
to the surface. Without culling, both faces of a polygon serve for
3.3. Intersection test
possible intersection points. With back-face culling, an inter- RT cores accelerate BVH traversal as well as ray-triangle
section is only valid if the surface normal is facing the ray intersection via hardware. For our study, we investigate both
direction. triangle and polygon meshes. To utilize hardware-accelerated
Considering we do not deploy back-face culling, there BVH traversal for a polygon mesh, a set of AABB has to be
are few possibilities of what could happen in case the ray provided.
origin is found to be outside. Best case, it will simply get For triangles, we use the built-in intersection routine,
redirected into the body of the mesh. Worst case, the hit while we also provide a custom port of the two-step ray-
location is on the edge between two non-coplanar tri- polygon test routine, which is used in Molflow. First, a ray-
angles and the ray direction is quasi-parallel to the rectangle intersection is used, where we solve a system of
neighboring triangle plane. In this case, the ray will not linear equations via Cramer’s rule (see Shirley and
find a collision point with the neighboring triangle: a Marschner (2009)). This is followed by a point-in-
miss. In addition to back-face culling, we can reduce the polygon test in 2D space, where the ray-casting algorithm
amount of misses of arithmetical origin by deploying an provided by Franklin (2018) provided the best results in
adaptive offset along the facet normal. We chose the single-precision both in performance and reliability in terms
strategy presented by Wächter and Binder (2019) as it of the false-negative rate.
leads to minimal computational overhead and minimal
impact on the accuracy compared to other options. In an
experiment with the introduced cylindrical tube ap-
3.4. Trace processing
proximated with 1000 vertices per end cap, we found that Following the calculation of a potential hit location, the
back-face culling itself resulted in a miss-hit ratio of 1.35 events are accumulated in hit counters for statistical
107 and the adaptive offset in a ratio of 1.44 106. purposes. The gathered statistics correspond not only to
Deploying both solutions reduced the rate to only 5.32 the plain amount of collisions or “MC Hits” Nhit, but also
1011, and therefore, this was used as the default strategy to the sum of orthogonal momentum changes of particles
for further simulations. ΣI’ and the sum of reciprocals of the orthogonal speed
Possible consequences for the given geometry in case of components of particles Σv1 ’ . From these three counters,
a miss can be neglected. If there was a surrounding shell, the the most important values for vacuum components can be
ray would hit that surface instead, leading to a series of derived, for example, the pressure or density as explained
unexpected hits. Depending on the complexity of a ge- by Ady (2016).
ometry, this can lead to uninterpretable results. The counters are increased according to the corresponding
collision event related to the facet type. Besides the per-facet
Algorithm 1 Ray-generation algorithm hit counters, the users can also enable more fine-grained
counters in the form of textures (N × M) or profiles (1 ×
1: function RAYGENERATION(Particle p) N), which divide a facet in two directions or one direction,
2: (p.terminate == true) → return respectively. Here, N ,M 2 Nþ describe the spatial reso-
3: switch p.state do lution. If used, an extra step to calculate the index of the
4: case new texture element or profile bin has to be deployed. Textures
5: INIT_STATE() and profile coordinates are given in local coordinates (u, v)
6: case active 2 [0,1]2 in 2D space, which simplifies indexing of the
7: LOAD_STATE() underlying data structure. The coordinates are already
8: APPLY_OFFSET(p.position) provided by the polygon intersection test as they are part of
9: OPTIXTRACE() ⊳ Launch ray with OptiX API the calculation. For a texture mapped onto a triangle, we
10: end function provide local texture coordinates for each vertex A, B, C 2
[0,1]2 and use the barycentric coordinates α, β, γ 2 [0, 1]
The ray-generation procedure is sketched in algorithm 1. returned by the intersection test to interpolate the exact
A particle is either initialized with the described model or local hit location p = (u, v)
Bähr et al. 7
launch in one go. For every thread, the memory demand is radius and index n denotes the amount of side facets used to
usually not fixed and depends on the deployed design so- approximate the circular shape. For most tests, a pipe with a
lutions. Accounting for about 64 B for the particle state and ratio length/radius of L/R = 100 and either 100 or 1000 side
an additional 48 B for a RNG state or TC × 32 B + 4 B for facets are used: G100;100 and G100;1000 , respectively. This
using batch generation with 8 single-precision random configuration has an analytic solution, which can be used as
numbers per cycle and an additional index for the next a benchmark. The inlet (desorption facet) defines a steady
random number, the remaining memory can be used for the influx of particles. Further, both end facets serve as perfect
hit buffers. To prevent race conditions or excessive access absorbers, removing all particles from the system. For the
synchronization, every thread ideally has its own hit buffer side facets, we define them as perfectly diffusive, reflecting
per facet that amounts to 7 × 4 B. In addition, when all particles. Algorithmically, we utilize both the built-in
utilizing textures or profiles for data collection on a more intersection routine provided by the OptiX API and a custom
fine-grained scale, we have to add 3 × 4 B per texture polygon intersection routine. Using the former, one can see
element or profile bin. Obviously, having individual hardware-acceleration effects on both BVH traversal and
counters per thread increases performance, since we do not intersection testing, whereas for the custom routine, only
have to utilize any sort of synchronization, for example, in BVH traversal can benefit from these effects. For a fair
the form of atomics. comparison with the CPU algorithm of Molflow+, we utilize
For example, considering a simulation on an Nvidia RTX an updated version of the code, with major improvements to
2060 as the reference GPU, we found deploying around the BVH structure and intersection routine.
128 × 16,384 threads (see section 4.1) to saturate the GPU
nicely. Now, given a simple pipe geometry with a total of
4.1. Amount of threads
only 102 total facets1 as an example, we would need to
allocate around 128 × 16,384 × (64 + 36)B = 210 MB for the First, we run a simple experiment to find how different
rays’ states and additionally around 5989 MB if every thread amounts of simultaneously launched threads influence the
would have its own hit counters, thus easily reaching the full performance. We use power-of-two multiples of 128
6 GB VRAM capacity, without deploying any additional threads, which represents the amount of CUDA cores per
fine-grained counters. This is obviously not a feasible streaming multiprocessor for the RTX 3090, scaling up to
strategy, as the memory limit would have already been 225 threads in total.
reached with the other memory requirements. Therefore, we In Figure 4, we can see that the performance on the
consider to utilize one shared facet buffer by default, which is overall faster RTX 3090 is unsteady for the simulation of the
modifiable with atomic operations. In section 4.2, we analyze G100;100 geometry. It is an effect of the atomic operations for
the option to use multiple buffers among a group of threads. the hit counters, which we investigate in the next experi-
ment. Further, we can see that the performance increase
stagnates between 128 × 8192 = 1, 048, 576 and 128 ×
4. Performance and precision study 16,384 = 2, 097, 152 simultaneous threads for the Nvidia
To account for all types of Molflow users, we decided to RTX 3090 and the RTX 2060, respectively, for both ge-
provide benchmarks for two sets of hardware. The first set ometries. This seems to be the sweet spot, when the GPU is
accounts for the average Molflow user, where we consider saturated enough. We can choose these values for memory
consumer-grade hardware as found in laptops or office critical simulations.
computers, on which simulations in early phases are usually
run. We utilize a Turing GPU with RTX technology of the
first generation. With a stronger emphasis on complex and
4.2. Extra buffers
time-demanding simulations, the second set focuses on As described in section 3.6, atomic operations can be used
high-end hardware. to effectively counter race conditions. Considering that it is
possible that certain facets are frequently hit, using a single
• Set 1: {CPU i7-8557U, GPU RTX 2060} hit buffer for each facet can save memory, but negatively
• Set 2: {CPU Epyc 7302P, GPU RTX 3090} impact the overall performance. We investigate the benefits
of utilizing multiple buffers per facet, where the buffers are
The software was compiled with GCC 10.2 and level 3 split equally among all threads in a warp.
optimizations on all machines, running an Ubuntu 20.04- For the G100;100 geometry, Figure 5 shows that increased
based system. The GPU kernel was compiled with OptiX performance can have atomic operations as a bottleneck as
7.1 and CUDA 10.1. As input geometry, all experiments the probability of simultaneous access increases. The effect
utilize a cylindrical tube, as it was previously introduced in can be reduced by utilizing multiple counters instead of only
section 3.1. Henceforth, we will refer to these pipes with a single one, where 4 counters are sufficient in this case. A
Glr,n for simplicity, where index lr denotes the ratio length/ triangle describing a side facet has an average hit chance of
Bähr et al. 9
around 0.45%, decreasing the chance to 0.11% for one of instead slightly slows down the simulations. We found that
four counters. As the effect was hardly visible with a slower this is mostly cache related, where generation for one cycle
GPU (RTX, 2060) or a geometry with more facets already leverages the positive effects and multiple cycles
(G100;1000 ), because of the lower hit frequency per facet, it is slightly suffer from less hits in both L1 and L2 cache. The ad
not necessary for most geometries to utilize multiple hoc generation of random numbers was inferior in all cases
counters as memory is likely more of a concern. compared to the single-cycle batch generation (TC = 1).
For vacuum geometries, particles usually reside inside
the system for a large amount of collision events. This
4.3. Recursion and RNG property makes them suitable for any level of recursion that
can be deployed within the applicable memory constraints.
To evaluate the effects of the different approaches for
For example, this is the case with the given test geometry,
random number generation—ad hoc and batched—and
where on average, each particle yields around ∼ 100×
possible benefits coming from utilizing a recursive kernel
events until it exits from the system. By contrast, a cylin-
for the launch of secondary rays in the case of reflections,
drical tube with a ratio of L/R = 1 instead of L/R = 100 does
we compared the results for different geometries (G100;100
not benefit a lot from recursion as particles yield for around
and G100;1000 ) on both GPUs. The simulations are run with
∼ 2× events on average.
the corresponding thread numbers, which were found in the
previous experiment: 128 × 8192 = 1, 048, 576 and 128 ×
16,384 = 2, 097, 152 for the RTX 3090 and the RTX 2060,
respectively. For the geometry G100;100, we deploy 16 hit
4.4. Performance
buffers for the simulations on the RTX 3090. We analyze the raw performance of the ray-tracing engine in
In Figure 6, we see that random number generation in two experiments. First, we run simulations on the specified
batches does yield better performance in all cases in con- geometry without any modifications to highlight the impact
junction with our ray-tracing kernel. Surprisingly, generating of the actual ray tracing on the performance. Next, we
more random numbers in advance does not speed up, but include textures on all facets to put a heavier load on the
Figure 4. Performance measured in Hits per second (1 MHit = Figure 5. Effect of atomic operations in combination with a
106 Hit) in relation to simultaneously launched threads, where variable amount of facet buffers N on the performance, given
the x-axis is log 2-scaled. Memory limits are reached for higher more simultaneously launched threads, for the G100;100 geometry
thread numbers on the RTX 2060, resulting in 0 MHit/s. simulated on an Nvidia RTX 3090.
Table 1. Performance measured in MRay/s (106) for the geometry with different approximations (G100;100 and G100;1000 ) and with (w/) or
without (w/o) textures. The GPUs run on either a polygon mesh (Poly) or a triangle mesh (Tri).
Figure 6. Performance for the Molflow GPU algorithm for a pipe with different configurations on both RTX GPUs. Results are
generated for ad hoc and batched random number generation with varying cycles TC and recursive depths.
Table 2. Transmission probabilities W{GPU,CPU} for cylindrical tubes Glr,100 approximated with 100 side facets and varying L/R ratio, as
calculated with simulations for 109 desorbed particles with the GPU kernel and Molflow's CPU kernel, respectively. The corresponding Δ
values denote the absolute difference from the reference value Wref. The speedup for each set is given as S = PCPU/PGPU, where P denotes
the corresponding performance measurement.
hit location, could possibly have a big impact. This is 5. Conclusion and future work
likely due to its 32 bit floating point limitation on the
We developed a GPU kernel powered by Nvidia’s OptiX 7
dedicated ray-tracing units, which is a common problem
API to utilize the novel ray-tracing cores found in recent
for ray tracers (see Haines and Akenine-Möller (2019)).
GPU architectures. We analyzed different techniques to
Thus, for further conclusions, we included a modified
design the algorithm to achieve peak performance for the
CPU algorithm into our test set, which uses 32 bit pre-
given geometry, while also keeping memory into account.
cision for the geometry and ray description to correspond
Our design achieved major speedups on budget hardware
to the RTX hardware limitations. As the CPU algorithm
(×63 ×175) as well as on high-end hardware (×30 ×50),
has been found as unstable when run completely with 32 without heavy influence on the precision.
bit floating precision, the remaining parts were kept with In this study, we gave a brief outlook on possible issues
64 bit precision. We evaluated the transmission proba- that can arise for physical simulations as a trade-off for a
bility W after 109 desorptions2 for 50 runs3 for the CPU highly performant GPU kernel fully utilizing new RTX
and GPU algorithm and compared them to an analytical hardware. The adaptive offset mitigates effects such as the
solution (see Gómez-Goñi and Lobo (2003)). The displaced ray origins, but it does not solve all issues by
transmission probability is the ratio of particles that got itself; some of these problems have also been discussed by
absorbed on one end of the facet to the total amount of Wächter and Binder (2019). A marginal, but noticeable, risk
desorbed particles. of offsetting a ray outside of another facet remains. Possible
As can be seen in Figure 7, we find that the converged solutions would be to use a different approach in the form of
results are sufficiently close on all architectures. Consid- OptiX’s Any Hit routine or to account for neighboring facets
ering the mean values, the results of all methods converge that join with a sharp angle. Such techniques are necessary for
towards the analytical solution staying at least within the the extension of the current implementation, when other facet
εrel = 104 error margin. Increasing the level of approxi- properties and more complex geometries have to be taken
mation, G100;1000 , the results converge having at most an into account as these are more prone to single-precision errors
error margin of εabs = 105. Calculations with 32 bit floating and thus demand more robust methods. A detailed analysis
point precision does not seem to have a major impact on the and design study for such problems and the implementation
precision, where the span between the minimum and of Molflow’s full feature set are left for future work.
maximum for each set are also comparable between the
different architectures. Hence, if there is a demand for Declaration of conflicting interests
higher precision, a better approximation of the geometry has The author(s) declared no potential conflicts of interest with re-
to be considered first. spect to the research, authorship, and/or publication of this article.
Further, we show in Table 2 that the behavior is similar
for other geometries. We compare the results for simulations
on cylindrical tubes with various length–radius ratios. The Funding
results differ for neither configuration by a significant The author(s) received no financial support for the research, au-
amount besides the expected MC fluctuations. thorship, and/or publication of this article.
12 The International Journal of High Performance Computing Applications 0(0)