0% found this document useful (0 votes)
33 views

Improving The Performance of A Ray Tracing Algorithm Using A GPU

This document summarizes a paper that presents three GPU implementations of a ray tracing algorithm to improve computational efficiency compared to a CPU implementation. The experimental results showed significant speedups of up to 13x when using a GPU. GPUs are well-suited for ray tracing since each pixel calculation can be done independently in parallel. The paper is structured to first describe GPU and CUDA architectures, then introduce ray tracing, explain the three GPU implementations, and finally analyze performance results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Improving The Performance of A Ray Tracing Algorithm Using A GPU

This document summarizes a paper that presents three GPU implementations of a ray tracing algorithm to improve computational efficiency compared to a CPU implementation. The experimental results showed significant speedups of up to 13x when using a GPU. GPUs are well-suited for ray tracing since each pixel calculation can be done independently in parallel. The paper is structured to first describe GPU and CUDA architectures, then introduce ray tracing, explain the three GPU implementations, and finally analyze performance results.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

XXIX International Conference of the Chilean Computer Science Society

Improving the Performance of a Ray Tracing Algorithm using a GPU

Santiago Cioli∗ , Gonzalo Ordeix† , Eduardo Fernández‡ , Martı́n Pedemonte§ , Pablo Ezzatti¶
Instituto de Computación, Facultad de Ingenierı́a,
Universidad de la República
Montevideo, Uruguay
Email: ∗ [email protected], † [email protected]
[email protected], § [email protected], ¶ [email protected]

Abstract—This article presents the application of parallel Graphics Processing Units (GPUs) are devices designed
computing techniques using a Graphics Processing Unit (GPU) originally for graphics processing, lightening the workload
in order to improve the computational efficiency of a ray on the CPU in applications such as video games. Thus,
tracing algorithm. Three different GPU implementations of
the ray tracing algorithm are presented. The experimental the CPU can be used to perform other computations while
evaluation of the proposed methods demonstrates that a most of the graphic processing calculations are performed
significant reduction of the computing time can be obtained on the graphic device. GPUs are currently very powerful
when compared with a CPU implementation, making a step platforms, provided for tens or hundreds of cores with
forward to the real-time calculation of scene brightness on acceptable clock frequencies (500-600MHz). Additionally,
desktop computers.
the computing power of GPUs is enhanced due to its
Keywords-Ray tracing; GPU; Real-time intrinsically parallel architecture.

I. I NTRODUCTION Initially, the progress in the design of GPUs was not


associated with an advance in the software capabilities until
A scene consist of a collection of objects and light sources 2006, when NVIDIA released CUDA (Compute Unified
seen through a camera. Each object in a scene is a geometric Device Architecture) [7]. CUDA is an architecture for
primitive, a simple geometric shape like a polygon, a sphere general purpose parallel computing that allows the use of
or a bicubic surface. Additionally, the surface of each object parallel processing in these devices to solve a wide variety of
has material properties, textures, etc. All global illumination problems more efficiently than it is possible to solve with a
techniques try to solve the problem of finding a set of images CPU. The ray tracing algorithm is highly parallelizable since
photorealistic for a given scene. These algorithms usually the calculation of lighting in each pixel is an independent
differ in how they handle the lighting of the scene. process, and therefore is very suitable for GPUs.
Several kind of global illumination algorithms can be
identified, based on the different light elements considered. This article studies the implementation of the ray tracing
Radiosity [11], ray tracing [25] and multipass methods (like algorithm implemented in GPU for speeding up the com-
RADIANCE [24] and photon mapping [13]) produce realis- putation time. Three parallel versions were developed, in
tic images in diverse scenarios with several kind of surfaces. order to exploit different characteristics of the ray tracing
Radiosity works well in scenes with Lambertian surfaces, algorithm and GPU architecture. An analysis of the per-
ray tracing produces good images in scenes with specular formance was conducted, measuring the number of frames
surfaces and the multipass methods are more versatile based that could be calculated per second. The preliminary results
on the mixture of methodologies used in them. Nowadays show that large improvements can be obtained (up to 13×)
the development of a global illumination algorithm that using a GPU instead of using a standard multicore CPU,
generates at least twenty images per second -that is, in real- such as the ones used in this analysis. The GPU hardware is
time- is a great challenge for the computer graphics research also convenient in price, for instance the GPUs used in this
community. paper are currently cheaper than a standard multicore CPU
Ray tracing algorithm calculates the brightness of each The content of the article is structured as follows. The
pixel of an image by throwing rays and evaluating their next section describes the main features of modern GPUs
bounces in the objects of the scene. Each bounce produce and CUDA architecture. Then, Section III introduces the ray
one or more rays that impact in other objects and so on. tracing algorithm. Section IV describes the three different
The color of the pixel is composed of the color of the first GPU implementations of ray tracing presented in this article.
object considered and the color added with each bounce. The experimental evaluation of the proposed methods is
This algorithm was one of the first steps into photorealistic reported in Section V, where the results are also analyzed.
rendering, and its success is due to its capacity to generate Finally, Section VI presents the conclusions of this research
good quality images using a simple code. and formulates the main lines of future work.

1522-4902/10 $26.00 © 2010 IEEE 11


DOI 10.1109/SCCC.2010.42

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
II. G RAPHIC PROCESSING UNITS The shared memory is located on the chip and its lifetime
is equal to the lifetime of the block.
Based on the facilities provided for CUDA [7] for GPU
programming, GPUs can be viewed as a set of shared Table I
memory multicore processors. Moreover, GPUs are usually F EATURES OF THE DIFFERENT GPU MEMORY SPACES .
considered many-cores processors due to the large number Memory Scope Lifetime Size Speed
Registers Thread Kernel Very small Very fast
of small cores that contain. GPUs follow the single-program Local Thread Kernel Small Slow
multiple-data (SPMD) parallel programming paradigm in Shared Block Kernel Very small Very fast
which cores execute the same program on multiple parts of Global Grid Application Big Slow
the data, but do not have to be executing the same instruction Constant Grid Application Very small Fast
Texture Grid Application Very small Fast
at the same time [8]. The number of threads that currently
GPU can execute in parallel is in the order of hundreds and is
expected to continue increasing rapidly, which makes these All the threads executing on the GPU have access to the
devices a powerful and low cost platform for implementing same global memory that is located on the GPU. The global
parallel algorithms. memory is one of the slowest memories on the GPU and
CUDA [15] consists of a stack of software layers includ- is not cached. On the other hand, constant memory is fast
ing: a hardware driver, a C language application program- although for the device is read-only. It is located in the
ming interface and the CUDA driver that is dedicated to device memory even though it is cached. In fact, constant
transfer data between the GPU and CPU. It is available for memory can be seen more as a cache of the global memory
all NVIDIA’s GeForce 8 series GPUs and superiors. It is than a different memory space. Finally, the texture memory
compatible with operatiing systems Linux of 32/64 bits and has the same characteristics that constant memory.
Windows XP and superiors of 32/64 bits. Figure 1 presents the CUDA architecture diagram, in-
The CUDA architecture is built around a scalable mul- cluding the six different memory spaces. The figure shows
tiprocessor array. Each multiprocessor on GPUs based on local memory close to the threads and private to each
G80 architecture consists of eight scalar processors as well thread. However, local memory is really located in the device
as additional units like a multithreading instruction unit and memory as the global memory.
a shared memory chip. When a part of an application runs
many times on different data, it can be isolated in a function,
called kernel function, to be executed on the device through
many different threads. For this purpose, the kernel function
is compiled using the device instruction set and the resulting
program is transferred to the device.
When a kernel function is called, a large number of
threads are generated on the GPU. The group of all generated
threads is called a grid, which is partitioned in many blocks.
Each block groups threads that are executed concurrently on
a single multiprocessor of the GPU. There is no fixed order
of execution between blocks. If there are enough multipro-
cessors available on the GPU, the blocks are executed in
parallel. Otherwise, a time-sharing strategy is used.
Threads can access data across multiple memory spaces
during their execution. GPUs based on G80 architecture have
six different memory spaces: registers, local memory, shared
block memory, global memory, constant memory and texture
memory. Table I presents the main features of the different Figure 1. CUDA memory model.
GPU memory spaces that are briefly commented next.
Registers, that are located on the chip, are the fastest
memory on the GPU and are only accessible by each thread. III. R AY TRACING ALGORITHM
In addition to this, each thread has its own local memory The ray casting algorithm [3] proposed by A. Appel is
but is one of the slowest memories on the GPU, because based on tracing rays from the observer’s viewpoint to a
it is located in the device memory and is not cached. Both view plane between the observer and the scene. The ray
memory spaces are entirely managed by the compiler. Each tracing algorithm [25] extends the idea of ray casting by
block has a shared memory space that is almost as fast as making the process recursive, generating new rays when the
registers and could be accessed by any thread of the block. a ray intersects an object in its way. Each new ray starts in

12

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
the intersection point and follow a direction based on the enables to get the intensity of the light that reaches the
physical laws of refraction and reflection. intersection point of the primary ray due to the phenomenon
Ray tracing achieves a great realism in the images of reflection.
generated even though its implementation is quite simple. When the object is transparent, a ray (T1 , called refraction
However, the simplifications used in the lighting model do ray) is traced through the object. This ray enables to get the
not allow generating caustics caused by light rays reflected intensity of the light that reaches the intersection point of
or refracted by curved surfaces. Similarly, the calculation the primary ray due to the phenomenon of refraction.
of the color component of “ambient light” [14], that is a Each one of the reflection and refraction rays when
simplification in the lighting calculation, makes the algo- intersects with an object, can generate new “shadows”,
rithm unable to produce some effects like “color bleeding” reflection and/or refraction rays. Therefore, the steps for the
(phenomenon caused by light reflection making the color of calculation of refraction and reflection effects should be done
a surface spread over the surfaces surrounding). recursively. For example, the same steps used to calculate
The ray tracing algorithm works as follows. For each the intensity of the light provided for the primary ray should
pixel of the image, a ray is traced from the observer’s be used to calculate the intensity of the light provided for
viewpoint to the pixel (called primary ray). If a ray does R1 . Thus, a ray tree is built for each primary ray, as shown
not intersect an object in its way, then the pixel is painted in Figure 3.
with the background color of the scene. On the contrary, if
the ray intersects with an object, the shadows, refraction and
reflection are calculated. Figure 2 shows the rays generation
of ray tracing in a scene with a single light source from a
single primary ray.

Figure 3. Ray tree resulting from the Figure 2.

A. Acceleration structures: uniform spatial subdivision


Two families of strategies can be used to improve the per-
formance of ray tracing algorithm; one family of strategies
reduces the number of rays and the other one optimizes the
Figure 2. Rays generated from a single primary ray. number of intersection checks performed. The spatial divi-
sion of the scene helps to reduce the number of intersection
To calculate the shadows, a “shadow” ray (L1 ) is traced checks, since it guarantees that the entire list of objects of
from the intersection point of the primary ray to each exist- the scene should not be checked for each ray.
ing light source on the scene. If any of these rays intersects The spatial division method have the advantage of check-
with an object, the amount of light that passes through the ing for possible intersections only with objects belonging to
object is calculated, depending on the transparency of the regions traversed by the ray. As a result, the space division
object. If the object intersected is opaque (as the smallest method reduces the amount of unnecessary calculations,
object in the Figure 2), the intersection point of the primary depending on the object distribution in the scene.
ray is under the shadow of the object, so the light source In the uniform space division method the scene is divided
is eliminated. If the object intersected has some degree into a set of uniform regions. Each region has a list of all
of transparency (as happens with the largest object in the the objects that it contains, either in whole or in part. This
Figure 2), the illumination contribution of the light source technique requires a preprocessing stage to create a data
is reduced. structure for storing the regions occupied by each object in
When the object has specular reflection, a ray (R1 , called the scene.
reflection ray) is reflected from the primary ray at the point This uniform spatial subdivision strategy can effectively
of intersection with respect to the normal (N1 ). This ray accelerate the calculation of the intersections despite being

13

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
simple. While there are other alternatives, like the kd-tree separate thread to calculate each ray. Three different versions
[23], the uniform spatial subdivision improves the perfor- of the ray tracing algorithm were developed in order to
mance of ray tracing, it is easy to implement and it does exploit different characteristics of the algorithm and the
not add extra issues to the algorithm; therefore, it is a GPU architecture. The versions were implemented following
spatial acceleration structure that can be considered in a an incremental approach, incorporating in each version a
GPU implementation of the ray tracing algorithm. considerable improvement over the previous one.
This section describes the main characteristics of the
B. Related work different versions implemented. First, for all versions imple-
The ray tracing algorithm has a high computational cost mented, a description is introduced of some general features.
specially with the geometrical models used in most 3D Later, the differences between all versions implemented are
applications, therefore until recently it was not suitable for detailed.
real-time applications. However, nowadays, some new works All versions were implemented using the C language
have achieved real-time ray tracing implementations over and CUDA (version 2.3) to manage the GPU. The general
CPU architectures, such as the Quake Wars game engine structure of the different versions of the implemented ray
[21] implemented using openRT [18]. A demo of the engine tracing algorithm is presented in the Figure 4. It has five
showed in August 2008 runs between 20 - 35 fps with different steps that are discussed below.
an image resolution of 1024 by 720 pixels on a Caneland
system that includes four Dunnington CPUs, each with six
cores.
On the other hand, the application of the computational
resources delivered by modern GPUs to ray tracing has re-
sulted in a number of implementations that allows rendering
scenes in reasonable times. Researchers have introduced sev-
eral techniques to speed up the construction of acceleration
structures and the traversal of rays through an acceleration
structure. The list of the related works includes those done
by Horn et al. [12], Popov et al. [20], Parket et al. [19], Aila
and Laine [1] and the 3D engine developed by researchers
of the Alexandra Institute [4].
The Horn et al. work is based on the use of Boundary
Volume Hierarchy (BVH), which is not covered in our
work. Meanwhile, the proposal of Aila and Lane is im-
plemented using a combination of Brook [6] and Direct3D
[9], involving a different conceptual abstraction of the GPU
model. Parket et al. [19] have proposed a general framework
to develop ray tracing algorithms, but their work focuses
on developing a flexible and adaptable framework than
on the performance of the resulting algorithm. For these Figure 4. Structure of the ray tracing algorithm implemented in CUDA.
reasons, none of the three previous works were considered
for the development of our proposal. On the other hand, the In the first step of the algorithm, the data of the scene is
algorithm implemented by the Alexandra Institute is based loaded from a text file. The file format is based on Wavefront
on the traditional ray tracing algorithm whereas the work OBJ (version 3.0) [17]. It makes possible to define the
of Popov et al. implements a kd-tree spatial acceleration elements of the scene (e.g. vertices, points, lines, polygons,
structure, therefore both works are closely related with our curves, etc.) and the materials of the elements. Also at
approach. this stage, a configuration file is loaded, which contains
A good survey of the state of the art in the ray tracing parameters required to execute the algorithm such as image
techniques can be found in the works of Parker et al. [19], resolution, division size of the acceleration grid, maximum
and McGuire and Luebke [16]. number of ray bounces, etc.
In the second step it is built a spatial acceleration structure
IV. O UR P ROPOSAL corresponding to the uniform space subdivision, because of
The ray tracing algorithm is inherently suitable for par- the simplicity of its construction and its traversal. The grid
allelization with SPMD techniques, since the calculation of construction algorithm has been optimized, so that for each
lighting in each pixel is an independent process. This feature object in the scene, it is associated with (in grid coordinates)
makes possible its implementation on GPU cards, using a an axis-aligned box that surrounds it. Then, candidate spatial

14

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
regions that overlap with the generated box are obtained. The differences between the versions are discussed in the
For each candidate region, it is tested the region-object next subsections.
overlapping, and if it happens, the object is added to the
A. RT(GPU) version
region.
The third stage involves the data transfer from the memory The first version of ray tracing algorithm (RT(GPU)) is
space of the CPU to the GPU memory. The transferred data a GPU-analogue to the CPU implementation. In RT(GPU),
are: the view plane, the camera, the image resolution, the list all the data is stored in global memory.
of triangles and its normals, the light sources, the regions of B. RT(GPU-ml) version
the grid for spatial subdivision and the material properties
of each object. Regarding the GPU architecture, and particularly the
After copying the data to the GPU, the kernel that importance of the correct use of the memory levels, this
calculates the primary rays is invoked. The data required version (RT(GPU-ml)) uses the different memory levels
to calculate the primary rays are the view plane, the camera of GPU accessible through CUDA. In particular, texture and
and the image resolution. constant memories are used, ensuring an improvement in the
The primary rays are the input for the calculation of each performance.
pixel color, which is the core of the ray tracing algorithm. The values that are more frequently used by the algorithm,
The data required to calculate the color of each pixel are such as the list of triangles or the boxes of the grid for spatial
the list of triangles and its normals, the light sources, the subdivision, must be stored in a memory level with fast read
regions of the grid for spatial subdivision and the material access. For this reason, the list of triangles and its normals,
properties of each object. the light sources, the boxes of the grid for spatial subdivision
The kernel that calculates the color of each pixel is and the list of material properties of the objects are copied to
invoked following a division in patches (group of pixels) the texture memory, since these data is accessed frequently
of the image to render. The image is divided uniformly, and do not need updating. Other data such as the view plane,
which each patch has the same number of pixels. Each patch the camera and the image resolution is stored in the constant
corresponds directly to a block of threads in CUDA, in order memory. Reading data from these types of memory is much
to process each division of the image by a different block. faster than reading from global memory.
Moreover, since each pixel of the image is processed by a C. RT(GPU-ii) version
different thread, the number of threads per block is equal The third version (RT(GPU-ii)) improves the proce-
to the number of pixels contained in each division. For this dure for calculating the ray-triangle intersection using the
reason, the division is completely established when fixing barycentric coordinates method [2]. This method verifies that
the number of threads per block and the image resolution. the ray intersects the plane containing the triangle and then
For example, if the image resolution is 640 × 480 pixels and by a change of coordinates verifies that the intersection point
the block size is equal to 16 × 8 threads, the image must is within the triangle boundaries.
necessarily be divided into 40 × 60 patches.
Each thrown ray traverse the spatial acceleration structure, V. E XPERIMENTAL ANALYSIS
following the reflections and refractions in the objects. The In this section, we present the test cases and hardware
ray tracing algorithm implemented has only one type of platforms used to evaluate the different versions imple-
element, the triangle. Thus, the intersection algorithm is mented of the ray tracing algorithm. Then, we describe
simple, requiring only a few arithmetic operations. in detail the various experiments conducted to validate the
A relevant aspect is that the ray tracing algorithm is recur- proposal.
sive, and current GPUs do not support recursion. As a con- In addition to the GPU versions of the ray tracing
sequence, the algorithm has to be implemented iteratively. algorithm described in Section IV, a CPU implementation
There are two alternatives to achieve this, implementing a of ray tracing RT(CPU-ii) was developed to evaluate the
stack to store the recursion tree or simplifying the tree by comparative performance versus the GPU versions.
making it degenerate into a list. The first alternative was
ruled out because each thread must have its own stack and A. Test cases
the size of the local memory is very limited. The second In a first instance, we studied the existing strategies for
alternative requires as a precondition that the scene has no measuring the quality of the images generated. The survey
objects that reflect and transmit light at the same time. This did not obtain any comprehensive strategy. The choice of
was the approach followed in our implementation since the the method for measuring the quality of the image depends
limitation imposed on the scene is acceptable. heavily on the objective of the study. Avcibas et al. [5]
Finally, after calculating the color for each pixel, the data and Dirik et al. [10] present good surveys of strategies
generated in the GPU is copied to the CPU to be displayed and discuss their limitations, but none is applicable for the
on the screen. purposes of this study.

15

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
In addition to this, there are no standardized test cases Table III
or benchmarks that could be used to evaluate the different T EST CASES WITH DIFFERENT NUMBER OF PRIMITIVES .
implementations of the ray tracing implemented in this Scene name # Objects # Lights # Triangles Figure
Pri I 2 2 194 6(a)
work. For this reason, a set of images were designed trying Pri II 2 2 274 N/S
to cover several aspects of the image generation process, in Pri III 2 2 348 N/S
order to contribute to measure different characteristics of the Pri IV 2 2 482 N/S
implemented algorithms. The designed test set of images is Pri V 2 2 606 6(b)
divided into three different groups. The first group consists
of images for evaluating the effect of the distribution of the
objects in the scene. The test cases of the second group
consists of images for evaluating the impact of the number
of triangles in the scene. Although there are no benchmarks,
some images have been used by the research community
(such as the Bunny from Stanford University). Therefore,
the third group includes some of those scenes and images
that have been used in studies similar to this. (a) (b)
1) Test cases with different object distribution: All scenes Figure 6. Scenes with a different number of primitives.
have the same number of triangles but have a different
distribution in the scene. Table II presents the main features Table IV
of the test cases considered. T EST CASES USED FOR COMPARING WITH OTHER ray tracing
IMPLEMENTATIONS .
2) Test cases with different number of primitives: All im-
Scene name # Objects # Lights # Triangles Figure
ages in this group are exactly the same, but were discretized
Alexandra 14 1 236 7(a)
using a different number of primitives. Table III presents the Buddha 1 1 100,000 7(b)
main features of the test cases considered. Dragon 1 1 100,000 7(c)
3) Test cases from similar studies: All images are taken Bunny 1 1 69,698 7(d)
from similar studies or are images commonly used by the
research community. Table IV presents the main features of
the test cases considered.
Table II
T EST CASES WITH DIFFERENT DISTRIBUTION IN THE SCENE .
Scene name # Objects # Lights # Triangles Figure
Dist I 9 1 10,338 5(a)
Dist II 9 1 10,338 5(b)
Dist III 9 1 10,338 5(c)
(a) (b)

(c) (d)
(a) (b)
Figure 7. Scenes used for comparing with other ray tracing implementa-
tions.

B. Hardware platform
Several hardware platforms were employed to evaluate
the implemented algorithms. Each platform consists of a PC
(c) Core 2 Duo with a GPU of the NVIDIA GeForce series. The
main details of the hardware platforms used are presented
Figure 5. Scenes with different spatial distribution of the objects.
in the Table V. All the PCs were running the Windows
operating system.

16

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
Table V executing in the PC with a GTX260 are summarized in
H ARDWARE PLATFORMS USED FOR EXPERIMENTAL ANALYSIS .
Table VII. The table shows the number of frames per second
GPU GPU memory CPU RAM memory
9500M GS 512 MB T7500 2.20GHz 4GB DDR2 667 MHz
(fps) that can be computed for each of the images. We can
9600M GT 512 MB P8400 2.26GHz 4GB DDR2 667 MHz conclude from these results that the optimal grid size for
GTX 260 896 MB E7500 2.93GHz 4GB DDR2 667 MHz
all test cases considered is obtained by halving each axis
(2 × 2 × 2).
Table VII
The Table VI provides more details of each one of the FPS OF RT(GPU-ii) VERSION FOR DIFFERENT TEST CASES .
GPUs used during the evaluation. Scene 1x1x1 2x2x2 4x4x4 6x6x6 10x10x10 15x15x15
Pri I 18.7 36.7 32.7 29.7 27.3 24.7
Table VI Pri II 13.9 27.6 25.4 23.3 21.5 20.1
GPU S USED FOR EXPERIMENTAL ANALYSIS . Pri III 11.3 24.5 22.8 20.9 19.2 18.0
Multi Clock Shader Memory Pri IV 8.4 18.5 18.0 16.5 15.9 15.1
GPU Cores Pri V 6.7 16.6 15.5 14.6 13.8 13.5
processors (MHz) clock (MHz) clock (MHz)
9500M GS 4 32 475 950 400
9600M GT 4 32 500 1250 400
GTX260 27 216 576 1242 999 Once the optimal grid size for each of the test cases was
found, all the versions implemented on GPU are executed
for the same test cases. Table VIII presents the performance
C. Experimental results obtained by the different GPU versions in the PC with a
Most of the experiments were conducted with an image GTX260, measured in frames per second. The results show
resolution of 640 by 480 pixels. However, in the case of the that the performance improves with the version, and that
comparison with algorithms implemented by other authors RT(GPU-ii) achieved the best performance.
and our work, it was essential to use other resolutions.
Table VIII
For comparing with the implementation of the Alexandra FPS COMPARISON BETWEEN THE DIFFERENT GPU VERSIONS .
Institute, was used an image resolution of 800 by 600 pixels. Optimal RT(GPU) RT(GPU-ml) RT(GPU-ii)
The resolution was determined by their implementation of Scene
grid size (fps) (fps) (fps)
the ray tracing algorithm as it could not be modified. For Pri I 2x2x2 5.8 26.2 36.7
comparison purposes with Popov et al. results [20], was used Pri II 2x2x2 5.1 20.2 27.6
Pri III 2x2x2 4.9 18.2 24.5
an image resolution of 1024 by 1024 pixels. The resolution Pri IV 2x2x2 4.3 14.1 18.5
was determined from the experiments described in the article Pri V 2x2x2 3.9 12.6 16.6
by Popov et al., since the authors worked with that fixed
resolution.
Experiments conducted during the evaluation confirmed The obtained results shown the importance of exploiting
that the choice of the grid size can increase the performance the different memory levels of GPU. In RT(GPU) version
of image generation. As a first approximation we considered all the data is stored in the global memory, while in
the value suggested by Thrane and Simonsen [22], which RT(GPU-ml) version most of the data is allocated in the
√ texture and the constant memories. For the test cases con-
indicates that the resolution should be 3 3 N boxes along
the shortest axis, where N is the number of triangles in the sidered, the use of the different memory levels of the GPU
scene. After several tests, it was found that this division is enables to improve the performance due to the reduction in
not always the best, and that a√better value memory access time, making the algorithm on average three
√ for the grid and a half times faster than the algorithm that does not use
size could be found between 3 N and 3 3 N along the
shortest axis. For each of the images, it must be determined it.
empirically the optimal grid size that obtains the better On the other hand, RT(GPU-ii) version improves the
performance within the range of values. algorithm of ray-triangle intersection, with an algorithm that
1) Evaluation of the different GPU versions implemented: requires less arithmetic operations, thereby reducing the time
The performance comparison between different versions of needed to generate images. From the results, it is possible
the algorithm implemented on GPU was made using the test to notice that the improved intersection algorithm helps
cases Pri I, Pri II, Pri III, Pri IV and Pri V. This evaluation to make RT(GPU-ii) generate images 30% faster than
has two stages. In the first stage, the optimal grid size for RT(GPU-ml) version.
the considered test cases is determined, while in the second Finally, it can be seen in Table VII as well as in Table
stage, each one of the GPU versions are executed for each VIII, that the increase in the number of triangles in a scene,
of the test cases using the optimal grid size, founded in the increases the time required for generating each image and
previous stage. therefore reduces the fps.
The RT(GPU-ii) version was used to determine the 2) Comparative study with other ray tracing implementa-
optimal grid size for each test case. The results obtained tions: First, it was made a test to compare our proposal with

17

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
the ray tracing implemented in GPU by Alexandra Institute suited GPU from the ones available for our experiments but
[4] considering a scene provided with its implementation. has superior features than the one used by Popov et al. Figure
The results obtained in the PC with a 9600M GT graphics 9(a) shows the render presented by Popov et al. and Figure
card, showed that RT(GPU-ii) reach 13.0 fps, while the 9(b) shows the render of the Bunny scene generated with
implementation of the Alexandra Institute obtained 11.7 fps. our implementation.
Figure 8(a) shows the rendering generated by the imple-
mentation of the Alexandra Institute and Figure 8(b) shows
the rendering generated by our implementation for the same
scene. It should be emphasized that our implementation
renders generic scenes while the Alexandra Institute im-
plementation is specially designed to produce images of
a single type of scene, managing efficiently the memory
space and reducing the number of memory access for that
type of scenes. Another difference in the implementations is
that Alexandra Institute implementation considers the light
sources as objects of the scene, while this was not considered (a) (b)
in our implementation. In the images generated it can be
Figure 9. Render of the image Bunny with Popov et al. [20] implementation
seen subtle differences, such as the background color that (left) and with our implementation (right).
could not be properly reproduced for the test case. Also there
is a noticeable difference in the specular brightness of the
The performance of our implementation of the ray tracing
sphere motivated by the incorrect reproduction of material
under the conditions described above is 6. 1 fps, while
used in the Alexandra image. On the other hand, in the
the performance of the work of Popov et al. is 5. 9 fps.
image generated by our implementation it is best seen the
The performance of both implementations for the test case
reflection of objects near the sphere on its surface. Based on
considered is very similar. The results obtained show that
an analysis of the images, it could be concluded that there
the ray tracing implemented is competitive with other ray
are no significant differences between the two images.
tracing algorithms implemented in GPU.
3) Comparative study between different platforms: The
comparative study of performance between the different
platforms consisted in the execution of the RT(GPU-ii)
version to the test cases Dragon, Buddha and Alexandra. In
each of the platforms, the algorithm is executed for each
of the test cases using several grid sizes, determining the
optimal grid size for each case and platform.
(a) (b)
Tables IX, X and XI show the results obtained for the
equipment 9500M GS, 9600M GT and GTX260, respec-
Figure 8. Render of the image Alexandra with Alexandra Institute tively.
implementation (left) and with our implementation (right).
Table IX
Then, a test was conducted to compare our proposal with a FPS OF RT(GPU-ii) VERSION IN A PC WITH A 9500M GS GRAPHICS
ray tracing implemented in GPU developed by Popov et al. CARD .

[20] that used the kd-tree structure as spatial acceleration Dragon Buddha Alexandra
Grid size fps Grid size fps Grid size fps
method. In their work, the rendering of the scene Bunny 20x20x20 1.4 20x20x20 1.3 1x1x1 4.6
is presented as well as the time required for the image 46x46x46 2.3 46x46x46 2.4 3x3x3 11.5
generation. The scene Bunny considered in our experiments 92x92x92 2.8 92x92x92 2.5 6x6x6 15.6
was built from the observation of the rendering presented in 138x138x138 2.0 138x138x138 2.1 10x10x10 13.1
180x180x180 1.6 180x180x180 1.7 15x15x15 12.3
the work of Popov et al. The scene could not be exactly 230x230x230 1.3 230x230x230 1.3 30x30x30 9.1
the same because of the limitations of the construction
method used and the omission in the original article of some
relevant aspects of the scene. For example, the number of The results obtained suggest that the optimal grid size
light sources is the same but the position is not identical (92 × 92 × 92 in Dragon, and Buddha images, and 6 × 6 × 6
and the material properties of the main object could not be in Alexandra image) for the test cases considered is the
accurately reproduced, since the article does not provide this same regardless of the platform used. On the other hand,
information. The GPU used by Popov et al. is a NVIDIA the experiments confirmed that the generation of images by
GeForce 8800 GTX with 112 cores. The GTX260 is the best the ray tracing algorithm implemented is faster, when the

18

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
Table X 5) Evaluation of the effect of the object distribution in
FPS OF RT(GPU-ii) VERSION IN A PC WITH A 9600M GT GRAPHICS the scene: The test cases Dist I, Dist II and Dist III were
CARD .
designed for detecting a possible weakness in the spatial
Dragon Buddha Alexandra
Grid size fps Grid size fps Grid size fps acceleration structure chosen, when working with scenes in
20x20x20 1.7 20x20x20 1.4 1x1x1 5.9 which objects are not evenly distributed. We assumed that
46x46x46 3.3 46x46x46 2.8 3x3x3 14.0 in the case of Dist III, which has all the items concentrated
92x92x92 3.4 92x92x92 3.1 6x6x6 20.0
in the center of the scene, would reach less fps than in the
138x138x138 2.6 138x138x138 2.6 10x10x10 18.4
180x180x180 2.0 180x180x180 2.1 15x15x15 17.5 cases Dist I and Dist II, which are more evenly distributed.
230x230x230 1.6 230x230x230 1.7 30x30x30 12.8 However, the results obtained (presented in Table XII) show
the opposite. One possible explanation for this behavior
Table XI in such scenes, is that when the objects are more evenly
FPS OF RT(GPU-ii) VERSION IN A PC WITH A GTX260 GRAPHICS distributed in the scene, they cast more shadows (as it can
CARD . be seen in Figure 5). Since the calculation of the shadows is
Dragon Buddha Alexandra computationally expensive, this cost counteracts the benefit
Grid size fps Grid size fps Grid size fps gained with uniform distribution of objects in the scene.
20x20x20 5.0 20x20x20 4.9 1x1x1 28.0
46x46x46 12.2 46x46x46 9.8 3x3x3 49.3
92x92x92 16.8 92x92x92 12.4 6x6x6 71.2 VI. C ONCLUSIONS AND FUTURE WORK
138x138x138 14.2 138x138x138 11.4 10x10x10 67.6
180x180x180 11.4 180x180x180 10.3 15x15x15 62.5 This work has presented an initial study on applying
230x230x230 9.3 230x230x230 8.4 30x30x30 49.4 GPU computing in order to speed up the execution of the
ray tracing algorithm. Three version of ray tracing were
implemented in GPU using CUDA and were evaluated on
different platforms using several images.
GPU has a greater number of cores. In addition to this,
the results also show that the algorithm implemented can The experimental analysis showed that the GPU imple-
automatically scale with the number of cores of the GPU. mentation increased significantly the number of frames that
This is an important property that arises as a consequence could be generated per second over the traditional CPU
of the CUDA programming model which helps to achieve implementation (the RT(GPU-ii) version obtained an ac-
it very easily. celeration of up to 13×). These results show the importance
of making a good use of the different levels of memory on
4) Comparative study between CPU and GPU implemen-
current GPUs.
tations: The experiments for comparing the performance
between the implementations for CPU and GPU used the We can also conclude that the performance achieved by
test cases Dist I, Dist II, Dist III and Bunny. The eval- our proposal (RT(GPU-ii) version) is competitive with the
uation consisted in the execution of RT(GPU-ii), the state of the art in real-time ray tracing implementations on
most efficient version on GPU, and RT(CPU-ii), the CPU GPU, such as the one developed by the Alexandra Institute
version of the ray tracing algorithm, in the PC with a and the proposal of Popov et al. In addition to this, our
GTX260 graphics card. Table XII presents the fps obtained proposal showed a good scalability on the platforms used
by RT(GPU-ii) and RT(CPU-ii) versions and the in this study. This property is very important because the
acceleration (defined as fps of GPU implementation GPUs improve its power at a vertiginous rate, which predicts
fps of CPU implementation ) achieved
when using the GPU. that our implementation of ray tracing will achieve a better
performance in new GPUs.
Table XII The main line for current and future work consists in
FPS OF THE CPU AND GPU IMPLEMENTATIONS . evaluating the use of a different spatial acceleration struc-
Scene Grid size CPU (fps) GPU (fps) Acceleration ture, being the kd-trees the structure that best seems to suit.
Dist I 50x50x50 1.4 17.2 12.29 Also, it is important, for graphical purposes, the proposal of
Dist II 50x50x50 1.5 20.1 13.40
Dist III 50x50x50 1.6 21.1 13.19 a technique to cover completely the ray tree. In addition to
Bunny 80x80x80 4.2 23.8 5.67 this, extending this work to a multi-GPU scenery or a hybrid
multicore-GPU approach should be addressed in order to
attain real-time calculation of the frames in more complex
The results obtained for the test cases Dist I, Dist II, scenes.
Dist III and Bunny show that the GPU implementation
produces images faster than the CPU implementation, and R EFERENCES
therefore achieves a higher number of frames generated per
second. In particular, the GPU implementation is more than [1] Aila, T. and Laine, S., Understanding the efficiency of ray
11 times faster on average than the CPU implementation for traversal on GPUs, Proceedings of the Conference on High
Performance Graphics (HPG ’09), pp. 145-149, ACM, 2009.
the four test cases considered in this study.

19

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.
[2] Akenine-Möller, T. and Haines, E., Real-time rendering, A. K. [14] Kay, T. and Kajiya, J., Ray tracing complex scenes, SIG-
Peters Ltd., 2002. GRAPH Computer Graphics, vol. 20, no. 4, pp. 269-278,
ACM, 1986.
[3] Appel, A., Some Techniques for Shading Machine Renderings
of Solids, Proceedings of the American Federation of Informa- [15] Kirk, D. and Hwu, W., Programming Massively Parallel
tion Processing Societies (AFIPS ’68), pp. 37-45, ACM, 1968. Processors: A Hands-on Approach., Morgan Kaufmann, 2010.

[4] Alexandra Institute, Computer Graphics Group, Denmark. [16] McGuire, M. and Luebke, D., Hardware-Accelerated Global
Available at http://cg.alexandra.dk/tag/gpgpu/. Accessed on Illumination by Image Space Photon Mapping, Proceedings of
February 2011. the Conference on High Performance Graphics (HPG ’09), pp.
77-89, ACM, 2009.
[5] Avcibas, I., Sankur, B. and Sayood, K., Statistical Evaluation
of Image Quality Measures, Journal of Electronic Imaging, vol. [17] Murray, J. and Van Ryper, W., Encyclopedia of Graphics File
11, no. 2, pp. 206-223, 2002. Formats, O’Reilly Media, 1996.

[6] Buck, I., Foley, T., Horn, D., Sugerman, J., Fatahalian, K., [18] OpenRT website. Available at http://openrt.de/. Accessed on
Houston, M. and Hanrahan, P., Brook for GPUs: Stream com- February 2011.
puting on graphics hardware, ACM Transactions on Graphics
- Proceedings of ACM SIGGRAPH 2004, vol. 23, no. 3, pp. [19] Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J.,
777-786, 2004. Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison,
A. and Stich, M., OptiX: A General Purpose Ray Tracing
[7] CUDA website. Available at http://www.nvidia.com/object Engine, ACM Transactions on Graphics - Proceedings of ACM
/cuda home new.html. Accessed on February 2011. SIGGRAPH 2010, vol. 29, no. 4, pp. 1-13, 2010.

[8] Darema, F., The SPMD Model: Past, Present and Future, [20] Popov, S., Günther, J., Seidel, H.-P., and Slusallek, P., Stack-
Proceedings of the 8th European PVM/MPI Users’ Group less KD-Tree Traversal for High Performance GPU Ray Trac-
Meeting, Lecture Notes in Computer Science 2131, p. 1, 2001. ing Computer Graphics Forum - Proceedings of Eurographic,
vol. 26, no. 3, pp. 415-424, 2007.
[9] DIRECTX website. Available at http://developer.nvidia.com/
page/directx.html. Accessed on February 2011. [21] Quake Wars: Ray Traced website. Available at
http://www.qwrt.de/. Accessed on February 2011.
[10] Dirik A., Bayram, S., Sencar, H. and Memon, N., New
Features to Identify Computer Generated Images, Proceedings [22] Thrane, N. and Simonsen, L., A Comparison of Acceleration
of the International Conference on Image Processing (ICIP Structures for GPU Assisted Ray Tracing, Master’s thesis, De-
2007), pp. 433-436, IEEE, 2007. partment of Computer Science, Faculty of Science, University
of Aarhus, Denmark, 2005.
[11] Goral, C., Torrance, K., Greenberg, D. and Battaile, B.,
[23] Wald, I., Realtime Ray Tracing and Interactive Global Illu-
Modeling the Interaction of Light Between Diffuse Surfaces,
mination, PhD. Thesis, Computer Graphics Group, Saarland
SIGGRAPH Computer Graphics, vol. 18, no. 3, pp. 213-222,
University, 2004.
ACM, 1984.
[24] Ward, G., Rubinstein, F. and Clear, R., A Ray Tracing Solution
[12] Horn, D., Sugerman, J., Houston, M. and Hanrahan, P.,
for Diffuse Interreflection, Proceedings of the 15th annual
Interactive K-D Tree GPU Raytracing, Proceedings of the
conference on Computer graphics and interactive techniques
Symposium on Interactive 3D Graphics and Games (I3D ’07),
(SIGGRAPH ’88), pp. 85-92, ACM, 1988.
pp. 167-174, ACM, 2007.
[25] Whitted, T., An Improved Illumination Model for Shaded
[13] Jensen, H., Realistic Image Synthesis Using Photon Mapping, Display, Communications of the ACM, vol. 23, no. 6, pp. 343-
A. K. Peters Ltd., 2001. 349, 1980.

20

Authorized licensed use limited to: Carleton University. Downloaded on September 15,2020 at 17:41:47 UTC from IEEE Xplore. Restrictions apply.

You might also like