Robert So
Robert So
1
Contact information: [email protected], LBNL, 1 Cyclotron Road, MS 50F, Berkeley, CA 94720
physical interactions of light sources with objects in a For the cabin model, it typically takes 10-15 minutes on a
virtual environment. single processor of a Cray T3E supercomputer to render a
single 640x480 32-bit image without ambient lighting.
Before creating images, a user first generates a scene file With the incorporation of ambient lighting, an image of
in Radiance format, either by importing one from another the cabin can take hours to produce. Dramatic speedups
package, or by developing a model directly in the are thus necessary to achieve multiple frames per second.
Radiance scene language. The scene file consists of a
description of the objects in the model, their locations, The standard distribution of Radiance includes methods
and their physical properties such as transparency, color of parallelization suitable for a small network of
and reflectivity. workstations connected on a local area network. In one
method, pieces of an image are assigned to different
An octree is built up from this scene file to spatially processing elements based on a static partitioning, and in
subdivide the scene for faster ray-surface intersection another, different processors render different frames in the
tests. The octree represents a static version of the scene sequence. The primary mechanism of data sharing and
and includes a compact description of the objects in the synchronization in these methods is NFS file sharing and
scene. A user specifies a desired viewpoint and rays are locking [7]. Sharing of information relating to ambient
cast into the scene using a light-backwards ray tracing lighting results in significant speedup.
method [7].
Although this mechanism works well for a small number
Like other ray-tracing packages, Radiance takes into of workstations, it does not scale to large numbers of
account direct illumination, and specular reflections processors. As the number of processors increase, file I/O
resulting from reflective surfaces and mirrors. Unlike and the file locking mechanism becomes a major
many rendering packages, however, Radiance attempts to bottleneck.
accurately model the diffuse indirect reflections between
objects in a scene. The effect of the diffuse indirect Other parallelization efforts with Radiance and other
reflections is also called ambient lighting. By using a lighting simulation software have focused on using a
hybrid deterministic/stochastic technique, Radiance smaller number of processors, and decreasing the amount
strives for a balance between speed and accuracy [7]. The of time necessary to generate a single image. These
use of a deterministic/stochastic technique in Radiance methods were not tested with a large number of
distinguishes it from radiosity approaches. processors, and it is unclear how well they would scale.
They also did not employ speedup measures in addition to
The diffuse indirect component is computed by sampling parallelization [2,4].
rays from a hemisphere over points within the scene. The
indirect components are cached for re-use by surrounding Exploiting frame-to-frame coherence is one way of
points to reduce the amount of re-calculation required, reducing the amount of computation required to generate
and can also be saved in a file to greatly decrease the a frame. During a walkthrough of a scene, the differences
amount of time required to render further images. between viewpoints in adjacent frames are typically fairly
small. Since the viewpoints are very similar, many of the
Figure 1 shows the effect of a Radiance calculation pixels calculated in one frame can be reused in
without the diffuse indirect component. Figure 2 shows subsequent frames, though aliasing and occlusion
the same image, with the diffuse indirect component problems reduce, by a small amount, the number of pixels
calculated. One can quickly see the dramatic effect the that can be reused.
indirect component has on a scene.
The standard distribution of Radiance includes a method
The amount of time required to generate a Radiance that takes advantage of frame to frame coherence, but it is
image varies. A standard model included with the not suitable for movement in arbitrary directions. For
Radiance distribution describes a mountain cabin of accurate rendering, this program requires interpolation
moderate complexity. The model consists of a cabin, between two complete reference frames, one at the start
furnishings, several light sources, windows and an position, the other at the final position. The program
outdoor environment consisting of trees (Figure 3). interpolates new frames between these two locations.
Depending on the view, a single rendering of a viewpoint Since our goal was to allow the user to move in any
requires the calculation of between 1 to 25 million rays direction, it would be impossible to generate the two
without ambient lighting, and hundreds of millions if reference frames in real time. We have devised another
ambient lighting is taken into account. method of using frame to frame coherence to decrease
rendering time.
3. Approach there was a tradeoff between block size and the amount of
inter-processor communication necessary.
Our goal was to achieve interactive rates during a
walkthrough. To attain this, we needed to implement an If ambient lighting was incorporated, each worker PE
efficient, scalable parallelization and exploit frame to kept track of how many ambient values it had generated,
frame coherence. Scalable parallelization decreased the and when a certain number of values had been produced,
amount of time required to perform the required it communicated them to the ambient master processor.
calculations, while exploiting frame to frame coherence The ambient master process then returned to the sending
reduced the amount of calculations required for each worker any results accumulated from other processors.
frame. The sending worker added these new values to its ambient
cache. The number of values to accumulate before
sending was kept small for better load balancing.
3.1 Parallelization
Worker PE’s own a rectangular, statically determined
We originally parallelized Radiance using a hybrid portion of the image to be rendered. A worker performed
message passing/shared memory approach for ray tracing, wrote results into its sub-image, and sent the
communication, which ran on a 64-processor SGI Origin. sub-image to the master processor when all pixels were
We have since moved on to a strictly message passing written. When a worker received the location (in global
approach which is applicable to a wider number of image coordinates) of a block of pixels to ray trace, it
architectures, including workstation clusters. Our goal determined whether that location resided within its sub-
was scalability to 64 or more processors (processing image. After ray tracing, it either wrote that result into its
elements or PE’s) within a single machine. MPI (Message sub-image, or sent it to the worker that would own that
Passing Interface) [9] was used for communication result.
between PE’s and for synchronization. The
implementation runs on 64 processors of a Cray T3E. Initially, blocks of pixels were assigned sequentially by
the master. This caused a performance bottleneck,
A client, running on an SGI Onyx 2 connected to the because the master would send a series of block locations
parallel rendering engine via an FDDI network, was used which were within the region of ownership of one worker,
as a front end to display and interact with the and thus all results funneled back into that worker. When
visualization. Separating the display from the parallelized the master instead sent block locations with a stride
ray casting allowed us the freedom to utilize different roughly equal to the number of workers, this problem was
types of parallel architectures for the ray casting alleviated.
calculations without having to modify the client.
To reduce the amount of network communication
One PE was designated as the master process that between the computational engine and the client platform,
controlled communication with the client and workers compressed their sub-images prior to sending
synchronization between the PE’s. If ambient lighting was them to the master. The master read these into one buffer
incorporated, another PE served as the ambient master before sending it on to the client. The client
process controlling the sharing of ambient calculation decompressed each sub-image separately.
results among processors, in order to reduce the number
of rays cast. All other PE’s, the workers, were dedicated
to ray tracing. 3.2 Point Cloud Algorithm
In this investigation, we focused on achieving interactive Parallelization by itself was not sufficient to achieve
rates for moderately complex scenes. We did not attempt multiple frames per second. The amount of calculation
to distribute the scene octree or the ambient cache across required to generate a single frame precluded image
the processors. Each processor had a complete copy of generation at interactive rates. For example, if an image
both data structures. took 12 minutes to render sequentially, the percentage of
pixels ray traced each frame must be less than 2 percent to
Load balancing was accomplished using a dynamic achieve 5 frames per second on 64 PE’s. Obviously for
approach. The PE receiving the data (the master) kept greater frame rates, the percentage must be even less. We
track of the location of the next block of pixels to ray needed a method to reduce the amount of per-frame
trace. When a worker PE was finished, it sent the finished calculation.
data to the master PE using MPI, and grabbed the location
of the next block from the master. Blocks were kept One way to reduce the amount of per-frame calculations
small to achieve better load balancing; keeping in mind required is to exploit the frame to frame coherency during
walkthroughs and reuse as many pixels as possible. We space), enough pixels need to be projected into the image
developed a method for frame to frame pixel reuse that to give the illusion of continuity. Otherwise, the image
we call a point cloud. Pixels generated for an image would look much like a pointillism painting.
correspond to points within the scene where a ray first
intersected an object. We retained the positions of these In the original implementation of the algorithm, there
points in scene coordinates and transformed the points were as many points in the point cloud as pixels in the
into the viewing coordinate systems of subsequent frames. image. During the walkthrough, if a point from the
previous frame was obscured, it was thrown away.
We chose a geometric instead of image-based rendering Typically a contiguous pixel in the image would need to
approach [8] since Radiance requires a geometric model be recomputed because no point projected to it. This
and generates geometric information as part of the approach, however, was inefficient because a hole may
rendering process. A walkthrough using the point cloud have opened up simply due to aliasing; such as on a large
maintains a high degree of accuracy for a large number of flat surface viewed at an angle. The point thrown away
frames, since the viewpoint rarely changes by a large very likely could have been used in a subsequent frame.
extent. No reference frames are needed beyond the first This inefficiency resulted in 10% or more of the pixels
image at the start of the walkthrough. The number of needing to be ray traced after only a small number of
pixels that can be reused depends on how different the frames in the walkthrough.
viewpoint is from frame to frame. The default rotations
and translations we used in walkthroughs typically To reduce this inefficiency and reuse more points, we
resulted in a change in position of approximately one changed the algorithm so that points were not thrown
percent of the field of view. away, even if they were not used in the current frame.
This change decreased the percentage of pixels that
Originally, the point cloud calculation was performed on needed to be ray traced during a walkthrough to 2-3%.
the client (SGI Onyx 2) in an attempt to limit the amount Obviously points must be aged out of the cache at some
of data communicated between the client and server. The time, otherwise the amount of information stored would
client retained knowledge of the point cloud and exceed memory capacity. Currently, a point is eliminated
transmitted the locations that needed to be re-calculated to if it has not been used in 25 frames or if it projects outside
the server. This method, however, required more than one the viewing area. We call this augmented method a
processor on the client side in order to achieve 10 frames supersaturated point cloud.
a second for a 640x480 image. We did not want to restrict
use of parallel Radiance to high-end clients or limit the Although the supersaturated point cloud method reduces
image size. Our goal was to minimize the computational the computation required by a factor of three or more,
requirements on the client side so that it did not require having more points in the point cloud than pixels in the
any special hardware or excessive computational power. image introduced inaccuracies. A point saved from a
previous frame may map to a location that would have
We will first describe the algorithm for the point cloud been recomputed using the original method because a
and serial performance enhancements, before describing hole would have appeared there. Determining that this
the parallel implementation in the next section. The point was actually a hole was problematic. Errors were limited
cloud was initially populated when a complete image was by eliminating a point from the cache if it was obscured
generated, one point per pixel. Afterwards, the points during projection by another point a large distance in
from the previous frame were mapped onto the screen front of it, or by forcing ray tracing of a pixel if there
(image) space using a standard perspective viewing were greatly different z values at all neighboring points.
projection and a z-buffer algorithm. After all the pixels The larger the change in viewpoint from frame to frame,
had been transformed using the new viewpoint, the image the more likely this error was to occur.
was scanned for missing sections.
Ideally, for each pixel in a new frame, a ray should be
Exposure of previously occluded regions resulted in holes sent through the octree until it hits an object. The
in the image, since no current point was projected into distance to that object would be compared with the z-
that location. Ray tracing was performed in these areas to buffer value resulting from projecting that point. If the z
fill in the gaps. Other areas also needed to be recomputed values differ greatly, this would imply that the point
(described later in this section). Typically the number of would need to be ray traced. Unfortunately, this requires
pixels that need ray tracing were only a small percentage too much computation, since the determination of ray-
of the pixels in the total image. octree intersections is a large percentage of the original
ray tracing calculation. In the current implementation,
Since there are no geometric continuities in the point this check was only performed along the edges of the
cloud (it is a set of discrete points in three-dimensional image where new objects were coming into view. The
result of this constraint made backward movement within projection the resulting image space coordinates of some
the scene unfeasible until a much faster octree intersection points might no longer map into that worker’s sub-image.
method is implemented. These points were sent to the worker that owns them.
That worker then placed them in the proper location in its
Another problem occurred if a ray intersected an object sub-image after performing hidden surface removal.
that had an associated transparent or reflective material,
or both in the case of window glass. In the case of Once all workers finish the projection and redistribution
window glass, the z-value associated with the object step, each worker examined its sub-image to find areas
visible through the glass was assigned to the that needed recalculation, added those pixels’ locations to
corresponding point. We plan to assign and project an a redo list, and sent its redo list to the master processor.
additional point using a z-value calculated from the The master then parceled out the redo list to the workers
distance to an object reflected in the glass. This latter for ray tracing in the same manner as for blocks of pixels
point would be subject to the culling process described in in the full image. Although the shipping of the redo list to
the following paragraph. the master increased the amount of communication
overhead, this was compensated for by the more even
The specular component of a lighting visualization is distribution of the expensive ray tracing calculations
highly view-dependent. Glints and highlights can among the workers.
drastically change with slight changes in viewpoint. The
original version of Radiance accurately calculated this Each worker might be responsible for a different number
component since each pixel was recalculated each frame. of points, depending on how holes in the image appear.
Currently we do not handle this situation correctly using In practice, the amounts do not differ by more than a few
the point cloud method unless the material is a completely percent from processor to processor. The projection step
reflective surface, such as a mirror. We plan on aging out currently takes so little processing time compared to the
a point associated with a specular material after the ray tracing that the overall load balance is not affected.
viewpoint has changed enough to make a pixel’s color
noticeably inaccurate. In practice, a point is often When more than 32 processors were used, only 32 of the
obscured due to change in viewpoint before its inaccuracy available workers were used to perform projection. In
is noticeable. this case, the remaining workers only did ray tracing and
always passed their results to a worker that did projection.
If a material was a mirror, the distance to an object For a larger number of processors, the sub-images were so
reflected in the mirror was calculated as if the object were small that the communication overhead dwarfed the gain
physically present behind the mirror. The associated in parallelization.
point was assigned this z-value, and projected as usual.
The point cloud method handled this correctly, except 4. Results
where larger changes in viewpoint caused these distant
points to fill holes in the image outside the mirror. We timed the original Radiance serial ray tracer and the
parallel implementation of Radiance on representative
3.3 Point Cloud Parallelization walkthroughs from the cabin model available from the
Radiance web site [10]. Figure 3 shows the starting views
In the parallel implementation of the point cloud, each for the cabin walkthroughs. The results are summarized
worker was assigned a portion of the point cloud. As in Tables 1 and 2.
noted in a previous section, each worker owned a
rectangular sub-region of the total image. It was In the walkthroughs, the view direction was changed by
responsible for any points that map to that sub-image. the default rotation amount around the y-axis in each
frame. We did timings with and without the use of the
The point cloud was initially populated during generation point cloud for non-ambient lighting. Image generation
of a complete image. Recall that if a worker did not own times using ambient lighting takes on the order of 5-20
a block of pixels that it ray traced, it sent the results of the times longer than each result given in the tables, partly
ray tracing to the one that does. At this time it also sent due to the greater number of rays, and partly due to
the point location in scene coordinates. This was scaling problems.
necessary for the point cloud algorithm.
We performed timings using a varying number of
For each frame in a walkthrough, each worker performed processors on a Cray T3E. All results are for the
the perspective projection on the points that it owns. Due generation of 32-bit, 640x480 images. The speedup
to the change in viewpoint from frame to frame, after calculations include file I/O for the original Radiance
program, and socket I/O for the parallel implementation. case, a new image may take several minutes to render,
The timings were done on non-dedicated machines so any even with 64 or more processors. In general, it may not
superlinear results should be viewed with caution. be possible to achieve interactive rates while including
ambient lighting. A short-term goal is to not incorporate
The speedup was less for the point cloud because fewer ambient lighting while a user is moving through the
pixels were ray traced. The ratio of computation to scene, and calculate the ambient lighting and display the
communication was decreased, and load balancing was more accurate results when the user pauses.
more difficult because there were fewer blocks of pixels
to distribute. It takes a tenth of a second or more to transmit a 640x480
32-bit image over an FDDI network. This becomes
Frame rate varies with pixel reuse, which was scene and significant at higher rates of image generation. We
movement dependent. For scenes with a large amount of incorporated JPEG compression to reduce the amount of
fine detail, such as the forest outside the cabin, pixel reuse data sent over the network. More than one processor on
could be as low as 85%. For cabin interior scenes and for an SGI Origin was necessary to decompress an image fast
the theater, forward movement typically resulted in 94% enough to make a difference when an FDDI network was
reuse. All other movement within the cabin or the theater, available. JPEG compression was worthwhile in any case
with default changes in view point or view direction, if images were transmitted over a wide-area network. We
resulted in 97-98% reuse (as noted in the previous section, are looking into alternative methods of
backward movement was not allowed). compression/decompression where less work is necessary
on the client side.
5. Conclusions and Ongoing Work Over a wide area network, issues such as network latency
and synchronization become important. When faster
Our current results are not yet in the target interactive frame rates are achieved, we may have to render at a
range of 10 frames per second. We have, however, higher frame rate and skip transmission of some images to
drastically reduced the amount of time required to achieve a smoother animation.
calculate a Radiance image. We are working in several
areas to increase the frame rate. The point cloud method was a useful tool for achieving
higher frame rates for architectural walkthroughs. It
The number of pixels rendered at one time by a worker achieved a speedup of 50 or more on the generation of a
processor affects load balancing and communication single image in a sequence with a high degree of
costs. We plan to have the master keep timing statistics accuracy, and offers the possibility for true interactivity in
and dynamically vary this number to improve the frame walkthroughs in supercomputer environments. This
rate. For very large numbers of processors (greater than method will also obviously accelerate the production of
64), more than one master processor may also be useful. batch animations on single-processor PC’s.
The amount of pixel reuse can be increased by super- Use of the point cloud method is not limited to
sampling during idle periods, for example while the user architectural walkthroughs. It should provide speedups in
is paused during a walkthrough or when a processor may most situations where pixel generation time is orders of
be idle towards the end of frame generation. Super- magnitude larger than point projection time. We are
sampling regions where there is a high gradient should looking into the possibility of using the point cloud
also prove useful for anti-aliasing. method in other scientific visualizations that use other
forms of ray casting/tracing.
We are pursuing serial speedups as well. Radiance
currently uses a scene description language that is 6. Acknowledgments
interpreted during rendering. We have already observed a
20% speedup (included in all timings) by providing C This work was supported by the Director, Office of
code replacements for some scene modifiers, and plan to Energy Research, Office of Basic Energy Sciences, of the
provide C versions of all portions of the scene description. U.S. Department of Energy under Contract No. DE-
The use of C versions does not require any modification AC03-76SF00098. We thank Charles Ehrlich of the
to existing Radiance scene descriptions. Environmental Energy Technologies Division at LBNL
for sharing his expertise during this project.
We are investigating how to improve scalability when
using ambient lighting, and how to handle large changes
in viewpoint when using ambient lighting. In the latter
7. References
[1] Pharr, M, Kolb, C., Gershbein, R., and Hanrahan, P.,
“Rendering Complex Scenes with Memory-Coherent Ray
Tracing”, Computer Graphics (Proceedings of ’97
SIGGRAPH Conference), August 1997, pp. 101-108.
[9] http://www.mpi-forum.org/
[10] http://radsite.lbl.gov/radiance/
# of PE’s Cabin Time Cabin Speedup
Original program 14.32 min 1.0:1
6 2.82 min 5.1:1
11 1.41 min 10.2:1
31 0.46 min 31.1:1
64 0.22 min 65.1:1
Table 1. Full image generation, initial frames of cabin walkthrough, no ambient lighting.
Table 2. Point cloud, average of 50 frames in cabin walkthrough, no ambient lighting (97% reuse).
Figure 1. Rendering without ambient lighting Figure 2. Rendering with ambient lighting