0% found this document useful (0 votes)
56 views12 pages

A Parallel SPH Implementation On Multi-Core Cpus: (1981), Number 0 Pp. 1-12

1) This paper presents parallel implementations of the Smoothed Particle Hydrodynamics (SPH) method for simulating fluids on multi-core CPUs. 2) Two efficient spatial data structures are proposed for parallel neighborhood queries: compact hashing and Z-index sort. 3) The performance of these methods is analyzed and compared to existing uniform grid approaches. Compact hashing is shown to be competitive with Z-index sort while being better suited for large-scale fluid simulations in arbitrary domains.

Uploaded by

Aditya Bhosale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views12 pages

A Parallel SPH Implementation On Multi-Core Cpus: (1981), Number 0 Pp. 1-12

1) This paper presents parallel implementations of the Smoothed Particle Hydrodynamics (SPH) method for simulating fluids on multi-core CPUs. 2) Two efficient spatial data structures are proposed for parallel neighborhood queries: compact hashing and Z-index sort. 3) The performance of these methods is analyzed and compared to existing uniform grid approaches. Compact hashing is shown to be competitive with Z-index sort while being better suited for large-scale fluid simulations in arbitrary domains.

Uploaded by

Aditya Bhosale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Volume 0 (1981), Number 0 pp.

1–12 COMPUTER GRAPHICS forum

A parallel SPH implementation on multi-core CPUs

Markus Ihmsen Nadir Akinci Markus Becker Matthias Teschner

University of Freiburg

Abstract

This paper presents a parallel framework for simulating fluids with the Smoothed Particle Hydrodynamics (SPH)
method. For low computational costs per simulation step, efficient parallel neighborhood queries are proposed
and compared. To further minimize the computing time for entire simulation sequences, strategies for maximiz-
ing the time step and the respective consequences for parallel implementations are investigated. The presented
experiments illustrate that the parallel framework can efficiently compute large numbers of time steps for large
scenarios. In the context of neighborhood queries, the paper presents optimizations for two efficient instances of
uniform grids, i. e. spatial hashing and index sort. For implementations on parallel architectures with shared mem-
ory, the paper discusses techniques with improved cache-hit rate and reduced memory transfer. The performance
of the parallel implementations of both optimized data structures is compared. The proposed solutions focus on
systems with multiple CPUs. Benefits and challenges of potential GPU implementations are only briefly discussed.
Categories and Subject Descriptors (according to ACM CCS): Computer Graphics [I.3.7]: Three-Dimensional
Graphics and Realism: —Animation

1. Introduction larger time periods, it should preserve or restore spatial lo-


cality. As the acceleration structure has to be updated in each
Physically-based animation techniques are used to produce simulation step, query as well as construction times have
realistic visual effects for movies, advertisement and com- to be optimized. Therefore, acceleration structures used in
puter games. While the animation of fluids is becoming in- static applications, e. g. kd-trees or perfect hashing, might be
creasingly popular in this context, it is also one of the most too expensive to construct. Querying has to be efficient, as
computation-intensive problems. This paper focuses on the the set of neighboring particles has to be processed multiple
efficient simulation of fluids with Smoothed Particle Hydro- times in an SPH algorithm for computing various attributes.
dynamics (SPH) using multiple CPUs in parallel.
In SPH, the dynamics of a material is governed by the On average, a particle is interacting with 30 neighbors.
local influence of neighboring particles. In fluids, the set of Storing the neighbor set for fast reuse is, thus, a natural
neighboring particles dynamically changes over time. There- choice. However, for systems with a low-memory limit, this
fore, the efficient querying and processing of particle neigh- quickly limits the maximum complexity of the scene. In fact,
bors is crucial for the performance of the simulation. The recent SPH implementations on the GPU [HKK07b, ZSP08,
neighborhood problem for SPH fluids is similar to, e. g., GSSP10] do not store interacting particles, but recompute
collision detection or intersection tests in ray tracing. How- neighbor sets when needed. Consequently, the performance
ever, neighborhood queries in SPH are also characterized by of these systems scales with the number of neighborhood
unique properties that motivate our investigation of efficient queries in each simulation step. Therefore, either the gas
acceleration data structures in this context. equation [MCG03] or the Tait equation [Mon94, BT07] are
employed, as these equations are fast to compute and the
For instance, adjacent cells in the employed spatial data influencing particle sets are processed only twice per sim-
structure have to be accessed efficiently. Further, an effi- ulation step. However, the state equations suffer from visi-
cient data structure should employ the temporal coherence ble compression artifacts, if the time step is not sufficiently
of a fluid between two subsequent simulation steps, but over small.

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
Published by Blackwell Publishing, 9600 Garsington Road, Oxford OX4 2DQ, UK and
350 Main Street, Malden, MA 02148, USA.
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

The predictive-corrective incompressible SPH method to efficiently find interacting particles with different influ-
(PCISPH) [SP09] is more expensive to compute, but it can ence radii, a kd-tree is used which has to be rebuilt in ev-
handle significantly larger time steps. In this method, incom- ery time step. According to the presented timings, the neigh-
pressibility is enforced by iteratively predicting and correct- borhood search marks the performance bottleneck for this
ing the density fluctuation. Due to the iteration scheme, the method.
neighbor set is processed at least seven times per simulation
There exist various approaches that address the imple-
step, which might limit the performance of current GPU im-
mentation of SPH on highly parallel architectures, especially
plementations.
on the GPU. While SPH computations are easy to paral-
Our contribution. We present an efficient system for lelize, the implementation of an efficient parallelized neigh-
computing SPH fluid simulations on multi-core CPUs. Since borhood search is not straightforward. In [AIY∗ 04], e. g., the
the query and processing of particle pairs (neighbor search) GPU is only used for computing the SPH update but not for
is crucial for the performance, we focus on parallel spatial the neighborhood query. Later, [HKK07b, ZSP08] presented
acceleration structures. We propose compact hashing and Z- SPH implementations that run entirely on the GPU. These
index sort as optimizations of the commonly used spatial methods map a three-dimensional uniform grid onto a two-
hashing and index sort schemes. dimensional texture. Particle indices are stored in RGBA
channels. Thus, only four particles can be mapped into each
A low memory transfer is evident for a good performance
grid cell. Due to this issue, a smaller than optimal cell size
of parallel systems. We employ techniques that lower the la-
has to be chosen in order to avoid simulation artifacts.
tency when accessing particles and their interacting neigh-
bors. Temporal coherence is exploited in order to reduce Generally, even though a uniform grid is fast to construct
grid construction times for both acceleration structures. We and the costs of accessing an item are in O(1), it suffers
show that the performance of the proposed compact hashing from a low cache-hit rate in the case of SPH fluid simula-
is competitive with Z-index sort, while compact hashing is tions. This is due to the fact that the fluid fills the simula-
more appropriate for large scale fluid simulations in arbitrary tion domain in a non-uniform way. Only a small part of the
domains. domain is filled, while the fluid also tends to cluster. Conse-
quently, the grid is only sparsely filled and, hence, a signifi-
In the context of SPH, the performance of compact hash- cant amount of memory is unnecessarily allocated for empty
ing and Z-index sort is thoroughly analyzed. Both proposed cells. As stated in [HKK07b], this decreases the performance
schemes are compared with three existing variants of uni- due to a higher memory transfer. In [HKK07a], an adaptive
form grids, i. e. basic uniform grid, spatial hashing and index uniform grid is presented, where the memory consumption
sort. scales with the bounding box of the fluid volume.
We further analyze the two types of SPH algorithms Index sort is another approach to reduce the memory
used in Computer Graphics, namely state equation based consumption and transfer of the uniform grid. Index sort
algorithms (SESPH) and the predictive-corrective SPH al- first sorts the particles with respect to their cell indices c.
gorithm (PCISPH). We discuss the performance aspects of Then, indices of the sorted array are stored in each cell.
these algorithms. We finally show simulations with up to 12 Each cell just stores one reference to the first particle with
million fluid particles using compact hashing and PCISPH. corresponding cell index. This idea has been described by
Purcell et al. [PDC∗ 03] for a fast photon mapping algo-
rithm on the GPU. In [OD08], a similar idea is applied to
2. Related work
a non-parallel SPH simulation. The paper shows that index
In this work, we focus on an efficient fluid solver using sort outperforms the spatial hashing scheme for simulations
the Smoothed Particle Hydrodynamcis method. While SPH with a low number of particles, i. e. 1300. Index sort is
is applied to model gas [SF95], hair [HMT01] and de- used in NVIDIA’s CUDA based SPH fluid solver [Gre08]
formable objects [DC96,SSP07,BIT09], its main application and also employed for fast parallel ray tracing of dynamic
in Computer Graphics is the simulation of liquids [MCG03]. scenes [LD08, KS09]. In this work, we discuss important is-
In this field, research focuses on the simulation of vari- sues of the index-sort scheme in the context of a parallel SPH
ous effects like interaction of miscible and immiscible flu- implementation on multi-core CPUs. Thereby, we propose a
ids [MSKG05, SP08], phase transitions [KAG∗ 05, SSP07], new variant named Z-index sort which includes SPH-specific
user-defined fluid animations [TKPR06] or the simulation enhancements.
and visualization of rivers [KW06].
In contrast to the uniform grid, the spatial hashing method
Recent research is also concerned with performance as- can represent infinite domains. This method has been in-
pects of SPH. Adams et al. [APKG07] use an adaptive sam- troduced for deformable solids [THM∗ 03] and rigid bod-
pling with varying particle radii. The number of particles in- ies [GBF03]. For particle-based granular materials, spatial
side a volume is reduced which significantly improves the hashing is applied by [BYM05]. We propose a memory-
performance for densely sampled fluids. However, in order efficient data structure for the spatial hashing method called

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

compact hashing that allows for larger tables and faster 3. SPH
queries.
The SPH method approximates a function f (xi ) as a
smoothed function h f (xi )i using a finite set of sampling
In general, a hash function does not maintain spatial local-
points x j with mass m j , density ρ j , and a kernel function
ity of data, which increases the memory transfer. In order to
W (xi − x j , h) with influence radius h. According to Gingold
enforce spatial locality, we employ a space-filling Z-curve.
and Monaghan [GM77] and Lucy [Luc77], the original for-
A similar strategy is used by Warren and Salmon [WS95].
mulation of the smoothed function is
In this approach, a hashed octree is presented for parallel
mj
particle simulations. For the cosmological simulation code h f (xi )i = ∑ f (x j )W (xi − x j , h). (1)
GADGET-2, particles are ordered in memory according to a j ρj
Peano-Hilbert curve [Spr05]. The Peano-Hilbert curve pre-
serves the locality of the data very well, but is relatively ex- The concept of SPH can be applied to animate different
pensive to evaluate. Instead, we use the Z-curve which can kinds of materials including fluids. The underlying contin-
be efficiently computed by bit-interleaving [PF01]. uum equations of the material are discretized using the SPH
formulation. Thereby, objects are discretized into a finite set
In addition to the uniform grid, spatial hashing and in- of sampling points, also called particles. The neighborhood
dex sort, various other acceleration structures have been pre- of a particle i is defined by the particles j that are located
within the support radius of i, i. e. xi − x j ≤ h . Each par-

sented. One of the most popular techniques is the Verlet-
list method [Ver67, Hie07]. In this method, a list of poten- ticle represents the material properties of the object with re-
tial neighbors is stored for each particle. Potential neighbors spect to its current position. In each time step, these prop-
have a distance which is lower or equal to a predefined range erties are updated according to the influence of neighbor-
s, where s is significantly larger than the influence radius r. ing particles. Therefore, the particle neighborhood has to be
Thereby, the list of potential neighbors needs to be updated computed in each time step.
only if a particle has moved more than s − r. However, the In the SPH method, derivatives are calculated by shift-
memory transfer of this method scales with the ratio s/r, as ing the differential operator to the kernel function [Mon02].
all potential particle pairs are processed in each neighbor- Thereby, the computations are simplified. The most time
hood query. consuming part of SPH simulations is the neighborhood
query and the processing of neighbors, since the number of
In [KW06], particles are sorted according to staggered interacting particles (pairs) is significantly larger than the
grids, one for each dimension. Instead of querying spatially number of particles. In the following sections, we there-
close cells in all dimensions at once, dimensions are pro- fore focus on acceleration techniques and data structures for
cessed one after another. As stated in [KW06], this method these methods.
works well for a low number of particles. For higher res-
olutions, the performance is lower compared to, e. g., the
octree used in [VCC98]. However, as reported in [PDC∗ 03] 4. Neighborhood search
and [HKK07b], hierarchical subdivision-schemes are not a The neighborhood search can be accelerated using spatial
good choice for fluids with uniform influence radius, since subdivision schemes. As reported in [HKK07b], the con-
the costs of accessing an item are in O(log n), while for uni- struction cost of a hierarchical grid is O(n log n) and the cost
form grids they are in O(1). This implies an higher memory of accessing a leaf node is O(log n). In contrast, the con-
transfer, which especially limits the performance of hierar- struction cost of a uniform grid is O(n), while any item can
chical data structures in parallel implementations. be accessed in O(1). Therefore, uniform grid approaches are
most efficient for SPH simulations with uniform support ra-
The above-mentioned GADGET-2 simulation code is de- dius h.
signed for massively parallel architectures with distributed
memory. MPI (Message Passing Interface) is used for the
parallelization. A parallel library for distributed memory 4.1. Basic uniform grid
systems is also presented by Sbalzarini et al. [SWB∗ 06].
In the basic uniform grid approach, each particle i with po-
Using the library, continuum systems can be simulated with
sition x = (x, y, z) is inserted into one spatial cell with coor-
different particle-based methods. [FE08] use orthogonal re-
dinates (k, l, m). In order to determine the neighborhood of
cursive bisection for decomposing the simulation domain, in
i, only particles that are in the same spatial cell or in one
order to achieve load balanced parallel simulations of parti-
of the neighboring spatial cells within distance h need to be
cle based fluids. This approach is designed for cluster sys-
queried.
tems. While these approaches focus on distributed mem-
ory systems, our approach addresses shared memory sys- Obviously, the cell size d has an impact on the number of
tems, where parallelization can be efficiently realized using potential neighbors. The smaller the cell size, the smaller the
OpenMP [Ope05]. number of potential pairs. However, with cell sizes smaller

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

Figure 1: Index sort data structure. The top row illustrates


the uniform grid where the numbers refer to the correspond-
ing cell indices. Each non-empty cell points to the first parti-
cle in the sorted particle array with corresponding cell index
(bottom row).

than h, the number of cells to query gets bigger. This might Figure 2: The self-similar structure of the Z-curve in 2D is
slow down the neighborhood query, due to a larger number illustrated for a regular grid.
of memory lookups. In the following, we assume a cell size
of h. In this case, 27 cells have to be queried for each particle.
We discuss the performance of different cell sizes in Sec. 6. processed in parallel. For each particle i, all particles in
the 27 neighboring cells are tested for interaction. Due to
4.2. Index sort the sorting, particles that are in the same spatial cell are
also close in memory. This improves the memory coher-
Parallel construction. The parallel construction of the uni-
ence (cache-hit rate) of the query. However, it depends on
form grid is not straightforward since the insertion of par-
the indexing scheme if particles in neighboring cells are also
ticles into the grid might cause race conditions, i. e. two or
close in memory. In the following section, we discuss impor-
more threads try to write to the same memory address con-
tant aspects when applying the index sort method on parallel
currently.
CPU architectures.
As suggested in [PDC∗ 03,KS09], these memory conflicts
can be avoided by using sorted lists. The index sort method
4.3. Z-index sort
first sorts the particles in memory according to their cell in-
dex c. The cell index c of a position x = (x, y, z) is computed Indexing scheme. Current shared-memory parallel comput-
as ers are built with a hierarchical memory system, where a
very fast, but small cache memory supplies the processor
c = k+l ·K +m·K ·L (2)
j x − x k j y − y k j z − z k with data and instructions. Each time new data is loaded
min min min
c = (k, l, m) = , , from main memory, it is copied into the cache, while blocks
d d d
of consecutive memory locations are transferred at a time.
with d denoting the cell size. K and L may either denote the A new block may dispose other blocks of data, if the cache
number of cells in x and y direction of the fluid’s AABB or is full. Operations can be performed with almost no latency
of the whole simulation domain. if the required values are available in the cache. Otherwise,
In contrast to non-sorted uniform grids, a grid cell does no there will be a delay. Thus, the performance of an algorithm
longer store references to all the particles in this cell. In fact, is improved by reducing the amount of data transferred from
a cell with index c does only store one reference to the first main memory to the cache. Consequently, for parallel algo-
particle in the sorted array with cell index c. We use a par- rithms we have to enforce that threads do share as much data
allel reduction to insert the references into the grid [KS09]. as possible without generating race conditions.
Thereby, each bucket entry B[k] in the sorted particle array Even though it is not possible to directly control the cache,
B is tested against B[k − 1]. Let ck denote the cell index of the program can be structured such that the memory transfer
the particle stored at B[k]. If ck is different from ck−1 , a ref- is reduced. The indexing scheme defined in (2) is not spa-
erence to B[k] is stored in the spatial grid cell G[ck ]. This tially compact, since it orders the cells according to their z
procedure can be performed in parallel since race conditions position first. Thus, particles that are close in space are not
do not occur. The final data structure is illustrated in Fig. 1. necessarily close in memory. In order to further reduce the
Index sort avoids expensive memory allocations while the memory transfer, we suggest to employ a space-filling Z-
memory consumption is constant. However, the whole spa- curve for computing the cell indices.
tial grid needs to be represented in order to find neighboring
Rather than sorting an n-dimensional space one di-
cells.
mension after another, a Z-curve orders the space by n-
Parallel query. The neighborhood query for parallel ar- dimensional blocks of 2n cells. This ordering preserves spa-
chitectures is straightforward. The sorted particle array is tial locality very well due to the self-containing (recursive)

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

block structure (see Fig. 2). Consequently, it leads to a high


cache-hit rate while the indices can be computed fast by bit-
interleaving [PF01]. In Sec. 6, we show that the Z-curve im-
proves the cache-hit rate and, therefore, improves the perfor-
mance for the query and processing of particle neighbors.
Sorting. The particles carry several physical attributes,
e. g. velocity, position and pressure. When sorting the parti-
cles, these values have to be copied several times. Thus, the
memory transfer might slow down the sorting significantly.
In order to avoid sorting the particles array in every simula-
tion step, we suggest to use a secondary data structure which
stores a key-value pair, called handle. Each handle stores a
reference to a particle (value) and its corresponding cell in- Figure 3: Compact hashing. A handle array is allocated for
dex (key). Due to the minimal memory usage of this struc- the hash table of size m. The handles point to a compact
ture, sorting the handles in each time step is much more ef- list which stores the small number of n used cells (yellow
ficient than sorting the particle array. shaded). For a used cell, memory for k entries is reserved.
In order to yield high cache-hit rates, the particle array it- The memory consumption is thereby reduced to O(n · k + m).
self should still be reordered. However, we experienced that Note that the neighborhood query traverses only the n used
it is sufficient to reorder the particle array every 100th simu- cells.
lation step since the fluid simulation evolves slowly and co-
herently. The temporal coherence of the simulation data can
be exploited further. According to the Courant-Friedrich- same hash cell (hash collision), slowing down the neigh-
Levy (CFL) condition, a particle must not move more than borhood query. The number of hash collisions can be re-
half of its influence radius h in one time step. Thus, the av- duced by increasing the size of the hash table. According
erage number of particles for which the cell index changes to [THM∗ 03], the hash table should be significantly larger
in a consecutive simulation step is low, i. e. 2%. Therefore, than the number of primitives, in order to minimize the risk
we propose to use insertion sort for reordering the handles. of hash collisions. Our experiments indicate that for SPH, a
Insertion sort is very fast for almost sorted arrays. Usually, hash table size of two times the number of particles is appro-
parallel radix sort is used for sorting [Gre08, OD08]. As we priate.
show in Sec. 6, the insertion sort outperforms our parallel
radix sort implementation on CPUs even for large data sets. In order to avoid frequent memory allocations,
[THM∗ 03] reserve memory for a certain number k of
Generally, index sort variants are considered to be the
entries in all hash cells on initialization. Thereby, a table
fastest spatial acceleration methods. However, according
with m hash cells consumes O(m · k) memory. However,
to [Gre08] there are two limitations. First, the memory con-
on average, the number of non-empty cells is only 12%
sumption scales with the simulation domain. Second, for
of the total number of particles. For SPH fluids, the hash
sorting, the whole particle array has to be reprocessed af-
table is generally sparsely filled and a significant amount
ter computing the cell indices. In the following sections, we
of memory is unnecessarily preallocated. Furthermore, for
discuss an acceleration structure which is able to represent
most of the hash cells, adjacent cells might be empty, which
infinite domains.
reduces the cache-hit rate for the insertion and query. In the
following section, we present solutions to these problems.
4.4. Spatial hashing
In order to represent infinite domains with low memory con- 4.5. Compact hashing
sumptions, we employ the spatial hashing procedure intro-
duced in [THM∗ 03]. In this scheme, the effectively infinite In order to reduce the memory consumption, we propose to
domain is mapped to a finite list. The hash function that use a secondary data structure which stores a compact list
maps a position x = (x, y, z) to a hash table of size m has of non-empty (used) cells. The hash cells do only store a
the following form: handle to their corresponding used cell. Memory for a used
hj x k  j y k  j z k i cell is allocated if it contains particles and deallocated if the
c= · p1 xor · p2 xor · p3 %m (3) cell gets empty. Thus, this data structure consumes constant
d d d
memory for the hash table storing the handles and additional
with p1 , p2 , p3 being large prime numbers that are cho-
memory for the list of used cells. In contrast to the basic uni-
sen similar to [THM∗ 03] as 73856093, 19349663 and
form grid, the memory consumption scales with the number
83492791, respectively.
of particles and not with the simulation domain. The data
Unfortunately, different spatial cells can be mapped to the structure is illustrated in Fig. 3.

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

Figure 4: Particles are colored according to their location in memory, where red is the last and white is the first position. Spatial
locality is not maintained, if particles are not reordered (left). Since the hash function abolishes spatial locality, reordering
according to the cell index (middle) is a bad choice to reduce the memory transfer. Spatial compactness is enforced using a
Z-curve (right).

The compact list of used cells already lowers the expected results in a significant computational overhead, particularly
memory transfer. However, the hash function is not designed since we do not know if there is a hash collision in a cell or
to maintain spatial locality, but rather abolishes it. Thus, not. In order to keep the overhead low, we suggest to store a
compared to index sort, the required memory transfer for hash-collision flag in each used cell. Therefore, the hash in-
the query is still comparatively large. This again results in dices have to be computed only once for cells without hash
much larger query times, even if there are no hash collisions. collisions.
In the following, we propose techniques, which significantly
The memory consumption of the proposed data structure
improve the performance of the construction and query for
scales with the number of particles and not with the hash
the spatial hashing method.
(handle) table size. Therefore, the hash table can always be
Parallel construction. Like for the non-sorted uniform set to a size which enforces a very low number of hash col-
grid, the particles cannot be inserted into the used cells in lisions. For SPH, this number is usually below 2%. Storing
parallel. On the other hand, temporal coherence can be ex- a collision flag significantly speeds up the query since the
ploited more efficiently. Therefore, we do not reconstruct performance is almost not influenced by the low number of
the compact list in each time step, but only account for the hash collisions.
changes in each cell.
Spatial locality. Hash functions are designed to map data
In a first loop, the spatial cell coordinates ci = (k, l, m) of from a large domain to a small domain. Thereby, the data
each particle i are computed. Only if ci has changed, the new is scattered and spatial locality is usually not preserved. The
cell index of i is computed with (3) and the particle is added list of used cells is, hence, not ordered according to space
to a list of moving particles. In a second step, the moving which means that consecutive cells are not spatially close
particles are removed from their old cell and inserted into (see Fig. 4). Thus, an increased memory transfer is expected
the new one. Note that the second step has to be performed for the query since already cached data might not be reused.
serially, but with very low amortized costs since the number
In order to reduce the memory transfer, we again sort the
of moving particles is generally small, i. e. around 2% on
particles according to a Z-curve every 100th simulation step.
average.
Sorting is performed similar as described in Sec. 4.3. Instead
Parallel query. The neighborhood query is straightfor- of sorting the particle array with all attributes, we sort a han-
ward. The compact list of used cells can be processed in dle structure. Each handle consists of cell index and refer-
parallel without any race conditions. Nevertheless, the query ence to a particle. Then, all other attributes of the particle
is expected to be comparatively slow for the spatial hash- are sorted accordingly.
ing. This is due to hash collisions and an increased memory
Note that when sorting the particle array, the pointers in
transfer, as consecutive used cells are close in memory, but
the used cell become invalid. Therefore, the compact used-
not close in space. We now give solutions to overcome these
cell list has to be rebuilt. Used cells are created each time a
limitations.
particle is added to an empty hash cell. Thus, if we simply
Hash collisions. For a used cell without hash collisions, insert the sorted particles serially into the compact hash ta-
all particles are in the same spatial cell and, hence, the po- ble, the order of the used cells is consistent with the Z-curve
tential neighbors are the same. However, if there is a hash order of the particles. We employ a similar parallel reduction
collision in a used cell, the hash indices of the neighbor- scheme as in Sec. 4.2, in order to parallelize this step. The
ing spatial cells have to be computed for each particle. This computational overhead of reconstructing the compact list of

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

Algorithm 1: SESPH Algorithm 2: PCISPH


foreach particle i do foreach particle i do
compute density ; compute forces Fυ,st,ext (t);
i
compute pressure (4) or (5); set pressure to 0;
foreach particle i do set pressure force to (0, 0, 0)T ;
compute all forces; k=0;
integrate; while (max(ρ∗erri ) > η or k < 3) do
foreach particle i do
predict velocity ;
predict position ;
used cells is negligible since reordering every 100th steps is
sufficient. foreach particle i do
update distances to neighbors;
In this section, we described a spatial hashing scheme op- predict density variation;
timized for SPH simulations on parallel architectures. By al- update pressure ;
locating memory only for non-empty hash cells, the over- foreach particle i do
all memory consumption is reduced. Furthermore, for the compute pressure force;
neighborhood query, only the small percentage of used cells k+ = 1;
is traversed, which significantly improves the performance. foreach particle i do
Finally, for shared memory systems, the memory transfer is integrate;
minimized by reordering particles and the compact list ac-
cording to a space-filling curve.

5. Fluid update reported in [BT07], k should not be set too small in order to
avoid compression artifacts. In order to get plausible simu-
In Computer Graphics, two different algorithms are gener-
lation results with SEPSH, small time steps are required.
ally used for updating SPH fluids, namely SESPH [MCG03,
BT07] and PCISPH [SP09]. The neighborhood search dis-
cussed in Sec. 4 builds the base for these algorithms since 5.2. PCISPH
physical attributes are updated according to the influence of
neighboring particles. This is illustrated in Alg. 1 and Alg. 2 In contrast to SESPH, the PCISPH algorithm [SP09] does
where we marked the steps that process the neighbor set with not use an equation of state for computing the pressure. This
red and bold letters. In this section, we discuss performance method predicts and corrects the density fluctuations in an it-
and accuracy issues of both SPH algorithms. erative manner. Thereby, the density error is propagated until
the compression is resolved. However, in every simulation
step, at least three iterations are required in order to limit
5.1. SEPSPH temporal fluctuations (see Alg. 2). Thereby, the neighbor set
The SESPH algorithm loops only two times over all parti- is processed 7 times in total. Thus, the PCISPH method is
cles in each simulation step (see Alg. 1). Particles and their comparatively expensive to compute.
neighbors are processed in two subsequent loops, in order to
update the physical attributes.
5.3. Performance comparison
For SESPH, the pressure can be computed by using an
equation of state, namely the Tait-equation [BT07] or the gas For SPH simulations, various authors [HKK07b, Gre08,
equation [MCG03]. The Tait-equation relates the pressure p GSSP10] give the number of simulation updates per lab sec-
with the density ρ polynomially as ond and refer to this as frames per second (fps). In our opin-
 7 ! ion, this number gives only small insights about the overall
ρ performance as long as the time step is not given. In order to
p=k −1 (4)
ρ0 assess the performance of our system, we do not only state
the simulation updates per second, but also the time step. In
where ρ0 denotes the rest density and k is a stiffness parame-
the following, ups denotes the number of times the simula-
ter that governs the relative density fluctuation ρ − ρ0 . While
tion is updated (neighbor search + fluid update) in one lab
for (4), the pressure grows polynomially with the compres-
second.
sion of the fluid, for the gas equation
Accordingly, the PCISPH might be more efficient than
p = k(ρ − ρ0 ) (5)
the SESPH method, despite the computational overhead per
it just grows linearly. Consequently, (4) results in larger pres- fluid update. The reason is that the PCISPH method can han-
sures than (5) which restricts the time step. Furthermore, as dle significantly larger time steps. [SP09] reports up to 150

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

recompute neighbors store neighbors method construction query total


particles SESPH PCISPH SESPH PCISPH
basic uniform grid 25.7 (27.5) 38.1 (105.6) 63.8 (133.1)
130K 47.1 190.1 44.4 105.4 index sort [Gre08] 35.8 (38.2) 29.1 (29.9) 64.9 (77.3)
1700K 662.1 2649.5 572.3 1378.7 Z-index sort 16.5 (20.5) 26.6 (29.7) 43.1 (50.2)
spatial hashing 41.9 (44.1) 86.0 (89.9) 127.9 (134.0)
Table 1: Performance comparison of storing the neighbors compact hashing 8.2 (9.4) 32.1 (55.2) 40.3 (64.6)
and recompute them on-the-fly. Here, SESPH and PCISPH
times (in milliseconds) include the neighbor query, but not Table 2: Performance analysis of different spatial acceler-
the construction time of the grid. Simulations were per- ation methods with and without (in brackets) reordering of
formed on an Intel Xeon 7460 using all 24 CPUs with particles. Timings are given in milliseconds for CBD 130K
2.66GHz. and include storing of pairs.

times larger time steps in comparison to SESPH using the scaling of our system for a varying number of threads. Fi-
Tait-equation [BT07]. nally, we give complete analyses for some example scenes,
Also, SESPH using the gas equation requires much where we also compare the performance of PCISPH and
smaller time steps than the PCISPH in order to achieve plau- SESPH. In the following, PCISPH and SEPSH times are for
sible simulation results. For all of our simulations, the time reading neighbors from memory. Time required for neighbor
step for PCISPH could be at least set 25 times higher than search is listed separately.
for SESPH with (5). However, SESPH updates the simula- Our test system has the following specifications:
tion only 2.5 times faster than PCISPH. Accordingly, the
PCISPH outperforms both SESPH algorithms at least by a • CPU: 4x 64-Bit Intel Six Core [email protected], 16 MB
factor of ten. shared L3 cache, 64MB Snoop Filter that manages cached
data to reduce data traffic
Storing or recomputing neighbors. For some platforms, • Memory: 128GB RAM, Bus Speed 1066MHz
dynamic memory management is expensive or challenging.
In such cases, recomputing the particle neighbors on-the-fly
can be efficient. Consequently, state-of-the art SPH imple- 6.1. Performance analysis
mentations on the GPU do not store particle pairs, but re- We use one example scene for analyzing the performance of
compute them when needed. In contrast, for our multi-core the presented spatial acceleration methods. For this scene,
CPU systems, we experienced that performing the neigh- we have chosen a corner breaking dam with 130K particles
bor search only once per simulation step and query the pairs (CBD 130K) since it is a typical test scene in the field of
from memory is more efficient. computational fluid dynamics. We averaged the performance
In Table 1, we analyze the performance difference for the over 20K time steps (see Table 2).
fluid update, when querying pairs from memory and recom- Basic uniform grid. The construction of the basic uni-
puting the neighborhood on-the-fly. When pairs are stored, form grid is not parallelized due to race conditions. The
neighbors are computed in a first loop and then written to query step loops over all particles in parallel and computes
and read from memory. In Table 1, these times are included the neighbors. Thereby, we do not have to traverse the whole,
for store neighbors. On our system, the overhead of writing sparsely filled grid. However, if particles are not reordered,
and reading the data to memory pays off. While the ben- the memory transfer is high, which has a significant impact
efit is rather low for SESPH, it is significant for PCISPH. on the performance.
This is due to the fact, that PCISPH performs seven neigh-
Index sort. In comparison to the basic uniform grid, the
bor queries, if pairs are not stored. Accordingly, for PCISPH,
query of the index sort is much faster since the query pro-
the neighbor query becomes the bottleneck if neighbors need
cesses the particles in the order of their cell indices. Con-
to be recomputed, as for the GPU based systems presented
sequently, the cache-hit rate is high, which improves the
in [HKK07b, ZSP08, Gre08, GSSP10].
performance significantly. However, the construction time is
comparatively large due to a slow sorting time. Our parallel
6. Results radix sort algorithm sorts the 130K key-value pairs (handles)
in 25ms which is slow compared to the much faster multi-
In this section, we evaluate the performance of the investi-
core sort implementation described in [CNL∗ 08]. Note that
gated data structures. We start with a detailed performance
for this method, particles are reordered according to their
analysis of the five uniform grid variants. By comparing the
cell index which is computed with (2).
construction and query times, we show that the proposed Z-
index sort and the compact hashing are most efficient. Next, Z-index sort. For Z-index sort, the handles are sorted
we discuss the performance influence of the cell size and the using insertion sort instead of radix sort. Due to the small

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

method compact hashing PCISPH (SESPH) ups cell size pairs tested pairs ratio query
no reorder 64.6 171.8 (38.4) 4.2 (9.6) h 3.6M 25M 6.9 26.6
reorder 40.3 78.6 (17.6) 8.3 (16.7) 0.5h 3.6M 15.5M 4.3 39.6

Table 4: Influence of the cell size on the ratio of tested pairs


Table 3: Performance improvement when particles are re-
to influencing pairs and the query time in millisecond. Test
ordered every 100 step. ups denotes simulation updates
scene CBD 130K.
(neighbor search + fluid update) per lab second. Timings
are in milliseconds for CBD 130K.
CPUs compact hashing Z-index sort PCISPH SESPH
1 404.1 391.3 791.1 189.5
24 40.3 43.1 78.6 17.6
changes from one time step to the next, only a small number
of particles move to another spatial cell. Thus, the handles Table 5: Parallel scaling of neighbor search and simulation
are only slightly disordered from one time step to the next. update. For 24 CPUs a speed up of 10 is achieved. Timings
Since insertion sort performs very well on almost sorted lists, are in milliseconds for CBD 130K.
the sorting time is significantly reduced to 6ms. Furthermore,
in contrast to index sort, the cell indices are computed on
a Z-curve and the particles are reordered accordingly. By
preferred. On the other hand, if influencing pairs can not be
mapping spatial locality onto memory, the cache-hit rate is
stored, Z-index sort should be used since the query time is
increased and, hence, the construction and query time for Z-
lower.
index sort is further reduced.
Spatial hashing. We have implemented the spatial hash-
6.2. Cell size
ing method described in [THM∗ 03]. This method performs
the worst due to hash collisions and a high memory transfer The size of the spatial cells influences the number of po-
invoked by the hash function. Note that even when reorder- tential pairs that have to be tested for interaction. Generally,
ing the particles according to a Z-curve, the improvement is smaller cell sizes approximate the shape of the influence ra-
marginal since we traverse the whole hash table and spatially dius h better which is a sphere in 3D. If the cell size d is
close cells are not close in memory. However, if we loop over chosen as h, only 27 cells need to be tested for interaction.
the particles and not the whole grid, spatial compactness can If d < h, the number of potential cells grows, but less parti-
be exploited. In this case, the query time reduces to 50ms cles have to be tested for interaction. However, if more cells
when particles are reordered (this number is not given in Ta- are queried, the memory transfer increases. We observed that
ble 2). the increased memory transfer is more expensive than test-
ing more potential neighbors for interaction (see Table 4).
Compact hashing. The proposed compact hashing ex- Thus, the optimal cell size is the influence radius.
ploits temporal coherence in the construction step. Thereby,
the insertion of particles into the grid is five times faster than
for spatial hashing. Furthermore, the query is nearly three 6.3. Parallel scaling
times faster due to the reduced memory transfer invoked by Optimally, the speed up from parallelization would be lin-
the compact list of used-cells and the hash collision flag. ear. However, the optimal scaling can not be expected due to
Reordering. Note that we reorder the particles every the parallelization overhead for synchronization and com-
100th step according to a Z-curve in all methods except in- munication between different threads. According to Am-
dex sort. By reordering the particles according to a Z-curve, dahl’s law [Amd67], even a small portion of the problem
spatially close particles reside close in memory. Thus, parti- which cannot be parallelized will limit the possible speed
cles that are close in memory are very likely spatial neigh- up. For example, if the sequential portion is 10%, the maxi-
bors. Since in SPH, most computations are interpolations re- mum speed up is 10.
quiring informations of neighbors, the Z-curve increases the Rewriting algorithms, in order to circumvent data depen-
cache-hit rate for the neighborhood query and for the fluid dencies is fundamental to increase the possible speed up.
update. Reordering significantly reduces the memory trans- Furthermore, the performance of an algorithm is either mem-
fer and thereby improves the performance (see Table 3). ory or CPU bound. While for CPU-bound algorithms the
performance is easily increased by using an additional or
We summarize that compact hashing is as efficient as Z-
faster CPU, the bottleneck for memory-bound problems is
index sort. Both methods outperform spatial hashing and
the bandwidth. Therefore, only strategies that are reducing
the basic uniform grid. For compact hashing, the memory
the memory transfer are improving the efficiency.
consumption scales with the number of particles, while for
Z-index sort it scales with the domain. Therefore, for very The presented strategies and techniques employed in Z-
large or unrestricted domains, compact hashing should be index sort and the compact hashing as well as the reordering

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

scene # particles compact hashing [ms] PCISPH (SESPH) [ms] ∆t[s] ups
Glass 75K (20K) 30.8 29.8 (7.3) 0.0025 (0.0001) 16.5 (26.2)
CBD 130K 130K 40.3 78.6 (17.6) 0.0006 (1.8e-5) 8.3 (16.7)
CBD large 1.7M 427.8 1130.0 (271.6) 0.0002 (5.7e-6) 0.6 (1.4)
River 12M (5M) 5193.6 11941.8 0.0005 0.06

Table 6: Average performance in milliseconds for different scenes. Compact hashing is used in all scenes. For the glass and
river scene, we use boundary particles (numbers are given in brackets). Fluid update times, time step and simulation updates
per lab second (ups) are for PCISPH and in brackets for SESPH (similar simulation result).

ticle with 17 ups. This performance is similar to the fastest


GPU implementation [GSSP10]. However, we present sim-
ulations with up to 12 million particles using the more effi-
cient PCISPH method, while [GSSP10] shows simulations
with up to 250K particles using SESPH.

7. Conclusion

We presented a parallel CPU-based framework for SPH fluid


simulations. Important aspects which are critical for the
Figure 5: Total speed up of our system for fluid update and performance of such a system are discussed. For accelera-
neighbor search using compact hashing. The scaling is in tion structures based on uniform grids, the construction and
good agreement with Amdahl’s law. For comparison, the query times are reduced by lowering the memory transfer.
scaling of spatial hashing is given. This is achieved by mapping spatial locality onto memory,
using compact data structures and exploiting temporal co-
herence. Furthermore, we showed how the spatial hashing
can be optimized for a parallel SPH framework. We thor-
of particles lower the memory transfer and, thus, improve
oughly analyzed the performance aspects of the five pre-
the performance of the system significantly. The scaling and
sented uniform grid methods and give detailed scaling anal-
even the flattening of the proposed system is in very good
yses. Additionally, we investigated the performance of dif-
agreement with Amdahl’s law for a problem that is paral-
ferent SPH algorithm, i. e. PCISPH and SESPH.
lelized to 95% (see Fig. 5 and Table 5). In contrast, the scal-
ing of spatial hashing is much worse.

6.4. Scaling with particles


Finally, we show that our system scales linearly with the
number of particles (see Table 6). We therefore set up dif-
ferent small-scale and large-scale simulations. In the glass
scene, a glass is filled with 75K particles (see Fig. 7). For this
scene, a plausible simulation result is achieved by PCISPH
for a time step of 0.0025. Thus, ten real world seconds were
simulated in four minutes. With SESPH we computed a sim-
ilar result with a 25 times smaller time step in 63 minutes
using (5).
Due to the large memory capacity of CPU architectures,
our system shows good performance for large-scale simula-
tions with millions of particles. For those setups, querying
the particles multiple times has a significant impact on the
performance. On average, for the CBD large scene, 341 mil- Figure 6: CBD large scene. A corner breaking dam with 1.7
lion pairs are queried for interaction per simulation step and million particles simulated with PCISPH.
2.3 billion pairs for the river scene (see Fig. 6 and Fig. 8).
Our system updates an SESPH simulation of 130K par-

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

Figure 7: Glass scene. The ’wine’ is sampled with up to 75K particles, while the glass is sampled with 20K boundary particles.
Interaction is computed with [BTT09].

Adaptively sampled particle fluids. In SIGGRAPH ’07: ACM


SIGGRAPH 2007 papers (New York, NY, USA, 2007), ACM
Press, p. 48.
[BIT09] B ECKER M., I HMSEN M., T ESCHNER M.: Corotated
SPH for deformable solids. Eurographics Workshop on Natural
Phenomena (2009), 27–34.
[BT07] B ECKER M., T ESCHNER M.: Weakly compressible SPH
for free surface flows. In SCA ’07: Proceedings of the 2007
ACM SIGGRAPH/Eurographics symposium on Computer ani-
mation (Aire-la-Ville, Switzerland, 2007), Eurographics Associ-
ation, pp. 209–217.
[BTT09] B ECKER M., T ESSENDORF H., T ESCHNER M.: Direct
forcing for lagrangian rigid-fluid coupling. IEEE Transactions
on Visualization and Computer Graphics 15, 3 (2009), 493–503.
[BYM05] B ELL N., Y U Y., M UCHA P. J.: Particle-based sim-
ulation of granular materials. In SCA ’05: Proceedings of the
2005 ACM SIGGRAPH/Eurographics symposium on Computer
Figure 8: River scene. The fluid consists of 12.1 million par- animation (New York, NY, USA, 2005), ACM Press, pp. 77–86.
ticles and the terrain is sampled with more than 5 million [CNL∗ 08] C HHUGANI J., N GUYEN A. D., L EE V. W., M ACY
particles. The particles are coded according to acceleration, W., H AGOG M., KUANG C HEN Y., BARANSI A., D UBEY P.:
Efficient Implementation of Sorting on Multi-Core SIMD CPU
where white is high and blue is low. Architecture. In 34th Intel Conference on Very Large Data Bases
(2008), pp. 1313–1324.
[DC96] D ESBRUN M., C ANI M.-P.: Smoothed Particles: A new
Acknowledgements paradigm for animating highly deformable bodies. In Eurograph-
ics Workshop on Computer Animation and Simulation (EGCAS)
This project is supported by the German Research Founda- (1996), Springer-Verlag, pp. 61–76. Published under the name
tion (DFG) under contract number TE 632/1-1. We thank the Marie-Paule Gascuel.
reviewers for their helpful comments. [FE08] F LEISSNER F., E BERHARD P.: Parallel load-balanced
simulation for short-range interaction particle methods with hi-
erarchical particle grouping based on orthogonal recursive bisec-
References tion. International Journal for Numerical Methods in Engineer-
ing 74 (2008), 531–553.
[AIY∗ 04] A MADA T., I MURA M., YASUMORO Y., M ANABE Y.,
[GBF03] G UENDELMAN E., B RIDSON R., F EDKIW R.: Non-
C HIHARA K.: Particle-based fluid simulation on GPU.
convex rigid bodies with stacking. ACM Trans. Graph. 22, 3
[Amd67] A MDAHL G.: Validity of the Single Processor Approach (2003), 871–878.
to Achieving Large-Scale Computing Capabilities. In AFIPS [GM77] G INGOLD R., M ONAGHAN J.: Smoothed particle hydro-
Conference Proceedings (1967), vol. 30, pp. 483–485. dynamics: theory and application to non-spherical stars. Monthly
[APKG07] A DAMS B., PAULY M., K EISER R., G UIBAS L.: Notices of the Royal Astronomical Society 181 (1977), 375–398.

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation
M. Ihmsen et al. / A parallel SPH implementation on multi-core CPUs

[Gre08] G REEN S.: Particle-based Fluid Simulation. J ENSEN H. W., H ANRAHAN P.: Photon mapping on pro-
grammable graphics hardware. In HWWS ’03: Proceedings of
http://developer.download.nvidia.com/presentations/2008/GDC/GDC08_ParticleFluids.pdf,
2008. the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics
hardware (Aire-la-Ville, Switzerland, Switzerland, 2003), Euro-
[GSSP10] G OSWAMI P., S CHLEGEL P., S OLENTHALER B.,
graphics Association, pp. 41–50.
PAJAROLA R.: Interactive SPH Simulation and Rendering
on the GPU. In SCA ’10: Proceedings of the 2010 ACM [PF01] PASCUCCI V., F RANK R. J.: Global static indexing for
SIGGRAPH/Eurographics symposium on Computer animation real-time exploration of very large regular grids. In Supercom-
(2010). puting ’01: Proceedings of the 2001 ACM/IEEE conference on
Supercomputing (CDROM) (New York, NY, USA, 2001), ACM,
[Hie07] H IEBER S.: Particle-Methods for Flow-Structure Inter-
pp. 2–2.
actions. PhD thesis, Swiss Federal Institute of Technology, 2007.
[SF95] S TAM J., F IUME E.: Depicting fire and other gaseous
[HKK07a] H ARADA T., KOSHIZUKA S., K AWAGUCHI Y.: phenomena using diffusion processes. In SIGGRAPH ’95: Pro-
Sliced data structure for particle-based simulations on gpus. In ceedings of the 22nd annual conference on Computer graphics
GRAPHITE ’07: Proceedings of the 5th international conference and interactive techniques (New York, NY, USA, 1995), ACM
on Computer graphics and interactive techniques in Australia Press, pp. 129–136.
and Southeast Asia (New York, NY, USA, 2007), ACM, pp. 55–
62. [SP08] S OLENTHALER B., PAJAROLA R.: Density Con-
trast SPH Interfaces. In Proceedings of the 2005 ACM
[HKK07b] H ARADA T., KOSHIZUKA S., K AWAGUCHI Y.: SIGGRAPH/Eurographics symposium on Computer animation
Smoothed Particle Hydrodynamics on GPUs. In Proc. of Com- (2008).
puter Graphics International (2007), pp. 63–70.
[SP09] S OLENTHALER B., PAJAROLA R.: Predictive-Corrective
[HMT01] H ADAP S., M AGNENAT-T HALMANN N.: Modeling Incompressible SPH. In SIGGRAPH ’09: ACM SIGGRAPH 2009
Dynamic Hair as a Continuum. Computer Graphics Forum 20, 3 Papers (2009). to appear.
(2001), 329–338.
[Spr05] S PRINGEL V.: The cosmological simulation code
[KAG∗ 05] K EISER R., A DAMS B., G ASSER D., BAZZI P., GADGET-2. Mon. Not. R. Astron. Soc. 364 (2005), 1105–1134.
D UTRÉ P., G ROSS M.: A Unified Lagrangian Approach to Solid-
Fluid Animation. In Proceedings of the Eurographics Symposium [SSP07] S OLENTHALER B., S CHLÄFLI J., PAJAROLA R.: A uni-
on Point-Based Graphics (2005), pp. 125–134. fied particle model for fluid-solid interactions. Computer Anima-
tion and Virtual Worlds 18, 1 (2007), 69–82.
[KS09] K ALOJANOV J., S LUSALLEK P.: A parallel algorithm for
construction of uniform grids. In HPG ’09: Proceedings of the [SWB∗ 06] S BALZARINI I., WALTHER J., B ERGDORF M.,
1st ACM conference on High Performance Graphics (New York, H IEBER S., KOSALIS E., KOUMOUTSAKOS P.: PPM - A highly
NY, USA, 2009), ACM, pp. 23–28. efficient parallel particle-mesh library for the simulation of con-
tinuum systems. J. Comp. Phys. 215, 2 (2006), 566–588.
[KW06] K IPFER P., W ESTERMANN R.: Realistic and interac-
tive simulation of rivers. Proceedings of the 2006 conference on [THM∗ 03] T ESCHNER M., H EIDELBERGER B., M ÜLLER M.,
Graphics interface (2006), 41–48. P OMERANETS D., G ROSS M.: Optimized Spatial Hashing for
Collision Detection of Deformable Objects. In Proc. of Vision,
[LD08] L AGAE A., D UTRÉ P.: Compact, fast and robust grids Modeling, Visualization (VMV) (2003), pp. 47–54.
for ray tracing. In SIGGRAPH ’08: ACM SIGGRAPH 2008 talks
(New York, NY, USA, 2008), ACM, pp. 1–1. [TKPR06] T HÜREY N., K EISER R., PAULY M., RÜDE U.:
Detail-preserving fluid control. In SCA ’06: Proceedings of the
[Luc77] L UCY L.: A numerical approach to the testing of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer
fission hypothesis. The Astronomical Journal 82 (1977), 1013– animation (Aire-la-Ville, Switzerland, Switzerland, 2006), Euro-
1024. graphics Association, pp. 7–13.
[MCG03] M ÜLLER M., C HARYPAR D., G ROSS M.: Particle- [VCC98] V ERMURI B., C AO Y., C HEN L.: Fast collision de-
based fluid simulation for interactive applications. In SCA tection algorithms with applications to particle flow. Computer
’03: Proceedings of the 2003 ACM SIGGRAPH/Eurographics Graphics Forum 17, 2 (1998), 121–134.
symposium on Computer animation (Aire-la-Ville, Switzerland,
Switzerland, 2003), Eurographics Association, pp. 154–159. [Ver67] V ERLET L.: Computer experiments on classical fluids. I.
Thermodynamical properties of Lennard-Jones molecules. Phys.
[Mon94] M ONAGHAN J.: Simulating free surface flows with Rev. 159, 1 (1967), 98–103.
SPH. Journal of Computational Physics 110, 2 (1994), 399–406.
[WS95] WARREN M., S ALMON J.: A portable parallel particle
[Mon02] M ONAGHAN J.: SPH compressible turbulence. Monthly program. Computer Physics Communications 87, 1-2 (1995),
Notices of the Royal Astronomical Society 335, 3 (2002), 843– 266–290.
852.
[ZSP08] Z HANG Y., S OLENTHALER B., PAJAROLA R.: Adap-
[MSKG05] M ÜLLER M., S OLENTHALER B., K EISER R., tive Sampling and Rendering of Fluids on the GPU. In Proceed-
G ROSS M.: Particle-based fluid-fluid interaction. In SCA ’05: ings Symposium on Point-Based Graphics (2008), pp. 137–146.
Proceedings of the 2005 ACM SIGGRAPH/Eurographics sym-
posium on Computer animation (New York, NY, USA, 2005),
ACM, pp. 237–244.
[OD08] O NDERIK J., D URIKOVIC R.: Efficient Neighbor Search
for Particle-based Fluids. Journal of the Applied Mathematics,
Statistics and Informatics (JAMSI) 4, 1 (2008), 29–43.
[Ope05] O PEN MP A RCHITECTURE R EVIEW B OARD: OpenMP
Application Program Interface, Version 2.5, May 2005.
[PDC∗ 03] P URCELL T. J., D ONNER C., C AMMARANO M.,

c 2010 The Author(s)


c 2010 The Eurographics Association and Blackwell Publishing Ltd.
Journal compilation

You might also like