0% found this document useful (0 votes)

12 views18 pages

TanandSitar ParallelImplementationLS DEM

LSDEM

Uploaded by

Opu Debnath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views18 pages

TanandSitar ParallelImplementationLS DEM

LSDEM

Uploaded by

Opu Debnath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Computers and Geotechnics 172 (2024) 106408

Contents lists available at ScienceDirect

Computers and Geotechnics

journal homepage: www.elsevier.com/locate/compgeo

Research paper

Parallel implementation of LS-DEM with hybrid MPI+OpenMP

Peng Tan, Nicholas Sitar ∗
Department of Civil and Environmental Engineering, University of California, Berkeley, Berkeley, 94704, CA, USA

ARTICLE INFO ABSTRACT

Keywords: The discrete element method (DEM) is ideally suited for the study of mechanical behavior of particulate media.
Level-set DEM While serial DEM codes can readily handle models containing thousands of regularly shaped particles, the
High performance computing computational demands increase significantly once the problem size increases to hundreds of thousands or
Particle mechanics
millions of particles. In our study of the mechanical properties of naturally deposited sands we adopted the
Level-Set DEM (LS-DEM, Kawamoto et al. (2016)) which captures the kinematics and mechanics of a system
of arbitrarily shaped 3D particles using level set function as geometric basis. Herein, we present a parallel,
binning algorithm, which has been implemented and optimized based on the existing LS-DEM framework
using C++. The binning algorithm effectively reduces the computational complexity from 0(𝑛2 ) to 0(𝑛). The
code maps relationship between bins and particles with linked-list like data structure and manages MPI
communication in two major phases: border/ghost exchange and across-block migration. Many performance-
critical implementation details are managed optimally to achieve high performance and scalability. The new
code shows excellent weak scalability, with a negligible serial fraction and a low parallel overhead, requiring
only 5% of the computational resources used by the original LS-DEM code.

1. Introduction are often simplified and non-physical parameters such as rolling resis-
tance are introduced in contact models. Although it is acceptable to
Grain morphology plays an essential role in determining the macro- sacrifice grain size and shape in the trade-off for a larger problem size
scopic properties of granular assemblies. Recent developments in the in the prototype-scale simulations for engineering studies, overly sim-
characterization of granular systems from X-ray computed tomographic plified grains cannot reproduce the detailed micromechanical behavior
(XRCT) images enable the study of grain morphology, soil fabric, of the grain assembly without capturing grain-level details, as pointed
and grain interactions through the numerical reconstruction of three- out by Peters et al. (2009). In a numerical simulation, the complex
dimensional, complex-shaped avatars. As result, accurate grain-level grain shapes are accounted for mostly by clustering or clumping of
morphology information can be integrated into a numerical method spheres (Garcia et al., 2009; Tamadondar et al., 2019; Garcia and Bray,
such as the discrete element method (DEM) to understand the link 2019; Wu et al., 2021; Angelidakis et al., 2021; Nie et al., 2021), using
between the microscopic properties of the granular material and its a simplex or polygon as a base geometry (Zhao et al., 2006; Zhao
engineering behavior (see e.g. Kawamoto et al., 2018; Nie et al., 2021)
and Zhao, 2021) or generating realistic grains based on the concept
. However, modeling of granular assemblies on a realistic scale requires
of Fourier descriptors or spherical harmonic function (Garboczi, 2002;
significant computational resources. Therefore, there has been consid-
Taylor et al., 2006; Mollon and Zhao, 2012; Zhou et al., 2015).
erable interest in developing parallel DEM codes (see e.g. Plimpton,
Recently, the idea of using LS to encode object shape, as proposed
1995; Henty, 2000; Baugh Jr. and Konduri, 2001; Washington and
by Osher and Sethian (1988), has drawn substantial attention in the
Meegoda, 2003; Maknickas et al., 2006; Walther and Sbalzarini, 2009;
realm of image segmentation, as it is capable to fully capture the com-
Chorley and Walker, 2010; Kačianauskas et al., 2010; Kloss et al., 2012;
Gopalakrishnan and Tafti, 2013; Amritkar et al., 2014). A comprehen- plex morphology of natural granular material with high fidelity. The
sive review of the various approaches is given in Yan and Regueiro fundamental idea of the LS-based image segmentation is to implicitly
(2018b, 2019). represent boundaries of objects through grid-based interpolation from
In general, most parallel implementations of DEM only deal with a space which is one dimension higher. The strength of an LS-based
spheres rather than complex-shaped grains (Yan and Regueiro, 2018b). algorithm is that it can track motion on a fixed Eulerian grid, which is
In addition, to alleviate the computational cost of DEM, grain shapes handy when dealing with topological changes as the curve evolves. For

∗ Corresponding author.
E-mail address: [email protected] (N. Sitar).

https://doi.org/10.1016/j.compgeo.2024.106408
Received 11 February 2023; Received in revised form 4 May 2024; Accepted 6 May 2024
0266-352X/© 2024 Elsevier Ltd. All rights reserved.
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 1. (a) SEM image of a tight fabric of a naturally deposited sand; and (b) 900 𝑥 900 pixel LS reconstruction of the fabric from an XRCT scan.

this reason, coupling the LS representation of grain morphology with a 2. Implementation steps
DEM simulation is an attractive approach because the grain interactions
are straightforward in the LS framework, as shown by Vlahinić et al. 2.1. Contact detection complexity and binning algorithm
(2013). The LS discrete element method (LS-DEM) was introduced
by Kawamoto (2018) who used high-fidelity LS grain reconstruction A conventional DEM formulation requires a considerable amount
of contact detection. Therefore, efficient solutions of large-scale DEM
to investigate the kinematic and mechanical behavior of a system of
problems necessitate the use of fast and effective contact detection
discrete sand grains. For illustration, Fig. 1 shows a scanning electron
algorithms, particularly for significant scale problems. There are three
microscope (SEM) image of a sample of a naturally deposited sand and common neighbor search algorithms with varying time complexities:
a 900 𝑥 900 pixel LS reconstruction of the sand from an XRCT scan. 𝑂(𝑛2 ), which entails a complete n-by-n mapping of an assembly of
The modeling of such arbitrarily shaped particles typically demands grains; 𝑂(𝑛 log 𝑛), which utilizes multilevel grids derived from a tree-
several orders of magnitude greater computational effort than needed based algorithm (Jagadish et al., 2005; Muja and Lowe, 2009); and 𝑂(𝑛)
for an assemblage of spherical particles. For example, the CPU demand binning (Munjiza and Andrews, 1998; Williams et al., 2004) or link cell
for simulating 122 million spheres is equivalent to simulating 488k poly- algorithm (Grest et al., 1989). While Yan and Regueiro (2018a) argued
ellipsoids when using the same inter-grain contact model (Yan and that the choice of neighbor search algorithm, whether it has a complex-
ity of 𝑂(𝑛2 ), 𝑂(𝑛 log 𝑛), or 𝑂(𝑛), has no effect on contact resolution when
Regueiro, 2018b). Thus, the LS-DEM method encounters a significant
dealing with non-trivial shaped grains, it significantly limits overall
computational bottleneck related to the geometric representation of
performance for complex-shaped grains. The most straightforward and
non-spherical grains. Specifically, the surface geometry of each grain
commonly employed approach is the binning algorithm or the link-cell
is discretized into nodes embedded in an LS table, leading to a larger algorithm.
number of contact checks during force resolution. Moreover, incorpo-
rating history-dependent tangential shear between grains in granular 2.2. Binning and domain decomposition strategy
materials introduces additional complexities. While modeling intricate
grains and using sophisticated inter-grain contact models may not be The binning algorithm employs a hash function to assign each
challenging in a sequential code, their implementation in a parallel grain to a bin based on its coordinates, and then determines their
code is significantly more arduous. This is primarily because tracking proximity using fixed relationships between the bins. To create the bins,
the loading–unloading–reloading path and contact histories becomes a domain decomposition strategy is employed whereby the computa-
tional domain is divided into sub-domains, and each MPI processor
more cumbersome in parallel algorithms. Consequently, the parallel
performs calculations on its respective portion of the domain. Each
domain decomposition algorithm must be modified to accommodate
sub-domain is further partitioned into multiple bins to hold the grains
various contact models, grain shape complexity, boundary conditions,
of interest, and the bin size is larger than the largest grain diam-
and computational granularity. eter to ensure that grains separated by more than one bin are not
In our approach (Tan and Sitar, 2022; Tan, 2022), we parallelized in contact. Fig. 2 illustrates the domain decomposition strategy and
the LS-DEM code using a hybrid implementation of the message passing the concepts of bin, block, and border layer. One of the main ad-
interface (MPI) and open multi-processing (OpenMP), and by utilizing vantages of this strategy is its high scalability for a large number
a variant of the binning algorithm and a spatial domain decomposition of processors and its usability for both shared and distributed archi-
strategy to enable modeling of complex-shaped grains with a history- tecture machines (Gopalakrishnan and Tafti, 2013). This strategy is
dependent contact model. The implementation details that are crucial not limited to discrete element modeling and has been applied to
boundary value problems in structural analysis (Zohdi and Wriggers,
for achieving high performance and scalability have been optimally
1999) and to physics-based, finite-difference simulations of viscoelastic
managed. While these algorithms have been tailored to integrate with
wave equations in both space and time (McCallen et al., 2021).
the existing LS-DEM code, they can also be applied to other applica-
tions. In this paper, we provide an overview of the binning algorithm 2.3. Contact detection and resolution algorithm
implementation in the context of DEM and an MPI implementation
overview. The resulting performance of the code is then evaluated in The LS-DEM code utilizes reconstructed avatars consisting of hun-
terms of speedup, efficiency, scalability, and different problem sizes. dreds of nodes, and computes their volume, mass, and moments of

2
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 3. Illustration of the neighbor search in the binning algorithm with the
computational effort reduced by half due to symmetry.

Fig. 2. Illustration of the domain decomposition strategy with eight sub-domains; the 2.4. Search complexity of the binning algorithm
blue dots are boundary bins that require communication between the processors.
The binning algorithm is based on the assumption that grain inter-
actions occur only when two grains are in contact. The node-to-surface
inertia by counting the number of voxels contained within. A marching contact is explicitly solved and cannot be vectorized. The compu-
method is used to efficiently calculate the signed distance function, tational cost of a naive search algorithm for a generalized N-body
which is a one-time cost in constructing the rigid body model (Sethian, problem is well-known to be 𝑂(𝑛2 ) for an 𝑛-by-𝑛 search (Gray and
1996). To simplify calculations, the avatars are stored with their center Moore, 2000). However, if non-contact pairs of objects are excluded,
of mass at the origin and axes aligned with the principal axes of the computational complexity can be reduced to 𝑂(𝑛) for the binning
inertia, resulting in a diagonal inertia tensor. Interactions between algorithm or 𝑂(𝑛 log 𝑛) for a tree algorithm. In this particular problem,
implicit surfaces are determined by checking the sign of nodes in the the binning algorithm sorts each grain into a bin of a prescribed size
signed distance function of the other surface. However, this approach hashed on the grain’s coordinates. The cutoff distance is chosen to be
may miss edge-face collisions when both edge vertices are outside the at least the largest equivalent diameter of grains so that any two grains
implicit surface. In a well-resolved mesh with sufficient node density, that are at least one bin distance away will not interact. Once the grains
these errors can be ignored as they are proportional to the edge length. are sorted into bins, spatial proximity can be evaluated based solely on
The most common particle overlap model used to determine contact the fixed relationships of the bins. As illustrated in Fig. 3, the computa-
force assumes that particles are spheres and relies on the separation tional effort for neighbor detection is substantially decreased through
distance between particles (Zohdi, 2017). In DEM simulations with binning, particularly for spherical particles (Zohdi, 2004a, 2010, 2012).
complex-shaped grains consisting of hundreds of nodes per grain, the Because grains cannot be arbitrarily densely clustered due to their 3D
contact detection and resolution algorithm phases are often the primary shape, the average number of grains in each bin can be assumed to
computational bottleneck, especially when nonlinear history-dependent be 𝑏, which reduces the computational complexity to 𝑂(𝑛) with a small
mechanical models are used. A neighbor search phase identifies or
constant 𝑏. However, for a very small number of grains, it can be shown
estimates objects that are close to the target object using an easy-to-
that the 𝑂(𝑛) neighbor search algorithm might be inferior to 𝑂(𝑛2 )
model approximate geometry, such as a bounding box or a bounding
for two reasons. First, both shared-memory and distributed-memory
sphere. The LS-DEM code adjusts to the following:
encounter synchronization and communication overheads, which are
‖𝐜𝑖 − 𝐜𝑗 ‖ > 𝑟𝑖 + 𝑟𝑗 (1) inevitable and always retard the performance, primarily when the
performance is governed by communication bandwidth and latency.
where the position of the mass center of grain 𝑖 is denoted by 𝐜𝑖 , and
Second, the overall performance improvement resulting from a parallel
the equivalent radius of grain 𝑖 is represented by 𝑟𝑖 . The contact reso-
algorithm might be highly limited for complex-shaped grains because
lution phase involves checking the surface of each of the hundreds of
the algorithm only affects the performance of the neighbor estimate
discretized nodes in the LS reconstructed avatars against those of neigh-
rather than the contact resolution, which takes up a large fraction of
boring avatars. This stage is performed sequentially and has limited
floating-point operations in the computation. Therefore, it is acceptable
vectorization potential due to the explicit nature of DEM modeling. This
and advisable to use the serial approach when the number of grains is
computational cost is a necessary tradeoff for a three-dimensional DEM
relatively small. However, as the number of grains increases, the serial
code capable of simulating grains of any shape. Resolving the contact
between two arbitrarily shaped grains is significantly more compu- implementation becomes extremely inefficient.
tationally expensive than resolving the contact between two grains
with basic geometrical representations, such as spheres. Additionally, 2.5. Force resolution and numerical integration scheme
due to the necessity for numerical precision and resilience, this stage
frequently increases the floating-point operations by many orders of Normal contacts in LS-DEM are handled through iterating node-to-
magnitude. Although we use the binning algorithm to limit the scope of surface contact algorithm, whereby nodes are seeded onto the surface
neighbor search, the total computing benefit of neighbor search may be of each grain and grain shape is implicitly embedded in its LS grid
negligible for complex-shaped grains. This is because neighbor search value table. This representation is similar to a triangulated surface mesh
accounts for only a small fraction of floating-point operations, whereas except that LS-DEM does not store connectivity information between
the force resolution phase is the most computationally intensive portion nodes and therefore does not consider edge–surface collision. The
of the LS-DEM. density of nodes on a given grain is a matter of choice and is entirely up

3
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

to the designer. The number of nodes seeded onto a grain has no effect 3.2. Border/ghost layers communications
on the underlying geometry but does have an effect on the computa-
tional complexity associated with force resolution. With grain refining,
The binning algorithm employs a labeling scheme that assigns a
extremely high-fidelity reconstruction frequently requires unaffordable
unique hash value to each grain based on its coordinates and the
computation time. Lim et al. (2014) demonstrated that seeding with a chosen bin size. By setting a carefully chosen cutoff distance, each grain
maximum node-to-node spacing less than 𝑑∕10, where 𝑑 is the grain di- is guaranteed to interact only with those in its 26 neighboring bins,
ameter, is sufficient to capture grain morphology and a further increase as even the largest grain cannot span more than two bins. However,
in nodal densities has a minor impact on the mechanical behavior. The boundary grains may extend into neighboring sub-domains, requiring
LS based contact algorithm, the tangential/traction force resolution at each sub-domain to maintain a copy of remote grains to correctly
grain boundaries, and the numerical integration scheme adapted in the account for boundary grain interactions. This creates a halo region, an
LS-DEM code were developed by Kawamoto et al. (2016). The details of extended layer of bins outside the boundary bins that record updates
the implementation are presented in Appendices A–C for completeness. from neighboring sub-domains. To update boundary information for
sub-domains before proceeding to a new time step, border/halo com-
3. Hybrid MPI+OpenMP design considerations munication is required. This involves exchanging border information
to update boundary grains in the border/halo area with the latest
translational velocity, angular velocity, rotation in the global frame,
In parallel computing, the two most commonly used APIs are MPI and center mass location. The new code packs information beforehand
and OpenMP. MPI is the industry-standard for message passing across and minimizes communication overheads effectively. The amount of
distributed-memory machines and often incorporates the concept of data required to update remote data from halo regions is constant
domain decomposition. On the other hand, OpenMP is a standard and small, allowing for collective transmission. Note that grains’ shear
API for parallel programming on shared-memory architectures that history in border/halo layers does not need to be transmitted as they
parallelizes serial code by adding directives to instruct the compiler on are computed on their host processors.
how to distribute workload at the data level.
To fully leverage the potential of multiprocessing clusters, a hybrid
MPI+OpenMP model was implemented. This combination makes use 3.3. Across-block migration
of the data distribution and explicit message passing between cluster
nodes offered by MPI, as well as the shared memory and multithreading Across-block migration occurs when a grain enters or leaves its host
within nodes supported by OpenMP. block. If a grain migrates across the block border, one sub-domain must
delete it while another must add it. In contrast to border/ghost com-
munication, where data can be packed and communicated together,
3.1. Data abstraction and motivation for the use of data structure
across-block migration must be considered individually for different
grains, as the size of the message being transferred is different. A
The binning algorithm utilizes three levels of abstraction: blocks, naive algorithm analogous to border/ghost migration is to construct
bins, and grains. Firstly, the computation domain is divided into blocks, a spatially outward extension from each processor’s border (Yan and
with each block ‘‘owned’’ by a computing processor. The block is then Regueiro, 2018b). The size of the extended layers is independent of
subdivided into equally-sized bins with a length equal to or greater the size of virtual bins and is determined by the velocities of the grains
than the diameter of the largest grain in the assembly. The number of and the time step used in the current time increment. Each processor
processors is chosen to minimize the total communication area between should check its extended layers to see if any of its grains move into
the computational sub-domains. the computational domain. If so, the corresponding processor should
The bin length is expressed mathematically as 𝑅 + 𝛥𝑅, where 𝑅 is send such grains to the interested processor and delete them from its
the largest diameter of the grain in the assembly, and 𝛥𝑅 is determined own space. However, this approach is not optimal for across-block
empirically to balance the bin size with the maximum number of migration, as a grain can move to any other processor, and its velocity
grains that can reside in a bin. Increasing the bin size has the same may be difficult to estimate for some dynamic problems.
effect as decreasing the total number of bins required to partition the In general, three-dimensional DEM simulations fall into two main
domain. Each bin, serving as a primitive task unit, contains grains categories: static/quasi-static and dynamic. While the velocity range in
and communicates with its 26 neighbors to detect and resolve contact problems falling into the former category can be reasonably estimated,
and force. Finally, grains are assigned to a given bin based on the it is more challenging to determine the position that a grain could
coordinates of their mass center in relation to the bin and block sizes. reach in dynamic problems. The idea of an extension layer also creates
In a clustered system, a fixed number of grains in a bin is preferred technical difficulties in its implementation. In the binning algorithm,
to minimize border/ghost layer communication overheads. One way to blocks that are mapped on sub-domains might have different spatial
achieve this is to pre-calculate the bin size and estimate the maximum locations, resulting in many edge cases. Furthermore, if a history-
number of grains that can fit within. However, this method becomes dependent tangential contact model is chosen, the shear history must
less efficient for assembling arbitrarily shaped assemblies with a range be transmitted as well. It is less efficient to send and receive com-
of grain sizes. An alternative approach is to use a linked-list abstraction plex information of varying lengths from specific senders to specific
to map grain–bin associations, where grains are sequentially indexed receivers using point-to-point communication APIs (e.g., primitive send
by a unique number, and only the exact number of total grains is and receive functions) in MPI implementation. This approach is also
maintained. This technique efficiently implements the belonging rela- likely to cause communication deadlocks that can be addressed using
tionship between bins and grains, the neighbor list for each grain, and BoostMPI, which runs on top of MPI implementation, or using one-
the ghost bins using four lists in 𝑂(1) operations. This approach is more sided communication features of MPI3. In our new code, we adopt
memory-efficient and avoids potential memory management problems. highly tuned collective functions, MPI_Alltoall() and MPI_Alltoallv().
Additionally, it avoids resource waste caused by having more bins than This makes the code more readable, as all processors call the collective
grains, which can occur if the majority of bins are empty. functions, and the same code is applicable for all processors.

4
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

3.4. Dynamic workload distribution 3.6. Limitations of the parallel implementation

In modeling semi-static load problems, such as the triaxial compres- Our parallel implementation treats the computational domain as
sion test, the distribution of workload among processors is relatively a cuboid and divides it into identical sub-domains to enable efficient
uniform due to the compact and uniform arrangement of grains in message passing. The specific configuration of sub-domains heavily
the assembly, despite variations in the number of nodes on their influences the linked list-like data structure, halo communication, and
reconstructed surface. Thus, even though the number of grains residing grain migration. While this implementation is optimized to use any
in a bin may vary, the overall workload in a sub-task is comparable, available processors and determine the best domain granularity, it
and the approximate cost can be estimated beforehand. However, for cannot handle inhomogeneous grain distributions across the domain,
inhomogeneous grain distributions, static domain decomposition may leading to resource waste.
not be as effective due to imbalanced workloads. In highly dynamic Another limitation is that the maximum number of processors that
can be used is limited by the cutoff size of a bin, which must be
simulations, such as debris flows or earthquake models, where grains
larger than the maximum equivalent diameter of a grain assembly.
can move and enter arbitrary sub-domains frequently, the number of
Additionally, the average number of grains assigned to one processor
grains residing in a fixed-size bin can fluctuate significantly.
must be larger than a certain value, which can cause some cores to
To address this issue, an adaptive re-binning scheme is used in the
remain idle at times. Work imbalance among processors is also induced
parallel simulation scheme adopted here, which can be applied to a
by the different amounts of additional interactions with ghost bins
wide range of simulations. This approach decomposes the computa-
in the border layer. The center sub-domain is entirely surrounded by
tional domain and redistributes resources to achieve the best balance
other sub-domains and is influenced by ghost bins from all six faces,
for the current geometrical configuration. This subroutine can be called
while a corner sub-domain has only three ghost bins. The number of
either at each time step or after a specified number of time steps,
bins involved in force resolution decreases as the number of processors
depending on the problem. The re-binning step has the same time increases, and the work ratio decreases as the number of bins per
complexity as the standard step and does not impact performance, as it sub-domain increases.
dynamically adjusts and re-assigns grains at the beginning of each time Finally, this implementation is not universally applicable and needs
step, which also skips border/halo communication. to be modified for different contact models, boundary conditions,
The fundamental idea behind the adaptive re-binning scheme is to and sorting algorithms. The specific sorting algorithm used in the
temporarily treat the boundary sub-domains as infinitely large, so that linked-list-like data structure determines the halo layer communication
any grain about to leave the computational domain will still belong and grain movement. Additionally, different boundary conditions are
to the boundary sub-domains. After a certain number of time steps, treated differently in parallel implementation. Despite these limitations,
the entire computation grid is regenerated. An alternative approach the code can handle many types of problems.
is the quad-tree algorithm, which divides the domain into flexible bin
sizes, counts the number of grains within a bin, and emphasizes load 3.7. Programming environment
balancing of each bin. While the dynamic decomposition approach can
improve load balancing in MPI-implementing simulations, it may not The code design and programming adhere to the principles of
adequately balance the actual workload among processors (Yan and Object-Oriented Programming (OOP), with various classes created to
Regueiro, 2019), as it primarily focuses on balancing the number of model practical concepts and objects present in a DEM simulation
grains. This can result in additional work imbalances among individual system, such as grains, bins, and exchange information packages. To
processors. ensure code robustness and performance, the implementation heavily
utilizes the Standard Template Library (STL), including vector and set.
3.5. Memory management The Eigen library is also utilized to simplify matrix-matrix or matrix–
vector multiplication and enable seamless mathematical operations.
Additionally, highly tuned API functions are utilized to accomplish
One of the simplest approaches in a parallel code implementation
both border/ghost communication and across-block migration when
is to create a copy of all grains for each processor. This method is
developing MPI functions for three-dimensional LS-DEM.
effective when the problem size is small, but it is important to optimize
simulation speed. By allowing each processor to handle all updates to
4. Implementation details and experience
grain velocities and rotations without reloading the morphology data,
data locality is preserved and performance is improved. However, as
4.1. Minimizing memory consumption and memory leak prevention
the number of processors increases for larger problems, several issues
arise. Firstly, there is a strain on memory, leaving little space for calcu- In the LS-DEM implementation, a C++ class is utilized to represent
lations. Secondly, in a clustered computing system, multiple processors a grain, which contains a large LS table comprising the data for the
may have to share a finite amount of memory, leading to under- nodes seeded on the reconstructed grain surface, along with their shear
utilization of computing resources. Thirdly, during the initialization histories. However, as most of this data remains constant, and only a
phase, reading all data files multiple times is a significant expense. small amount of data requires communication among sub-domains, it
Finally, this approach goes against the goal of parallelism, which is to is more efficient to index a grain object instead of storing it entirely in
partition data between processors when it is too large to fit on a single the bins.
processor. Thus, to overcome these challenges, it is advantageous to It is worth noting that the number of grains can be smaller than
read morphology files only once and assign grains to processors based the number of bins, which may seem counter intuitive since bins are a
on their coordinates. When a grain migrates to another computational more extensive abstraction than grains. In a mini grain assembly test,
sub-domain, the new host processor consults the database and regen- 74 grains were distributed across a 100 × 100 × 100 computational do-
erates the avatars using the most recent velocity, rotation, and friction main, creating 125 bin partitions with a cutoff distance of 20. Therefore,
information. The memory associated with a migrating grain is marked instead of building 125 bins and assigning grains hashed on their mass
as unused instead of being released, reducing the number of times the centers, it is more efficient to maintain an array of 74 grains and track
database needs to be queried. While this procedure is rarely invoked their bin indices. Although the contact detection, resolution algorithm,
during quasi-static type problems, it results in negligible additional and time step operate directly on the grain array, the bin structure is
overhead. still required to assist message packing in the MPI communication.

5
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Under the rigid body assumption, the information required to up- the inner grains. Nonetheless, this measure mitigates randomness due
date grain status for the next time step consists solely of the positions to thread-level parallelism and domain decomposition, maintaining
of the center of mass, rotations with respect to the global frame, and differences between tests within an acceptable range. Newton’s third
translational and angular velocities. These quantities are necessary to law mandates iterating only half of the adjoining bins. As depicted
update boundary grains in the outermost bins of a sub-domain through in Fig. 3, our code considers 13 neighboring bins, encompassing all
border/halo communication. Since the LS table and node positions nine bins above the target bins and four bins at the same level as the
remain unchanged during simulation, another implementation option target bins. The choice of these 13 bins is arbitrary, but we selected
to update grains is to regenerate them from their morphology files with them because the additional interactions between grains and boundary
updated quantities. Thus, after border/ghost communication, a new grains are reasonably straightforward. The code avoids handling edge
cases as a result of padding. In a shared-memory system, the grain–
grain can be generated from the corresponding morphology file with
grain interaction is completely symmetrical because the computational
the freshly computed location, quaternion, and velocities. This alterna-
domain and the message exchange are not partitioned. This symmetry
tive requires less memory for a large problem but needs to access disk
does not hold for parallelism in distributed memory machines, as only
memory more frequently. The use of a history-dependent tangential
the ‘real’ grains are considered and updated when a real grain interacts
contact model in across-block migration makes it more complicated, with grains in the halo region. For instance, a bottom grain in the upper
as the shear history needs to be brought together with migrating grains sub-domain should interact with ghost grains in the lower sub-domain,
since moving to other processors does not imply that grains lost contact but only forces exerted by ghost grains on non-ghost grains are taken
with their neighbors. into account.
To avoid memory leakage, containers in the Standard Template
Library, such as vector and set, are heavily relied on, and the smart 4.4. Position reasoning and linked-list data structure
pointer feature is used whenever possible. Although raw pointers can-
not be entirely avoided due to MPI’s incompatibility with smart point- A less sophisticated method for binning involves iterating through
ers, they are produced and discarded with special care. Each processor all the bins to identify contacts and determine forces. However, a more
keeps a copy of the full grain list to map bins and grains, which is a refined and efficient approach is to utilize linked lists to establish
memory-efficient strategy. Additionally, the copy constructor and as- connections between grains and bins. This technique operates with a
signment operator are overwritten to prevent memory issues. Eigen and time complexity of 𝑂(1), allowing for a straightforward encoding and
STL inherently implement deep copy and provide memory protection. decoding of bin IDs and their corresponding grains. As shown in Fig. 4,
this approach involves four interconnected lists.
4.2. Code simplification and edge case avoidance (1) The first array:

In MPI implementation, it is critical to ensure that as many pro- i n t ∗ b i n s = new i n t [ n u m _ o f _ b i n s ] ;

cessors as possible are utilized to accomplish the same task, thereby has a length of the number of bins and is initialized as all −1.
enhancing the overall efficiency of the system. To achieve this objec- The value that is stored in each element of bins is the first grain
tive, our LS-DEM code incorporates a padding bin layer around each number in that bin. For instance, bins[23]=13 means the first
sub-domain, effectively increasing the number of bins along each di- grain in the 23rd bins is #13, bins[4]=-1 means there is currently
mension by two. This strategy ensures that all bins within a sub-domain no grain stored in the 4th bins.
are considered spatially equivalent, with each having a complete set of (2) The second array
26 neighboring bins. Additionally, our code ensures that each processor i n t ∗ b i n L i s t = new i n t [ n u m _ o f _ g r a i n s ] ;
is assigned sub-domains of equal size, thus treating each processor
uniformly at the domain decomposition level. The number of processors has a length of the number of grains and is initialized as all -1.
The value stored in each element of binList is the bin ID that the
is chosen to minimize communication overhead between MPI ranks,
grain belongs to. For instance, binList[13]=8 means grain #13 is
with the optimal domain granularity dynamically selected based on
stored in the 8th bins.
both the domain topology and resource availability. To further optimize
(3) The third array
the performance of the code, a minimum number of grains is assigned
to each processor to mitigate the communication overhead, with this i n t ∗ g r a i n L i s t = new i n t [ n u m _ o f _ g r a i n s ] ;
value being determined by prototype test results and dependent on has a length of the number of grains and is initialized as all −1.
the average number of nodes used to discretize avatars. Furthermore, The value stored in each entry of grainList is the next grain
the code utilizes built-in MPI functions, including MPI_Dims_create(), ID stored in the same bin as the current grain. For instance,
MPI_Cart_create(), and MPI_Cart_shift(), to arrange processors into a grainList[13]=17 means the next grain which was stored in the
grid, thereby enabling automatic handling of edge cases during com- same bins as grain #13 has index #17; similarly, grainList[17]=-1
munication with neighbors. For instance, when the left-most processor means there are no more grains stored after grain #17. As a more
is instructed to send data to its left neighbor, which does not exist, the comprehensive example, instructions bins[23]=13;
MPI library seamlessly manages the scenario. grainList[13]=17; grainList[17]=9; and grainList[9]=-1 describe
a search path if one is interested which grains are stored in bins
4.3. Reducing computational effort #23, the first grain is #13 which followed by #17 and #9. There
is no sequential order among grains, the one is added merely to
To optimize computational efficiency, we leverage Newton’s third assist neighbor searching and interaction computation.
law and successfully achieve a reduction in computation time by a (4) The fourth array
factor of two. In an ideal scenario where the grain surface is continuous i n t ∗ belongToRank = new i n t [ n u m _ o f _ p a r t i c l e s ] ;
and integrable at any location, the interaction force between master
has the length of the number of grains and is initialized as all
and slave grains is equal in magnitude but opposite in direction.
0. The value that is stored in each element of belongToRank is
However, due to the discrete representation employed in LS-DEM, in the enumerator {0, 1, 2}, 0 means the grain is not associated
the magnitudes of forces and reaction forces are not identical. Conse- with the current rank, 1 means the grain is residing inside the
quently, the code always applies the force exerted by the grain with associated sub-domain, 2 means the grain is a ghost grain related
the lower index to the grain with the higher index, which guarantees to current rank. For instance, belongToRank[2]=0 implies the
that the resulting forces are identical within a sub-domain, but not 2-nd grain does not belong to the current rank.
at the boundaries where forces are exerted from the halo layer to

6
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 4. Illustration of the grain search using a linked list data structure.

5. Comparison between different implementations eliminate false sharing or data race issues by ensuring that each grain
is associated with and managed by only one processor at any given
A numerical study was conducted to evaluate the effectiveness time. Our implementation successfully simulated a problem size of
of various code versions using the GNU C compiler on the Lenovo approximately 50,000 grains, while utilizing only around 5% (27 cores
NeXtScale nx360m5 nodes of the Savio system at UC Berkeley, featur- vs. 480 cores) of the computational resources required by the original
ing two Intel Xeon Haswell processors, each with 12 cores @2.3 GHz. LS-DEM code. Moreover, our implementation consumed a comparable
Each core contains two 512-bit-wide vector processing units, four hard- amount of time (24 h vs. 17 h) for the simulation.
ware threads, and 256 KB L1 cache, while every four cores share an
8MB L2 cache. The primary objective of this experiment was to assess 5.3. Border/Ghost communication algorithm
the strong and weak scaling of MPI implementations and to investigate
the performance gain achieved by employing 𝑂(𝑛) algorithms and At the start of each time step, the sub-domains’ border/halo layers
parallel techniques. communicate with their respective neighbors. In general, information
exchange happens through six surfaces, twelve edges, and eight ver-
5.1. Original implementation of LS-DEM tices. However, this step can be simplified and reduced to three sequen-
tial steps using the blocked and synchronized MPI_Sendrecv() function.
The LS-DEM code developed by Kawamoto et al. (2016) is im- This function is designed to prevent deadlocks and is well-suited for
plemented using a hybrid MPI+OpenMP architecture and does not simultaneous message sending and receiving. The border/ghost com-
utilize binning or tree algorithms to reduce computation complexity. munications algorithm is illustrated in two dimensions (Figs. 7 and
Instead, it loops through all grains and relies on OpenMP directives to 8), and the 3D algorithm is similar. Two steps are sufficient for a
parallelize the computations. The load balance is improved by using two-dimensional border/ghost communication.
a dynamic loop schedule, and each processor receives grain updates For example, consider a two-dimensional computational domain
through the collective function MPI_Allreduce(). Although the LS-DEM divided into six sub-domains, each containing a 3 × 3 bin matrix
code employs collective communication and identifier MPI_IN_PLACE (shown in gray in the plot). Each sub-domain also maintains a copy
to reduce memory demands, the algorithm requires 𝑂(𝑛2 ) computa- of the remote processors’ boundary bins to form a layer of ghost bins
tional complexity and global all-to-all communications involving all that wrap the original 3 × 3 bins matrix and form a 5 × 5 matrix. The
grains. While this implementation is fast enough for most applications, adopted algorithm shows that two sequential calls of MPI_Sendrecv()
it does have significant limitations, including the need for each proces- are enough to update all ghost bins, and only three sequential calls are
sor to own a copy of all data, leading to a large memory footprint that required for a three-dimensional simulation.
limits the number of grains that can be simulated. Additionally, the During the first border/halo communication step (Fig. 7), each pro-
code formulation is not protected against false sharing or data races, cessor sends its rightmost non-boundary bins (marked as type-1 bins) to
which can occur when two grains simultaneously interact with the same update the left boundary bins of its right neighbor. Simultaneously, the
grain and compete to write to its memory. current processor receives a message from its left neighbor and updates
Despite these limitations, this LS-DEM code can still model a triaxial the left boundary bins. MPI_Sendrecv() synchronizes the processors
compression test on roughly 60,000 grains within one day. It is worth and completes this procedure simultaneously. Similarly, each processor
noting that LS-DEM’s contact algorithm has constant time complexity sends its leftmost non-boundary bins (marked as type-2 bins) to update
with respect to grid resolution. However, the large memory footprint its left neighbor’s right boundary bins. The current processor then re-
may lead to increased computation time due to cache misses, and it can ceives a message from its right neighbor and updates the right boundary
limit the number of grains that can be simulated. For instance, a single bins. The code handles the edge cases where left or right neighbors
40 × 40 × 40 reconstructed avatar requires storing 64,000 LS values. do not exist by organizing processors and constructing a Cartesian grid
using MPI_Dims_create(), MPI_Cart_create(), and MPI_Cart_shift().
5.2. Two binning algorithm implementations During the second border/ghost communication step (Fig. 8), each
processor sends its topmost non-boundary bins (marked as type-3 bins)
Figs. 5 and 6 illustrate the flowcharts for the two parallel im- to update the bottom boundary bins of its top neighbor. Simultane-
plementations, namely, the naive binning algorithm implementation ously, the current processor accepts a message from its bottom neighbor
and the binning algorithm implemented in this work. Both implemen- and updates the bottom boundary bins. Once these two sequential
tations consist of four major flow components: memory efficiency, steps are completed, all the corner bins in the ghost area are updated.
interaction computation, border/halo communication, and across-block For three-dimensional simulations, a similar algorithm is employed,
migrations. These components are not equally important and can occur but with three steps instead of two. The left and right ghost bins are
at different times during the simulation. The computational complexity updated first, then the back and front ghost bins, and finally the top and
for both implementations is reduced from 𝑂(𝑛2 ) to 𝑂(𝑛). Bins in each bottom ghost bins. An alternative approach for updating the corner bins
sub-domain are isolated from other processors and updated using the could involve sending them to the left and right neighbors in one step
available information in the host sub-domain, and hence, do not par- and having the left and right neighbors re-direct the corner bins to the
ticipate in message passing phases. Furthermore, both implementations top and bottom neighbors in the second step.

7
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 6. Flowchart of the binning algorithm where the grains are the basic elements in
the iteration.

5.4. Iterate over bins vs. iterate over grains

The computation stage involves calculating the interactions between

pairs of grains. The conventional approach is to iterate through the
bins and compute the interactions between the grains residing in them,
starting with those in the same bin and then progressing to those in
neighboring bins. However, this method is simple but inefficient since
the bins may be empty. To overcome this, we have introduced a novel
force resolution strategy that iterates over the grains, providing the
Fig. 5. Flowchart of the naive binning algorithm where bins are the basic elements in
optimal amount of work. The linked-lists are used to map the grain-to-
the iteration.
grain and grain-to-bin relationships with a time complexity of 𝑂(1). By

8
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 7. Border/ghost communication step one, exchange between left and right layers.

Fig. 8. Border/ghost communication step two, exchange between top and bottom layers.

iterating over the grains, the strategy significantly reduces the amount information, such as the grain’s translational velocity, angular velocity,
of analysis by an order of magnitude. The resulting code scales linearly, mass center location, and rotation, collectively, the communication
and computations are directly proportional to the number of grains. On latency can be reduced. To distribute migrated grains to other proces-
the other hand, the naive implementation requires iterating through all sors, collective operations like MPI_Alltoall() and MPI_Alltoallv() are
the bins to locate the target grain whenever it is queried, leading to commonly used. If the time step is small enough, the fundamental
computational inefficiency and redundancy. MPI_Send() and MPI_Receive() functions can be implemented using
Moreover, when a grain is about to migrate to a new bin, it is only neighboring 26 processors. However, this approach cannot be
assigned a new bin ID that differs from the current one. However, this generalized, and using fundamental MPI_Send() and MPI_Receive() in
operation cannot be executed immediately since all bins need to be nested loops may lead to deadlocks. The across-block migration ap-
checked first, as the grain could move to unchecked bins and require proach is illustrated in Figs. 9 and 10, where the common quantities
recalculation. Both implementations that use data structure abstraction are packed and communicated first, followed by individual communi-
need to consider additional interactions from ghost grains since a cation of grain-specific information such as shear forces, normal shear
sub-domain does not have ambient information about the clustered direction, and contact grain ID.
system. In this process a packer is the container storing the necessary grain
information and the underlying data structure of the packer is:
5.5. Implementation of the across-block migration
v e c t o r <v e c t o r <b a s i c _ g r a i n _ p r o p e r t y >>
packer ( num_of_proc ) ;
At the third step, the movement of a grain is updated, and it
may migrate across blocks at the fourth step, which is implemented Each grain in the system has six basic properties, namely its ID, the
using the same algorithm in both the naive binning algorithm and the number of nodes it has, its position, quaternion, translational veloc-
implementation using data structure abstraction. This step can be com- ity, and angular velocity. When a grain migrates from its current
putationally expensive, especially if the history-dependent tangential sub-domain to another sub-domain, its information is stored in the
contact model is applied, as the contact history must also be carried corresponding packer[new_proc_id]. The contact history is maintained
along with the migrated grain. However, by packing the necessary in three containers, which are defined as follows:

9
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 11. Illustration of the mechanism of MPI_Alltoall().

were multiples of 74, obtained by duplicating existing avatars with

Fig. 9. Illustration of data containers for the basic grain information in across-block shifted positions to avoid overlap. The computational domain in each
migration. test was cubic and constructed using 8 copies of existing 74 grains
shaped into a 200 × 200 × 200 domain, resulting in a total of 592
grains. The problem setting was straightforward: domain boundaries
were modeled as undeformed planes, and grains were not allowed to
leave the domain and would bounce back if they tried to do so. Grains
were subjected to a random force at each time step for simplicity and
work balance. The bookkeeping and adaptive binning were disabled
during performance benchmarking.
Fig. 12 displays the performance speed-up by varying the number of
processors. In all cases, the speed-up was measured relative to a serial
run, equivalent to an MPI simulation with a 1 × 1 × 1 decomposition.
Each processor works on its smaller region and maps to the closest
physical memory to maximize memory bandwidth usage and avoid
bandwidth limitations. Therefore, a well-balanced MPI implementation
maintains high efficiency and low communication costs. However,
Fig. 10. Illustration of data containers for the contact history in across-block migration. beyond a certain point more MPI processors increase communication
overheads and stalls leading to increased MPI traffic that deteriorates
scalability. Furthermore, increasing domain granularity increases the
v e c t o r <v e c t o r <Eigen : : Vector3d >> area of inter-processor interfaces through which border layers are
packerNodeShears ( num_of_proc ) ; exchanged and more ghost grains are cloned to create extra workload.
Evenly distributed workload and suitable domain granularity help to
The first quantity stored in a contact history is the vector of tangential hide the communication overhead, i.e., if the major tasks are completed
forces on each node. As the number of nodes seeded on a grain is not later than communication, a good linear speed-up is expected as the
constant, the length of this vector changes accordingly. The second cost of the communication latency is amortized. Generally, increasing
quantity is the ID of the grain that a node is in contact with, and it the number of grains brings scalability closer to the idealized speed-
is stored separately: up. This is because the CPU is saturated, and the portion of simulation
v e c t o r <v e c t o r <int >> time spent on contact detection, force resolution, and grain update
packerNodeContact ( num_of_proc ) ; is more significant than the communication time. This also explains
why larger test samples perform better than their smaller counterparts.
The third quantity is the unit normal direction of a node in the principal However, with increasing number of processors, the MPI traffic due to
body frame of the last time step. This is stored as: data migration and increased synchronization times could add up and
dominate. Increasing the number of processors is analogous to reducing
v e c t o r <v e c t o r <Eigen : : Vector3d >>
the problem size, and a single parameter defined as the number of
packerNodeNormals ( num_of_proc ) ;
grains associated with one processor is more suitable for optimizing
MPI collective functions, such as MPI_Alltoall() and MPI_Alltoallv(), parallel execution.
are used to facilitate data communication which allows each processor The data presented in Fig. 13 indicates that increasing the number
to send distinct data to all other processors. The data is received and of processors decreases the run time, resulting in considerable speed-
placed in the appropriate buffer on the receiving processor. Meanwhile, up when the work is evenly distributed. The plot also shows that using
MPI_Alltoallv() offers more flexibility, as it allows for specific send parallel techniques reduces the time needed to complete 1,000 time
and receive buffer displacements to be specified. In effect, the all-to- steps from 852 seconds with a single processor to only 20 seconds.
all operation performed by these functions is equivalent to a matrix However, the relationship between running time and the number of
transpose operation, which is shown in the illustration below (see processors is not decreasing monotonically, as the domain granularity
Fig. 11). and the bin size also play a role. For example, simulating 1998 grains
in a 300 × 300 × 300 domain using 2 processors and a bin size of 100
6. Numerical experiments results in a division of the computational domain into 100 × 300 × 300
and 200 × 300 × 300. This division only saves one third of the
6.1. Numerical experiments on a small dataset computation time, taking 256 seconds with 2 processors versus 356
seconds with a single processor.
We conducted a series of small numerical experiments to investigate In this application, the bin size was chosen to be at least the largest
communication overheads in small-sized problems. The problem sizes equivalent grain diameter of the specimen. This choice implies that the

10
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Fig. 14. Run time fractions for simulations with 4763 grains as a function of the
number of processors.

Fig. 12. Comparison between the actual speed-up and idealized speed-up for a series
of small numerical experiments; a study of strong scalability. The time taken for synchronization has been included implicitly.
Let us consider two sections of the code where MPI communication
occurs. The border/halo exchange phase is synchronized since all pro-
cessors are synchronized at the beginning of a new time step. For
synchronization in the across-block migration barriers are set. The
work imbalance appears to arise during the force resolution phase
despite the fact that MPI_Alltoall() and MPI_Alltoallv() functions are
highly optimized and capable of handling peculiar situations, such as
transmitting a zero-length message. The main reason is that the MPI
collective operations are blocked, meaning that faster processors must
wait until all processors have completed the previous phase before
proceeding to the next. To address this issue, the OpenMP directives
are integrated into the code because the work imbalance issue can
be alleviated in the shared-memory implementation. Thus, a hybrid
parallelism type has the potential to achieve the best of both worlds:
fully utilizing the available computing resources and minimizing the
Fig. 13. Relationship between the run time and the number of processors for a series work imbalance via a dynamic load scheme.
of small numerical experiments; a study of weak scalability. The breakdown of runtime may be misleading since the time for
force calculation is exaggerated due to work imbalance, leading to
an increased percentage of force calculation. In this regard, Amdahl’s
bin size is the minimum scale at which the computational domain can law (Rodgers, 1985) does not consider work imbalance and commu-
be divided. Consequently, using more computational resources does nication synchronization and assumes that all processors compute for
not lead to indefinite parallel performance gains. For example, it is the same amount of time, implying perfect load balancing. When some
impossible to parallelize 74 grains in a 100 × 100 × 100 domain with processors take longer than others, the speed-up decreases, resulting
in a proportionately larger serial fraction. Second, Amdahl’s law does
a bin size of 100. Similarly, the use of 8, 12, or 16 processors provides
not include a term for the overhead of synchronizing processors, which
the same running time for 592 grains in a 200 × 200 × 200 domain.
increases monotonically with the number of processors. A better metric
Figs. 12 and 13 also provide evidence of the 𝑂(𝑛) computational
is the serial fraction, 𝑓 , proposed by Karp and Flatt (1990), which
complexity of domain decomposition, although the relationship is not
is used here to investigate the effect of work imbalance due to the
perfectly linear due to several factors. The binning algorithm used
increasing overhead. The increasing overhead reduces the speed-up,
assumes that the grain interactions occur only when two grains are
and an increasing 𝑓 indicates that the granularity is too fine.
sufficiently close to each other and the contact detection step acts as a
The speed up measured by running the same program on a varying
filter to eliminate unnecessary calculations. However, the computation
number of processors is defined as:
cost depends on both the geometry and the spatial distribution of
the grain assembly, which can vary greatly. Additionally, the force 𝑇 (1)
𝑠= (2)
interaction phase is distributed into tiered steps, which minimizes the 𝑇 (𝑝)
amount of calculation but also introduces nonlinearity. where 𝑇 (1) is elapsed time with 1 processor. The issue of efficiency is
Fig. 14 displays the breakdown of the run time into four categories: related to price/performance, and is usually defined as:
contact detection and force resolution, MPI border/halo exchange, 𝑇 (1) 𝑠
grain migration, and other computations such as numerical integra- 𝑒= = (3)
𝑝𝑇 (𝑝) 𝑝
tion and grain–wall interactions. The simulations of 4736 grains were Consider the Amdahl’s law which in the simplest form states:
chosen for analysis as this was a sufficient problem size to determine
𝑇𝑝
whether a processor was saturated or dominated by communication 𝑇 (𝑝) = 𝑇𝑠 + (4)
overheads. The majority of the computational resources were used for 𝑝
contact resolution, which accounted for over 90% of the run time. Where 𝑇𝑠 is the time taken by the portion that must be run serially and
The border/halo exchange and the grain migration components were 𝑇𝑝 is the time in the parallel part. Then:
influenced by factors such as the bin size, the domain granularity,
the external forces, the boundary conditions, and the grain densities. 𝑇 (1) = 𝑇𝑠 + 𝑇𝑝 (5)
However, the run time for both MPI communication phases was in- If the fraction serial is defined:
significant compared to the force interaction computations due to the 𝑇
small magnitude of the external forces in this example. 𝑓= 𝑠 (6)
𝑇 (1)

11
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Table 1
Serial fraction for simulations with 4763 grains.
Processors Time (s) Speed-up (𝑠) Efficiency (𝑒) Serial fraction (𝑓 )
1 852.359 1.000 1.000 –
2 459.581 1.855 0.927 0.078
4 234.757 3.631 0.908 0.034
8 118.234 7.209 0.901 0.016
16 59.269 14.381 0.899 0.008
32 35.175 24.232 0.757 0.010
64 20.198 42.200 0.659 0.008

and the Amdahl’s law can be re-written as:

𝑇 (1)(1 − 𝑓 )
𝑇 (𝑝) = 𝑇 (1)𝑓 + (7)
𝑝
or in terms of speed up 𝑠:
1 1−𝑓 Fig. 15. Corrected speed up relationship considering actual workload on processors,
=𝑓+ (8)
𝑠 𝑝 study of strong scalability.
The serial fraction can be solved as:
1 1
𝑠
− 𝑝
𝑓= (9)
1
1− 𝑝

Table 1 demonstrates how the work imbalance and synchronization

affect the performance. The results indicate that when more than
16 processors were employed, the computational resources were not
fully utilized, resulting in a significant amount of time being spent
on communication overhead. The efficiency also experienced a sharp
decline beyond 16 processors. The optimal performance is achieved
when all processors have a balanced workload and domain granularity.
Other hardware configurations such as cache and memory bandwidth
also impact the overhead.

6.2. Limitations (strong scalability) of domain decomposition strategy Fig. 16. Simulation times for problems of different sizes with a fixed workload per
processor; study of weak scalability.

High-performance computing is commonly assessed for scalability

through two notions: strong scalability and weak scalability. Strong
scalability refers to the variation of solution time as a function of another multiplier 𝛽 = 1∕13 ∼ 9∕13 to account for the determination of
the number of processors for a fixed problem size. To evaluate the 𝛼, which is challenging.
scalability of our code, we conducted numerical tests simulating a In addition to the interactions from the inner bins to the ghost bins,
large number of grains. Specifically, we constructed 42684 grains from reciprocal interactions from the ghost bins to the inner bins also exist.
the relatively low-resolution images (13 − 15 μm/pixel), resulting in This is due to the fact that only 13 out of 26 neighboring bins of an inner
substantially more avatars and smaller morphology files for each. The bin are considered for force resolution. Hence, to achieve completeness,
surface of each grain was discretized into an average of 279 nodes, and interactions from the bottom ghost bins need to be incorporated when
the total amount of computer memory used for this test was 2.5 GB. The considering interactions between inner bins and their upper neighbors.
grains were subjected to random forces with a mean magnitude about The number of bins that a ghost bin can affect ranges from a minimum
10 times gravity to induce grain movement. The entire computational of 1 out of 13 in the corner to a maximum of 9 out of 13 when
domain was 600 × 600 × 600, with the largest possible grain radius of the ghost bin is located below the bottom and near the center. The
40 and a bin size of 50. measured speed-up shown in Fig. 15 is close to the average speed-up.
To minimize the communication area between adjacent processors, Super linearity, which is not unusual in parallel code benchmarking,
we partitioned the computational domain. Although communication is observed, as the parallel efficiency is greater than one. This is due
time tends to increase as the number of processors increases, it re- to the performance gain from increasing the total cache size being
mained below 1% in all runs. However, work imbalance was observed greater than the communication overhead, especially when a small
due to different sub-domains interacting with different numbers of number of processors is used. Despite the inherent load imbalance
ghost bins, depending on their location within the domain. This work and the additional workload introduced by the domain decomposition,
imbalance was more severe with larger bin sizes and more processors as the communication overhead associated with our code is minimal. It
shown in Table 2. The relationship between the total workload and the should be noted that domain decomposition is a widely accepted and
number of bins is not proportional due to the convoluted and nonlinear effective parallel strategy for many problems beyond the scope of DEM
contact detection and force resolution processes. Nonetheless, we can modeling.
assume that this relationship is linear due to the large and dense
specimen. Additionally, not all ghost bins are visited equally frequently, 6.3. Weak scalability of domain decomposition strategy
as we avoid half of the computation redundancy by accepting that
the forces between contacting pairs of grains match in magnitude and To evaluate the weak scalability of the code, we conducted a series
oppose each other. To estimate the number of visited ghost bins, we of numerical experiments by partitioning the domain into identical
assume that half of the ghost bins are visited (𝛼 = 1∕2) and that no sub-domains and selecting the number of processors such as to ensure
sub-domain is completely wrapped by neighbors. In this case, we use uniform decomposition of the domain. The bin size was set at 50 due to

12
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Table 2
Run time breakdown for simulations of 42,684 grains in a 600 × 600 × 600 domain, with bin size 50, simulations ran for 1000 steps.
Processors 𝑇force 𝑇border 𝑇migrate 𝑇update Theoretical Actual Difference
(s) (s) (s) (s) Bins Bins Bins
1 67 083.43 0.00 0.00 82.30 1728 1728 0
2 32 727.49 23.74 0.95 51.41 864 1008 144
4 17 836.61 22.54 1.14 33.39 432 588 156
6 14 422.88 21.09 0.94 41.72 288 504 216
8 12 661.43 20.17 0.36 36.25 216 343 127
12 9261.44 16.47 1.13 37.68 144 294 150
16 6525.09 13.68 1.00 22.58 108 245 137
18 8814.76 15.02 1.21 27.10 96 252 156
24 7118.26 15.11 1.33 26.13 72 210 138
27 4896.77 18.87 1.54 20.67 64 216 152
32 4940.85 20.04 2.68 22.46 54 175 121
36 5545.15 13.70 2.35 19.05 48 180 132
48 4495.77 12.45 2.67 19.29 36 150 114
54 4185.16 16.18 1.17 17.90 32 144 112
64 3900.41 17.82 1.71 17.66 27 125 98
72 3431.75 11.82 3.34 27.36 24 120 96
96 3218.85 11.96 4.36 14.36 18 100 82
108 3486.49 11.32 2.36 13.04 16 96 80
144 3135.79 15.74 4.41 12.06 12 80 68
216 2758.50 13.19 6.29 10.02 8 64 56

Table 3
Domain angularity parameters for studies of weak scalability.
Grains Domain size Processors Theoretical Actual Simulation
#Bins #Bins times(𝑠)
Series #1 sub-domain size: 150 × 150 × 150, 481 grains/processor
481 150 × 150 × 150 1 27 27 195.95
3848 300 × 300 × 300 8 27 64 321.87
12987 450 × 450 × 450 27 27 125 360.88
30784 600 × 600 × 600 64 27 125 386.64
60125 750 × 750 × 750 125 27 125 404.79
103896 900 × 900 × 900 216 27 125 399.62
164983 1050 × 1050 × 1050 343 27 125 398.47
246272 1200 × 1200 × 1200 512 27 125 410.13
Series #2 sub-domain size: 200 × 200 × 200, 1328 grains/processor
1328 200 × 200 × 200 1 64 64 896.66
10624 400 × 400 × 400 8 64 125 955.22
35856 600 × 600 × 600 27 64 216 1081.27
84992 800 × 800 × 800 64 64 216 1185.02
166000 1000 × 1000 × 1000 125 64 216 1139.87
286848 1200 × 1200 × 1200 216 64 216 1133.78
455504 1400 × 1400 × 1400 343 64 216 1155.67
679936 1600 × 1600 × 1600 512 64 216 1172.96
Series #3 sub-domain size: 200 × 200 × 200, 1562 grains/processor
12494 400 × 400 × 400 8 64 125 2501.74
99952 800 × 800 × 800 64 64 216 2755.65
337338 1200 × 1200 × 1200 216 64 216 2984.45
799616 1600 × 1600 × 1600 512 64 216 2991.20

the maximum possible grain radius ranging between 30 and 40 for all simulations and a comparison with experimental result are presented
simulations, and each simulation ran for 1,000 time steps. The results in Fig. 17. The vacuum triaxial test was performed on a sample of
of the numerical experiments are presented in Table 3 and plotted in naturally deposited sand as a part of a study of the mechanical behavior
Fig. 16. of naturally deposited sands using XRCT (Garcia et al., 2022). The
The plots in Fig. 16 show that the serial implementation requires physical sample was composed of approximately 120,000 particles. The
less time to complete, as it does not involve additional ghost bin inter- LS-DEM simulation used a subset of 19,906 sand grains from a high
actions, and the simulations with 8 processors exhibit slightly superior resolution XRCT scan at 3 μm per voxel on a sample from the same
performance due to similar reasons. However, for all other domain deposit.
decompositions utilizing 27 or more processors, the computation times
The grains were discretized with up to 5000 nodes per avatar
are similar and display a slight increase with the increasing number
and the avatars were placed in their respective locations within the
of processors due to communication overheads. These findings suggest
sample, to preserve the observed packing and fabric. Same process was
that weak scalability is upheld, even though strong scalability is not
followed to create a subset of 4852 spherical particles with dimensions
achievable when the bin size is significant compared to the simulation
equivalent to the corresponding sand particles. Thus we maintained
geometry.
the same grain size distribution by mass, while the sample void ratio
7. Experimental code validation increased since the spherical particles cannot be directly packed to the
same density. The coefficient of friction for inter-particle friction in
The primary goal of the numerical modeling effort was to be able the simulations was 0.6, corresponding to a friction angle of about 31
to simulate the behavior of naturally deposited sands at high reso- degrees, which is well in the accepted range typical of a quartz rich
lution with large numbers of particles. Illustrative example LS-DEM sand (Mitchell and Soga, 2005). Most importantly, the particles were

13
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

overheads are kept to well below 1% of the total computing wall time,
particularly in configurations in which each processor is saturated with
a sufficient and balanced workload. In comparison with the original
LS-DEM code, the new parallel implementation can simulate problems
of comparable size while consuming less than 5% of the computing
resources.
Our low communication overhead has resulted in near-perfect weak
scalability, with the same computing time required for twice larger
problems using twice the resources. Conversely, strong scalability
rapidly deteriorates as the number of processors increases. This is due
to the sub-domain located at the center of the entire computational
space requiring more ghost bins from neighboring sub-domains to
complete the halo layer exchange, force resolution, and grain migration
than its edge and corner counterparts. In other words, if the bin size is
significant compared to the total domain size, the domain decomposi-
tion strategy is inherently load imbalanced. However, if we calculate
and consider the true workload of each sub-domain, the measured
performance gain matches the expected value reasonably well. This
issue is unique to our application of simulating a densely packed grain
system in a quasi-static setting, where the bin size needs to be larger
than the largest grain diameter in an assembly and not insignificant
compared to the overall specimen configuration.
Finally, an important advantage of the MPI implementation is that
Fig. 17. Mobilized friction angle and corresponding volume change from a vacuum the code can run on a wide variety of parallel systems, including
triaxial test (Garcia et al., 2022) and the LS-DEM simulation with 19,906 high shared-memory computers and clustered systems, thus achieving high
resolution avatars and equivalent spheres with interparticle friction coefficient of 0.6. portability. The new code also has the potential to achieve a negligible
serial fraction and low parallel overhead when executed on modern
multiprocessing supercomputers. An illustrative simulation of a triaxial
free to rotate as the sample deformed and a shear band developed. test shows that the code is capable of simulating the deformation
The plots shows that the simulation of the sand sample reproduces the of a large, complex grain assembly in a reasonable amount of time.
high, mobilized friction angle observed in the experiments reflecting Future efforts to simulate large-deformation problems will require the
the depositional fabric of the sand and the angular nature of the implementation of an adaptive dynamic mesh or quad-tree algorithm
particles. The volume change also matches the observed response, with to further exploit the computational speed offered by the parallel code.
initial contraction followed by dilation, as would be expected. The
figure also shows renderings of the sample at the beginning and at CRediT authorship contribution statement
the end of the triaxial test simulation exhibiting the development of
a shear plane. In comparison, the simulation with spherical particles Peng Tan: Writing – original draft, Visualization, Validation, Soft-
shows that the stress–strain response lacks the stiffness and the peak ware, Formal analysis. Nicholas Sitar: Writing – review & editing,
observed in the naturally deposited sample, while the peak mobilized Validation, Supervision, Project administration, Methodology, Funding
friction angle reflects the interparticle friction angle. The corresponding acquisition, Formal analysis, Conceptualization.
volume change shows continuous contraction, as would be expected of
a relatively loose, spherical particle assembly. The simulation with the Declaration of competing interest
19,906 particles contained roughly 95 million points and took 50 h on
53 cores on an Intel Xeon Gold 6330 @ 2.0 GHz processor with 256 GB The authors declare that they have no known competing finan-
RAM. cial interests or personal relationships that could have appeared to
influence the work reported in this paper.
8. Conclusions
Data availability
A C++ implementation of a parallel code for 3D LS-DEM to model
granular materials of arbitrary shape has been developed, building on Data will be made available on request.
an existing LS-DEM framework developed by Kawamoto et al. (2016).
To reduce the computational complexity from 𝑂(𝑛2 ) to 𝑂(𝑛), a binning Acknowledgments
algorithm was introduced. The new code utilizes a linked-list like data
structure to map the relationship between bins and grains, and consid- We thank Prof. José Andrade and Dr. Reid Kawamoto for sharing
ers MPI communication in two major parts: border/halo exchange and their LS-DEM code and for their assistance in mastering its contents.
across-block migration. The time complexity of the execution time, the We also thank Dr. Michael Gardner and Dr. Estephan Garcia for their
communication time, and the parallel overhead of the new code were constructive advice in the early phases of the code development. The
analyzed with respect to the amount of the computational resources high resolution XRCT scans used in the simulations were provided by
and the problem size. The results indicate excellent weak scalability the TESCAN XRE Demo Lab in Ghent, Belgium. Hasitha Wijesuriya
numerically and the potential for simulating large-scale DEM problems generously assisted in performing the triaxial test simulations on Savio
with complex-shaped grains. HPC at UC Berkeley.
We also conducted a comprehensive benchmarking study of our do- Funding for this research was provided by the National Science
main decomposition strategy, specifically designed for our application, Foundation, United States under grant CMMI-1853056, the Edward
by simulating problems with up to 800,000 grains, and we examined G. and John R. Cahill Chair, the Berkeley-France Fund, the US Geo-
both strong and weak scalability. We have meticulously optimized var- logical Survey Cooperative Agreement G17AC00443, and the Pacific
ious performance-critical aspects of the code to ensure communication Earthquake Engineering Center (PEER), solicitation TSRP 2018-01. The

14
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

of ∇𝜙𝑗 (𝐩𝑗𝑘 ) is very close but not equal to unity and therefore it is
normalized. If at least one node 𝐩𝑗𝑘 of master grain 𝑖 is penetrating a
slave grain 𝑗, then the two grains are considered to be in contact and
inter-grain forces are computed. This process is detailed in Algorithm
1:

Algorithm 1 findPenetrationDirection
INPUT: grain, point P
OUTPUT: flag, penetration depth 𝑑, contact normal in principal
frame 𝐧̃
/* extract LS values nearby 𝐏 */
/* 𝐏𝑥 , 𝐏𝑦 , 𝐏𝑧 are coordinates of 𝐏 */
𝑥0 = floor(𝐏𝑥 ), 𝑦0 = floor(𝐏𝑦 ), 𝑧0 = floor(𝐏𝑧 )
𝑥1 = ceil(𝐏𝑥 ), 𝑦1 = ceil(𝐏𝑦 ), 𝑧1 = ceil(𝐏𝑧 )
/* function getGridValue looks up the LS table to extract value */
𝑃𝑖𝑗𝑘 = getGridValue(𝑥𝑖 , 𝑦𝑗 , 𝑧𝑘 ), where 𝑖, 𝑗, 𝑘 = 1, 2
/* find penetration 𝑑 via linear interpolation */
𝑃𝑥 = 𝑃100 − 𝑃000 , 𝑃𝑦 = 𝑃010 − 𝑃000 , 𝑃𝑧 = 𝑃001 − 𝑃000
𝑃𝑥𝑦 = −𝑃𝑥 − 𝑃010 + 𝑃110
𝑃𝑥𝑧 = −𝑃𝑥 − 𝑃001 + 𝑃101
𝑃𝑦𝑧 = −𝑃𝑦 − 𝑃001 + 𝑃011
𝑃𝑥𝑦𝑧 = 𝑃𝑥𝑦 − 𝑃001 − 𝑃101 − 𝑃011 + 𝑃111
𝛥𝑥 = 𝐏𝑥 − 𝑥0
𝛥𝑦 = 𝐏𝑦 − 𝑦0
𝛥𝑧 = 𝐏𝑧 − 𝑧0
𝑑 = 𝑃000 + 𝑃𝑥 ⋅ 𝛥𝑥 + 𝑃𝑦 ⋅ 𝛥𝑦 + 𝑃𝑧 ⋅ 𝛥𝑧 + 𝑃𝑥𝑦 ⋅ 𝛥𝑥 ⋅ 𝛥𝑦 + 𝑃𝑥𝑧 ⋅ 𝛥𝑥 ⋅ 𝛥𝑧 + 𝑃𝑦𝑧 ⋅
𝛥𝑦 ⋅ 𝛥𝑧 + 𝑃𝑥𝑦𝑧 ⋅ 𝛥𝑥 ⋅ 𝛥𝑦 ⋅ 𝛥𝑧
if 𝑑 < 0 then
flag = True
/* take derivative respect to 𝑑 obtain contact normal 𝐧̃ */
𝐧̃ = 𝑃𝑥 + 𝑃𝑥𝑦 ⋅ 𝛥𝑦 + 𝑃𝑥𝑧 ⋅ 𝛥𝑧 + 𝑃𝑦𝑧 ⋅ 𝛥𝑦 ⋅ 𝛥𝑧 + 𝑃𝑥𝑦𝑧 ⋅ 𝛥𝑦 ⋅ 𝛥𝑧
return 𝑑, 𝐧̃

The current code adopts the linear elastic contact model. Thus, the
normal contact force contributed from the node 𝐩𝑖𝑘 on grain 𝑖 is:
Fig. A.18. (a) Illustration of two contact grains, where 𝑑𝑘𝑗,𝑖 denotes the scalar {
penetration of the 𝑘th node on grain 𝑖 to the geometry of grain 𝑗, 𝐧̂ 𝑗,𝑖
𝑘
is the unit normal −𝑘𝑛 𝑑𝑘𝑗,𝑖 𝐧̂ 𝑗,𝑖 𝑑𝑘𝑗,𝑖 < 0
of penetration 𝑑𝑘𝑗,𝑖 . (b) Contact forces between two grains are different. After Kawamoto 𝐅𝑖𝑛,𝑘 = 𝑘 (A.1)
et al. (2016). 0 else
Where 𝐾𝑛 is the normal contact stiffness. By action and reaction, the
contribution of contact normal force 𝐅𝑗𝑛,𝑘 from the node 𝐩𝑖𝑘 on grain 𝑗
opinions, findings, and conclusions expressed in this publication are is:
those of the authors and do not necessarily reflect the views of the
𝐅𝑗𝑛,𝑘 = −𝐅𝑖𝑛,𝑘 (A.2)
study sponsors: the National Science Foundation (NSF), the Pacific
Earthquake Engineering Research Center (PEER), the Regents of the The moment 𝐌𝑖𝑛,𝑘
contributed by the normal contact force 𝐅𝑖𝑛,𝑘 at
University of California, or the US Geological Survey. the node 𝐩𝑖𝑘 on grain 𝑖 is:

Appendix A. Normal contact force resolution 𝐌𝑖𝑛,𝑘 = (𝐩𝑖𝑘 − 𝐜𝑖 ) × 𝐅𝑖𝑛,𝑘 (A.3)

Where 𝐜𝑖 is the centroid of grain 𝑖. Similarly, the moment 𝐌𝑗𝑛,𝑘

con-
Contact between grains is determined by comparing each node of a
tributed by the normal contact force 𝐅𝑗𝑛,𝑘 at the node 𝐩𝑖𝑘 on grain 𝑗 is:
master grain to a slave grain for penetration. By embedding the grain
in a three-dimensional Cartesian grid with a value indicating the signed
distance to the nearest grain surface, the grain surface is implicitly 𝐌𝑗𝑛,𝑘 = (𝐩𝑖𝑘 − 𝐜𝑗 ) × 𝐅𝑗𝑛,𝑘 (A.4)
defined by a set of nodes with zero LS value. This framework is quite
It is important to keep in mind that the contact forces between two
convenient as it is amenable to calculating the forces between grains
grains vary slightly depending on which grain is selected as the master
with the commonly used penalty-based method. As shown in Fig. A.18, grain. This is because the 𝑘th node on master grain 𝑖 might penetrate
𝑑𝑘𝑗,𝑖 denotes the scalar penetration of the 𝑘th node on grain 𝑖 to the the slave grain 𝑗, while there does not exist a corresponding node on the
geometry of grain 𝑗; 𝜙𝑗 is the LS function of grain 𝑗; 𝐩𝑗𝑘 is the position slave grain 𝑗 penetrating the master grain 𝑖 due to the discrete nature
of 𝑘th node on grain 𝑖 considered in grain 𝑗’s coordinate; 𝐧̂ 𝑗,𝑖
𝑘
is the unit of the LS geometry representation. This does not influence the serial
normal direction of penetration 𝑑𝑘𝑗,𝑖 . The amount of penetration can be implementation, as the force resolution phase is always iterated from a
computed through interpolation from grid values near 𝐩; any order of small index to a large index. Nevertheless, the index order will change
interpolation can be used, and linear interpolation was used here for after adding or deleting migrated grains from bins, thus we always
simplicity and speed. consider the grain with the smaller index as the master grain in force
Note that one property of LS accommodated with signed distance resolution in the parallel implementation.
function is that the gradient of a point of LS is unity at that point. To maximize efficiency, each grain is stored with the center of
However, due to the LS function’s discrete nature, the magnitude mass at the origin and the axes aligned with the inertia primary axes,

15
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

resulting in a diagonal inertia tensor that simplifies many calculations. Where 𝑁 is the number of nodes on grain 𝑖. By action and reaction:
As a result of the rigid body assumption, the grain’s LS function is
never altered. When contact is computed, the nodes 𝐩𝑖𝑘 of grain 𝑖 are 𝐅𝑗𝑟𝑜𝑡 = −𝐅𝑖𝑟𝑜𝑡 (B.10)
temporarily relocated into the reference configuration of grain 𝑗’s LS The total contact moment on each grain is found by summing all
function. The contact forces and the moments are then determined (in nodal contact moments:
the reference configuration of grain 𝑗) and translated back to the global
frame. ∑
𝑁
𝐌𝑖𝑟𝑜𝑡 = (𝐌𝑖𝑛,𝑘 + 𝐌𝑖𝑠,𝑘 )
𝑘=1
Appendix B. Tangential/Traction force resolution (B.11)
∑𝑁
𝐌𝑗𝑟𝑜𝑡 = (𝐌𝑗𝑛,𝑘 + 𝐌𝑗𝑠,𝑘 )
The original LS-DEM code (Kawamoto et al., 2016) uses a history- 𝑘=1
dependent Coulomb friction model similar to those in Cundall and The implementation of the grain interaction model is shown in
Strack (1979). This model requires that contact histories accompany Algorithm 2 below.
migrating grains because the tangential displacements are computed in
increments until the two objects are separated. Consequently, the grains
must be considered individually for cross-block migration. Typically, Algorithm 2 findInterGrainForceMoment
a history-dependent tangential model is required to simulate physi- INPUT:grain 𝐴, grain 𝐵
cal experiments, since highly simplified contact models are incapable OUTPUT: grainForce 𝐅, grainMoment 𝐌
of accurately capturing the physical properties of frictional granular for int 𝑖 = 0; 𝑖 < num_nodes; 𝑖 = 𝑖 + 1 do
material because they do not account for shear history and do not 𝐮𝐀𝐁 = 𝐮𝐀 − 𝐮𝐁 , where 𝐮𝐀 is the mass center of grain 𝐴
simulate non-linearity. While the Coulomb friction model is the sim- if |𝐮𝐀𝐁 | < 𝑟𝐴 , 𝑟𝐴 = max𝑘∈ {|𝐮𝐤𝐀 |}, where 𝐮𝐤𝐀 is the vector from
plest, more sophisticated models incorporate the rate of shearing, and 𝑘th node of grain 𝐴 to mass center then
incorporating such models may result in improved results. /* rotate 𝐮𝐢𝐀 to the principal frame of grain 𝐵’s LS grid */
For a given node 𝐩𝑖𝑘 , frictional forces and the related moments only 𝐮̃ 𝑖𝐴 = 𝐑𝑇 𝐮𝐢𝐀 + 𝐜𝐁 , where 𝑅 is the operator rotate vector from
exist if 𝐅𝑖𝑛,𝑘 ≠ 0. The relative velocity 𝐯𝑘 of node 𝐩𝑖𝑘 to grain 𝑗 is: principal frame to global frame
/* check if 𝐮̃ 𝑖𝐴 penetrates into grain 𝐵 */
𝐯𝑘 = 𝐯𝑖 + 𝝎𝑖 × (𝐩𝑖𝑘 − 𝐜𝑖 ) − 𝐯𝑖 − 𝝎𝑗 × (𝐩𝑖𝑘 − 𝐜𝑗 ) (B.1)
flag, 𝑑 𝑖 , 𝐧̃ 𝑖 = findPenetration(̃𝐮𝑖𝐴 , grain B)
Where 𝐯𝑖 , 𝐯𝑗 , 𝝎𝑖 , 𝝎𝑗
are translational and angular velocities of grain 𝑖 if flag then
and grain 𝑗. The incremental shear displacement 𝛥𝐬𝑘 is then: 𝐧𝐢 = 𝐑𝐧̃ 𝑖
𝐟𝐧𝐢 = 𝑘𝑛 ⋅ 𝑑 𝑖 ⋅ 𝐧𝐢
𝛥𝐬𝑘 = [𝐯𝑘 − (𝐯𝑘 ⋅ 𝐧̂ 𝑗,𝑖 )𝐧̂ 𝑗,𝑖 ] ⋅ 𝛥𝑡 (B.2)
𝑘 𝑘 𝐅 = 𝐅 + 𝐟𝐧𝐢 , 𝐌 = 𝐌 + 𝐮𝐢𝐀 × 𝐟𝐧𝐢
The shear force 𝐅𝑖𝑠,𝑘 on grain 𝑖 contributed by node 𝐩𝑖𝑘 is updated as 𝐯𝐢𝐀𝐁 = 𝐯𝐀 − 𝐯𝐁 + 𝜔𝐢𝐀 𝐮𝐢𝐀 − 𝜔𝐢𝐁 𝐮𝐢𝐁
such: 𝐟𝐭𝐭+𝚫𝐭,𝐢 = 𝐙𝐟𝐭𝐭,𝐢 − 𝐯𝐢𝐀𝐁 ⋅ 𝑘𝑡 ⋅ 𝛥𝑡, where 𝐙 is operator rotate past
shear force direction to current direction.
𝐅𝑖𝑠,𝑘 = 𝐙𝐅𝑖𝑠,𝑘 − 𝑘𝑠 𝛥𝐬𝑘 (B.3)
/* Check Coulomb’s friction criterion is satisfied */
Where 𝐙 is the rotation operation that rotates the normal vector 𝐧̂ 𝑗,𝑖
𝑘
at 𝑓 𝑖 = min{|𝐟𝐭𝐢 |, 𝜇|𝐟𝐧𝐢 |}
the previous time step to the normal vector at the current time step and 𝐟𝐭𝐢 ∶= 𝑓 𝑖 ⋅ 𝐟𝐭𝐢 ∕|𝐟𝐭𝐢 |
𝑘𝑠 is the shear contact stiffness. This step is necessary because the rela- 𝐅 = 𝐅 + 𝐟𝐭𝐢 , 𝐌 = 𝐌 + 𝐮𝐢𝐀 × 𝐟𝐭𝐢
tive orientation of the two grains would change between time steps. In return 𝐅, 𝐌
the code presented herein, we use Rodrigues’s rotation formula (Murray
et al., 2017):
Appendix C. Discrete equations of motion
𝐯𝑟𝑜𝑡 = cos 𝜃 + (1 − cos 𝜃)(𝐤 ⋅ 𝐯)𝐤 + sin 𝜃𝐤 × 𝐯 (B.4)
Where 𝜃 is the angle between the interested vector in two time steps, The scheme to update the center of mass and nodes of each grain
and 𝐤 is cross product between normal vector at the current and implemented by Kawamoto et al. (2016) was adapted from the work
previous time step. The Coulomb friction law dictates 𝐅𝑖𝑠,𝑘 be capped by Lim and Andrade (2014). The locations, forces, and velocities of
at a fraction of the normal force 𝐅𝑖𝑛,𝑘 : the grains are known at the end of each time step, allowing the grain
motion to be explicitly updated via Newton’s law:
𝐅𝑖𝑠,𝑘
𝐅𝑖𝑠,𝑘 = min (‖𝐅𝑖𝑠,𝑘 ‖, 𝜇‖𝐅𝑖𝑠,𝑛 ‖) (B.5) 𝑚𝑎𝑖 + 𝐶𝑣𝑖 = 𝐹𝑖 (C.1)
‖𝐅𝑖𝑠,𝑘 ‖
Where 𝜇 is the inter-grain friction coefficient. By action and reaction: Where 𝑖 = 1, 2, 3 in three dimensions, m is the mass of the grain, 𝐶 = 𝜉𝑚
is the damping that proportionally scales the linear velocity 𝑣𝑖 , with 𝜉
𝐅𝑗𝑠,𝑘 = −𝐅𝑖𝑠,𝑘 (B.6) being the global damping parameter. The linear acceleration is given
The moment 𝐌𝑖𝑠,𝑘 contributed by node 𝐩𝑖𝑘 ’s shear force on grain 𝑖 is: by 𝑎𝑖 and is related to the resultant force 𝐹𝑖 . Centered finite-difference
integration scheme is used to integrate the translational components of
motion:
𝐌𝑖𝑠,𝑘 = (𝐦𝑖𝑘 − 𝐜𝑖 ) × 𝐅𝑖𝑠,𝑘 (B.7) 𝑛+ 21 𝜉𝛥𝑡 𝑛− 21 𝛥𝑡
1
𝑣𝑖 = [(1 − )𝑣 + 𝐹𝑖 ] (C.2)
Similarly, the 𝐌𝑗𝑠,𝑘 contributed by node 𝐩𝑖𝑘 ’s shear force on grain 𝑗 1+ 𝜉𝛥𝑡 2 𝑖 𝑚
2
is: 𝑛+ 12
𝑥𝑛+1
𝑖 = 𝑥𝑛𝑖 + 𝛥𝑡𝑣𝑖 (C.3)
𝐌𝑗𝑠,𝑘 = (𝐦𝑖𝑘 −𝐜 𝑗
) × 𝐅𝑗𝑠,𝑘 (B.8)
This scheme is second-order explicit, which is conditionally stable,
In the end, the total contact force on grain i is found by summing
and there is a family of trapezoidal integration schemes in various
all nodal contact forces:
forms. Based on the variable metric, the scheme becomes implicit,
∑
𝑁
unconditionally stable, and the coupled system of equations is solved
𝐅𝑖𝑟𝑜𝑡 = (𝐅𝑖𝑛,𝑘 + 𝐅𝑖𝑠,𝑘 ) (B.9)
using an adaptive iterative scheme. Zohdi (2003, 2004b, 2007, 2013).
𝑘=1

16
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

For complex-shaped objects, the rotational components of motion must parameters (𝑞1 , 𝑞2 , 𝑞3 , 𝑞4 ) can be expressed in terms of the quaternions
also be integrated, and the time derivatives of the angular accelerations and the angular velocities.
in the principal frame are given by Euler’s equations of motion. 1
𝑞̇ 1 = (−𝑞3 𝜔𝑥 − 𝑞4 𝜔𝑦 + 𝑞2 𝜔𝑧 )
𝝎̇ = (𝑀 − 𝝎 × (𝐼𝝎) − 𝜉𝐼𝝎)∕𝐼 (C.4) 2
1
𝑞̇ 2 = (𝑞4 𝜔𝑥 − 𝑞3 𝜔𝑦 − 𝑞1 𝜔𝑧 )
Where 𝝎̇ is the angular acceleration, 𝝎 is the angular velocity, 𝐼 is 2 (C.12)
the (diagonal) moment of inertial tensor in the principal body-fixed 1
𝑞̇ 3 = (𝑞1 𝜔𝑥 + 𝑞2 𝜔𝑦 + 𝑞4 𝜔𝑧 )
frame, and 𝑀 is the torque vector in the principal body-fixed frame. 2
The Euler equations are nonlinear due to the presence of angular 1
𝑞̇ 4 = (−𝑞2 𝜔𝑥 + 𝑞1 𝜔𝑦 − 𝑞3 𝜔𝑧 )
velocities products on both sides. Therefore, to appropriately integrate 2
the rotational components of motion, a predictor–corrector procedure ∑ 4
𝑞𝑖2 = 1 (C.13)
is recommended: 𝑖=1

(1) Estimate the angular velocities at the current time step by as- Above equations can be solved explicitly for the quaternion values
suming constant angular acceleration for an additional half step. at the new time step in terms of the old values, and for the angular
velocities at the midpoint of the time step using time-centered finite
𝑛− 12 difference scheme.
′ 1 𝑛−1
𝜔𝑖 𝑛 = 𝜔𝑖 + 𝛥𝜔 (C.5)
2 𝑖
References
where =𝛥𝜔𝑛−1
𝑖 𝛼𝑖𝑛−1 𝛥𝑡
(2) Calculate angular velocity predictor by using the estimates as Amritkar, A., Deb, S., Tafti, D., 2014. Efficient parallel CFD-DEM simulations using
mentioned earlier. OpenMP. J. Comput. Phys. 256, 501–519. http://dx.doi.org/10.1016/j.jcp.2013.09.
′ ′ ′ ′ 007.
𝛥𝜔1𝑛 = 𝛥𝑡[𝑀1𝑛 + 𝜔2𝑛 𝜔3𝑛 (𝐼2 − 𝐼3 ) − 𝜉𝐼1 𝜔1𝑛 ]∕𝐼1
Angelidakis, V., Nadimi, S., Otsubo, M., Utili, S., 2021. CLUMP: a code library to
′𝑛 ′𝑛 ′𝑛 ′𝑛
𝛥𝜔2 = 𝛥𝑡[𝑀2𝑛 + 𝜔3 𝜔1 (𝐼3 − 𝐼1 ) − 𝜉𝐼2 𝜔2 ]∕𝐼2 (C.6) generate universal multi-sphere particles. SoftwareX 15, 100735.
′ ′ ′ ′ Baugh Jr., J.W., Konduri, R., 2001. Discrete element modelling on a cluster of
𝛥𝜔3𝑛 = 𝛥𝑡[𝑀3𝑛 + 𝜔1𝑛 𝜔2𝑛 (𝐼1 − 𝐼2 ) − 𝜉𝐼3 𝜔3𝑛 ]∕𝐼3 workstations. Eng. Comput. 17 (1), 1–15. http://dx.doi.org/10.1007/PL00007192.
Chorley, M.J., Walker, D.W., 2010. Performance analysis of a hybrid MPI/OpenMP
(3) Predict angular velocities at the current time step. application on multi-core clusters. J. Comput. Sci. 1 (3), 168–174. http://dx.doi.
org/10.1016/j.jocs.2010.05.001.
𝑛− 12 1 ′𝑛
𝜔𝑛𝑖 = 𝜔𝑖 + 𝛥𝜔 (C.7) Cundall, P.A., Strack, O.D., 1979. A discrete numerical model for granular assemblies.
2 𝑖 Geotechnique 29 (1), 47–65. http://dx.doi.org/10.1680/geot.1979.29.1.47.
Evans, D.J., Murad, S., 1977. Singularity free algorithm for molecular dynamics
(4) Calculate angular velocity correctors.
simulation of rigid polyatomics. Mol. Phys. 34 (2), 327–331. http://dx.doi.org/
𝛥𝜔𝑛1 = 𝛥𝑡[𝑀1𝑛 + 𝜔𝑛2 𝜔𝑛3 (𝐼2 − 𝐼3 ) − 𝜉𝐼1 𝜔𝑛1 ]∕𝐼1 10.1080/00268977700101761.
Garboczi, E.J., 2002. Three-dimensional mathematical analysis of particle shape using
𝛥𝜔𝑛2 = 𝛥𝑡[𝑀2𝑛 + 𝜔𝑛3 𝜔𝑛1 (𝐼3 − 𝐼1 ) − 𝜉𝐼2 𝜔𝑛2 ]∕𝐼2 (C.8) X-ray tomography and spherical harmonics: Application to aggregates used in
𝛥𝜔𝑛3 = 𝛥𝑡[𝑀3𝑛 + 𝜔𝑛1 𝜔𝑛2 (𝐼1 − 𝐼2 ) − 𝜉𝐼3 𝜔𝑛3 ]∕𝐼3 concrete. Cem. Concr. Res. 32 (10), 1621–1638. http://dx.doi.org/10.1016/S0008-
8846(02)00836-0.
(5) Update angular velocities by using the correctors. Garcia, E., Ando, E., Viggiani, G., Sitar, N., 2022. Influence of depositional fabric
on mechanical properties of naturally deposited sands. Geotechnique 1–15. http:
𝑛+ 21 𝑛− 12 1 𝑛 //dx.doi.org/10.1680/jgeot.21.00230.
𝜔𝑖 = 𝜔𝑖 + 𝛥𝜔 (C.9)
2 𝑖 Garcia, F.E., Bray, J.D., 2019. Modeling the shear response of granular mate-
rials with discrete element assemblages of sphere-clusters. Comput. Geotech.
For small time steps used to resolve the grain contacts and for 106, 99–107. http://dx.doi.org/10.1016/j.compgeo.2018.10.003, URL https://
quasi-static conditions, in which the angular velocities are small, the www.sciencedirect.com/science/article/pii/S0266352X18302477.
number of iterations is typically small. Usually, between three and five Garcia, X., Latham, J.-P., Xiang, J.-s., Harrison, J., 2009. A clustered overlapping sphere
iterations are required to achieve machine precision tolerance. Orien- algorithm to represent real particles in discrete element modelling. Geotechnique
59 (9), 779–784. http://dx.doi.org/10.1680/geot.8.T.037.
tations for each grain are updated using Evans’ singularity with free
Gopalakrishnan, P., Tafti, D., 2013. Development of parallel DEM for the open source
quaternion approach (Evans and Murad, 1977). For Euler’s equations of code MFIX. Powder Technol. 235, 33–41. http://dx.doi.org/10.1016/j.powtec.2012.
motion and the integration of the quaternions, the torques are specified 09.006.
in the body or principal frame, while the contact detection and force Gray, A., Moore, A., 2000. ‘N-body’ problems in statistical learning. In: Leen, T.,
calculations are performed in a space or global frame. Therefore, the Dietterich, T., Tresp, V. (Eds.), Advances in Neural Information Processing Systems.
Vol. 13, MIT Press, pp. 1–7, URL https://proceedings.neurips.cc/paper/2000/file/
rotation matrix from space to body frame is given by:
7385db9a3f11415bc0e9e2625fae3734-Paper.pdf.
2 2 2 2 Grest, G.S., Dünweg, B., Kremer, K., 1989. Vectorized link cell fortran code for
⎛−𝑞1 + 𝑞2 − 𝑞3 + 𝑞4 −2(𝑞1 𝑞2 − 𝑞3 𝑞4 ) 2(𝑞2 𝑞3 + 𝑞1 𝑞4 ) ⎞ molecular dynamics simulations for a large number of particles. Comput. Phys.
⎜
𝑅 = −2(𝑞1 𝑞2 + 𝑞3 𝑞4 ) 𝑞12 − 𝑞22 − 𝑞32 + 𝑞42 −2(𝑞1 𝑞3 − 𝑞2 𝑞4 ) ⎟ Comm. 55 (3), 269–285. http://dx.doi.org/10.1016/0010-4655(89)90125-2.
⎜ ⎟ Henty, D.S., 2000. Performance of hybrid message-passing and shared-memory paral-
⎝ 2(𝑞2 𝑞3 − 𝑞1 𝑞4 ) −2(𝑞1 𝑞3 + 𝑞2 𝑞4 ) −𝑞12 − 𝑞22 + 𝑞32 + 𝑞42 ⎠
lelism for discrete element modeling. In: SC’00: Proceedings of the 2000 ACM/IEEE
(C.10) Conference on Supercomputing. IEEE, p. 10. http://dx.doi.org/10.1109/SC.2000.
10005.
Where the 𝑞’s are the quaternions of Evans and Murad (1977). Jagadish, H.V., Ooi, B.C., Tan, K.-L., Yu, C., Zhang, R., 2005. IDistance: An adaptive
B+-tree based indexing method for nearest neighbor search. ACM Trans. Database
𝜃 𝜓 −𝜙 Syst. 30 (2), 364–397. http://dx.doi.org/10.1145/1071610.1071612.
𝑞1 = sin sin
2 2 Kačianauskas, R., Maknickas, A., Kačeniauskas, A., Markauskas, D., Balevičius, R., 2010.
𝜃 𝜓 −𝜙 Parallel discrete element simulation of poly-dispersed granular material. Adv. Eng.
𝑞2 = sin cos
2 2 (C.11)
Softw. 41 (1), 52–63. http://dx.doi.org/10.1016/j.advengsoft.2008.12.004.
𝜃 𝜓 +𝜙 Karp, A.H., Flatt, H.P., 1990. Measuring parallel processor performance. Commun. ACM
𝑞3 = cos sin 33 (5), 539–543. http://dx.doi.org/10.1145/78607.78614.
2 2
𝜃 𝜓 +𝜙 Kawamoto, R.Y., 2018. The Avatar Paradigm in Granular Materials (Ph.D. thesis).
𝑞4 = cos cos California Institute of Technology, Pasadena, CA.
2 2
Kawamoto, R., Andò, E., Viggiani, G., Andrade, J.E., 2016. Level set discrete element
And 𝜃, 𝜓, 𝜙 are Euler’s angles representing successive rotations about method for three-dimensional computations with triaxial case study. J. Mech. Phys.
the 𝑧, 𝑥′ and 𝑧′ axes. It turns out the time derivatives of the orientation Solids 91, 1–13. http://dx.doi.org/10.1016/j.jmps.2016.02.021.

17
P. Tan and N. Sitar Computers and Geotechnics 172 (2024) 106408

Kawamoto, R., Andò, E., Viggiani, G., Andrade, J.E., 2018. All you need is shape: Vlahinić, I., Andrade, J., Andö, E., Viggiani, G., 2013. From 3d tomography to physics-
predicting shear banding in sand with LS-DEM. J. Mech. Phys. Solids 111, 375–392. based mechanics of geomaterials. In: Computing in Civil Engineering (2013). ASCE,
http://dx.doi.org/10.1016/j.jmps.2017.10.003. pp. 339–345. http://dx.doi.org/10.1061/9780784413029.043.
Kloss, C., Goniva, C., Hager, A., Amberger, S., Pirker, S., 2012. Models, algorithms and Walther, J.H., Sbalzarini, I.F., 2009. Large-scale parallel discrete element simulations
validation for opensource DEM and CFD–DEM. Progr. Comput. Fluid Dyn., Int. J. of granular flow. Eng. Comput. 26 (6), 688–697. http://dx.doi.org/10.1108/
12 (2–3), 140–152. 02644400910975478.
Lim, K.-W., Andrade, J.E., 2014. Granular element method for three-dimensional Washington, D.W., Meegoda, J.N., 2003. Micro-mechanical simulation of geotechni-
discrete element calculations. Int. J. Numer. Anal. Methods Geomech. 38 (2), cal problems using massively parallel computers. Int. J. Numer. Anal. Methods
167–188. http://dx.doi.org/10.1002/nag.2203. Geomech. 27 (14), 1227–1234. http://dx.doi.org/10.1002/nag.317.
Lim, K.-W., Krabbenhoft, K., Andrade, J.E., 2014. A contact dynamics approach to the Williams, J.R., Perkins, E., Cook, B., 2004. A contact algorithm for partitioning n
granular element method. Comput. Methods Appl. Mech. Engrg. 268, 557–573. arbitrary sized objects. Eng. Comput. 21 (2/3/4), 235–248. http://dx.doi.org/10.
http://dx.doi.org/10.1016/j.cma.2013.10.004. 1108/02644400410519767.
Maknickas, A., Kačeniauskas, A., Kačianauskas, R., Balevičius, R., Džiugys, A., 2006. Wu, M., Wang, J., Russell, A., Cheng, Z., 2021. DEM modelling of mini-triaxial test
Parallel DEM software for simulation of granular media. Informatica (Ljubl.) 17 based on one-to-one mapping of sand particles. Géotechnique 71 (8), 714–727.
(2), 207–224. http://dx.doi.org/10.15388/INFORMATICA.2006.134. http://dx.doi.org/10.1680/jgeot.19.P.212.
McCallen, D., Petersson, A., Rodgers, A., Pitarka, A., Miah, M., Petrone, F., Sjogreen, B., Yan, B., Regueiro, R., 2018a. Comparison between O (n2) and O (n) neighbor search
Abrahamson, N., Tang, H., 2021. EQSIM—A multidisciplinary framework for fault- algorithm and its influence on superlinear speedup in parallel discrete element
to-structure earthquake simulations on exascale computers part I: Computational method (DEM) for complex-shaped particles. Eng. Comput. 35 (3), http://dx.doi.
models and workflow. Earthq. Spectra 37 (2), 707–735. http://dx.doi.org/10.1177/ org/10.1108/EC-01-2018-0023.
8755293020970982. Yan, B., Regueiro, R.A., 2018b. A comprehensive study of MPI parallelism in three-
Mitchell, J.K., Soga, K., 2005. Fundamentals of Soil Behavior. Vol. 3, John Wiley & dimensional discrete element method (DEM) simulation of complex-shaped granular
Sons New York. particles. Comput. Part. Mech. 5 (4), 553–577. http://dx.doi.org/10.1007/s40571-
Mollon, G., Zhao, J., 2012. Fourier–Voronoi-based generation of realistic samples for 018-0190-y.
discrete modelling of granular materials. Granul. Matter 14 (5), 621–638. http: Yan, B., Regueiro, R.A., 2019. Comparison between pure MPI and hybrid MPI-OpenMP
//dx.doi.org/10.1007/s10035-012-0356-x. parallelism for Discrete Element Method (DEM) of ellipsoidal and poly-ellipsoidal
Muja, M., Lowe, D.G., 2009. Fast approximate nearest neighbors with automatic particles. Comput. Part. Mech. 6 (2), 271–295. http://dx.doi.org/10.1007/s40571-
algorithm configuration. VISAPP (1) 2 (331–340), 2. http://dx.doi.org/10.5220/ 018-0213-8.
0001787803310340. Zhao, D., Nezami, E.G., Hashash, Y.M., Ghaboussi, J., 2006. Three-dimensional discrete
Munjiza, A., Andrews, K., 1998. NBS contact detection algorithm for bodies of similar element simulation for granular materials. Eng. Comput. 23 (7), 749–770. http:
size. Internat. J. Numer. Methods Engrg. 43 (1), 131–149. http://dx.doi.org/10. //dx.doi.org/10.1108/02644400610689884.
1002/(SICI)1097-0207(19980915)43:1<131::AID-NME447>3.0.CO;2-S. Zhao, S., Zhao, J., 2021. SudoDEM: Unleashing the predictive power of the discrete
Murray, R.M., Li, Z., Sastry, S.S., 2017. A Mathematical Introduction to Robotic element method on simulation for non-spherical granular particles. Comput. Phys.
Manipulation. CRC Press. Comm. 259, http://dx.doi.org/10.1016/j.cpc.2020.107670.
Nie, J.-Y., Zhao, J., Cui, Y.-F., Li, D.-Q., 2021. Correlation between grain shape and Zhou, B., Wang, J., Zhao, B., 2015. Micromorphology characterization and reconstruc-
critical state characteristics of uniformly graded sands: a 3D DEM study. Acta tion of sand particles using micro X-ray tomography and spherical harmonics. Eng.
Geotech. 1–16. Geol. 184, 126–137. http://dx.doi.org/10.1016/j.enggeo.2014.11.009.
Osher, S., Sethian, J.A., 1988. Fronts propagating with curvature-dependent speed: Zohdi, T.I., 2003. Genetic design of solids possessing a random–particulate microstruc-
Algorithms based on Hamilton-Jacobi formulations. J. Comput. Phys. 79 (1), 12–49. ture. Philos. Trans. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 361 (1806),
http://dx.doi.org/10.1016/0021-9991(88)90002-2. 1021–1043. http://dx.doi.org/10.1098/rsta.2003.1179.
Peters, J.F., Hopkins, M.A., Kala, R., Wahl, R.E., 2009. A poly-ellipsoid particle for Zohdi, T.I., 2004a. A computational framework for agglomeration in thermochemically
non-spherical discrete element method. Eng. Comput. 26 (6), 645–657. http://dx. reacting granular flows. Proc. R. Soc. Lond. Ser. A Math. Phys. Eng. Sci. 460 (2052),
doi.org/10.1108/02644400910975441. 3421–3445. http://dx.doi.org/10.1098/rspa.2004.1277.
Plimpton, S., 1995. Fast parallel algorithms for short-range molecular dynamics. J. Zohdi, T.I., 2004b. Staggering error control of a class of inelastic processes in random
Comput. Phys. 117 (1), 1–19. microheterogeneous solids. Int. J. Non-Linear Mech. 39 (2), 281–297. http://dx.
Rodgers, D.P., 1985. Improvements in multiprocessor system design. ACM SIGARCH doi.org/10.1016/S0020-7462(02)00188-9.
Comput. Archit. News 13 (3), 225–231. http://dx.doi.org/10.1145/327070.327215. Zohdi, T.I., 2007. Particle collision and adhesion under the influence of near-fields. J.
Sethian, J.A., 1996. A fast marching level set method for monotonically advancing Mech. Mater. Struct. 2 (6), 1011–1018. http://dx.doi.org/10.2140/jomms.2007.2.
fronts. Proc. Natl. Acad. Sci. 93 (4), 1591–1595. http://dx.doi.org/10.1073/pnas. 1011.
93.4.1591. Zohdi, T.I., 2010. Simulation of coupled microscale multiphysical-fields in particulate-
Tamadondar, M.R., de Martín, L., Rasmuson, A., 2019. Agglomerate breakage and doped dielectrics with staggered adaptive FDTD. Comput. Methods Appl. Mech.
adhesion upon impact with complex-shaped particles. AIChE J. 65 (6), e16581. Engrg. 199 (49–52), 3250–3269. http://dx.doi.org/10.1016/j.cma.2010.06.032.
http://dx.doi.org/10.1002/aic.16581. Zohdi, T.I., 2012. Estimation of electrical heating load-shares for sintering of powder
Tan, P., 2022. Parallel LS-DEM. URL https://github.com/tanpeng1995/LS-DEM_Parallel_ mixtures. Proc. R. Soc. A 468 (2144), 2174–2190. http://dx.doi.org/10.1098/rspa.
Benchmark.git. 2011.0755.
Tan, P., Sitar, N., 2022. Parallel level-set DEM (LS-DEM) development and application Zohdi, T.I., 2013. Numerical simulation of charged particulate cluster-droplet impact
to the study of deformation and flow of granular media. Technical Report 2022/06, on electrified surfaces. J. Comput. Phys. 233, 509–526. http://dx.doi.org/10.1016/
Pacific Earthquake Engineering Center (PEER), UC Berkeley. j.jcp.2012.09.012.
Taylor, M.A., Garboczi, E., Erdogan, S., Fowler, D., 2006. Some properties of irregular Zohdi, T.I., 2017. Modeling and Simulation of Functionalized Materials for Additive
3-D particles. Powder Technol. 162 (1), 1–15. http://dx.doi.org/10.1016/j.powtec. Manufacturing and 3d Printing: Continuous and Discrete Media: Continuum and
2005.10.013. Discrete Element Methods. Vol. 60, Springer.
Zohdi, T.I., Wriggers, P., 1999. A domain decomposition method for bodies with
heterogeneous microstructure basedon material regularization. Int. J. Solids Struct.
36 (17), 2507–2525. http://dx.doi.org/10.1016/S0020-7683(98)00124-3.