0% found this document useful (0 votes)
18 views10 pages

Efficient Homology Computations On Multi

This paper presents two parallel algorithms for efficient homology computations on multicore and GPU systems, addressing the challenges posed by large datasets. The first algorithm utilizes OpenMP for parallelizing algebraic reductions, achieving up to 4.9× speedup over sequential methods, while the second algorithm is designed for memory efficiency on GPUs, yielding up to 40× speedup. Key contributions include a multicore algorithm, improvements to sequential reductions, and a novel cost assignment scheme for load-balanced execution.

Uploaded by

W Dale Hall
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views10 pages

Efficient Homology Computations On Multi

This paper presents two parallel algorithms for efficient homology computations on multicore and GPU systems, addressing the challenges posed by large datasets. The first algorithm utilizes OpenMP for parallelizing algebraic reductions, achieving up to 4.9× speedup over sequential methods, while the second algorithm is designed for memory efficiency on GPUs, yielding up to 40× speedup. Key contributions include a multicore algorithm, improvements to sequential reductions, and a novel cost assignment scheme for load-balanced execution.

Uploaded by

W Dale Hall
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Efficient Homology Computations on Multicore and

Manycore Systems
N. Anurag Murty1 , Vijay Natarajan1,2 , Sathish Vadhiyar1
1
Supercomputer Education and Research Centre
2
Department of Computer Science and Automation
Indian Institute of Science, Bangalore, India
[email protected], [email protected], [email protected]

Abstract—Homology computations form an important step large datasets highlight the need for fast and memory-efficient
in topological data analysis that helps to identify connected algorithms for homology computations. This serves as our
components, holes, and voids in multi-dimensional data. Our primary motivation for developing parallel algorithms for
work focuses on algorithms for homology computations of large
simplicial complexes on multicore machines and on GPUs. This homology computations.
paper presents two parallel algorithms to compute homology. A We present parallelization strategies for fast computation
core component of both algorithms is the algebraic reduction of a of homology on multicore and manycore GPU systems. The
cell with respect to one of its faces while preserving the homology algorithm we consider for parallelization uses the method of
of the original simplicial complex. The first algorithm is a parallel algebraic reductions to reduce the size of the input space while
version of an existing sequential implementation using OpenMP.
The algorithm processes and reduces cells within each partition of maintaining its homology[8]. For implementation on multicore
the complex in parallel while minimizing sequential reductions on architectures, the algebraic reduction step in REDHOM is par-
the partition boundaries. Cache misses are reduced by ensuring allelized using OpenMP[9]. We decompose the complex and
data locality for data in the same partition. We observe a linear perform parallel reductions on the different partitions while
speedup on algebraic reductions and an overall speedup of up keeping the boundaries between partitions intact. The next
to 4.9× with 16 cores over sequential reductions. The second
algorithm is based on a novel approach for homology compu- step involves algebraic reduction of the unreduced boundary
tations on manycore/GPU architectures. This GPU algorithm is cells sequentially to compute homology. We obtain up to
memory efficient and capable of extremely fast computation of 4.9× improvements in performance over sequential algebraic
homology for simplicial complexes with millions of simplices. We reductions.
observe up to 40× speedup in runtime over sequential reductions The above idea does not scale well for higher degrees of
and up to 4.5× speedup over REDHOM library, which includes
the sequential algebraic reductions together with other advanced
parallelism as in the case of GPU architectures. So we describe
homology engines supported in the software. a different algorithm amenable to massively parallel archi-
tectures. Each GPU thread attempts to perform an algebraic
I. I NTRODUCTION reduction but this is possible only when certain conditions
Topology is the study of the connectivity of space and involving its neighbours are met. Moreover, it is observed that
provides useful tools for analyzing datasets by enabling the construction of the entire simplicial complex is not necessary
abstract representation of features in the data. Topological for performing algebraic reductions. This observation speeds
data analysis finds numerous applications in neuroscience, up homology computations and leads to a memory-efficient
astrophysics, image analysis, and nonlinear dynamics [1–6]. algorithm. Finally, we define a cost function that enables us to
All of these applications are characterised by very large data perform reductions in a load-balanced manner. Implementation
sizes from which topological data analysis reveals underlying of this algorithm gives up to 40× speedup over sequential
patterns and structure. This structure is extracted in the form of algebraic reductions. We also obtain up to 4.5× speedup
connected components, holes, and voids of higher dimensions over REDHOM, which implements among the fastest sequential
along which the data aligns itself in space. The characteriza- algorithms for homology computations.
tion of these connected components, holes, voids, and their Primary contributions of this work are:
higher dimensional equivalents is more formally described by 1) A multicore algorithm for fast homology computations.
the notion of homology. Computing homology requires the 2) Modifications to sequential algebraic reductions using
construction of a combinatorial representation of the space OpenMP that improve the performance by up to 4.9×.
such as a simplicial complex. 3) A memory-efficient GPU algorithm based on algebraic
An interesting application of homology computations is the reductions that gives up to 40x speedup over the sequen-
detection of holes in the coverage of a sensor network[7]. Hole tial algorithm and up to 4.5× speedup over homology
detection is useful in cell-phone communications, beacon nav- computations in REDHOM library.
igation and some problems in security and defense. These type 4) A novel cost assignment scheme to ensure load-balanced
of applications require real-time computation of homology. execution and to ensure that only low-cost reductions are
The requirement for real-time computations and increasingly performed in a given iteration.

978-1-4799-0730-4/13/$31.00 ©2013 IEEE

333
(a) (b)
Fig. 2. The torus has one connected component, two tunnels and one void
Fig. 1. (a) A valid simplicial complex of dimension 2 (b) An invalid
simplicial complex since A and B do not intersect on an edge or a vertex.
it. In our example, the boundary of [A, B, C] are the three
edges [A, B], [B, C] and [C, A]. The boundary of a k-
Although we focus on computing the rank of homology simplexPσ = [v0 , v1 , . . . , vk−1 ] is defined as the formal sum
groups in our description, the algorithms presented could ∂σ = i −(1)i [v0 , v1 , . . . , v̂i , . . . , vk−1 ]. A minus sign in this
optionally be used to output an incidence matrix of the reduced sum basically means including the same simplex but with
complex. The Smith normal form algorithm can be applied to the opposite orientation i.e., with any two of the vertices
compute torsion coefficients also. This extension is relevant interchanged. Simplices with opposite orientations cancel each
only in complexes of dimensions higher than three. other out. The coboundary of a k-simplex σ ′ is the set of all
Section II provides the required background and definitions k + 1-simplices that have σ ′ as a face. If a simplex σ ′ lies in
especially focusing on algebraic reductions. Section III is a the boundary of σ, then σ ′ lies in the coboundary of σ ′ . For
literature survey of prior research in this area. Sections IV instance, in Figure 1(a), ∂A = 3 + 4 + 5 and ∂B = 5 + 6 + 7.
and V provide detailed descriptions of the proposed algorithms The coboundary of edges 1 and 2 is {φ}. The coboundary of
for homology computations on multicore systems and GPUs edges 3 and 4 is {A}, and that of 6 and 7 have coboundary
respectively. Experimental results are presented in Section {B}. Edge 5 has {A, B} as its coboundary.
VI and Section VII presents possible directions for future A fundamental property of boundaries is that the bound-
research. ary function applied twice is zero. In the above example,
∂∂[A, B, C] = ∂([B, C] + [C, A] + [A, B]) = [B] − [C] +
II. BACKGROUND
[C] − [A] + [A] − [B] = 0. We define a k-cycle as any formal
In topology, we study the properties of spaces that are sum of simplices whose boundary is zero. Due to property
invariant under continuous deformations or more formally, of boundaries, all boundaries are cycles. However, not all
homeomorphisms. A finite representation of topological spaces cycles bound a higher dimensional simplex. For instance, if
is required to compute these topological invariants. An exam- our original simplicial complex had the edges [B, C], [C, D]
ple of such a finite representation is a simplicial complex. We and [D, B] but not the triangle [B, C, D], the boundary of
present below a few definitions that are required to describe [B, C] + [C, D] + [D, B] is ∂([B, C] + [C, D] + [D, B]) =
our methodology. For a more mathematical treatment, we refer [B] − [C] + [C] − [D] + [D] − [B] = 0. The edges [B, C],
the reader to the texts by Zomorodian[10] and Munkres[11]. [C, D] and [D, B] form a cycle that is not a boundary of any
triangle.
A. Simplicial complexes and simplicial homology
The homology of a simplicial complex deals with counting
A k-simplex σ is the convex hull of a set A of k + 1 the number of independent cycles that do not bound any set of
independent points in Rd , for 0 ≤ k ≤ d. We use the simplices in a higher dimension. The homology in orders 0, 1
terms vertex for 0-simplex, edge for 1-simplex, triangle and 2 represent the number of connected components, tunnels,
for 2-simplex and tetrahedron for 3-simplex. A simplex and voids respectively, and are represented as algebraic groups.
σ ′ is a f ace of a simplex σ if σ ′ is contained in σ. A In this paper, we are interested in computing the rank of
simplicial complex, K, is a finite set of simplices satisfying these groups and we refer to these computations as homology
two properties : (i) if σ ∈ K and τ is a face of σ then τ ∈ K computations. For example, homology computations identify
and (ii) if σ ∈ K and σ ′ ∈ K, then σ ∩ σ ′ is either φ or a face one connected component, two independent tunnels and one
of both σ and σ ′ . The dimension of K, d(K) is defined as void in the simplicial complex that represents a torus in
the maximum dimension of a simplex in K. Fig 1(a) shows a Figure 2. For ease of description, computations are performed
valid simplicial complex whereas the collection of simplices modulo 2 which gives us the Z2 homology[10].
in Fig 1(b) does not satisfy property (ii) and is thus not a
simplicial complex. B. Algebraic reduction
A k-simplex σ can be represented as the set of its vertices Consider the simplicial complex in Figure 1(a) . It consists
[v0 , v1 , . . . , vk−1 ]. For instance, a triangle with vertices A, of one connected component and contains one tunnel. Clearly,
B, and C can be represented as [A, B, C]. The boundary we can construct a smaller sized complex representing one
of a k-simplex is formed by the (k − 1)-simplices bounding component and containing one tunnel. Reduction algorithms

334
reduce the size of a simplicial complex in a way such that
homology remains unchanged.
We focus on algebraic reductions to reduce the size of
the complex[8]. Initially, each dimension d of the simplicial
complex consists of the set of all the d-simplices. During the
reduction procedure, d-simplices can merge to form d-cells,
which can be thought of as more general versions of simplices.
For example, vertices are 0-cells, edges are 1-cells, polygons
are 2-cells and 3-D polytopes are 3-cells. In any intermediate
step of the procedure, dimension d consists of the set of all
the d-cells.
For two cells u,v of the same dimension, we define hu, vi
to be 1 when u = v and 0 otherwise. After the algebraic
reduction of cell b of dimension m with respect to its face
a in dimension m-1, the new boundary maps are given by
Equation 1, where addition is performed modulo 2.

∂v,
 if d(v) ∈
/ {m, m + 1},
∂v = ∂v + h∂v, ai∂b, if d(v) = m, (1)

∂v + h∂v, bib, if d(v) = m + 1.

After the reduction, b and a are removed from the com-


plex. This reduction operation is guaranteed to preserve the
homology of the complex. In the end, the number of cells in
dimension d that are irreducible is equal to the homology of
order d.
Figure 3 illustrates an example algebraic reduction. The
input is a simplicial complex with one connected component
and one tunnel. Initially, the cells in various dimensions just
consist of the simplices. The cells are as follows:
dimension − 0 : {a, b, c, d, e} Fig. 3. Step-by-step algebraic reductions of a simplicial complex [Red(b, a)
dimension − 1 : {1, 2, 3, 4, 5, 6, 7} means the reduction of cell b with respect to face a]
dimension − 2 : {A, B}
First, face B is reduced with respect to edge 5. After this
reduction, face A no longer remains a simplex. It becomes a expected running time to O(n2 )[13]. The quadratic complexity
cell with boundary {3, 4, 6, 7}. Subsequent reductions reduce is undesirable for very large datasets.
face A with respect to edge 7, and the edges 6, 2, and 1 For finite simplicial complexes embedded in R3 , Delfinado
with respect to the vertices e, d, and b respectively. In each et al.[14] describe a near linear time algorithm to compute ho-
of these steps, the homology of the initial simplicial complex mology. However, this algorithm does not extend to complexes
is preserved. Finally, the edge 3 and vertex a remain. There in dimensions greater than three.
are no remaining cells of dimension 2. This implies that the Another class of algorithms are the algebraic reduction
homology of order 2 is zero. In dimension 1, edge 3 is algorithms that reduce the size of the complex while main-
irreducible as it is incident on vertex a twice which means that taining its homology. Kaczynski et al.[8] propose a reduction
it has no boundary. So homology of order 1 is one. Similarly, algorithm to compute homology of a finitely generated chain
the vertex a is irreducible as a 0-dimensional cell has a zero complex. This is the first in a number of algorithms based
boundary. Thus the homology of order 0 is also one. on algebraic and geometric reductions which are implemented
in REDHOM, a software library for efficient computation
III. R ELATED W ORK of homology of sets[15]. REDHOM includes algorithms for
Homology computations via the classical method given by computing homology based on coreductions[16], acyclic sub-
Munkres[11] has exponential bounds. The Smith normal form space methods[17] and discrete Morse theory[18]. All these
used to compute homology requires a polynomial number of methods postpone the actual homology computations using
steps but the numbers in the intermediate computations can get Smith normal form until the complex size is much smaller.
very large thus yielding exponential bounds. Kannan et al.[12] The implementation in REDHOM is sequential and there is
gave the first polynomial time algorithm to compute the Smith scope for parallelization in many of its algorithms.
normal form of matrices. A probabilistic analysis based on the Lewis et al.[19] have implemented a framework for parallel
fact that the boundary matrices are sparse further improved the computation of homology on multicore computers by dividing

335
180
an input complex into local pieces and performing parallel 160
computations on these. After the parallel computations, the 140

TIME(IN SECONDS)
pieces are merged and homology is calculated again to give 120
Algebraic Reductions
100
the final result. The method relies on the property of the initial Codes + Reducible
80 Complex
division that the homology is equal to the sum of homology of 60 Read and construct
the individual pieces. However, this method does not scale to 40 simplicial complex
the level of parallelism offered by manycore GPU platforms, 20

an issue we address in our work. 0


BLUNT POST BUCKY SYNTH

DATASETS
IV. H OMOLOGY C OMPUTATIONS ON M ULTICORE
S YSTEMS
Fig. 4. Timings for various functions in homology computations using
We now propose two approaches to parallelizing the ho- sequential algebraic reductions
mology computation algorithm. The first approach is suitable
for multicore computations and is based on the sequential
algorithm implemented in the REDHOM library[15]. The li- compute homology is also a time consuming step in sequential
brary has efficient implementations of algorithms based on reductions.
reduction methods such as acyclic subspace construction,
elementary reductions, and discrete Morse theory. Each of Algorithm 1 Algorithm For Multicore Homology
these techniques is applied in sequence on the input complex Input: Maximal simplices of simplicial complex K
to reduce its size and hence compute the homology efficiently. Output: Homology : β[0],β[1],. . . ,β[d(K)]
In this work, our focus is only on parallelizing algebraic 1: Partition the simplicial complex K (P0 ,P1 ,. . . ,Pk−1 )
reductions. We disable all the other steps of REDHOM and 2: Mark boundary vertices
restrict our attention to algebraic reductions. 3: Spawn k threads and assign thread t to Pt
First, we discuss the steps of the sequential algebraic 4: (In Parallel) Threads construct reducible complex for their
reductions which are profiled for various datasets in Figure 4. partition
We discuss the datasets mentioned in the figure in more detail 5: (In Parallel) Threads reduce non-boundary cells in their
in Section VI. partition
6: Merge unreduced partitions to get a single reducible chain
A. Sequential algorithm for algebraic reductions complex K ′
Read and construct simplicial complex. In this step, max- 7: Perform algebraic reductions on all reducible cells of K ′
imally induced simplices of a simplicial complex are taken 8: β[d] is cardinality of irreducible cells in dimension d
as input and the simplices of all dimensions are generated. return β
This step forms the pre-processing step for all the algorithms
implemented in REDHOM, including algebraic reductions. As
seen in Figure 4, this step has a very low contribution to the
B. Multicore algorithm for algebraic reductions
execution times of sequential algebraic reductions.
Codes assignment and construction of reducible complex. We attempt to parallelize the construction of the reducible
The simplicial complex constructed in the previous step has complex and the algebraic reductions as both of these are
to be algebraically reduced for homology computations using the major contributors to the execution time of the homology
Equation 1. Since each step of a reduction modifies the computations using algebraic reductions. Algorithm 1 explains
boundaries and coboundaries of simplices, we need a data the steps for computation of homology on multicore machines.
structure that provides fast access to boundary and coboundary The steps are : decomposition of the complex into partitions,
data. For the purpose of creating a map from simplices to their parallel reductions of these partitions, merging of the reduced
boundaries and coboundaries, integer codes are assigned to partitions and sequential reduction of the merged complex.
all the simplices. Then boundary and coboundary maps which Decomposition into partitions. Parallelization depends on an
assign a chain to each code are constructed. This set of maps initial partition of the input mesh into near-equal sized meshes
constitutes a reducible complex on which algebraic reductions whose boundaries are as small as possible. These partitions
are performed. This step takes up the highest percentage of are generated using METIS, a graph partitioning software[20].
the total execution time. Minimizing the number of boundary cells helps in reducing
Algebraic reductions. This step performs the actual reduc- the time spent in the sequential part of our algorithm while
tions on the reducible complex that represents the input sim- similar sized partitions help in maintaining load balance during
plicial complex. Starting from the highest dimension, the cells the parallel phase. Simplices with all their vertices occurring
are reduced with respect to their faces and their boundary maps in two or more partitions are marked as boundary simplices.
are modified. For each dimension, the number of remaining During the serial read phase, we preallocate contiguous mem-
irreducible simplices is the homology of that dimension. ory for non-boundary simplices from each partitions. This
The modification of these boundaries and coboundaries to ensures spatial locality of simplices from the same partition.

336
Fig. 5. Intermediate steps in reductions of the partitions by different threads.
Different colours represent the partitions reduced by the threads. The boundary Fig. 6. Intermediate steps of the sequential reduction phase. The partitions
elements are shown in red and are not reduced in the parallel phase. are merged in this case and sequential reductions are performed subsequently.

Also, a separate memory space is allocated beforehand for the partially reduced chain complex consisting primarily of unre-
boundary simplices. duced boundary cells. All of these complexes are then merged
Parallel construction of the reducible complex. After bound- together to form a new reducible complex. As reduction oper-
ary vertices are marked, we spawn one thread per partition. ations preserve homology, the homology of this new complex
Firstly, the threads split the simplicial complex obtained from is the same as the homology of the input mesh. Algebraic
the read operation to construct one reducible complex per reductions are applied to all reducible cells of the merged
partition. We allocate memory pools for each partition and chain complex. After this step is completed, the homology in
ensure that simplices which belong to the same partition each dimension is given by the number of irreducible cells of
are assigned to the same memory pool. As data locality is that dimension. Figure 6 shows the different partitions being
maintained for the non-boundary simplices of a partition, merged after which the complex is reduced sequentially.
cache misses incurred in the iterations over these simplices
to construct the reducible complex in parallel are greatly V. H OMOLOGY C OMPUTATIONS ON M ANYCORE /GPU
reduced. This step is crucial in improving the performance S YSTEMS
of the multicore algorithm. As mentioned, the methodology used for multicore homol-
For the set of boundary simplices, all the threads have to ogy cannot be directly extended to GPU architectures. We
iterate over the same memory space but this does not greatly propose an algorithm to compute homology on GPUs. The
affect performance due to the small size of the boundary discussion considers Z2 homology i.e., addition modulo 2, but
compared to the size of the partitions. can be easily extended to arbitrary fields, albeit with increased
The reducible complex represents the boundary and space requirements. We assume that the input is in the form of
coboundary information of the partition assigned to the thread. maximal simplices of a simplicial complex and is stored as a
Codes are assigned to cells while ensuring that boundary cells, set of vertex arrays in the GPU global memory. Algorithm 2 is
i.e., cells with all their vertices on the boundary, are assigned the homology computation algorithm for GPUs. The important
the same code over all partitions. Using METIS for graph steps in the algorithm are explained below.
partitioning ensures that we obtain balanced partitions of the Reducing memory requirements for GPU algorithm. In
mesh, thus keeping these parallel constructions load-balanced. Equation 1, we observe that reducing a cell in dimension
Parallel algebraic reductions. Algebraic reductions are then m can only modify the boundaries in dimensions m and
performed in parallel on each of these reducible complexes. m + 1. So, if we start from the highest dimension and
The reductions are done on all cells with the exception of work our way downwards, the boundaries in only the highest
boundary cells. Boundary cells are shared by two or more dimension are modified[8]. An algebraic reduction of a cell in
partitions and are not reduced in the parallel phase. Our dimension m is performed with respect to one of its faces in
partitions should thus have very small boundary sizes to dimension m − 1. Thus, given the list of all cells we can
maximize the parallel reductions. Figure 5 shows some of the generate all the faces with which the cells can be paired
intermediate steps in the parallel reduction phase. The threads and reduced. This implies that we just need to transfer the
reduce the different coloured partitions in parallel and leave list of simplices of the highest unreduced dimension m to
the boundaries unreduced. perform algebraic reductions. This crucial observation helps in
Merge and sequential reductions. Now each thread has a improving performance of the algorithm as all the intermediate

337
data structures are generated on the GPU so that data transfers the symmetric difference operation carried out for algebraic
between the host and device are minimized. Intermediate data reductions can be executed in time linear in the size of the
structures include the boundary data and the coboundary data boundary/coboundary.
of the cells and faces respectively. For lower dimensions, we
only need to carry forward the unreduced faces from this Algorithm 3 Procedure Reduce-dimension
dimension. In comparison to the space required for storing the 1: Reduce-dimension(struct-cells,struct-faces){
entire simplicial complex, GPU memory requirements are very 2: (GPU) Generate faces from cells
low when we adopt this approach of constructing the complex 3: (GPU) Assign values to boundary, boundary-size vectors
per-dimension starting from the highest dimension. In contrast 4: (GPU) Sort faces in lexicographic order and mark repeated
with this, the entire simplicial complex with simplices in all faces
dimensions is constructed in REDHOM as a pre-processing 5: (GPU) Assign values to coboundary, coboundary-size vec-
step. tors
6: (GPU) Remap to get newIDs in boundary vectors
Algorithm 2 Algorithm For GPU Homology 7: (GPU) Remove repeats from face vectors
Input: Maximal simplices of simplicial complex K 8: Initialize variables irreducible, reduced to 0
Output: Homology : β[0],β[1],. . . ,β[d(K)] 9: while (irreducible + reduced < number of cells) do
1: for dim = d(K) downto 1 do 10: (GPU) Each cell finds face with min. cost of reduction
2: Transfer cells of dimension dim to GPU(struct-cells) 11: (GPU) Cells with min. costs within fixed margin do a
3: Allocate space for faces on GPU(struct-faces) race-prioritycheck-check to lock required
4: β[dim] = Reduce-dimension(struct-cells,struct-faces) boundaries and coboundaries
5: Merge unreduced faces in struct-faces with cells of 12: (GPU) Invoke Kernel Reduce-pair
dimension dim − 1 13: (GPU) Update values oF reduced and irreducible
6: end for 14: end while
7: return β 15: return irreducible
16: }

Data structure for reductions. Algebraic reductions of a cell-


face pair require a data structure which allows fast access Cost of a reduce-pair operation. When reducing the cells
to cell boundaries and face coboundaries[1]. In REDHOM, of a particular dimension, only one of the many faces of a
this reducible chain complex is generated from the simplicial cell has to be chosen for reduction. In addition to this, not
complex[15]. For our purposes, we never construct the entire all cells can be reduced simultaneously. We introduce a novel
simplicial complex. In each iteration in Algorithm 2, we only cost function that helps in choosing unique reduction pairs
transfer the unreduced simplices of the highest unreduced without conflicts. We only allow reduction pairs with costs
dimension to the GPU. within a small margin of the smallest cost to proceed with
On the device, the reducible complex is generated in the reductions in a given iteration. As cost of a reduction pair
procedure given in Algorithm 3. Initially, all the cells to be reflects the time taken for that particular cell-face reduction,
reduced are simplices. So, it is straightforward to generate avoiding high reduction costs ensures that we limit the time
the faces. It is trivial for cells to access the faces on their spent in a particular iteration.
boundaries and also to compute the boundary-sizes which are To derive the cost function and to describe the intuition
equal in the beginning. However, the list of faces has many behind its design, let us consider Equation 1 again. As
repeated faces that belong to the boundary of more than one reductions are performed on the highest unreduced dimension,
cell. A sort procedure, say lexicographical sort, helps to collect only the boundaries of dimension m are modified. The value
the repeated faces and in obtaining their coboundary cells and h∂v, ai is non-zero only if v is on the coboundary of face a.
coboundary sizes. The indices of the faces change after the So, reduction with respect to a only modifies the boundaries
sorting step. So, a remapping from old face indices to new of cells on the coboundary of a. Similarly, if cell b is reduced,
ones is performed on the cell boundaries. After this step, all only the coboundaries of faces on its boundary are modified.
cells and faces are organised into a data structure that enables When working with Z2 homology, the boundaries and
efficient algebraic reductions, by supporting fast access to coboundaries are essentially sets of faces and cells respec-
cell boundaries and face coboundaries. Cells can access their tively. For cell a, Cbdy(a) and Bdy(a) are used to denote
boundary by using the new maps and faces can directly access the set of cells in the coboundary and boundary respectively.
their coboundary by using the values of their old locations. #Cbdy(a) and #Bdy(a) are the cardinality of these sets.
For instance, if the face generated from a tetrahedron had For Z2 homology, merging two boundaries or coboundaries
an initial index i, it is straightforward to see that the index is equivalent to a symmetric set difference operation, as
of the parent tetrahedron of this face is ⌊i/4⌋. Initially we described in Algorithm 4. We define the cost of a pair
ensure that the boundary face indices and coboundary cell reduction as the work done in performing the symmetric set
indices are stored in sorted order. This helps in ensuring that difference operations. As the order of complexity of computing

338
set difference of two sorted arrays is linear in the sum of
number of elements in these arrays, the cost of reducing cell
b with face a is:
reduction cost(a, b) = (#Bdy(b) − 1) × (#Cbdy(a))
X
+ (#Cbdy(g))
g∈Bdy(b)\{a}
(2)
+(#Cbdy(a) − 1) × (#Bdy(b))
X
+ (#Bdy(u)) (a) (b)
u∈Cbdy(a)\{b}

In Algorithm 3, the cost function is used to find the face with


minimum cost of reduction for each face. This cost helps
in deciding the maximum cost we are willing to incur for
reductions in a particular iteration.

Algorithm 4 Kernel Reduce-pair


1: Reduce-pair(cell b,face a){
2: for all cells t on coboundary of a except b do
3: Bdy(t)=(Bdy(t)∪Bdy(b))\(Bdy(t)∩Bdy(b)) (c)
4: if (#Bdy(t) == 0) then
Fig. 7. Algebraic reductions in a single iteration of the GPU algorithm: (a)
5: Mark t as irreducible Yellow cells have low reduction costs and pink cells denotes high reduction
6: end if costs. (b) In this iteration, purple cells will be reduced and the orange cells
7: end for will not be reduced as they are unable to lock their neighbours. Green cells are
the neighbours of cells being reduced. (c) The complex after the reductions in
8: for all faces f on boundary of b except a do this iteration. The brown cell is formed by the reduction of the interior cell.
9: Cbdy(f )=(Cbdy(f )∪Cbdy(a))\(Cbdy(f )∩Cbdy(a))
10: end for
11: Mark b, a as reduced thread IDs. As a result, very few pairs are able to obtain the
12: } required locks and proceed with the reduction, thus reducing
the number of parallel reductions in an iteration.
Race-prioritycheck-check and homology computations. In some iterations, there is a possibility that the sequence of
The reduction of a cell-face pair needs to modify certain priorities is such that very few pairs are declared reducible. In
boundaries and coboundaries as described in Algorithm 4. the worst case, it is possible that no pairs are reducible in this
Thus we need to ensure that these boundaries and cobound- iteration. In both these cases, random priorities are reassigned
aries are not modified by more than one thread perform- to reduction pairs to increase the number of parallel reductions.
ing a cell-face reduction. We use the three-phase race- Unreduced cells in dimension m are marked irreducible
prioritycheck-check technique to ensure that modification of when their boundary size becomes zero. This effectively means
a particular boundary/coboundary is done only by a single there is no face with respect to which these cells can be
thread[21]. In the race step, threads assigned to each unre- reduced. When there are no reducible m-cells left, the number
duced cell use their IDs to lock the required boundaries and of irreducible m-cells is the homology of order m. After
coboundaries. In the priority check step, all threads read the the homology for dimension m is computed, the maximal
lock ID value of the boundaries and coboundaries and modify simplices of dimension m − 1 are merged with the unreduced
the lock value, if and only if they have a higher priority than faces of this iteration. The same procedure is repeated until
the ID assigned in the race phase. Finally, in the check phase the homology of all orders has been computed.
all cells check if they have ownership of all the required An illustration for the GPU algorithm is shown in Figure 7.
boundaries and coboundaries. If the result of the check phase The complex in Figure 7(a) has to be reduced and one GPU
is TRUE, then the thread proceed with the reduction. thread is assigned to each cell. The minimum reduction cost of
Assignment of priorities in the prioritycheck phase is not each cell with respect to any of its faces is computed on GPU
directly based on thread IDs. Priorities to reduction pairs are using Equation 1. Yellow cells represent a minimum reduction
assigned on the basis of costs defined in Equation 2. The cost of 6 with respect to their free face and pink cells have a
reduction pair with lower reduction costs is given a higher minimum reduction cost of 20. Random priorities are assigned
priority. Ties are broken based on a random number assigned to each of these cells and the race-prioritycheck-check step
to each reduction pair. We do not directly use thread IDs identifies the cells which will be reduced in this iteration. In
to break ties between reduction pairs with equal costs. This Figure 7(b), purple cells denote the cells being reduced in this
is because the lexicographic ordering of the cells results in iteration as they have locks on their neighbours denoted by
a series of cascading conflicts when we break ties using the green cells. Orange cells are the ones which were unable

339
180 The time taken by the various functions during sequential
160 algebraic reductions in REDHOM is shown in Figure 4. In all
140
cases, construction of reducible complex is the most time-
TIME (IN SECONDS)

120 merge
100
Algebraic Reductions consuming operation followed by algebraic reductions. The
Codes + Reducible complex read and construction of the simplicial complex is
80 Complex
60 Split done sequentially. Codes are assigned to the simplices and
Read and construct
40
simplicial complex
reducible complexes are constructed on the different partitions
20 in parallel. Algebraic reductions are performed on all non-
0
1 2 4 8 16 boundary simplices in parallel following which the unreduced
NUMBER OF THREADS simplices are merged.
The results of this parallelization for different number of
Fig. 8. Parallelization results for dataset SYNTH using multicore reductions threads on SYNTH dataset are presented in Figure 8.
For all datasets, it is observed that the algebraic reduc-
tions step of the sequential algorithm scales linearly with
180
160
increasing number of cores. We obtain up to 10.7× speedup
140 for this step with 16 cores. We notice that the execution
TIME (IN SECONDS)

120 times for construction of the reducible complex decrease with


BLUNT
100
POST
increasing number of cores. With 16 cores, the maximum
80
BUCKY speedup attained is 8.76× over the sequential execution of this
60
SYNTH step. Also, the initial read and construction of the simplicial
40
20 complex is performed sequentially and lack of parallelism in
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
this step eventually makes this function a major contributor to
NUMBER OF THREADS
the execution time. The total execution times for the various
datasets with increasing number of cores is in Figure 9.
Performance gains of up to 4.9× are obtained over sequential
Fig. 9. Total times taken (including pre-processing step) with increasing algebraic reductions.
number of cores
B. GPU
The GPU algorithm for homology computations was im-
to lock their coboundaries and boundaries and hence will not plemented in CUDA for devices with compute capability 2.0
participate in reductions in this iteration. Finally, at the end and higher. The -arch = compute 20 and code = sm 20
of this iteration, we have a reduced complex as shown in flags are used for compiling the code with the Nvidia compiler
Figure 7(c). The brown cell in the centre is formed due to nvcc 5.0. We evaluate the performance of our algorithm on a
the reduction of the interior cell. 1.15GHz Nvidia Tesla C2070 Card. It belongs to the Fermi
GPU series and has 6GB of global memory and 448 CUDA
VI. E XPERIMENTS AND R ESULTS cores.
The inputs to our experiments were mostly tetrahedral The timings vary over different runs for the same dataset due
meshes obtained from the aim@shape repository[22]. BLUNT to the randomized nature of the algorithm. For comparisons,
represents the blunt fin dataset which consists of ~1 million we use the average times over 20 runs of homology compu-
simplices, POST is the liquid post dataset with ~3 million tations for each dataset. We list the maximum, minimum, and
simplices and BUCKY is the buckyball dataset which ~6 average times (in seconds) for each dataset in the following
million simplices), all from the aim@shape repository. SYNTH table.
is a synthetically generated dataset with ~10 million simplices. - Max. time Min. time Avg. time
All the timings presented for both the multicore and the BLUNT 0.57s 0.48s 0.51s
POST 1.67s 0.96s 1.32s
GPU algorithms include data pre-processing times as well. BUCKY 2.93s 2.26s 2.71s
For the GPU algorithm, data transfers to and from the device SYNTH 4.73s 3.19s 4.18s
are included in the reported timings.
The speedups of the average GPU timings over the sequen-
A. Multicore tial algebraic reductions are shown in Figure 10. For SYNTH
For computation of homology on multicore systems, we use and POST datasets, we obtain about 40× speedup. In Figure 11,
METIS library for generating load-balanced graph partitions. we compare the performance of multicore algebraic reductions
REDHOM is written in C ++ and the parallelization is imple- and the GPU algorithm for reductions. The GPU algorithm
mented in OpenMP. The experiments were performed on an performs upto 9.8× better than multicore algebraic reductions.
x86 64 Linux machine with 16GB of RAM and a 2GHz Intel We also perform experiments to compare the running times
Xeon Processor E5-2650. We enable hyperthreading to get 16 of the GPU algorithm against the optimized version of RED -
processing threads over the 8 physical cores. HOM . The optimized version includes an entire range of

340
VII. C ONCLUSIONS AND F UTURE W ORK
50 In this work, we have developed algorithms for homology
SPEEDUP 40 computations on multicore and manycore GPU systems. We
30 observe up to 4.9× speedup with 16 cores over sequential
20 algebraic reductions on multicore systems. A speedup of up to
10 40× over the sequential algebraic reductions is observed using
0 our GPU algorithm. The GPU algorithm compares favourably
BLUNT POST BUCKY SYNTH
with the REDHOM library which has a series of algorithms
DATASETS
for homology computations, giving up to 4.5× performance
gains.
We have explored the possibility of parallelization exclu-
Fig. 10. Speedup of average GPU timings with respect to sequential algebraic sively based on algebraic reductions. There are many other
reductions
types of reduction algorithms implemented in REDHOM. We
plan to extend our work further by identifying algorithms
45 that work at a local level to reduce the size of the simplicial
40 complex and then using a similar approach to parallelize it.
35 Another possible extension could be parallel algorithms for
TIME( IN SECONDS)

30
homology computations in a distributed memory environment.
25
Multicore (16 cores)
20 algebraic reductions ACKNOWLEDGEMENTS
15 GPU (average times)
10 This work was partially supported by the Depart-
5 ment of Science and Technology, India, under Grant
0
BLUNT POST BUCKY SYNTH
SR/S3/EECE/0086/2012.
DATASETS
R EFERENCES
Fig. 11. Comparison of GPU and multicore timings [1] T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational Homol-
ogy. New York: Springer, 2004, vol. 157.
[2] G. Singh, F. Memoli, T. Ishkhanov, G. Sapiro, G. Carlsson, and D. L.
Ringach, “Topological analysis of population activity in visual cortex,”
algorithms for efficient homology computation in addition to Journal of vision, vol. 8, no. 8, 2008.
[3] S. Maadasamy, H. Doraiswamy, and V. Natarajan, “A hybrid parallel
the algebraic reductions. Even when all homology computation algorithm for computing and tracking level set topology.” HiPC, 2012.
engines of REDHOM are switched on, speedups of upto 4.5× [4] A. Gyulassy, V. Natarajan, V. Pascucci, P.-T. Bremer, and B. Hamann,
are observed with the GPU algorithm, as seen in Figure 12. “A topological approach to simplification of three-dimensional scalar
functions,” Visualization and Computer Graphics, IEEE Transactions
We also observe better speedups for POST and SYNTH on, vol. 12, no. 4, pp. 474–484, 2006.
datasets compared to the other datasets for both the algorithms. [5] H. Edelsbrunner and J. L. Harer, Computational Topology: An Introduc-
In Figure 4, we notice that for both these datasets the time tion. American Mathematical Soc., 2010.
[6] R. van de Weygaert, G. Vegter, H. Edelsbrunner, B. J. Jones, P. Pranav,
spent in algebraic reductions forms a high percentage of the C. Park, W. A. Hellwing, B. Eldering, N. Kruithof, E. P. Bos et al.,
total execution time. This relationship between the contribution “Alpha, betti and the megaparsec universe: on the topology of the cosmic
of the algebraic reduction step to the the total execution time web,” in Transactions on Computational Science XIV. Springer, 2011,
pp. 60–101.
and the speedups obtained for the dataset was observed in [7] R. Ghrist and A. Muhammad, “Coverage and hole-detection in sensor
general for all the datasets on which we tested our algorithm. networks via homology,” in Proc. Intl. symp. Information processing in
sensor networks. IEEE Press, 2005, p. 34.
[8] T. Kaczyński, M. Mrozek, and M. Ślusarek, “Homology computation
by reduction of chain complexes,” Computers & Mathematics with
20 Applications, vol. 35, no. 4, pp. 59–70, 1998.
20
18
18
[9] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-
16
memory programming,” Computational Science & Engineering, IEEE,
14 16
seconds)

vol. 5, no. 1, pp. 46–55, 1998.


Time(in seconds)

12 14
10 12 REDHOM (serial, optimized)
[10] A. J. Zomorodian, Topology for Computing. Cambridge University
8 10 GPU (averageREDHOM
times) (serial, Press, 2005.
6 optimized) [11] J. R. Munkres, Elements of Algebraic Topology. Addison-Wesley
Time(in

8
4
6
GPU (average times) Reading, 1984.
2
0
4 [12] R. Kannan and A. Bachem, “Polynomial algorithms for computing the
2 BLUNT POST BUCKY SYNTH smith and hermite normal forms of an integer matrix,” SIAM Journal
0 Datasets
on Computing, vol. 8, no. 4, pp. 499–507, 1979.
BLUNT POST BUCKY SYNTH
[13] B. R. Donald and D. R. Chang, “On the complexity of computing the
Datasets homology type of a triangulation,” in Proc. Annual symp. Foundations
of Computer Science, 1991, pp. 650–661.
[14] C. J. A. Delfinado and H. Edelsbrunner, “An incremental algorithm for
Fig. 12. Comparison of average GPU timings with optimized REDHOM, betti numbers of simplicial complexes on the 3-sphere,” Computer Aided
which includes the sequential algebraic reductions together with other ad- Geometric Design, vol. 12, no. 7, pp. 771–784, 1995.
vanced homology engines supported in the software. [15] “REDHOM,” http://redhom.ii.uj.edu.pl/.

341
[16] M. Mrozek and B. Batko, “Coreduction homology algorithm,” Discrete 41–47.
& Computational Geometry, vol. 41, no. 1, pp. 96–118, 2009. [19] R. H. Lewis and A. Zomorodian, “Multicore homology,” http://comptop.
[17] M. Mrozek, P. Pilarczyk, and N. Żelazna, “Homology algorithm based stanford.edu/preprints/, 2012.
on acyclic subspace,” Computers & Mathematics with Applications, [20] “METIS,” http://glaros.dtc.umn.edu/gkhome/views/metis.
vol. 55, no. 11, pp. 2395–2412, 2008. [21] R. Nasre, M. Burtscher, and K. Pingali, “Morph algorithms on gpus,”
[18] S. Harker, K. Mischaikow, M. Mrozek, V. Nanda, H. Wagner, M. Juda, in Proc. ACM SIGPLAN symp. Principles and practice of parallel
and P. Dłotko, “The efficiency of a homology algorithm based on discrete programming, 2013, pp. 147–156.
morse theory and coreductions,” in Proc. Intl. Workshop Computational [22] “Aim@Shape,” http://www.aimatshape.net/.
Topology in Image Context (CTIC 2010). Image A, vol. 1, 2010, pp.

342

You might also like