A New Algorithm For Parallel Connected-Component Labelling On Gpus
A New Algorithm For Parallel Connected-Component Labelling On Gpus
Abstract—Connected-component labelling remains an important and widely-used technique for processing and analysing images and
other forms of data in various application areas. Different data sources produce components with different structural features and may
be more or less suited to certain connected-component labelling algorithms. Although many efficient serial algorithms exist,
determining connected-components on Graphical Processing Units (GPUs) is of interest as many applications use GPUs for
processing other parts of the application and labelling on the GPU can avoid expensive memory transfers. The general problem of
connected-component labelling is discussed and two existing GPU-based algorithms are discussed—label-equivalence and
Komura-equivalence. A new GPU-based, parallel component-labelling algorithm is presented that identifies and eliminates redundant
operations in the previous methods for rectilinear two- and three-dimensional datasets. A set of test-cases with a range of structural
features and systems sizes is presented and used to evaluate the new labelling algorithm on modern NVIDIA GPU devices and
compare it to existing algorithms. The results of the performance evaluation are presented and show that the new algorithm can provide
a meaningful performance improvement over previous methods across a range of test cases.
1 INTRODUCTION
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1218 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
hierarchically merge clusters and [4] which uses a very simi- any transitive function is used to determine whether two
lar merge function based on atomic operations. These meth- nodes are connected.
ods are discussed in further detail in Section 3.4. The algorithms discussed in this article consider the
The present article describes an algorithm that improves nodes to be four-connected in two-dimensions and six-
on previous GPU-based connected-component labelling connected in three-dimensions. Similar datasets may also be
methods [35], [40] by eliminating redundant operations for considered to be connected to additional neighbouring
two- and three-dimensional rectilinear datasets. In cases nodes such as eight-connected nodes (Moore neighbour-
where components are connected together through many hood) in two-dimensions. The lattice structure themselves
different nodes, the previous algorithms may apply redun- may also extend to higher-dimensions or have other regular
dant operations. The algorithm identifies configurations in lattice structures. In this work the datasets are limited to
the rectilinear datasets where these redundant operations four- and six-connected rectilinear lattices in two- and
may be applied, each thread can efficiently identify such three-dimensions as they represent the most common cases
configurations from local data and eliminate them. The of the connected-component labelling problem.
prior knowledge of the structure of a rectilinear dataset Two different boundary conditions for the datasets are
allows a redundant connection between nodes to be deter- considered—clamped and periodic. For applications in
mined by by accessing one additional neighbour in two- image processing or medical imaging, boundaries are gen-
dimensions and up to three neighbours in three-dimensions erally considered to be clamped. Nodes on the boundary of
from known addresses. the dataset simply don’t have any neighbours in the direc-
A similar approach could be applied to other datasets tion of the boundary. Periodic boundaries are also consid-
with regular structures to identify configurations where ered where nodes on one boundary are connected to the
redundant operations may occur. The benefit the algorithm corresponding node on the other side of the dataset. While
provides would depend on the number and cost of identify- these boundaries are generally not applicable for real-world
ing such configurations and the expense of the redundant images, they are commonly used in computational simula-
operations that can be eliminated. The approach may not be tions [30], [31], [32], [41].
applicable for the general connected-component labelling More recent algorithms take advantage of GPU architec-
program that considers an arbitrary graph. Determining ture features that were not available on the early Tesla
redundant connections between a node’s neighbours in an architecture devices. The most significant of these is the
arbitrary graph would require comparison of their adja- introduction and performance optimisation of atomic oper-
cency lists (or other adjacency representation). This would ations [4], [5], [40], [42]. Atomic operations were initially
create a tradeoff between the advantage of eliminating introduced in the compute capability 1.1 devices and their
redundant operations and the overhead of searching performance has be significantly improved in subsequent
through the adjacencies. For many graphs, the cost of the generations of GPUs [42], [43], [44], [45]. These architectures
overhead is likely to outweigh the benefit. have also introduced a number of special instructions such
The rest of this article is organised as follows. Section 2 as register shuffle and more powerful synchronization func-
briefly introduces the connected-component labelling prob- tions. The capabilities and improved performance enable
lem and GPU algorithm consideration. Relevant GPU label- new GPU algorithms and allow faster implementation of
ling algorithms and are described in Section 3. The new older algorithms.
algorithm is presented in Section 4. A number of implementa-
tion and optimisation considerations for these algorithms are 3 PREVIOUS WORK
discussed in Section 5. A set of two- and three-dimensional
The GPU connected-component labelling algorithms dis-
test datasets are presented and used to evaluate and compare
cussed in this section are based on the label-equivalence
the different algorithms in Section 6. These results are then
algorithm described for NVIDIA Tesla architecture GPUs
discussed in Section 7 and some final conclusions are drawn
in [35]. The label-equivalence algorithm has become a com-
in Section 8
mon reference point used to compare subsequent GPU-
based algorithms [4], [5], [40]. Subsequent improvements to
2 BACKGROUND the implementation of this algorithm for later generation
The connected-component labelling problem is an algorith- GPUs have been presented in [37], [38] and more signifi-
mic application of graph theory that considers a graph of cantly the removal of kernel iteration from the algorithm [40]
nodes and assigns a unique label for each subset of con- using a reduction method [39]. An updated description of
nected vertices. However, many applications consider data- the label-equivalence algorithm incorporating modern opti-
sets with regular structures instead of arbitrary graphs. The misations is presented for reference.
connections between nodes are often defined by a function
of the node data rather than from a set of edges. 3.1 Label-Equivalence Algorithm
This article considers regular, rectilinear datasets (such as The label-equivalence algorithm identifies connected-
images and volumetric data) where connections between components using three device kernels - initialisation, scan
nodes are determined by their data values. Each node and analysis. The initialisation kernel sets the label of each
represents a pixel in two-dimensions or a voxel in three- node to the linear address of the node in the dataset - this
dimensions and in all examples the equality operator is used ensures each label is initialised with a unique value that can
to determine connections between neighbouring nodes. also be used to identify the node. The scan kernel inspects
However, all algorithms discussed will work correctly when the node’s neighbours to determine if it is connected to a
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1219
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1220 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1221
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1222 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1223
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1224 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1225
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1226 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
Fig. 10. Timing results of individual kernels for the two-dimensional datasets for a system size of 81922 . The results shown the average time (in milli-
seconds) for each kernel call with the exception of the label-equivalence implementations where the analysis and scan (for the direct-type) and bor-
der and analysis (for the two-stage) show the total of multiple kernel calls.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1227
Fig. 12. Timing results of individual kernels for the three-dimensional datasets for a system size of 5123 . The results shown the average time (in milli-
seconds) for each kernel call with the exception of the label-equivalence implementations where the analysis and scan (for the direct-type) and bor-
der and analysis (for the two-stage) show the total of multiple kernel calls.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1228 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
TABLE 1
Average Labelling Time Per Node (in ms) of the Different Labelling Algorithms for Large System Sizes
2D LE-D KE-D PE-D LE-B KE-B KE-B (shfl) PE-B PE-B (shfl)
Ising 1:344 5:7% 0:581 1:1% 0:510 1:2% 0:967 2:5% 0:502 2:0% 0:485 2:0% 0:438 1:7% 0:398 1:7%
Percolation 1:724 3:8% 0:579 1:2% 0:549 1:1% 1:116 2:0% 0:414 2:1% 0:416 2:1% 0:373 2:3% 0:353 2:0%
OASIS 0:971 7:9% 0:467 4:6% 0:338 1:7% 0:930 4:2% 0:605 1:7% 0:573 1:8% 0:530 0:9% 0:477 1:1%
USC-SIPI 1:132 14% 0:552 17% 0:370 10% 0:962 6:5% 0:593 3:2% 0:562 3:1% 0:520 2:8% 0:468 2:9%
Spiral 1:358 3:0% 1:153 3:7% 1:162 3:6% 0:774 0:7% 0:467 2:3% 0:456 2:9% 0:447 2:2% 0:418 2:2%
Hilbert 2:506 0:8% 0:667 7:4% 0:673 7:5% 1:323 0:3% 0:420 8:1% 0:423 8:1% 0:393 8:7% 0:377 9:3%
3D LE-D KE-D PE-D LE-B KE-B KE-B (shfl) PE-B PE-B (shfl)
Ising 1:935 2:9% 1:314 7:2% 1:017 1:0% 1:398 5:2% 0:732 2:6% 0:721 2:7% 0:645 2:8% 0:617 3:0%
Percolation 2:738 4:6% 1:167 0:3% 1:013 0:3% 1:891 3:4% 0:768 1:2% 0:759 1:2% 0:652 0:9% 0:632 0:9%
OASIS 1:518 21% 0:700 9:8% 0:540 6:1% 1:299 11% 0:754 1:9% 0:699 2:0% 0:686 0:7% 0:610 0:6%
Hilbert 3:497 6:4% 0:829 0:5% 0:800 0:6% 2:284 3:2% 0:734 0:3% 0:730 0:3% 0:610 0:1% 0:624 0:2%
Results show the average labelling time per node averaged across dataset sizes of ½40962 ; 81922 for two-dimensions and ½2563 ; 5123 for three-dimensions. The
timing data is presented in ms and error estimates show the standard deviation as a percentage for the label-equivalence (LE), Komura-equivalence (KE) and
Playne-equivalence (PE) as direct-type (-D) and as two-stage (-B) methods with shuffle operations (shfl) and without.
equivalence algorithm can provide between a 2x and 5x is dependent on the dataset. For the two-stage implementa-
speedup over the label-equivalence algorithm for reason- tions, the block kernel is directly comparable across all three
ably large datasets across a range of test cases. algorithms with the label-equivalence performing the slow-
The results also show that the Playne-equivalence algo- est and Playne-equivalence the fastest. The Playne-equiva-
rithm provides a consistent performance benefit over the lence algorithm also outperforms the Komura-equivalence
Komura-equivalence algorithm across a range of test cases algorithm in the border kernel as this also benefits from the
and system sizes in both the two- and three-dimensional improved reduction method and configuration filtering. As
datasets. The Playne-equivalence algorithm provides the expected there is no distinguishable difference in the perfor-
most improvement in test cases that have a higher ratio of mance of the analysis kernels as they perform the same oper-
redundant connections between components—the USC-SIPI ations. The label-equivalence border and analysis kernels are
and OASIS cases. These real-world datasets tend to consist not directly comparable as they must be iteratively called
of larger domains with less single-node connections between multiple times.
clusters. The benefit of the algorithm is less pronounced in In three-dimensional datasets the direct-type Playne-
datasets with almost no redundant connections—the spiral equivalence algorithm also provides a performance improve-
and Hilbert datasets. These datasets consist of very complex ment by reducing the run-time of the reduction kernel. One
structures mostly connected by single nodes and present less unexpected result is the performance of the block kernel
opportunity to filter out unnecessary connections. The Ising where the label-equivalence algorithm performs faster than
and percolation test cases still show an important improve- the Komura-equivalence algorithm. The Playne-equivalence
ment as they consist of both larger structures and but block kernel does perform slightly faster than the label-
still contain many single-node connections. However, the equivalence version but the improvement is only small com-
improved reduction (merge) method still provides a perfor- pared to the two-dimensional case. The Komura-equivalence
mance benefit in these scenarios. and Playne-equivalence algorithms are both able to merge
The results also agree with the conclusion in [40] that the blocks together significantly faster than the label-equivalence
two-stage implementation of the Komura-equivalence algo- algorithm with the Playne-equivalence border kernel provid-
rithm performs better than the direct-type implementation ing the best performance. However, these kernel results
for the Ising model. A similar performance improvement is show that the two-stage Komura-equivalence algorithm
also present in the percolation, spiral and Hilbert space- could be improved by performing the initial labelling of the
filling curve datasets. However, the result is not so straight- blocks with the label-equivalence algorithm.
forward for the USC-SIPI and OASIS datasets. In these cases The use of __shfl operations plays a significant role in
the direct-type implementations can provide the best per- achieving the best performance for the two-stage Playne-
formance for some larger system sizes. As these dataset con- equivalence algorithm. While these shuffle operations can be
tain larger structures there are fewer small structures that used to improve the two-stage Komura-equivalence algo-
can be resolved within a single block and the benefit of rithm, the improvement is less significant. This is due to the
labelling and then merging blocks is outweighed by addi- fact that the Komura-equivalence algorithm can only access a
tional costs of converting between local and global labels single neighbour through __shfl operations. The Playne-
and the additional synchronisation required. equivalence algorithm accesses additional neighbours, of
Analysis of the individual kernels produces several find- which two can be accessed through __shfl operations in two-
ings - the two-dimensional case is discussed first followed dimensions and three in three-dimensions. The benefit of
by the three-dimensional case. For two-dimensional data- avoiding unnecessary reduction operations can be achieved
sets, the direct-type Playne-equivalence algorithm provides without significantly increasing the memory transactions.
a performance benefit by reducing the run-time of the reduc- The implementation of the two-stage Playne-equivalence
tion kernel. This is the only kernel that different from the algorithm provides a performance benefit over the imple-
Komura-equivalence algorithm and the exact improvement mentation without shuffle operations for all cases except for
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1229
the three-dimensional Hilbert case, the exact reason for this is Medical Institute (HHMI) at Harvard University, the Neu-
currently unknown and warrants further investigation. roinformatics Research Group (NRG) at Washington Uni-
The algorithm presented takes advantage of the rectilinear versity School of Medicine, and the Biomedical Informatics
structure of the datasets which allows redundant connec- Research Network (BIRN) supported by NIH grants P50
tions to be identified through simple configurations. The AG05681, P01 AG03991, P20 MH071616, RR14075, RR
address of the local nodes that may provide these redundant 16594, U24 RR21382, the Alzheimer’s Association, the James
connections can be easily calculated and accessed. A similar S. McDonnell Foundation, the Mental Illness and Neurosci-
approach could be applied to other regular datasets such as ence Discovery Institute, and HHMI.
hexagonal lattices once the appropriate configurations have
been identified. It would not be so straightforward to identify REFERENCES
such nodes in for an arbitrary graph and in this case the cost
[1] A. Eklund, P. Dufort, M. Villani, and S. LaConte, “BROCCOLI:
of accessing many neighbouring nodes may outweigh the Software for fast fMRI analysis on many-core CPUs and GPUs,”
benefits. However, this algorithm should prove useful for Frontiers Neuroinformatics, vol. 8, no. 24, pp. 1–19, 2014.
the many applications that use regular, rectilinear datasets. [2] P. Chen, H. L. Zhao, C. Tao, and H. Sang, “Block-run based
connected component labelling algorithm for GPGPU using
shared memory,” Electron. Lett., vol. 47, no. 24, pp. 1309–1311,
8 CONCLUSIONS Nov. 2011.
[3] A. A. Yildirim and C. Ozdogan, “Parallel wavelet-based cluster-
A new parallel, GPU-based connected-component labelling ing algorithm on GPUs using CUDA,” Procedia Comput. Sci.,
has been described that improves on the method described vol. 3, pp. 396–400, 2011.
in [40] by improving the reduction method and only apply- [4] V. M. A. Oliveira and R. A. Lotufo, “A study on connected compo-
ing it to nodes that match certain configurations. The config- nent labeling algorithms using GPUs,” in Proc. 23rd Conf. Graph.
Patterns Images, 2010.
urations can be used to identify redundant connections [5] O. Stava and B. Benes, “Connected component labeling in
between label clusters and have been described for two- CUDA,” in GPU Computing Gems Emerald Edition, Amsterdam,
and three-dimensional datasets with both clamped and Netherlands: Elsevier Inc, 2011, pp. 569–581.
periodic boundary conditions. Identifying and disregarding [6] K. Yonehara and K. Aizawa, “A line-based connected component
labelling algorithm using GPUs,” in Proc. 3rd Int. Symp. Comput.
these redundant connections is specific to parallel algo- Netw., Dec. 2015, pp. 341–345.
rithms as most sequential algorithms will already have [7] H. Wenke, S. Kolodzey, and O. Vornberger, “A work-optimal par-
resolved the connections due to the order of execution. allel connected-component labelling algorithm for 2D-image-data
uing pre-contouring,” in Proc. Int. Workshops Elect. Comput. Eng.
While the labelling method does require access to addi- Subfields, Aug. 22–23, 2014, pp. 154–161.
tional neighbouring nodes, it has been shown that the cost [8] A. Rasmusson, T. Sazrensen, and G. Ziegler, “Connected compo-
of accessing this additional information can be mitigated nents labeling on the GPU with generalization to Voronoi dia-
through the use of shared memory and __shfl operations grams and signed distance fields,” in Proc. Advances Vis. Comput.,
2013, vol. 8033, pp. 206–215.
provided by CUDA. For the two-stage implementation of [9] B. Preto, et al., “A GPU-enabled algorithm for 3D image label-
the algorithm, the necessary information can be loaded with ling,” in Proc. ENURS 1st Meet. Synchrotron Radiation Users Portu-
no additional memory reads for two-dimensional datasets gal, 2012, pp. 1–2.
[10] M. Jablonski and M. Gorgon, “Handel-C implementation of classi-
and only one additional read for three-dimensional data- cal component labelling algorithm,” in Proc. EUROMICRO Syst.
sets. Utilising these special functions enables the algorithm Digital Des., 2004, pp. 387–393.
to perform local filtering without increasing memory band- [11] B. Thornburg and N. Lawal, “Real-time component labelling and
width requirements. feature extraction on FPGA,” in Proc. Int. Symp. Signals Circuits
Syst., Jul. 2009, pp. 1–4.
A set of two- and three-dimensional test datasets repre- [12] D. Crookes and K. Benkrid, “An FPGA implementation of image
senting common and near worst-case datasets has been component labelling,” Proc. SPIE, 1999, vol. 3844, pp. 17–23.
presented and used to evaluate and compare the proposed [13] S. Byna, Prabhat, M. F. Wehner, and K. Wu, “Detecting atmo-
algorithm with the Komura-equivalence and label- spheric rivers in large climate datasets,” in Proc. 2nd Int. Workshop
Petascale Data Analytics: Challenges Opportunities, Nov. 14, 2011,
equivalence algorithms. The new algorithm provides a per- pp. 7–14.
formance improvement of approximately 10-20 percent [14] K. H. Karstensen, “Silhouette extraction using graphics processing
over the Komura-equivalence algorithm and 40-60 percent units,” Master’s thesis, Dept. Informatics, Univ. Oslo, Oslo,
Sweden, 2012.
over the label-equivalence algorithm in most cases. The [15] T. Lelore and F. Bouchara, “FAIR: A Fast Algorithm for document
two-stage implementation provides the best performance in Image Restoration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
many cases but direct-type performed better for the real- no. 8, pp. 2039–2048, Aug. 2013.
world datasets consisting of large structures. [16] L. Riha and M. Mareboyana, “GPU accelerated one-pass algorithm
for computing minimal rectangles of connected components,” in
The Playne-equivalence algorithm proposed in this article Proc. IEEE Workshop Appl. Comput. Vis., 2011, pp. 479–484.
provides a meaningful performance improvement over [17] A. Dubois and F. Charpillet, “Tracking mobile objects with several
existing algorithms across a range of datasets and system Kinect using HMMs and component labelling,” in Proc. Int. Conf.
Intell. Robots Syst. Workshop Assistance Service Robot. Human Envi-
sizes by identifying and eliminating unnecessary operations. ronment, 2012, pp. 1–7.
[18] A. Abramov, T. Kulvicius, F. Worgotter, and B. Dellen, “Real-time
ACKNOWLEDGMENTS image segmentation on a GPU,” in Facing the Multicore-Challenge
Berlin, Germany: Springer, 2010, no. 6310, pp. 131–142.
The USC-SIPI image database is made available by the Uni- [19] A. Korbes and G. B. Vitor, “Advances on watershed processing on
versity of Southern California - Signal and Image Processing GPU architecture,” in Proc. 10th Int. Symp. Math. Morphology Appl.
Image Signal Process., 2011, pp. 260–271.
Institute. [20] A. N. Moga and M. Gabbouj, “Parallel image component labeling
Open Access Series of Imaging Studies (OASIS) is made with watershed transformation,” IEEE Trans. Pattern Anal. Mach.
available by Dr. Randy Buckner at the Howard Hughes Intell., vol. 19, no. 5, pp. 441–450, May 1997.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1230 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018
[21] T. Berka, “The generalized feed-forward loop motif: Definition, [46] B. Preto, F. Birra, A. Lopes, and P. Medeiros, “Object identification
detection and statistical significance,” Procedia Comput. Sci., in binary tomographic images using GPGPUs,” Int. J. Creative
vol. 11, pp. 75–87, 2012. Interfaces Comput. Graph., vol. 4, no. 2, pp. 40–56, 2013.
[22] J. Janes, “Efficient integration of multiscale n-body systems,” Mas- [47] M. Giles, E. Laszlo, I. Reguly, J. Appleyard, and J. Demouth, “GPU
ter’s thesis, Instituut voor Informatica, Universiteit Van Amsterdam, implementation of finite difference solvers,” in Proc. 7th Workshop
Amsterdam, Netherlands, 2011. High Perform. Comput. Finance, Nov. 2014, pp. 1–8.
[23] M. J. Dineen, M. Khosravani, and A. Probert, “Using OpenCL for [48] R. J. Glauber, “Time-dependent statistics of the ising model,” J.
implementing simple parallel graph algorithms,” in Proc. Int. Math. Physics, vol. 4, pp. 294–307, 1963.
Conf. Parallel Distrib Process. Techn. Appl., 2011, pp. 268–273. [49] E. Ising, “Beitrag zur theorie des ferromagnetismus,” Zeitschrift fur
[24] V. C. Barbosa and R. G. Ferreira, “On the phase transitions of Physik, vol. 31, no. 1, pp. 253–258, 1925.
graph coloring and independent sets,” Physica A, vol. 343, [50] D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris,
pp. 401–423, Nov. 2004. and R. L. Buckner, “Open access series of imaging studies
[25] M. C. E. van der Ven, “The edge-bandwidth minimization prob- (OASIS): Cross-sectional MRI data in young, middle aged, nonde-
lem,” Master’s thesis, Department of Econometrics & Operations mented, and demented older adults,” J. Cognitive Neuroscience,
Research, Tilburg Univ. Tilburg, Netherlands, 2012. vol. 19, pp. 1498–1507, 2007.
[26] M. Nowakiewicz, “A path non-existence proof in motion planning [51] A. G. Weber, “The USC-SIPI image data base: Version 5,” Univ.
using CUDA,” in Proc. IEEE/ASME Int. Conf. Adv. Intell. Mecha- Southern California, Los Angeles, CA, Tech. Rep. USC-SIPI-315,
tronics, Jul. 6–9, 2010, pp. 973–979. 1997.
[27] H. V. Pham, B. Bhaduri, K. Tangella, C. Best-Popescu, and
G. Popescu, “Real time blood testing using quantitative phase Daniel Peter Playne received the PhD degree in
imaging,” PLOS One, vol. 8, no. 2, pp. 1–9, Feb. 2013. computer science from Massey University, in
[28] M. Weigel, “Connected-component identication and cluster 2012. He is a senior lecturer of computer science
update on graphics processing units,” Physical Review E, vol. 84, and director of the Parallel Computing Centre at
no. 3, p. 036709, Sept. 2011. Massey University in New Zealand. He is a mem-
[29] Y. Komura and Y. Okabe, “Poster: Multi-GPU-based calculation of ber of the ACM.
percolation problem on the TSUBAME 2.0 supercomputer,” in
Proc. SC Companion High Perform. Comput., Netw. Storage Anal.,
2012, pp. 1369–1369.
[30] Y. Komura and Y. Okabe, “Multi-GPU-based Swendsen-Wang
multi-cluster algorithm for the simulation of two-dimensional q-
state Potts model,” Comput. Physics Commun., vol. 184, pp. 40–44,
2013. Ken Hawick received the PhD degree in compu-
[31] Y. Komura and Y. Okabe, “GPU-based Swendsen-Wang multi- tational physics from the University of Edinburgh.
cluster algorithm for the simulation of two-dimensional classical He is professor of computer science and director
spin systems,” Comput. Physics Commun., vol. 183, pp. 1155–1161, of The Digital Centre at the University of Hull in
2012. the United Kingdom. He is a senior member of
[32] Y. Komura and Y. Okabe, “Large-scale Monte Carlo simulation of the ACM, a member of the IEEE, IEEE Computer
two-dimensional classical XY model using multiple GPUs,” J. Society, SIAM, and the IET, and is a fellow of the
Phys. Soc. Japan Lett., vol. 81, no. 11, pp. 1–4, 2012. Institute of Physics, a fellow of the British Com-
[33] S. Sah, Y. Roh, K. Chang, and D. Jeong, “Phase map generation for puter Society, and a fellow of the Royal Meteoro-
phase shift Moire using CUDA,” in Proc. 12th Int. Conf. Parallel logical Society, an associate member of the
Distrib. Comput. Appl. Technol., 2012, pp. 140–145. Chartered Management Institute and is a mem-
[34] D. L. Arendt, “In search of self-organisation,” Ph.D. dissertation, ber of the Royal Society of New Zealand.
Department of Computer Science, Virginia Polytechnic Institute,
Blacksburg, VA, Mar. 2012.
[35] K. A. Hawick, A. Leist, and D. P. Playne, “Parallel graph compo- " For more information on this or any other computing topic,
nent labelling with GPUs and CUDA,” Parallel Comput., vol. 36, please visit our Digital Library at www.computer.org/publications/dlib.
no. 12, pp. 655–678, Dec. 2010.
[36] K. Suzuki, I. Horiba, and N. Sugie, “Fast connected-component
labeling based on sequential local operations in the course of for-
ward raster scan followed by backward raster scan,” in Proc. 15th
Int. Conf. Pattern Recognit., vol. 2, 2000, pp. 434–437.
[37] O. Kalentev, A. Rai, S. Kemnitz, and R. Schneider, “Connected
component labelling on a 2D grid using CUDA,” J. Parallel Distrib.
Comput., vol. 71, no. 4, pp. 615–620, 2011.
[38] I. Jung and C. Jeong, “Parallel connected-component labeling
algorithm for GPGPU applications,” in Proc. Int. Symp. Commun.
Inf. Technol., 2010, pp. 1149–1153.
[39] F. Wende and T. Stienke, “Swendsen-Wang multi-cluster algo-
rithm for the 2D/3D Ising model on Xeon Phi and GPU,” in Proc.
Int. Conf. High Perform. Comput. Netw. Storage Anal., 2013, pp. 1–12.
[40] Y. Komura, “GPU-based cluster-labeling algorithm without the
use of conventional iteration: Application to the Swendsen-Wang
multi-cluster spin flip algorithm,” Comput. Physics Commun.,
vol. 194, pp. 54–58, 2015.
[41] A. Leist, D. P. Playne, and K. A. Hawick, “Interactive visualisation
of spins and clusters in regular and small-world Ising models
with CUDA on GPUs,” J. Comput. Sci., vol. 1, pp. 33–40, 2010.
[42] CUDA C Programming Guide v8.0, Santa Clara, CA, USA: NVIDIA
Corporation, January 2017.
[43] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 1st
ed., Santa Clara, CA, USA: NVIDIA Corporation, 2009.
[44] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
GK110, 1st ed., Santa Clara, CA, USA: NVIDIA Corporation, 2012.
[45] NVIDIA Tesla P100, 1st ed., Santa Clara, CA, USA: NVIDIA Cor-
poration, 2016.
Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.