0% found this document useful (0 votes)

101 views

A New Algorithm For Parallel Connected-Component Labelling On Gpus

A new algorithm is presented for performing connected-component labelling on GPUs in a parallel manner. Previous GPU-based algorithms for this task were inefficient, performing redundant operations when components were connected through many nodes. The new algorithm eliminates these redundant operations by having each thread efficiently identify configurations in two-dimensional and three-dimensional rectilinear datasets where redundancies may occur, using only local data. A performance evaluation shows the new algorithm provides an improvement over previous methods across a range of test cases.

Uploaded by

Miguel Ramírez Carrillo

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views

A New Algorithm For Parallel Connected-Component Labelling On Gpus

Uploaded by

Miguel Ramírez Carrillo

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO.

6, JUNE 2018 1217

A New Algorithm for Parallel

Connected-Component Labelling on GPUs
Daniel Peter Playne and Ken Hawick, Member, IEEE

Abstract—Connected-component labelling remains an important and widely-used technique for processing and analysing images and
other forms of data in various application areas. Different data sources produce components with different structural features and may
be more or less suited to certain connected-component labelling algorithms. Although many efficient serial algorithms exist,
determining connected-components on Graphical Processing Units (GPUs) is of interest as many applications use GPUs for
processing other parts of the application and labelling on the GPU can avoid expensive memory transfers. The general problem of
connected-component labelling is discussed and two existing GPU-based algorithms are discussed—label-equivalence and
Komura-equivalence. A new GPU-based, parallel component-labelling algorithm is presented that identifies and eliminates redundant
operations in the previous methods for rectilinear two- and three-dimensional datasets. A set of test-cases with a range of structural
features and systems sizes is presented and used to evaluate the new labelling algorithm on modern NVIDIA GPU devices and
compare it to existing algorithms. The results of the performance evaluation are presented and show that the new algorithm can provide
a meaningful performance improvement over previous methods across a range of test cases.

Index Terms—Connected-component labelling, GPGPU, parallel computing

1 INTRODUCTION

C ONNECTED-COMPONENT labelling (CCL) algorithms

play an important role in many applications—
especially pattern analysis and image processing. The
A number of algorithms for solving the connected-
component labelling on Graphical Processing Units have
been described for a range of different applications. Related
connected-component labelling problem involves assigning works include: Block-Run based algorithm [2], Sliding Win-
a unique label to every element in a connected-component dow approach [3], Union-Find algorithm [4], [5], line-based
within a dataset. The general problem considers a graph of algorithm [6], Pre-Contouring [7], extensions to Voronoi
nodes but for many applications the dataset has a regular Diagrams [8] and Three-Dimensional labelling [9]. Other
structure such as images and volumetric data. related developments include work on algorithms for Field
As hardware development drives processing architec- Programmable Gate Arrays [10], [11], [12]. These algorithms
tures to become more parallel, there is an increasing interest and implementations often focus on particular cases or data
in developing parallel algorithms for solving common prob- patterns.
lems. Graphical Processing Units (GPUs) are now a common The connected component labelling algorithms them-
computing architecture due to their high computational selves have wide range of applications including—Object
throughput for suitable applications. Many image process- Detection including Detection of Rivers [13], Silhouette
ing methods can be implemented easily on GPUs and pro- Extraction [14], Document Restoration [15], detection of
vide good performance but less regular problems, such as minimal rectangles [16] and Object Tracking [17], Image
CCL, tend to achieve lower performance. However, for some Segmentation [18], [19], [20], Graph Analysis [21], [22], [23],
applications, even providing comparable performance to a [24], [25], proof of Path Non-Existence [26], Medical Imag-
CPU algorithm can be sufficient for a GPU algorithm as it ing Applications [1], [27], Cluster-based updates and analy-
can eliminate data transfer between the device and host and sis of physical models [28], [29], [30], [31], [32], [33] and Self
allows the entire application to be computed by the GPU [1]. Organisation [34].
In previous work, a connected-component labelling algo-
rithm was presented for Tesla-architecture NVIDIA Graphi-
D. Playne is with the Institute of Natural and Mathematical Sciences, cal Processing Units [35] based on a serial, multi-pass
Massey University, Auckland, Wellington, New Zealand. algorithm presented by Suzuki et al. [36]. Improvements on
E-mail: [email protected]. this algorithm for more recent GPU architectures have since
K. Hawick is with The Digital Centre, University of Hull, Hull HU6 7RX,
United Kingdom. E-mail: [email protected]. been presented [37], [38]. Work by [39], [40] has shown that
Manuscript received 12 Apr. 2017; revised 10 Jan. 2018; accepted 12 Jan.
performance can be improved by replacing the iterative
2018. Date of publication 30 Jan. 2018; date of current version 11 May 2018. merging of labels by a label reduction method based on
(Corresponding author: Daniel Peter Playne.) atomic operations but only present data for patterns arising
Recommended for acceptance by A. Melo. from the Ising model. The algorithms presented by Komura
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference the Digital Object Identifier below. take a very similar approach to the Union-Find methods
Digital Object Identifier no. 10.1109/TPDS.2018.2799216 discussed in [5] which uses global memory and iteration to
1045-9219 ß 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1218 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

hierarchically merge clusters and [4] which uses a very simi- any transitive function is used to determine whether two
lar merge function based on atomic operations. These meth- nodes are connected.
ods are discussed in further detail in Section 3.4. The algorithms discussed in this article consider the
The present article describes an algorithm that improves nodes to be four-connected in two-dimensions and six-
on previous GPU-based connected-component labelling connected in three-dimensions. Similar datasets may also be
methods [35], [40] by eliminating redundant operations for considered to be connected to additional neighbouring
two- and three-dimensional rectilinear datasets. In cases nodes such as eight-connected nodes (Moore neighbour-
where components are connected together through many hood) in two-dimensions. The lattice structure themselves
different nodes, the previous algorithms may apply redun- may also extend to higher-dimensions or have other regular
dant operations. The algorithm identifies configurations in lattice structures. In this work the datasets are limited to
the rectilinear datasets where these redundant operations four- and six-connected rectilinear lattices in two- and
may be applied, each thread can efficiently identify such three-dimensions as they represent the most common cases
configurations from local data and eliminate them. The of the connected-component labelling problem.
prior knowledge of the structure of a rectilinear dataset Two different boundary conditions for the datasets are
allows a redundant connection between nodes to be deter- considered—clamped and periodic. For applications in
mined by by accessing one additional neighbour in two- image processing or medical imaging, boundaries are gen-
dimensions and up to three neighbours in three-dimensions erally considered to be clamped. Nodes on the boundary of
from known addresses. the dataset simply don’t have any neighbours in the direc-
A similar approach could be applied to other datasets tion of the boundary. Periodic boundaries are also consid-
with regular structures to identify configurations where ered where nodes on one boundary are connected to the
redundant operations may occur. The benefit the algorithm corresponding node on the other side of the dataset. While
provides would depend on the number and cost of identify- these boundaries are generally not applicable for real-world
ing such configurations and the expense of the redundant images, they are commonly used in computational simula-
operations that can be eliminated. The approach may not be tions [30], [31], [32], [41].
applicable for the general connected-component labelling More recent algorithms take advantage of GPU architec-
program that considers an arbitrary graph. Determining ture features that were not available on the early Tesla
redundant connections between a node’s neighbours in an architecture devices. The most significant of these is the
arbitrary graph would require comparison of their adja- introduction and performance optimisation of atomic oper-
cency lists (or other adjacency representation). This would ations [4], [5], [40], [42]. Atomic operations were initially
create a tradeoff between the advantage of eliminating introduced in the compute capability 1.1 devices and their
redundant operations and the overhead of searching performance has be significantly improved in subsequent
through the adjacencies. For many graphs, the cost of the generations of GPUs [42], [43], [44], [45]. These architectures
overhead is likely to outweigh the benefit. have also introduced a number of special instructions such
The rest of this article is organised as follows. Section 2 as register shuffle and more powerful synchronization func-
briefly introduces the connected-component labelling prob- tions. The capabilities and improved performance enable
lem and GPU algorithm consideration. Relevant GPU label- new GPU algorithms and allow faster implementation of
ling algorithms and are described in Section 3. The new older algorithms.
algorithm is presented in Section 4. A number of implementa-
tion and optimisation considerations for these algorithms are 3 PREVIOUS WORK
discussed in Section 5. A set of two- and three-dimensional
The GPU connected-component labelling algorithms dis-
test datasets are presented and used to evaluate and compare
cussed in this section are based on the label-equivalence
the different algorithms in Section 6. These results are then
algorithm described for NVIDIA Tesla architecture GPUs
discussed in Section 7 and some final conclusions are drawn
in [35]. The label-equivalence algorithm has become a com-
in Section 8
mon reference point used to compare subsequent GPU-
based algorithms [4], [5], [40]. Subsequent improvements to
2 BACKGROUND the implementation of this algorithm for later generation
The connected-component labelling problem is an algorith- GPUs have been presented in [37], [38] and more signifi-
mic application of graph theory that considers a graph of cantly the removal of kernel iteration from the algorithm [40]
nodes and assigns a unique label for each subset of con- using a reduction method [39]. An updated description of
nected vertices. However, many applications consider data- the label-equivalence algorithm incorporating modern opti-
sets with regular structures instead of arbitrary graphs. The misations is presented for reference.
connections between nodes are often defined by a function
of the node data rather than from a set of edges. 3.1 Label-Equivalence Algorithm
This article considers regular, rectilinear datasets (such as The label-equivalence algorithm identifies connected-
images and volumetric data) where connections between components using three device kernels - initialisation, scan
nodes are determined by their data values. Each node and analysis. The initialisation kernel sets the label of each
represents a pixel in two-dimensions or a voxel in three- node to the linear address of the node in the dataset - this
dimensions and in all examples the equality operator is used ensures each label is initialised with a unique value that can
to determine connections between neighbouring nodes. also be used to identify the node. The scan kernel inspects
However, all algorithms discussed will work correctly when the node’s neighbours to determine if it is connected to a

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
PLAYNE AND HAWICK: A NEW ALGORITHM FOR PARALLEL CONNECTED-COMPONENT LABELLING ON GPUS 1219

node with a lesser label, the smallest label will be adopted.

Finally the analysis kernel looks up the node’s label to deter-
mine if that label is in turn connected to an even smaller
label, this process is repeated until the smallest label at the
end of the chain is found. The node’s label is then updated
to this new, smaller label.
As a node may be connected to several other neighbour-
ing nodes, there is no simple way to determine which label
will eventually resolve to the smallest value. To correctly
label a dataset, the scan and analysis functions must be
repeated until the correct labels are found. The three kernels
for the label-equivalence algorithm are summarised in
Algorithm 1.

Algorithm 1. Label-equivalence kernels - L represents

the array of labels, I the dataset, indexðthread; blockÞ cal-
culates a thread’s linear address from its thread and
block id, neighbours(k) returns neighbours of node k
and threadid & blockid are provided by CUDA to
uniquely identify the thread
kernel initialisation(L)
ki indexðthreadid ; blockid Þ
L½ki ki
end kernel
kernel analysis(L)
ki indexðthreadid ; blockid Þ
L½ki find_rootðL; L½ki Þ Fig. 1. Example of the komura-equivalence algorithm - a shows the lin-
end kernel ear address of each node, b shows the initial labels assigned to each
node (with root nodes highlighted), c shows the effect of applying an
function find_root(L; label) analysis kernel to b, d highlights the nodes where the reduction opera-
while label 6¼ L½label do tion will be applied, e shows the effect of this reduction and finally f
label L½label shows the final labelled dataset after the second analysis kernel has
end while been applied.
return label
end function the need for repeated calls to the scan and analysis kernels [40].
kernel scan(L, I, changed) Instead the algorithm uses three kernels—initialisation, analy-
ki indexðthreadid ; blockid Þ sis and reduction. The Komura-equivalence algorithm per-
label1 L½ki forms a very similar set of operations to the Union-Find
label2 label1 labelling methods described in [4], [5].
for all kn neighboursðki Þ do The initialisation kernel differs from the label-equivalence
if I½ki ¼ I½kn then algorithm in that it initialises the label of each node to the
label2 minðlabel2 ; L½kn Þ smallest linear address of its connected neighbours (See
end if Fig. 1b). This takes advantage of the fact that the linear
end for address (the initial label according to the label-equivalence
if label2 < label1 then algorithm) can be directly calculated. As only smaller labels
L½label1 label2 will be adopted, only neighbours with smaller linear
changed true addresses are considered in this kernel. The analysis kernel
end if
is the same as in the label-equivalence algorithm and fol-
end kernel
lows the chain of labels to the lowest label. The results of
applying this kernel are shown in Figs. 1c and 1f.
The original implementation of this algorithm [35] used The most significant change is the reduction kernel that
texture-memory bound to a read-only CUDA Array to merges different label clusters by connecting their root
cache neighbouring memory access for input and a separate nodes (a root node is a node with the same label as its linear
writeable, linear array of labels. As GPU architectures have address). The reduction kernel detects two different labels
improved, the penalty of accessing non-aligned global that are connected and joins their root nodes using a reduc-
memory has been greatly reduced (especially since the tion method that makes use of atomic operations [39], [40].
introduction of L1/L2 cache) [42] and later implementations The root nodes (solid colour) and the nodes where reduc-
store labels in a single in global memory array [37]. tion will be applied (the two colours of the root nodes they
connect) are illustrated in Fig. 1d. This kernel will merge all
3.2 Komura-Equivalence Algorithm clusters in a single kernel call by moving the iteration inside
Komura proposed a number of improvements to the label- the kernel. The kernels for this algorithm are shown in
equivalence algorithm, the most significant of which removes Algorithm 2.

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1220 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

connected by many different nodes and the reduction

Algorithm 2. Komura-equivalence algorithm kernels - L
method will be applied at all of them which leads to multi-
represents the labels, I the dataset, index() calculates a ple threads attempting to merge the same two clusters
thread’s linear address from its thread and block ids, together. The same effect occurs in three-dimensional data-
neighbours(k) returns the neighbours of k, reduction_ sets where a node may be connected to three neighbours
neighbours(k) returns the neighbours that a reduction with smaller initial labels.
operation may be applied to and threadid & blockid are The reduction method described in Algorithm 2 also con-
provided by CUDA to uniquely identify the thread. the tains redundant operations. The kernel will find the root
definition for find_root can be found in Algorithm 1. nodes of two label clusters and iterate until the result of the
kernel initialisation(L; I) atomicMin function returns the lesser of the two labels. The
ki indexðthreadid ; blockid Þ CUDA atomicMin function will set the current value of
label ki the variable to the lower of the old and new values and then
for all kn neighboursðki Þ do returns the old value. The return value of the atomicMin is
if kn < ki and I½ki ¼ I½kn then important as the value (L½label1 ) may have been set to some
label minðlabel; kn Þ other value by another thread. If this occurs then the result
end if (label3 ) will return this other label which must then be
end for merged. However, if there is no interference from other
L½ki label threads, the method will perform unnecessary extra atomic
end kernel operations.
kernel reduction(L; I) In a merge operation with no interference from other
ki indexðthreadid ; blockid Þ threads, at least two atomicMin calls will be required. The
for all kn reduction_neighboursðki Þ do first will set L½label1 to the value label2 and then returns the
reduce(L; ki ; kn ) old value label1 . Since this value is not equal to label2 the loop
end for
will continue and perform another atomicMin operation
end kernel
which tries to set L½label1 to the value label2 again. This time
function reduce(L; k1 ; k2 )
the old value is already label2 which is then returned and the
label1 find_root(L; k1 )
loop terminates. A solution to this is discussed in Section 4.
label2 find_root(L; k2 )
flag true
if I½k1 ¼ I½k2 and label1 6¼ label2 then
3.3 Direct-Type and Two-Stage Approach
flag false The label-equivalence and Komura-equivalence algorithms
end if have been presented as kernels that can be applied to each
if label1 < label2 then node from the dataset in parallel, this is described as the
swap(label1 ; label2 ) direct-type approach in [40]. On a GPU, in general, the
end if threads that compute these kernels do not directly commu-
while flag ¼ false do nicate with each other and only synchronise at the end of a
label3 atomicMin(L½label1 ; label2 ) kernel call. However, threads that are scheduled to execute
if label3 ¼ label2 then in the same block can communicate through shared
flag true memory and synchronise with each other using __sync
else if label3 > label2 then operations. This functionality can be used to improve per-
label1 label3 formance by performing the entire labelling process on each
else if label3 < label2 then block of nodes and then merging these blocks together.
label1 label2 The two-stage approach splits the dataset into sections
label2 label3 and assigns a block of threads to each section. Each block
end if will load the section of the dataset into shared memory,
end while label that section using the chosen labelling algorithm and
end function
then write the labels to global memory. Afterwards, another
set of kernels are launched to merge the labels from neigh-
The Komura-equivalence labelling algorithm provides bouring blocks together. This is a useful technique that has
a significant performance improvement over the label- been used to improve the performance of a number of label-
equivalence algorithm. Results are shown in the original ling algorithms [4], [6], [46].
work for the Ising model [40] and have been reproduced and The two-stage approach requires a method of merging
expanded to other datasets in this article (see Section 6). neighbouring blocks together. The label-equivalence algo-
However, there is an element of redundancy in this algorithm rithm uses a modified scan kernel that is only applied to
where the same operations are repeated by multiple threads. nodes on the borders of a section which must be applied
When the labels are initialised, sites that are connected in iteratively with the analysis until no new connections
both the west and north directions prefer the label from the between labels are found. The Komura-equivalence algo-
north direction as it will have a lower initial label. As these rithm will apply a reduction operation to each pair of nodes
nodes can only adopt one of two possible labels, they must that are connected across a border, the borders are proc-
be processed to check if they connect together two different essed for each dimension separately (see Figs. 2c and 2d for
clusters which are then merged together (this is the function a two-dimensional example). After the borders have been
of the reduction kernel). However, two clusters may be processed, an analysis operation must be applied to every

Fig. 3. Possible neighbourhood configurations of a node’s neighbours.

White squares denote neighbours that are connected to the node while
dark squares denote nodes that are not. Arrows show the neighbouring
label assigned to the node in the initialisation phase while a cross shows
the node is assigned its own address as a label.

The first change is a modification of the reduction kernel

to perform additional comparisons to the two labels during
the process of finding the root nodes. If at any point in the
process it is discovered that the labels are equal then the
two clusters have already been merged by another thread
and the reduction is determined to be unnecessary and is
halted. The reduction kernel is also simplified to remove
additional flags and if-else statements to reduce thread
divergence and to make the kernel (in the opinion of the
authors’) easier to understand.
The second (and more significant) change is based on
identifying unnecessary reductions from examining the
local neighbourhood of a node. The neighbourhood exam-
ined consists of the three neighbouring nodes in the west
ðxÞ, north-west ðx; yÞ and north ðyÞ directions in two-
dimensions and six neighbouring nodes in the ðxÞ, ðyÞ,
ðzÞ, ðx; yÞ, ðx; zÞ and ðy; zÞ directions in three-
dimensions. Examining the possible local neighbourhood
configurations allows many unnecessary reductions to be
Fig. 2. Example of the two-stage, Komura-equivalence algorithm show- eliminated at the relatively small cost of a few additional,
ing a 2D dataset split into four sections. The four sections are labelled
local data reads.
separately and written to global memory as shown in b with the root
nodes highlighted. The labels are merged by applying reduction opera-
tions across the x- and y-borders at the nodes highlighted in c and d 4.1 Two-Dimensional Neighbourhood
respectively. This connects the root nodes together (shown in e) and f
shows the final labels found by applying an analysis kernel to every Configurations
node in the dataset. All the possible configurations of a node’s local neighbour-
hood in two-dimensions are shown in Fig. 3. From this illus-
node in the dataset to determine the final labels. The stages tration it can be seen that there are only two configurations
of the two-stage, Komura-equivalence algorithm are illus- that lead to a node being connected to multiple labels (con-
trated in Fig. 2. figurations f and h), for all other configurations the correct
neighbouring label can be correctly determined by the initi-
3.4 Related Work - Union-Find Algorithm alisation kernel. In configuration h, the node does not
represent a unique connection between two clusters as the
The Union-Find method described by Oliveira et al. [4] takes a
north and west neighbours are also connected through the
very similar approach to labelling the connected-components
north-west neighbour. This observation is most relevant for
as the Komura-equivalence two-stage method. The Union-
parallel algorithms as serial algorithms will usually have
Find method differs in the initialisation of labels (each node
is initialised to its linear address) and then merges all con- assigned labels to the north and west neighbours and the
nected labels. The algorithm will thus require two merge fact that their labels are joined together through another
operations in two-dimensions and would require three in node will already be known. Applying a reduction to these
three-dimensions. There are also some slight differences in two labels will not be required as they will either already be
the implementation of the merge function (equivalent to the joined together or will be merged by another thread (this
reduction method described by Komura). method prefers to delegate the responsibility of merging
clusters to the thread with the lower index).
Configuration f cannot be immediately identified as rep-
4 NEW ALGORITHM - PLAYNE-EQUIVALENCE resenting a redundant connection as there is no way to
The proposed algorithm is based on the Komura-equivalence determine from the available information if the west and
algorithm and makes two important changes that reduce north labels are connected together by another node in the
unnecessary operations that arise when two clusters are dataset. In this case the reduction operation should be
connected together by multiple different nodes. Reducing applied to merge the clusters and to ensure the dataset is
the number of reduction operations can improve perfor- properly labelled. Another node may exist in the dataset
mance by eliminating expensive memory transactions and that connects these two clusters together, however identify-
atomic operations. ing such a node may require searching a large portion of the

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1222 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

Fig. 5. Possible configurations of a node’s (bottom-right) neighbours.

White neighbours are connected to the node while dark squares are not.
The dark line denotes the border of the dataset.

across the boundary and requires no special consideration.

Configuration h does not need to apply a reduction as
the clusters will be connected through the west ðxÞ and
Fig. 4. Illustration of the root nodes (filled with solid colour) and the north-west ðx; yÞ neighbours. For configurations b, d
reduction nodes (filled half-half with the colours of the root nodes they
connect) as identified by the Playne-equivalence algorithm using and f the reduction operation should be applied to merge
clamped boundaries. the two clusters across the boundary as the node is
connected across the boundary and no other connection
dataset and (in most cases) would outweigh the cost of can be easily determined. A similar set of configurations
applying the reduction operation. can be easily found for east-west boundary conditions
Configuration f can be identified by Equation (1) where n by rotating the boundary which identifies configurations
represents a connection between a node and the neighbour e, f and g.
denoted by the subscript of n. The values nðx1;yÞ , nðx;y1Þ The configurations for border nodes that should apply a
and nðx1;y1Þ represent the west, north and north-west reduction to the neighbour across the east-west x and
neighbours respectively and ^, _ and : denote logical and, north-south y borders can be identified by the expressions
or and not operations. in Equations (2) and (3) respectively. In this expression, the
nðx1;yÞ ^ nðx;y1Þ ^ :nðx1;y1Þ : (1) neighbouring indexes of n have a periodic boundary
applied to them in the direction of the boundary.
An example of identifying these nodes is shown in Fig. 4.
This illustration can be compared to the komura-equivalence nðx1;yÞ ^ ð:nðx;y1Þ _ :nðx1;y1Þ Þ (2)
method shown in Fig. 1d (the other steps a-c and e,f will be
the same). Each of the reduction nodes (filled 50-50 with the nðx;y1Þ ^ ð:nðx1;yÞ _ :nðx1;y1Þ Þ: (3)
colours of the root nodes they connect) match the configura-
tion f from Fig. 3 and correctly identify all connections Finally there is a special case that should be considered,
between labels (in this case there are no duplicate connec- the ð0; 0Þ node that lies on both the north-south and east-
tions). For this example, only the four necessary reduction west boundary. This node should apply the reduction oper-
operations will be applied as opposed to the thirteen applied ation to both the north and west neighbour (if it is connected
by the Komura-equivalence algorithm. to them) as there can be no other node with a lower label
that connects them together.
4.2 Two-Dimensional Boundary Conditions
Nodes on boundaries require separate consideration to inte- 4.3 Three-Dimensional Neighbourhood
rior nodes as they will not be initialised with connections Configurations
across the boundary. Such nodes occur when a dataset has In three-dimensional datasets, each node may be uniquely
periodic boundary conditions and also in a two-stage algo- connected to up to three neighbours (with a lower index),
rithm (see Section 4.6). In datasets with clamped boundary the ðxÞ, ðyÞ and ðzÞ neighbours. Each node may be the
conditions, nodes on a boundary have no connections connection point of up to three different label clusters. As
across the boundary and the correct neighbouring label can the initialisation step will set the label to the smallest label,
be easily identified by the initialisation kernel. These condi- each node will be initialised (in order of preference) with
tions are equivalent to configurations a and e for the north the ðzÞ, ðyÞ or ðxÞ neighbours depending on the connec-
boundary and a and b for the west boundary. tions. So the reduction stage may need to merge clusters
For a dataset with periodic boundary conditions, the belonging to the ðxÞ and ðyÞ neighbours.
threads processing nodes along the north and west borders As with two-dimensional datasets, the configurations of
of the dataset are responsible for merging clusters that are the local neighbourhood are examined to determine dupli-
connected across the border. These threads must inspect the cate connections and eliminate unnecessary reductions. The
node directly across the border and merge the clusters if neighbourhood examined includes the following neigh-
necessary. As with the interior nodes of the dataset, there bouring nodes ðxÞ, ðyÞ, ðzÞ, ðx; yÞ, ðx; zÞ and
may be multiple connections between clusters across a ðy; zÞ. As the number of possible configurations is far
boundary. A similar set of configurations can be identified greater in three-dimensions, only relevant configuration
for these cases and reduction operations are only applied in patterns are shown in Fig. 6.
these cases. The possible configurations (for a north-south) The internal nodes where a reduction operation should
periodic boundary are shown in Fig. 5, note that these con- be applied are identified by configurations i,j and k. A
figurations have the same labels as Fig. 3 and simply has the reduction will never need to be applied to the neighbour in
border shown for clarity. the ðzÞ direction (for interior nodes) as this label will be
From Fig. 5 it can be seen (for a north-south boundary) adopted by the initialisation kernel. The cluster from ðyÞ
that in configurations a, c, e and g the node is not connected neighbour should be merged if the node is connected to the

return multiple neighbours for three-dimensional datasets

or for the ð0; 0Þ node of a two-dimensional dataset. While a
general implementation such as this can be written in
CUDA, in practice specific implementations are written for
Fig. 6. Important neighbouring configuration patterns for three-dimen-
sional datasets configurations. The first three configuration (i-k) show the different dimensions and boundary conditions.
internal nodes that should apply the reduction operator to the ðxÞ
neighbour (i and j) and the ðyÞ neighbour (k). The configurations (l-o) Algorithm 3. Playne-equivalence algorithm kernels - L
show nodes that should apply a reduction across a border in the ðxÞ
direction, these configurations can be rotated for borders in other
represents the labels, I the dataset, index() calculates a
dimensions. thread’s linear address from its thread and block id,
neighbours(k) returns the neighbours of k and threadid &
neighbour in the ðzÞ direction but these two are not con- blockid are provided by CUDA to uniquely identify the
nected by the ðy; zÞ neighbour. Likewise the ðxÞ neigh- thread. The function configuration_neighbours(k; I) will
bour should be merged if the node is connected to either the examine the local neighbourhood and return the appro-
ðyÞ or the ðzÞ neighbour and these labels are not already priate neighbours if they match one of the configurations
connected through the ðx; yÞ and ðx; zÞ neighbours (depending on the number of dimensions and boundary
respectively. conditions)
Configurations that require a reduction applied to the
ðxÞ neighbour and the ðyÞ neighbour can be identified by kernel initialisation(L; I)
the expressions in Equations (4) and (5) respectively. ki indexðthreadid ; blockid Þ
label ki
nðx1;y;zÞ ^ ððnðx;y;z1Þ ^ :nðx1;y;z1Þ Þ for all kn neighboursðki Þ do
(4)
_ðnðx;y1;zÞ ^ :nðx1;y1;zÞ ÞÞ if kn < ki and I½ki ¼ I½kn then
label minðlabel; kn Þ
nðx;y1;zÞ ^ nðx;y;z1Þ ^ :nðx;y1;z1Þ (5) end if
end for
L½ki label
4.4 Three-Dimensional Boundary Conditions end kernel
The next case is a three-dimensional dataset with boundary kernel reduction(L; I)
conditions, like the two-dimensional case from Section 4.2, ki index(threadid ; blockid )
nodes on the edge of the boundaries are responsible for label1 L½ki
merging clusters across a boundary. The set of configura- for all kn configuration_neighbours(ki ; I) do
tions responsible for applying this reduction are identified label1 reduce(L; label1 ; L½kn )
by the configurations l, m, n and o for a border in the x end for
dimension and can be rotated for borders in the y and z end kernel
dimensions. The expressions for identifying these configu- function reduce(L; label1 ; label2 )
rations are given in Equations (6), (7), and (8). Put simply, while label1 6¼ label2 and label1 6¼ L½label1 do
these configurations identify nodes that are connected label1 L½label1
across a border but cannot identify from the given neigh- end while
while label1 6¼ label2 and label2 6¼ L½label2 do
bourhood another pair of nodes with lower indexes that
label2 L½label2
will also connect the two clusters.
end while
while label1 6¼ label2 do
nðx1;y;zÞ ^ ð:nðx;y1;zÞ _ :nðx1;y1;zÞ Þ
(6) if label1 < label2 then
^ ð:nðx;y;z1Þ _ :nðx1;y;z1Þ Þ swap(label1 ; label2 )
end if
nðx;y1;zÞ ^ ð:nðx1;y;zÞ _ :nðx1;y1;zÞ Þ label3 atomicMin(L½label1 ; label2 )
(7) if label1 ¼ label3 then
^ ð:nðx;y;z1Þ _ :nðx;y1;z1Þ Þ label1 label2
else
nðx;y;z1Þ ^ ð:nðx1;y;zÞ _ :nðx1;y1;zÞ Þ label1 label3
(8) end if
^ ð:nðx;y1;zÞ _ :nðx;y1;z1Þ Þ
end while
return label1
4.5 Algorithm end function
The Playne-equivalence algorithm uses the modified reduc-
tion method and only applies it to nodes that match the
appropriate configurations identified in preceding sections. 4.6 Two-Stage Approach
The general algorithm is presented in Algorithm 3. This The two-stage, Playne-equivalence algorithm follows a very
general representation allows for different dimension data- similar process to the Komura-equivalence algorithm. The
sets and boundary conditions by applying reduction to all dataset is split into sections and locally labelled, this time
the neighbours returned by the function configuration_ using the Playne-equivalence algorithm. When the blocks are
neighbours(k; I). This function returns the neighbours for merged together, the reduction operation is not applied at
nodes that match an appropriate configuration. It may every node along the border (as in the Komura-equivalence

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1224 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

Listing 1. Memory access for the initial labelling phase of

the two-stage Playne-equivalence algorithm. Data is
cached in shared memory and shared between threads
in the same block while __shfl operations are used to
share data between threads in the same warp.
// Thread and Block indexes
const unsigned int tx = threadIdx.x;
const unsigned int ty = threadIdx.y;
Fig. 7. Illustration showing the bordering nodes where reduction opera- const unsigned int bx = blockIdx.x;
tions will be applied in the Playne-equivalence algorithm. This can be const unsigned int by = blockIdx.y;
compared to the nodes identified by the Komura-equivalence algorithm
shown in c and d from Fig. 2. The number of reduction operations in this // Calculate global index
is example is reduced from twelve to five. const unsigned int x = ((bx * blockDim.x) + tx);
const unsigned int y = ((by * blockDim.y) + ty);
algorithm) but only at nodes that match the appropriate
configurations. // Load node data from global memory
These configurations are identified using the same rules const unsigned char pyx = g_image[y*X + x];
as for periodic boundaries (see Equations (2) and (3) for 2D // Write to shared memory
and Equations (6), (7), and (8) for 3D). An illustration show- s_image[ty*blockDim.x + tx] = pyx;
ing the bordering nodes where a reduction operation will
be applied is shown in Fig. 7. These reduction nodes can be // Synchronise block
compared to the Komura-equivalence algorithm nodes __syncthreads();
shown in Figs. 2c and 2d, the other stages will be the same
as the previous example shown for the Komura-equivalence // Fetch (-y) neighbour
const unsigned char pym1x = (ty > 0) ?
algorithm in Fig. 2.
s_image[(ty-1)*blockDim.x+tx] : 0;

5 IMPLEMENTATION // Register shuffle to fetch (-x) and (-x,-y)

neighbours
The labelling algorithms described in Section 3 have all been
const unsigned char pyxm1 = __shfl_up(pyx, 1);
implemented as both direct-type and two-stage methods in
const unsigned char pym1xm1 = __shfl_up(pym1x, 1);
two- and three-dimensions with clamped and periodic
boundary conditions. The implementations have been opti- // Determine neighbouring connections
mised for important performance considerations including const bool nym1x = (ty > 0) ? (pyx == pym1x) : false;
thread-block arrangement, memory access and use of spe- const bool nyxm1 = (tx > 0) ? (pyx == pyxm1) : false;
cial device functions. All implementations have been writ- const bool nym1xm1 = (ty > 0 && tx > 0) ?
ten using NVIDIA CUDA devices for Kepler and later (pyx == pym1xm1) : false;
generation architectures using CUDA 8.0 to allow for the
use of special device functions, specifically the register shuf- In the initial phase of the two-stage algorithm, only the
fle instructions. The minimum requirements for the shuffle neighbouring nodes within the same thread-block are con-
instructions are devices of compute capability 3.0 (Kepler sidered. Each thread will load its node’s data from global
architecture) [42] and CUDA 6.5. memory and save it into shared memory. The neighbouring
All implementations use thread-block configurations of nodes in the ðyÞ (or ðzÞ in three-dimensions) can be
32 4 for two-dimensional and 32 4 4 for three-dimen- accessed from this shared memory cache. The other neigh-
sional datasets. While the thread-block size has less effect bours (ðxÞ, ðx; yÞ and ðx; zÞ) do not require shared
on the direct-type implementations, the two-stage versions memory reads and can instead be obtained with __shfl_up
have a tradeoff between different kernels. Increasing the instructions. Each neighbour in the ðxÞ direction is either
block size means more processing must be done by each the node of another thread in the same warp (given a block
block in the initial labelling phase but results in fewer width of 32) or the ðyÞ or ðzÞ neighbour of that thread.
boundaries that must be merged together. These thread- This means that the Playne-equivalence algorithm can obtain
block sizes were chosen as they have been found to consis- the required neighbourhood information with no extra
tently provide high occupancy and strong performance memory reads (as compared to the Komura-equivalence
across a range of test cases. algorithm) for the two-dimensional case and only one addi-
The datasets are stored in global memory and cached tional, aligned read (the ðy; zÞ neighbour) for the three-
in shared memory for each block and, when applicable, dimensional case. A code listing showing the loading of the
shared between threads in a warp through the use of necessary neighbourhood can be seen in Listing 1.
__shfl operations. The technique of sharing neighbour- The same approach is not directly applicable for the
ing data with shuffle operations is similar the approach direct-type implementations which require all threads to
used for finite-differencing applications [42], [47]. The access their local neighbourhoods. In such a case the threads
labels are accessed through global memory reads and on the left border of the thread-block cannot access their
stored in shared memory for the initial labelling phase neighbours in the ðxÞ direction using __shfl_up instruc-
of the two-stage implementations. tions and must perform a separate memory read.

The second phase of the two-stage Playne-equivalence

algorithm can also use shuffle operations to reduce memory
transactions. Borders in each dimension are processed by
separate kernels and are executed asynchronously to maxi-
mise occupancy. Each thread is assigned to a node on a bor-
der and must determine from the neighbourhood if a
reduction should be applied and then perform the reduction
if necessary. For the purpose of each kernel, the nodes are
not considered to be connected across borders in other
dimensions so the threads on the edge of each block does not
need to access neighbouring values in these dimensions.
This allows __shfl operations to be used to minimise mem-
ory transactions. A code snippet showing the loading of a
local neighbourhood for a thread processing a border in the y
dimension with periodic boundaries can be seen in Listing 2.
Fig. 8. Examples of the two-dimensional test cases. Clockwise from the
top-left, the examples are: Ising model, percolation threshold (4-con-
Listing 2. Memory access for the kernels that merge label nected), OASIS medical imaging data, USC-SIPI image database, spiral
clusters across borders. This example shows a kernel that image and Hilbert space-filling curve.
merges clusters across a border in the y dimension and
applies periodic boundary conditions. __shfl operations comparison of the algorithms’ performance across a range of
are used to share data between threads in the same warp. datasets containing a range of structural features—including
datasets specifically chosen to represent near worst-case
// Thread and Block indexes
const unsigned int tx = threadIdx.x; scenarios.
const unsigned int ty = threadIdx.y; The performance results presented in this article have
const unsigned int bx = blockIdx.x; been gathered on a GPU server running Ubuntu Server
const unsigned int by = blockIdx.y; 16.04 and equipped with an NVIDIA K20X using driver ver-
sion 375.20. All code has been compiled and tested with the
// Calculate global indexes CUDA 8.0.44 toolkit.
const unsigned int x = ((bx*blockDim.x) + tx);
const unsigned int y = ((by*blockDim.y) + ty)* 6.1 Two-Dimensional Results
blockDim.y; The two-dimensional test-cases consist of six different data-
// Periodic boundary sets representing common and near worst-case scenarios.
const unsigned int ym1 = (y == 0) ? Y-1 : y-1; The images are prepared by separating the nodes into two
domains (using a threshold set at the median pixel value)
// Get image value and for a range of different image sizes [642 , 81922 ]. The dif-
const unsigned char pyx = g_image[y *X + x]; ferent image sizes are produced for generated images (Ising,
const unsigned char pym1x = g_image[ym1*X + x]; Percolation, Spiral, Hilbert) by generating images of the
// Neighbour values appropriate size, however for the real-world images (USC-
const unsigned char pyxm1 = __shfl_up(pyx, 1); SIPI and OASIS) the different sizes were produced by scal-
const unsigned char pym1xm1 = __shfl_up(pyxm1, 1); ing the images. Samples from these test images can be seen
in Fig. 8 and are discussed below.
// Determine neighbouring connections Ising Model. The first set of test cases have been generated
const bool nym1x = (pyx == pym1x); using the Ising model [41], [48], [49]. This model of ferro-
const bool nyxm1 = (tx > 0) ? (pyx == pyxm1) : false; magnetism is represented by a system of atomic spins that
const bool nym1xm1 = (tx > 0) ? (pyx == pym1xm1) : can be in one of two states and exhibits a phase transition at
false; a critical temperature Tc . At this critical temperature the sys-
tem states will exhibit clusters at all length scales. These test
These memory access optimisations allow the Playne- cases are used to approximate representative of the perfor-
equivalence algorithm to access the necessary neighbouring mance of the algorithms for component labelling in related
nodes and identify redundant connections with only one statistical models. Each system has been simulated using
additional memory transaction in three-dimensions and the Ising model at the critical temperature Tc until the sys-
none in two-dimensions (as compared to the Komura- tem reaches equilibrium.
equivalence algorithm). Percolation Threshold. The second set of test cases have
been generated randomly using the site percolation thresh-
6 RESULTS old. Each pixel is randomly initialised as one of two possible
The three labelling algorithms have been implemented and values based on a probability p. If p is below the percolation
optimised as both direct-type and two-stage versions for threshold Pt then a giant connected component will not
NVIDIA devices with CUDA. These implementations are exist while above Pt there there will be one large connected
compared across a range of different test cases and systems component with a size of the same order as the system.
sizes for both two- and three-dimensional datasets. These These test cases result in a lot of finely connected compo-
test cases have been selected to provide a thorough nents and are one of the difficult test cases. The algorithms

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1226 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

images still represent a useful test case. All images in this

test dataset have been obtained from the Open Access Series
of Imaging Studies (OASIS) project [50] and are originally of
size 2562 .
USC-SIPI Test Images. The sixth test case used to compare
the connected-component labelling algorithms is a series of
images from the University of Southern California—Signal
and Image Processing Institute (USC-SIPI) image data-
base [51]. This is a collection of digitised images that
includes overhead aerial photographs, textures and other
miscellaneous subjects. All images have been converted to
grayscale, normalised and labelled based on a threshold
value to determine two domains. The original source
images range in size from 2562 to 20482 .
Spiral Image. The final test case consists of a set of images
with a two large clusters (of each domain) packed into a
large spiral. While this test case doesn’t result in complex
components it has proved useful as a near worst-case sce-
nario for previous connected-component labelling algo-
rithms due to its two large components connected together
by single nodes [35].
Hilbert Space-Filling Curve. The Hilbert space filling curve
is a fractal space-filling curve described by David Hilbert.
The test images are generated from the discrete approxima-
tion with additional space introduced between the curve
Fig. 9. Performance results comparing the labelling algorithms for the
two-dimensional test cases (using clamped boundaries). The results (actually filling the entire dataset with the curve is not a par-
show the performance of the algorithms relative to the fastest algorithm ticularly interesting test case). Results show this is an even
for each test case. The implementations are labelled for the direct-type more challenging test case than the spiral image.
label-equivalence, Komura-equivalence and Playne-equivalence as LE-
D, KE-D and PE-D. For the two-stage approach (without shuffle) opera-
The labelling algorithm implementations have been com-
tions as LE-B, KE-B and PE-B. Finally the Komura-equivalence and pared for the different test datasets and across a range of
Playne-equivalence two-stage methods using shuffle operations are system sizes. The relative performance results for the two-
labelled KE-B(shfl) and PE-B(shfl). dimensional cases are presented in Fig. 9. These results
show the performance of each algorithm implementation
have been tested using images initialised with probability relative to the fastest overall algorithm - these are the direct-
p ¼ Pt for four-nearest neighbours. type implementation of the Playne-equivalence algorithm
OASIS Medical Imaging Data. Medical MRI data has been for the USC-SIPI and OASIS medical image datasets and the
used for the fifth test case although, as this article is focusing two-stage (with shuffle) implementation of the Playne-
on the two-dimensional problem, each slice of the data has equivalence algorithm for the other datasets.
been considered a separate image. Medical imaging is a Fig. 10 shows a breakdown of the processing time for the
common application of connected-component labelling and kernels of each algorithm across the different test datasets
although more often performed in three-dimensions these for a fixed system size of 81922 . The label-equivalence

Fig. 10. Timing results of individual kernels for the two-dimensional datasets for a system size of 81922 . The results shown the average time (in milli-
seconds) for each kernel call with the exception of the label-equivalence implementations where the analysis and scan (for the direct-type) and bor-
der and analysis (for the two-stage) show the total of multiple kernel calls.

the system reaches equilibrium. The Percolation threshold

has been set to the threshold for three-dimensional, six-
nearest neighbours. The OASIS medical imaging data is
already a three-dimensional dataset and has been scaled to
different sizes. Finally the Hilbert space-filling curve has
been generated for three-dimensions (leaving a one voxel
gap) to create a single component filling the space.
Fig. 11 shows the performance of the labelling algorithm
implementations relative to the fastest algorithm for each of
the four three-dimensional test datasets. The results are
shown relative to the Playne-equivalence, direct-type algo-
rithm for the OASIS medical imaging test case and the
Playne-equivalence (with__shfl operations), two-stage for
the other three cases.
The computation time for the kernels in the different
implementations are compared across a single system
Fig. 11. Performance results comparing the labelling algorithms for the size (5123 ) and presented in Fig. 12. The label-equiva-
three-dimensional test cases (clamped boundaries). The results present
the performance relative to the fastest algorithm for each test case. The lence algorithm does also require a memory transaction
implementations are labelled for the direct-type label-equivalence, after each iteration to determine if the labelling process
Komura-equivalence and Playne-equivalence as LE-D, KE-D and PE-D. has completed.
The two-stage implementations without shuffle operations are labelled The results presented in this section have shown perfor-
LE-B, KE-B and PE-B. The Komura-equivalence and Playne-equiva-
lence two-stage implementations with shuffle operations are labelled as mance of the algorithms using clamped boundary condi-
KE-B(shfl) and PE-B(shfl). tions, implementations have also been tested using periodic
boundary conditions. However, these implementations
algorithm groups all of the calls to the analysis and the scan only have a few extra operations on the borders of the data-
kernels together as these two kernels will be called itera- sets and do not show significantly different performance
tively until the labels are determined. The Komura-equiva- results. For the sake of brevity the results from these tests
lence and Playne-equivalence algorithms list the two calls are not included. An overview of the algorithms’ perfor-
to the analysis kernel separately as there will only ever be mance results is presented in Table 1.
two calls to this kernel.

6.2 Three-Dimensional Results 7 DISCUSSION

Four three-dimensional test cases have been used to evalu-
ate the performance of the algorithms. The four test cases The performance results shown in Section 6 reproduce the
are the two-dimensional cases that can be extended to results presented in [40] and show that the two-stage
three-dimensions, namely the Ising model, percolation Komura-equivalence algorithm provides an improvement
threshold, OASIS medical imaging data and the Hilbert of approximately 2x speedup over the two-stage label-
space-filling curve. These datasets have been generated (or equivalence algorithm for the Ising model dataset in both
in the case of the OASIS data scaled) to a range of sizes two- and three-dimensions. The comparison is extended to
[83 ; 5123 ]. The Ising model datasets have been generated by additional test cases representing other common (and
simulating the Ising model at critical temperature Tc until near worst-case) datasets and shows that the Komura-

Fig. 12. Timing results of individual kernels for the three-dimensional datasets for a system size of 5123 . The results shown the average time (in milli-
seconds) for each kernel call with the exception of the label-equivalence implementations where the analysis and scan (for the direct-type) and bor-
der and analysis (for the two-stage) show the total of multiple kernel calls.

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1228 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

TABLE 1
Average Labelling Time Per Node (in ms) of the Different Labelling Algorithms for Large System Sizes

2D LE-D KE-D PE-D LE-B KE-B KE-B (shfl) PE-B PE-B (shfl)
Ising 1:344 5:7% 0:581 1:1% 0:510 1:2% 0:967 2:5% 0:502 2:0% 0:485 2:0% 0:438 1:7% 0:398 1:7%
Percolation 1:724 3:8% 0:579 1:2% 0:549 1:1% 1:116 2:0% 0:414 2:1% 0:416 2:1% 0:373 2:3% 0:353 2:0%
OASIS 0:971 7:9% 0:467 4:6% 0:338 1:7% 0:930 4:2% 0:605 1:7% 0:573 1:8% 0:530 0:9% 0:477 1:1%
USC-SIPI 1:132 14% 0:552 17% 0:370 10% 0:962 6:5% 0:593 3:2% 0:562 3:1% 0:520 2:8% 0:468 2:9%
Spiral 1:358 3:0% 1:153 3:7% 1:162 3:6% 0:774 0:7% 0:467 2:3% 0:456 2:9% 0:447 2:2% 0:418 2:2%
Hilbert 2:506 0:8% 0:667 7:4% 0:673 7:5% 1:323 0:3% 0:420 8:1% 0:423 8:1% 0:393 8:7% 0:377 9:3%
3D LE-D KE-D PE-D LE-B KE-B KE-B (shfl) PE-B PE-B (shfl)
Ising 1:935 2:9% 1:314 7:2% 1:017 1:0% 1:398 5:2% 0:732 2:6% 0:721 2:7% 0:645 2:8% 0:617 3:0%
Percolation 2:738 4:6% 1:167 0:3% 1:013 0:3% 1:891 3:4% 0:768 1:2% 0:759 1:2% 0:652 0:9% 0:632 0:9%
OASIS 1:518 21% 0:700 9:8% 0:540 6:1% 1:299 11% 0:754 1:9% 0:699 2:0% 0:686 0:7% 0:610 0:6%
Hilbert 3:497 6:4% 0:829 0:5% 0:800 0:6% 2:284 3:2% 0:734 0:3% 0:730 0:3% 0:610 0:1% 0:624 0:2%

Results show the average labelling time per node averaged across dataset sizes of ½40962 ; 81922 for two-dimensions and ½2563 ; 5123 for three-dimensions. The
timing data is presented in ms and error estimates show the standard deviation as a percentage for the label-equivalence (LE), Komura-equivalence (KE) and
Playne-equivalence (PE) as direct-type (-D) and as two-stage (-B) methods with shuffle operations (shfl) and without.

equivalence algorithm can provide between a 2x and 5x is dependent on the dataset. For the two-stage implementa-
speedup over the label-equivalence algorithm for reason- tions, the block kernel is directly comparable across all three
ably large datasets across a range of test cases. algorithms with the label-equivalence performing the slow-
The results also show that the Playne-equivalence algo- est and Playne-equivalence the fastest. The Playne-equiva-
rithm provides a consistent performance benefit over the lence algorithm also outperforms the Komura-equivalence
Komura-equivalence algorithm across a range of test cases algorithm in the border kernel as this also benefits from the
and system sizes in both the two- and three-dimensional improved reduction method and configuration filtering. As
datasets. The Playne-equivalence algorithm provides the expected there is no distinguishable difference in the perfor-
most improvement in test cases that have a higher ratio of mance of the analysis kernels as they perform the same oper-
redundant connections between components—the USC-SIPI ations. The label-equivalence border and analysis kernels are
and OASIS cases. These real-world datasets tend to consist not directly comparable as they must be iteratively called
of larger domains with less single-node connections between multiple times.
clusters. The benefit of the algorithm is less pronounced in In three-dimensional datasets the direct-type Playne-
datasets with almost no redundant connections—the spiral equivalence algorithm also provides a performance improve-
and Hilbert datasets. These datasets consist of very complex ment by reducing the run-time of the reduction kernel. One
structures mostly connected by single nodes and present less unexpected result is the performance of the block kernel
opportunity to filter out unnecessary connections. The Ising where the label-equivalence algorithm performs faster than
and percolation test cases still show an important improve- the Komura-equivalence algorithm. The Playne-equivalence
ment as they consist of both larger structures and but block kernel does perform slightly faster than the label-
still contain many single-node connections. However, the equivalence version but the improvement is only small com-
improved reduction (merge) method still provides a perfor- pared to the two-dimensional case. The Komura-equivalence
mance benefit in these scenarios. and Playne-equivalence algorithms are both able to merge
The results also agree with the conclusion in [40] that the blocks together significantly faster than the label-equivalence
two-stage implementation of the Komura-equivalence algo- algorithm with the Playne-equivalence border kernel provid-
rithm performs better than the direct-type implementation ing the best performance. However, these kernel results
for the Ising model. A similar performance improvement is show that the two-stage Komura-equivalence algorithm
also present in the percolation, spiral and Hilbert space- could be improved by performing the initial labelling of the
filling curve datasets. However, the result is not so straight- blocks with the label-equivalence algorithm.
forward for the USC-SIPI and OASIS datasets. In these cases The use of __shfl operations plays a significant role in
the direct-type implementations can provide the best per- achieving the best performance for the two-stage Playne-
formance for some larger system sizes. As these dataset con- equivalence algorithm. While these shuffle operations can be
tain larger structures there are fewer small structures that used to improve the two-stage Komura-equivalence algo-
can be resolved within a single block and the benefit of rithm, the improvement is less significant. This is due to the
labelling and then merging blocks is outweighed by addi- fact that the Komura-equivalence algorithm can only access a
tional costs of converting between local and global labels single neighbour through __shfl operations. The Playne-
and the additional synchronisation required. equivalence algorithm accesses additional neighbours, of
Analysis of the individual kernels produces several find- which two can be accessed through __shfl operations in two-
ings - the two-dimensional case is discussed first followed dimensions and three in three-dimensions. The benefit of
by the three-dimensional case. For two-dimensional data- avoiding unnecessary reduction operations can be achieved
sets, the direct-type Playne-equivalence algorithm provides without significantly increasing the memory transactions.
a performance benefit by reducing the run-time of the reduc- The implementation of the two-stage Playne-equivalence
tion kernel. This is the only kernel that different from the algorithm provides a performance benefit over the imple-
Komura-equivalence algorithm and the exact improvement mentation without shuffle operations for all cases except for

the three-dimensional Hilbert case, the exact reason for this is Medical Institute (HHMI) at Harvard University, the Neu-
currently unknown and warrants further investigation. roinformatics Research Group (NRG) at Washington Uni-
The algorithm presented takes advantage of the rectilinear versity School of Medicine, and the Biomedical Informatics
structure of the datasets which allows redundant connec- Research Network (BIRN) supported by NIH grants P50
tions to be identified through simple configurations. The AG05681, P01 AG03991, P20 MH071616, RR14075, RR
address of the local nodes that may provide these redundant 16594, U24 RR21382, the Alzheimer’s Association, the James
connections can be easily calculated and accessed. A similar S. McDonnell Foundation, the Mental Illness and Neurosci-
approach could be applied to other regular datasets such as ence Discovery Institute, and HHMI.
hexagonal lattices once the appropriate configurations have
been identified. It would not be so straightforward to identify REFERENCES
such nodes in for an arbitrary graph and in this case the cost
[1] A. Eklund, P. Dufort, M. Villani, and S. LaConte, “BROCCOLI:
of accessing many neighbouring nodes may outweigh the Software for fast fMRI analysis on many-core CPUs and GPUs,”
benefits. However, this algorithm should prove useful for Frontiers Neuroinformatics, vol. 8, no. 24, pp. 1–19, 2014.
the many applications that use regular, rectilinear datasets. [2] P. Chen, H. L. Zhao, C. Tao, and H. Sang, “Block-run based
connected component labelling algorithm for GPGPU using
shared memory,” Electron. Lett., vol. 47, no. 24, pp. 1309–1311,
8 CONCLUSIONS Nov. 2011.
[3] A. A. Yildirim and C. Ozdogan, “Parallel wavelet-based cluster-
A new parallel, GPU-based connected-component labelling ing algorithm on GPUs using CUDA,” Procedia Comput. Sci.,
has been described that improves on the method described vol. 3, pp. 396–400, 2011.
in [40] by improving the reduction method and only apply- [4] V. M. A. Oliveira and R. A. Lotufo, “A study on connected compo-
ing it to nodes that match certain configurations. The config- nent labeling algorithms using GPUs,” in Proc. 23rd Conf. Graph.
Patterns Images, 2010.
urations can be used to identify redundant connections [5] O. Stava and B. Benes, “Connected component labeling in
between label clusters and have been described for two- CUDA,” in GPU Computing Gems Emerald Edition, Amsterdam,
and three-dimensional datasets with both clamped and Netherlands: Elsevier Inc, 2011, pp. 569–581.
periodic boundary conditions. Identifying and disregarding [6] K. Yonehara and K. Aizawa, “A line-based connected component
labelling algorithm using GPUs,” in Proc. 3rd Int. Symp. Comput.
these redundant connections is specific to parallel algo- Netw., Dec. 2015, pp. 341–345.
rithms as most sequential algorithms will already have [7] H. Wenke, S. Kolodzey, and O. Vornberger, “A work-optimal par-
resolved the connections due to the order of execution. allel connected-component labelling algorithm for 2D-image-data
uing pre-contouring,” in Proc. Int. Workshops Elect. Comput. Eng.
While the labelling method does require access to addi- Subfields, Aug. 22–23, 2014, pp. 154–161.
tional neighbouring nodes, it has been shown that the cost [8] A. Rasmusson, T. Sazrensen, and G. Ziegler, “Connected compo-
of accessing this additional information can be mitigated nents labeling on the GPU with generalization to Voronoi dia-
through the use of shared memory and __shfl operations grams and signed distance fields,” in Proc. Advances Vis. Comput.,
2013, vol. 8033, pp. 206–215.
provided by CUDA. For the two-stage implementation of [9] B. Preto, et al., “A GPU-enabled algorithm for 3D image label-
the algorithm, the necessary information can be loaded with ling,” in Proc. ENURS 1st Meet. Synchrotron Radiation Users Portu-
no additional memory reads for two-dimensional datasets gal, 2012, pp. 1–2.
[10] M. Jablonski and M. Gorgon, “Handel-C implementation of classi-
and only one additional read for three-dimensional data- cal component labelling algorithm,” in Proc. EUROMICRO Syst.
sets. Utilising these special functions enables the algorithm Digital Des., 2004, pp. 387–393.
to perform local filtering without increasing memory band- [11] B. Thornburg and N. Lawal, “Real-time component labelling and
width requirements. feature extraction on FPGA,” in Proc. Int. Symp. Signals Circuits
Syst., Jul. 2009, pp. 1–4.
A set of two- and three-dimensional test datasets repre- [12] D. Crookes and K. Benkrid, “An FPGA implementation of image
senting common and near worst-case datasets has been component labelling,” Proc. SPIE, 1999, vol. 3844, pp. 17–23.
presented and used to evaluate and compare the proposed [13] S. Byna, Prabhat, M. F. Wehner, and K. Wu, “Detecting atmo-
algorithm with the Komura-equivalence and label- spheric rivers in large climate datasets,” in Proc. 2nd Int. Workshop
Petascale Data Analytics: Challenges Opportunities, Nov. 14, 2011,
equivalence algorithms. The new algorithm provides a per- pp. 7–14.
formance improvement of approximately 10-20 percent [14] K. H. Karstensen, “Silhouette extraction using graphics processing
over the Komura-equivalence algorithm and 40-60 percent units,” Master’s thesis, Dept. Informatics, Univ. Oslo, Oslo,
Sweden, 2012.
over the label-equivalence algorithm in most cases. The [15] T. Lelore and F. Bouchara, “FAIR: A Fast Algorithm for document
two-stage implementation provides the best performance in Image Restoration,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35,
many cases but direct-type performed better for the real- no. 8, pp. 2039–2048, Aug. 2013.
world datasets consisting of large structures. [16] L. Riha and M. Mareboyana, “GPU accelerated one-pass algorithm
for computing minimal rectangles of connected components,” in
The Playne-equivalence algorithm proposed in this article Proc. IEEE Workshop Appl. Comput. Vis., 2011, pp. 479–484.
provides a meaningful performance improvement over [17] A. Dubois and F. Charpillet, “Tracking mobile objects with several
existing algorithms across a range of datasets and system Kinect using HMMs and component labelling,” in Proc. Int. Conf.
Intell. Robots Syst. Workshop Assistance Service Robot. Human Envi-
sizes by identifying and eliminating unnecessary operations. ronment, 2012, pp. 1–7.
[18] A. Abramov, T. Kulvicius, F. Worgotter, and B. Dellen, “Real-time
ACKNOWLEDGMENTS image segmentation on a GPU,” in Facing the Multicore-Challenge
Berlin, Germany: Springer, 2010, no. 6310, pp. 131–142.
The USC-SIPI image database is made available by the Uni- [19] A. Korbes and G. B. Vitor, “Advances on watershed processing on
versity of Southern California - Signal and Image Processing GPU architecture,” in Proc. 10th Int. Symp. Math. Morphology Appl.
Image Signal Process., 2011, pp. 260–271.
Institute. [20] A. N. Moga and M. Gabbouj, “Parallel image component labeling
Open Access Series of Imaging Studies (OASIS) is made with watershed transformation,” IEEE Trans. Pattern Anal. Mach.
available by Dr. Randy Buckner at the Howard Hughes Intell., vol. 19, no. 5, pp. 441–450, May 1997.

Authorized licensed use limited to: CINVESTAV. Downloaded on February 10,2022 at 20:02:54 UTC from IEEE Xplore. Restrictions apply.
1230 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

[21] T. Berka, “The generalized feed-forward loop motif: Definition, [46] B. Preto, F. Birra, A. Lopes, and P. Medeiros, “Object identification
detection and statistical significance,” Procedia Comput. Sci., in binary tomographic images using GPGPUs,” Int. J. Creative
vol. 11, pp. 75–87, 2012. Interfaces Comput. Graph., vol. 4, no. 2, pp. 40–56, 2013.
[22] J. Janes, “Efficient integration of multiscale n-body systems,” Mas- [47] M. Giles, E. Laszlo, I. Reguly, J. Appleyard, and J. Demouth, “GPU
ter’s thesis, Instituut voor Informatica, Universiteit Van Amsterdam, implementation of finite difference solvers,” in Proc. 7th Workshop
Amsterdam, Netherlands, 2011. High Perform. Comput. Finance, Nov. 2014, pp. 1–8.
[23] M. J. Dineen, M. Khosravani, and A. Probert, “Using OpenCL for [48] R. J. Glauber, “Time-dependent statistics of the ising model,” J.
implementing simple parallel graph algorithms,” in Proc. Int. Math. Physics, vol. 4, pp. 294–307, 1963.
Conf. Parallel Distrib Process. Techn. Appl., 2011, pp. 268–273. [49] E. Ising, “Beitrag zur theorie des ferromagnetismus,” Zeitschrift fur
[24] V. C. Barbosa and R. G. Ferreira, “On the phase transitions of Physik, vol. 31, no. 1, pp. 253–258, 1925.
graph coloring and independent sets,” Physica A, vol. 343, [50] D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris,
pp. 401–423, Nov. 2004. and R. L. Buckner, “Open access series of imaging studies
[25] M. C. E. van der Ven, “The edge-bandwidth minimization prob- (OASIS): Cross-sectional MRI data in young, middle aged, nonde-
lem,” Master’s thesis, Department of Econometrics & Operations mented, and demented older adults,” J. Cognitive Neuroscience,
Research, Tilburg Univ. Tilburg, Netherlands, 2012. vol. 19, pp. 1498–1507, 2007.
[26] M. Nowakiewicz, “A path non-existence proof in motion planning [51] A. G. Weber, “The USC-SIPI image data base: Version 5,” Univ.
using CUDA,” in Proc. IEEE/ASME Int. Conf. Adv. Intell. Mecha- Southern California, Los Angeles, CA, Tech. Rep. USC-SIPI-315,
tronics, Jul. 6–9, 2010, pp. 973–979. 1997.
[27] H. V. Pham, B. Bhaduri, K. Tangella, C. Best-Popescu, and
G. Popescu, “Real time blood testing using quantitative phase Daniel Peter Playne received the PhD degree in
imaging,” PLOS One, vol. 8, no. 2, pp. 1–9, Feb. 2013. computer science from Massey University, in
[28] M. Weigel, “Connected-component identication and cluster 2012. He is a senior lecturer of computer science
update on graphics processing units,” Physical Review E, vol. 84, and director of the Parallel Computing Centre at
no. 3, p. 036709, Sept. 2011. Massey University in New Zealand. He is a mem-
[29] Y. Komura and Y. Okabe, “Poster: Multi-GPU-based calculation of ber of the ACM.
percolation problem on the TSUBAME 2.0 supercomputer,” in
Proc. SC Companion High Perform. Comput., Netw. Storage Anal.,
2012, pp. 1369–1369.
[30] Y. Komura and Y. Okabe, “Multi-GPU-based Swendsen-Wang
multi-cluster algorithm for the simulation of two-dimensional q-
state Potts model,” Comput. Physics Commun., vol. 184, pp. 40–44,
2013. Ken Hawick received the PhD degree in compu-
[31] Y. Komura and Y. Okabe, “GPU-based Swendsen-Wang multi- tational physics from the University of Edinburgh.
cluster algorithm for the simulation of two-dimensional classical He is professor of computer science and director
spin systems,” Comput. Physics Commun., vol. 183, pp. 1155–1161, of The Digital Centre at the University of Hull in
2012. the United Kingdom. He is a senior member of
[32] Y. Komura and Y. Okabe, “Large-scale Monte Carlo simulation of the ACM, a member of the IEEE, IEEE Computer
two-dimensional classical XY model using multiple GPUs,” J. Society, SIAM, and the IET, and is a fellow of the
Phys. Soc. Japan Lett., vol. 81, no. 11, pp. 1–4, 2012. Institute of Physics, a fellow of the British Com-
[33] S. Sah, Y. Roh, K. Chang, and D. Jeong, “Phase map generation for puter Society, and a fellow of the Royal Meteoro-
phase shift Moire using CUDA,” in Proc. 12th Int. Conf. Parallel logical Society, an associate member of the
Distrib. Comput. Appl. Technol., 2012, pp. 140–145. Chartered Management Institute and is a mem-
[34] D. L. Arendt, “In search of self-organisation,” Ph.D. dissertation, ber of the Royal Society of New Zealand.
Department of Computer Science, Virginia Polytechnic Institute,
Blacksburg, VA, Mar. 2012.
[35] K. A. Hawick, A. Leist, and D. P. Playne, “Parallel graph compo- " For more information on this or any other computing topic,
nent labelling with GPUs and CUDA,” Parallel Comput., vol. 36, please visit our Digital Library at www.computer.org/publications/dlib.
no. 12, pp. 655–678, Dec. 2010.
[36] K. Suzuki, I. Horiba, and N. Sugie, “Fast connected-component
labeling based on sequential local operations in the course of for-
ward raster scan followed by backward raster scan,” in Proc. 15th
Int. Conf. Pattern Recognit., vol. 2, 2000, pp. 434–437.
[37] O. Kalentev, A. Rai, S. Kemnitz, and R. Schneider, “Connected
component labelling on a 2D grid using CUDA,” J. Parallel Distrib.
Comput., vol. 71, no. 4, pp. 615–620, 2011.
[38] I. Jung and C. Jeong, “Parallel connected-component labeling
algorithm for GPGPU applications,” in Proc. Int. Symp. Commun.
Inf. Technol., 2010, pp. 1149–1153.
[39] F. Wende and T. Stienke, “Swendsen-Wang multi-cluster algo-
rithm for the 2D/3D Ising model on Xeon Phi and GPU,” in Proc.
Int. Conf. High Perform. Comput. Netw. Storage Anal., 2013, pp. 1–12.
[40] Y. Komura, “GPU-based cluster-labeling algorithm without the
use of conventional iteration: Application to the Swendsen-Wang
multi-cluster spin flip algorithm,” Comput. Physics Commun.,
vol. 194, pp. 54–58, 2015.
[41] A. Leist, D. P. Playne, and K. A. Hawick, “Interactive visualisation
of spins and clusters in regular and small-world Ising models
with CUDA on GPUs,” J. Comput. Sci., vol. 1, pp. 33–40, 2010.
[42] CUDA C Programming Guide v8.0, Santa Clara, CA, USA: NVIDIA
Corporation, January 2017.
[43] NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, 1st
ed., Santa Clara, CA, USA: NVIDIA Corporation, 2009.
[44] NVIDIA’s Next Generation CUDA Compute Architecture: Kepler
GK110, 1st ed., Santa Clara, CA, USA: NVIDIA Corporation, 2012.
[45] NVIDIA Tesla P100, 1st ed., Santa Clara, CA, USA: NVIDIA Cor-
poration, 2016.