0% found this document useful (0 votes)
54 views11 pages

An Efficient Parallel Approach For

an Efficient Parallel Approach Foran Efficient Parallel Approach For

Uploaded by

VigneshInfotech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views11 pages

An Efficient Parallel Approach For

an Efficient Parallel Approach Foran Efficient Parallel Approach For

Uploaded by

VigneshInfotech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO.

2, FEBRUARY 2014

147

An Efficient Parallel Approach for


Sclera Vein Recognition
Yong Lin, Eliza Yingzi Du, Senior Member, IEEE, Zhi Zhou, Student Member, IEEE,
and N. Luke Thomas
Abstract Sclera vein recognition is shown to be a promising
method for human identification. However, its matching speed
is slow, which could impact its application for real-time applications. To improve the matching efficiency, we proposed a new
parallel sclera vein recognition method using a two-stage parallel
approach for registration and matching. First, we designed a
rotation- and scale-invariant Y shape descriptor based feature
extraction method to efficiently eliminate most unlikely matches.
Second, we developed a weighted polar line sclera descriptor
structure to incorporate mask information to reduce GPU memory cost. Third, we designed a coarse-to-fine two-stage matching
method. Finally, we developed a mapping scheme to map the
subtasks to GPU processing units. The experimental results show
that our proposed method can achieve dramatic processing speed
improvement without compromising the recognition accuracy.
Index Terms Sclera vein recognition, sclera feature matching,
sclera matching, parallel computing, GPGPU.

I. I NTRODUCTION

HE sclera is the opaque and white outer layer of the eye.


The blood vessel structure of sclera is formed randomly
and is unique to each person [1, 2], which can be used for
humans identification [3-6]. Several researchers have designed
different Sclera vein recognition methods and have shown
that it is promising to use Sclera vein recognition for human
identification. In [4], Crihalmeanu and Ross proposed three
approaches: Speed Up Robust Features (SURF)-based method,
minutiae detection, and direct correlation matching for feature registration and matching. Within these three methods,
the SURF method achieves the best accuracy. It takes an
average of 1.5 seconds1 using the SURF method to per-

Manuscript received October 28, 2012; revised June 4, 2013 and


September 29, 2013; accepted October 26, 2013. Date of publication
November 14, 2013; date of current version January 7, 2014. The associate
editor coordinating the review of this manuscript and approving it for
publication was Prof. Patrizio Campisi.
Y. Lin is with the School of Computer Science, Xidian University, Xian 710071, China, and also with the Department of Computer Science, Ningxia Normal University, Guyuan 756000, China (e-mail:
[email protected]).
E. Y. Du was with Purdue University, Indianapolis, IN 47907 USA.
She is now with Qualcomm, Santa Clara, CA 92121 USA (e-mail:
[email protected]).
Z. Zhou was with Purdue University, Indianapolis, IN 47907 USA. He is
now with Allen Institute for Brain Science, Seattle, WA 98103 USA (e-mail:
[email protected]).
N. L. Thomas is with the Biometrics and Pattern Recognition Laboratory, Department of Electrical and Computer Engineering, Indiana University-Purdue University, Indianapolis, IN 46202 USA (e-mail:
[email protected]).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIFS.2013.2291314
1 This speed is based on our implementation of their method.

form a one-to-one matching. In [3], Zhou et. al. proposed


line descriptor-based method for sclera vein recognition.
The matching step (including registration) is the most timeconsuming step in this sclera vein recognition system, which
costs about 1.2 seconds to perform a one-to-one matching.
Both speed was calculated using a PC with Intel Core
2 Duo 2.4GHz processors and 4 GB DRAM. Currently,
Sclera vein recognition algorithms [3, 4] are designed using
central processing unit (CPU)-based systems. As discussed
in [7], CPU-based systems are designed as sequential processing devices, which may not be efficient in data processing
where the data can be parallelized. Because of large time
consumption in the matching step, Sclera vein recognition
using sequential-based method would be very challenging to
be implemented in a real time biometric system, especially
when there is large number of templates in the database for
matching.
GPUs (as abbreviation of General purpose Graphics
Processing Units: GPGPUs) are now popularly used for
parallel computing to improve the computational processing
speed and efficiency [8-20]. The highly parallel structure
of GPUs makes them more effective than CPUs for
data processing where processing can be performed in
parallel. GPUs have been widely used in biometrics
recognition such as: speech recognition [8], text detection [9],
handwriting recognition [10], and face recognition [14].
In iris recognition [15], GPU was used to extract the features,
construct descriptors, and match templates. GPUs are also used
for object retrieval and image search [16-19]. Park et al. [20]
designed the performance evaluation of image processing
algorithms, such as linear feature extraction and multi-view
stereo matching, on GPUs. However, these approaches were
designed for their specific biometric recognition applications
and feature searching methods. Therefore they may not be
efficient for Sclera vein recognition.
Compute Unified Device Architecture (CUDA), the computing engine of NVIDIA GPUs, is used in this research.
CUDA is a highly parallel, multithreaded, many-core processor with tremendous computational power [21]. It supports not
only a traditional graphics pipeline but also computation on
non-graphical data. More importantly, it offers an easier programming platform which outperforms its CPU counterparts in
terms of peak arithmetic intensity and memory bandwidth [22].
In this research, the goal is not to develop a unified
strategy to parallelize all sclera matching methods because
each method is quite different from one another and would
need customized design. To develop an efficient parallel computing scheme, it would need different strategies for different

1556-6013 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

148

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014

Sclera vein recognition methods. Rather, the goal is to develop


a parallel sclera matching solution for Sclera vein recognition
using our sequential line-descriptor method [3] using the
CUDA GPU architecture. However, the parallelization strategies developed in this research can be applied to design parallel
approaches for other Sclera vein recognition methods and help
parallelize general pattern recognition methods.
Based on the matching approach in [3], there are three challenges to map the task of sclera feature matching to GPU. 1)
Mask files are used to calculate valid overlapping areas of two
sclera templates and to align the templates to the same coordinate system. But the mask files are large in size and will preoccupy the GPU memory and slow down the data transfer. Also,
some of processing on the mask files will involve convolution
which is difficult to improve its performance on the scalar
process unit on CUDA. 2) The procedure of sclera feature
matching consists of a pipeline of several computational stages
with different memory and processing requirements. There is
no uniform mapping scheme applicable for all these stages.
3) When the scale of sclera database is far larger than the
number of the processing units on the GPU, parallel matching
on the GPU is still unable to satisfy the requirement of realtime performance. New designs are necessary to help narrow
down the search range. In summary, nave implementation of
the algorithms in parallel would not work efficiently.
Note, it is relatively straightforward to implement our
C program for CUDA on AMD-based GPU using OpenCL.
Our CUDA kernels can be directly converted to OpenCL
kernels by concerning different syntax for various keywords
and built-in functions. The mapping strategy is also effective
in OpenCL if we regard thread and block in CUDA as workitem and work-group in OpenCL. Most of our optimization
techniques such as coalesced memory access and prefix sum
can work in OpenCL too. Moreover, since CUDA is a data
parallel architecture, the implementation of our approach by
OpenCL should be programmed in data-parallel model.
In this research, we first discuss why the nave parallel
approach would not work (Section 3). We then propose the
new sclera descriptor the Y shape sclera feature-based
efficient registration method to speed up the mapping scheme
(Section 4); introduce the weighted polar line (WPL)
descriptor, that would be better suited for parallel computing
to mitigate the mask size issue (Section 5); and develop
our coarse to fine two-stage matching process to dramatically improve the matching speed (Section 6). These new
approaches make the parallel processing possible and efficient.
However, it is non-trivial to implement these algorithms in
CUDA. We then developed the implementation schemes to
map our algorithms into CUDA (Section 7). In the Section 2,
we give brief introduction of Sclera vein recognition. In the
Section 8, we performed some experiments using the proposed
system. In the Section 9, we draw some conclusions.
II. BACKGROUND OF S CLERA V EIN R ECOGNITION
A. Overview of Sclera Vein Recognition
A typical sclera vein recognition system includes sclera
segmentation, feature enhancement, feature extraction, and
feature matching (Figure 1).

Fig. 1.

The diagram of a typical sclera vein recognition approach.

Sclera image segmentation is the first step in sclera vein


recognition. Several methods have been designed for sclera
segmentation [3, 4, 23-25]. Crihalmeanu et al. [25] presented
an semi-automated system for sclera segmentation. They used
a clustering algorithm to classify the color eye images into
three clusters - sclera, iris, and background. Later on, Crihalmeanu and Ross [4] designed a segmentation approach
based on a normalized sclera index measure, which includes
coarse sclera segmentation, pupil region segmentation, and fine
sclera segmentation. Zhou et. al. [3] developed a skin tone plus
white color-based voting method for sclera segmentation in
color images and Otsus thresholding-based method for grayscale images.
After sclera segmentation, it is necessary to enhance and
extract the sclera features since the sclera vein patterns often
lack contrast, and are hard to detect. Zhou et al. [3] used
a bank of multi-directional Gabor filters for vascular pattern
enhancement. Derakhshani et. al. [23] used contrast limited
adaptive histogram equalization (CLAHE) to enhance the
green color plane of the RGB image, and a multi-scale region
growing approach to identify the sclera veins from the image
background. Crihalmeanu and Ross [4] applied a selective
enhancement filter for blood vessels to extract features from
the green component in a color image.
In the feature matching step, Crihalmeanu and Ross proposed [4] three registration and matching approaches including Speed Up Robust Features (SURF) which is based on
interest-point detection, minutiae detection which is based
on minutiae points on the vasculature structure, and direct
correlation matching which relies on image registration.
Zhou et. al. designed a line descriptor based feature registration and matching method [3].
B. Overview of the Line Descriptor-Based Sclera Vein
Recognition Method
The matching segment of the line-descriptor based method
is a bottleneck with regard to matching speed. In this section,
we briefly describe the Line Descriptor-based sclera vein
recognition method.
After segmentation, vein patterns were enhanced by a bank
of directional Gabor filters. Binary morphological operations
are used to thin the detected vein structure down to a singlepixel wide skeleton and remove the branch points. The line
descriptor is used to describe the segments in the vein
structure [3]. Figure 2 shows a visual description of the line
descriptor. Each segment is described by three quantities: the
segments angle to some reference angle at the iris center ,
the segments distance to the iris center r , and the dominant
angular orientation of the line segment
. 

T
Thus, the descriptor is S = r . The individual
components of the line descriptor are calculated as:


yl yi
= tan1
xl x i

LIN et al.: EFFICIENT PARALLEL APPROACH FOR SCLERA VEIN RECOGNITION

149

Fig. 3.
Fig. 2.

The weighting image [3].

The sketch of parameters of segment descriptor [3].


r =

(yl yi )2 + (xl x i )2


d
1
and = tan
f line (x) .
dx

(1)

Here fline (x) is the polynomial approximation of the line


segment, (xl , yl ) is the center point of the line segment, (x i , yi )
is the center of the detected iris, and S is the line descriptor.
In order to register the segments of the vascular patterns,
a RANSAC-based algorithm is used to estimate the best-fit
parameters for registration between the two sclera vascular
patterns. For the registration algorithm, it randomly chooses
two points one from the test template, and one from the
target template. It also randomly chooses a scaling factor
and a rotation value, based on a priori knowledge of the
database. Using these values, it calculates a fitness value for
the registration using these parameters [3].
After sclera template registration, each line segment in the
test template is compared to the line segments in the target
template for matches. In order to reduce the effect of segmentation errors, we created the weighting image (Figure 3) from
the sclera mask by setting interior pixels in the sclera mask
to 1, pixels within some distance of the boundary of the mask
to 0.5, and pixels outside the mask to 0.
The matching score for two segment descriptors is calculated by [3]:




  d Si , S j Dmat ch



and
w(Si )w S j ,

(2)
m Si , S j =
i j mat ch

0,
else,
where Si and S j are two segment descriptors, m(Si , S j ) is
the matching score between segments Si and S j , d(Si , S j )
is the Euclidean distance between the segment descriptors
center points (from Eq. 6-8), Dmat ch is the matching distance
threshold, and mat ch is the matching angle threshold. The
total matching score, M, is the sum of the individual matching scores divided by the maximum matching score for the
minimal set between the test and target template. That is, one
of the test or target templates has fewer points, and thus the
sum of its descriptors weight sets the maximum score that can
be attained [3].



m Si , S j

M=
mi n

(i, j )Mat ches

iT est

w (Si ) ,


j T arget

w Sj

.


(3)

Fig. 4.

The module of sclera template matching.

here, Matches is the set of all pairs that are matching, T est
is the set of descriptors in the test template, T arget is the set
of descriptors in the target template.
III. A NAVE I MPLEMENTATION OF
PARALLEL P ROCESSING
A nave parallel approach is to directly convert the sequential algorithm to a parallel computation model (Figure 4).
Before matching, the masks file should be aligned and the
overlap of these masks was calculated as a new mask. The
descriptors outside of the new mask are removed. A binary
erosion is performed to generate the boundary area of the
new mask. A weight value of a descriptor is calculated
according to their position. Most of these common steps, such
as mask merging, weight calculation, and descriptor mask
require scanning the mask image pixel by pixel and convolution. Computationally, these are time-consuming and create
a bottleneck with regard to speed for the sclera matching.
Furthermore, the size of the mask file is too large to load
onto the GPU without computational delay. As a result, this
parallel approach is inefficient.
IV. T HE P ROPOSED Y S HAPE S CLERA F EATURE
FOR E FFICIENT R EGISTRATION
Currently, the registration of two sclera images during
matching is very time consuming. To improve the efficiency,
in this research, we propose a new descriptor the Y shape
descriptor, which can greatly help improve the efficiency of
the coarse registration of two images and can be used to filter
out some non-matching pairs before refined matching.
Within the sclera, there can be several layers of veins. The
motion of these different layers can cause the blood vessels
of sclera show different patterns [26]. But in the same layers,
blood vessels keep some of their forms. As present in Figure 5,
the set of vessel segments combine to create Y shape branches
often belonging to same sclera layer. When the numbers of
branches is more than three, the vessels branches may come
from different sclera layers and its pattern will deform with

150

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014

Fig. 5.

Fig. 6.

The Y shape vessel branch in sclera.

The rotation and scale invariant character of Y shape vessel branch.

movement of eye. Y shape branches are observed to be a stable


feature and can be used as sclera feature descriptor.
To detect the Y shape branches in the original template, we
search for the nearest neighbors set of every line segment in a
regular distance, classified the angles among these neighbors.
If there were two types of angle values in the line segment
set, this set may be inferred as a Y shape structure and the
line segment angles would be recorded as a new feature of the
sclera.
There are two ways to measure both orientation and relationship of every branch of Y shape vessels: one is to use the
angles of every branch to x axle, the other is to use the angels
between branch and iris radial direction. The first method
needs additional rotation operating to align the template.
In our approach, we employed the second method. As Figure 6
shows, 1 , 2 , and 3 denote the angle between each branch
and the radius from pupil center. Even when the head tilts,
the eye moves, or the camera zooms occurs at the image
acquisition step, 1 , 2 , and 3 are quite stable. To tolerate
errors from the pupil center calculation in the segmentation
step, we also recorded the center position (x, y) of the Y shape
branches as auxiliary parameters. So in our rotation, shift and
scale invariant feature vector is defined as: y(1 , 2 , 3 , x, y).
The Y-shape descriptor is generated with reference to the
iris center. Therefore, it is automatically aligned to the iris
centers. It is a rotational- and scale- invariant descriptor.
V. WPL S CLERA D ESCRIPTOR
As we discussed in the Section 2.2., the line descriptor is
extracted from the skeleton of vessel structure in binary images
(Figure 7). The skeleton is then broken into smaller segments.
For each segment, a line descriptor is created to record the
center and orientation of the segment. This descriptor is
expressed as s(x, y, ), where (x, y) is the position of the
center and is its orientation.
Because of the limitation of segmentation accuracy, the
descriptor in the boundary of sclera area might not be accurate
and may contain spur edges resulting from the iris, eyelid,
and/or eyelashes. To be tolerant of such error, the mask file

Fig. 7. The line descriptor of the sclera vessel pattern. (a) An eye image.
(b) Vessel patterns in sclera. (c) Enhanced sclera vessel patterns. (d) Centers
of line segments of vessel patterns.

is designed to indicate whether a line segment belongs to


the edge of the sclera or not. However, in GPU application,
using the mask is a challenging since the mask files are large
in size and will occupy the GPU memory and slow down
the data transfer. When matching, the registration RANSACtype algorithm was used to randomly select the corresponding
descriptors and the transform parameter between them was
used to generate the template transform affine matrix. After
every templates transform, the mask data should also be transformed; and new boundary should be calculated to evaluate the
weight of the transformed descriptor. This results in too many
convolutions in processor unit.
To reduce heavy data transfer and computation, we designed
the weighted polar line (WPL) descriptor structure, which
includes the information of mask and can be automatically
aligned. We extracted the relationship of geometric feature
of descriptors and store them as a new descriptor. We use
a weighted image created via setting various weight values
according to their positions. The weight of those descriptors
who are beyond the sclera are set to be 0, and those who are
near the sclera boundary are 0.5 and interior descriptors are
set to be 1. In our work, descriptors weights were calculated
on their own mask by the CPU only once. The calculating
result was saved as a component of descriptor. The descriptor
of sclera will change to s(x, y, , w), where, w denotes the
weight of the point and the value may be 0, 0.5, 1.
To align two templates, when a template is shifted to
another location along the line connecting their centers, all
the descriptors of that template will be transformed. It would
be faster if two templates have similar reference points. If
we use the center of the iris as the reference point, when two
templates are compared, the correspondence will automatically
be aligned to each other since they have the similar reference
point. Every feature vector of the template is a set of line
segment descriptors composed of three variable (Figure 8):
the segment angle to the reference line which went through
the iris center, denoted as ; the distance between the segments
center and pupil center which is denoted as r ; the dominant
angular orientation of the segment, denoted as . To minimize
the GPU computing, we also convert the descriptor value from
polar coordinate to rectangular coordinate in CPU preprocess.
The descriptor vector becomes s(x, y, r, , , w).
The left and right parts of sclera in an eye may have different
registration parameters. For example, as an eyeball moves left,
left part sclera patterns of the eye may be compressed while
the right part sclera patterns are stretched. In parallel matching,
these two parts are assigned to threads in different warps
to allow different deformation. The multiprocessor in CUDA
manages threads in groups of 32 parallel threads called warps.
We reorganized the descriptor from same sides and saved

LIN et al.: EFFICIENT PARALLEL APPROACH FOR SCLERA VEIN RECOGNITION

151

Fig. 10.

Two-stage matching scheme.

Algorithm (Kernel) 1 Matching With Y Shape Descriptor


Fig. 8.

Fig. 9.

The key elements of descriptor vector.

Simplified sclera matching steps on GPU.

them in continuous address. This would meet requirement of


coalesced memory access in GPU.
After reorganizing the structure of descriptors and adding
mask information into the new descriptor, the computation on
the mask file is not needed on the GPU. It was very fast
to match with this feature because it does not need to reregister the templates every time after shifting. Thus the cost
of data transfer and computation on GPU will be reduced.
Matching on the new descriptor, the shift parameters generator
in Figure 4 is then simplified as Figure 9.
VI. C OARSE - TO -F INE T WO -S TAGE M ATCHING P ROCESS
To further improve the matching process, we propose the
coarse-to-fine two-stage matching process. In the first stage,
we matched two images coarsely using the Y-shape descriptors, which is very fast to match because no registration was
needed. The matching result in this stage can help filter out
image pairs with low similarities. After this step, it is still
possible for some false positive matches. In the second stage,
we used WPL descriptor to register the two images for more
detailed descriptor matching including scale- and translationinvariance. This stage includes shift transform, affine matrix
generation, and final WPL descriptor matching.
Overall, we partitioned the registration and matching
processing into four kernels2 in CUDA (Figure 10): matching
on the Y shape descriptor, shift transformation, affine matrix
generation, and final WSL descriptor matching. Combining
these two stages, the matching program can run faster and
achieve more accurate score.
A. Stage I: Matching With Y Shape Descriptor
Due to scale- and rotation- invariance of the Y-shape
features, registration is unnecessary before matching on
Y shape descriptor. The whole matching algorithm is listed
as algorithm 1.
2 Kernel in CUDA means function called from the host that runs on the

device.

Here, ytei , and yta j are the Y shape descriptors of test


template Tt e and target template Tt a respectively. d is the
Euclidian distance of angle element of descriptors vector
defined as (3). dx y is the Euclidian distance of two descriptor
centers defined as (4). n i , and di are the matched descriptor
pairs number and their centers distance respectively. t is
a distance threshold and tx y is the threshold to restrict the
searching area. We set t to 30 and tx y to 675 in our
experiment. Here,

 
d ytei , ytai = (i0 j 0)2 +(i1 j 1)2 +(i2 j 2)2 , (5)
and


 
dx y ytei , ytai = (x i x j )2 + (yi y j )2 .

(6)

To match two sclera templates, we searched the areas nearby


to all the Y shape branches.The search area is limited to
the corresponding left or right half of the sclera in order to
reduce the searching range and time. The distance of two
branches is defined in (3) where i j is the angle between the
j th branch and the polar from pupil center in desctiptor i .
The number of matched pairs n i and the distance between
Y shape branches centers di are stored as the matching result.
We fuse the number of matched branches and the average
distance between matched branches centers as (2). Here, is
a factor to fuse the matching score which was set to 30 in our
study. Ni and N j is the total numbers of feature vectors in
template i and j separately. The decision is regulated by the
threshold t: if the scleras matching score is lower than t, the
sclera will be discarded. The sclera with high matching score
will be passed to the next more precisely matching process.
B. Stage II: Fine Matching Using WPL Descriptor
The line segment WSL descriptor reveals more vessel
structure detail of sclera than the Y shape descriptor. The

152

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014

variation of sclera vessel pattern is nonlinear because:


1) When acquiring an eye image in different gaze angle,
the vessel structure will appear nonlinear shrink or extend
because eyeball is spherical in shape. And 2) sclera is
made up of four layers: episclera, stroma, lamina fusca and
endothelium. There are slightly differences among movement
of these layers. Considering these factors, our registration
employed both single shift transform and multi-parameter
transform which combines shift, rotation, and scale together.
1) Shift Parameter Search: As we discussed before, segmentation may not be accurate. As a result, the detected iris
center could not be very accurate. Shift transform is designed
to tolerant possible errors in pupil center detection in the
segmentation step. If there is no deformation or only very
minor deformation, registration with shift transform together
would be adequate to achieve an accurate result. We designed
Algorithm 2 to get optimized shift parameter. Where, Tte is
the test template; and ssei is the i t h WPL descriptor of Tt e .
Tt a is the target template; and ssai is the i t h WPL descriptor
of Tt a .d(st ek , st a j ) is Euclidean distance of descriptors st ek
and st a j :

 
d st ei , st a j = (x t ei x t a j )2 + (yt ei yt a j )2 .
(7)

template and the target template. The factor is involved to


determine if the pair of descriptor is matched, and we set
it to be 20 pixels in our experiment. After N iterations, the
optimized transform parameter set is determined via selecting
the maximum matching numbers m(it).
Algorithm (Kernel) 3 Affine Parameter Search for
Registration

sk is the shift value of two descriptors defines as:


sk = (x t ek x t a j  , yt ek yt a j  ).

(8)

Algorithm (Kernel) 2 Shift Parameter Search for Registration

Here, st ei , Tt e , st a j and Tt a is defined same as algorithm 2.


(it )
(it )
tr shi f t , (it ) tr scale is the parameters of shift, rotation and
)
scale transform generated in i t th iteration. R( (it ) ), T (tr (it
shi f t )
(it )

We first randomly select an equal number of segment


descriptors st ek in test template Tt e from each quad and find its
nearest neighbors st a j  in target template Tt a . The shift offset
of them is recorded as the possible registration shift factor
sk . The final offset registration factor is sopt im which has
the smallest standard deviation among these candidate offsets.
2) Affine Transform Parameter Search: Affine transform is
designed to tolerant some deformation of sclera patterns in
the matching step. The affine transform algorithm is shown in
Algorithm 3. The shift value in the parameter set is obtained by
(it )
randomly selecting descriptor st e and calculating the distance
from its nearest neighbor st a j  in Tt a . We transform the test
template by the matrix in (7). At end of the iteration, we count
the numbers of matched descriptor pairs from the transformed

and S(tr scale ) are the transform matrix defined as (7). To


search optimize transform parameter, we iterated N times to
generate these parameters. In our experiment, we set iteration
time to 512.
3) Registration and Matching Algorithm: Using the optimized parameter set determined from Algorithms 2 and 3,
the test template will be registered and matched simultaneously. The registration and matching algorithm is listed in
Algorithm 4. Here, st ei , Tt e , st a j and Tt a are defined same
(opt m)
(opt m)
as Algorithms 2 and 3. (opt m), tr shi f t , tr scale , sopt im are
the registration
attained from Algorithms 2 and 3.
  parameters
(opt m)   (opt m) 
R (opt m) T tr shi f t S tr scale is the descriptor transform
matrix defined in Algorithm 3. is the angle between the
segment descriptor and radius direction. w is the weight of the
descriptor which indicates whether the descriptor is at the edge
of sclera or not. To ensure that the nearest descriptors have a
similar orientation, we used a constant factor to check the
abstract difference of two . In our experiment, we set to 5.
The total matching score is minimal score of two transformed
result divided by the minimal matching score for test template
and target template.

LIN et al.: EFFICIENT PARALLEL APPROACH FOR SCLERA VEIN RECOGNITION

Fig. 11.

153

The task assignment inside and outside the GPU.

Algorithm (Kernel) 4 Registration and Match

memory and texture memory are read-only and cacheable


memory.
Mapping algorithms to CUDA to achieve efficient processing is not a trivial task. There are several challenges in
CUDA programming: 1) If threads in a warp have different
control path, all the branches will be executed serially. To
improve performance, branch divergence within a warp should
be avoided. 2) Global memory is slower than on-chip memory
in term of access. To completely hide the latency of the small
instructions set, we should use on-chip memory preferentially
rather than global memory. When global memory access
occurs, threads in same warp should access the words in
sequence to achieve coalescence. 3) Shared memory is much
faster than the local and global memory space. But shared
memory is organized into banks which are equal in size. If
two addresses of memory request from different thread within
a warp fall in the same memory bank, the access will be
serialized. To get maximum performance, memory requests
should be scheduled to minimize bank conflicts.

VII. M APPING THE S UBTASKS TO CUDA


CUDA is a single instruction multiple data (SIMD) system
and works as a coprocessor with a CPU. A CUDA consists
of many streaming multiprocessors (SM) where the parallel
part of the program should be partitioned into threads by
the programmer and mapped into those threads. There are
multiple memory spaces in the CUDA memory hierarchy:
register, local memory, shared memory, global memory, constant memory and texture memory. Register, local memory
and shared memory are on-chip and could be a little time
consuming to access these memories. Only shared memory
can be accessed by other threads within the same block.
However, there is only limited availability of shared memory.
Global memory, constant memory, and texture memory are
off-chip memory and accessible by all threads, which would
be very time consuming to access these memories. Constant

A. Mapping Algorithm to Blocks


Because the proposed registration and matching algorithm
has four independent modules, all the modules will be converted to different kernels on the GPU. These kernels are
different in computation density, thus we map them to the
GPU by various map strategies to fully utilize the computing
power of CUDA.
Figure 11 shows our scheme of CPU-GPU task distribution
and the partition among blocks and threads. Algorithm 1 is
partitioned into coarse-grained parallel subtasks. We create a
number of threads in this kernel. The number of threads is
the same as the number of templates in the database. As the
upper middle column shows in Figure 11, each target template
will be assigned to one thread. One thread performs a pair of
templates compare. In our work, we use NVIDIA C2070 as

154

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014

our GPU. Threads and blocks number is set to 1024. That


means we can match our test template with up to 1024 1024
target templates at same time.
Algorithms 2-4 will be partitioned into fine-grained subtasks
which is processed a section of descriptors in one thread. As
the lower portion of the middle column shows in Figure 11,
we assigned a target template to one block. Inside a block,
one thread corresponds a set of descriptors in this template.
This partition makes every block execute independently and
there are no data exchange requirements between different
blocks. When all threads complete their responding descriptor
fractions, the sum of the intermediate results needs to be
computed or compared. A parallel prefix sum algorithm is
used to calculate the sum of intermediate results which is
show in right of Figure 11. Firstly, all odd number threads
compute the sum of consecutive pairs of the results. Then,
recursively, every first of i (= 4, 8, 16, 32, 64, ...) threads
compute the prefix sum on the new result. The final result
will be saved in the first address which has the same variable
name as the first intermediate result.
B. Mapping Inside Block
In shift argument searching, there are two schemes we can
choose to map task: 1) mapping one pair of templates to all the
threads in a block, and then every thread would take charge of
a fraction of descriptors and cooperation with other threads; or
2) assigning a single possible shift offset to a thread, and all
the threads will compute independently unless the final result
should be compared with other possible offset. Due to great
number of sum and synchronization operations in every nearest
neighbor searching step, as Figure 11 shown, we choose the
second method to parallelize shift searching.
In affine matrix generator, we mapped an entire parameter
set searching to a thread and every thread randomly generated
a set of parameters and tried them independently. The generated iterations were assigned to all threads. The challenge
of this step is the randomly generated numbers might be
correlated among threads. In the step of rotation and scale
registration generating, we used the Mersenne Twister pseudorandom number generator because it can use bitwise arithmetic
and have long period. The Mersenne twister, as most of
pseudorandom generators, is iterative. Therefore its hard to
parallelize a single twister state update step among several
execution threads. To make sure that thousands of threads
in the launch grid generate uncorrelated random sequence,
many simultaneous Mersenne twisters need to process with
different initial states in parallel. But even very different (by
any definition) initial state values do not prevent the emission
of correlated sequences by each generator sharing identical
parameters. To solve this problem and to enable efficient
implementation of Mersenne Twister on parallel architectures,
we used a special offline tool for the dynamic creation of
Mersenne Twisters parameters, modified from the algorithm
developed by Makoto Matsumoto and Takuji Nishimura [27].
In the registration and matching step, when searching the
nearest neighbor, a line segment that has already matched
with others should not be used again. In our approach, a flag

Fig. 12.

Fig. 13.

Example image from the UBIRIS database.

Occupancy on various thread numbers per block.

variable denoting whether the line has been matched is stored


in shared memory. To share the flags, all the threads in a
block should wait synchronic operation at every query step.
Our solution is to use a single thread in a block to process
the matching.
C. Memory Management
The bandwidth inside GPU board is much higher than the
bandwidth between host memory and device memory. The data
transfer between host and device can lead to long latency.
As shown in Figure 11, we load the entire target templates
set from database without considering when they would be
processed. Therefore, there was no data transfer from host to
device during the matching procedure.
In global memory, the components in descriptor
y(1, 2 , 3 , x, y) and s(x, y, r , , w) were stored separately.
This would guarantee contiguous kernels of Algorithm 2 to 4
can access their data in successive addresses. Although
such coalescing access reduces the latency, frequently global
memory access was still a slower way to get data. In our
kernel, we loaded the test template to shared memory to
accelerate memory access. Because the Algorithms 2 to 4
execute different number of iterations on same data, the bank
conflict does not happen. To maximize our texture memory
space, we set the system cache to the lowest value and
bonded our target descriptor to texture memory. Using this
catchable memory, our data access was accelerated more.
VIII. E XPERIMENTAL R ESULTS
We used a computer with INTEL i7 950 3.07GHz processor
and NVIDIA Tesla C2070 graphic card, which has 448 cores,
1.15GHz GPU clock, and 1.5GHz memory clock. The sclera
image database we used is UBIRIS database Session 1.
It is a publicly available eye image database acquired in
visible wavelength [26]. There are 1214 images collected from
241 persons in this database (Figure 12). In our study, 46 blur,
blink or no-sclera-area images are removed.

LIN et al.: EFFICIENT PARALLEL APPROACH FOR SCLERA VEIN RECOGNITION

155

Fig. 15.

Fig. 14.

Matching time with various threads numbers per block.

Achieving a good performance using a GPU requires keeping the multiprocessor as busy as possible by using a suitable
number of threads and blocks. The larger the number of
threads used per block, the more templates can be compared
simultaneously. Threads in a warp start together at the same
program address. When one warp is paused, other warps
will be executed to reduce latencies and keep the process
unit busy. To quickly switch from one execution context to
another, multiprocessors keep all warps active by partition
private register to every warp. As a result, the numbers of
bocks and warps that can reside on the multiprocessor depend
on whether there are enough registers and shared memory
available on the multiprocessor [29]. If we set the number
of threads per block as a multiple of warp size, the maximum
threads number per block should set to be
T =

Rblock Wsize

 ,
Wsize Rk
GT

(9)

where T is the number of threads per block which can make


all the threads resident in multiprocessor. Rblock is the total
number of registers for a block, Wsize is the warp size, Rk is
the number of registers used by the kernel, G T is the thread
allocation granularity.
The number of blocks should guarantee that there are at least
two blocks in a multiprocessor. The total amount of shared
memory Sblock for a block should be
 
Sk
Sblock =
,
(10)
Gs
where Sk is the amount of shared memory used by the kernel,
Gs is the shared memory allocation granularity.
We also used the occupancy calculation tools provided by
NVIDA to search for the optimized configuration parameters [21]. Every kernel of Kernel 3 needs 31 registers and
7168 bytes shared memory. As Figure 13 shows, the maximum
occupancy, which is defined as the ratio of the number of
active warps per multiprocessor to the maximum number of
possible active warps, can be achieve when threads number is
set to be 256, 512, or 1024.

ROC curve of parallel matching.

Using Y shape descriptor, it takes 56.8 second to compare


a 11681168 pair template. The Equal Error Rate (EER) of this
stage is 9.93%, which is not very accurate. However, it can
be used as a filtering method to select most likely matching
templates to compare in the next step, Stage II.
Figure 14 shows our experimental result with two stages.
To balance the matching speed and accuracy, we adopted
different strategies to select the possible template after Stage I.
The matching only using Stage II achieves the most accurate
result; however it would take longest time. In the sequential
implementation on a CPU, the iteration of performance ranges
from 100 to 400 depending on the size of the template. In
our implementation on a GPU, the iteration count was set to
be 512. Consequently, this extends the registration parameter
search range and gains more accurate matching result. While
the percentage of selected templates after Stage I increases,
the accuracy of parallel matching result would decrease.
Figure 14 shows the ROC curve and the EER of each method;
and Figure 15 shows ROC curve and the GARs when FAR =
0.1% and FAR = 0.01%. The summary of the accuracy and
speed is shown in Table 1.
For the sequential method, the EER is 3.386%, the area
under the curve (AUC) is 97.5% and GAR = 92.6% and
86.46% with FAR = 0.1% and FAR = 0.01% respectively.
For the parallel computing, if all templates were used for the
Stage II matching, the parallel approach would achieve better
recognition accuracy than the sequential method with EER =
3.052%, AUC = 98.6%, GAR = 93% (when FAR = 0.1%)
and 87.9% (when FAR = 0.01%). At the same time, the proposed parallel computing approach achieves a 769 times speed
improvement. If we filter 23.3% of pairs from the Stage I, the
speed would be further improved to be 805 times, while the
accuracy still beats the sequential method using EER, AUC,
GAR (when FAR = 0.1% and FAR = 0.01%) as measures.
If we filter 43.5% of pairs from the Stage I, the speed would
be 1304 times improvement over the sequential method. The
equal error rate would be a little higher than the sequential
method, however, the AUC and GAR are better. If we filter
61.6% of pairs from the Stage I, the EER would be 3.637%,
which is about 0.3% higher than the EER of the sequential
method. And the AUC is 97.4%, which is about 0.1% lower
than the AUC of the sequential method. But the GAR would
be much better: GAR = 93.8% and 89.7% when FAR = 0.1%

156

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 9, NO. 2, FEBRUARY 2014

TABLE I
PARALLEL M ATCHING C OMPARED W ITH S EQUENTIAL M ATCHING

and 0.01% respectively. The speed is 1935 times faster than the
sequential method. Note that we used 448 cores GPU in this
research. This would mean that the proposed method efficiency
is 4.3 times of the number of GPU cores. This shows that
the proposed parallel computing method could dramatically
improve the speed without compromising the recognition
accuracy.
IX. C ONCLUSION
In this paper, we proposed a new parallel sclera vein
recognition method, which employees a two stage parallel
approach for registration and matching. Even though the
research focused on developing a parallel sclera matching
solution for the sequential line-descriptor method using CUDA
GPU architecture, the parallel strategies developed in this
research can be applied to design parallel solutions to other
sclera vein recognition methods and general pattern recognition methods. We designed the Y shape descriptor to narrow
the search range to increase the matching efficiency, which is a
new feature extraction method to take advantage of the GPU
structures. We developed the WPL descriptor to incorporate
mask information and make it more suitable for parallel
computing, which can dramatically reduce data transferring
and computation. We then carefully mapped our algorithms
to GPU threads and blocks, which is an important step to
achieve parallel computation efficiency using a GPU. A work
flow, which has high arithmetic intensity to hide the memory
access latency, was designed to partition the computation task
to the heterogeneous system of CPU and GPU, even to the
threads in GPU. The proposed method dramatically improves
the matching efficiency without compromising recognition
accuracy.
ACKNOWLEDGMENT
We would like to thank the associate editor
Dr. Patrizio-Campisi, and anonymous reviewers for their
constructive comments. We would also like to acknowledge
the Department of Computer Science at the University of
Beira Interior for providing the UBIRIS database [28].

R EFERENCES
[1] C. W. Oyster, The Human Eye: Structure and Function. Sunderland:
Sinauer Associates, 1999.
[2] P. Kaufman, and A. Alm, Clinical application, Adlers Physiology of
the Eye, 2003.
[3] Z. Zhou, E. Y. Du, N. L. Thomas, and E. J. Delp, A new human
identification method: Sclera recognition, IEEE Trans. Syst., Man,
Cybern. A, Syst., Humans, vol. 42, no. 3, pp. 571583, May 2012.
[4] S. Crihalmeanu and A. Ross, Multispectral scleral patterns for ocular biometric recognition, Pattern Recognit. Lett., vol. 33, no. 14,
pp. 18601869, Oct. 2012.
[5] Z. Zhou, E. Y. Du, N. L. Thomas, and E. J. Delp, A comprehensive
multimodal eye recognition, Signal, Image Video Process., vol. 7, no. 4,
pp. 619631, Jul. 2013.
[6] Z. Zhou, E. Y. Du, N. L. Thomas, and E. J. Delp, A comprehensive
approach for sclera image quality measure, Int. J. Biometrics, vol. 5,
no. 2, pp. 181198, 2013.
[7] R. N. Rakvic, B. J. Ulis, R. P. Broussard, R. W. Ives, and N. Steiner,
Parallelizing iris recognition, IEEE Trans. Inf. Forensics Security,
vol. 4, no. 4, pp. 812823, Dec. 2009.
[8] P. R. Dixon, T. Oonishi, and S. Furui, Harnessing graphics processors
for the fast computation of acoustic likelihoods in speech recognition,
Comput. Speech Lang., vol. 23, no. 4, pp. 510526, 2009.
[9] K.-S. Oh and K. Jung, GPU implementation of neural networks,
Pattern Recognit., vol. 37, no. 6, pp. 13111314, 2004.
[10] D. C. Cirean, U. Meier, L. M. Gambardella, and J. Schmidhuber,
Deep, big, simple neural nets for handwritten digit recognition, Neural
Comput., vol. 22, no. 12, pp. 32073220, 2010.
[11] J. Antikainen, J. Havel, R. Josth, A. Herout, P. Zemcik, and M. HautaKasari, Nonnegative tensor factorization accelerated using GPGPU,
IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 7, pp. 11351141,
Feb. 2011.
[12] C. Cuevas, D. Berjon, F. Moran, and N. Garcia, Moving object
detection for real-time augmented reality applications in a GPGPU,
IEEE Trans. Consum. Electron., vol. 58, no. 1, pp. 117125, Feb. 2012.
[13] Y. Xu, S. Deka, and R. Righetti, A hybrid CPU-GPGPU approach
for real-time elastography, IEEE Trans. Ultrason., Ferroelectr. Freq.
Control, vol. 58, no. 12, pp. 26312645, Dec. 2011.
[14] G. Poli, J. H. Saito, J. F. Mari, and M. R. Zorzan, Processing neocognitron of face recognition on high performance environment based on
GPU with CUDA architecture, in Proc. 20th Int. Symp. Comput. Archit.
High Perform. Comput., 2008, pp. 8188.
[15] F. Z. Sakr, M. Taher, and A. M. Wahba, High performance iris
recognition system on GPU, in Proc. ICCES, 2011, pp. 237242.
[16] W. Wenying, Z. Dongming, Z. Yongdong, L. Jintao, and G. Xiaoguang,
Robust spatial matching for object retrieval and its parallel implementation on GPU, IEEE Trans. Multimedia, vol. 13, no. 6, pp. 13081318,
Dec. 2011.

LIN et al.: EFFICIENT PARALLEL APPROACH FOR SCLERA VEIN RECOGNITION

[17] N. Ichimura, GPU computing with orientation maps for extracting local
invariant features, in Proc. IEEE Comput. CVPRW, Jun. 2010, pp. 18.
[18] K. Tsz-Ho, S. Hoi, and C. C. L. Wang, Fast query for exemplarbased image completion, IEEE Trans. Image Process., vol. 19, no. 12,
pp. 31063115, Dec. 2010.
[19] X. Hongtao, G. Ke, Z. Yongdong, T. Sheng, L. Jintao, and L. Yizhi,
Efficient feature detection and effective post-verification for large scale
near-duplicate image search, IEEE Trans. Multimedia, vol. 13, no. 6,
pp. 13191332, Dec. 2011.
[20] P. In Kyu, N. Singhal, L. Man Hee, C. Sungdae, and C. W. Kim, Design
and performance evaluation of image processing algorithms on GPUs,
IEEE Trans. Parallel Distrib. Syst., vol. 22, no. 1, pp. 91104, Jan. 2011.
[21] NVIDIA CUDA C Programming Guide, NVIDIA Corporation, Santa
Clara, CA, USA, 2011.
[22] J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone,
and J. C. Phillips, GPU computing, Proc. IEEE, vol. 96, no. 5,
pp. 879899, May 2008.
[23] R. Derakhshani, A. Ross, and S. Crihalmeanu, A new biometric
modality based on conjunctival vasculature, in Proc. Artif. Neural Netw.
Eng., 2006, pp. 18.
[24] R. Derakhshani and A. Ross, A texture-based neural network classifier
for biometric identification using ocular surface vasculature, in Proc.
Int. Joint Conf. Neural Netw., 2007, pp. 29822987.
[25] S. Crihalmeanu, A. Ross, and R. Derakhshani, Enhancement and
registration schemes for matching conjunctival vasculature advances
in biometrics, in Proc. 3rd IAPR/IEEE Int. Conf. Biometrics, 2009,
pp. 12401249.
[26] R. Broekhuyse, The lipid composition of aging sclera and cornea, Int.
J. Ophthalmol., vol. 171, no. 1, pp. 8285, 1975.
[27] M. Matsumoto and T. Nishimura, Mersenne twister: A
623-dimensionally equidistributed uniform pseudo-random number
generator, ACM Trans. Model. Comput. Simul., vol. 8, no. 1, pp. 330,
1998.
[28] H. Proena and L. A. Alexandre, UBIRIS: A noisy iris image database,
in Proc. 13th Int. Conf. Image Anal. Process., 2005, pp. 970977.
[29] CUDA C Best Practices Guide, NVIDIA Corporation, Santa Clara, CA,
USA, 2011.

Yong Lin received the B.S. degree in physics from


Ningxia University, China, in 1995, and the M.S.
degree in computer software from the Institute of
Software Chinese Academy of Sciences, China,
in 2005. From 2010 to 2012, he was a Visiting
Scholar with Indiana University-Purdue University,
Indianapolis, IN, USA. Since 1995, he has been
with the Department of Computer Science, Ningxia
Teachers University. He is currently pursuing the
Ph.D. degree with the School of Computer Science and Technology, Xidian University, China. His
research interests include computer architecture, high performance computing
for biometrics, parallel computing, and GPUs.

157

Eliza Yingzi Du (SM08) received the Ph.D. degree


in electrical engineering from the University of
Maryland, Baltimore County, Baltimore, in 2003,
and the B.S. and M.S. degrees in electrical engineering from the Beijing University of Posts and
Telecommunications, Beijing, China, in 1996 and
1999, respectively. She is currently a Director of
engineering with Qualcomm. From 2005 to 2013,
she was the Founding Director of the Biometrics
and Pattern Recognition Laboratory and a tenured
Professor with the Department of Electrical and
Computer Engineering, Purdue University, Indianapolis (IUPUI), IN, USA.
From 2003 to 2005, she was an Assistant Research Professor with the
Electrical Engineering Department, United States Naval Academy.
Her research interests include image processing, pattern recognition, and
biometrics. Her research has been funded by the Office of Naval Research,
National Institute of Justice, Department of Defense, National Science Foundation, Canada Border Services Agency, Indiana Department of Transportation, and several industry and IUPUI internal grants.
Dr. Du received the Office of Naval Research Young Investigator Award in
2007, the Indiana University Trustee Teaching Award in 2009, the Supervisor
of the Year Award at IUPUI in 2009, the Best Paper Award with her students
in IEEE Workshop on Computational Intelligence in Biometrics: Theory,
Algorithms, and Applications in 2009. She is a member of the honor societies
Tau Beta Pi and Phi Kappa Phi.

Zhi Zhou (S08) received the Ph.D. degree in


electrical engineering from Purdue University, West
Lafayette, in 2013, the M.S. degree in electrical
and computer engineering from Indiana UniversityPurdue University Indianapolis, Indianapolis, IN,
USA, in 2008, and the B.S. degree in electrical engineering from the Beijing University of Technology,
Beijing, China, in 2005. He is currently a Scientist
with the Allen Institute for Brain Science.
His research interests include image processing,
biometrics, image analysis, pattern recognition, data
mining, machine learning, and data visualization of large volume of 3-D
biological imaging data.
Dr. Zhou received the Best Paper Award in the IEEE Workshop on Computational Intelligence in Biometrics: Theory, Algorithms, and Applications in
2009.

N. Luke Thomas received the B.S. degree in electrical engineering and the M.S. degree in electrical
and computer engineering from Indiana UniversityPurdue University Indianapolis, IN, USA, in 2010.
He is currently in industry as a Software Engineer
of safety critical engine control systems.
His research interests include algorithm development, biometrics, and pattern recognition.

You might also like