Proceedings of International Conference On Computer Vision-And Image Processing CVIP 2016 Volume II
Proceedings of International Conference On Computer Vision-And Image Processing CVIP 2016 Volume II
Balasubramanian Raman
Sanjeev Kumar
Partha Pratim Roy
Debashis Sen Editors
Proceedings of
International
Conference on
Computer Vision and
Image Processing
CVIP 2016, Volume 2
Advances in Intelligent Systems and Computing
Volume 460
Series editor
Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland
e-mail: [email protected]
About this Series
The series “Advances in Intelligent Systems and Computing” contains publications on theory,
applications, and design methods of Intelligent Systems and Intelligent Computing. Virtually
all disciplines such as engineering, natural sciences, computer and information science, ICT,
economics, business, e-commerce, environment, healthcare, life science are covered. The list
of topics spans all the areas of modern intelligent systems and computing.
The publications within “Advances in Intelligent Systems and Computing” are primarily
textbooks and proceedings of important conferences, symposia and congresses. They cover
significant recent developments in the field, both of a foundational and applicable character.
An important characteristic feature of the series is the short publication time and world-wide
distribution. This permits a rapid and broad dissemination of research results.
Advisory Board
Chairman
Nikhil R. Pal, Indian Statistical Institute, Kolkata, India
e-mail: [email protected]
Members
Rafael Bello Perez, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba
e-mail: [email protected]
Emilio S. Corchado, University of Salamanca, Salamanca, Spain
e-mail: [email protected]
Hani Hagras, University of Essex, Colchester, UK
e-mail: [email protected]
László T. Kóczy, Széchenyi István University, Győr, Hungary
e-mail: [email protected]
Vladik Kreinovich, University of Texas at El Paso, El Paso, USA
e-mail: [email protected]
Chin-Teng Lin, National Chiao Tung University, Hsinchu, Taiwan
e-mail: [email protected]
Jie Lu, University of Technology, Sydney, Australia
e-mail: [email protected]
Patricia Melin, Tijuana Institute of Technology, Tijuana, Mexico
e-mail: [email protected]
Nadia Nedjah, State University of Rio de Janeiro, Rio de Janeiro, Brazil
e-mail: [email protected]
Ngoc Thanh Nguyen, Wroclaw University of Technology, Wroclaw, Poland
e-mail: [email protected]
Jun Wang, The Chinese University of Hong Kong, Shatin, Hong Kong
e-mail: [email protected]
Proceedings of International
Conference on Computer
Vision and Image Processing
CVIP 2016, Volume 2
123
Editors
Balasubramanian Raman Partha Pratim Roy
Department of Computer Science Department of Computer Science
and Engineering and Engineering
Indian Institute of Technology Roorkee Indian Institute of Technology Roorkee
Roorkee, Uttarakhand Roorkee, Uttarakhand
India India
v
vi Preface
the local organizing committee members with their unconditional help, and our
sponsors and endorsers with their timely support.
Finally, we would like to thank Springer for agreeing to publish the proceedings
in their prestigious Advances in Intelligent Systems and Computing (AISC) series.
Hope the technical contributions made by the authors in these volumes presenting
the proceedings of CVIP 2016 will be appreciated by one and all.
vii
viii Contents
xiii
xiv About the Editors
Partha Pratim Roy received his Ph.D. degree in Computer Science in 2010 from
Universitat Autònoma de Barcelona, Spain. He worked as postdoctoral research
fellow in the Computer Science Laboratory (LI, RFAI group), France and in
Synchromedia Lab, Canada. He also worked as Visiting Scientist at Indian Sta-
tistical Institute, Kolkata, India in 2012 and 2014. Presently, Dr. Roy is working as
Assistant Professor at Department of Computer Science and Engineering, Indian
Institute of Technology (IIT), Roorkee. His main research area is Pattern Recog-
nition. He has published more than 60 research papers in various international
journals, conference proceedings. Dr. Roy has participated in several national and
international projects funded by the Spanish and French government. In 2009, he
won the best student paper award in International Conference on Document
Analysis and Recognition (ICDAR). He has gathered industrial experience while
working as an Assistant System Engineer in TATA Consultancy Services (India)
from 2003 to 2005 and as Chief Engineer in Samsung, Noida from 2013 to 2014.
Debashis Sen is Assistant Professor at the Department of Electronics and Electrical
Communication Engineering in Indian Institute of Technology (IIT) Kharagpur.
Earlier, from September 2014 to May 2015, he was Assistant Professor at the
Department of Computer Science and Engineering in Indian Institute of Technol-
ogy (IIT) Roorkee. Before joining Indian Institute of Technology, he worked as a
postdoctoral research fellow at School of Computing, National University of Sin-
gapore for about 3 years. He received his PhD degree from the Faculty of Engi-
neering, Jadavpur University, Kolkata, India in 2011 and his M.A.Sc. degree from
the Department of Electrical and Computer Engineering, Concordia University,
Montreal, Canada in 2005. He has worked at the Center for Soft Computing
Research of Indian Statistical Institute from 2005 to 2011 as a research scholar, and
at the Center for Signal Processing and Communications and Video Processing and
Communications group of Concordia University as a research assistant from 2003
to 2005. He is currently an associate editor of IET Image Processing journal. He has
co-convened the first international conference on computer vision and image pro-
cessing in 2016, and has served as a reviewer and program committee member of
more than 30 international journals and conferences. Over the last decade, he has
published in high-impact international journals, which are well cited, and has
received two best paper awards. He heads the Vision, Image and Perception group
in IIT Kharagpur. He is a member of Institute of Electrical and Electronics Engi-
neers (IEEE), IEEE Signal Processing Society and Vision Science Society (VSS).
His research interests include vision, image and video processing, uncertainty
handling, bio-inspired computation, eye movement analysis, computational visual
perception and multimedia signal processing.
Fingerprint Image Segmentation Using
Textural Features
1 Introduction
The increasing interest in security over the last years has made the recognition of peo-
ple by means of biometric features receive more and more attention. Admission to
restricted areas, personal identification for financial transactions, lockers and foren-
sics are just a few examples of its applications. Though iris, face, fingerprint, voice,
gait etc. can be used as a biometric characteristic, the most commonly used one is
the fingerprint. The fingerprint sensors are relatively low priced and not much effort
is required from the user. Moreover, fingerprint is the only biometric trait left by
criminals during a crime scene.
A fingerprint is a sequence of patterns formed by ridges and valleys of a finger.
A fingerprint verification system is carried out by four main steps: acquisition, pre-
processing, feature extraction and matching. A fingerprint image is usually captured
using a fingerprint scanner. A captured fingerprint image usually consists of two
parts: the foreground and the background (see Fig. 1). The foreground is originated
and acquired from the contact of a fingertip with the sensor [1]. A ridge is a curved
like line segment in a finger and the region adjacent to two ridges is defined as a
valley. A fingerprint matching is performed by extracting feature points from a pre-
processed fingerprint image. Minutiae (ridge ending and bifurcation) and singular
points (core and delta) are the most commonly used feature points. It is important
that the features should be extracted only from the foreground part for an accurate
matching. Therefore, fingerprint image segmentation plays a crucial role in the reli-
able extraction of feature points.
Several approaches are known in fingerprint image segmentation for years. Bazen
and Gerez [1], used pixel based features like local mean, local variance and coher-
ence for fingerprint segmentation. Then, a linear combination of these features is
taken for segmentation. The coherence feature indicates how well the orientations
over a neighborhood are pointing in the same direction. Since coherence measure is
higher in the foreground than in the background, the combination of these features
is used to characterize foreground and background. But this algorithm is not robust
to noise and also it is costly since it is based on pixel features. Gabor features of
the fingerprint image are used by Alonso-Fernandez et al. [2] for segmentation. It
is known that the Gabor response is higher in the foreground region than that in the
background region. Chen et al. [3] proposed a feature called cluster degree (CluD)
which is a block level feature derived from gray-level intensity values. CluD mea-
sures how well the ridge pixels are clustered in a block and they have stated that this
measure will be higher in the foreground. Harris corner point features are used by
Wu et al. [4] to separate foreground and background. The advantage of this approach
is that it is translation and rotation invariant. It has been stated that the strength of a
Harris point is much higher in the foreground area.
In this paper a fingerprint segmentation algorithm is presented. This proposal is
made on the observation that a fingerprint image can be viewed as an oriented tex-
ture pattern of ridges and valleys. This paper proposes four block level GLCM fea-
tures: Contrast, Correlation, Energy and Homogeneity for fingerprint segmentation.
A linear classifier is trained for classifying each block of fingerprint image into fore-
ground and background. Finally a morphological operator is used to obtain compact
foreground clusters.
This paper is organized in four sections. Section 2 describes Gray-Level Co-
occurrence Matrix (GLCM) and the proposed feature extraction method. Section 3
discusses the classification techniques and the proposed classifier. Section 4 gives
experimental results and discussion and Sect. 5 draws the conclusion.
2 Fingerprint as Texture
In a fingerprint image, the flow of ridges and valleys can be observed as an oriented
texture pattern [5]. Texture is a repeated pattern of local variations in image inten-
sity. One of the important characteristics is that most textured images contain spatial
relationships. Mutually distinct texture differs significantly in these relationships and
can easily be discriminated by a joint distribution based on their co-occurrences and
orientation channels.
Gray Level Co-occurrence Matrix (GLCM) is one of the popular methods that com-
pute second-order gray-level features for textural image analysis. GLCM was orig-
inally proposed by Haralick [6, 7]. The co-occurrence matrix is used to measure
the relative probabilities of gray-level intensity values that are present in an image.
A co-occurrence matrix can be defined as a function P(i, j, d, 𝜃) that estimates the
probability of co-occurring two neighborhood pixel values in an image with a gray
level value i and the other with gray level j for a given distance d and direction 𝜃.
Usually, the parameter value for the distance d is taken as d = 1 or 2 and the direc-
tion 𝜃 utilizes values of 0◦ , 45◦ , 90◦ and 135◦ . The following equations given by [7]
are used to compute co-occurrence matrix:
4 R.C. Joy and M. Azath
|
P(i, j, d, 0◦ ) = #{((k, l), (m, n)) ∈ (Lr × Lc ) × (Lr × Lc )|
| (1)
k − m = 0, |l − n| = d, I(k, l) = i, I(m, n) = j}
|
P(i, j, d, 45◦ ) = #{((k, l), (m, n)) ∈ (Lr × Lc ) × (Lr × Lc )|
|
(k − m = d, l − n = −d)or(k − m = −d, l − n = d) (2)
I(k, l) = i, I(m, n) = j}
|
P(i, j, d, 90◦ ) = #{((k, l), (m, n)) ∈ (Lr × Lc ) × (Lr × Lc )|
| (3)
|k − m| = d, l − n = 0, I(k, l) = i, I(m, n) = j}
|
P(i, j, d, 135◦ ) = #{((k, l), (m, n)) ∈ (Lr × Lc ) × (Lr × Lc )|
|
(k − m = d, l − n = d)or(k − m = −d, l − n = −d) (4)
I(k, l) = i, I(m, n) = j}
where # denotes number of elements in the set, Lr and Lc be the spatial domains of
the row and column dimensions.
Haralick et al. [6] proposed 14 statistical features from each co-occurrence matrix
computed by the Eqs. (1–4). However, in this paper we have used only 4 features that
can successfully separate the foreground and background regions (experimentally
determined) of the fingerprint images for fingerprint segmentation. These are
∑
Contrast ∶ f1 = |i − j|2 p(i, j)
i,j
∑ (i − 𝜇i )(j − 𝜇j )p(i, j)
Correlation ∶ f2 =
i,j
𝜎i 𝜎j
∑ (5)
Energy ∶ f3 = p(i, j)2
i,j
∑ p(i, j)
Homogeneity ∶ f4 =
i,j
1 + |i − j|
where 𝜇i , 𝜇j are the means and 𝜎i , 𝜎j are the standard deviations of the row and
column respectively and p(i, j) is the probability of co-occurring i with j using the
chosen distance d.
Fingerprint Image Segmentation Using Textural Features 5
Fig. 2 Joint distributions of the combination of two block features for foreground and background
of Db_2
6 R.C. Joy and M. Azath
lower intensity value will be considered for processing. Zacharias et al. [8] states that
the co-occurrence matrix may use false textural information if not used an optimal
value of the parameter distance (d) and would tend to extract false GLCM features.
Extracting GLCM features is performed as follows: Divide the input fingerprint
image into a non-overlapping block size of W × W. From each block compute co-
occurrence matrix in 4 directions with a predefined set of parameters using equations
(Eq. 1–4). From each co-occurrence matrix, 4 set of GLCM features are computed
using the Eq. 5 to form a total of 16 features. To reduce the feature size of the classi-
fier, we have used the summed up values of each feature with respect to 4 directions to
form a 4 feature vector represented as [Contrast Correlation Energy Homogeneity].
In our work, we have used a block size W = 15, fingerprint image is quantized to con-
tain only 8 gray levels and the distance d is taken as 2 for different parameters for
computing a co-occurrence matrix.
To show the usefulness of these block features taken from the fingerprint, we have
shown (see Fig. 2) the joint distribution of the combination of two block features
taken from both the foreground and the background area. We have used standard
FVC2002 Db2_b dataset [9] for testing our algorithm.
3 Classification
Fingerprint segmentation is a problem which divides the fingerprint image into fore-
ground and background parts. Essentially this problem can be treated as a classifi-
cation problem to classify a fingerprint image into two classes: class foreground and
class background. For a classification problem there are two main approaches: super-
vised learning and unsupervised learning. In literature, many segmentation algo-
rithms have been reported to have used supervised learning approach [1, 3] and
unsupervised learning approach [10]. In this paper, we are using supervised learn-
ing approach since we already know what could be the features for the samples to
classify them either to class foreground or class background. There are several super-
vised learning algorithms reported in the literature like linear and quadratic discrim-
inant functions, neural networks, K-nearest neighbor, decision trees, support vector
machines etc. [9]. However, in our work we have used a linear classifier called logis-
tic regression as the segmentation classifier since it requires a low computational
cost.
Let a variable y represents the class of a fingerprint sample then, y = 0 means that
the sample belongs to class background and y = 1 means that the sample belongs
to class foreground. Let xj represents jth feature of the sample. A logistic regression
model is defined as:
Fingerprint Image Segmentation Using Textural Features 7
1
g(z) = (6)
1 + e−z
where the function g(z) is known as logistic function. The variable z is usually
defined as
∑
n
z = 𝜃 T 𝚡 = 𝜃0 + 𝜃j xj (7)
j=1
where 𝜃0 is called the intercept term and 𝜃1 , 𝜃2 , ..., 𝜃n are called the regression coeffi-
cients of x1 , x2 , ..., xn respectively. Hence, Logistic regression is an attempt to find a
formula that gives the probability p(y = 1|x, 𝜃) that represents the class foreground.
Since only two classes are considered, the probability of the sample representing the
class background is therefore 1 − p(y = 1|x, 𝜃). A logistic regression is modeled as
linear combination of
p
𝜂 = log = 𝜃0 + 𝜃1 x1 + 𝜃2 x2 + ⋯ + 𝜃n xn (8)
1−p
where 𝜃0 , 𝜃1 , ..., 𝜃n are the optimal parameters that minimizes the following cost
function.
1 ∑
m
J(𝜃) = − [ yi log pi + (1 − yi ) log(1 − pi )] (9)
m i=1
where m is the total number of samples and yi is the assigned class for the sample xi .
The prediction of the given new sample data x is class label 1 if this criteria is true:
p(y = 1|x, 𝜃) ≥ 0.5.
The proposed classifier is linear classifier which tests a linear combination of the
features, given by:
Fig. 3 a Before Morphology b Boundary showing the regions classified as foreground (blue color)
and background (red color) c After Morphology
{
𝜔̂ 1 if g(z) ≥ 0.5
𝜔̂ = (12)
𝜔̂ 0 if g(z) < 0.5
4 Experimental Results
The segmentation algorithm was tested on FVC 2002 Db2_a standard dataset [9]. In
order to quantitatively measure the performance of the segmentation algorithm, ini-
tially we have manually identified the foreground and the background blocks. Eval-
uation is done by comparing the manual segmentation with the segmentation results
given by the classifier. The number of misclassification can be used as a performance
measure.
The misclassification is given by:
Nbe
p(𝜔̂ 1 |𝜔0 ) =
Nb
Nfe
p(𝜔̂ 0 |𝜔1 ) = (13)
Nf
p(𝜔̂ 1 |𝜔0 ) + p(𝜔̂ 0 |𝜔1 )
perror =
2
where Nbe is number of background classification error, Nb is the total number
of true background blocks in the image and p(𝜔̂ 1 |𝜔0 ) gives the probability that a
foreground block is classified as background. Nfe is the number of foreground clas-
sification error, Nf is the total number of true foreground blocks in the image and
p(𝜔̂ 0 |𝜔1 ) is the probability that a background block is classified as foreground. The
probability of error perror is the average of p(𝜔̂ 0 |𝜔1 ) and p(𝜔̂ 1 |𝜔0 ).
The proposed fingerprint segmentation algorithm has been trained on the 800
block samples from 80 fingerprint images in FVC Db2_b dataset [9] which consists
of 10 equal number of background and foreground block samples and labeled cor-
respondingly to get the optimal values of the regression coefficients. The regression
coefficients of the trained result is
We have used this vector for classifying the fingerprint image into foreground and
background. Table 1 gives the test result of a 10 randomly selected images from
Db2_a dataset with respect to the equation given in Eq. 13.
To quantify the effect of morphological operations as post-processing is quanti-
fied in Table 2 and it consolidates the total error probabilities before and after mor-
phological operations on the same 10 fingerprint images. The results show that the
overall misclassification rate is very less (Fig. 4).
The overall performance measure is also quantified as how well the foreground
region contains the feature parts. Since the fingerprint singular points is one of the
important feature for fingerprint classification and matching, the performance of the
proposed method can be analyzed by finding how well the singular points of the fin-
gerprint images are included by the segmentation algorithm. To conduct this analy-
10 R.C. Joy and M. Azath
Fig. 4 Segmentation results of some fingerprint images from FVC2002 Db2_a dataset
sis, out of 800 images in the FVC2002 Db2_a dataset [9], we have excluded 11
images since either the singular point is absent or it is very near to the image border.
Table 3 shows the test result of this analysis. It can be shown even when the segmen-
tation of fingerprint image is moderate, the algorithm is able to include the singular
point region in the foreground area for 98.6 % of images. This shows the efficiency of
our proposed algorithm as singular point is very important for the subsequent stages
of fingerprint identification.
Fingerprint Image Segmentation Using Textural Features 11
5 Conclusion
References
1. Bazen, A. M., Gerez, S. H.: Segmentation of fingerprint images. In: ProRISC 2001 Workshop
on Circuits, Systems and Signal Processing, (2001)
2. Alonso-Fernandez, F., Fierrez-Aguilar, J., Ortega-Garcia, J.: An enhanced Gabor filter-based
segmentation algorithm for fingerprint recognition systems. In: Proceedings of the 4th Inter-
national Symposium on Image and Signal Processing and Analysis (ISPA 2005), pp. 239–244,
(2005)
3. Chen, X., Tian, J., Cheng, J., Yang, X.: Segmentation of fingerprint images using linear clas-
sifier. EURASIP Journal on Applied Signal Processing. 2004(4), pp. 480–494, (2004)
4. Wu, C., Tulyakov, S., Govindaraju, V.: Robust Point-Based Feature Fingerprint Segmentation
Algorithm. In: Lee, S.-W., Li, S.Z. (eds.) ICB 2007, LNCS, vol. 4642, pp. 1095–1103, Springer,
Heidelberg (2007)
5. Jain, A. K., Ross, A.: Fingerprint Matching Using Minutiae and Texture Features. In: Proceed-
ing of International Conference on Image Processing, pp. 282–285, (2001)
6. Haralick, R. M., Shanmugan, K., Dinstein, J.: Textual features for image classification. IEEE
Trans. Syst. Man. Cybern. Vol. SMC-3, pp. 610–621, (1973)
7. Haralick, R. M.: Statistical and Structural Approaches to Texture. In: Proceedings of IEEE,
Vol. 67, No. 5, pp. 768–804, (1979)
12 R.C. Joy and M. Azath
8. Zacharias, G.C., Lal, P.S.: Combining Singular Point and Co-occurrence Matrix for Finger-
print Classification. In: Proceedings of the Third Annual ACM Bangalore Conference, pp.
1–6, (2010)
9. Maltoni, D., Maio, D., Jain, A.K., Prabhakar, S.: Handbook of fingerprint recognition (Second
Edition). Springer, New York (2009)
10. Pal, N.R., Pal, S.K.: A review on image segmentation techniques. Pattern Recognition. Vol.
26, No. 9, pp. 1277–1294, (1993)
11. Gonzalez, R. C., Wintz, P.: Digital Image Processing.2nd Edition. Addison-Wesley, (1987)
12. Soille, P., Morphological Image Analysis: Principles and Applications. Springer-Verlag, pp.
173–174, (1999)
Improved Feature Selection for Neighbor
Embedding Super-Resolution Using Zernike
Moments
Abstract This paper presents a new feature selection method for learning based
single image super-resolution (SR). The performance of learning based SR strongly
depends on the quality of the feature. Better features produce better co-occurrence
relationship between low-resolution (LR) and high-resolution (HR) patches, which
share the same local geometry in the manifold. In this paper, Zernike moment is
used for feature selection. To generate a better feature vector, the luminance norm
with three Zernike moments are considered, which preserves the global structure.
Additionally, a global neighborhood selection method is used to overcome the prob-
lem of blurring effect due to over-fitting and under-fitting during K-nearest neighbor
(KNN) search. Experimental analysis shows that the proposed scheme yields better
recovery quality during HR reconstruction.
1 Introduction
Visual pattern recognition and analysis plays a vital role in image processing and
computer vision. However, it has several limitations due to image acquisition in the
unfavorable condition. Super-resolution (SR) technique is used to overcome the lim-
itations of the sensors and optics [1]. Super-resolution is a useful signal processing
The remainder of the paper is organized as follows. Section 2 describes the prob-
lem statement. Section 3 presents an overall idea about Zernike moment. Section 4
discusses the proposed algorithm for single image super-resolution using Zernike
moment. Experimental results and analysis are discussed in Sect. 5 and the conclud-
ing remarks are outlined in Sect. 6.
2 Problem Statement
Xl = DB(Xh ) , (1)
3 Background
In the field of image processing and pattern recognition, moment-based features play
a vital role. The use of the Zernike moments in image analysis was introduced by
Teague [15]. Zernike moments are basically projections of the image information to
a set of complex polynomials,
√ that from a complete orthogonal set over the interior
of a unit circle, i.e. x2 + y2 ≤ 1.
The two-dimensional Zernike moments of an image intensity function f (x, y) of
order n and repetition m are defined as
n+1 ∗
Znm = f (x, y) Vnm (x, y) dxdy , (2)
𝜋 ∫ ∫√x2 +y2 ≤1
n+1
where 𝜋
is a normalization factor. In discrete form Znm can be expressed as
∑∑ √
∗
Znm = f (x, y) Vnm (x, y) , x2 + y2 ≤ 1 . (3)
x y
16 D. Mishra et al.
The kernel of these moments is a set of orthogonal polynomials, where the complex
polynomial Vnm can be expressed in polar coordinates (𝜌, 𝜃) as
n−|m|
∑
2
(−1)s (n − s)!rn−2s
Rnm (𝜌) = ( ) ( ) . (5)
s=0 s! n+|m|
2
− s ! n−|m|
2
− s !
The real and imaginary masks are deduced by a circular integral of complex poly-
nomials. On the whole, edge detection is conducted at the pixel level. At each edge
point, orthogonal moment method is used to calculate accurately gradient direction.
Mostly, the higher-order moments are more sensitive to noise. Therefore, first three
2nd order moments has been employed for feature selection. The real and imaginary
7 × 7 homogeneous mask of M11 and M20 should be deduced by circular integral of
∗ ∗
V11 and V20 [16]. Hence, three Zernike moments are Z11 R, Z11 I and Z20 .
In this section, a new feature selection method is proposed using Zernike moments
for neighbor embedding based super-resolution. The feature vector is generated by
combining the three Zernike moments with luminance norm. Moreover, neighbor-
hood size for K-nearest neighbor (KNN) search is generated by global neighborhood
selection [17]. The overall block diagram of the proposed scheme is shown in Fig. 1.
‖ ‖2
‖ t ∑ ‖
t ‖
𝜀 = min ‖yl − s‖
wts xl ‖ , (6)
‖ xls ∈Nt ‖
‖ ‖
∑
subject to two constrains, i.e., ys ∈Nt wts = 1 and wts = 0 for any ysl ∉ Nt . This is
l
generally used for normalization of the optimal weight vector, where Nt is the set of
neighborhood of ytl in training set XL .
The local Gram matrix Gt plays an important role to calculate the weight wt asso-
ciated to ytl , which is defined as
( )T ( t T )
Gt = ytl 1T − X yl 1 − X , (7)
where one’s column vectors are considered to match the dimensionality with X. The
dimension of X is D × K, where its columns represent the neighbors of ytl . The opti-
mal weight vector Wt for ytl having the weights of each neighbors wts are reordered
by s. The weight is calculated as
G−1
t 1
wt= , (8)
1T G−1
t 1
18 D. Mishra et al.
In this section, an efficient feature selection method for neighbor embedding based
super-resolution method is proposed. In [5, 7, 8], several features are used for bet-
ter geometry preservation in the manifold. But, consistency in structure between the
neighborhood patches embedding still is an issue. To overcome the problem like sen-
sitivity of noise, recovery quality, and neighborhood preservation among the patches,
Zernike moment feature descriptor is used as appropriate feature selection. Due to
robustness to noise and orthogonal properties of Zernike moment, a perfect repre-
sentation of information is done. Basically, the features are selected from the lumi-
nance channel because it is sensitive to the human visual system. Luminance norm
is also considered as a part of the features because it represent the global structure
Improved Feature Selection for Neighbor Embedding . . . 19
of the image. For each pixel, there are four components of a feature vector i.e.,
[LN, Z11 R, Z11 I , and Z20 ]. As the learning based SR perform on the patch, feature
vector of each patch size is 4p2 , where p is the patch size.
Choosing the neighborhood size for locally linear embedding has great influence
on HR image reconstruction because the neighborhood size K determines the local
and global structure in the embedding space. Moreover, fixed neighborhood size
leads to over-fitting or under-fitting. To preserve the local and global structure, the
neighbor embedding method search a transformation. Hence, global neighborhood
selection method is used. The reason for global neighborhood selection is to preserve
the small scale structures in manifold. To get the best reconstructed HR image, well
representation of high dimensional structure is required in the embedding space.
This method has been introduced by Kouropteva et al. [17], where Residual Vari-
ance is used as a quantitative measure that estimate the quality of the input-output
mapping in embedding space. The residual variance [18] is defined as
where 𝜌 is the standard linear correlation coefficient, takes over all entries of dX
and dY matrices; The element of dX and dY matrices having size m × m represents
the Euclidean distance between pair of patches in X and Y. According to [17] lower
is the residual variance better is the high dimensional data representation. Hence,
optimal neighborhood size K = (kopt ) computed by hierarchical method as
5 Experimental Results
To validate the proposed algorithm, simulations are carried out on some standard
images of different size like Parrot, Peppers, Lena, Tiger, Biker, and Lotus. In this
experiment, a set of LR and HR pairs are required for training. Hence, LR images are
generated from the ideal images by blurring each image using (5 × 5) Gaussian ker-
nel and decimation using 3 ∶ 1 decimation ratio in each axis. A comparative analy-
sis has been made with respect to two performance measures, namely, pick signal to
noise ratio (PSNR) and feature similarity index (FSIM) [19]. The value of FSIM lies
between 0 to 1. The larger value of PSNR and FSIM indicates better performance.
To evaluate the performance of the proposed scheme, we compare our results with
four schemes namely, Bicubic interpolation, EBSR [2], SRNE [5], and NeedFS [8].
(a) Parrot (b) Peppers (c) Lena (d) Tiger (e) Biker (f) Lotus
Table 1 PSNR and FSIM results for test images with 3× magnification
Images Bicubic EBSR [2] SRNE [5] NeedFS [8] Proposed
Parrot 27.042 28.745 29.623 31.764 32.135
0.8340 0.8458 0.8511 0.8603 0.8693
Peppers 28.756 29.137 30.969 32.111 33.249
0.8397 0.8469 0.8582 0.8725 0.8839
Lena 29.899 30.117 31.826 33.026 34.762
0.8527 0.8702 0.8795 0.8889 0.9023
Tiger 24.549 25.771 26.235 27.909 28.423
0.8239 0.8394 0.8403 0.8519 0.8604
Biker 25.009 26.236 27.169 28.669 29.973
0.8331 0.8481 0.8537 0.8601 0.8715
Lotus 26.829 27.787 28.979 30.276 31.862
0.8338 0.8501 0.8637 0.8756 0.8904
Improved Feature Selection for Neighbor Embedding . . . 21
(a) LR image (b) Bicubic interpolation (c) EBSR [2] (d) SRNE [5]
(e) NeedFS [8] (f) Proposed method (g) Ground truth image
The test images are shown in Fig. 2. Table 1 lists the PSNR and FSIM values for all
test images. The 1st row and 2nd row in the table indicates PSNR and FSIM values
respectively. The features generated by Zernike moment for Lena image are shown
in Fig. 3. The visual comparison for Lena and Tiger image are shown in Figs. 4 and
5 respectively. To validate the performance of the proposed scheme, we compare
the results with state-of-the-art approaches with different K value. In SRNE [5], the
K value is fixed which leads to blurring effect in the expected HR image; whereas
in NeedFS [8] two different K values are provided according to the patches having
edge. In our scheme, the K value lies between 1 to 15. Due to global neighborhood
selection, our method gives a better results in terms of both PSNR and FSIM as
shown in Fig. 6. It shows the graph is increased gradually between the K value 5 to
9. However, it gives only good results for a certain K value in the state-of-the-arts.
22 D. Mishra et al.
(a) LR image (b) Bicubic interpolation (c) EBSR [2] (d) SRNE [5]
(e) NeedFS [8] (f) Proposed method (g) Ground truth image
6 Conclusion
In this paper, we have proposed a new feature selection method for neighbor embed-
ding based super-resolution. The feature vector is generated by combining three
Zernike moments with the luminance norm of the image. The global neighborhood
size selection technique is used to find the K value for K-nearest neighbor search.
Improved Feature Selection for Neighbor Embedding . . . 23
Both qualitative and quantitative comparison of the proposed method is carried out
with the state-of- the-art methods. The results show that the proposed method is
superior to the other methods in terms of PSNR and FSIM values. However, for
texture based image edge preservation is still an issue that will be addressed in our
future work.
References
1. Park, S.C., Park, M.K., Kang, M.G.: Super-resolution image reconstruction: a technical
overview. IEEE Signal Processing Magazine 20(3) (2003) 21–36
2. Freeman, W.T., Jones, T.R., Pasztor, E.C.: Example-based super-resolution. IEEE Computer
Graphics and Applications 22(2) (2002) 56–65
3. Kim, K.I., Kwon, Y.: Example-based learning for single-image super-resolution. In: Proceed-
ings of the 30th DAGM Symposium on Pattern Recognition. (2008) 456–465
4. Li, D., Simske, S.: Example based single-frame image super-resolution by support vector
regression. Journal of Pattern Recognition Research 1 (2010) 104–118
5. Chang, H., Yeung, D.Y., Xiong, Y.: Super-resolution through neighbor embedding. In: IEEE
Computer Society Conference on Computer Vision and Pattern Recognition. Volume 1. (2004)
275–282
6. Chan, T.M., Zhang, J.: Improved super-resolution through residual neighbor embedding. Jour-
nal of Guangxi Normal University 24(4) (2006)
7. Fan, W., Yeung, D.Y.: Image hallucination using neighbor embedding over visual primitive
manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition. (June 2007)
1–7
8. Chan, T.M., Zhang, J., Pu, J., Huang, H.: Neighbor embedding based super-resolution algo-
rithm through edge detection and feature selection. Pattern Recognition Letters 30(5) (2009)
494–502
9. Liao, X., Han, G., Wo, Y., Huang, H., Li, Z.: New feature selection for neighbor embedding
based super-resolution. In: International Conference on Multimedia Technology. (July 2011)
441–444
10. Mishra, D., Majhi, B., Sa, P.K.: Neighbor embedding based super-resolution using residual
luminance. In: IEEE India Conference. (2014) 1–6
11. Gao, X., Zhang, K., Tao, D., Li, X.: Joint learning for single-image super-resolution via a
coupled constraint. IEEE Transactions on Image Processing 21(2) (2012) 469–480
12. Gao, X., Zhang, K., Tao, D., Li, X.: Image super-resolution with sparse neighbor embedding.
IEEE Transactions on Image Processing 21(7) (2012) 3194–3205
13. Bevilacqua, M., Roumy, A., Guillemot, C., Morel, M.L.A.: Super-resolution using neigh-
bor embedding of back-projection residuals. In: International Conference on Digital Signal
Processing. (2013) 1–8
14. Gao, X., Wang, Q., Li, X., Tao, D., Zhang, K.: Zernike-moment-based image super resolution.
IEEE Transactions on Image Processing 20(10) (2011) 2738–2747
15. Teague, M.R.: Image analysis via the general theory of moments. Journal of the Optical Society
of America 70 (1980) 920–930
16. Xiao-Peng, Z., Yuan-Wei, B.: Improved algorithm about subpixel edge detection based on
zernike moments and three-grayscale pattern. In: International Congress on Image and Signal
Processing. (2009) 1–4
17. Kouropteva, O., Okun, O., Pietikinen, M.: Selection of the optimal parameter value for
the locally linear embedding algorithm. In: International Conference on Fuzzy Systems and
Knowledge Discovery. (2002) 359–363
24 D. Mishra et al.
18. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Sci-
ence 290(5500) (2000) 2323–2326
19. Zhang, L., Zhang, D., Mou, X., Zhang, D.: Fsim: A feature similarity index for image quality
assessment. IEEE Transactions on Image Processing 20(8) (2011) 2378–2386
Target Recognition in Infrared Imagery
Using Convolutional Neural Network
Abstract In this paper, deep learning based approach is advocated for automatic
recognition of civilian targets in thermal infrared images. High variability of target
signature and low contrast ratio of targets to background makes the task of target
recognition in infrared images challenging, demanding robust adaptable methods
capable of capturing these variations. As opposed to the traditional shallow learning
approaches which rely on hand engineered feature extraction, deep learning based
approaches use environmental knowledge to learn and extract the features auto-
matically. We present convolutional neural network (CNN) based deep learning
framework for automatic recognition of civilian targets in infrared images. The
performance evaluation is carried on infrared target clips obtained from
‘CSIR-CSIO moving object thermal infrared imagery dataset’. The task involves
four categories classification one category representing the background and three
categories of targets -ambassador, auto and pedestrians. The proposed CNN
framework provides classification accuracy of 88.15 % with all four categories and
98.24 % with only three target categories.
1 Introduction
With the advancement of computer technology and availability of high end com-
puting facilities, research in the area of recognition is gaining momentum across
wide range of applications like defense [1, 2], underwater mine [3], face recogni-
tion, etc. Target recognition is a crucial area of research from security point of view.
Generalized recognition system consists of two stages, feature extraction stage
followed by a classifier stage. The feature extraction stage takes the detected target
region and performs computation to extract information in the form of features. This
information is fed to the classifier which categorizes the target to the most relevant
target class. The performance of recognition algorithms is highly dependent on the
extracted features. Imaging systems which capture data in visible spectrum fail to
perform during night time and under dark conditions. It needs strong artificial
illumination to capture data [4, 5]. Thermal infrared imaging systems which work in
the infrared band of the electromagnetic spectrum sense the heat released by the
objects above absolute zero temperature and form an image [6], thereby capable of
working in no light conditions. The heat sensing capability of thermal infrared
imaging make it superior over visible imaging [7, 8]. However, variability of target
infrared signatures due to a number of environment and target parameters, pose
challenge to researchers working towards development of automated recognition
algorithms [9]. In this paper we present a robust recognition framework for target
recognition in infrared images.
2 Related Work
The recent trends in recognition show researchers employing neural network based
approaches. These approaches learn from experience. Similar to human brain
system, the neural networks extract the information from the external environment.
These approaches have been widely applied in character recognition [10], horror
image recognition [11], face recognition [12] and human activity recognition [13].
These methods are providing better performances than the classical methods [14].
We can be broadly classify these methods into shallow learning and deep learning
methods.
Commonly used learning based classifiers such as support vector machine
(SVM), radial basis function neural network (RBFNN), k nearest neighbor method
(k-NN), modular neural network (MNN) and tree classifier are the shallow learning
methods that considers hand engineered features using some of the commonly used
methods local binary pattern [15], principal component analysis, shift invariant
feature transform (SIFT) [16] and histogram of oriented gradients (HOG) [17]. The
feature selection is time consuming process and we need to specifically work
towards identifying and tuning features that are robust for particular application. On
the other hand, deep learning based methods employs learning based feature
extraction using hierarchical layers, providing a single platform for feature
extraction and classification [18]. Convolution neural network which is a deep
learning method has shown state of the art performances in various applications
such as MNIST handwriting dataset [19], Large Scale Visual Recognition Chal-
lenge 2012 [20], house number digit classification [21]. Convolution neural net-
work is also shown to provide more tolerance to variable conditions such as pose,
lightning, surrounding clutter [22].
Target Recognition in Infrared Imagery Using Convolutional … 27
Convolution neural networks (CNN) are feed forward neural networks having
hierarchy of layers. They combine the two stages of recognition, feature extraction
and classification stages in a single architecture. Figure 1 shows a representative
architecture of deep convolution neural network. Feature extraction stage consists
of convolution layers and subsampling layers. Both layers have multiple planes
which are called feature maps. Typically the networks may have multiple feature
extraction stages. A feature map is obtained by processing the input image or
previous layer image with the kernel. The operation may be the convolution (as in
convolution layer) or subsampling (averaging or pooling, as for subsampling layer).
Each pixel in a feature map is a neuron. A neuron in a convolution layer receives
the weighted sum of inputs (convolved result of the local receptive field of the input
image with kernel) from the previous neuron and a bias, which is then passed
Input image
Convolution Subsampling
Layer Layer Fully Connected
Layer
Feature Extraction Classification
Stage Stage
ul = vul − 1 ð3Þ
4 Experimental Dataset
‘CSIR-CSIO Moving Object Thermal Infrared Imagery Dataset’ is used for vali-
dating the performance of the proposed deep learning framework [26]. Detected
target regions obtained from the moving target detection method presented in [9]
are used to train and test the system. It was observed that the detection algorithm
presented some false positives where background was also detected as moving
target. To handle this, in this work, we have designed a four class classifier system,
one class representing background and three classes representing targets—ambas-
sador, auto-rickshaw and pedestrian.
Target Recognition in Infrared Imagery Using Convolutional … 29
(2) Auto
(3) Pedestrians
The CNN architecture adapted from the LeNet-5 architecture [23] as shown in
Fig. 1 is designed. We have used ten layers in the architecture. Before training of
the neural network, all the experimental dataset is resized into a square of
200 × 200. Also the input image pixel values are normalized to zero mean and 1
standard deviation according to Eq. (4). The normalization improves the conver-
gence speed of the network [27].
xðold Þ − m
xðnewÞ = ð4Þ
sd
where xðnewÞ is the preprocessed pixel value, xðoldÞ is the original pixel value, m is
the mean value of pixel from the image and sd is the standard deviation of pixels.
xðnewÞ is the new zero mean and 1 standard deviation value. First the original
image is resized to 200 × 200 and then normalized.
The feature maps of each layer are chosen according to the Eqs. (1) and (3). The
first layer is the dummy layer having sampling rate 1. This is just for the symmetry
of architecture. The second layer is convolutional layer (C2) with 6 feature maps.
30 A. Akula et al.
The size of kernel for this layer is 3 × 3. The size of each feature map in C2 layer
is 198 × 198. The third layer is subsampling layer (S3) with 6 feature maps. The
subsample rate is 2. The connections from C2-layer to S3 layer are one to one. The
size of each feature map is 99 × 99. The fourth layer is convolutional layer (C4)
with 12 feature map each of size 96 × 96. The kernel size is 4 × 4. The con-
nections from S3 to C4 layer are random. The fifth layer is subsampling layer (S5)
with 12 feature maps. The subsample rate is 4. The connections from C4 layer to S5
layer are one to one. The size of each feature map is 24 × 24 in S5 layer. The sixth
layer is convolutional layer (C6) with 24 feature map each of size 20 × 20. The
kernel size is 5 × 5. The connections from S5 to C6 layer are random. The seventh
layer is subsampling layer (S7) with 24 feature maps. The subsample rate is 4. The
connections from C6 layer to S7 layer are one to one. The size of each feature map
is 5 × 5 in S7 layer. The eighth layer is convolutional layer (C8) with 48 feature
map each of size 1 × 1. The kernel size is 5 × 5. The connections from S7 to C8
layer are random. The fully connected layer has random number of hidden neurons
which are varied while performing simulation. The output layer has 4 neurons
corresponding to four categories.
The kernel or the weights and bias for convolution layers are initialized ran-
domly. The weights of subsampling layer are initialized with unit value and zero
bias. The activation function for convolutional layer neurons is linear and scaled
bipolar sigmoidal for all other neurons. Scaled activation function is given in
Eq. (5).
2
FðnÞ = 1.7159 * −1 ð5Þ
ð1 + expð − 1.33*nÞÞ
where n is the weighted output of neuron. F ðnÞ is the output obtained after applying
activation function. The number of neurons in the output layer are equal to the
number of categories. The neural network is trained in such a way that the true
category neuron corresponds to +1 and others to –1. The bipolar sigmoidal function
is scaled to 1.7159 and the slope control constant is set to 1.33 as used by [23]. The
scaling improves the convergence of the system.
The network is trained with Stochastic Diagonal Levenberg-Marquardt method
[28]. At each kth learning iteration, free parameter wk is updated according to
Eq. (6) stochastic update rule.
∂EP
wk ðk + 1Þ←wk ðkÞ − ∈ k ð6Þ
∂wk
where, η is the constant step size which is controlled by the second order error term
μ is the hand-picked constant, and hkk is the estimate of the second order derivative
of loss function w.r.t the connection weights uij as given in Eq. (8).
∂2 E
hkk = ∑ ð8Þ
∂u2ij
Number of Epochs
The preliminary results reported in this work, demonstrate that deep learning based
automatic feature extraction and classification system can accurately classify
civilian targets in infrared imagery. The proposed system could classify the
Target Recognition in Infrared Imagery Using Convolutional … 33
References
Abstract This paper presents a novel prediction error expansion (PEE) based
reversible watermarking using 3 × 3 neighborhood of a pixel. Use of a good pre-
dictor is important in this kind of watermarking scheme. In the proposed predictor,
the original pixel value is predicted based on a selected set, out of the eight neigh-
borhood of a pixel. Moreover, the value of prediction error expansion (PEE) is opti-
mally divided between current pixel and top-diagonal neighbor such that distortion
remains minimum. Experimental results show that the proposed predictor with opti-
mal embedding outperforms several other existing methods.
1 Introduction
Multimedia has become more popular in modern life with rapid growth in commu-
nication systems. Digital media is widely being used in various applications and
is being shared in various forms with sufficient security. But advancements of sig-
nal processing operations lead to malicious attacks and alterations of these con-
tents. Watermarking is a technique to protect these contents or the ownership of
the contents against such malicious threats. A watermark is embedded in the multi-
media content to achieve the said purpose. Later, the same watermark is extracted to
Similar to the work in [13], the proposed work presents a novel predictor based
on the uniformity of the pixels in a neighborhood. But unlike the approach in [13],
it considers eight-neighborhood of a pixel. The neighbors are grouped into four dif-
ferent pairs (horizontal, vertical, diagonal and anti diagonal). A strategy has been
devised to consider some of these groups to predict the center pixel value. Less diver-
sity among the pairs of pixels in each group and closeness between average values
among these groups decide which of these can be considered for prediction. Coupled
with an optimal embedding scheme, this proposed prediction error based reversible
watermarking scheme outperforms not only the four-neighbor based method [13],
but also scores better than the recent gradient based method in [12].
The outline of the paper is as follows: The proposed prediction scheme based
on a select of the 8-neighborhood is depicted in Sect. 2. Proposed optimal embed-
ding scheme is explained in Sect. 3. Extraction of watermark is described in Sect. 4.
Experimental results are presented in Sect. 5. Finally, the conclusion is drawn in
Sect. 6.
dh = |xm,n−1 − xm,n+1 |
dv = |xm−1,n − xm+1,n |
(1)
dd = |xm−1,n−1 − xm+1,n+1 |
da = |xm−1,n+1 − xm+1,n−1 |
xm,n−1 + xm,n+1
ah = ⌊ ⌋
2
xm−1,n + xm+1,n
av = ⌊ ⌋
2 (2)
xm−1,n−1 + xm+1,n+1
ad = ⌊ ⌋
2
xm−1,n+1 + xm+1,n−1
aa = ⌊ ⌋
2
Only homogeneous (less diverse) groups are considered for predicting the cen-
ter pixel values. Hence, the group of pixels having the least diversity is considered
for estimating the current pixel. Let four diversity values in Eq. 1 are sorted in non-
decreasing order and let these be denoted as d1 , d2 , d3 , and d4 (while d1 is the smallest
of these four values). Moreover, let the averages in these four directions (as computed
in Eq. 2) be sorted in non-decreasing order of the diversities in respective directions
and let the sorted values be a1 , a2 , a3 , and a4 . Here, the ai corresponds to the direction
having diversity di . Basically, these average values act as a predicted value in their
respective directions. At first, the predicted value a1 according to the least diverse
group (with diversity value d1 ) is considered to predict the central value. Addition-
ally, the predictions in other directions are considered, only if the predicted (average)
values in those directions are also close enough to the value a1 . A threshold T decides
the closeness of these average values to the value a1 (The value of T is assumed to
be 1 for our experiments). To focus on the groups of less diverse pixels, closeness of
these average values have been tested iteratively, starting with second least diverse
group. This complete algorithm is mentioned in Algorithm 1, where the iteration
has been broken down using if-else constructs for three other groups (apart from the
least diverse group). Ultimately, if predicted (i.e.,average) values of all four groups
are similar enough, then average of all four prediction values (i.e., average of indi-
vidual groups) predicts the center pixel value.
This means that the estimated error and the watermark information are directly
added to the pixel intensity of the cover image. At detection, the estimation of the
Selected Context Dependent Prediction for Reversible . . . 39
watermarked pixel, should not be changed so that the prediction error is generated
from the watermarked pixels. Based on the estimated value of the watermarked pixel,
the prediction error is determined. Then, the watermark bit w is taken as the least
′
significant bit of the X − x , namely
′
′ X−x
w = (X − x ) − 2 × (⌊ ⌋) (4)
2
Then, the original cover image pixel x is computed as
′
X+x −w
x= (5)
2
40 R. Uyyala et al.
where, 0 ≤ L ≤ 1. After inserting some optimal value d into the top-diagonal pixel,
the remaining amount (PW1 = PW − d) is added to its original pixel value.
x = x + PW1 (7)
Now the context has been modified due to addition of value d into top-diagonal pixel.
The modified context Ndx can be written as
To obtain the optimal fraction value L, the optimization of the embedding error is
determined in [15] based on the minimum square error (MSE).
∑
j
2
MSE = (x − Xd ) + (N x (i, j) − Ndx (i, j))2 (10)
i
where x and Xd are original and watermarked pixel values. Moreover, Nx and Nxd are
original and modified context. The above equation in proposed case can be rewrit-
ten as
MSE = (x − Xd )2 + (xm−1,n−1 − (xm−1,n−1 + d))2 . (11)
Moreover, as Xd is the new watermarked pixel value, the top-diagonal and current
pixels are modified during embedding. Based on the above equation the minimum
value of d is obtained as 1∕2PW. Equivalently L = 0.50. Thus, the function f in
Eq. 8 splits the data between the present pixel and its context, whereas L controls
the optimal fraction of d to be embedded in the top-diagonal pixel. The optimal
embedding is used in the pixel locations where the prediction error (PE) falls within
Selected Context Dependent Prediction for Reversible . . . 41
4 Extraction of Watermark
As the embedding is carried out in raster scan order, the extraction is performed in
′
opposite order, from lower right to top left. The estimated value X is computed
using the predictor in Sect. 2 from the context of a pixel in watermarked image.
42 R. Uyyala et al.
Then, the reversibility of the modified scheme is immediately follows. The amount
of prediction that can be recovered at detection is
′
PE1 = Xd − X (12)
PE1
w = (PE1) − 2 × (⌊ ⌋) (13)
2
The optimal fraction value d can be computed as
Then, after generating the original context from function f −1 and after computing
PW, the original pixel x can be computed as follows;
x = Xd − PW + d. (18)
5 Experimental Results
In this section, experimental results for the proposed reversible watermarking based
on eight neighborhood with optimal embedding are presented. Standard four test
images (Lena, Barbara, Mandrill, and Boat) of size 512 × 512 pixels are considered
for evaluation. These images are shown in Fig. 2. Peak-signal-to-noise ratio (PSNR)
between the cover image and the watermarked image is used as evaluation metric. It
quantifies the distortion in the watermarked image due to the watermark embedding.
The outcome of the proposed method is compared with the outcomes of the extended
gradient based selective weighting (EGBSW) [12] and rhombus average [13]. The
results are compared for various embedding rates. The proposed method outperforms
both of these methods at various embedding rates as it can be observed from the
values in Table 1. Higher PSNR value indicates better result. The comparison among
these methods using PSNR value at various embedding rates has also been plotted in
Fig. 3. Moreover, it has also been observed that the original image can be perfectly
restored back after extracting the watermark.
44 R. Uyyala et al.
Fig. 2 Test Images (Cover Images (a–d), Watermarked Images (e–h) with embedding rate of 0.8
bpp): Lena, Barbara, Mandrill, and Boat
LENA BARBARA
60 Rhombus Average[13]
60 Rhombus Average[13]
EGBSW[12] EGBSW[12]
Proposed 55 Proposed
55
50
50
PSNR [dB]
PSNR [dB]
45
45 40
35
40
30
35
25
30 20
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
bit−rate [bpp] Bit−rate [bpp]
MANDRILL BOAT
55 Rhombus Average[13]
55 Rhombus Average[13]
EGBSW[12] EGBSW[12]
Proposed Proposed
50 50
45
45
PSNR [dB]
PSNR [dB]
40
40
35
35
30
25 30
20 25
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Bit−rate [bpp] Bit−rate [bpp]
6 Conclusion
References
1 Introduction
which a generic biometric system can be attacked. However, amongst many identi-
fied issues, stolen biometric scenario where an imposter is able to spoof by provid-
ing a stolen biometric sample of the genuine user, is the current threat to deal with.
Database attacks leads to permanent template compromise, where an attacker uses
the stored biometric data to obtain illegitimate access. As biometric data is being
increasingly shared among various applications, cross matching of different data-
bases may be performed to track an individual. Unlike passwords or PINs, biometric
templates cannot be revoked on theft. Biometric template are permanently associ-
ated with a particular individual and once compromised, it will be lost permanently.
Moreover, the same template is stored across different application databases which
can be compromised by cross-matching attack. Template data once compromised
for one application renders it compromised and unsafe for all other applications for
entire lifetime of the user. The concept of cancelable biometrics addresses these con-
cerns. Instead of original biometrics, it uses its transformed versions for storing and
matching purposes. In case of any attack, the compromised template can be revoked
and new transformed versions can be easily generated.
The objective of this work is to generate biometric templates which can canceled
like passwords while at the same time provide non-repudiation and perform like
generic biometric templates. Cancelability is achieved by first subjecting the image
to Hadamard transformation (HT) and then projecting it on a random matrix whose
columns are independent vectors having values –1, +1, or 0 with probabilities 1/6,
1/6, and 2/3, respectively. The sample is then subjected to inverse HT followed by a
one-way modulus hashing on the basis of a vector computed in Hadamard domain.
The organization of the paper is as follows. A formal definition of cancelable biomet-
rics and related works is provided in Sect. 2. It is followed by the proposed template
transformation approach explained in Sect. 3. The experimental results are covered
in Sect. 4, and finally the work is concluded in Sect. 5.
2 Cancelable Biometrics
Teoh et al. (2004) proposed BioHashing which salts biometric features by project-
ing them on user-specific random matrices (Random Projection) followed by thresh-
olding to generate binary codes. BioHashing becomes invertible if the binary codes
and user-specific random matrices are compromised and pre-image attack can be
simulated to recover the original data [2]. Sutcu et al. (2005) proposed a nonlinear
transformation based salting technique known as robust hashing [3]. The technique
is non-invertible but the hashed templates tend to compromise on discriminability.
Teoh et al. (2006) proposed BioPhasoring which iteratively mixes biometric features
with user-specific random data in a non-invertible fashion without losing discrim-
inability [4]. To address the invertibility of Biohashing, Teoh and Yaung (2007) pro-
posed salting techniques which involve Multispace Random Projections (MRP) [5].
Further, Lumini et al. (2007) combined Multispace Random Projections, variable
thresholding, and score level fusions to enhance performance [6].
Non-invertible transformations are many-to-one functions that easily transform
biometric data into a new mapping space. Ratha et al. (2007) generated non-invertible
cancelable fingerprint templates by distorting minutiae features using Cartesian,
polar, and surface folding transformation functions [7].Tulyakov et al. (2005) dis-
torted minutiae features using polynomial based one way symmetric hash functions
[8]. Ang et al. (2005) generated cancelable minutiae features using key dependent
geometric transformation technique [9]. Bout et al. (2007) generated revocable bio-
metric based identity tokens from face and fingerprint templates by using one way
cryptographic functions. The technique separates data into two parts, such that the
integer part is used for encryption and the fractional part is used for robust distance
computations [10]. Farooq et al. (2007) and Lee et al. (2009) extracted rotation and
translation invariant minutiae triplets to generate cancelable bit string features [11].
Each of the above mentioned approaches have their own advantages and dis-
advantages. BioHashing and other salting techniques are effective but are subjec-
tive to invertibility. Also their performance degrades considerably in stolen token
scenario. Non-invertible transforms tends to compromise discriminability of trans-
formed biometric in order to achieve irreversibility which degrades the performance.
It is imperative to maintain a balance between non-invertiblity, discriminability, and
performance for a cancelable biometric technique. This work is motivated towards
designing a transformation approach such that the templates are easy to revoke, dif-
ficult to invert, and maintains performance in stolen token scenario.
3 Template Transformation
Along with the basics of Hadamard transform and Random Projection, proposedtem-
plate transformation technique is discussed here.
50 H. Kaur and P. Khanna
Hn = H1 ⊗ Hn−1 (1)
[ ]
1 1 1
H1 = √ (2)
2 1 −1
Since the elements of the Hadamard matrix (Hn ) are real containing only +1 and –1,
they are easy to store and perform computations. Hn is an orthogonal matrix. HT has
good energy packing properties, but it cannot be considered a frequency transform
due to its non-sinusoidal nature. The sign change along each row of Hn is called
sequence which exhibits characteristics like frequency. HT is fast as its computation
requires only simple addition and subtractions operations. It can be performed in
O(Nlog2 N) operations. For a 2-D vector I of dimensions N × N where N = 2n , the
forward and inverse transformations are performed using Eqs. 3 and 4, respectively.
⎧
√ ⎪ +1, with probability 1/6;
A(i, j) = 3 ⎨ 0, with probability 2/3; (5)
⎪ −1, with probability 1/6.
⎩
This allows computation of projected data using simple addition and subtraction
operations and is well suited for database environments. Detailed proofs and deeper
insights about the distance preservation property of projections using Achiloptas
matrix can be found in [13, 16, 17].
The column wise mean of the projected image matrix I RP is calculated and stored in
a vector M, M ∈ Rk . The elements of vector M are transformed as
where abs is absolute value function. Exploiting the energy compaction property of
HT, the coefficients confining to the upper left triangle which gives the basic details
of the image are retained and rest are discarded by equating them to zero. On the
52 H. Kaur and P. Khanna
resultant, inverse HT is performed using Eq. 4 to obtain I R . Modulus for each ith
row of I R is separately calculated using vector M.
where i varies from 1 to N and the total number of rows and columns being k and N
respectively. After computing the transformed template I T , the vector M is discarded.
Overall I T can be written as
(
I T = (Hn × (Hn × I × Hn ) × R)× Hn )mod M (9)
The performance is evaluated on two different biometric modalities, i.e., face and
palmprint. To study the functional performance of the proposed system on face
modality, three different standard face databases− ORL, Extended Yale Face Data-
Cancelable Biometrics Using Hadamard Transform and Friendly Random Projections 53
base B, and Indian face are used. ORL is an expression variant database consisting
of 40 subjects with 10 images per subject capturing different facial expressions [20].
Extended YALE face database is an illumination variant database containing 64 near
frontal images for 38 subjects under various illumination conditions [21]. Out of it
only 10 images per subject having uniform illumination are selected. The Indian face
database is a collection of 61 subjects, 39 males and 22 females with 11 images per
subjects collected by IIT Kanpur for different orientation of face, eyes, and emo-
tions on face [22]. For each database, 3 images are randomly selected for training
database and 7 images for test database. CASIA and PolyU palmprint databases are
used to study the functional performance of the proposed system on palmprint image
templates. CASIA contains 5,239 palmprint images of left and right palms of 301
subjects thus a total 602 different palms [23]. PolyU database includes 600 images
from 100 individuals, with 6 palmprint images from each subject [24]. For palm-
print databases, per subject 2 images for training and 4 images for testing purposes
are randomly selected after extracting the region of interest [25].
The performance is determined using Equal Error Rates (EER) and Decidability
Index (DI). Decidability Index (DI) is defined as the normalized distance between
means of Genuine (𝜇G ) and Imposter distributions (𝜇I ). DI index measures the con-
fidence in classifying patterns for a given classifier. The value is either positive or
negative, according to the score assigned to that pattern. Higher values of DI indi-
cate better decidability while classifying genuine and imposter populations. DI is
calculated as
|𝜇G − 𝜇I |
DI = √| | (10)
(𝜎G2 + 𝜎I2 )∕2
Fig. 2 ROC curves for matching performance a original domain b transformed domain
It can be observed that the matching performance, i.e., EER of proposed approach
under stolen token scenario is comparable to non-cancelable based technique. The
experimental results validate that the proposed approach transforms biometric tem-
plates while effectively preserving their discriminability and meets the performance
evaluation criteria of cancelability under stolen token scenario. The genuine and
imposter populations in transformed domain is well distributed. DI values obtained
from genuine and imposter mean and variance in stolen token scenario are suffi-
ciently high which indicate good separability among transformed templates. The per-
formance in case of legitimate key scenario, when each subject is assigned different
random matrix R results in nearly 0 % EER for all modalities and databases.
Consider the scenario, when the transformed template I T and projection matrix R are
available simultaneously. The inverse operation (decryption) requires the projection
Cancelable Biometrics Using Hadamard Transform and Friendly Random Projections 55
The next step requires an attacker to have the exact values over which modulus is
computed for each row, i.e., the mean vector M, which is discarded immediately
after transformation. Hence, it cannot be inverted. Yet, we consider a scenario where
the exact vector M is approximated by the attacker using intrusion or hill climbing
attacks. Then the inverse template should be computed as
where Ī1 ,Ī2 represents the mean of templates I1 , I2 , respectively. The correlation
index (CI) is the mean of all collected Cr values. Table 2 provides CI values between
transformed templates for different modalities and databases. For example, average
value of I = 0.121 means that two templates generated from the same biometric sam-
ple using different random matrices share 12.1 % of mutual information and are dif-
ferent to each other by 87.9 %. It is observed from Table 2 that CI values are low.
This indicates that the proposed approach offers good revocability and diversity.
56 H. Kaur and P. Khanna
5 Conclusion
References
1. Ratha, N.K., Connell, J.H., Bolle, R.M.: Enhancing security and privacy in biometrics-based
authentication systems. IBM systems Journal 40 (2001) 614–634
2. Lacharme, P., Cherrier, E., Rosenberger, C.: Preimage attack on biohashing. In: International
Conference on Security and Cryptography (SECRYPT). (2013)
3. Sutcu, Y., Sencar, H.T., Memon, N.: A secure biometric authentication scheme based on robust
hashing. In: Proceedings of the 7th workshop on Multimedia and security, ACM (2005) 111–
116
4. Teoh, A.B.J., Ngo, D.C.L.: Biophasor: Token supplemented cancellable biometrics. In: Con-
trol, Automation, Robotics and Vision, 2006. ICARCV’06. 9th International Conference on,
IEEE (2006) 1–5
5. Teoh, A., Yuang, C.T.: Cancelable biometrics realization with multispace random projections.
Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 37 (2007) 1096–
1106
6. Lumini, A., Nanni, L.: An improved biohashing for human authentication. Pattern recognition
40 (2007) 1057–1065
7. Ratha, N., Connell, J., Bolle, R.M., Chikkerur, S.: Cancelable biometrics: A case study in fin-
gerprints. In: Pattern Recognition, 2006. ICPR 2006. 18th International Conference on. Vol-
ume 4., IEEE (2006) 370–373
8. Tulyakov, S., Farooq, F., Govindaraju, V.: Symmetric hash functions for fingerprint minutiae.
In: Pattern Recognition and Image Analysis. Springer (2005) 30–38
9. Ang, R., Safavi-Naini, R., McAven, L.: Cancelable key-based fingerprint templates. In: Infor-
mation Security and Privacy, Springer (2005) 242–252
10. Boult, T.E., Scheirer, W.J., Woodworth, R.: Revocable fingerprint biotokens: Accuracy and
security analysis. In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Con-
ference on, IEEE (2007) 1–8
11. Farooq, F., Bolle, R.M., Jea, T.Y., Ratha, N.: Anonymous and revocable fingerprint recognition.
In: Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, IEEE
(2007) 1–7
Cancelable Biometrics Using Hadamard Transform and Friendly Random Projections 57
12. Dasgupta, S., Gupta, A.: An elementary proof of the johnson-lindenstrauss lemma. Interna-
tional Computer Science Institute, Technical Report (1999) 99–006
13. Matoušek, J.: On variants of the johnson–lindenstrauss lemma. Random Structures & Algo-
rithms 33 (2008) 142–156
14. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimen-
sionality. In: Proceedings of the thirtieth annual ACM symposium on Theory of computing,
ACM (1998) 604–613
15. Dasgupta, S.: Learning mixtures of gaussians. In: Foundations of Computer Science, 1999.
40th Annual Symposium on, IEEE (1999) 634–644
16. Achlioptas, D.: Database-friendly random projections. In: Proceedings of the twentieth ACM
SIGMOD-SIGACT-SIGART symposium on Principles of database systems, ACM (2001)
274–281
17. Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to
image and text data. In: Proceedings of the seventh ACM SIGKDD international conference
on Knowledge discovery and data mining, ACM (2001) 245–250
18. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face recognition by independent component
analysis. Neural Networks, IEEE Transactions on 13 (2002) 1450–1464
19. Connie, T., Teoh, A., Goh, M., Ngo, D.: Palmprint recognition with pca and ica. In: Proc.
Image and Vision Computing, New Zealand. (2003)
20. ORL face database: (AT&T Laboratories Cambridge) http://www.cl.cam.ac.uk/.
21. Yale face database: (Center for computational Vision and Control at Yale University) http://
cvc.yale.edu/projects/yalefaces/yalefa/.
22. The Indian face database: (IIT Kanpur) http://vis-www.cs.umas.edu/.
23. CASIA palmprint database: (Biometrics Ideal Test) http://biometrics.idealtest.org/
downloadDB/.
24. PolyU palmprint database: (The Hong Kong Polytechnic University) http://www4.comp.polyu.
edu.hk/biometrics/.
25. Kekre, H., Sarode, T., Vig, R.: An effectual method for extraction of roi of palmprints. In: Com-
munication, Information & Computing Technology (ICCICT), 2012 International Conference
on, IEEE (2012) 1–5
A Semi-automated Method for Object
Segmentation in Infant’s Egocentric Videos
to Study Object Perception
1 Introduction
Infants begin to learn about objects, actions, people and language through many
forms of social interactions. Recent cognitive research highlights the importance of
studying the infant’s visual experiences in understanding early cognitive develop-
ment and object name learning [1–3]. The infant’s visual field is dynamic and char-
acterized by large eye movements and head turns owing to motor development and
bodily instabilities which frequently change the properties of their visual input and
experiences. What infants attend to and how their visual focus on objects is structured
and stabilized during early stages of development has been studied to understand the
underlying mechanism of object name learning and language development in early
growth stages [1, 4–6].
Technological advancement allows researchers to have access to these visual
experiences that are critical to understanding the infant’s learning process first hand
[1, 7]. Head cameras attached to the child’s forehead enables researchers to observe
the world through child’s viewpoint by recording their visual input [2, 8]. However,
it becomes very time consuming and impractical for humans to annotate objects in
these high volume egocentric videos manually.
As discussed in [9], egocentric video is an emerging source of data and informa-
tion, the processing of which poses many challenges from a computer vision perspec-
tive. Recently, computer vision algorithms have been proposed to solve the object
segmentation problem in such videos [10, 11]. The nuances of segmentation in ego-
centric videos arise because the child’s view is unfocused and dynamic. Specifically,
the egocentric camera, (here, the head camera) is in constant motion, rendering the
relative motion between object and background more spurious than that from a fixed
camera. In addition, the random focus of a child causes the objects to constantly move
in and out of the view and appear in different sizes, and often the child’s hand may
occlude the object. Examples of such views are shown in Fig. 1a, b and c. Finally, if
the child looks towards a light source, the illumination of the entire scene appears
very different, as shown in Fig. 1d.
In this paper, we develop an interactive and easy to use tool for segmentation of
objects in child’s egocentric video that addresses the above problems. The method
enables cognitive scientists to select the desired object and monitor the segmentation
process. The proposed approach exploits graph cut segmentation to model object
and background and calculate optical flow between frames to predict object mask in
following frames. We also incorporate domain specific heuristic rules to maintain
high accuracy when object properties change dramatically.
The method is applied to find binary masks of objects in videos collected by
placing a small head camera on the child as the child engages in toy play with a
A Semi-automated Method for Object Segmentation . . . 61
Fig. 1 a Size variation. b Entering and leaving. c Occlusion. d Illumination variation. e Orientation
variations
parent. The object masks are then used to study object distribution in child’s view
at progressive ages by generating heat maps of objects for multiple children. We
investigate the potential developmental changes in children’s visual focus on objects.
The rest of the paper is organized as follows: We describe the experimental setup
and data collection process in Sect. 2. The semi automated segmentation method is
explained in detail in Sect. 3. Results are presented and discussed in Sect. 4. Finally
Sect. 5 will conclude the present study and highlights the main achievements and
contributions of the paper.
In this section we explain our proposed method in detail in three main steps namely,
initialization and modeling of the object and background, object mask prediction for
next frame, and performing a confidence test to continue or restart the program. The
flow diagram for the method is shown in Fig. 2. We use a graph based segmentation
approach [12, 13] to take user input, for initialization and also when recommended
62 Q. Mirsharif et al.
by the confidence mechanism (Sect. 3.3). For each user input we model the object and
save the features as a KeyFrame in an active learning pool, which is used as ground
truth data. We then use optical flow [14] to estimate the segmentation mask for the
next frame and subsequently refine it to obtain the final prediction mask (Sect. 3.2).
The obtained segmentation result is then evaluated under our confidence test (Sect.
3.3). If the predicted mask is accepted by the test, the system continues this process
of automatic segmentation. When the confidence test fails, the system quickly goes
back and takes a new user input to maintain the desired accuracy in segmentation.
We begin our prediction with the calculation of dense optical flow between the pre-
vious frame and the current frame. Since the two frames are consecutive frames of
an egocentric video, we can assume there isn’t a drastic change in the characteristics
of the foreground or the background. Using optical flow we predict pixel to pixel
translation of the mask. This initial calculation provides a starting estimate of the
mask.
Some refinement in this mask is required to maintain the accuracy for the follow-
ing reasons. Firstly, the pixel to pixel transformation using optical flow is not a one to
one but a many to one transformation i.e. many pixels in the previous frame may get
translated to the same pixel in the current frame. Secondly, if part of the object is just
entering the frame from one of the boundaries, optical flow by itself cannot predict
if the boundary pixels belongs to the object or not. Lastly, under occlusion, as is the
case in many frames when the mother is holding the object or the child’s hands are
interacting with the object, flow estimates the mask fairly well at the onset of occlu-
sion but fails to recover the object once the occlusion has subsided thus leading to
severe under segmentation (Fig. 3)
To refine this initial estimate of the mask, we define a region of uncertainty around
this initial estimate of the mask, both inwards and outwards from the mask. If the
object happens to be near one of the edges of the frame, we define areas along the
right, left, top, or bottom edges as part of the uncertain region based on the aver-
age flow of the object as well as the local spatial distribution of the object near
the edges. We then input this unlabeled, uncertain region into the earlier Graph-Cut
stage to label these uncertain pixels as either foreground or background. This helps
in obtaining a more accurate, refined segmentation mask (Fig. 4).
This segmentation result is now compared against the learnt ground truth from
the user, stored in the active learning KeyFrame Pool based on a confidence test
Fig. 3 Need for refinement. a Error due to optical flow. b–c Occlusion of object by the child’s
hand in successive frames. d Predicted mask
64 Q. Mirsharif et al.
Fig. 4 Steps in segmentation prediction. a User seed. b First estimate of next mask using opti-
cal flow. c The uncertain region. d Final mask using graph cut (Bunny moved downwards, legs
expanded)
(explained in the next section). If the segmentation result is accepted by the confi-
dence test, we go on to predict the mask for the next frame using this mask. If the
result fails the confidence test, we go back and take a new input from the user.
We define Keyframes as those frames in which the segmentation mask has been
obtained from user input. This represents the ground truth data. For each keyframe
we save the following parameters:
These three parameters are utilized to test the validity of the predicted segmenta-
tion mask. Firstly, we check if the size of the mask has increased or decreased beyond
a certain fractional threshold of the size of the mask in the most recent keyframe. This
test is introduced as a safeguard to maintain the reliability of segmentation because
when the child interacts with the object, brings it closer or moves it away, the object
may suddenly go from occupying the entire frame to just a small no. of pixels. We’ve
implemented to flag 39 % variation, but anywhere between 25–50 % can be chosen
depending on the sensitivity of the results required.
ter that moves closer to the foreground does so very slightly as it still must account
for the background pixels but the length of the projection of the eigen vectors along
the line joining the centers of this background—foreground cluster pair increases.
From the above two distances, the first one being for the keyframe clusters and the
second for the current frame, we can find which background cluster moved closer
to which foreground cluster. After which we calculate the projections of the eigen
vectors of the Background Cluster that moved closest to the Foreground Cluster onto
the line joining the centers of these two clusters, in the keyframe and the current
frame.
3 3
∑ ∑
PKey = ||E.Veck,Key ⋅ CCj,Key || , PCurr = ||E.Veck,Curr ⋅ CCj,Curr || (4)
k=1 k=1
If this increase is greater than 25 % then we can reliably conclude under segmen-
tation.
Lastly, the object may appear differently in different orientations or in different
illumination conditions, for which we compare the GMM for the predicted segment
against all the Models in the Keyframe. We do this by checking if the average distance
between corresponding centers in the two segments is within the acceptable error for
that GMM, and if so, are the difference in weights of these corresponding centers also
within a certain acceptable threshold. The latter threshold is set manually depending
on the number of modes and the desired sensitivity. The former is obtained using
standard deviation of the RGB channels
3
∑
Thresh For Avg Dist For Center i = ( Stdi,k )∕3
k=1
(5)
where k ∈ (R, G, B the 3 Channels)
If this criteria is not met then we take a new user input and it becomes a keyframe
in the learning pool.
We use the method to extract multiple objects from videos and compare the result-
ing object masks with their corresponding manual annotation provided by experts.
Further, the performance of the algorithm in terms of run time and amount of user
66 Q. Mirsharif et al.
Table 1 Performance measures (Note The above values are average over 300 frames)
Object Total time Optical Processing Image size % Area User input Accuracy
(s) flow time time (s) (px) (per 300 (%)
(s) frames)
Bunny 8.7002 8.0035 0.6967 199860 65.00 11 97.76
Cup 8.1156 7.6534 0.4622 47261 15.38 9 98.43
Carrot 7.9513 7.5180 0.4333 32080 10.44 9 99.71
Car 7.9237 7.6320 0.2917 10872 3.54 10 99.35
Cookie 7.9421 7.6145 0.3276 27653 9.00 13 98.10
interaction, for each of these objects is reported in Table 1. The run time of the
method consists of the time taken for optical flow calculation and the processing
time required by the proposed method. We see clearly that over a large number of
frames the average total time would easily outperform the time required in man-
ual segmentation. We also observe that the processing time varies directly with the
(image size) area occupied by the object. Lastly we observe how frequently the algo-
rithm requires user input. We let the method run for a large number of frames (300
frames, a subset of the entire video) and count the number of user input requests. It
is important to note that we have set the method to take user input every 50 frames,
so even in case of no errors we would take 6 user inputs. Hence the additional user
inputs required, due to uncertainty in prediction, are 5, 3, 3, 4, 7 on average, respec-
tively. In any case this is a significant reduction in the amount of user involvement
as only 3 % of the frames require user input on average.
As with any automated approach to segmentation, user interaction and processing
time is reduced, what is traded off is the accuracy of the automated segmentation as
compared to manual segmentation. To evaluate this we take 30 randomly picked
frames and have them manually annotated by 5 different people. We calculate the
DICE similarity between automated masks versus the manually annotated masks
and then the DICE similarity between the manually segmented masks for each of
the frames for each pair of people. The mean and standard deviations of which are
noted in Table 2. We see that, on average, we lose only 3.21 % accuracy as compared
to manual segmentation. Note: DICE similarity is measured as the raio of twice the
no. of overlap pixels to the sum of the no. of pixels in each mask.
In our application, under segmentation is not tolerable but slight over segmen-
tation is. We see that our approach doesn’t undersegment any worse than manual
segmentation would, as can be seen from the recall values in the two columns. On
the other hand, our algorithm consistently, but not excessively (as we see from the
DICE measurements), oversegments the object, as can be seen from the precision
values in the two columns. Thus we see that the proposed approach significantly
reduces time and user interaction with little loss in accuracy as compared to man-
ual segmentation. Note: Precision is the proportion of mask pixels that overlap with
manual segmentation and Recall is the proportion of the manual segmentation pixels
that are part of the predicted mask.
A Semi-automated Method for Object Segmentation . . . 67
We use the results obtained from segmentation to investigate how objects are dis-
tributed in the child’s view at progressive ages. We look to study potential regularities
in object movement patterns and concentration in child’s view with age. To visualize
the areas where infants focus and fixate on objects we plot the heat maps of object
for each video. To see which locations have been occupied by objects most recently,
we give each pixel of the object a weight Wi for each object mask and accumulate
the results in time. The final output stores the following values for each image pixels:
∑
L
Pxy,Output = Wi ∗ Pxy,ObjectMask , Wi = i∕L (6)
i=1
where i is the frame number and L is the total number of frames in video (usually
around 9500).
From the heat maps, we can see that object movement in 6 months infants is large
and does not follow any specific pattern. The object concentration region changes
across infants and their visual focus on objects are not stabilized. However after
9 months, the object distribution pattern becomes more structured and the object
concentration area moves down toward the middle-bottom part of the visual field.
This region seems to be the active region where child interacts with object most
of the time. For 18 months old children, object movements increase and the pattern
change across the children. However the concentration point is still in the bottom area
very close to child’s eyes. The results might be aligned with previous psychological
hypothesis which discusses an unfocused view in 6 month old infants and increasing
participation of child in shaping his visual field with age [15]. 18 month old infants
are able to make controlled moves and handle objects. This study is still at early
stages and more investigation of other factors such as who is holding the object is
required to discover how child’s visual focus of attention is shaped and stabilized
over the early developmental stages and who is shaping the view at each stage. The
results demonstrate a developmental trend in child’s visual focus with physical and
motor development which might support a controversial psychological hypothesis
on existence of a link between physical constraint and language delay in children
suffering from autism (Fig. 5).
68 Q. Mirsharif et al.
Fig. 5 Heatmaps of objects for infants at progressive ages (a, b) 6 months infants (c, d) 9 months
infants (e, f) 12 months infants (g, h) 15 months infants (i, j) 18 months infants
5 Conclusions
References
1. Pereira, A.F., Smith, L.B., Yu, C.: A bottom-up view of toddler word learning. Psychonomic
bulletin & review 21(1), 178–185 (2014)
2. Pereira, A.F., Yu, C., Smith, L.B., Shen, H.: A first-person perspective on a parent-child social
interaction during object play. In: Proceedings of the 31st Annual Meeting of the Cognitive
Science Society (2009)
3. Smith, L.B., Yu, C., Pereira, A.F.: Not your mothers view: The dynamics of toddler visual
experience. Developmental science 14(1), 9–17 (2011)
4. Bambach, S., Crandall, D.J., Yu, C.: Understanding embodied visual attention in child-parent
interaction. In: Development and Learning and Epigenetic Robotics (ICDL), 2013 IEEE Third
Joint International Conference on. pp. 1–6. IEEE (2013)
5. Burling, J.M., Yoshida, H., Nagai, Y.: The significance of social input, early motion expe-
riences, and attentional selection. In: Development and Learning and Epigenetic Robotics
(ICDL), 2013 IEEE Third Joint International Conference on. pp. 1–2. IEEE (2013)
A Semi-automated Method for Object Segmentation . . . 69
6. Xu, T., Chen, Y., Smith, L.: It’s the child’s body: The role of toddler and parent in selecting
toddler’s visual experience. In: Development and Learning (ICDL), 2011 IEEE International
Conference on. vol. 2, pp. 1–6. IEEE (2011)
7. Yoshida, H., Smith, L.B.: What’s in view for toddlers? Using a head camera to study visual
experience. Infancy 13(3), 229–248 (2008)
8. Smith, L., Yu, C., Yoshida, H., Fausey, C.M.: Contributions of Head-Mounted Cameras to
Studying the Visual Environments of Infants and Young Children. Journal of Cognition and
Development (just-accepted) (2014)
9. Bambach, S.: A Survey on Recent Advances of Computer Vision Algorithms for Egocentric
Video. arXiv preprint arXiv:1501.02825 (2015)
10. Ren, X., Gu, C.: Figure-ground segmentation improves handled object recognition in egocen-
tric video. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on.
pp. 3137–3144. IEEE (2010)
11. Ren, X., Philipose, M.: Egocentric recognition of handled objects: Benchmark and analysis. In:
Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops 2009. IEEE
Computer Society Conference on. pp. 1–8. IEEE (2009)
12. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for
energy minimization in vision. Pattern Analysis and Machine Intelligence, IEEE Transactions
on 26(9), 1124–1137 (2004)
13. Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on 23(11), 1222–1239 (2001)
14. Horn, B.K., Schunck, B.G.: Determining optical flow. In: 1981 Technical Symposium East.
pp. 319–331. International Society for Optics and Photonics (1981)
15. Yoshida, H., Burling, J.M.: Dynamic shift in isolating referents: From social to self-generated
input. In: Development and Learning and Epigenetic Robotics (ICDL), 2013 IEEE Third Joint
International Conference on. pp. 1–2. IEEE (2013)
A Novel Visual Secret Sharing Scheme Using
Affine Cipher and Image Interleaving
Abstract Recently an interesting image sharing method for gray level images using
Hill Cipher and RG-method has been introduced by Chen [1]. The method does
not involve pixel expansion and image recovery is lossless. However, use of Hill
Cipher requires a 2 × 2 integer matrix whose inverse should also be an integer matrix.
Further, to extend the method for multi-secret sharing, one requires higher order
integer matrices. This needs heavy computation and the choice of matrices is also
very restricted, due to integer entry constraints. In the present paper we introduce
an RG-based Visual Secret Sharing Scheme (VSS) scheme using image interleaving
and affine cipher. Combined effect of image interleaving and affine transformation
helps in improving the security of the secret images. Parameters of the affine cipher
serve as keys and the random grid and encrypted image form the shares. No one can
reveal the secret unless the keys and both the shares are known. Further, as opposed to
the method in [1], the present scheme does not require invertible matrix with integer
inverse. The scheme is also extended for multi-secret sharing.
1 Introduction
needs heavy computation and the choice of matrices is also very restricted, due to
integer entry constraints. In addition, it has been observed in [14] that a small guess
of diagonal entries of the 2 × 2 matrix reveals the secret, especially when the matrix
is diagonally dominant. This motivated us to study image encryption using other
affine transformations. A RG-based VSS scheme is proposed in the present paper
for multi-secret sharing using image interleaving and affine cipher. Combined effect
of image interleaving and affine transformation helps in improving the security of
the secret images. Given secret image is divided into four sub-images and all the
sub-images are packed within each other using interleaving. To enhance the secu-
rity, an affine cipher is applied on the resulting image followed by XOR operation
with a given random grid of the same size. Decryption is performed by applying the
same operations in the reverse order. Parameters of the affine cipher serve as keys
and the random grid and encrypted image form the shares. No one can reveal the
secret unless the keys and both the shares are known. Further, as opposed to the
method by Chen [1], the present scheme does not require invertible matrix with inte-
ger inverse. The scheme is also extended for multi-secret sharing. Rest of the paper
is organized as follows. Section 2 introduces the preliminaries. Section 3 is devoted
to the proposed encryption and decryption process. In Sect. 4, experimental results
are presented. The proposed technique is also compared with the Hill Cipher based
encryption method proposed in [1].
2 Preliminary
operation is then performed row wise on the two images S1,2 and S3,4 . Let the final
interleaved image be denoted by S1,2,3,4 . The image is then transformed to another
intermediate image using affine cipher discussed in the next section.
where C stands for the cipher text, P for the plaintext and N = 255 for gray scale
images. The key K0 and K1 are selected in the range [0, 255], so that decryption
can be ensured. The function C = (K1 P+K0 )modN defines a valid affine cipher if K1
is relatively co-prime to 256, and K0 is an integer between 0 and 255 (both values
inclusive).
3 Proposed Method
Fig. 1 Encryption process for Lena image a divided image, b & c column wise interleaving of
the upper and lower two quadrants respectively, d row wise interleaving b and c, e affine cipher, f
random grid R, g XORed image
Step 4. XOR Operation Generate a random grid of the same size as that of the orig-
inal image (2m × 2n) and perform XOR operation to get the final encrypted image
I = E ⊕ R.
Figure 1g shows the final encrypted image obtained after following the above
steps. It may be worthwhile to mention here that the security property is satisfied
here, since the image cannot be revealed without the keys, and the random grid. Fur-
ther, image interleaving makes it too difficult to decipher the image content even if
the parameters are known. The decryption process is discussed in the next section.
The recovery of the image requires the following inputs. The encrypted image I, the
random grid R and the secret keys K0 , K1 . The first step involves XOR operation
of the random grid with the encrypted image I, to obtain the interleaved and affine
encrypted image E = I ⊕ R (Fig. 6a). We now apply the inverse affine cipher (2) to
each pixel of the image I. This gives us the intermediate image (Fig. 6b) which is then
de-interleaved to obtain by applying the reverse process of the process explained in
Step 2 of the encryption process. This reveals the original image (Fig. 6c).
The scheme can easily be extended to share multiple secrets by considering the orig-
inal image to be composed of more than one secret image. Number of secret images
can be even or odd. For example to share four secret images of the same size, a new
image is composed with four quadrants consisting secret images (refer Fig. 4) and
for sharing three secrets any two quadrants can be selected to fit in the first secret
image and, second and third secret images can be fitted in the remaining two quad-
rants (refer Fig. 5). Hence, the number of images per quadrant can be adjusted as
required to share multiple secrets without any change in the encryption and recovery
procedures.
76 H. Kaur and A. Ojha
Fig. 2 Encryption process for Dice image a divided image, b & c column wise interleaving of
the upper and lower two quadrants respectively, d row wise interleaving b and c, e affine cipher,
f random grid R, g XORed image
Fig. 3 Encryption process for Iext image a divided image, b & c column wise interleaving of
the upper and lower two quadrants respectively, d row wise interleaving b and c, e affine cipher,
f random grid R, g XORed image
A Novel Visual Secret Sharing Scheme Using Affine Cipher . . . 77
Fig. 4 Encryption process for four images a divided image, b & c column wise interleaving of
the upper and lower two quadrants respectively, d row wise interleaving b and c, e affine cipher,
f random grid R, g XORed image
Fig. 5 Encryption process for three images a divided image, b & c column wise interleaving of
the upper and lower two quadrants respectively, d row wise interleaving b and c, e affine cipher, f
random grid R, gXORed image
Fig. 6 Recovery of Lena image: a XOR operation with R, b decrypting affine cipher,
c de-interleaved image
Fig. 7 Recovery of Dice image: a XOR operation with R, b decrypting affine cipher,
c de-interleaved image
78 H. Kaur and A. Ojha
Fig. 8 Recovery of Text image: a XOR operation with R, b decrypting affine cipher,
c de-interleaved image
Fig. 9 Recovery results obtained after XOR operation for Chens method: a & b subshares for Lena
image, c & d subshares for Dice image, e & f subshares for Text image
The experimental results show that interleaving operations in the proposed method
provides extra layer of security. The main reason for interleaving is to protect the
information even if the encryption keys are compromised. The proposed scheme pro-
vides a three layer security for sharing a secret image. The random grid technique
provides security at first level and also renders a noisy appearance to the image. In
case of Chens scheme the security of the shared information is breached if the ran-
dom grid is available as the sub-images recovered after decryption tends to reveal the
secret as shown in Fig. 9. As compared to the above results the recovered shares after
XOR operation using the proposed method are much secured as shown in Figs. 6, 7
and 8.
The Hill cipher algorithm used by Chen [1] suffers from the problem of limited
key space of matrices which have integral inverse. To overcome the problem, at the
second level affine cipher is used to encrypt the information. Various techniques
in literature suggest it is possible to partially obtain the information if the attacker
guesses the keys or has partial known secret keys. To deal with such attacks, inter-
leaving provides an extra third layer of security.
Even if the random grid and the encrypting keys are compromised, the revealed
information will be the interleaved image and hence will not provide any guess about
A Novel Visual Secret Sharing Scheme Using Affine Cipher . . . 79
the original secret image. A comparison of the proposed scheme with some of the
recent multiple secret sharing schemes is provided in Table 1.
6 Conclusion
In this paper, a novel secret sharing scheme is presented which is based on affine
cipher, image interleaving and random grids. The scheme provides a solution to the
security flaws observed in Hill cipher-based method introduced in [1]. As opposed
to the Hill cipher-based method [1] where two layers of security is proposed, the
scheme provides three layers of security. Further, the matrix used in Hill cipher-based
method is required to have an integer inverse matrix, which is a major constraint not
only in the construction, but also in extending the method to multi-secret sharing.
The proposed method is easily extended to multi-secret sharing without making any
major modifications. The scheme provides lossless recovery and is also not having
any pixel expansion issues. Numerical results demonstrate robustness of the method.
References
1. Chen, W.K.: Image sharing method for gray-level images. Journal of Systems and Software 86
(2013) 581–585
2. Naor, M., Shamir, A.: Visual cryptography. In: Advances in Cryptology EUROCRYPT’94,
Springer (1995) 1–12
3. Kafri, O., Keren, E.: Encryption of pictures and shapes by random grids. Optics letters 12
(1987) 377–379
4. Shyu, S.J.: Image encryption by multiple random grids. Pattern Recognition 42 (2009) 1582–
1596
5. Chen, T.H., Tsao, K.H.: Threshold visual secret sharing by random grids. Journal of Systems
and Software 84 (2011) 1197–1208
6. Guo, T., Liu, F., Wu, C.: k out of k extended visual cryptography scheme by random grids.
Signal Processing 94 (2014) 90–101
80 H. Kaur and A. Ojha
7. Wu, X., Sun, W.: Improved tagged visual cryptography by random grids. Signal Processing 97
(2014) 64–82
8. William, S., Stallings, W.: Cryptography and Network Security, 4/E. Pearson Education India
(2006)
9. De Palma, P., Frank, C., Gladfelter, S., Holden, J.: Cryptography and computer security for
undergraduates. In: ACM SIGCSE Bulletin. Volume 36., ACM (2004) 94–95
10. Shyu, S.J., Huang, S.Y., Lee, Y.K., Wang, R.Z., Chen, K.: Sharing multiple secrets in visual
cryptography. Pattern Recognition 40 (2007) 3633–3651
11. Chen, L., Wu, C.: A study on visual cryptography. Diss. Master Thesis, National Chiao Tung
University, Taiwan, ROC (1998)
12. Feng, J.B., Wu, H.C., Tsai, C.S., Chang, Y.F., Chu, Y.P.: Visual secret sharing for multiple
secrets. Pattern Recognition 41 (2008) 3572–3581
13. Hsu, H.C., Chen, T.S., Lin, Y.H.: The ringed shadow image technology of visual cryptography
by applying diverse rotating angles to hide the secret sharing. In: Networking, Sensing and
Control, 2004 IEEE International Conference on. Volume 2., IEEE (2004) 996–1001
14. Bunker, S.C., Barasa, M., Ojha, A.: Linear equation based visual secret sharing scheme. In:
Advance Computing Conference (IACC), 2014 IEEE International, 2014 IEEE (2014) 406–
410
Comprehensive Representation and Efficient
Extraction of Spatial Information for Human
Activity Recognition from Video Data
Abstract Of late, human activity recognition (HAR) in video has generated much
interest. A fundamental step is to develop a computational representation of interac-
tions. Human body is often abstracted using minimum bounding rectangles (MBRs)
and approximated as a set of MBRs corresponding to different body parts. Such
approximations assume each MBR as an independent entity. This defeats the idea that
these are parts of the whole body. A representation schema for interaction between
entities, each of which is considered as set of related rectangles or what is referred
to as extended objects holds promise. We propose an efficient representation schema
for extended objects together with a simple recursive algorithm to extract spatial
information. We evaluate our approach and demonstrate that, for HAR, the spatial
information thus extracted leads to better models compared to CORE9 [1] a compact
and comprehensive representation schema for video understanding.
1 Introduction
1 http://www.visint.org.
formalisms, notable for the ability to capture interactive information, are often used
for description of video activities [3]. Topology and direction are common aspects
of space used for qualitative description. Topology deals with relations unaffected
by change of shape or size of objects; it is given as Region Connection Calculus
(RCC-8) relations: Disconnected (DC), Externally Connected (EC), Partially Over-
lapping (PO), Equal (EQ), Tangential Proper Part (TPP) and its inverse (TPPI), and
Non-Tangential Proper Part (nTPP) and its inverse (nTPPI) [9] (Fig. 2). Directional
Relations are one of the 8 cardinal directions: North (N), NorthEast (NE), East (E),
SouthEast (SE), South (S), SouthWest (SW), West (W), NorthWest (NW)—or as a
combination [6]. Figure 3 shows cardinal direction relations for extended objects.
2.1 CORE9
the core is a part of B – A (iv) □ if the core is not a part of A or B (v) 𝜙 if the
core is only a line segment or point. The state of objects A and B in Fig. 4 is the 9-
tuple [A, A, 𝜙, A, AB, B, B, 𝜙, B, B]. From this SI it is possible to infer that the RCC-8
relation between A and B is PO, because there is at least one core that is part of both
A and B.
3 Extended CORE9
Consider a pair of extended objects, say A and B, such that a1 , a2 , ..., ⋃amm are m
components of A and b1 , b2 , ..., bn are n components of B, i.e., A = i=1 ai and
⋃n
B = i=1 bi . The MBR of a set of rectangles, MBR(a1 , a2 , ..., an ), is defined as the
axis-parallel rectangle with the smallest area covering all the rectangles. To extract
binary spatial information between A and B, we first obtain MBRs of A and B, i.e.
MBR(A) = MBR(a1 , a2 , ..., am ) and MBR(B) = MBR(b1 , b2 , ..., bn ). In Fig. 5, MA is
MBR(A), where extended object A = a1 ∪ a2 ; similarly MB is MBR(B). The nine
cores of the extended objects A and B is obtained from MBR(A) and MBR(B) as
defined in [1]. For each of the nine cores we store an extended state information
(ESI) which tells us whether a particular core has a non-empty intersection with any
of the components of the extended objects.
For each corexy (A, B) x, y ∈ {1...3} of the whole MBRs of A and B, the ESI,
𝜎xy (A, B), is defined as,
⋃
m
⋃
n
𝜎xy (A, B) = (ak ∩ corexy (A, B)) ∪ (bk ∩ corexy (A, B)) (1)
k=1 k=1
In a human interaction, component relations are the relations between body parts
of one person with body parts of another person/object. The overall relation between
the interacting person(s)/object is obtained as a function of these component rela-
86 S. Kalita et al.
tions; we term this as whole-relation. Using the ESI, we are interested in computing
the whole relations and inter-entity component relations.
Component relations are the relations between parts of one entity with parts of the
other entity. We give a general recursive algorithm, Algorithm 1, to find all inter-
entity component relations, R(ai , bj ). We focus on topological relations expressed as
RCC8 relations [9] and directional relations expressed as cardinal directions [6]; the
algorithm is valid for both topological and directional relations. The algorithm takes
advantage of the fact that when two components are completely in different cores, the
topological relation between two such components can be immediately inferred to
be DC (a1 and b1 in Fig. 5a). On the other hand, the directional relation between two
such components in different cores can be inferred following the cardinal directions:
Fig. 6 The objects in the first three levels of recursion and the base case
Comprehensive Representation and Efficient Extraction . . . 87
else
return FALSE ⊳ no new relations are computed
return TRUE ⊳ at least one new relation is computed
⎡ {a3 } {a3 } □ ⎤
𝜎(A, B) = ⎢{a1 , a2 , a3 } {a2 , a3 , b1 , b2 } {b1 , b2 }⎥
⎢ ⎥
⎣ □ {b2 } {b2 , b3 }⎦
From this ESI, we can infer R(a1 , b1 ), R(a1 , b2 ), R(a1 , b3 ), R(a2 , b3 ), R(a3 , b3 ) are
DC. Rest of the relations are recursively obtained from new objects A′ = a2 ∪ a3 and
B′ = b1 ∪ b2 (where a2 , a3 , b1 , b2 ∈ core22 (A, B)) as shown in Fig. 6; this happens at
level 1 in Fig. 7. The ESI of A′ and B′ will be:
⎡ {a3 } {a3 } □ ⎤
𝜎(A′ , B′ ) = ⎢{a2 , a3 } {a2 , a3 , b1 } {b1 }⎥
⎢ ⎥
⎣ □ {b2 } {b2 }⎦
From this ESI, we further infer that R(a2 , b2 ), R(a3 , b2 ) are DC. For the rest of
the relations we recursively compute A′′ = a2 ∪ a3 and B′′ = b2 (where a2 , a3 , b2 ∈
core22 (A′ , B′ )) as shown in Fig. 6; this is level 2 in Fig. 7.
⎡ {a3 } {a3 } □ ⎤
𝜎(A′′ , B′′ ) = ⎢{a2 , a3 } {a2 , a3 , b1 } {b1 }⎥
⎢ ⎥
⎣ {a2 } {a2 } □ ⎦
At this stage, no new information is obtained using the ESI; hence CORE9 SI is used
to infer R(a3 , b1 ) and R(a2 , b1 ) as PO. This is the base case and level 3 in Fig. 7. The
88 S. Kalita et al.
recursive algorithm ensures that the number of computations is minimal for a given
pair of extended objects.
Theorem 1 Extended CORE9 is linear in the number of overlappings between com-
ponents of the two objects.
Proof Algorithm 1 computes only the most important (from the point of HAR) m × n
opportunistically, requiring a computation only if there is an overlap between com-
ponents. When components belong to different cores it is possible to immediately
infer the spatial relation between them using ESI. In the worst case, if all compo-
nents of A overlap all components of B the number of computations required would
be mn. ■
For extended objects, A and B (say m and n components), CORE9 could use either
of the two variants CORE9W and CORE9C for representation. In CORE9C , the num-
ber of computations is quadratic in the total number of components of A and B, i.e.
O((m + n)2 ). Note that CORE9W has constant number of computations, this is at the
expense of information loss (as detailed in Sect. 2.2). Whereas number of compu-
tations required to obtain all relations using ExtCORE9 is linear in the number of
overlaps between components of A and B.
3.2 Whole-Relations
We derive whole relations between the extended objects, for both topological and
directional aspects, from the component relations computed previously. The topo-
logical whole relation between A and B (R(A, B)), is obtained as follows:
Comprehensive Representation and Efficient Extraction . . . 89
2
We use I-frames obtained using the tool ffmpeg as keyframes, http://www.ffmpeg.org.
90 S. Kalita et al.
Similar LDA clustering experiments were performed using topological and direc-
tional features obtained using (a) CORE9W (b) CORE9C (c) ExtCORE9W and (d)
ExtCORE9C . A comparison of the f-measures is given in Fig. 8. For all activities
used in the experimentation, the qualitative features obtained using ExtCORE9W
provide a much better feature set for the activity compared to that obtained from
CORE9W . This is because CORE9W fails to recognize many interesting interaction
details at the component level.
ExtCORE9W performs better for most activities compared to CORE9C . In case of
CORE9C , even though all interaction details involving components are considered,
a lot of unimportant intra-entity component-wise relations are incorporated as well,
while losing out on the more interesting inter-entity whole relations. An interesting
Fig. 8 F1-scores of a
CORE9W , b CORE9C , c
ExtCORE9W , d ExtCORE9C
Comprehensive Representation and Efficient Extraction . . . 91
result is seen in case of the activity throw where CORE9C achieves the best perfor-
mance. We believe, this is because of the nature of the throw activity in which entities
tend to be overlapping for the most part in the beginning and move apart suddenly;
the entity being thrown is no longer in the scene and there is less evidence within
the feature set of the activity regarding the moving apart phase. However, CORE9C
utilizing the full set of inter-entity and intra-entity component relations as features
provide a better description.
A similar case is seen in case of ExtCORE9C . For most activity classes, per-
formance results of ExtCORE9W is better; this emphasizes the importance of the
inter-entity whole relations as computed by ExtCORE9W . However, for the activity
class throw ExtCORE9C performs marginally better. This shows that for activities
in which there is less evidence of interaction amongst entities, using the inter-entity
whole-relation only aggravates the classification results.
5 Final Comments
The part-based model of the human body obtained during tracking is easily seen as
an extended object. ExtCORE9W leads to better interaction models by focusing on
component-wise relations and whole relations of these extended objects. A recur-
sive algorithm is used to opportunistically extract the qualitative relations using as
few computations as possible. ExtCORE9W assumes components are axis-aligned
MBRs. For single-component objects that are not axis-aligned, more accurate rela-
tions can be obtained [12]. Adapting ExtCORE9W such that objects and components
are not axis-aligned is part of an ongoing research.
References
1. Cohn, A.G., Renz, J., Sridhar, M.: Thinking inside the box: A comprehensive spatial represen-
tation for video analysis. In: Proc. 13th Int. Conf. on Principles of Knowledge Representation
and Reasoning (KR2012). pp. 588–592. AAAI Press (2012)
2. Aggarwal, J., Ryoo, M.: Human activity analysis: A review. ACM Computing Surveys 43(3),
16:1–16:43 (Apr 2011)
3. Dubba, K.S.R., Bhatt, M., Dylla, F., Hogg, D.C., Cohn, A.G.: Interleaved inductive-abductive
reasoning for learning complex event models. In: ILP. Lecture Notes in Computer Science, vol.
7207, pp. 113–129. Springer (2012)
4. Kusumam, K.: Relational Learning using body parts for Human Activity Recognition in
Videos. Master’s thesis, University of Leeds (2012)
5. Schneider, M., Behr, T.: Topological relationships between complex spatial objects. ACM
Trans. Database Syst. 31(1), 39–81 (2006)
6. Skiadopoulos, S., Koubarakis, M.: On the consistency of cardinal directions constraints. Arti-
ficial Intelligence 163, 91 – 135 (2005)
7. Chen, L., Nugent, C., Mulvenna, M., Finlay, D., Hong, X.: Semantic smart homes: Towards
knowledge rich assisted living environments. In: Intelligent Patient Management, vol. 189, pp.
279–296. Springer Berlin Heidelberg (2009)
92 S. Kalita et al.
8. Cohn, A.G., Hazarika, S.M.: Qualitative spatial representation and reasoning: An overview.
Fundam. Inform. 46(1-2), 1–29 (2001)
9. Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection. In:
Proc. of 3rd Int. Conf. on Principles of Knowledge Representation and Reasoning (KR’92).
pp. 165–176. Morgan Kauffman (1992)
10. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3,
993–1022 (2003)
11. al Harbi, N., Gotoh, Y.: Describing spatio-temporal relations between object volumes in video
streams. In: Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence
(2015)
12. Sokeh, H.S., Gould, S., J, J.: Efficient extraction and representation of spatial information
from video data. In: Proc. of the 23rd Int. Joint Conf. on Artificial Intelligence (IJCAI’13).
pp. 1076–1082. AAAI Press/IJCAI (2013)
13. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scene categories.
In: IEEE Comp. Soc. Conf. on Computer Vision and Pattern Recognition (CVPR). vol. 2, pp.
524–531 (2005)
14. Phan, X.H., Nguyen, C.T.: GibbsLDA++: A C/C++ implementation of latent Dirichlet allo-
cation (LDA) (2007)
Robust Pose Recognition Using Deep
Learning
1 Introduction
The current state-of-the-art pose estimation methods are not flexible enough to
model horizontal people, suffers from double counting phenomena (when both left
and right legs lie on same image region) and gets confused when objects partially
occlude people. Earlier works on pose estimation [1, 2], impose a stick-man model
on the image of the body and assume that the head lies above the torso. Similarly,
shoulder joints are supposed to be higher than the hip joint and legs. However, these
assumptions are unrealistic and are violated under typical scenarios shown in this
work. As an example, we show how one state-of-the-art approach [1] fails to estimate
the pose correctly for an image taken from the standard PARSE dataset as shown in
Fig. 1a, b. The images of Indian classical dance (ICD) and Yoga too have such com-
plex configuration of body postures where current pose estimation methods fail as
shown in Fig. 1c, d. There exists a set of 108 dance postures named Karanas in the
original Natya Shastra enacted by performers of Bharatnatyam. Yoga too is popular
as a system of physical exercise across the world. Several challenges such as occlu-
sions, change in camera viewpoint, poor lighting etc. exist in the images of body
postures in dance and Yoga. The proposed ICD and Yoga dataset have such complex
scenarios where the head is not necessarily above the torso, or have horizontal or
overlapping people, twisted body, or objects that partially occlude people. Hence,
we also tested Ramanan et al. approach [1] on the proposed dataset and the results
are depicted in Fig. 1c, d. The results of another recent technique using tree models
for pose estimation proposed by Wang et al. [2] on our dataset are also reported in
Sects. 6.1 and 6.3.
Fig. 1 Failure of state-of-the-art approach [1] on few images from PARSE [3] and our datasets.
a and b represents failure results of [1] on PARSE dataset. c and d represent failure results of [1]
on our ICD and Yoga datasets. Failure cases emphasise the lacuna of approach in [1] to model
horizontal people as in (a) and it’s inability to handle partially occluded people as shown in (b).
The color assignment of parts is depicted in (e)
Robust Pose Recognition Using Deep Learning 95
Deep learning has recently emerged as a powerful approach for complex machine
learning tasks such as object/image recognition [4], handwritten character recogni-
tion [5] etc. The ability of deep learning algorithms to not rely on the hand crafted
features to classify the images motivated us to use it for pose identification in typical
situations wherein pose estimation algorithms such as [1, 2] fail due to unrealistic
assumptions. Since there is no publicly available dataset on ICD we created our own
dataset containing images of twelve dance postures collected in laboratory settings
and a dataset of fourteen poses from videos on Youtube. We also created a dataset
of eight Yoga poses to show the efficacy of a trained CNN model and a SAE in
identifying postures dance and Yoga.
Because of limited labeled data we used data augmentation and transfer learning.
We used a pre-trained model which is trained with a large dataset such as MNIST [5].
Interestingly, we observe a significant reduction in time taken to train a pre-trained
network on our datasets and also improvements in accuracy.
2 Prior Work
There are several works in literature pertaining to the identification of poses. Mallik
in their work in [6] tried to preserve the living heritage of Indian classical dance.
However, unlike our work, they do not identify body postures of the dancer. To clas-
sify ICD a sparse representation based dictionary learning technique is proposed in
[7]. In the literature there are very few significant works addressing the problem of
recognition of postures in ICD but a vast literature on general pose identification of
humans exists. Initial works for 2D pose estimation in the images/video domains is
[8]. Entire human shapes have been matched in [9].
Discriminatively trained, multi scale, deformable part based model for pose esti-
mation is proposed in [10]. This idea is also used for object detection in [11]. Felzen-
szwalb et al. [12] describe a statistical framework for representing the visual appear-
ance of objects composed of rigid parts arranged in a deformable configuration. A
generic approach for human detection and pose estimation based on the pictorial
structures framework in [13] is proposed by Andriluka et al.
Very recently, a deep learning approach using CNNs has been used for estimating
pose in [14] but it does not deal with complex datasets like ICD and Yoga as in this
work. Recently several models which incorporated higher order dependencies while
remaining efficient in [15] have been proposed. A state-of-the-art method for pose
estimation using tree models is given in [2]. A new hierarchial spatial model that
can capture an exponential number of poses with a compact mixture representation
is given in [16]. Still images were used for estimating 2D human pose by Dantone
et al. by proposing novel, nonlinear joint regressors in [17]. A method for automatic
generation of training examples from an arbitrary set of images and a new challenge
of joint detection and pose estimation of multiple articulated people in cluttered sport
scenes is proposed by Pischchulin et al. [18]. Eichner et al. [19] are capable of esti-
mating upper body pose in highly challenging uncontrolled images, without prior
96 A. Mohanty et al.
knowledge of background, clothing, lighting, or the location and scale of the person.
A learning based method for recovering 3D human body pose from single images
and monocular image sequences is given by [20]. An efficient method to accurately
predict human pose from a single depth image is proposed by Shotton et al. [21].
The general architecture of the proposed CNN is shown in Fig. 2a. Apart from the
input and the output layers, it consists of two convolution and two pooling layers.
The input is a 32 × 32 pixels image of a dance posture.
As shown in Fig. 2a, the input image of 32 × 32 pixels is convolved with 10 filter
maps of size 5 × 5 to produce 10 output maps of 28 × 28 in layer 1. The output
convolutional maps are downsampled with max-pooling of 2 × 2 regions to yield 10
output maps of 14 × 14 in layer 2. The 10 output maps of layer 2 are convolved with
each of the 20 kernels of size 5 × 5 × 10 to obtain 20 maps of size 10 × 10. These
maps are further downsampled by a factor of 2 by max-pooling to produce 20 output
maps of size 5 × 5 of layer 4. The output maps from layer 4 are concatenated to form
a single vector while training and fed to the next layer. The quantity of neurons in
the final output layer depends upon the number of classes in the database.
Fig. 2 a Architecture of the proposed CNN model used for pose and Yoga classification. b Detailed
block diagram of the proposed SAE architecture used for pose and Yoga classification
Robust Pose Recognition Using Deep Learning 97
(neural network) to classify the input [22]. The architecture of the proposed SAE
is shown in Fig. 2b. In SAE the image inputs are fed to the hidden layer to extract
features as seen in Fig. 2b. Then the features are fed to the output layer of SAE to
reconstruct back the original input. Output of the last layer is treated as input to
a classifier. We use a neural network as a classifier, training it to map the features
extracted to the output labels. We used an input of 784 nodes followed by a hidden
layer of 100 nodes before 784 number of output nodes. This SAE is followed by a
neural network having 784 input nodes, 100 hidden nodes and output nodes identical
to the number of classes as shown in Fig. 2b.
4 Data Augmentation
It has been shown in [4] that data augmentation boosts the performance of CNNs.
We performed data augmentation for the Yoga dataset so as to increase the number
of labeled data. We did not augment the synthetic and Youtube based pose databases
since the number of images was substantial. The Yoga pose database has only 8
classes with 50 images per class. We performed data augmentation of the training
data by five-fold cropping and resizing images to original size. Of the 50 images per
class we used 40 images per class for training which we augmented 5 times to 200
images per class. The test images were 10 per class. Hence, we obtained a total of
1600 training images and 80 test images for all 8 classes.
5 Transfer Learning
Because of limited labeled training data, the proposed CNN is pre-trained from ran-
domly initialized weights using MNIST [5] which contains 50,000 labeled training
images of hand-written digits. The CNN is trained for 100 epochs with this data
yielding an MSE of 0.0034 and testing accuracy of 99.08 % over 10,000 images.
The convereged weights of the trained network are used to initialize the weights of
the CNN model to which our dataset of dance poses and Yoga dataset were given
as input. We obtained much faster convergence during training with a pre-trained
network and improved accuracies on the test data.
6 Experimental Results
Training Phase: Synthetic Case The onstrained database used for training the pro-
posed CNN architecture described in subsection 3.1 consists of 864 images which
were captured using Kinect camera originally at 640 × 480 pixels resolution. We
used images of 12 different poses as shown in Fig. 3a enacted 12 times by 6 different
98 A. Mohanty et al.
Fig. 4 a Mean squared error (MSE) versus epochs for the CNN trained on the synthetic pose
dataset. b MSE versus epochs plot for pose data from Youtube videos. c Effect of pre-training on
synthetic pose data using a CNN pre-trained with MNIST data. d Effect of pre-training on real
world pose data using a pre-trained model
of 144 images. There is no overlap between the training and the test datasets. All
images are down-sampled to 32 × 32 pixels before feeding to the CNN.
The weights of the proposed CNN are trained by the conventional back-
propagation method using the package in [23]. The total number of learnable para-
meters in the proposed CNN architecture is 6282. We have chosen batch size as 4
and constant learning rate 𝛼 = 1 throughout all the layers. The network is trained
for 300 epochs using random initialization of weights on a 3.4 GHz Intel Core i7
processor with 16 GB of RAM. The variation of the mean square error (MSE) ver-
sus epochs during the training phase is shown in Fig. 4a and the final MSE during
training is 1.53 % in 300 epochs. Interestingly, by using a pre-trained MNIST model
to initialize the weights of a CNN we could achieve an MSE of 1.74 % in only 15
epochs as represented in Fig. 4c.
Testing Phase For the testing phase, we give as input to the trained CNN model
images from the test dataset. Given a test image, the output label with maximum
score is chosen at the output layer of the CNN. The accuracy for 144 images is
97.22 %. By using transfer learning we could achieve an improved accuracy of
98.26 % in only 15 epochs as compared to 97.22 % with 300 epochs in case of random
initialization of weights as shown in Table 1.
100 A. Mohanty et al.
Table 1 Performance of the proposed CNN method on our proposed pose dataset of synthetic
pose (ICD), real-world pose (ICD) and Yoga dataset
Data No. of Training Testing 𝛼 Batch Epochs MSE Proposed Transfer
classes set size approach learning
(%) (%)
Synthetic 12 720 144 0.5 5 300 0.0153 97.22 98.26
pose (15 epochs)
(ICD)
Real- 14 1008 252 0.5 4 200 0.0258 93.25 99.72
world (2 epochs)
pose
(ICD)
Yoga 8 1600 80 0.5 5 500 0.0062 90.0
data
Training Phase: Real-World Data We downloaded some dance videos from the
Youtube. The extracted frame is then re-sized to 100 × 200 pixels. We created a
dataset of such real-world images for 14 different poses performed by 6 different
dancers extracting 15 frames per pose for each dancer. A snapshot of the 14 postures
is depicted in Fig. 3b. To create the training set, we used 12 frames per pose for each
of the 6 performers leading to 1008 images. The testing set consisted of the rest 252
images. Similar to the synthetic case, there is no overlap between the training and
testing sets. All images were further re-sized to 32 × 32 pixels before feeding to the
CNN.
The CNN model was trained for 200 epochs using random intial weights with
batch size as 4 and constant learning rate 𝛼 = 0.5 throughout all the layers. The
variation of the mean square error (MSE) versus epochs during the training phase is
shown in Fig. 4b. By using a pre-trained MNIST model to initialize the weights of a
CNN we could achieve an MSE of 0.37 % in only 2 epochs as represented in Fig. 4d.
The first layer filter kernels for an image from the Youtube pose dataset (in Fig. 5a)
is shown in Fig. 5b and the convolved outputs at the first layer are shown in Fig. 5c.
Testing Phase The test set containing 252 images is input to the trained CNN and
yields an overall accuracy of 93.25 %. By using transfer learning we could achieve
an improved accuracy of 99.72 % in only 2 epochs as compared to 93.25 % with
200 epochs for the random initialization of weights as shown in Table 1. The exist-
ing state-of-the-art methods for pose estimation [1, 2] work well for the standard
datasets, but fail to perform on our proposed dataset due to the complexity in our
dataset involving illumination, clothing and clutter. The failure cases of the state-of-
the-art approaches [1, 2] on the proposed dataset is shown in Fig. 6a, b. The strong
performance of the proposed CNN architecture shows that it is an apt machine learn-
ing algorithm for identifying dance postures (Karanas).
Robust Pose Recognition Using Deep Learning 101
Fig. 5 a Original input image to a CNN. b First layer filter kernels in the Youtube pose dataset of
a CNN. c First layer convolved output in Youtube pose dataset. d The input Yoga pose. e First layer
filters of the SAE for the Yoga data. f The reconstructed output of the SAE for the Yoga pose in (d)
Fig. 6 Comparision with state-of-the-art: a Some images of Karanas from our proposed dataset
where the approach proposed by Ramanan et al. [1] fails. b Failure results of Wang et al. [2] due to
the complexity of our dataset with regard to illumination, clutter in the background, clothing etc.
102 A. Mohanty et al.
Table 2 Performance of the proposed SAE method on our proposed pose dataset of synthetic pose,
real-world pose and yoga dataset
Data No. of Training Testing set 𝛼, Batch 𝛼, Batch Testing
classes size, Epochs size, Epochs accuracy
of auto- of neural (%)
encoder network
Synthetic 12 720 144 0.5, 4, 1000 0.5, 4, 1000 86.11
pose (ICD)
Real-world 14 1008 252 0.5, 4, 200 0.5, 4, 200 97.22
pose (ICD)
Yoga data 8 1600 80 0.09, 5, 500 0.09, 5, 500 70.0
As explained earlier, our SAE consisting of three layer along with a neural network
is used to classify images of both ICD and Yoga data-sets. The accuracy obtained by
using a stacked auto encoder for the synthetic pose data is 86.11 % and for the real-
world pose data is 97.22 %. The details of using the stacked auto encoder is reported
in Table 2.
Training Phase We downloaded 50 images per class for 8 Yoga postures and re-sized
them to 100 × 100 pixels. A snapshot of these 8 Yoga postures is depicted in Fig. 7.
To create the training set, we used 40 images per pose. The testing set consisted of
the rest 10 images per pose. There is no overlap between the training and testing
sets. Then we performed data augmentation by cropping successively and resizing
to original size. All images were further re-sized to 32 × 32 pixels before feeding to
the CNN. The CNN model was trained for 500 epochs from random initial weights
with batch size as 5 and constant learning rate 𝛼 = 0.5 throughout all the layers.
Testing Phase The test set containing 80 images as input to the trained CNN and
yields an overall accuracy of 90 %. The existing state-of-the-art methods for pose
estimation [1, 2] fail to perform on our proposed dataset due to poor illumination,
Fig. 8 A snapshot of Yoga poses extracted from our proposed dataset a where the state-of-the-art
approach proposed by Ramanan et al. [1] fails and b where Wang et al. [2] fails due to the complexity
associated with our dataset i.e. twisted body, horizontal body etc
clothing on body parts and clutter in the background. Importantly, note that the
assumption of the head being above the torso always is not satisfied for these images.
The failure of the existing state-of-the-art methods of [1, 2] on the complex Yoga
dataset is represented in Fig. 8a, b respectively.
SAE Model: Yoga Data Auto encoders were stacked as in case of pose data to
use it for initialising the weights of a deep network which was followed by a neural
network to classify the poses. An image depicting Yoga posture is input to the SAEs
is shown in Fig. 5d. The 100 filters in the first layer of the SAE is shown in Fig. 5e.
The reconstructed output of the SAE for the Yoga dataset for a single Yoga posture
is shown in Fig. 5f. The accuracy obtained by using a stacked auto encoder for the
Yoga dataset is 70 %. The details regarding the proposed SAE is reported in Table 2.
7 Conclusions
The state-of-the-art approaches [1, 2] are not robust enough for estimating poses in
conditions such as bad illumination, clutter, flowing dress, twisted body commonly
found in the images in the proposed datasets of ICD and Yoga. Hence, a deep learn-
ing framework is presented here to classify the poses which violate the assumptions
made by state-of-the-art approaches such as the constraint that the head has to be
above the torso which is not necessarily maintained in ICD or Yoga. The proposed
CNN and SAE models have been demonstrated to be able to recognize body postures
to a high degree of accuracy on both ICD and Yoga datasets. There are several chal-
104 A. Mohanty et al.
lenges in the problem addressed here such as occlusions, varying viewpoint, change
of illumination etc. Both ICD and Yoga has various dynamic poses which we aim to
classify by analyzing video data in our future work.
References
1 Introduction
its online counterparts due to the availability of less information. Although several
studies [1–5] of this problem can be found in the literature, it still remains an open
field of research. That is why several handwriting text line segmentation contests
have been held recently in conjunction with a few reputed conferences [6].
In this article, we present a novel and simple method based on certain divide
and conquer strategy for line segmentation of textual documents irrespective of the
script. We simulated the proposed approach on a standard dataset and the recognition
results are comparable with the state-of-the-art approaches.
The remaining part of this paper is organized as follows: Sect. 2, provides a brief
survey of the existing works. The proposed approach has been described in Sect. 3.
Results of our experimentation have been provided in Sect. 4. Conclusion is drawn
in Sect. 5.
2 Previous Works
The proposed scheme is based on a divide and conquer strategy. Where the input
document is first divided into several vertical strips and text lines in each strip are
identified. Next, the individual text lines of a strip are associated with the corre-
sponding text lines (if any) of the adjacent strip to the right side. This association
process starts from the two consecutive leftmost strips of the document and is ter-
minated at the two consecutive rightmost strips. Finally, the text lines of the entire
document get segmented. The overall flow of the process is shown in Fig. 1. Specific
strategies are employed for (i) consecutive text lines which vertically overlap within
a strip or (ii) touching texts in vertically adjacent lines.
A Robust Scheme for Extraction of Text . . . 109
3.1 Preprocessing
The input raw image is first subjected to a few preprocessing operations. These
include mean filtering with window size 3 × 3 followed by binarization using a
recently proposed method [17]. The minimum bounding rectangle of the binarized
image is computed for further processing. The text portions of the processed image
is black against white background.
Since neither the words in a text line nor the consecutive text lines in a handwritten
document are expected to be properly aligned, the binarized document is first verti-
cally divided into several strips. The width of these strips are intelligently estimated
so that it is not be too small or too large. The horizontal projection profile plays a
major role in the proposed approach. If the width is too large, the separation between
two consecutive lines inside a strip often may not be signalled by its horizontal pro-
jection profile. On the other hand, if the width is too small, the horizontal projection
profile may frequently indicate false line breaks. Here we estimate the width of the
vertical strips as follows.
Step 1: Divide the document into a few (here, 10) vertical strips of equal width.
Step 2: Compute horizontal projection profile of each strip.
Step 3: Identify the connected components of horizontal projection profile in each
strip.
Step 4: Decide the horizontal segment bounded by the upper and lower boundaries
of each such connected component as a text line inside the strip.
Step 5: Compute the average height (Havg ) of all such text lines in the input docu-
ment.
Step 6: Similarly, compute the average height (LSavg ) of the gaps between two con-
secutive text lines.
Step 7: Obtain the Strip Width estimate Sw = 3 ∗ (Havg + LSavg ).
110 B. Biswas et al.
Next, we obtain certain rough estimation of text lines in each individual strip of
width Sw as described in the following section.
Step 1: Consider the next strip and verify its initial segmented lines from top to
bottom until there is no more strip.
Step 2: If the height of the next line in the current strip is less than 2Havg , then we
accept it as a single line and move to the next line in the strip until we reach the
bottom of the strip when we go to Step 1. If the height of a line exceeds the above
threshold, we move to the next step (Step 3).
A Robust Scheme for Extraction of Text . . . 111
Fig. 2 Initial separation of text lines: a Part of a handwritten manuscript of poet Rabindranath
Tagore, b horizontal line segments are drawn at the top and bottom of each profile component
inside individual vertical strips, c estimated line segments barring the line of small height, d initial
vertical strip-wise separation of text lines of the image shown in (a)
Step 3: Find the connected components in the current segmented line and if the
height of all such components are less than (2Havg ), we move to Step 4. Otherwise,
we decide that this component consists of touching characters of two vertically con-
secutive lines. We use the projection profile component of this line and find its min-
imum valley around the middle of the region. We segment the component at a point
where the horizontal straight line segment through this valley intersects the compo-
nent. As illustrated in Fig. 3 the initial line is now segmented into two lines above
and below this horizontal straight line segment. Next, move to Step 2.
Step 4: It is a case of vertically overlapping lines and the leftmost strip of Fig. 3a
shows an example. Here we find the valley region at the middle of the projection
profile of the current segmented line. Usually, in similar situations, a small contigu-
ous part of the profile can be easily identified as the valley instead of a single valley
point. We consider the horizontal line through the middle of this valley region and
the components, major parts of which lie above this horizontal line are considered to
belong to the upper line and other components are considered to belong to the lower
line. Figure 3c illustrates this and the segmentation result is shown in Fig. 3d. Next,
move to Step 2.
The lines of individual strips have already been identified in Sect. 3.4. Now, it is
required to associate the lines in a vertical strip with the corresponding lines of the
adjacent strips, if any. Here, at any time we consider a pair of adjacent strips. We
112 B. Biswas et al.
Fig. 3 Illustration of segmentation of vertically overlapping and touching lines: a Part of a hand-
written manuscript; its leftmost and rightmost strips respectively contain vertically overlapping and
touching lines, b the minimum valley around the middle region of the horizontal projection pro-
file component corresponding to the touching line is identified and shown by a blue circle, c the
overlapping region around the valley of its projection profile is shown by a dotted dark red col-
ored rectangle and the segmentation site of the touching component is shown by a blue colored
oval shape, d the initial line of each of the leftmost and rightmost strips is now segmented into two
separate lines
start with the left most two strips and finish at the rightmost pair. The strategy used
here is described below in a stepwise fashion.
Step 1: Set i = 1.
Step 2: Consider the pair of i-th and (i + 1)-th strips until there is no more strip.
Step 3: Consider the next line of (i + 1)-th strip. If there is no more line, increase i
by 1 and go to Step 2, else move to the next Step.
Step 4: If the current line consists of no component which has a part belonging to a
line of the i-th strip, then move to the next Step. Otherwise, associate the current line
of (i + 1)-th strip with the line of i-th strip which accommodates a part of one of its
components. If there are more than one such component common to both the strips
and they belong to different lines of i-th strip, then we associate the present line with
the line of the i-th strip corresponding to the larger component. Go to Step 3.
Step 5: We associate the current line with the line of the i-th strip having the maxi-
mum vertical overlap. If there is no such line in the i-th strip, then we look for similar
overlap with a another strip at further left. If any such strip is found at the left, then
the two lines are associated and otherwise, the current line is considered as a new
line. Go to Step 3.
The above strategy of association of text lines in adjacent strips is further illus-
trated in Fig. 4 using part of the document shown in Fig. 2.
A Robust Scheme for Extraction of Text . . . 113
Fig. 4 Association of text lines of adjacent strips: a Initial segmentation into different lines of the
left strip is shown in color, b lines of initial segmentation of the 2nd strip are associated with the
lines of 1st strip, c lines of initial segmentation of the 3rd strip are associated with the lines of 2nd
strip, d segmented lines of the 4th strip are associated with the lines of 3rd strip
3.6 Postprocesing
During initial segmentation of text lines in individual strips described in Sect. 3.3, we
H
ignore all text components with their profile height less than 3avg . Here, we consider
the above components and associate them to the lines nearest to them. In the example
shown in Fig. 2, there were 3 such small components which are now associated with
their respective lines. The final segmentation result of this example is shown in Fig. 5.
4 Experimental Results
Fig. 6 Comparative performance evaluation result of the proposed method provided by the ICDAR
2013 line segmentation contest
Fig. 8 A few line segmented handwritten documents of different scripts by the proposed algorithm
A Robust Scheme for Extraction of Text . . . 115
in this contest. The result of this comparison is shown in Fig. 6. The comparison is
provided by the Performance Metric (PM) defined in terms of the measures Detec-
tion Rate (DR) and Recognition Accuracy (RA) as follows:
where o2o is the number of one-to-one matches between result image and ground
truth, N and M are respectively the counts of ground truth result elements.
From Fig. 6, it can be seen that the accuracy of the proposed method on ICDAR
2013 dataset is 97.99 %. Examples of a few difficult situations where the proposed
algorithm performed efficiently are shown in Fig. 7.
In Fig. 8, we show some more results of line segmentation on Devanagari, Ben-
gali, Greek and English handwritten documents
5 Conclusions
In this article, we presented a novel method based on a simple strategy for line seg-
mentation of handwritten documents of different Indian scripts. Its performance on
“ICDAR 2013 Line Segmentation Contest” dataset is quite impressive. The method
works equally efficiently on different types of scripts and can handle various peculiar
situations of handwritten manuscripts. The only situation where we observed con-
sistent failure of the present algorithm is the use of a caret to insert a line just top of
another line of the input document. A few examples of such situations are shown in
Fig. 9.
References
1. Mullick, K., Banerjee, S., and Bhattecharya, U.: An Efficient Line Segmentation Approach
for Handwritten Bangla Document Image. Eighth International Conference on Advences in
pattern Recognition (ICAPR), 1–6 (2015)
2. Alaei, A., Pal, U., and Nagabhushan, P.: A New Scheme for Unconstrained Handwritten Text-
Line Segmentation. Pattern Recognition. 44(4), 917–928, (2011)
3. Papavassiliou, V., Stafylakis, T., Katsouros, V., Carayannis, G.: Handwritten document image
segmentation into text lines and words. Pattern Recognition. 147, 369–377 (2010)
4. Shi, Z., Seltur, S., and Govindaraju, V.: A Steerable Directional Local Profile Technique for
Extraction of Handwritten Arabic Text Lines. Proceedings of 10th International Conference
on Document Analysis and Recognition, 176–180, (2009)
5. Louloudis, G., Gatos, B., and Halatsis, C: Text Line and Word Segmentation of Handwritten
Documents. Pattern Recognition, 42(12):3169–3183, (2009)
6. Stamatopoulos, N., Gatos, B., Louloudis, G, Pal, U., Alaei, A.: ICDAR 2013 Handwritten
Segmentation Contest. 12th International Conference on Document Analysis and Recognition,
14021–1406 (2013)
7. Likforman-Sulem, L., Zahour, A., and Taconet, B.: Text Line Segmentation of Historical Doc-
uments: a Survey. International Journal of Document Analysis and Recognition: 123–138,
(2007)
8. Antonacopoulos, A., Karatzas, D.: Document Image analysis for World War II personal
records, International Workshop on Document Image Analysis for Libraries. DIAL, 336–341
(2004)
9. Li, y., Zheng, Y., Doermann, D., and Jaeger, S.: A new algorithm for detecting text line in
handwritten documents. International Workshop on Frontiers in Handwriting Recognition, 35–
40 (2006)
10. Louloudis, G. Gatos, B., Pratikakis, I., Halatsis, K., Alaei, A.: A Block Based Hough Trans-
form Mapping for Text Line Detection in Handwritten Documents. Proceedings of the Tenth
International Workshop on Frontiers in Handwriting Recognition, 515–520 (2006)
11. Tsuruoka, S., Adachi, Y., and Yoshikawa, T.: Segmentation of a Text-Line for a Handwrit-
ten Unconstrained Document Using Thinning Algorithm, Proceedings of the 7th International
Workshop on Frontiers in Handwriting Recognition:505–510, (2000)
12. Luthy, F., Varga, T., and Bunke, H.,: Using Hidden Markov Models as a Tool for Handwritten
Text Line Segmentation. Ninth International Conference on Document Analysis and Recogni-
tion. 9, 630–632 (2007)
13. Lie, Y., Zheng, Y.: Script-Independent Text Line Segmentation in Freestyle Handwritten Doc-
uments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(8), 1313–1329
(2008)
14. Yin, F., Liu, C: A Variational Bayes Method for Handwritten Text Line Segmentation. Inter-
national Conference on Document Analysis and Recognition. 10, 436–440 (2009)
15. Brodic, D., and Milivojevic, Z.: Text Line Segmentation by Adapted Water Flow Algorithm.
Symposium on Neural Network Applications in Electrical Engineering. 10, 225–229 (2010)
16. Dinh, T. N., Park, J., Lee, G.: Voting Based Text Line Segmentation in Handwritten Docu-
ment Images. International Conference on Computer and Information Technology. 10, 529–
535 (2010)
17. Biswas, B., Bhattacharya, U., and Chaudhuri, B.B.: A Global-to-Local Approach to Binariza-
tion of Degraded Document Images. 22nd International Conference on Pattern Recognition,
3008–3013 (2014)
Palmprint Recognition Based on Minutiae
Quadruplets
1 Introduction
Due to the growing demand of human identification for many ID services, biomet-
rics has become more attracting research area. Fingerprint recognition system is
more convenient and accurate. Palmprints can be considered as a variant of finger-
prints which shares the similar feature extraction and matching methodology. Palm
consists friction ridges and flexion creases as main features. Due to folding of the
palm the flexion creases will be formed. The palmprint is having three regions
namely hypothenar, thenar and interdigital (see Fig. 1).
In many instances, examination of hand prints like fingerprint and palmprint was
the method of differentiating illiterate people from one another as they are not able to
write. The first known automated palmprint identification system (APIS) [1] devel-
oped to support palmprint identification is built by a Hungarian company. There
are mainly two different approaches in palmprint matching on high resolution palm-
prints, namely, minutiae based [2], ridge feature based [3]. Minutiae based palmprint
matching methods find number of minutiae matches between the input palmprint
(probe) and the enrolled palmprint (gallery). This is the most popular and widely
used approach. In ridge feature-based palmprint matching, features of the palm-
print ridge pattern like local ridge orientation, frequency and shape are extracted
for comparison. These features may be more reliable for comparison in palmprint of
low-quality images than minutiae features. The matching of the palmprint match-
ing algorithm is correct when there are genuine matches (true accepts) and genuine
rejects (true non-matches). The matching is wrong when there are impostor matches
(false accepts) and impostor non matches (false rejects).
Palmprint Recognition Based on Minutiae Quadruplets 119
2 Related Work
3 Feature Extraction
Fig. 2 Various stages of palmprint feature extraction: a original image, b smoothed image, c bina-
rized image, d quality map of image, e thinned image, f minutiae interpolation
Palmprint Recognition Based on Minutiae Quadruplets 121
2. Binarization: In this step, the image is converted to complete black and white
pixels from gray scale.
3. Thinning: Binarized image ridges are converted to one pixel thickness. Which
will be useful for extracting of minutiae.
4. Minutiae Extraction: Minutiae extraction is done on thinned image. When tra-
versing pixel by pixel on a thinned image where the pixel find one neighbor is
End point and three neighbors is bifurcation point.
5. Spurious Minutiae Removal: This is final stage of feature extraction, where the
spurious minutiae due to ridge cuts, border minutiae, bridges, lakes are removed.
The Fig. 2 shows the various phases involved in feature extraction and their cor-
responding output images of palmprint.
In this section, the proposed palmprint matching algorithm is explained. The quadru-
plet details are given first and then the k-nearest neighbor matching and global minu-
tia matching using quadruplets is described.
4.1 Quadruplets
Let A be the set of palmprint minutiae and the n-quadruplets can be computed as
follows: The k-nearest neighbors from the set A are computed for all m ∈ A in order
to find all n-quadruplets which have m and three of its nearest minutiae which is
tolerant to the low quality. Figures 3 and 4 illustrate the sample quadruplet represen-
tation of minutiae points. In Fig. 3, ab, bc, cd, ad, bd and ac are euclidean distances
between each pair of minutiae. Each minutia point has mainly 3 characteristics x,
y, Direction. Figure 4 illustrates each minutiae pair features for matching, ab is the
Euclidean distance, b is direction at B, a is direction at minutiae A.
Fig. 3 Quadruplet
representation of minutiae
122 A.T. Rao et al.
Fig. 4 Characteristics of
minutiae pair
This step finds the similar mates from query template and probe template using k-
nearest neighbor local minutiae matching techniques. G and P are the palmprint
minutiae feature vectors. The proposed Minutiae-based method considers mainly
three features from each minutia m = x, y, 𝜃, where x, y is location 𝜃 is direc-
tion. Let G = m1 , m2 , … mm , mi = xi , yi , 𝜃i , i = 1, … , m and P = m1 , m2 , … mn , mi =
xi , yi , 𝜃i , i = 1, … , n, where m and n denote the number of minutiae in gallery and
probe template respectively. Equations (1) and (2) denote the Euclidean distance
and angle of minutiae a and b, respectively.
√
Distab = (Xa − Xb )2 − (Ya − Yb )2 (1)
(Ya − Yb )
Dirab = arctan (2)
(Xa − Xb )
∑
KNN
∑
KNN
Pi , Gj (3)
k=1 l=1
jl jl
Distpik − DistG < DistThr, Dirpik − DirG < DirThr
This step considers short listed minutiae from k-nearest neighbor stage. Each minu-
tiae pair as reference pair for finding quadruplet. The following three conditions
should be considered to determine whether the two minutia in quadruplet are matched
in order to overcome the tolerance to distortions and rotations. In order to qualify a
Palmprint Recognition Based on Minutiae Quadruplets 123
quadruplet as mate, the four edges of each quadruplet should satisfy following three
conditions:
1. The Euclidean distance between two minutiae < DistThr.
2. The difference between minutia directions < DirThr.
3. Minutiae relative direction with edge < RelThr.
The experiments are conducted on the standard palmprint benchmark data sets FVC
ongoing competition test data [13] and Tsinghua university [12, 14, 15]. The test data
consists of 10 people, 1 palm of 8 instances. Tsinghua university data set consists of
80, people 2 palms of 8 instances. These experiments were carried out on Intel core
i3 machine with 4 GB ram and 1.70 GHz processor. Figures 5 and 6 show the ROC
curves over the standard databases FVC Ongoing and Tsinghua THUPALMLAB
data sets.
Table 1 shows that the databases with number of persons, genuine and impostor
attempts. Table 2 shows that the standard databases, nearest neighbors considered
and Equal Error Rate (EER). The accuracy of the algorithm is good when nearest
Table 1 Databases
Data set No of persons Instances Genuine Impostor
FVC ongoing 10 8 280 2880
THUPALMLAB 80 8 4480 20000
neighbors considered 6 with the two databases. The DistThr is 12, DirDiff is 30
and RelDiff is 30 considered in all the experiments. Table 3 shows that the standard
databases, nearest neighbors considered, space and time taken for each verification.
The proposed algorithm achieved 0.12 % of EER on THUPALMLAB data set
where as, the EER of [11, 12] on THUPALMLAB data set are 4.8 and 7 % respec-
tively.
Palmprint Recognition Based on Minutiae Quadruplets 125
Table 3 Space and times taken on FVC and THUPALMLAB data sets
# of NNs Space (Kb) Time (ms) Space (Kb) Time (ms)
FVC data THUPALAMLAB
5 32524 2089 32480 2535
6 32572 2347 32564 4017
7 32488 2388 32512 4435
8 32536 2711 32504 5154
6 Conclusion
The existing minutiae based matching algorithms have few limitations which are
mainly based on segmentation. The accuracy of these algorithms is affected with dif-
ferent qualities of the palmprint regions. The proposed palmprint matching algorithm
used the new representation of minutiae points using quadruplets and the matching
is done with out segmenting the palmprint. The experiments have proved that the
proposed matching algorithm achieves very good accuracy over existing standard
data sets. The proposed algorithm on FVC palm test data have achieved EER 3.87 %
and on THUPALMLAB data set achieved EER of 0.12 %.
Acknowledgements We are sincerely thankful to FVC and Tsinghua university for providing data
sets for research. The first author is thankful to Technobrain India Pvt Limited, for providing support
in his research.
References
1. FBI: https://www.fbi.gov/about-us/cjis/fingerprints_biometrics/biometric-center-of-excellence/
files/palm-print-recognition.pdf
2. Liu N, Yin Y, Zhang H: Fingerprint Matching Algorithm Based On Delauny Triangulation
Net. In: Proc. of the 5th International Conference on Computer and information Technology,
591–595 (2005)
3. Jain A, Chen Y, Demirkus M: Pores and Ridges: Fingerprint Matching Using level 3 features.
In Proc. of 18th International Conference on Pattern Recognition (ICPR’06), 477–480 (2006)
4. Awate, I. and Dixit, B.A.: Palm Print Based Person Identification. In Proc. of Computing Com-
munication Control and Automation (ICCUBEA), 781–785 (2015)
5. Ito, K. and Sato, T. and Aoyama, S. and Sakai, S. and Yusa, S. and Aoki, T.: Palm region
extraction for contactless palmprint recognition. In Proc. of Biometrics (ICB), 334–340 (2015)
6. George, A. and Karthick, G. and Harikumar, R.: An Efficient System for Palm Print Recogni-
tion Using Ridges. In Proc. of Intelligent Computing Applications (ICICA), 249–253 (2014)
7. D. Zhang, W.K. Kong, J. You, and M. Wong: Online Palmprint Identification. IEEE Trans.
Pattern Analysis and Machine Intelligence 25(9), 1041–1050 (2003)
8. W. Li, D. Zhang, and Z. Xu: Palmprint Identification by Fourier Transform. Pattern Recogni-
tion and Artificial Intelligence 16(4), 417–432 (2002)
9. J. You, W. Li, and D. Zhang: Hierarchical Palmprint Identification via Multiple Feature Extrac-
tion. Pattern Recognition 35(4), 847–859 (2002)
126 A.T. Rao et al.
10. N. Duta, A.K. Jain, and K. Mardia: Matching of Palmprints. Pattern Recognition Letters 23(4),
477–486 (2002)
11. A.K. Jain and J. Feng: Latent Palmprint Matching. IEEE Trans. Pattern Analysis and Machine
Intelligence 31(6), 1032–1047 (2009)
12. J. Dai and J. Zhou: Multifeature-Based High-Resolution Palmprint Recognition. IEEE Trans.
Pattern Analysis and Machine Intelligence 33(5), 945–957 (2011)
13. B. Dorizzi, R. Cappelli, M. Ferrara, D. Maio, D. Maltoni, N. Houmani, S. Garcia-Salicetti
and A. Mayoue: Fingerprint and On-Line Signature Verification Competitions at ICB 2009. In
Proc. of International Conference on Biometrics (ICB), 725–732 (2009)
14. THUPALMLAB palmprint database. http://ivg.au.tsinghua.edu.cn/index.php?n=Data.Tsinghu
a500ppi
15. Dai, Jifeng and Feng, Jianjiang and Zhou, Jie: Robust and efficient ridge-based palmprint
matching. IEEE Trans. Pattern Analysis and Machine Intelligence 34(8), 1618–1632 (2012)
Human Action Recognition for Depth
Cameras via Dynamic Frame Warping
1 Introduction
of low-cost depth sensors such as Microsoft Kinect, the advantages of depth sensors
(such as illumination and color invariance), have been realized to better understand
and address the problems such as gesture recognition, action recognition, object
recognition etc.
In general, the problem of action recognition using depth video sequences involves
two significant questions. The first is about effective representation of RGBD data,
so as to extract useful information from RGBD videos of complex actions. The sec-
ond question concerns developing approaches to model and recognize the actions
represented by the suitable feature representation.
For the video representation, we use an existing approach of skeleton joints rep-
resentation, that of Eigen Joints [1]. The advantage of using this is that most of the
existing works are mainly on video level features but with Eigen Joints feature rep-
resentation, we are able to work with frame level features which provides us more
information and flexibility to work with.
Unlike the traditional RGB camera based approaches, the classification algorithm
for the depth stream should be robust enough to work without even huge amount
of training data, and handle large intra-class variations. In this respect, we explore
a recently proposed work on Dynamic frame warping framework for RGB based
action recognition [2], for the task of depth based action recognition. This framework
is an extension to Dynamic time warping framework to handle the large amount of
intra class variations which cannot be captured by normal Dynamic time warping
algorithm. Unlike in [2], we do not use RGB features, but the skeleton joint features
mentioned above.
With the best of our knowledge, such a dynamic frame warping framework on
depth data has not been attempted till now. Such an adaptation from the technique
proposed in [2], for action recognition in depth videos, is not obvious as depth data
for action recognition brings with it its own challenges.
For instance, skeleton features involved in RGBD sequences are often inaccurate
with respect to the joint positions and involve some amount of noise. Furthermore,
the complexity in the actions is further enhanced in the case of 3D action recognition
as it involves more information available for a single frame which is needed to be
captured by a good classifier. Also, some actions in depth representations are typi-
cally more similar to each other which makes the problem harder. With more subjects
performing same action in different environments in some different ways, it becomes
evidently important to come up with a more robust technique to deal with high
intra-class variations. Our experiments involve different subsets of data, which high-
light the above mentioned cases of complex actions and similar actions, and demon-
strate superior performance of the proposed approach over the state-of-the-art.
We also note that, in conjunction with frame-level features, such a framework has
another advantage over discriminative models like Support Vector Machines (SVM).
It can be further extended as a dynamic programming framework, which can work
for continuous action recognition rather than isolated action recognition. Continu-
ous action recognition involves unknown number of actions being performed with
unknown transition boundaries in a single video sequence. (While, in this work, we
Human Action Recognition for Depth Cameras . . . 129
With the advent of real-time depth cameras, and availability of depth video datasets,
there is now a considerable work on the problem of human action recognition from
RGBD images or from 3D positions (such as skeleton joints) on the human body.
Li et al. [3] developed a bag of words model using 3D points for the purpose
of human action recognition using RGBD data. They used a set of 3D points from
the human body to represent the posture information of human in each frame. The
evaluation of their approach on the benchmark MSR-Action3D dataset [3] shows
that it outperforms some state of the art methods. However, because the approach
involves a large amount of 3D data, it is computationally intensive.
Xia et al. [4] proposed a novel Histogram of 3D Joint Locations (HOJ3D) rep-
resentation. The authors use spherical coordinate system to represent each skeleton
and thus also achieve view-invariance, and employ Hidden Markov models (HMMs)
for classification.
In the work, reported in [1], the authors proposed an Eigen Joints features rep-
resentation. which involves pairwise differences of skeleton joints. Their skeleton
representation consists of static posture of the skeleton, motion property of the skele-
ton, and offset features with respect to neutral pose in each frame. They use a Naive
Bayes classifier to compute video to class distance. An important advantage with this
representation is that it involves frame level features which not only captures tempo-
ral information better but also has an adaptability to continuous action recognition
framework. Moreover, these features are also simple and efficient in their computa-
tion.
The approaches reported in [5–7] also have been shown to perform well on the
MSR Action 3D dataset. However, these works use video level features instead of
frame level features as we use in our work. We reiterate that with frame level fea-
tures, this work can be extended to a continuous action recognition module, which
is difficult with video level features.
In [2], the dynamic frame warping (DFW) framework was proposed to solve the
problem of continuous action recognition using RGB videos. Like the traditional
DTW, this framework has the ability to align varying length temporal sequences.
130 K. Gupta and A. Bhavsar
Moreover, an important advantage of this approach over DTW is that it can better
capture intra-class variations, and as a result, is more robust.
Our proposed approach also uses the Eigen Joint feature representation, but in a
modified Dynamic time warping framework as proposed in [2]. The major advan-
tage with such a dynamic programming framework is it can work with frame level
features, so it can arguably understand the temporal sequences of frames better than
Naive Bayes nearest neighbour classifier such as in [1]. In addition, we demonstrate
that it is able to work without a large amount of training data required as in case of
HMM (such as in [4]), as also indicated in [2].
Our paper is divided into subsequent sections. In Sect. 2 we explain our approach
in depth, describing the Eigen Joints representation technique and the DFW algo-
rithm. We show the experimental evaluations and their comparisons, with consider-
ation in Sect. 3. We provide our conclusions in Sect. 4.
2 Proposed Approach
As mentioned earlier, we employ the Eigen Joints features [1] which are based on
the differences of skeleton joints. The overall Eigen Joints feature characterizes three
types of information in the frames of an action sequence, including static posture,
motion property, and overall dynamics.
The three dimensional coordinates of 20 joints can be generated using human
skeletal estimation algorithm proposed in [8], for all frames: X = {x1 , x2 , … , x20 },
X ∈ ℜ3×20 . Based on the skeletal joint information three types of pair-wise features
are computed.
Differences between skeleton joints for the current frame: These features capture
the posture
{ of skeleton joints within a} frame:
fcc = xi − xj |i, j = 1, 2, … , 20; i ≠ j .
Skeleton joint differences between the current frame-c and its previous frame-p:
These features take into the account the motion from previous to the current frame:
p p
fcp = {xic − xj |xic ∈ Xc ; xj ∈ Xp }.
Skeleton joint differences between frame-c and frame-i (initial frame which con-
tains neutral posture of the joints): These features capture the offset of an interme-
diate posture with respect to a neutral one:
fci = {xic − xji |xic ∈ Xc ; xji ∈ Xi }.
Human Action Recognition for Depth Cameras . . . 131
The concatenation of the above mentioned feature channels forms the final feature
representation for each frame: fc = [ fcc , fcp , fci ]. Feature rescaling is used to scale the
feature in the range [−1, +1] to deal with the inconsistency in the coordinates. In
each frame, 20 joints are used which result in huge feature dimension i.e. (190 +
400 + 400) × 3 = 2970 as these differences are along three coordinates after feature
rescaling, which gives us fnorm . Finally, PCA is applied over the feature vectors reduce
redundancy and noise from fnorm where we use leading 128 eigen vectors to reduce
the dimensionality.
Such a feature representation on the depth videos is much more robust (in terms of
invariances) than ordinary color based features on RGB counterparts of such videos,
and also provide structural information in addition to Spatio-temporal interest points.
Rabiner and Juang [9], Mueller [10] proposed Dynamic time warping (DTW)
framework to align two temporal sequences P1∶TP and Q1∶TQ of unequal lengths.
In this algorithm, the frame-to-frame assignments helps to match two temporal
sequences:
A(P, Q) = {(l1 , l1′ ), (li , li′ ), … , (l|A| , l|A|
′
)} (1)
As a variant to the traditional DTW algorithm, Dynamic frame warping i.e. DFW
framework was introduced in [2]. This concept of DFW involves two main compo-
nents: Action template represented by Y l and Class template represented by Ỹ l for
each action class l. Here, the closest match to all the training samples of class l is
Nl
found, Xil∗ ∈ {Xnl }n=1 . The closest match for each class l is defined as the action tem-
plate of class l. Solving minimization in (3), yields the index of the sequence which
is selected as the action template of each class:
132 K. Gupta and A. Bhavsar
∑ ( )
i∗ = argmin DTW Xil , Xjl (3)
i j≠i
Finally, denoting the action template of class l as Y l , each training example Xjl is
aligned with Y l using the above Eq. (3). This provides the class template:
( )
Ỹ l = ỹ l1 , … ỹ lt′ , … , ỹ lT l , (4)
Y
ing frames which are elements of a metaframe, ỹ lt′ . As, typically, only a small number
of training frames within a metaframe, would be similar to the test frame, a sparse
solution is computed for the weights in the linear combination wt′ , by solving the
following Eq. (5) from [11],
Finally, the frame-to-metaframe distance has been expressed in following Eq. (6),
reproduced from [2]. This can be solved using [12].
2
( ) ‖ ỹ lt′ w ‖ ||
∑
̃d zt , ỹ l′ = min ‖
‖ z −
‖
‖ s.t. wi = 1. (6)
t w ‖ t
‖̃ylt′ w‖2 ‖
‖ ‖2 i=1
We reiterate that the existing work using Dynamic frame warping framework on
human action recognition is only on the RGB videos with color based features. Our
approach extends this work to depth sequences with arguably more robust features
which provide structural information apart from the spatio-temporal interest points.
As indicated above, we believe that such an exploration of the Dynamic frame warp-
ing framework with skeleton joint features is important from the aspects of complex
actions, large intra-class variability (actions performed by different subject or in dif-
ferent environments) and performance under noisy joint features, which are more
specific to depth videos.
Human Action Recognition for Depth Cameras . . . 133
We evaluate our work on a benchmark MSR-Action3D dataset [3] with cross subject
evaluation settings as used in previous works. We show detailed comparisons of our
work with the existing Eigen Joints approach [1], Bag of 3D Points [3] and HOJ3D
[4]. Figure 1 shows some of the frames of different actions from the MSR-Action3D
dataset.
We evaluate proposed approach on the cross subject evaluation as used in other works
to make comparisons more accurate. So, we use actions sequences of 5 subjects i.e.
1, 3, 5, 7 and 9 for training of each action and rest 5 i.e. 2, 4, 6, 8 and 10 for testing.
These tests are done separately on three subsets of the total 20 actions as listed in the
table depicted in Fig. 2.
Fig. 1 Frames depicting MSR-Action3D dataset out of the 20 actions performed (Reproduced
from Li et. al. [3])
134 K. Gupta and A. Bhavsar
These three subsets have been made exactly as they have been used in the previous
works to reduce computational complexity. Another important reason for this parti-
tioning is based on the fact that Subset 3 contains very complex but distinctive actions
in terms of human motion where as Subset 1 and 2 contains simpler actions but with
quite similar human motion (i.e. with more overlap). More specifically, a complex
action (in Subset 3) is an action which is a combination of multiple actions (e.g.
bend and throw constitute pickup and throw) and also actions like jogging involving
periodic repetitive motion.
The recognition rates of the proposed approach for each subset of the MSR-Action3D
dataset is illustrated by means of three confusion matrices for respective subset in
Fig. 3. It is clearly visible that the proposed approach works very well on the complex
actions of Subset 3 as the joint motion is quite different in all the actions in this
subset. The recognition accuracy is relatively low in Subset 1 and 2 as they mainly
comprise of action which have quite similar motions. Another reason for somewhat
reduced recognition rate in Subset 1 and 2, is that they contain sequences with noisier
skeleton information in training as well as testing sets.
Fig. 3 Confusion matrix of our proposed approach in different subsets under cross subject evalu-
ation settings. Each element of the matrix gives the recognition results
Human Action Recognition for Depth Cameras . . . 135
Table 1 Depicting recognition rates in %age, of our approach in comparison to the state of the art
techniques for all the subsets of MSR-Action3D dataset
3D Silhouettes [3] HOJ3D [4] EigenJoints [1] Ours
Subset 1 72.9 87.98 74.5 80.18
Subset 2 71.9 85.48 76.1 82.14
Subset 3 79.2 63.46 96.4 96.40
Having discussed our absolute performance, the relative comparisons with state-
of-the-art approaches show very encouraging results, thus highlighting the efficacy
of the proposed approach. The comparison with 3D Silhouettes, HOJ3D and Eigen
Joints for the cross-subject evaluation for the respective subsets of MSR-Action3D
dataset is as listed in Table 1. Clearly, for all the subsets of data the proposed approach
outperforms 3D Silhouettes and Eigen Joints (except in subset 3 where Eigen Joints
performs similar to ours). This highlights that our frame level features within the
dynamic frame warping framework is able to handle inter and intra-class variations
better. For subset 3, which consists of complex actions, our comparative performance
is very high as compared to the HMM based classification of HOJ3D. Note that
the HMM based classification [4] also uses skeleton-joint based features. Consider-
ing this, along with the difference in performance for Subset 3, it is apparent that
the training data is not sufficient for the HMM based approach in case of complex
actions. Thus, the result for Subset 3 clearly indicates our approach can perform well
even with less number of training samples.
Finally, Table 2 compares the overall performance of proposed approach with var-
ious state-of-the-art human action recognition approaches, and clearly shows that
as a whole, the proposed approach gives better results than all the other existing
approaches with simpler feature representation. This indicates that the proposed
approach can better handle the trade-off between interclass variation, intraclass vari-
ation and noisy features.
Table 2 Comparisons of recognition rates in %age of our approach to some of state of the art
techniques on the MSR-Action3D dataset with cross subject evaluation (5 subjects for training and
5 subjects for testing) settings
Method Accuracy
DTW [10] 54
HMM [15] 63
3D Silhouettes [3] 74.67
HOJ3D [4] 78.97
HOG3D [14] 82.78
EigenJoints [1] 83.3
HDG [13] 83.70
Ours 86.24
136 K. Gupta and A. Bhavsar
References
1. X. Yang, & Y. Tian. Effective 3d action recognition using eigenjoints. Journal of Visual Com-
munication and Image Representation, 25(1), 2014, pp. 2–11.
2. K. Kulkarni, G. Evangelidis, J. Cech, & R. Horaud. Continuous action recognition based on
sequence alignment. International Journal of Computer Vision, 112(1), 2015, pp. 90–114.
3. W. Li, Z. Zhang, & , Z. Liu. Action recognition based on a bag of 3d points. IEEE Computer
Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2010),
2010, pp. 9–14.
4. L. Xia, C. C. Chen, & J. K. Aggarwal. View invariant human action recognition using his-
tograms of 3d joints. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition Workshops (CVPRW 2012), 2012, pp. 20–27.
5. J. Wang, Z. Liu, Y. Wu, & J. Yuan. Mining actionlet ensemble for action recognition with depth
cameras. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2012), 2012.
pp. 1290–1297.
6. O. Oreifej, & Z. Liu. HON4D: Histogram of oriented 4d normals for activity recognition from
depth sequences. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2013),
2013. pp. 716–723.
7. C. Chen, K. Liu & N. Kehtarnavaz. Real-time human action recognition based on depth motion
maps. Journal of Real-Time Image Processing. 2013. pp.1–9.
8. J. Shotton, T. Sharp, A. Kipman, A. Fitzgibbon, M. Finocchio, A. Blake, & R. Moore. Real-
time human pose recognition in parts from single depth images. IEEE Conference on Computer
Vision and Pattern Recognition (CVPR 2011), 2011, pp. 116–124.
9. L. Rabiner, & B. H. Juang. Fundamentals of speech recognition. Salt Lake: Prentice hall 1993.
10. M. Mueller. Dynamic time warping. Information retrieval for music and motion, Berlin:
Springer 2007, pp. 6984.
11. S. S. Chen, D. L. Donoho, & M. A. Saunders. Atomic decomposition by basis pursuit. SIAM
journal on scientific computing, 20(1), 1998, pp. 33–61.
12. G. D. Evangelidis, & E. Z. Psarakis. Parametric image alignment using enhanced correlation
coefficient maximization. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(10), 2008, pp. 1858–1865.
13. H. Rahmani, A. Mahmood, D. Q. Huynh, & A. Mian. Real time human ation recognition using
histograms of depth gradients and random decision forests. IEEE Winter Conference on Appli-
cations of Computer Vision (WACV 2014), 2014, pp. 626–633.
Human Action Recognition for Depth Cameras . . . 137
1 Introduction
Images can be securely shared using well known techniques in steganography and
cryptography [2, 7, 8]. Since raw image data are huge in size, in general, they have
to be compressed before sharing. A natural way to share images securely would be to
compress them first, followed by scrambling or encrypting. On the other end, a user
would need to decrypt and decode in order to retrieve the encoded image. Essentially,
in the above scheme, the image encoding and decoding is a 2-step process. In this
paper, we propose an image encoding scheme, similar to fractal encoding [3], which
simultaneously achieves compression and encryption in a single step.
Few earlier works have attempted to do secure image sharing using fractal codes
in two separate steps: compression and encryption. Shiguo lian [5] encrypts some of
the fractal parameters during fractal encoding to produce the encrypted and encoded
data. In [6], Jian Lock et al. by modifying Li et al. [4], actually perform compres-
sion and symmetric key encryption in a 2-step process wherein fractal codes of an
image is multiplied with a equal sized Mandelbrot image generated through an equa-
tion. Before the multiplication, the fractal code and Mandelbrot image matrices are
permuted. The product matrix is transmitted along with some few other parameters.
Decryption is done through inverse permutation and some matrix manipulations.
In contrast to the above approaches, we propose a single step reference-based
image encoding wherein an image is encoded using another “reference” image in
such a way that decoding is possible only by the receiver or a user having the same
reference image. In other words, the reference image serves like a key for secure
image sharing. To begin with in Sect. 2, we give a brief overview of fractal encoding
and decoding, followed by a description of our proposed reference-based approach
and highlight important differences between the two encoding methods. In Sect. 3
we describe how the PatchMatch algorithm [1] is used in order to reduce encoding
time. Experiments and results are provided in Sect. 4 along with some insights about
the functioning of PatchMatch in encoding. Finally, Sect. 5 concludes the paper.
Since the proposed reference-based encoding is very similar to fractal encoding, for
the sake of completeness, in what follows we provide a brief description of frac-
tal image encoding and decoding. For more elaborate details on the theoretical and
implementation aspects, please see [3]. Let f be the image to be encoded. Fractal
encoding is essentially an inverse problem where an Iterated Function System (IFS)
W is sought such that the fixed point of W is the image to be encoded f . An approx-
imate solution to this problem is to partition f into M disjoint sub-blocks (“range
blocks”) fi such that f = ∪M
i=1 fi ; for each fi , we find a block of twice the size (“domain
block”) elsewhere in the same image which best matches to fi after resizing, spatial
transformation (isometry) and intensity modifications (brightness and contrast). The
mapping between fi and its corresponding matching block is a contractive mapping
wi . This process of searching self-similar block can be seen as having two copies of
the original image; one is called the range image and the other is called the domain
image. The range image contains non-overlapping range blocks fi , and the domain
image contains overlapping domain blocks of twice the size Dj . The relationship
between fi and the best matching block in Dj is represented through the contractive
mapping wi . It can be shown that W = ∪M i=1 wi is also a contractive mapping, and in
practice, the fixed point of W is very close to the original f which allows to encode the
image f as the parameters of wi , i = 1, … , M. Unlike the encoding, fractal decoding
is a much simpler process. Since the encoder is based on the concept of contractive
mappings, starting from any initial image, application of the contractive mappings
wi , i = 1, … , M repeatedly will converge to the same fixed point which will be an
approximation to the encoded image. Image compression is achieved because the
Reference Based Image Encoding 141
Unlike fractal encoding described above where the same image is used as domain
and range images, in our proposed scheme, we use the reference image as the domain
image, and the original image as the range image. For encoding the non-overlapping
range blocks, the best matching block among the overlapping domain blocks of equal
size in the domain image (reference) is found. By the above two simple modifica-
tions, the proposed encoding method is sufficient to meet our objective. In other
words, for decoding, the same domain image used for encoding is necessary as a key
to recover the original (range) image. To keep the reference-based encoder simple,
while searching for a matching domain block, no spatial transformations are applied
to the domain blocks. Given a range block fi and a candidate domain block we find
only the optimal “contrast” and “brightness” which will bring the domain block and
range block “closer”. Equations for computing optimal values of contrast and bright-
ness are given in [3], p. 21. Although the encoded data has compression properties
similar to a fractal encoder, the encoding quality depends on finding suitable domain
blocks in the reference image for the range blocks. The block diagrams showing a
comparison between reference-based and fractal encoding are shown in Fig. 1, and
the pseudo-code for the reference-based encoding is given in Algorithm 1.
Given the reference-based image codes, decoding involves a single step wherein
the mappings wi are applied on the reference image used during encoding. Note that
unlike fractal decoding, No iterative process is involved in the our proposed method.
In a fractal encoder, the domain block is twice the size of the range block which
is important to make the mapping contractive because contractivity is a necessary
requirement for the nature of fractal decoding. The pseudo-code for the reference-
based decoding is given in Algorithm 2.
Algorithm 2: Reference-based image decoding
PatchMatch [1] is a iterative randomized algorithm for finding the best match between
image patches across two different images A and B. Initially, for a patch centred at
(x, y) in image A, a random matching patch (nearest neighbor) is assigned at a offset
f (x, y) or 𝐯 in image B. Let D(𝐯) denote the error distance between the patches at
(x, y) in A and at ((x, y) + 𝐯) in B. In every iteration, the algorithm refines the nearest
neighbor for every patch of image A in two phases: (1) propagation and (2) random
search.
Propagation Phase: Assuming the neighboring offsets of (x, y) are likely to
be the same, the nearest neighbor for a patch at (x, y) is improved by “propagat-
ing” the known offsets of neighbors of (x, y) only if the patch distance error D(𝐯)
improves. Propagation of these offsets use different neighbors during odd and even
iterations: the updated offset 𝐯 in odd iterations is, 𝐯 = arg min{D(f (x, y)), D(f (x −
1, y)), D(f (x, y − 1))}, and in even iterations, 𝐯 = arg min{D(f (x, y)), D(f (x + 1, y)),
D(f (x, y + 1))}.
Random Search: If 𝐯0 is the updated offset after the propagation phase, further
improvement is done by constructing a window around 𝐯0 and searching through a
sequence of offsets {𝐮i } at an exponentially decreasing distance from 𝐯0 . Let w be
the maximum search distance around 𝐯0 , and 𝐑i , a sequence of uniformly distributed
random points in [−1, 1] × [−1, 1]. The sequence of random offsets is given by 𝐮i =
𝐯0 + w𝛼 i 𝐑i , where 𝛼 = 1∕2, and i = 1, 2, …. The number of random offsets searched
is determined by w with the condition that the last search radius w𝛼 i is less than 1
pixel. In this phase, the updated offset is 𝐯 = arg min{D(𝐯0 ), D(𝐮1 ), D(𝐮2 ), …}; in
other words, the offset is only updated if the patch distance error reduces.
In order to reduce the search time for finding a matching domain block for a range
block, PatchMatch is modified in the following ways and incorporated in the refer-
ence encoder.
∙ As an initial step, every range block fi in the original image (A) is randomly
assigned a domain block in domain image (reference or B). An image code is gen-
erated for this range block; consisting of the domain block position, range block
position, scale and offset. In contrast to PatchMatch where for all patches (overlap-
ping) in A, matching patches in image B are found, in our modification, matching
domain blocks need to be found only for range blocks which are non-overlapping.
∙ Block matching in PatchMatch involves computing Euclidean distance between
blocks and finding the block in B with the least Euclidean distance. We, however,
change the distance error by finding two parameters for intensity modification,
contrast (scale) and brightness (offset), which minimizes the matching error. The
expressions for optimal brightness and contrast are the same as in the reference
coder (and fractal encoder [3]).
Apart from the above two differences, block matching is similar to PatchMatch,
including two phases of propagation and random search to improve the matching
domain block for a range block iteratively. The pseudo-code for the PatchMatch
reference encoder is given in Algorithm 3.
144 S.D.Y. Devi et al.
The decoding process is the same as described in Sect. 2.1 wherein the original
image is recovered in a single step by applying the image codes on the same domain
image that was used for encoding.
Fig. 2 Reference-based image encoding: a range image, b reference (domain) image, c decoded
image
Reference Based Image Encoding 145
Fig. 3 PatchMatch: a original image, b decoded image, c matching distance of range blocks
0.85
Proportion
1000
0.8
800
0.75
600
0.7
400 0.65
200 0.6
0 0.55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Iteration Iteration
Fig. 5 PatchMatch encoding of an image with many uniform regions: a original image, b decoded
image, c matching distance
1200 0.8
No: of updates
1000 0.75
Proportion
800 0.7
600 0.65
400 0.6
200
0.55
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Iteration Iteration
Fig. 6 a No of updates versus iteration for image with uniform region, b random updates versus
iteration
Fig. 7 a Original image, b decoding using reference image, c decoding without the same reference
image
For the texture image in Fig. 7a the decoded result using “lena” as reference is
shown in Fig. 7b. When an image with all pixels as 255 is used as reference while
decoding, the resulting output shown in Fig. 7c indicates that the decoder fails if the
appropriate reference is not used.
Reference Based Image Encoding 147
The tables below summarize the key differences between three types of image
encoders: (1) fractal, (2) reference image-based and (3) PatchMatch-based.
Table 1 gives the major differences between the original PatchMatch algorithm and
the reference-based encoder, while Table 2 highlights some of the key differences
between fractal encoding and PatchMatch encoder. Table 3 compares the perfor-
mance of the above three techniques in terms of PSNR and the encoding time. Fractal
encoding results in the best PSNR, but with the highest encoding time compared to
reference-based and PatchMatch encoders. Since the reference-based image encoder
does not use spatial transforms while searching for matching blocks, the encoding
time is less in comparison to fractal encoder. For the “lena” image (Fig. 8a), the
decoded result from the three different encoders are shown in Fig. 8b–d.
Table 3 Performance comparison of encoding techniques on “lena” image (256 × 256) with 4 × 4
range blocks
Fractal Reference PatchMatch
Iteration NA NA 15
PSNR 41.75 38.59 34.90
Time (sec) 2560 1350 963
148 S.D.Y. Devi et al.
(a)
5 Conclusions
In this paper we have proposed methods to share images in a secure way using a
single step encoding process which combines compression and encryption. Results
show that the user will be able to decode a meaningful image only when the
“reference image” is given. The proposed work, though not similar to fractal encod-
ing, adopts some of the key features of fractal encoding, a lossy compression tech-
nique. PatchMatch has been leveraged in order to speed up the encoding process.
However, the results obtained are inferior in comparison to the reference-based
encoder. One future direction to improve the PSNR could be selecting an
“appropriate” reference image from an image set.
References
1. Connelly Barnes, Ei Shechtman, Adam Finkelstein, and Dan B Goldman. PatchMatch: A Ran-
domized Correspondence Algorithm for Structured Image Editing. pages 24:1–24:11. ACM,
2009.
2. Chin-Chen Chang, Min-Shian Hwang, and Tung-Shou Chen. A New Encryption Algorithm for
Image Cryptosystems. Journal of Systems and Software, 58(2):83–91, 2001.
3. Yuval Fisher. Fractal Image Compression: Theory and Application. Springer Verlag, 1995.
Reference Based Image Encoding 149
4. Xiaobo Li, Jason Knipe, and Howard Cheng. Image Compression and Encryption Using Tree
Structures. Pattern Recognition Letters, 18(11):1253–1259, 1997.
5. Shiguo Lian. Secure fractal image coding. CoRR, abs/0711.3500, 2007.
6. A. J. J. Lock, Chong Hooi Loh, S.H. Juhari, and A Samsudin. Compression-Encryption Based on
Fractal Geometric. In Computer Research and Development, 2010 Second International Con-
ference on, pages 213–217, May 2010.
7. Debasis Mazumdar, Apurba Das, and Sankar K Pal. MRF Based LSB Steganalysis: A New
Measure of Steganography Capacity. In Pattern Recognition and Machine Intelligence, volume
5909, pages 420–425. Springer Berlin Heidelberg, 2009.
8. Ren Rosenbaum and Heidrun Schumann. A Steganographic Framework for Reference Colour
Based Encoding and Cover Image Selection. In Information Security, volume 1975, pages
30–43. Springer Berlin Heidelberg, 2000.
Improving Face Detection in Blurred
Videos for Surveillance Applications
Abstract Performance of face detection system drops drastically when blur effect
is present in the surveillance video. Motivated by this problem, the proposed
method deblurs facial images to detect and improve faces degraded by blur in the
scenario like banks, ATMs where sparse crowd is present. Prevalent Viola Jones
technique detect faces, but fails in the presence of blur. Hence, to overcome this,
first the target frame is decomposed using Discrete Wavelet Transform(DWT) into
LL, LH, HL and HH bands. The LL band is processed using Lucy-Richardson’s
algorithm which removes blur using Point Spread Function (PSF). Then the super
enhanced de-blurred frame without ripples is given into Viola-Jones algorithm. It
has been observed and validated experimentally that, the detection rate in the Viola
Jones algorithm has been improved by 47 %. Experimental results illustrate the
effectiveness of the proposed algorithm.
1 Introduction
Most of the available surveillance cameras are of low resolution. Hence, face
detection is the most challenging task in surveillance systems than the normal face
detection in the photo images. Challenges faced by face detection often involve low
resolution images, blurred version of images, occlusion of facial features such as
beards, moustaches and glasses, facial expressions like surprised, crying and poses
like frontal and side view, illumination and poor lighting conditions such as in
video surveillance cameras image quality and size of image as in passport control or
visa control, complex backgrounds also make it extremely hard to detect faces. To
detect and recognize face in an intensity image, holistic methods which use
(Principal Component Analysis) PCA [1] can recognize a person by comparing the
characteristics of face to those of known individuals, FLDA and LBP (Local Binary
Pattern) [2] which summarizes local structures of images efficiently by comparing
each pixel with its neighboring pixels can be used. These methods work well under
low resolution but fails in case when there is a large variation in pose and illu-
mination. The alternate approach is feature based approach which uses Gabor filter
and Bunch graph method. These methods work well under ideal condition and the
disadvantages are difficult in computation and automatic detection.
Nowadays, face detection and improving its detection rate in surveillance video
plays a major role in recognizing the face of individuals to identify the culprits
involved in crime scenes. The various approaches used to detect faces in surveillance
video are based on Crossed Face Detection Method that instantly detects low res-
olution faces in still images or video frames [3], Howel and Buxter method, Gabor
wavelet analysis and the most commonly used is Viola-Jones detector [4]. Face
recognition can be done by using algorithms such as reverse rendering and Exemplar
algorithm. These methods for face detection and recognition works well under low
resolution and cluttered background and needs super enhancement techniques.
2 Related Work
From the literature review, it is found that there is not much research concentrated on
improving face detection rate in surveillance videos for face recognition to authen-
ticate the person specifically. There are so many factors that affect the efficacy and
credibility of any surveillance videos such as blur, occlusion, masking, illumination
and other environmental factors included. This paper is designed with respect to
scenarios where blur is a major concern. Though, there is little research going on in
removal of noise and overcoming illumination changes, not much is focused on blur
removal. Hence, it is vital to develop a face detection algorithm that is robust to blur
which will help the smart surveillance system to recognize the person.
As per the survey related to the topic, the face detection and recognition in
blurred videos is extremely difficult. It sometimes provides inaccurate detection rate
in presence of blur. Here are some of the approaches used for de-blurring in face
recognition and detection method. In [5], a combination of image-formation models
and differential geometric tools is used to recover the space spanned by the blurred
versions. Joint blind image restoration and recognition approach is to jointly de-blur
and recognition of face image. This approach is based on sparse representation to
solve challenging task of face recognition from low quality image in blind
setting [6].
Improving Face Detection in Blurred Videos … 153
3 Proposed Method
At present, all the surveillance videos are almost rendered useless. The volumes of
data captured, stored and time stamped doesn’t helps to solve crimes. The major
problem is blurred video. Thus, the proposed method involved removing blur in the
surveillance videos. The video is first converted into required number of frames that
has the target person for detection. The target frames are first fed as input to the
most rampant face detection algorithm, called the Viola Jones algorithm, in the
computer vision field of research. Though it can be trained for object class detec-
tion, the problem of face detection is the primary motivation. Then, the same frames
are taken and a series of preprocessing techniques are applied. The blurred frame is
transformed using DWT. The foremost reason for using DWT it captures both
frequency and location information (location in time) whereas Fourier trans-
forms is temporal resolution. DWT is used to extract the LL band information from
the target frame. This contains the most varied smooth information. The proposed
method takes this LL-band information and performs de-blurring in that image.
The deblurring algorithm chosen is Lucy-Richardson’s (L-R). The Richardson–
Lucy algorithm, also known as Lucy Richardson deconvolution, is an iterative
procedure for recovering a latent image that has been blurred by a known PSF.
154 K. Menaka et al.
Input Video
DWT
Lucy-Richardson Algorithm
Inverse DWT
Viola-Jones Algorithm
Performance Measure
3.1 Methodology
The discrete variant of the wavelet transform is Discrete Wavelet Transform. The
image is processed by DWT using an appropriate bank of filters and this trans-
formed images involve D levels based on tree structure. Based on the criteria of
extracting strings of image samples, it follows two approaches. The first approach
involves generating the string by queuing image lines and then executing decom-
position on D levels, after which the D strings are generated by queuing the
Improving Face Detection in Blurred Videos … 155
L2 Approx.
L1
L2 Detail
L1
L2 Approx.
L1
L2 Detail
Input
L2 Approx.
L1
L2 Detail
L1
L2 Approx.
L1
Detail
L2
columns from the sub-images found and decomposition is again done on each
string. The simplified version of resulting decomposition extended up to the third
level, is shown in Fig. 2.
Since the nonlinear iterative methods often yield results better than those obtained
with linear methods, Lucy Richardson (L-R) algorithm which is a nonlinear iter-
ative restoration method is chosen. The L-R algorithm arises from maximum
likelihood formulation in which image is modelled with poison statistics. Maxi-
mizing the likelihood function of the model yields an equation is satisfied when
following iteration converges:
Based on the size and complexity of PSF matrix, good solution is obtained.
Hence, the specific value for the number of iterations is difficult to claim. The
algorithm usually reaches a stable solution in few steps with a small PSF matrix
which makes the image smoother. The computational complexity is increased when
increasing the number of iterations which will result in amplification of noise and
the ringing effect is also produced. Hence, by determining the optimal number of
156 K. Menaka et al.
iterations manually for every image a good quality of restored image is obtained.
The optimal number is obtained by alternating one decomposition by rows and
another one by columns, iterating only on the low-pass sub-image according to the
PSF size.
In the proposed method, the DWT of degraded image is taken. The target frame is
decomposed into four sub-bands by DWT: three high frequency parts (HL, LH and
HH) and one low frequency part (LL). The high frequency parts may contain the
fringe information while the low frequency part may contain strength of target
frame. These low frequency parts are more stable. Therefore, Lucy Richardson
algorithm is applied to LL sub-band. The steps involved in this process are
(1) A non-blurred image f(x, y) is chosen.
(2) Gaussian or Motion Blur is added to produce blurred image.
(3) Gaussian noise is added to the blurred image to produce degraded image.
(4) The degraded image is decomposed into four sub-bands LL, HL, LH and HH
by using DWT.
(5) Apply L-R algorithm to the LL sub-band to produce the restored low fre-
quency band (LL Modified by LR).
(6) For the remaining sub-bands (HL, LH and HH) apply the threshold.
(7) Finally, the restored image is obtained by applying inverse DWT to restored
low frequency band (LL modified by LR) and high frequency bands (HL, LH
and HH).
The features used in the framework of detection involve the addition of pixels of an
image within rectangular areas. These features resemble the Haar basis functions,
which have been employed in the field of image-based object detection. Subtracting
the total sum of the pixels inside clear rectangles from the total sum of the pixels
inside shaded rectangles gives the value of given feature. It is because in a feature,
each rectangular area is always next to at least one other rectangle. Followed that
using six array references, any two-rectangle feature can be computed, using eight
any three-rectangle feature, and using just nine array references any four-rectangle
feature can be calculated. The estimation of the strong classifiers is not fast enough
to run in real-time. For this reason, based on the order of complexity, these strong
classifiers are put together in a cascade form. Each consecutive classifier is trained
only on those selected samples which pass through the preceding classifiers. No
further processing is performed if at any step in the cascade, a classifier discards the
Improving Face Detection in Blurred Videos … 157
sub-window under inspection and it continues to search for the next sub-window.
Hence, the cascade has the degenerate tree structure. In the case of faces, to obtain
approximately 0 % of false negative rate and 40 % of false positive rate, the first
classifier in the cascade called the attentional operator uses only two features. The
effect of this single classifier decreases half the amount of times the entire cascade is
evaluated. Some Haar features are given in the Fig. 3.
A threshold value is obtained from Haar features using the equation,
If this threshold value is above a certain level, then the corresponding area is
detected as face and if it is below that level it is found to be a non-face region. Thus,
to match the false positive rates typically achieved by other detectors, each classifier
can get away with having surprisingly poor performance.
Experimentations are carried out on a bench mark Umist [9] dataset and college
surveillance video with 16 min duration, with the resolution of 811 × 508, frame
rate of 11 frames/sec and 11339 numbers of frames. The camera is mounted at
appropriate location on the wall to focus the intended area and focus the passing
objects.
First, Viola Jones face detection is applied on both the datasets to test the
performance. It is found that the accuracy of detection decreases in the presence of
blur. Hence, the frames are de-blurred and the same Viola Jones algorithm is
applied. After deblurring the detection rate is improved. Figure 4 shows the bench
mark dataset which is used as a default database. Viola Jones face detection
algorithm is applied for an image to check its performance. Out of 78 faces in the
dataset (50 profile faces and 28 frontal faces), Viola Jones the most popular
algorithm is able to detect only 31 frontal faces and 28 profile faces with a total of
158 K. Menaka et al.
45 faces (12 faces are detected both as frontal and profile faces). The accuracy rate
is just 58 %.
TCE college dataset is used to test the performance of Viola Jones in real time
surveillance video, in which there are two frontal, non-occluded faces which could
have been detected by Viola Jones. But it fails to detect faces which is shown in
Fig. 5. The reason is the presence of low resolution and blur in the video, which is
the case in most real time applications. So L-R method is applied to the frames and
then it is seen that Viola Jones algorithm is able to detect the non-occluded,
de-blurred faces.
Also, manually Gaussian blur is introduced to the bench mark data set to test the
performance of Viola Jones. Gaussian blur with filter size of 20 and standard
deviation of 3.5 is introduced. It is seen that the performance suddenly reduces. Out
of 78 faces in that dataset, the observation is that it has detected only 30 faces which
is shown in Fig. 6. (5 faces detected as both frontal and profile face) with a min-
imum accuracy of 38 %.
Fig. 5 a–d Face detection using Viola Jones Dataset: College dataset Frame size: 704 × 576 and
e–h Face detection in de-blurred frame Dataset: College dataset Frame size: 704 × 576
Improving Face Detection in Blurred Videos … 159
Fig. 6 Face detection in blurred image: Umist Image size: 1367 × 652
Fig. 7 Face detection in de-blurred image: Bench mark Umist dataset Image size: 1367 × 652
In this bench mark dataset, it is seen that after deblurring the accuracy has
greatly increased and it has reached nearly the ideal image which was not blurred.
Out of 78 faces in the dataset it has detected 30 frontal faces and 28 profile faces
with a total of 43 faces which is shown in Fig. 7 (10 faces detected as both frontal
and profile face). The accuracy percentage is increased to 55 %
The hit rate or face detection rate for above dataset is calculated using the
following:
Performance on surveillance video has been analyzed and the results are shown
in Fig. 8.
160 K. Menaka et al.
Fig. 8 Performance
comparison of Viola Jones
method and the proposed
method
5 Conclusion
Thus the face detection accuracy in surveillance video can be greatly improved by
preprocessing the frames initially i.e. deblurring the frames. Subsequently, existing
Viola Jones system is applied for face detection. It has been proved that the pro-
posed method increases the detection accuracy over 47 %, when compared to that
of Viola Jones algorithm. The future work includes building a fully automated face
recognition system invariant to blur and illumination. Also, estimation of PSF can
be automated thereby removing blur for all frames in a video completely. Other
parameters such as noise, occlusion can also be taken into consideration for robust
face detection and recognition system can be built for a smart surveillance system.
Acknowledgements This work has been supported under DST Fast Track Young Scientist
Scheme for the project entitled, Intelligent Video Surveillance System for Crowd Density Esti-
mation and Human Abnormal Analysis, with reference no. SR/FTP/ETA-49/2012. Also, it has
been supported by UGC under Major Research Project Scheme entitled, Intelligent Video
Surveillance System for Human Action Analysis with reference F.No.41-592/2012(SR).
References
1. Turk, Matthew, and Alex Pentland: Eigenfaces for recognition. In: Journal of cognitive
neuroscience, vol. 3, Issue 1, pp. 71–86. (1991).
2. Di Huang, Caifeng Shan; Ardabilian, M., Yunhong Wang: Local Binary Patterns and Its
application to Facial Image Analysis: A Survey. In: IEEE Transactions on Systems, Man, and
Cybernetics—Part C: Applications and Reviews, vol. 41, Issue 6, pp. 765–781. IEEE (2011).
3. Amr El, Maghraby Mahmoud, Abdalla Othman, Enany Mohamed, El Nahas, Y.: Hybrid Face
Detection System using Combination of Viola - Jones Method and Skin Detection. In:
International Journal of Computer Applications, vol. 71, Issue 6, pp. 15–22. IJCA Journal
(2013).
4. Yi-Qing Wang: An Analysis of the Viola-Jones Face Detection Algorithm. In Image
Processing On Line. vol. 2, pp. 1239–1009, (2013).
Improving Face Detection in Blurred Videos … 161
5. Raghuraman Gopalan, Sima Taheri, Pavan Turaga, Rama Chellappa: A blur robust descriptor
with applications to face recognition. In: IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 34, Issue 6, pp. 1220–1226, IEEE (2013).
6. Haichao Zhang, Jianchao Yang, Yanning Zhang, Nasser M. Nasrabadi and Thomas S. Huang.:
Close the Loop: Joint Blind Image Restoration and Recognition with Sparse Representation
Prior. In: ICCV Proceedings of IEEE international conference on computer vision., pp.
770–777, Barcelona, Spain (2011).
7. Swati Sharma, Shipra Sharma, Rajesh Mehra: Image Restoration using Modified Lucy
Richardson Algorithm in the Presence of Gaussian and Motion Blur. In: Advance in Electronic
and Electric Engineering. vol. 3, Issue 8, pp. 1063–1070, (2013)
8. Harry C. Andrews, Hunt, B.R.: Digital Image Restoration. Book Digital Image Restoration,
Prentice Hall Professional Technical Reference. (1977).
9. D. Graham. The UMIST Face Database, 2002. URL http://images.ee.umist.ac.uk/danny/
database.html. (URL accessed on December 10, 2002).
Support Vector Machine Based Extraction
of Crime Information in Human Brain Using
ERP Image
1 Introduction
Crime rate has increased a lot in past 2–3 years all over the world. As per national
crime record bureau, India [1] cognizable crime in India has increased steadily from
1953 till date. During 2012 a total of 6041559 cognizable crimes comprising of
2387188 penal code crime and 3654371 special and local laws crime were reported.
This research work is funded by Centre on Advanced Systems Engineering, Indian Institute of
Technology, Patna.
Conviction rate is fairly low in 2012 in many states of India. A technique which
can identify the concealed crime information in brain will be helpful to validate the
authenticity of information. Techniques available today target physiological body
parameters such as heart rate, electordermal activity, blood oxygen saturation etc. for
validation of crime information. Psychological parameters include emotion assess-
ment, voice change during test and questionnaires [2]. Charlotte [3] used eye fixation
as a parameter to detect concealed information in brain. They found that more fix-
ation duration to concealed information compared to non-target pictures. Uday Jain
[4] used facial thermal imaging for crime knowledge identification. Maximum classi-
fication rate achieved was 83.5 %. Brain wave frequency change was also explored for
crime knowledge detection. Maximum match overview achieved was 79 %. Wavelet
analysis was also used for decomposition of EEG signal and analyzing the frequency
band activity changes for crime knowledge detection. EEG signal was recorded dur-
ing question answer session and beta frequency band of EEG signal showed maxi-
mum variation [5]. Vahid et al. [6] used ERP as an tool for crime knowledge iden-
tification. Features extracted were morphological features, frequency features and
wavelet features. Linear discriminant analysis was used for classification purpose.
Farwell and Richardson had conducted a study to detect brain response to stimulus
in four different fields like real life event, real crime with substantial consequence,
knowledge unique to FBI agent and knowledge unique to explosive expert [7]. Par-
adigm for this work was designed based on the work done by Farwell and Richard-
son. Present research work is unique in addressing the problem of false eye witness
identification, feature used extracted from ERP image is unique and data collection
method is modified version of Farwell and Richardson paradigm according to the
requirement of the problem. Research work presented in this paper focuses on novel
combination of existing technique for classifying ERP response as crime related or
not related more efficiently (Fig. 1).
2 Data Collection
Eye witness claims to have information about crime. Authors have designed the test
to verify the authenticity of the eye witness. In this research work stimulus related
and not related to crime are shown to the participants. False eye witness can iden-
tify target (crime not related) stimulus but will fail to recognize stimulus from that
particular crime. Total 10 participants aging 18–22 years participated in the research
work voluntarily. Crime topic was selected not known to anyone. They were divided
into two groups called information present group given information about crime and
information absent group having no information about crime.
Nexus 10 neuro-feedback system was used for data collection from Cz and Pz
electrode position. 10–20 electrode placement system was followed for electrode
positioning. Participants were asked to observe the stimulus and to give response if
they recognize the stimulus or not by pressing the laptop right arrow key. A total
of 3 set of picture stimulus were shown, crime instruments, crime place and victim
name. Each stimulus set consists of 20 images, 16 target and 4 probe stimulus. Target
stimulus is the stimulus not from that particular crime but related to some other
similar crime and probe stimulus is the stimulus from that particular crime. Five
trials were performed per image set.
3 Proposed Approach
3.1 Pre-processing
Signal was recorded with sampling frequency of 256 Hz. Fourth order Butterworth
band pass filter with cut-off frequency (0.2–35 Hz) was used to keep data up to beta
band and remove others. EEG signal was segmented based on marker and averaged
to extract ERP signal.
Grey system theory can be applied to problems involving small samples and poor
information. In real world similar situation arises many a time and so grey theory
166 M.H. Kolekar et al.
Definition 1 xi = x1 (1), x1 (2), … , xi (n) be the behavioral time sequence and let D1
be the sequence operator, then-
x1 × D1 = (x1 (1) × d1, x1 (2) × d2, … , xi (n) × d1)
Where
xi (k) × d1 = xi (k) ÷ xi (1) , xi (1) ≠ 0, k = 1, 2, … n
Then D1 is called initial value operator and xi D1 is a mapping of xi under the D1 .
Theorem 1 Let the approximate time sequence xo = (xo (1), xo (2), … xo (n)) and
xi = (xi (1), xi (2), … xi (n)) i = 1, 2, . . . n and 𝜉 ∈ (0, 1) define
and
1∑
n
𝛾(xo , xi ) = 𝛾(xo (k), xi (k)) (2)
n k=1
to the frequency band of EEG signal. Signal was decomposed up to 6 levels. Noisy
components were removed by setting a hard threshold. Threshold is selected based
on absolute and detail coefficients GID value [8].
√
Threshold = 𝜎 × 𝛾 × 2 × (log l) (5)
Event related potential varies in latency and amplitude across each trial. So it is
difficult to select a common window to differentiate the signal based on amplitude.
ERP represented in image form can be helpful to find common activation region in
each trial. Let s be the total average trial set for each block defined as s = s1, s2, s3.
ERP image was constructed for each block, crime image, crime place and victim
name stimulus set by plotting time in x axis, trial in y axis for each block response
[9] and color represents amplitude variation of signal. Figures 6, 7, 8 and 9 represents
ERP images of crime related and not related stimulus.
Structural similarity index algorithm assess three terms between two images x and
y, luminescence l(x, y), contrast c(x, y) and structure s(x, y). Mathematically it can
be defined as-
2 × 𝜇x × 𝜇y + c1
l(x, y) = (6)
𝜇x 2 + 𝜇y 2 + c1
2 × 𝜎x × 𝜎y + c2
c(x, y) = (7)
𝜎x 2 + 𝜎y 2 + c2
𝜎xy + c3
s(x, y) = (8)
𝜎x × 𝜎y + c3
where c1 = (k1 × l)2 , c2 = (k2 × l)2 , c3 = 22 𝜇x and 𝜇y are the mean value of image x
c
and y. 𝜎x and 𝜎y represents the variance of x and y and 𝜎xy represents the co-variance
between image x and y. l is the dynamic range of pixel value. k1 ≪ 1 and k2 ≪ 1 are
scalar constant. The constants c1 , c2 and c3 provides spatial masking properties and
ensure stability when denominator approaches zero. combining 3 terms
As there is variation in ERP response for probe and target stimulus, similarity
comparison between two response is an good approach to differentiate person with or
without crime knowledge. In this paper image similarity was calculated between the
brain response to probe and target stimulus. Window of 200 ms was selected and sim-
ilarity index was calculated for each 200 ms block. For comparison between different
group 400–600 ms window was selected as maximum variation can be observed in
that window.
ERP signal features were extracted to compare its effectiveness with that of image
processing approach. Extracted features are power spectral entropy, Energy, Aver-
age ERP peak and Magnitude square coherence. Some features equations are given
below.
∑n
H=− pi × ln(pi ) (10)
i=1
∣ pxy (f ) ∣2
coh(f ) = (11)
pxx (f ) × pyy (f )
where pxy is the cross spectral density of x(t) and y(t) and pxx and pyy is the auto
spectral density of x(t) and y(t) respectively.
In this research work Support Vector Machine (SVM) classifier is used for classi-
fication because this is a two class problem and SVM is one of the efficient and
most used classifier in Neuroscience research. The idea of SVM is to find an optimal
hyperplane which can separate the two classes. Support vectors are the points closest
to the separating hyperplane. Margin of both classes is found out by finding the line
2
passing through the closest point [10, 11]. Margin length is set to be ‖w‖ , Where w
is the line perpendicular to the margin.
When the class cannot be linearly separated, optimal hyperplane can be found out
by allowing some error in linear separation or converting the data into linearly sepa-
rable set by transforming it to a higher dimension. The function used for converting
the data in to a higher dimension is called the kernel function. w is the vector per-
pendicular to the hyperplane and b is the bias. The equation wx + b = 0 represents
the hyperplane and wx + b = 1 and wx + b = −1 represents the margin of separation
of two class.
Support Vector Machine Based Extraction . . . 169
⎧ 1 ∑N ∑M 𝛼n 𝛼m yn ym xt xm − ∑N 𝛼n ,
⎪ 2 n=1 m=1 n n=1
L = ⎨𝛼n ≥ 0, (12)
⎪y × 𝛼 = 0
⎩
k(xn , xm ) is the matrix after transforming data in to higher dimension. The function
mentioned above has to be minimized to get values of 𝛼. The condition to be followed
is shown above.The value of w and b is found by
∑
N
w= 𝛼n × k(xn , x‘ ) × yn (13)
n=1
∑
b = yn − (𝛼n × yn × k(xn , x‘ ) (14)
𝛼n >0
Here kernel based SVM classifier was used. Radial basis function (RBF) kernel
was used to classify data into person with or without crime knowledge. Kernel math-
ematically represented as
4 Results
Wavelet denoising using grey incidence degree based threshold approach denoise the
signal keeping the originality of the signal intact. Universal threshold based wavelet
denoising resulted in more smoothed signal. Since in ERP interpretation, amplitude
plays a critical role GID based wavelet denoising is proposed in this research. Famil-
iarity of stimulus is indicated if the person has higher P300 ERP activation to probe
stimulus. For exact assessment of difference in P300 ERP component both ERP
170 M.H. Kolekar et al.
Fig. 2 Grand average ERP of person with crime information for Cz electrode (dotted line repre-
sents ERP for stimulus not related to crime, solid line represents ERP for stimulus related to crime)
Fig. 3 Grand average ERP of person with crime information for Pz electrode (dotted line represents
ERP for stimulus not related to crime, solid line represents ERP for stimulus related to crime)
component for person with and without crime knowledge were subtracted. Figures 2,
3, 4 and 5 represents comparison between subtracted probe and target response of
both person with and without crime knowledge. More clear response can be seen
from pz electrode. Figures 6, 7, 8 and 9 shows ERP image for probe and target stim-
ulus for person with and without crime information. It is observed that ERP image
of person with crime information has higher activation for probe stimulus as com-
pared to target stimulus. It is also observed that ERP image of person without crime
information showed equal activation for both stimulus.
Support Vector Machine Based Extraction . . . 171
Fig. 4 Grand average ERP of person without crime information for Cz electrode (dotted line rep-
resents ERP for stimulus not related to crime, solid line represents ERP for stimulus related to
crime)
Fig. 5 Grand average ERP of person without crime information for Pz electrode (dotted line rep-
resents ERP for stimulus not related to crime, solid line represents ERP for stimulus related to
crime)
The research work presented is unique in data collection method and has unique
combination of existing technique to analyze the ERP signal for detection of valid-
ity of eye witness. Grey incidence degree based wavelet denoising method is more
effective compared to wavelet denoising with universal threshold. Participants with
crime knowledge had higher brain activation to crime related stimulus compared to
participants without crime information. Data was classified into information present
172 M.H. Kolekar et al.
Fig. 6 ERP of person with crime information in image form for crime related stimulus (Pz elec-
trode) (x axis- time, y axis- number of trials. Three trials are ERP responses to crime instrument,
crime place and victim name, color represents amplitude variation)
Fig. 7 ERP of person with crime information in image form for stimulus not related to crime (Pz
electrode) (x axis- time, y axis- number of trials. Three trials are ERP responses to crime instrument,
crime place and victim name, color represents amplitude variation)
and absent group by extracting Structural similarity index of ERP image resulted in
maximum accuracy of 87.50 % which is significantly high in this type of research
work (Table 1).
Support Vector Machine Based Extraction . . . 173
Fig. 8 ERP of person without crime information in image form for crime related stimulus (Pz
electrode) (x axis- time, y axis- number of trials-Three trials are ERP responses to crime instrument,
crime place and victim name, color represents amplitude variation)
Fig. 9 ERP of person without crime information in image form for stimulus not related to crime (Pz
electrode) (x axis- time, y axis- number of trials-Three trials are ERP responses to crime instrument,
crime place and victim name, color represents amplitude variation)
In this paper a support vector machine classifier based crime information detection
system is proposed. In preprocessing grey scale index based wavelet denoising is
used which proved to be more efficient as compared to wavelet denoising with uni-
versal threshold. In feature extraction Structural similarity of ERP image was used
which resulted in maximum accuracy. The results obtained are encouraging and can
be extended further for criminal identification. In future more rigorous research work
will be conducted taking number of subjects and testing probability based classifier
like Hidden Markov Model [12, 13] in research work. ERP visualization of real time
brain activation can give practical input of how the person is responding to each stim-
ulus.
References
1. United Nations, World crime trends and emerging issues and responses in the field of crime
prevention and criminal justice, Report, vol. 14, pp. 12–13, (2014)
2. Wang Zhiyu, Based on physiology parameters to design lie detector, International Conference
on Computer Application and System Modeling, vol. 8, pp. 634–637, (2010)
3. Schwedes, Charlotte, and Dirk Wentura, The revealing glance: Eye gaze behavior to concealed
information, Memory & cognition vol. 40.4, pp 642–651, (2012)
4. Rajoub, B.A., Zwiggelaar R., Thermal Facial Analysis for Deception Detection, IEEE Trans-
actions on Information Forensics and Security, vol. 9, pp. 1015–1023, (2014)
5. Merzagora, Anna Caterina, et al. Wavelet analysis for EEG feature extraction in deception
detection. Engineering in Medicine and Biology Society, (2006)
6. Abootalebi, Vahid, Mohammad Hassan Moradi, and Mohammad Ali Khalilzadeh. A new
approach for EEG feature extraction in P300-based lie detection. Computer methods and pro-
grams in biomedicine, vol. 94.1, pp 48–57, (2009)
7. Lawrence A Farwell, Drew C Richardson, and Graham M Richardson, Brain fingerprinting
field studies comparing p300-mermer and p300 brainwave responses in the detection of con-
cealed information, Cognitive neurodynamics, vol. 7, pp. 263–299, (2013)
8. Wei Wen-chang, Cai Jian-li, and Yang Jun-jie, A new wavelet threshold method based on the
grey incidence degree and its application, International Conference on Intelligent Networks
and Intelligent Systems, pp. 577–580, (2008)
9. Ming-Jun Chen and Alan C Bovik, Fast structural similarity index algorithm, Journal of Real-
Time Image Processing, vol. 6, no. 4, pp. 281–287, (2011)
10. A. Kumar and M.H. Kolekar, Machine learning approach for epileptic seizure detection using
wavelet analysis of EEG signals, International Conference on Medical Imaging, m-Health and
Emerging Communication Systems, pp. 412–416, (2014)
11. Maheshkumar H Kolekar, Deba Prasad Dash, A nonlinear feature based epileptic seizure detec-
tion using least square support vector machine classifier, IEEE Region 10 Conference, pp. 1–6,
(2015)
12. Maheshkumar H. Kolekar, S. Sengupta, Semantic Indexing of News Video Sequences: A Mul-
timodal Hierarchical Approach Based on Hidden Markov Model, IEEE Region 10 Conference,
pp. 1–6, (2005)
13. Maheshkumar H Kolekar and Somnath Sengupta. Bayesian Network-Based Customized High-
light Generation for Broadcast Soccer Videos., IEEE Transactions on Broadcasting, vol. 61,
no. 2, pp. 195–209, (2015)
View Invariant Motorcycle Detection
for Helmet Wear Analysis in Intelligent
Traffic Surveillance
⋅
Keywords Background subtraction Histogram of Oriented Gradients (HOG) ⋅
Center-Symmetric Local Binary Pattern (CS-LBP) ⋅
K-Nearest Neighbor (KNN)
1 Introduction
Recently, detecting and classifying moving objects from video sequences has
become active research topics. They are used in various circumstances nowadays.
Segmenting and classifying four moving objects such as bicycles, motorcycles,
pedestrians and cars with view invariant in a video sequence is a challenging task.
The object can be detected both in motion as well as in rest position depending on
the application. Despite of its significance, classification of objects in wide scenario
surveillance videos is challenging because of the following reasons. As the capa-
bility of conventional surveillance cameras [1] is limited, Region of Interest
(ROI) in videos may be of low resolution. As a result, the information supplied by
these regions is very limited. Also, the intra class variation for each category is very
huge. Objects have diverse appearances and they may vary significantly because of
lighting, different view angles and environments. The potential for object classifi-
cation in real time application is great and so its performance has to be improved.
However, the above mentioned issues reduce the accurate working of object
classification algorithms. Helmets are essential for motorcyclists’ security from
deadly accidents. The inability of police power in many countries to enforce helmet
laws results in reduced usage of motorcycle helmets which becomes the reason for
head injuries in case of accidents. The goal of this work is to develop an integrated
and automated system approach for identifying motorcycle riders who are not
wearing a helmet.
2 Related Work
Motorcycles have always been a very significant focus for traffic monitoring
research which is based on computer vision. This requires some sort of camera
calibration which can greatly affect the accurate working of a traffic monitoring
system. Chiu and Ku et al. [2, 3] developed algorithms for detecting occluded
motorcycles using the pixel ratio, visual width and visual length based on
assumption that motorcycle riders always wear helmets. Anyway, these surveys do
not focus on detecting helmets but used as a cue to identify a motorcycle. For the
studies focusing on helmet detection, Liu et al. [4] proposed a technique to find a
full-face helmet which used circle that fits on a Canny edge image. Wen et al. [5, 6]
introduced similar techniques that detect helmets based on Circle Hough Trans-
form. These techniques are used in surveillance systems in banks and at ATM
machines. These algorithms are compatible with full-face helmets that have
extractable circles and circular arcs. But, these papers do not focus on different view
angles.
View Invariant Motorcycle Detection for Helmet Wear … 177
3 Proposed Method
3.1 Methodology
The first and foremost step of the system is to detect and to extract any moving
object in a scene. This involves extracting shape features using Histogram of
Oriented Gradients. By using K-Nearest Neighbor (KNN) classifier, the extracted
object is classified as a motorcycle or other objects. Subsequently, with the help of
background subtraction, foreground frame is extracted and the rider heads are
extracted. The features are derived from it for further classification. Finally, KNN
classifier is used which classifies whether the extracted head is wearing a helmet or
not wearing a helmet. Features used here are circularity of the head region, average
intensity and hues of each head quadrants. Figure 1 gives the overview of the
proposed method.
Vehicle classification is the process by which the vehicles are detected in the frame
and are classified as object of interest. Vehicle identification can be done based on
different parameters like shape, motion, color and texture. The color or texture
based classification does not yield much information about the detected vehicle in
traffic surveillance [7]. Hence, shape based classification are used for vehicle
detection. After the vehicles are detected in a video sequence, the next step is to
identify whether the vehicle is motorcycle or not.
NO
Automatic License Plate Detection For Sending
a Warning Message
as apparent aspect ratio of the blob bounding box and image blob area are given as
input features. Classification of objects for each blob is performed at every frame
and results are stored in the form of histogram.
Feature extraction using HOG. To detect the shapes of objects in image
processing and computer vision technique Histogram of oriented gradients
(HOG) is a feature descriptor used [8]. The HOG descriptor technique counts the
occurrences of gradient orientation in particular or localized portions of an image
detection window. The HOG features are used to detect the objects like humans and
vehicles. To capture the overall shape of the object it is used. For instance, in the
below visualization of the features using HOG technique (Fig. 2), the outline of the
motorcycle is prompt. The HOG is computed as follows: The magnitude of gra-
dient is
qffiffiffiffiffiffiffiffiffiffiffiffiffi
jGj = Ix2 + Iy2 ð1Þ
Ix
θ = arctan ð2Þ
Iy
where pi and qi are the nearest pixels. In this paper, HOG feature vector is taken as
an input to the KNN classifier for further classification of the vehicle such as
motorcycle or not.
Head Extraction
Therefore, from the previous step, if it is detected as a motorcycle, the following
procedure has to be followed. If the frame contains motorcycle, background sub-
traction is done to obtain the foreground which is exactly motorcycle for further
analysis. The region of interest, here, the motorcyclist’s head portion will be detected
immediately. Followed that, the features are used to classify whether the head
portion wears helmet or not. Again, KNN classifier is used for this classification.
Background subtraction and morphological operation. Video sequence may
be separated into background and foreground. If the background data is removed
from the video frame, then the necessary data left out is considered as foreground
which contains the object of interest. Better accuracy can be achieved, if the
background is already known. For example, in stationary surveillance cameras as in
road traffic monitoring, the background is always constant. The road remains in the
same position with respect to the camera. The background subtraction method is
subtracting the current from the background frame which helps to detect moving
objects with simple algorithm. The foreground extracted image will have certain
unnecessary information because of shadows and illumination changes. So, the
morphological operations (opening, closing) are performed to remove these chan-
ges. Closing operations performs the enlargement of boundaries of the foreground
(bright) regions in the image and shrinks the background color holes in such
regions. The opening operation is performed to remove foreground pixels which
occurs due to the illumination changes.
The heads of motorcyclist are in the upper part of motorcycle blob. Hence, the
Region of Interest (ROI) for extracting the heads of motorcyclist is at the top 25 %
of the height of a motorcycle blob. The totals of 4 different features are derived
from the four quadrants of head region.
1. Feature 1: Arc circularity
2. Feature 2: Average intensity
3. Feature 3: Average hue
4. Feature 4: Texture feature extraction
Arc circularity. The similarity measures between arc and a circle is given by
μr
c= ð4Þ
σr
where σ r , μr is the standard deviation and mean of the distance r from the head
centroid to the head contour. These features are extracted because the head portion
which contains a helmet is more circular than a head without a helmet, which
reflects in high circularity of head contour.
180 M. Ashvini et al.
1 N −1
μI = ∑ Ii ð5Þ
N i=0
where Ii is an intensity of the ith pixel, N is the pixel count in the head. These
features are employed since the intensity on the top and the back of the head
without helmet are mostly dark. Here, the assumption is made according to Indian
scenario. These features are normalized with the help of maximum gray scale
intensity.
Average hue of Head portion. Average hue of face is another important feature
which is computed exclusively by:
1 N −1
μH = ∑ Hi ð6Þ
N i=0
where Hi is the hue of ith pixel and N is the pixel count in the head. These features
are applied because a large portion of his/her face is covered by their helmet and it
also varies with the average hue value. Additionally, a rider without a helmet has
certain average hue of skin color.
Texture feature extraction by Centre Symmetric-Local Binary Pattern (CS-LBP).
The Centre Symmetric Local Binary Patterns (CS-LBP) are devised which com-
pares the center-symmetric pairs of pixels. This reduces the number of comparisons
for the similar number of neighbors. For 8 neighbors, only 16 different binary
patterns are produced by CS-LBP. So, CS-LBP is used as a texture feature for
helmet detection even under different illumination changes [9].
ðPÞ − 1 P
CS − LBPP, R ðcÞ = ∑i =2 0 s gi − gi + 2i ð7Þ
2
where gi and gi + P2 represents the gray values pixels of center-symmetric pairs P
which is equally spaced on a circle of radius R.
Classification using K-Nearest Neighbor classifier. Here the head is classified
based on the majority vote of its neighbors either the motorcyclists are “wearing a
helmet” or “not wearing”. These neighbors are taken from the head part where the
correct classification is known and labeled. For helmet classification, the Standard
deviation, hue, average intensity and CS-LBP texture features are calculated. These
features are given as an input to the KNN classifier. At last, the classifier output
displays as ‘helmet detected’ or ‘no helmet’. This will help to warn the particular
motorcyclist who does not wear helmet or to take a survey on motorcyclists
with/without helmet for the authorities. Thus, it may help to reduce deadly acci-
dents due to this issue.
View Invariant Motorcycle Detection for Helmet Wear … 181
The proposed system involves motorcycle detection and helmet wear or not clas-
sification. These experiments are tested separately where the results of each test are
independent and also there is no propagation of error from previous algorithm. The
proposed system performed KNN classifications with approximately 100 training
frames and 20 testing frames. Each experiment has different view angle and reso-
lution in Table 1.
The sample frames of benchmark datasets are shown in Fig. 3.
The first step of the proposed method is to extract HOG features from the given
current image which is shown in Fig. 4.
Subsequently, the extracted HOG feature vectors are given as an input to the
KNN classifier to detect and classify whether it is a motorcycle or not in Fig. 5.
Fig. 3 The sample frames of benchmark datasets. a TCE dataset. b IIT dataset. c Mirpur dataset.
d IISC. e Bangalore dataset
Fig. 7 Morphological
closing and opening
TCE DATASET.
Fig. 9 Motorcyclist wears helmet or not classification using KNN classifier for TCE dataset
5 Conclusion
From the survey of various shape based feature extraction algorithms which is
robust to different view angles; it is found that HOG descriptor provides better
results when compared with other algorithms such as SURF, SIFT, LBP and RIFT.
The proposed method makes use of HOG descriptor with KNN classifier for motor
cycle classification under challenging environment (different view angles) in
intelligent traffic surveillance system. After motorcycle detection, head region is
detected with the help of the features such as Arc circularity, Average intensity,
Average Hue and CS-LBP texture features. Finally, these features are used to detect
the motorcyclist wears helmet or not even under different illumination changes
which are usual in real time. The proposed algorithm helps to segment motorcycles
on public roads which act as an important task. This can be used to motivate the two
wheeler riders to wear helmets via providing awareness. This can estimate accident
with and without helmet, speed computation and vehicle tracking. The future work
is that average intensity feature can be extracted for motorcycle rider not wearing
helmet with white hair or bald as intensity varies. The proposed method can be
extended further for automatic license plate detection for sending a warning mes-
sage when the motorcyclist without helmet is detected.
View Invariant Motorcycle Detection for Helmet Wear … 185
References
1. Zhaoxiang Zhang, Yunhong Wanga: Automatic object classification using motion blob based
local feature fusion for traffic scene surveillance. In: International Journal of Advanced
Research in Computer Science and Software Engineering, vol. 6, Issue. 5, pp. 537–546. SP
Higher Education Press (2012).
2. C.C. Chiu, M.Y. Ku, and H.T. Chen: Motorcycle Detection and tracking system with
occlusion segmentation. In: WIAMIS’07 Proceedings of the Eight International Workshop on
Image Analysis for Multimedia Interactive Services, pp. 32. IEEE Computer Society
Washington, DC, USA (2007).
3. M.Y. Ku, C.C. Chin, H.T. Chen and S.H. Hong: Visual Motorcycle Detection and Tracking
Algorithm. In: WSEAS Trans. Electron, vol. 5, Issue. 4, pp. 121–131, IEEE (2008).
4. C.C. Liu, J.S. Liao, W.Y. Chen and J.H. Chen: The Full Motorcycle Helmet Detection
Scheme Using Canny Detection. In: 18th IPPR Conf, pp. 1104–1110. CVGIP (2005).
5. C.Y. Wen, S.H. Chiu, J.J. Liaw and C.P. Lu: The Safety Helmet Detection for ATM’s
Surveillance System via the Modified Hough transform. In: Security Technology, IEEE 37th
Annual International Carnahan Conference, pp. 364–369. IEEE (2003).
6. C.Y. Wen: The Safety Helmet Detection Technology and its Application to the Surveillance
System. In: Journal of Forensic Sciences, Vol. 49, Issue. 4, pp. 770–780. ASTM international,
USA (2004).
7. Damian Ellwart, Andrzej Czysewski: Viewpoint independent shape-based object classifica-
tion for video survelliance. In: 12th International Workshop on Image Analysis for Multimedia
Interactive Services, TU Delft; EWI; MM; PRB, Delft, The Netherlands (2011).
8. Chung-Wei Liang, Chia-Feng Juang: Moving object classification using local shape and HOG
features inwavelet-transformed space with hierarchical SVM classifiers. In: Applied Soft
Computing, vol. 28, Issue. C, pp. 483–497. Elsevier Science Publishers B. V. Amsterdam,
The Netherlands, The Netherlands (2015).
9. Zhaoxiang Zhang, KaiqiHuang, B., YunhongWanga, MinLi: View independent object
classification by exploring scene consistency information for traffic scene surveillance. In:
Journal Neurocomputing, Vol. 99, pp. 250–260. Elsevier Science Publishers B. V. Amster-
dam, The Netherlands, The Netherlands (2013).
10. C. Tangnoi, N. Bundon, V. Timtong, and R. Waranusast: A Motorcycle safety helmet
detection system using svm classifier. In: IET Intelligent Transport System, vol. 6, Issue. 3,
pp. 259–269. IET (2012).
Morphological Geodesic Active Contour
Based Automatic Aorta Segmentation in
Thoracic CT Images
1 Introduction
Cardiovascular diseases (CVDs) were the primary cause for death of around 788,000
people in 2010 in western countries [1]. Moreover, cardiovascular disease related
deaths in eastern countries are growing at an alarming rate [2]. Aortic abnormalities
such as calcification, dissection etc., are the most common cardiovascular diseases.
Thus, the detection and analysis of aorta is of medical importance. At present, the
available imaging modalities for manifestation of CVDs are lung computed tomogra-
phy, cardiac computed tomography, magnetic resonance (MR) etc. Aortic abberiva-
tions can be identified in the thoracic CT image which is the widely used non-invasive
imaging technique. The manual annotation and assessment of those CT images could
be tedious and inherently inaccurate even for highly trained professionals. To obviate
such difficulties, an automated aorta quantification system is of utmost importance
which requires accurate automatic aorta localization and segmentation.
Automated assessment of aorta has been reported and evaluated on both contrast-
enhanced and non-contrast-enhanced cardiac CT and MR images [3–10]. Multiple
atlas based aorta segmentation for low-dose non-contrast CT has been proposed by
Ivsgum et al. [7]. However, this method uses multiple registrations of images which
are manually labelled to get the final segmented output. Kurkure et al. [4] first pro-
posed an automated technique to localize and segment aorta from cardiac CT images
using dynamic programming. The authors of [4] formulated an entropy based cost
function in [5] for improved automatic segmentation of aorta. An automatic aorta
detection in non-contrast cardiac CT images using bayesian tracking algorithm has
been proposed by Zheng et al. [9]. Kurugol et al. [10] first reported an aorta segmen-
tation algorithm in thoracic CT images using 3D level set approach. Xie et al. [8]
reported an automated aorta segmentation in low-dose thoracic CT image which
makes use of pre-computed anatomy label maps (ALM). However, the ALM may
not be always available with CT images.
Inspired by the works done previously towards the automation of aorta quantifica-
tion, in this paper we propose an automated active contour based two stage approach
for aorta segmentation in CT images of thorax. In the first stage, a suitable slice is
chosen automatically to find the seed points for active contour. It is experimentally
found that the slice in which trachea bifurcates, aorta (both ascending and descend-
ing) takes nearly circular shape. So the slice in which trachea bifurcation occurs is
detected using image processing and taken as the suitable slice to localize aorta. After
the suitable slice is chosen, two seed points (center of two circles) among the circles
detected by CHT having lowest variances are selected automatically as the ascending
aorta and descending aorta. In the second stage, the aortic surface is determined by
upward and downward segmentation of ascending and descending aorta. This seg-
mentation algorithm builds upon morphological geodesic active contour [11, 12].
The key contributions of the proposed algorithm are the following: a fully auto-
matic algorithm for locating and segmenting the aortic surface using morphologi-
cal geodesic active contour. The proposed algorithm can be seamlessly applied to
the contrast-enhanced as well as non-contrast enhanced thoracic CT images. Unlike
other methods [4, 5], the algorithm proposed in this paper does not need any prior
knowledge of the span of thoracic CT slices to be processed. The proposed algorithm
automatically finds and segments the start and end of aorta from thoracic CT images.
Results produced by the proposed algorithm is compared with annotations prepared
by experts for quantitative validations.
Morphological Geodesic Active Contour Based Automatic Aorta . . . 189
The rest of the paper is organized as follows: Sect. 2 presents the detailed descrip-
tion of the proposed technique. Section 3 provides quantitative and qualitative results
and finally Sect. 4 concludes the paper.
2 Methodology
In order to localize the ascending and descending aorta, a suitable slice from the
axial 3D CT volume I needs to be determined where the ascending and descending
aorta are well defined in terms of its geometry. It has been analytically found that in
human anatomy the axial slice in which the trachea bifurcates (carina), the ascending
and descending aorta are almost circular in shape. However, accurate detection of
trachea bifurcation is not needed. A margin of ±2 slices does not incur any loss in
performance.
Aorta localizaƟon
3D CT stack
Trachea
Circle False circle
bifurcaƟon
detecƟon reducƟon
detecƟon
Aorta detecƟon
Segmented aorta
Fig. 1 The block diagram of the proposed automated aorta segmentation method
190 A. Dasgupta et al.
In order to detect the carina location, the CT image stack is first median filtered with
a window of size 3 × 3. Then the stack is thresholded at −700 HU which retains the
lower intensity air filled regions (including surrounding air) in the CT image stack.
A morphological binary erosion operation has been done with a disk type structur-
ing element of radius 1 pixel on the thresholded CT stack I. The main purpose of
this operation is to remove all the undesirable small regions present in the CT slices.
In order to remove the surrounding air and detect bifurcation, connected component
analysis is done on the binary eroded CT stack and trachea region is extracted from
the labelled connected components. Let the preprocessed CT stack now is repre-
sented by Î ∈ ℤ2
m×n×p
.
Once the preprocessing is done, trachea needs to be located. Regions with area
2
between 100 and 400 pixels and circularity ( Perimeter
4×𝜋×Area
) between 1 and 2 are considered
to be the trachea in preprocessed stack I. ̂ Once the initial location of the trachea
is detected, it is tracked using the centroid in next subsequent slices. This method
is applied progressively until either the circularity becomes greater than 2 or the
connectivity is lost.
Once the slice containing the carina region is detected Canny edge detector [13] is
applied followed by the CHT [14] for circle detection. Despite being a robust and
powerful method, the CHT often suffers from noise inherently present in the CT
images, as a result it produces false object boundaries along with the true circular
objects. To remove false objects, we construct circles with radius 8–12 pixels using
the centers detected by the CHT and choose two circles having lowest variances
which are considered to be the two seed points of ascending and descending aorta.
We construct two circles of radius 12 pixels using these two seed points which act as
the initial contour for Morphological Geodesic Active Contour (MGAC) to segment
ascending and descending aorta from each of the slices of the CT stack.
Active contour based segmentation methods are being used in medical image
processing research for years now. Geodesic active contour (GAC) is one of the
most popular contour evolution methods [15, 16]. GAC tries to separate fore-
ground (object) and background with the help of image intensity and gradient.
GAC solves a partial differential equation (PDE) to evolve the curve towards the
object boundary. Let u ∶ ℝ+ × ℝ2 → ℝ be an implicit representation of C such that
Morphological Geodesic Active Contour Based Automatic Aorta . . . 191
C(t) = {(x, y)|u(t, (x, y)) = 0}. The curve evolution equation of GAC can be repre-
sented in implicit form as
( ( ∇u ))
𝜕u
= g(I)|∇u| 𝜈 + div + ∇g(I).∇u, (1)
𝜕t |∇u|
( ∇u )
where, 𝜈 is the balloon force parameter, div |∇u| is the curvature of the curve and the
1
stopping function g(I) is defined as follows: g(I) = √ . Typically the value
1+𝛼|∇G∗I|
𝛼 is set to 0.15. It attains minima at the boundary of the object, thus, reducing the
velocity of the curve evolution near the border.
The GAC contour evolution equation comprises of three forces: (a) Balloon force,
(b) Smoothing force and (c) Attraction force. However, solving PDEs involves com-
putationally expensive numerical algorithms.
In this paper morphological operators are used to solve the PDE of GAC as pro-
posed in [11, 12]. Let the contour at nth iteration is represented by un (x). The balloon
force (g(I)|∇u|𝜈) can be solved using a threshold 𝜃, binary erosion (Eh ) and dilation
(Dh ) operations for (n + 1)th iteration as
The attraction force (∇g(I).∇u) can be solved very easily from intuition. The main
purpose of attraction force is to attract the curve C towards the edges. Mathematically
we can discretize this force as
1
⎧1, if ∇un+ 3 ∇g(I)(x) > 0,
n+ 23 ⎪ 1
u (x) = ⎨0, if ∇un+ 3 ∇g(I)(x) < 0, (3)
⎪ n+ 13
⎩u (x), otherwise.
( ∇u )
In order to solve the smoothing term (g(I)|∇u|div |∇u| ) Alvarez et al. [11,
12] defined two morphological operators, sup-inf (SIh ) and inf-sup (ISh ). In binary
images, both SIh and ISh operators look for small straight lines (3 pixels long) in four
possible directions (see Fig. 2). If no straight line is found, the pixel is made inac-
tive and active respectively. The difference between SIh and ISh is that, the first one
operates on active pixels (i.e. pixels having values 1) and the second one operates on
inactive pixels (i.e. pixels having values 0).
192 A. Dasgupta et al.
Fig. 2 The structuring elements B for the 2D discrete operator SIh ◦ISh
∇u
It can be proved that, the mean curvature (div( |∇u| )) can be obtained using the
composition of these operators (SIh ◦ISh ). So, the smoothing force with smoothing
constant 𝜇 can be written as
{ 2
((SIh ◦ISh )𝜇 un+ 3 )(x), if g(I)(x) > 𝜃,
un+1 (x) = n+ 23
(4)
u (x), otherwise.
The proposed algorithm was applied on 30 (26 contrast enhanced and 4 non-contrast
enhanced) randomly selected cases taken from the widely used LIDC-IDRI public
dataset [17]. On an average the dataset contains 187 slices per CT scan of 512 × 512
resolution with a spacing of 0.5469 − 0.8828 mm in x, y direction and 0.6250 − 2.5
mm in z direction. To make CT data isotropic in all directions (x, y, z) each CT stack
was resampled before further processing as suggested by [18].
The proposed methodology was evaluated by following the same technique as
described in [8]. Each of the thirty cases have 26 images manually annotated (5
images for ascending aorta, 3 images for aortic arch, 10 images for descending aorta
and 8 images for all three parts). In total, the proposed methodology was evaluated
on 780 images as compared to [8] which was evaluated on 630 images.
It was observed from the data, that the images were acquired mainly using two
types of CT machines—(a) GE Light Speed Plus and (b) GE Light Speed 16. For
the first case the value of 𝜃 of Eq. 2 was chosen as 50th percentile of g(I) for upward
Morphological Geodesic Active Contour Based Automatic Aorta . . . 193
ascending aorta segmentation and 55th percentile of g(I) for downward ascending
aorta and whole descending aorta (both in upward and downward direction). The
values of standard deviation, lower threshold and higher threshold in Canny edge
detection algorithm are 0.5, 100 and 400 respectively. For second case, 45th per-
centile of g(I) for upward segmentation and 55th percentile of g(I) for downward
segmentation of both ascending and descending aorta as the value of parameter 𝜃.
The values of standard deviation, lower threshold and higher threshold in Canny
edge detection algorithm are 0.2, 200 and 400 respectively. The values of smoothing
parameter 𝜇 and the balloon force parameter 𝜈 were set to 2 and 1 respectively for
all experiments.
Figure 3 shows the result of each steps involved in localization of ascending and
descending aorta which are marked in green and red respectively in Fig. 3c. Figure 4
shows 3D visualization of two correctly segmented aorta as well as one partially
segmented aorta where ascending aorta could not be segmented near the heart region.
The quality of the segmentation was evaluated in terms of Dice Similarity Coef-
2|GT∩S|
ficient (DSC). The DSC is defined as |GT|+|S| , where, GT and S represent the
groundtruth image and segmented image respectively and |A| denotes the total num-
ber of active pixels in an image A.
Fig. 3 Aorta localization: a Trachea bifurcation detection, b Circle detection using CHT,
c Detected ascending and descending aorta (green and red point) after false circle reduction
Fig. 4 3D visualization of
accurately segmented (left
and middle) aorta and an
aorta (right) in which
segmentation stopped early
in heart region
194 A. Dasgupta et al.
Table 1 Quantitative evaluation of our proposed algorithm in terms of average of dice similarity
coefficient (DSC) for (a) Whole aorta (b) Ascending aorta (c) Descending aorta (d) Aortic arch
Statistics DSC
Whole aorta Ascending aorta Descending aorta Aortic arch
Mean 0.8845 0.8926 0.9141 0.8327
Std. dev. 𝜎 0.0584 0.0639 0.0223 0.1192
4 Conclusion
A novel fully automated aorta segmentation algorithm has been developed for ana-
lyzing the aorta from thoracic CT images. The algorithm proposed in this paper
does not need any prior information regarding the span of the CT scan. Aorta can
be localized and segmented without any user intervention. It employs CHT on the
slice in which trachea bifurcates (carina region) to localize circular regions. CHT
generates many false positives from which two circles having lowest variances have
been considered as the ascending and descending aorta.
The algorithm was tested on 30 randomly sampled cases from LIDC-IDRI dataset.
In some cases the proposed algorithm fails to stop segmenting aorta near heart region
due to adjacent regions with similar intensities. More work will be needed to develop
an algorithm to address this issue. Future work should also involve to test the algo-
rithm on a large number of test cases and release ground truths for benchmarking
aorta segmentation.
References
1. National Heart Lung and Blood Institute, “Disease statistics,” in NHLBI Fact Book, Fiscal Year
2012. NHLBI, 2012, p. 35.
2. R Gupta, P Joshi, V Mohan, KS Reddy, and S Yusuf, “Epidemiology and causation of coronary
heart disease and stroke in India,” Heart, vol. 94, no. 1, pp. 16–26, 2008.
Morphological Geodesic Active Contour Based Automatic Aorta . . . 195
3. Shengjun Wang, Ling Fu, Yong Yue, Yan Kang, and Jiren Liu, “Fast and automatic segmen-
tation of ascending aorta in msct volume data,” in 2nd International Congress on Image and
Signal Processing, 2009. CISP’09. IEEE, 2009, pp. 1–5.
4. Uday Kurkure, Olga C Avila Montes, Ioannis Kakadiaris, et al., “Automated segmentation of
thoracic aorta in non-contrast ct images,” in 5th IEEE International Symposium on Biomedical
Imaging: From Nano to Macro, ISBI. IEEE, 2008, pp. 29–32.
5. Olga C Avila-Montes, Uday Kukure, and Ioannis A Kakadiaris, “Aorta segmentation in non-
contrast cardiac ct images using an entropy-based cost function,” in SPIE Medical Imaging.
International Society for Optics and Photonics, 2010, pp. 76233J–76233J.
6. Olga C Avila-Montes, Uday Kurkure, Ryo Nakazato, Daniel S Berman, Debabrata Dey, Ioannis
Kakadiaris, et al., “Segmentation of the thoracic aorta in noncontrast cardiac ct images,” IEEE
Journal of Biomedical and Health Informatics, vol. 17, no. 5, pp. 936–949, 2013.
7. Ivana Išgum, Marius Staring, Annemarieke Rutten, Mathias Prokop, Max Viergever, Bram
Van Ginneken, et al., “Multi-atlas-based segmentation with local decision fusionapplication to
cardiac and aortic segmentation in ct scans,” IEEE Transactions on Medical Imaging, vol. 28,
no. 7, pp. 1000–1010, 2009.
8. Yiting Xie, Jennifer Padgett, Alberto M Biancardi, and Anthony P Reeves, “Automated aorta
segmentation in low-dose chest ct images,” International journal of computer assisted radiol-
ogy and surgery, vol. 9, no. 2, pp. 211–219, 2014.
9. Mingna Zheng, J Jeffery Carr, and Yaorong Ge, “Automatic aorta detection in non-contrast 3d
cardiac ct images using bayesian tracking method,” in Medical Computer Vision. Large Data
in Medical Imaging, pp. 130–137. Springer, 2014.
10. Sila Kurugol, Raul San Jose Estepar, James Ross, and George R Washko, “Aorta segmentation
with a 3d level set approach and quantification of aortic calcifications in non-contrast chest
ct,” in Annual International Conference of the IEEE on Engineering in Medicine and Biology
Society (EMBC). IEEE, 2012, pp. 2343–2346.
11. L. Alvarez, L. Baumela, P. Henriquez, and P. Marquez-Neila, “Morphological snakes,” in IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), June 2010, pp. 2197–2202.
12. Pablo Marquez-Neila, Luis Baumela, and Luis Alvarez, “A morphological approach to
curvature-based evolution of curves and surfaces,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 36, no. 1, pp. 2–17, 2014.
13. John Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, no. 6, pp. 679–698, 1986.
14. Dana H Ballard, “Generalizing the hough transform to detect arbitrary shapes,” Pattern recog-
nition, vol. 13, no. 2, pp. 111–122, 1981.
15. Vicent Caselles, Ron Kimmel, and Guillermo Sapiro, “Geodesic active contours,” in Fifth
International Conference on Computer Vision,. IEEE, 1995, pp. 694–699.
16. Vicent Caselles, Ron Kimmel, and Guillermo Sapiro, “Geodesic active contours,” Interna-
tional journal of computer vision, vol. 22, no. 1, pp. 61–79, 1997.
17. Samuel G Armato III, Geoffrey McLennan, Luc Bidaut, Michael F McNitt-Gray, Charles R
Meyer, Anthony P Reeves, Binsheng Zhao, Denise R Aberle, Claudia I Henschke, Eric A Hoff-
man, et al., “The lung image database consortium (lidc) and image database resource initiative
(idri): a completed reference database of lung nodules on ct scans,” Medical physics, vol. 38,
no. 2, pp. 915–931, 2011.
18. William J Kostis, Anthony P Reeves, David F Yankelevitz, Claudia Henschke, et al., “Three-
dimensional segmentation and growth-rate estimation of small pulmonary nodules in helical
ct images,” IEEE Transactions on Medical Imaging, vol. 22, no. 10, pp. 1259–1274, 2003.
19. ELCAP Public Lung Image Database, http://www.via.cornell.edu/lungdb.html
Surveillance Video Synopsis While
Preserving Object Motion Structure
and Interaction
Abstract With the rapid growth of surveillance cameras and sensors, a need of
smart video analysis and monitoring system is gradually increasing for browsing
and storing a large amount of data. Traditional video analysis methods generate a
summary of day long videos but maintaining the motion structure and interaction
between object is of great concern to researchers. This paper presents an approach
to produce video synopsis while preserving motion structure and object interactions.
While condensing video, object appearance over spatial domain is maintained by
considering its weight that preserve important activity portion and condense data
related to regular events. The approach is tested in the context of condensation ratio
while maintaining the interaction between objects. Experimental results over three
video sequences show high condensation rate up to 11 %.
1 Introduction
These days the enormous amount of cameras are installed around the world and the
information produced by these devices are abundant enough for humans to extract
knowledge present in videos. A lot of data mining effort is needed to process surveil-
lance videos for browsing and retrieval of a specific event. Video synopsis condenses
video by showing activities simultaneously that happened at the different time in a
Fig. 1 Representation of different video synopsis approaches a Original video, b Synopsis video
with time shift only, c Synopsis with time as well as space shift, d Synopsis using proposed approach
preserving interaction between object 4 and 5
2 Related Work
Techniques of video motion analysis in literature are mainly put in order into two
groups: Static method generates the short account of all activities in original videos
in the form of image, and dynamic method produces summary as a video which is
based on the content of original video.
In the static method, each shot is represented by key frames, which are selected
to generate a representative image. Some of the examples of static image based sum-
marization are video mosaic in which video frames are found using region of interest
which are joined with each other to form a resulting video. Another form is video
collage in which single image is generated by arranging region of interest on a given
canvas. Storyboards and narratives are some more basic form of image based sum-
marization.However, static methods produce an abstract of video in shorter space,
but it does not take care of time-limited dependent relations between notable events.
Also, summary in the form of video is more appealing than watching static images.
An example of the dynamic method is video synopsis. Video synopsis gets activ-
ities in the video into smaller space viewing part in both spatial and time-limited
measures and generate a compact video that makes it easier to browse faster. Video
synopsis presents a few limiting conditions as it has need of greatly sized mem-
ory area to keep background image of a scene along with segmented moving object
trajectories. Although video synopsis saves memory in the final video it lost the
information related to interaction occurs between objects when they come into prox-
imity; Also the length of the synopsis video decides about the pleasing effect of the
final video. Other examples of dynamic methods are video fast-forward, video skim-
200 T. Badal et al.
ming, space-time video montage method, video narrative where selected frames are
arranged to form an extremely condensed video.
The overall framework of generating video synopsis using energy minimization
is given by Pritch et al. [1]. Fu et al. [2] measure sociological proximity distance to
find an interaction between objects. Lee et al. [3] proposed method to generate video
synopsis by discovering important object from the egocentric video.
∑
k
{ ∑ }
P(Xt ) = 𝜔i,t × 𝜂 Xt , 𝜇i,t , i,t (1)
i=1
where K is used to denote the number of Gaussian distributions taken as 3–5 depend-
∑
ing on the available memory, 𝜔i,t , 𝜇i,t and i,t represents the estimated weight, mean
value and the covariance matrix assign to the ith Gaussian in the model at time t
respectively. The parameter 𝜂 is a Gaussian probability density defined as follows:
∑ 1 −1
(Xt −𝜇t ) T ∑−1
(Xt −𝜇t )
𝜂(Xt , 𝜇, )= ∑1 e
2 (2)
n
(2𝛱) 2 ) ∣ 2 ∣
(
Temporal differencing (TD) segments a moving object in a video by taking the dif-
ference between corresponding pixels of two or more consecutive frames as given
in Eq. 3.
Surveillance Video Synopsis While Preserving Object Motion . . . 201
The primary consideration for this technique is on how to determine the suitable
threshold value. As both GMM and TD give the result as a mask of moving object(i.e.
a binary image), we simply perform binary OR between each pixel of both mask
images as shown in Fig. 2.
⨁
Mask = GMM TD. (5)
The result of mask image and foreground segmented image using background
subtraction combined temporal differencing is shown in Fig. 3.
Although above method of foreground segmentation is sufficient in most of the
real-world situation, in some challenging situation it may assign pixels to the fore-
ground which do not consider as moving objects like tree leaves, water surface, etc.
An efficient system needs to eliminate those falsely detected pixels also labeled as
202 T. Badal et al.
where k and l defined the size of window w centred around the pixel (i, j) for taking
median. After segmenting, to manage the identity of each foreground region a label
is assigned to each of them that is require to maintain until an object is present in
the video. The assignment of a label to an object in between the frame sequence is
also termed as tracking. Tracking is used to separate array tubes of distinct moving
object and generate their motion structure.
Due to miss detection or noise produced by object detection phase, the system
generates the fragmented trajectory of an object that makes object tracking a chal-
lenging task. Yilmaz et al. [14] give the complete description of object tracking.
An object is a label with an assignment to the detection by calculating the distance
between the centroid of current detection to the predicted location computed by
Kalman filter [15]. A detected region is assigned a label by calculating the mini-
mum distance between the centroids of current detections and predicted centroid
value computed by Kalman filter [15] as given in Eq. 7.
where xi represent the location value at current step and xi−1 is at prior step, 𝛷i
represents the state transition matrix relates the states between previous and current
time steps. The wi is a random variable used to represent the normally distributed
process noise. Figure 4 shows the result of tracking superpixel area across the frame
sequences. The motion structure of a moving object also denoted as tube in this
paper is represented by a three tuples structure. In video analysis, tubes of distinct
moving objects are the primary processing component. In video analysis, tubes of
different moving objects are the primary processing element. Each tube Ai is a union
of bounding boxes belongs to object i represented by bi from frame j to frame k as
given in Eq. 8.
⋃
k
Ai = T(i,f ,bi ) (8)
f =j
where Bk (i, j) are the pixels belongs to kth background, Bk−1 (i, j) pixels belongs to
(k − 1)th background and Ik−1 is (k − 1)th frame of the original video sequence.
204 T. Badal et al.
The difference between the spatial overlapping tubes is considered here as the indica-
tion of intersection between the objects in an original video. We find the interaction
between tubes by measuring the difference between tubes as given below in Eq. 10.
{
0 if dt (i, j) > k,
It (i, j) = (10)
k − dt (i, j) otherwise
where dt (i, j) = Tit − Tjt used to compute the distance between tube i and j at time t
and constant k is used for considering the minimum distance for interaction. Here K
is taken as 5 which means that the tubes are considered as interaction if they are 5
pixels apart. The tubes having It (i, j) other than 0 is merge and form a tubeset.
Energy minimization [1, 2, 16] is widely used and popular technique for generating
video synopsis. The objective of energy minimization has defined a function that
assigns cost for all possible solutions and finds solution with the lowest cost. While
shifting the input pixel to synopsis pixel with time shift M we formulate energy
functions to assign cost to activity loss and occlusion as follows:
where Ea (M) and Eo (M) are used to represents the activity loss and occlusion across
the frames respectively. 𝛼 is used to assign a relative weight of occlusion. The activity
loss of an object is the difference between pixels belongs to object tubes in input
video and synopsis video. The occlusion cost represents the area that is shared by
tubes in a frame in synopsis video.
Activity cost: As the length of an object tube depends on upon its appearance in
the video it does not participate equally in synopsis video too. While calculating the
activity loss weighted average of pixels belongs to input video and synopsis video is
considered as given in Eq. 12.
∑EFramei
t=SFramei
((xi , yi , t)o − (xi , yi , t)s )
Ea (i) = (12)
Lengthi
where (xi , yi , t)o and (xi , yi , t)s represent super pixel region belongs to object i in orig-
inal video and synopsis video respectively.
Collision cost: While condensing the activity in an even shorter video it is
required to share some pixel between tubes. Collision cost is computed by finding
Surveillance Video Synopsis While Preserving Object Motion . . . 205
the total number of pixels belongs to an object in consecutive frames share space in
synopsis video as given in Eq. 13.
{
1 if (xi , yi , t)s = (xi , yi , t + 1)s ,
Ci = (13)
0 otherwise
It is also used to allow the user in defining the number of pixels in tubes that
can overlap. Collision cost is normalized with the length of object tube for equal
participation of each object as given in Eq. 14.
∑ ∑S
j∈Q t=1
Ci
Eo (i) = (14)
Lengthi
The higher value of Eo results in pixel overlapping between two objects that can affect
the smoothness of object appearance. Moreover, the smaller value keeps objects well
separated, but it generates a longer synopsis video. The procedure of to summarize
the activity in the for of synopsis is explain in Algorithm 1. Table 1 used to explain
the notations used in this paper.
Fig. 5 Resulting synopsis frame for testing videos a Video5, b Person, c Car
6 Experimental Evaluation
7 Conclusion
References
1. Rav, A., Alex, R., Pritch, Y., Peleg, S.: Making a Long Video Short: Dynamic video synopsis.
In: Computer Vision and Pattern Recognition, IEEE, pp. 435–441 (2006)
2. Fu, W., Wang, J.,Gui, L., Lu, H., Ma, S.: Online Video Synopsis of Structured Motion. In:
Neurocomputing, Vol. 135.5, pp. 155–162 (2014)
3. Lee, Y., Ghosh, J., Grauman, K.: Discovering Important People and Objects for Egocentric
Video Summarization. In: Computer Vision and Pattern Recognition (CVPR) pp. 1346–1353
(2012)
4. Liyuan, L., Huang, W., Irene, Y., Tian, Q.: Statistical Modeling of Complex Backgrounds for
Foreground Object Detection. In: IEEE Transaction on Image Processing, IEEE Vol. 13.11,
pp. 1459–1472 (2004)
5. Horn, Berthold, K., Schunck, Brian, G.: Determining Optical Flow. In: Artificial Intelligence,
17, pp. 185–203 (1981)
6. Suganyadevi, K., Malmurugan N., Sivakumar R.: Efficient Foreground Extraction Based On
Optical Flow And Smed for Road Traffic Analysis. In: International Journal Of Cyber-Security
And Digital Forensics. pp. 177–182 (2012)
7. Stauffer, C., Eric, W., Grimson, L.: Learning Patterns of Activity Using Real-Time Tracking.
In: IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, pp. 747–757 (2012)
8. Karasulu, B.: Review and Evaluation of Well-Known Methods for Moving Object Detection
and Tracking in Videos. In: Journal Of Aeronautics and Space Technologies, 4, pp 11–22
(2012)
9. Kim, K., Chalidabhongse, T.H., Harwood, D., Davis, L.: Real-Time Foreground-Background
Segmentation using Codebook Model. In: Real Time Imaging 11, Vol. 3, pp 172–185 (2005)
10. Badal, T., Nain, N., Ahmed, M., Sharma, V.: An Adaptive Codebook Model for Change Detec-
tion with Dynamic Background. In: 11th International Conference on Signal Image Technology
& Internet-Based Systems, pp. 110–116. IEEE Computer Society, Thailand (2015)
11. Badal, T., Nain, N., Ahmed, M.: Video partitioning by segmenting moving object trajectories.
In: Proc. SPIE 9445, Seventh International Conference on Machine Vision (ICMV 2014), vol
9445, SPIE, Milan, pp. 94451B–94451B-5 (2014).
12. Chen, W., Wang, K., Lan, J.: Moving Object Tracking Based on Background Subtraction Com-
bined Temporal Difference. In: International Conference on Emerging Trends in Computer and
Image Processing (ICETCIP’2011) Bangkok, pp 16–19 (2011)
13. Bastian, L., Leonardis, A., Schiele, B.: Robust Object Detection with Interleaved Catego-
rization and Segmentation. In: International Journal of Computer Vision (IJCV), Vol. 77,
pp. 259–289 (2008)
14. Yilmaz, A., Javed, O., Shah, M.: Object tracking: A survey. In: Acm computing surveys
(CSUR). ACM, Vol. 38 pp. 4–13 (2006)
15. Fu, Z., Han, Y.: Centroid Weighted Kalman Filter for Visual Object Tracking. In: Elsevier
Journal of Measurement, pp. 650–655 (2012)
16. Pritch, Y., Alex, R., Peleg, S.: Nonchronological Video Synopsis and Indexing. In: IEEE Trans-
action on Pattern Analysis and Machine Intelligence, Vol. 30, NO. 11, pp. 1971–1984 (2008)
17. Blunsden, S., Fisher, R.: The BEHAVE Video Dataset: Ground Truthed Video for Multi-Person
Behavior Classification. In: Annals of the BMVA, Vol. 4, pp. 1–12 (2010)
Face Expression Recognition Using
Histograms of Oriented Gradients
with Reduced Features
Nikunja Bihari Kar, Korra Sathya Babu and Sanjay Kumar Jena
Abstract Facial expression recognition has been an emerging research area in last
two decades. This paper proposes a new hybrid system for automatic facial expres-
sion recognition. The proposed method utilizes histograms of oriented gradients
(HOG) descriptor to extract features from expressive facial images. Feature reduc-
tion techniques namely principal component analysis (PCA) and linear discrim-
inant analysis (LDA) are applied to obtain the most important discriminant fea-
tures. Finally, the discriminant features are fed to the back-propagation neural net-
work (BPNN) classifier to determine the underlying emotions from expressive facial
images. The Extended Cohn-Kanade dataset (CK+) is used to validate the proposed
method. Experimental results indicate that the proposed system provides the better
result as compared to state-of-the-art methods in terms of accuracy with the substan-
tially lesser number of features.
1 Introduction
Facial expression recognition is a standout amongst the most effective, regular, and
prompt means for individuals to impart their feelings, sentiments and desires [1].
Facial expressions contain non-verbal communication cues, which helps to iden-
tify the intended meaning of the spoken words in face-to-face communication.
2 Related Work
features represent facial components with a set of fiducial points. The drawback of
this approach is that the fiducial points must be set manually, which includes a com-
plex procedure. In this technique, recognition accuracy increases, with the increase
in face feature points.
Tsai et al. [8] proposed a novel face emotion recognition system using shape and
texture features. Haar-like features (HFs) and self-quotient image (SQI) filter is used
to detect the face area from the image. Angular radial transformation (ART), discrete
cosine transform (DCT) and Gabor filters(GF) are used for feature extraction. The
model proposed by Tsai et al. adopts ART features with 35 coefficients, SQI, Sobel,
and DCT with 64 features, GF features with 40 texture change elements. A SVM
classifier was employed to classify images into eight categories including seven face
expressions and non-face. Chen et al. [9] proposed hybrid features that include facial
feature point displacement and local texture displacement between neutral to peak
face expression images. The resultant feature vector contains 42-dimensional geo-
metric features and 21-dimensional texture features. A multiclass SVM was deployed
to recognise seven facial expressions. Valster and Pantic [10] located 20 facial fidu-
cial points using a face detector based on Gabor-feature-based boosted classifier.
These fiducial points can be tracked through a series of images using particle fil-
tering with factorized likelihoods. Action unit (AU) recognition can be done with a
combination of GentleBoost, SVM, and hidden Markov models.
Hsieh et al. [11] used six semantic features, which can be acquired using direc-
tional gradient operators like GFs and Laplacian of Gaussian (LoG). Active shape
model (ASM) is trained to detect the human face and calibrates facial components.
Later, Gabor and LoG edge detection is used to extract semantic features. Chen et al.
[12] applied HOG to face components to extract features. The face components are
eyes, eye-brows, and mouth. Initially, the HOG features are extracted from each of
these components, and they are concatenated to have a feature vector of size 5616.
Gritti et al. [6] proposed HOG, LBP, and LTP descriptors for facial representations.
Their experiments reveal that HOG, LBP-Overlap, LTP-Overlap descriptors result in
18,954 features, 8437 features, and 16,874 features respectively. A linear SVM with
ten-fold cross validation testing scheme was used in their recognition experiments.
The literature review reveals that local features like HOG, LBP, LTP, and Gabor
wavelets have been used to represent the face with quite a large number of features,
which slows down the face expression recognition process. Thus, there is a scope to
reduce the feature vector size which in turn minimize the computational overhead.
3 Proposed Work
The proposed technique includes four principal steps: preprocessing of facial images,
feature extraction, feature length reduction, and classification. Figure 1 shows the
block diagram of the proposed system. Detail description of each block of the pro-
posed system is given below.
212 N.B. Kar et al.
Fig. 1 Block diagram of proposed system for classification of face expression images
3.1 Preprocessing
At first, the face images are converted to a gray scale image. Then the contrast of
the image was adjusted so that 1 % of the information is immersed at low and high
intensities. After the contrast adjustment, the face is detected using popular Viola
and Jones face detection algorithm [7]. The detected face region is cropped from the
original face image. Then the cropped face is reshaped to an image of size 128 × 128.
Dalal and Higgs [13] proposed HOG descriptor for human detection. After that it has
been widely used for various computer vision problems like pedestrian detection,
face recognition, and face expression recognition. In HOG, images are represented
by the directions of the edges they contain. Gradient orientation and magnitudes are
computed by applying gradient operator across the image for HOG features.
Initially, the image is divided into a number of cells. A local 1-D histogram of gra-
dient directions over the pixel are extracted for each cell. The image is represented by
combining histograms of each cell. Contrast-normalization of the local histograms is
necessary for better invariance to illumination, shadowing, etc. So, local histograms
are combined over a larger spatial region, called blocks by using the result of nor-
malization of cells within the block. The feature length increases when the blocks
are overlapping. The normalized blocks are combined to represent HOG descriptor.
LDA has been effectively connected to different classification problems like face
recognition, speech recognition, cancer detection, multimedia information retrieval
etc. [14]. The fundamental target of LDA is to discover projection F that boosts the
proportion of between class scatter Sb to within class scatter SW .
Face Expression Recognition Using Histograms . . . 213
|FS F |
arg max |
b |
(1)
F |FS F |
| W |
For a very high dimensional data, the LDA algorithm faces various challenges.
In our proposed work, the HOG descriptor is used to extract features from the pre-
processed face image. First, it divides the image into 16 × 16 blocks each with 50 %
overlap. Each block contains 2 × 2 cells each of size 8 × 8. As a whole we get
15 × 15 = 225 blocks. For each cell, it computes the gradient orientation with nine
bins that are spread over (0◦ –180◦ ) (signed gradient). That implies the feature vector
of 225 × 4 × 9 = 8100 dimensions.
The scatter matrices are of size 8100 × 8100 ≅ 63M. It is computationally chal-
lenging to handle such enormous matrices. The matrices are always singular because
the number of samples needs to be at least 63M, with the goal that they will non-
degenerate. This problem is known as small sample size (SSS) problem as the size of
the sample set is smaller than the dimension of the original feature space. To dodge
these issues, another rule is utilized before LDA approach to reducing the dimen-
sion of the feature vector. PCA is used to reduce the dimension and SW is no longer
degenerate. After that LDA methodology can continue with no inconvenience.
where, F is the feature vector to be reduced and RFV is the reduced feature vector
after applying the PCA+LDA criterion. PCA is a method that is utilized for applica-
tions, for example, dimensionality reductions, lossy information pressure, highlight
extraction, and information representation [15]. PCA is utilized to reduce the dimen-
sion to X − 1, where X is the total number of samples and the feature vector is of size
X × (X − 1). Then, LDA is employed to decrease the dimension further. This method
creates a reduced feature vector. The feature vector alongside a vector holding the
class names of all samples is fed as an input to the classifier.
3.4 Classification
Artificial neural network with back-propagation (BP) algorithm has been used in
solving various classification and forecasting problems. Despite the fact that BP con-
vergence is moderate, yet it is ensured. A BPNN with one input, one hidden and one
output layer is presented. The network employed sigmoid neurons at the hidden layer
and linear neurons at the output later. The training samples are introduced to the net-
work in batch mode. The network configuration is IM × HY × Z, i.e., M number of
features, Y number of hidden neurons, and Z number of output neurons, which indi-
cate emotions. The network structure is depicted in Fig. 2.
214 N.B. Kar et al.
The input layer consist of 6 neurons as per the six features are selected after apply-
ing PCA+LDA standard. The number of hidden neurons Y can be calculated as per
the Eq. 3,
(M + Z)
Y= (3)
2
The back-propagation algorithm with Steepest descent learning rule is most fre-
quently used training algorithm for classification problems, which is also utilized in
this work. Back-propagation learning consists of two phases, namely forward pass
and backward pass [16]. Initially, the input features are presented to the input nodes
and its output propagates from one layer to other layers of the network. All the net-
work weights and biases are fixed during the forward pass.
The difference of the actual output from the desired output treated as an error
signal. In backward pass, the weights and biases are updated by passing the error
signal backward to the network. The learning performance is measured by root mean
square error (𝑅𝑀𝑆𝐸).
4 Experiments
The examinations are carried out on a PC with 3.40 GHz Core i7 processor and 4
GB of RAM, running under Windows 8 working framework. The proposed system is
simulated utilizing Matlab tool. The summary of the proposed scheme is presented
in Algorithm 1.
Face Expression Recognition Using Histograms . . . 215
4.1 Dataset
One standard dataset, CK+ [17] was used to validate the proposed method. In CK+
dataset, the facial expression of 210 adults is captured. The members were 18–
50 years old, among them 69 % female, 81 % Euro-American, 13 % Afro-American,
and 6 % different gatherings. CK+ contains 593 posed facial expressions from 123
subjects. Among 593 posed facial expressions, 327 were labeled with seven basic
emotion categories. We have selected 414 images from the CK+ dataset, which
includes 105 neutral images and 309 peak expressive images for the experiment.
However, we have excluded the contempt face expression images from the dataset.
The preprocessed sample images from CK+ of the experiment are shown in Fig. 3.
216 N.B. Kar et al.
94
Accuracy (%)
92
90
88
86
84
82
80
2 4 6 8 10 12 14 16 18 20
Number of features
The feature extraction stage is implemented by HOG with following settings: signed
gradient with nine orientation bins, which are evenly spread over 0◦ –180◦ a cell size
of 8 × 8, 4 number of cells in a block, 50 % block overlap, L2-Hys (L2-norm followed
by clipping) block normalization, and [−1, 0, 1] gradient filter with no smoothing.
All the images in the dataset are of size 640 × 490. After face detection, the
images are cropped and resized to 128 × 128. Then the features are extracted using
HOG with above-mentioned settings. The extracted feature dimension of each image
is 1 × 8100 (225 blocks × 4 cells in each block × 9 bins = 8100). Then PCA+LDA
approach is used to reduce the dimension of the feature vector from 8100 to 6.
Figure 4 depicts the plot between accuracy and the number of features. It is observed
that with only 6 features, the proposed system achieves highest classification accu-
racy. The six features with a target vector containing all class labels are combined to
form a resultant dataset.
The resultant dataset is fed to the BPNN classifier. The network consists of three
layers, with six nodes (represents the six features) in the input layer, six nodes in the
hidden layer, and seven nodes (represents seven emotions, i.e. anger, disgust, fear,
happy, sad, surprise, neutral) in the output layer. The training error for the dataset is
Face Expression Recognition Using Histograms . . . 217
0.0419. Table 1 shows the result of 5-fold stratified cross-validation (CV) procedure.
The confusion matrix of the proposed facial recognition system is given in Table 2.
The comparative analysis of proposed method with state-of-art methods in terms
of classification accuracy and number of features are listed in Table 3. All the exist-
ing methods have been validated on the same dataset. It is evident that the proposed
scheme yields higher classification accuracy with the lesser number of features com-
pared to other methods. The use of these features reduces computational overhead.
In addition, it makes the classifier task more feasible. For a test image, the execution
time for preprocessing and feature extraction is 0.133 s, whereas for feature reduc-
tion and classification it is 0.008 s and 0.001 s respectively. During time analysis, the
time needed for training the classifier is not considered.
5 Conclusion
This paper proposes a hybrid system for facial expression recognition. At first, the
face is detected using Viola and Jones face detection algorithm. In order to maintain
a uniform dimension of all the face images, the detected face region is cropped and
resized. Then the system introduced HOG to extract features from the preprocessed
face image. A PCA+LDA approach is harnessed to select most significant features
218 N.B. Kar et al.
from the high dimensional HOG features. Finally, BPNN classifier has been used
to build an automatic and accurate facial expression recognition system. Simulation
results show the superiority of the proposed scheme as compared to state-of-the-
art methods on CK dataset. The proposed scheme achieves recognition accuracy
of 99.51 % with only six features. In future, other machine learning techniques can
be suggested to enhance the performance of the system. In addition, contempt face
images of CK+ dataset can be taken into consideration.
References
1. Tian, Y.l., Brown, L., Hampapur, A., Pankanti, S., Senior, A., Bolle, R.: Real world real-time
automatic recognition of facial expressions. In: Proceedings of IEEE workshop on Performance
Evaluation of Tracking and Surveillance (PETS) (2003)
2. Pantic, M., Rothkrantz, L.J.: Automatic analysis of facial expressions: The state of the art.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12), 1424–1445 (2000)
3. Bettadapura, V.: Face expression recognition and analysis: the state of the art. arXiv preprint
arXiv:1203.6722 (2012)
4. Tian, Y.L., Kanade, T., Cohn, J.F.: Facial expression analysis. In: Handbook of face recogni-
tion, pp. 247–275. Springer (2005)
5. Bartlett, M.S., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Fully automatic
facial action recognition in spontaneous behavior. In: 7th International Conference on Auto-
matic Face and Gesture Recognition. pp. 223–230. (2006)
6. Gritti, T., Shan, C., Jeanne, V., Braspenning, R.: Local features based facial expression recog-
nition with face registration errors. In: 8th IEEE International Conference on Automatic Face
& Gesture Recognition, pp. 1–8. (2008)
7. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
IEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 1, pp. I–511.
(2001)
8. Tsai, H.H., Lai, Y.S., Zhang, Y.C.: Using svm to design facial expression recognition for
shape and texture features. In: International Conference on Machine Learning and Cybernetics
(ICMLC). vol. 5, pp.2697–2704. (2010)
Face Expression Recognition Using Histograms . . . 219
9. Chen, J., Chen, D., Gong, Y., Yu, M., Zhang, K., Wang, L.: Facial expression recognition using
geometric and appearance features. In: 4th International Conference on Internet Multimedia
Computing and Service. pp. 29–33. (2012)
10. Valstar, M.F., Pantic, M.: Fully automatic recognition of the temporal phases of facial actions.
IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 42(1), 28–43 (2012)
11. Hsieh, C.C., Hsih, M.H., Jiang, M.K., Cheng, Y.M., Liang, E.H.: Effective semantic features
for facial expressions recognition using svm. Multimedia Tools and Applications pp. 1–20
(2015)
12. Chen, J., Chen, Z., Chi, Z., Fu, H.: Facial expression recognition based on facial components
detection and hog features. In: International Workshops on Electrical and Computer Engineer-
ing Subfields (2014)
13. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition, (CVPR). vol. 1, pp. 886–893. IEEE (2005)
14. Yu, H., Yang, J.: A direct LDA algorithm for high-dimensional datawith application to face
recognition. Pattern recognition 34(10), 2067–2070 (2001)
15. Bishop, C.M.: Pattern recognition and machine learning. Springer (2006)
16. Haykin, S., Network, N.: A comprehensive foundation. Neural Networks 2 (2004)
17. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended cohn-
kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression.
In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). pp.
94–101. (2010)
18. Saeed, A., Al-Hamadi, A., Niese, R., Elzobi, M.: Frame-based facial expression recognition
using geometrical features. Advances in Human-Computer Interaction (2014)
Dicentric Chromosome Image
Classification Using Fourier Domain
Based Shape Descriptors and Support
Vector Machine
1 Introduction
than one (Fig. 1). Telescoring is the technique of counting such, and other aberrant
chromosomes (and at multiple geographical locations if needed say in the case of a
radiological accident or attack) and in the process assessing the actual and effective
radiation damage that the organism has suffered irrespective of what might be
assessable by just the exposure dose information.
Although chromosomes are miniscule subcellular structures, metaphase is one
phase of cell division where the chromosomes are at a very condensed stage and are
hence quite conducive to microscopic imaging. The input images were acquired
from Metafer 4 slide scanning microscope at our institute. This is an automated
imaging and analysis system comprising of a Microscope (model Axio Imager M2,
from Zeiss Germany) and DC score software for metaphase slide analysis. Some
examples of one of the input microscope images are depicted in (Fig. 2) (scaled
down to fit page). Our aim was to come up with a method to classify the constituent
individual chromosome images into normal and dicentric chromosomes.
Fig. 2 a A good quality chromosomes image. b An image with a nucleus in the background and
other debris from surrounding
Dicentric Chromosome Image Classification Using Fourier Domain … 223
2 Method
Shape has been shown to be an important component in the visual scene under-
standing by humans. In the extracted chromosomes the distinguishing feature in the
dicentric chromosomes are two additional constrictions along their body length.
Further, the chromosomes exhibit variability in size and shapes even amongst
normal chromosomes. Another observation is that the chromosomes are manifested
differently oriented. Shape feature methods are widely used over and above seg-
mentation in image analysis [7]. The shape features for our use had to be robust to
scale, rotation and boundary start point. Various methods of shape representation
and feature extraction are found in literature such as those based on geometry [8, 9].
Fourier based shape descriptor was adopted as many desirable invariance properties
are achievable in the Fourier space representation [10–12]. Some researchers have
also used contour re-sampling to counter the variation in object size [13, 14] and to
sample a fixed number of points. The method used by [15] was adapted to arrive at
the Fourier Shape Descriptor as detailed below.
After boundary extraction, the shape is represented by its boundary contours g(x
(t), y(t)) = f. The boundary can be taken to represent a periodic signal with a period
of 2π when we take the function f as defined in Eq. (1).
Now, the function g can be identity function or any other mapping of the
boundary contour and when this function f is expanded into Fourier series, the
Fourier coefficients can be taken as an approximation of the shape. The boundary
can be represented either as a collection of contour points as in [10, 15], or as in a
multidimensional representation as in [12] or as a scalar as approached in this paper.
The complex domain representation of the boundary can be achieved by com-
puting z(t) as shown in Eq. (2).
N −1
zðtÞ = ∑ ak expð2πikt ̸NÞ ð3Þ
k=0
1 N −1
ak = ∑ zðtÞ expð − 2πikt ̸NÞ ð4Þ
N k=0
A shape signature can be calculated from the 2-dimensional contour points rather
than working directly on the boundary contour points. The shape signature used
was the centroid distance function. The centroid of the shape boundary is given by
the Eq. (5).
1 N −1
ðx0 , y0 Þ = ∑ ðxk , yk Þ ð5Þ
N k=0
Dicentric Chromosome Image Classification Using Fourier Domain … 225
By its very nature this shape signature is rotation invariant. The Fourier trans-
form of the function is computed as given by the Eq. (7) (Fig. 4):
1 N −1
ak = ∑ rðtÞ expð − 2πikt ̸NÞ ð7Þ
N t=0
And further invariance of scale and boundary start point is arrived at by phase
normalization as depicted by Eq. (8):
jak j
anewk = expðisαk − ikαs Þ ð8Þ
ja j 0
αk = argðak Þ
Here, 0 is the index of the coefficient which is actually the mean of the data
points while ‘s’ is the index of the coefficient with the second largest magnitude.
Further the log scale representation of these coefficients was taken and a subset of
the coefficients was used as the final Fourier Boundary Descriptors set. The car-
dinality of the subset was kept the same for each chromosome. This was done to
include only the foremost relevant discerning Fourier features and also to make the
final feature vector uniform in size irrespective of the chromosome. After experi-
mentation, 50 features comprised from the first 25, sparing the mean and the last 25
produced acceptable results on classification accuracy.
Support Vector Machine [16] is a robust machine learning algorithm for classifi-
cation and a popular supervised learning method in use. SVM has been employed
for cytogenetic image classification [17]. In our work, two-class image classifica-
tion was carried out using the custom designed feature set as described above.
3 Results
A total of 141 chromosome images were used in this study. Out of these, 105 were
the binary extracted images of normal chromosomes and 36 were the binary
extracted images of dicentric chromosomes. The image dataset was randomly
divided into training and testing datasets. At all times, 70 normal chromosome
images and 20 dicentric chromosome images were used for the training phase and
the remaining images were used for the test classification.
Linear and radial basis function kernels were tried. The best accuracy of
90.1961 % in classification was achieved by using a Linear Kernel with SVM in
which 46 of the 51 test images were correctly classified. Matlab along with Libsvm
[18] was used for implementation of the work.
References
6. Shervin Minaee, Mehran Fotouhi, Babak Hossein Khalaj, “A Geometric Approach to Fully
Automatic Chromosome Separation”, Signal Processing in Medicine and Biology Symposium
(SPMB), 2014 IEEE.
7. Denshang Zhang, Guojun Lu, “Review of shape representation and description techniques”,
2003 Pattern Recognition Society.
8. E. Poletti, E. Grisan, A. Ruggeri, “Automatic classification of chromosomes in Q-band
images”, 30th Annual Intl. IEEE EMBS Conf. Aug 2008.
9. H Ling, D.W. Jacobs, “Shape Classification Using the Inner-Distance”, IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 29, no. 2, Feb 2007.
10. D. Zhang, G. Lu, “A Comparative Study on Shape Retrieval Using Fourier Descriptors with
Different Shape Signatures”, Monash Univ.
11. D Zhang, G Lu, “Study and evaluation of different Fourier methods for image retrieval”,
Image Vis. Comput. 23, 33–49, 2005.
12. I. Kuntu, L. Lepisto, J. Rauhamaa, A. Visa, “Multiscale Fourier Descriptor for Shape
Classification”, Proc of the 12th Intl. Conf. on Image Analysis and Processing, 2003.
13. Volodymyr V. Kindratenko, Pierre J. M. Van Espen, Classification of Irregularly Shaped
Micro-Objects Using Complex Fourier Descriptors, Proc of 13th Intl Conf on Pattern
Recognition, vol.2, pp. 285–289, 1996.
14. J. Mataz, Z Shao, J. Kitter, “Estimation of Curvature and Tangent Direction by Median
Filtered Differencing”, 8th Int. Conf. on Image Analysis and Processing, San Remo, Sep 1995.
15. Christoph Dalitz, Christian Brandt, Steffen Goebbels, David Kolanus, “Fourier Descriptors
for Broken Shapes”, EURASIP Journ. On Advances in Signal Proc, 2013.
16. Cortes, C.; Vapnik, V. (1995). “Support-vector networks”, Machine Learning 20 (3): 273.
doi:10.1007/BF00994018.
17. Christoforos Markou, Christos Maramis, Anastasios Delopoulos, “Automatic Chromosome
Classification using Support Vector Machines”.
18. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available
at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
An Automated Ear Localization Technique
Based on Modified Hausdorff Distance
Abstract Localization of ear in the side face images is a fundamental step in the
development of ear recognition based biometric systems. In this paper, a well-known
distance measure termed as modified Hausdorff distance (MHD) is proposed for
automatic ear localization. We introduced the MHD to decrease the effect of outliers
and allowing it more suitable for detection of ear in the side face images. The MHD
uses coordinate pairs of edge pixels derived from ear template and skin regions of
the side face image to locate the ear portion. To detect ears of various shapes, ear
template is created by considering different structure of ears and resized it automati-
cally for the probe image to find exact location of ear. The CVL and UND-E database
have side face images with different poses, inconsistent background and poor illumi-
nation utilized to analyse the effectiveness of the proposed algorithm. Experimental
results reveal the strength of the proposed technique is invariant to various poses,
shape, occlusion, and noise.
1 Introduction
In recent years, ear biometric has gained much attention and become an emerging
area for the new innovation in the field of biometrics. Ear biometrics is achieving
popularity as unlike face, ears are not affected by aging, mood, health, and posture.
Automatic ear localization in the 2D side face image is a difficult task and the per-
formance of ear localization influences the efficiency of the ear recognition system.
Ear can be used as a biometric trait for human recognition that has been first doc-
umented by the French Criminologist, Alphonse Bertillon [1]. More than a century
ago, Alfred Iannarelli [2] has demonstrated a manual ear recognition system. He had
examined more than ten thousand ears and observed that no two ears are exactly
identical. The first technique for ear detection is introduced by Berge et al. [3]. It
depends on building neighborhood graph from the edges of the ear. But its main
disadvantages are first user interaction needs for contour initialization and second
system is not enabled to discriminate the true ear edge and non-ear edge contours.
Choras [4] used geometric feature for contour detection but this approach also suf-
fered from the same problem of the selection of erroneous curves. Hurley et al. [5]
have proposed a force field technique to detect ear. Although, reported algorithm has
been tested on small background ear images, the results are established to be rather
encouraging. Alvarez et al. [6] have detected ear from 2D face image using ovoid
and active contour (snake) model. In this algorithm an initial approximated ear con-
tour is needed to execute for ear localization. Ansari et al. [7] have described an ear
detection technique that depends on outer ear helices edges. Therefore, it may fail
when the outer helix edges are not clear. Yuan and Mu [8] have applied skin-color
and contour information to detect ear. Proposed method assumes elliptical shape for
ear and search the ellipse on the edges of the side face to get the location of the ear.
But, considering elliptical ear shape may not be proper for all the individuals and
cannot be used for detecting the ear universally. Sana et al. [9] have suggested ear
detection technique based on template matching. In this work, different size of ear
templates is maintained to locate ears in side face at different scales. In real world
applications, ear occurs in various sizes and the off-line templates are not suitable
to manage all the cases. Islam et al. [10] have proposed a cascaded AdaBoost based
ear detection technique. The results of this approach are found to be rather promis-
ing for small database but it needs more training time for the large set of images.
Prakash et al. [11, 12] have proposed an efficient distance transform and template
based technique for ear localization. This approach is not efficient for illumination
variations and noisy images. In recent paper of Prakash et al. [13] have used edge
map of the side face image and convex hull criteria to construct a graph of connected
components and largest connected component is the ear region. Here experimental
results depend on quality of the input image and proper illumination condition.
This paper presents a new efficient scheme for automatic ear localization from
side face images based on similarity measure of ear edge template and skin regions
of side face using modified Hausdorff distance. As Hausdorff distance measure does
not depends on pixels intensity and again ear template is a representative of differ-
ent shape ears, the proposed approach is invariant to illumination variations, poses,
shape, and occlusion. The remainder of this paper is organized as follows. Section 2
presents skin-color segmentation and the Hausdorff distance. Then Sect. 3 describes
the proposed ear detection technique. Experimental results are discussed in Sect. 4
and finally, we draw some conclusions in Sect. 5.
An Automated Ear Localization Technique Based on Modified Hausdorff Distance 231
2 Technical Background
The proposed approach includes color based skin segmentation method to detect
only skin area of the side face images. Since, the ear is a part of the skin region,
the search space for localizing the ear is reduced by excluding non-skin region. Skin
color model suggested in [14] can be used for segmenting skin region from non-
skin region. In our work, YCbCr color space [15] has been used to represent images.
YCbCr color space is used to exclude luminance from blue and red colors then skin
colors are separated in a small region. For this reason, skin color model exploits
YCbCr color space. In skin detection method first, an image from RGB color space
is converted to YCbCr color space then likelihood of each pixels are computed using
Gaussian model N(𝜇, 𝛴). Each pixel of the image is represented using a color vector
c = (Cb, Cr)T and the likelihood P (r, b) value can be calculated as follows:
1 1
P(r, b) = √ exp[ (c − 𝜇)𝛴 −1 (c − 𝜇)T ] (1)
2𝜋|𝛴| 2
The skin-likelihood values are obtained using Eq. (1) to convert gray image into skin-
likelihood image. Then skin region is segmented from non-skin region using skin
segmentation. Skin segmentation is performed by thresholding the skin-likelihood
image to obtain binary image. Finally, the binary side face image is dilated using
morphological operator and multiplied with input color image to obtain skin seg-
mented Image.
The Hausdorff distance (HD) [16] is a promising similarity measure in many image
matching applications. It measures the degree of resemblance between two binary
images: the smaller the Hausdorff distance between edge point sets of two images
the greater is the degree of similarity. Let Img and Tmp be the two sets of points in
the input image and template, Img = {p1 , p2 , , pNp } and Tmp = {q1 , q2 , , qNq }, with
232 P.P. Sarangi et al.
each point pi or qj is the 2D pixel coordinates of the edge point extracted from the
object of interest. The Hausdorff distance for the two point sets is defined as
where d is a directed distance between two point sets Img and Tmp. It is used to
compute the distance from p pixel pi to all the points of the set Tmp, i.e.
Similarly, reverse distance from a pixel qj to all the points of the set Img is com-
puted as
d(qj , Img) = minpi ∈Img {d(qj , pi )} (4)
1 ∑ 1 ∑
dmh (Img, Tmp) = max( minqj d(pi , qj ), minpj d(qi , pj )) (5)
Np p ∈Img Nq p ∈Tmp
i i
3 Proposed Technique
Present technique includes four major parts: Skin-color segmentation, edge detection
and pruning most of non-ear edges, comparing edge image patches of skin region
and ear template to locate ear, finally validate true ear candidate using normalized
cross correlation (NCC). Details of the proposed technique is illustrated in Fig. 1.
An Automated Ear Localization Technique Based on Modified Hausdorff Distance 233
In this section, main objective is to detect skin region in a side face image. The first
step of skin-color segmentation is to establish a skin color model which has been
discussed in Sect. 2.1. By using an appropriate thresholding, gray scale image is
segmented to binary image. However different people have different skin likelihood
values, hence an adaptive thresholding process [14] is used to achieve the optimal
threshold value. The optimal threshold is used to transform skin likelihood image
to binary image. The binary image possesses holes due to presence of noise which
is occupied using dilation morphological operation. Figure 2a shows a sample side
face color image and its minimum edge set is obtained by processing with various
intermediate stages. These are illustrated in Figs. 2b, c, d, e, and f.
234 P.P. Sarangi et al.
Fig. 2 Stages of skin segmentation a Input side face image, b gray scale image of skin-likelihood
values, c Skin binary image with holes, d Skin binary image after dilation operation, e gray scale
Skin image, f skin edge image
In this work, Canny edge detector [18] is used to obtain edge map of the skin regions.
In order to pruning non-ear edges various schemes have discussed. Figure 3 shows
stages of pruning non-ear edges.
1. Pruning spurious edges
The input edge image is checked to encounter edge junctions and accordingly
edge lists are established. Subsequent, removing the edges whose length is shorter
than threshold is represented as spurious edges. Generally, small spurious edges
appear in the edge image due to noise or presence of hair. Let E be the set of all
the edges present in the edge image and El be the set of edges present in set E
after pruning edges whose edge length smaller than threshold as given by:
Fig. 3 Illustrates an example of edge pruning process in a side face edge image, b using minimum
edge length and c using curvature measured in pixels
An Automated Ear Localization Technique Based on Modified Hausdorff Distance 235
where Img is the side face skin edge image, length(e) represents the length of
edge e and Tl is the threshold of allowable edge length.
2. Piecewise linear representation of edges and pruning linear edges
Most of the non-ear edges are linear in nature, in this work linear edges are
removed to reduce the number of edges from the edge image to speed up the
matching operation. In the previous section, all the pixels present in an edge
belong to the set El neither they are equally important or essential to represent the
ear. In order to remove non-ear linear edges, line segments are approximated to
the edge points. The linear edges are represented using only two pixels and non-
linear edges are represented using line segments having more number of pixels.
Fitting line segments algorithm receives edge points of set El and determines the
locations of the maximum deviation for each edge and approximate lines between
two points. Finally, a new set E(ls ) is established for each edge is approximated
using line segments. Hence the edges having two points these are non-ear edges
and can be eliminated from the set Els . After pruning linear edges the set EC is
established which is defined as follows:
where function distance (e) counts number of times max deviation occurred and
returns count value for each edge e.
This section describes the ear localization which mainly comprises three subsec-
tions (1) ear template creation, (2) resizing ear template, and (3) edge based template
matching using modified Hausdorff distance.
1. Ear template generation
The main objective of ear template generation is to obtain an ear template
which is a good representation of ear candidates available in the database. Surya
Prakash et al. [11, 12] mentioned human ear can broadly be grouped into four
kinds: triangular, round, oval, and rectangular. In this article taking above men-
tioned types of ear shapes into consideration a set of ear images are selected from
the database manually off-line. The ear template Tmp is generated by averaging
each pixel intensity values of all ear images and is defined as follows:
nImg
1 ∑
Tmp(i, j) = Imgk (i, j) (8)
NImg k=1
where NI mg is the number of ear images selected manually for ear template
creation and Impk is the K th ear image. Imgk (i, j) and Tmp (i, j) represent the
(i, j)th pixel value of the K th ear (Imgk ) and template (Tmp) image respectively.
236 P.P. Sarangi et al.
wer f
wei = f
∗ wi (9)
wr
f f
where wr and wi be the widths of the reference face image and the input ear
image respectively. Similarly, wer the width of the reference ear is same as the
standard ear template width. Furthermore, to measure the height of the side face
is difficult and incorrect because of extra skin pixels of neck person. In our work,
an experiment has been made to find the relationship between the width and
height of the ear using many cropped ear images. It is experiential that ratio
between width and height of the human ear is greater than 0.5 (varies from 0.5
to 0.7) and is depended on the ratio between width of input side face and refer-
ence image. Hence, knowing the width of the ear using Eq. (9) and previous ratio
value, height of the ear of the input side face image is estimated and found effec-
tive in most of the side face images given in Figs. 5 and 6. This feature allowed
proposed approach fully automated for ear localization.
3. Localization of ear using MHD
After pruning the non-ear edges the set EC established to define the new edge
map of the side face image. The edge image of the ear template is compared
with the same sized overlapping window of the input edge image using modified
Hausdorff distance and the same process is repeated by moving the overlapping
window over the skin region of the input edge image. Among all the distances
of each block, the minimum Hausdorff distance block is selected and the cor-
responding region from the profile face is extracted. This region is expected as
true ear and the claim is verified using the NCC criteria. When verification fails,
then next minimum Hausdorff distance block is expected as the ear region and
claim is verified again. This process carry on till ear is localized.
4. Ear verification using NCC
This section validates selected ear region as true ear candidate using normalized
cross-correlation coefficient (NCC) method by verifying the correlation between
the localized ear portion and ear template created off-line. The equation of NCC
is written as
∑ ∑
x y [Img(x, y)
− Img][Tmp(x − xc , y − yc ) − Tmp]
NCC = √ √
∑ ∑ ∑ ∑
x y [Img(x, y) − Img] 2
x y [Tmp(x − xc , y − yc ) − Tmp]
2
(10)
An Automated Ear Localization Technique Based on Modified Hausdorff Distance 237
where Img and Tmp are images of localized ear portion and ear template respec-
tively. Similarly, Img and Temp is the average brightness values of the localized
ear portion and ear template respectively. The NCC value lies between −1.0
and 1.0 when the NCC value is closer to 1 indicates better matching between
localized ear and template. When the NCC value is typically above a predefined
threshold the localization termed as true localization and otherwise, false local-
ization.
We tested the proposed technique using two databases, namely CVL face database
[19] and University of Notre Dame database (Collection E) [20]. CVL is library for
image and data processing using graphics processing units (GPUs). This database
contains 114 person’s face images with 7 images from each subject. Images in the
database are frontal view of the ears of both left and right for each person, hence
total 456 side face images are contained in the database. In this work, 100 right
side faces are used for experimentation. All images are of resolution of 640 × 480
with JPEG format captured by Sony Digital Mavica under uniform illumination,
and with projection screen in background. The database contains 90 % of men and
10 % women side face images. Next, Collection E (UND-E) of University of Notre
Dame database has been tested for the proposed scheme. It contains 462 side face
images of 118 subjects and 2–6 samples per subject. In the Collection E database,
images are captured on different days with various pose in dissimilar background and
light conditions. The proposed approach was experimented on 100 side face images
chosen from the CVL database of the right profile face images and results illustrated
in Fig. 4. Similarly, Fig. 5 illustrates ear localization results of 462 profile face images
of UND-E database. Authors in [11, 12] applied template matching techniques for
ear localization based on pixel correspondences in turn, the performance of these
Fig. 4 Illustration of some ear localization outputs of the test images in the experiments using
proposed approach based on modified Hausdorff distance
238 P.P. Sarangi et al.
Fig. 5 Illustration of some ear localization outputs of the UND-E test images simulated using
proposed approach based on modified Hausdorff distance
Results in terms of accuracy obtained for two mentioned databases. The accuracy
for CVL face database was found 91 and 94.54 % for Collection E database. It has
observed that accuracy obtained for CVL face database is not found satisfactory
An Automated Ear Localization Technique Based on Modified Hausdorff Distance 239
Fig. 6 Illustration of some partial ear detection in presence of less ear edges
because of poor illumination and similarity of background and hair color with skin
color. Similarly, accuracy for the Collection E profile face database is encouraging
even in case of partially occluded by hair in side faces but especially performance is
poor because of similarity of hair color with side face of the test samples.
5 Conclusion
In this paper, we present an automated ear localization technique to detect ear from
human side face images. The main contributions are two fold: first to separate the
skin-color region from non-skin region using skin color model and second to locate
ear within that skin-region using modified Hausdorff distance. Experiment is simu-
lated on CVL and UND-E databases and extensive evaluations show that results are
promising. The proposed approach is simple and robust for ear detection without any
user intervention. This performance should encourage future research direction for
ear localization method using variants of the Hausdorff distance measures.
References
9. A. Sana, P. Gupta, and R. Purkait, “Ear biometric: A new approach,” In Proceedings of ICAPR,
pp. 46–50, (2007).
10. S.M.S. Islam, M. Bennamoun, and R. Davies, “Fast and fully automatic ear detection using
cascaded adaboost,” In Proceedings of IEEE Workshop on Applications of Computer Vision
(WACV’ 08), pp. 1–6, (2008).
11. Surya Prakash, J. Umarani, and P. Gupta, “Ear localization from side face images using distance
transform and template matching,” in Proceedings of IEEE Int’l Workshop on Image Proc.
Theory, Tools and Application, (IPTA), Sousse, Tunisia, pp. 1–8, (2008).
12. Surya Prakash, J. Umarani, and P. Gupta, “A skin-color and template based technique for auto-
matic ear detection,” in Proceedings of ICAPR, India, pp. 213–216, (2009).
13. Surya Prakash, J. Umarani, and P. Gupta, “Connected Component Based Technique for Auto-
matic Ear Detection,” in Proceedings of the 16th IEEE Int’l Conference of Image Processing
(ICIP), Cairo, Egypt, pp. 2705–2708, (2009).
14. J. Cai, and A. Goshtasby, “Detecting human faces in color images,” Image and Vision Com-
puting, 18(1), pp. 63–75, (1999).
15. G. Wyszecki and W.S. Styles, “Color Science: Concepts and Methods, Quantitative Data and
Formulae,” second edition, John Wiley & Sons, New York (1982).
16. D.P. Huttenlocher, G.A. Klanderman, W.J. Rucklidge, “Comparing images using the Hausdorff
distance,” IEEE Trans. Pattern Anal. Mach. Intell. 850–863, (1993).
17. M.P. Dubuisson and A.K. Jain, “A modified Hausdorff distance for object matching,” In
ICPR94, Jerusalem, Israel, pp. A:566–568, (1994).
18. J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analy-
sis and Machine Intelligence 8(6), 679–698, (1986).
19. Peter Peer, “CVL Face Database,” Available: http://www.lrv.fri.uni-lj.si/facedb.html.
20. University of Notre Dame Profile Face Database, Collection E, http://www.nd.edu/~cvrl/
CVRL/DataSets.html.
Sclera Vessel Pattern Synthesis Based
on a Non-parametric Texture Synthesis
Technique
Abstract This work proposes a sclera vessel texture pattern synthesis technique.
Sclera texture was synthesized by a non-parametric based texture regeneration
technique. A small number of classes from the UBIRIS version: 1 dataset was
employed as primitive images. An appreciable result was achieved which solicits
the successful synthesis of sclera texture patterns. It is difficult to get a huge
collection real sclera data and hence such synthetic data will be useful to the
researchers.
1 Introduction
Sclera is the white region with blood vessel patterns around the eyeball. Recently,
as with other ocular biometric traits, sclera biometrics has gained in popularity
[1–11]. Some recent investigations performed on multi-modal eye recognition
(using iris and sclera) show that iris information fusion with sclera can enhance the
biometric applicability of iris biometrics in off-angle or off-axis eye gaze. To
establish this concept, it is first necessary to assess the biometric usefulness of the
sclera trait independently (Fig. 1).
Moreover the research conducted on this subject is very limited and has not been
extensively studied across a large proportion of the population. Therefore, to-date
the literature related to sclera biometrics is still in its infancy and little is known in
regard to its usefulness in personal identity establishment for large populations. It
can be inferred from recent developments in the literature that a number of inde-
pendent research efforts have explored the sclera biometric and serval dataset have
been proposed, which are either publicly available or are proprietary datasets.
Efforts were also made to nurture the various challenges which reside in processing
the sclera trait towards personal identity establishment. It can be noted from the
literature that the datasets developed are with a limited population of a maximum of
241 individuals. Growing the population is a tough task and moreover it also
depends on the availability of volunteers. On some occasions, they may be avail-
able for a particular session and may not be available in the next session, which
again can bring inconsistency to the dataset. These types of instances can be found
in the datasets proposed in the literature. Hence establishing this trait for a larger
population and generating a larger representative dataset is an open research
problem.
In the biometric literature for larger population data collection, data synthesis is a
proposed solution. Data synthesis refers to the artificial regeneration of traits by
means of some computer vision or pattern synthesis based regeneration technique.
Several such techniques have been proposed in the literature for iris biometrics [12]
(texture pattern on the eye ball employed for biometrics). In order to mitigate the
above mentioned problem in sclera biometrics, similar to the iris biometric, we
propose a sclera biometric synthesis technique. The sclera contains the white area of
the eye along with vessel patterns that appear as a texture pattern. Therefore we
have applied texture regeneration theory for this application.
The organization of the rest of the paper is as follows: The concept of our
proposed method is presented in Sect. 2. In Sect. 3 our experimental results along
with the dataset are described, as well as a preliminary discussion on our experi-
ments. Conclusions and future scope are presented in Sect. 4.
2 Proposed Technique
Step 12: The positions of neighbor elements are listed after taking their values in
descending order (Fig. 6).
Step 13: Elements of SI having the same position as that obtained from Step_12
are considered.
Step 14: A window (we consider here window size is 3 × 3 but in our experi-
ments we consider it as 39 × 39) is placed on every considered element
in such a way that the elements should be the middle element of the
window (Fig. 7).
Step 15: After placing the window on every element of SI, the elements within
the window are matched position-wise with the corresponding elements
of PI, and an element is randomly chosen where the Match Error < Max
Match Error threshold. We considered 0.1 as the value of Max Error
Threshold.
Step 16: Let element q(x1, y1) of SI be an element that satisfies step_15, then we
assign,
Where,
Sclera Vessel Pattern Synthesis Based on a Non-parametric … 245
Fig. 6 a A neighbor element of an SI-2 matrix along with its row/column position, b,
c Respective row, column position list of neighboring elements in the SI-2 matrix
246 A. Das et al.
3 Experimental Details
Details about the implementation and the setup of the experiments are discussed
here, before presenting the detailed results.
3.1 Dataset
In order to evaluate the performance of the proposed method, the UBIRIS version 1
database [14] was utilized in these experiments. This database consists of 1877
RGB images taken in two distinct sessions (1205 images in session 1 and 672
images in session 2), from 241 identities and images are represented in the RGB
color space. The database contains blurred images and images with blinking eyes.
Both high resolution images (800 × 600) and low resolution images (200 × 150)
are provided in the database. All the images are in JPEG format. A few examples
from session 1 are given below in Fig. 8.
For our experiments, the first 10 identities from session 1 were considered.
5 samples from each identity were selected and their sclera vessel patterns were
manually cropped as shown below in Fig. 9. For each real image, a synthesized
Sclera Vessel Pattern Synthesis Based on a Non-parametric … 247
Fig. 9 Manually cropped vessel patterns from eye images of session 1 UBIRIS version1
Fig. 10 Synthetic sclera vessel patterns from eye images of session 1 from UBIRIS version 1
image was generated. The synthesised images generated from the corresponding
images in Fig. 9 as primitive images are shown in Fig. 10.
system was trained with 5 synthetic images and tested with 5 primitive images and
vise-versa.
For the first set of experiments, scores 10 * 2 for FRR and 10 * 9 * 2 scores for
FAR statistics were obtained, whereby 10 * 5 scores for FRR and 10 * 9 * 5 scores
for FAR statistics were obtained for the second set of experiments.
3.3 Results
3.4 Discussion
This work is an initial investigation on sclera pattern synthesis. Here the images
from the original (or the primitive) version are cropped manually and used for sclera
vessel pattern synthesis. Although satisfactory results were achieved in the pro-
posed experimental setup, the time complexity of the implementation is found to be
high. The experiments were performed on a cluster server machine in a Linux
environment using Matlab 2015. Therefore it can be easily assumed that it will take
more time to generate the total vessel patterns of the eye. Future efforts to minimize
this time will be an open research area for this field.
References
1. Derakhshani, R., A. Ross, A., Crihalmeanu, S.: A New Biometric Modality Based on
Conjunctival Vasculature. Artificial Neural Networks in Enineering (2006) 1–6.
2. Das, A., Pal, U., Ballester, M., F., A., Blumenstein, M.: A New Method for Sclera Vessel
Recognition using OLBP. Chinese Conference on Biometric Recognition, LNCS 8232 (2013)
370–377.
3. Das, A., Pal, U., Ballester, M., F., Blumenstein, M.: Sclera Recognition Using D-SIFT. 13th
International Conference on Intelligent Systems Design and Applications (2013) 74–79.
4. Das, A., Pal, U., Blumenstein M., Ballester, M., F.: Sclera Recognition—A Survey.
Advancement in Computer Vision and Pattern Recognition (2013) 917 –921.
5. Das, A., Pal, U., Ballester, M., F., Blumenstein, M.: Fuzzy Logic Based Sclera Recognition.
FUZZ-IEEE (2014) 561–568.
6. Das, A., Pal, U., Ballester M., F., Blumenstein, M.: Multi-angle Based Lively Sclera
Biometrics at a Distance. IEEE Symposium Series on Computational Intelligence (2014) 22–
29.
7. Das, A., Pal, U., Ballester, M., A., F., Blumenstein, M.: A new efficient and adaptive sclera
recognition system. Computational Intellig. in Biometrics and Identity Management. IEEE
Symposium (2014) 1–8.
8. Das, A., Pal, U., Blumenstein, M., Ballester, M., A., F.: Sclera Segmentation Benchmarking
Competition (2015). http://www.ict.griffith.edu.au/conferences/btas2015.
9. Crihalmeanu, S., Ross., A.: Multispectral scleral patterns for ocular biometric recognition.
Pattern Recognition Letters, Vol. 33 (2012) 1860–1869.
250 A. Das et al.
10. Zhou, Z., Du, Y., Thomas, N., L., Delp, E., J.: Quality Fusion Based Multimodal Eye
Recognition. IEEE International Conference on Systems, Man, and Cybernetics (2012) 1297–
1302.
11. Das, A., Kunwer, R.., Pal, U., M. A. Ballester, M., A., F., Blumenstein, M.: An online
learning-based adaptive biometric system. Adaptive Biometric Systems: Recent Advances
and Challenges (2015) 73–95.
12. Galbally, J., Ross, A., Gomez-Barrero, M., Fierrez, J., Ortega-Garcia, J.,: Iris image
reconstruction from binary templates: An efficient probabilistic approach based on genetic
algorithms, Computer Vision and Image Understanding, Vol. 117, n. 10, (2013) 1512–1525.
13. Efros, A., Leung, T.,: Texture Synthesis by Non-Parametric Sampling. In Proceedings of
International Conference on Computer Vision, (1999), 1033–1038.
14. Proença, H., Alexandre, L., A.: UBIRIS: A noisy iris image database, Proceed. of ICIAP 2005
—Intern. Confer. on Image Analysis and Processing, 1: (2005) 970–977.
Virtual 3-D Walkthrough for Intelligent
Emergency Response
1 Introduction
Virtual reality technology, has introduced a new spatial metaphor with very
interesting applications on intelligent navigation, social behavior over virtual
worlds, full body interaction, virtual studios, etc. [2]. With the evolution of modern
graphics processors, memory bandwidth capabilities and advanced optimization
techniques now it’s possible to add realism in real time 3D graphics.
With the advent of science and technology the form of threat is changing its face
and so in today’s scenarios planning and training is of paramount importance.
In case of any such adverse situation the first and foremost thing is to protect human
life and hence the evacuation is the first and most crucial logical step. The local
security authorities should not only be familiar with the architectural topologies of
the campus and its buildings along with all possible entry/exit points including
emergency and make shift entries but should also be able to communicate the same
to outside agencies in the minimum and most efficient manner.
Generally, the important installations all across the globe are being protected
from fire, terrorist attacks and other dire conditions by the means of various
physical protection systems such as different types of fire hydrants, CCTV cameras,
security personnel, logging and access control mechanisms for incoming or out-
going individuals or vehicles. These systems seems to be reasonable measures for
combating the emergency situations but when the access of these systems also gets
forbidden due to the conditions, it becomes very difficult to look for the way out.
In case of the hostile conditions like full/partial restriction to normal entry to the
premise; the usual practice and many a times the only option for the security
personnel and decision-makers is to prepare the further course of action for
restoring the state to the normalcy by referring 2D plan-elevation layouts of the
targeted premise. Sometimes the security personnel are the outside agencies which
are called for the rescue operations and are unaware of the topology of the campus.
In such context, referring 2D plan-elevation layouts do not really give much of its
insight; many a times it may lead to false presumptions.
Let us understand this problem through a scenario. Assume few people working
on a floor consisting of corridors and rooms in some building of the campus.
Figure 1 shows its 2D layout where three people in red (or marked 1), blue
(or marked 2) and yellow (or marked 3) are shown working and it looks that these
people can observe each other but Fig. 2 shows the actual setup where red person
cannot observe the other two persons. These figures in which loss of relevant
information in 2D due to its deficiency of one dimension in comparison to 3D are
good case of explaining better usability of 3D layouts over 2D layouts:
The usability of 3D over 2D layouts surely places a requirement for building a
3D solution for the security personnel and decision-makers to effectively carryout
their training, strategic and operational task in case of emergency or otherwise. Our
basic objective of developing 3D walkthrough is one such solution provided by the
virtual reality which, keeping natural calamities, sabotage and terrorist threats in
today’s global scenario, is a need of an hour.
There are various papers on applications of virtual reality ranging from different
physical, biological and chemical simulation systems to 3D walkthroughs.
The paper [3] researches on the virtual natural landscape walkthrough technol-
ogy, realizes natural landscape walkthrough by using Unity 3D, describes some
common methods of making sky, landform, trees, flowers and water in the virtual
walkthrough. The main focus of this paper was on rendering natural shapes than
man-made objects which have convention methods of modeling. Natural shapes are
highly relies on particle systems and other optimized techniques like bill-boarding
etc.
The paper [4] introduces how to realize the driving training simulation system by
using computer software aiming at the car driving training. The paper presents that
their system can partly replace the actual operation training, designers hope to
improve training efficiency and reduce the training cost. The main focus of this
paper was simulation besides modeling. It also uses physics library for simulating
the interaction between rigid bodies and hence enhances the scope of training
efficiency at lower cost.
254 N. Saxena and V. Diwan
Another paper [5], emphasis was brought on the creation of worlds that repre-
sented real places or buildings, where the user could be able to access various kinds
of information interacting with objects or avatars and travel at the same time in the
virtual space. That paper presented the architecture and implementation of a virtual
environment based on the campus of Guangxi University of Technology using Java
language for the user-interfaces, VRML for the 3D scripting and visualization and
HTML for all other multimedia pages (2D visualization) of the system. The main
focus of this paper was 3D walkthrough inside a campus. This paper has presented
the web based and platform independent solution but has not talked much about
vegetation and terrain of the campus.
By and large, the 3D walkthrough applications were developed to focus on either
the larger scenery with exteriors of the campus buildings using 3D modeling or only
the realistic interiors using panorama photographs. Also, we have not come across
any paper for its relevance on security perspectives. Therefore, in our approach; we
presented the both interiors and exteriors with near actual surroundings.
3 Methodology
We have developed the hybrid approach of creating the virtual walkthrough for the
premise where not only buildings’ exterior facades but also interiors of the build-
ings including the corridors and its rooms are given due importance by modeling
the entire architectural composition. The inputs for the development work are the
These models along with the surrounding vegetation are placed on terrain at their
actual geographical positions (obtained from Google maps) in Unity 3D Scene.
This scene consists of the panoramic rooms in the buildings of the campus along
with its major electrical, mechanical and security related work and equipment
housed in the respective buildings. It also has the first person controller by which
the end-user can navigate throughout the virtual plant by using just the mouse and
keyboard/joystick.
Software
Modern 3D modeling tools are user-friendly and rich in features which make the
development task of 3D models (of almost all physical entities) reasonably
smoother. In our case, we used the Blender software which is a free and open
source 3D animation suite, cross-platform and runs equally well on Linux, Win-
dows and Mac computers. Its interface uses OpenGL to provide a consistent
experience [6].
Recent game development tools have given the capability to rapid development
of virtual applications with near real-time rendering powers. The Unity engine is far
and away the dominant development platform for creating games and interactive
3D and 2D experiences like training simulations and medical and architectural
visualizations, across mobile, desktop, web, console and other platforms [7]. Unity
comes in two variants one is free with some limited features and other one is paid
with comprehensive features, called Unity-pro [8]. Free version of Unity 3D is
selected for development of our 3D walkthrough.
The built-in material and script in Unity 3D can support most virtual reality
application requirements, and numerous three-dimensional file formats, such as:
3DS MAX, Maya, Cinema-4D, making virtual scene production convenient and
quick; DirectX and OpenGL have highly optimized rendering graphics pipeline, it
makes the simulation result like true and the built-in physics engine can better
support designing of driving simulation [4]. With the software and hardware per-
formance improving, developers can truly focus on representing the scene when
making landscape walkthrough by using Unity 3D [3].
Hardware
4 Results
The application has been tested on various systems having different hardware and
software configurations. Figure 7 displays one scene from the application rendering
on one of the testing workstation.
The systems under tests are observed to render 40–60 frames per second
depending upon its hardware/software configurations which is considered to be
good rendering performance by any standard.
Virtual 3-D Walkthrough for Intelligent Emergency Response 259
In this paper, we described the development of virtual tour product that may be used
as a supplement to an actual visit of the building and its campus. In cases where an
actual visit is inconvenient or prohibited this may act as a substitute. The basic
objective of developing 3D walkthrough is to facilitate the offsite security response
forces to respond to any emergency situation for enabling them to effectively
carryout their operational tasks or evacuation planning in such times.
Besides evacuation planning as a usage of this application; many intelligent
scenarios, based on the experiences, feedbacks and actual events, can be built in the
subsequent versions of this 3D walkthrough. Building models rendered during
different lighting conditions, different time of the day, different seasons, virtual
flood, fire, hostage conditions, radioactivity released conditions etc. can be simu-
lated and worth of the training can be increased with the experience and technology
improvement.
References
1. J. Lee, S. Zlatanova, ‘A 3D data model and topological analyses for emergency response in
urban areas’.
2. Nikos A, Spyros V and Themis P, ‘Using virtual reality techniques for the simulation of
physics experiments’, Proceeding of the 4th Systemics, Cybernetics and Informatics
International Conference, Orlando, Florida, USA, 2000, pp 611.
260 N. Saxena and V. Diwan
3. Kuang Yang, Jiang Jie, Shen Haihui, ‘Study on the Virtual Natural Landscape Walkthrough by
Using Unity 3D’, IEEE International Symposium on Virtual Reality Innovation 2011 19–20
March, Singapore.
4. Kuang Yang, Jiang Jie, ‘The designing of training simulation system based on unity 3D’, Int.
Con. on Intelligent Computation Technology and Automation (ICICTA), 2011, Issue Date: 28–
29 March 2011.
5. Ziguang Sun, Qin Wang, Zengfang Zhang, ‘Interactive walkthrough of the virtual campus
based on vrml’.
6. http://www.blender.org/about/.
7. http://unity3d.com/public-relations.
8. http://unity3d.com/unity/licenses.
9. http://www.gearthblog.com/blog/archives/2014/04/google-earth-imagery.html.
Spontaneous Versus Posed Smiles—Can We
Tell the Difference?
Abstract Smile is an irrefutable expression that shows the physical state of the
mind in both true and deceptive ways. Generally, it shows happy state of the mind,
however, ‘smiles’ can be deceptive, for example people can give a smile when they
feel happy and sometimes they might also give a smile (in a different way) when
they feel pity for others. This work aims to distinguish spontaneous (felt) smile
expressions from posed (deliberate) smiles by extracting and analyzing both global
(macro) motion of the face and subtle (micro) changes in the facial expression fea-
tures through both tracking a series of facial fiducial markers as well as using dense
optical flow. Specifically the eyes and lips features are captured and used for analy-
sis. It aims to automatically classify all smiles into either ‘spontaneous’ or ‘posed’
categories, by using support vector machines (SVM). Experimental results on large
UvA-NEMO smile database show promising results as compared to other relevant
methods.
1 Introduction
People believe that human face is the mirror/screen showing internal emotional state
of the human body as and when it responds to the external world. This means that,
what an individual thinks, feels or understands, etc., deep inside the brain, get imi-
tated into the outside world through its face [7]. Facial smile expression undeniably
plays a huge and pivotal role [1, 11, 25] in understanding social interactions within
a community. People often give smile imitating the internal state of the body. For
example, generally, people smile when they are happy or when sudden humorous
The block diagram of our proposed method is shown in Fig. 1. Given smile video
sequences of various subjects, we apply the facial features detection and tracking
of the fiducial points over the entire smile video clip. Using D-markers, 25 impor-
tant parameters (like duration, amplitude, speed acceleration, etc.) are extracted
from two important regions of the face: eyes and lips. Smile discriminative features
are extracted using dense optical flow along the temporal domain from the global
(macro) motion and local (micro) motion of the face. All these information are fused
and support vector machine (SVM) is then used as a classifier on these parameters
to distinguish posed and spontaneous smiles.
We use the facial tracking algorithm developed by Nguyen et al. in [17] to obtain
the fiducial points on the face. The 21 tracking markers each are labeled and placed
following the convention as shown in Fig. 2a. The markers are manually annotated
in the first frame of each video by user input and thereafter it automatically tracks
the remaining frames of the smile video, it is of good accuracy and precision as com-
pared to other facial tracking software [2]. The markers are placed on important facial
feature points such as eyelids and corner of the lips for each subject. The convention
followed in our approach for selecting fiducial markers are shown in Fig. 2a.
To reduce inaccuracy due to the subject’s head motion in the video that can cause
change in angle with respect to roll, yaw and pitch rotations, we use the face nor-
malization procedure described in [5]. Let li represents each of the feature points
used to align the faces as shown in Fig. 2. Three non-collinear points (eye centers
l +l
and nose tip) are used to form a plane 𝜌. Eye centers are defined as c1 = 1 2 3 and
264 B. Mandal and N. Ouarti
(a) Frame #1 (b) Frame #30 (c) Frame #58 (d) Frame #72
Fig. 2 a Shows the tracked points on the 1st frame, b shows the tracked points on 30th frame,
c shows the tracked points on 58th frame and d shows the tracked points on 72nd frame on one
subject. (Best viewed when zoomed in.)
l +l
c2 = 4 2 6 . Angles between the positive normal vector N𝜌 of 𝜌 and unit vectors U on
X (horizontal), Y (vertical), and Z (perpendicular) axes give the relative head pose
as follows:
U.N𝜌
𝜃 = arccos ⃗ ⃖⃖⃖⃖⃖⃗
, where N = l⃖⃖⃖⃖⃖
g c2 × lg c1 . (1)
‖U‖‖N𝜌 ‖
l⃖⃖⃖⃖⃖
⃗ ⃖⃖⃖⃖⃖⃗
g c2 and lg c1 denote the vectors from point lg to points c2 and c1 , respectively. ‖U‖
and ‖N𝜌 ‖ represents the magnitudes of U and N𝜌 vectors respectively. Using the
human face configuration, (1) can estimate the exact roll (𝜃z ) and yaw (𝜃y ) angles of
the face with respect to the camera. If we start with the frontal face, the pitch angles
(𝜃x′ ) can be computed by subtracting the initial value. Using the estimated head pose,
tracked fiducial points are normalized with respect to rotation, scale and translation
as follows:
c1 + c2 100
li′ = [li − ]Rx (−𝜃x′ )Ry (−𝜃y )Rz (−𝜃z ) , (2)
2 𝜖(c1 + c2 )
where li′ is the aligned point. Rx , Ry and Rz denote the 3D rotation matrices for the
given angles. 𝜖() is the Euclidean distance measure. Essentially (1) constructs a nor-
mal vector perpendicular to the plane of the face using three points (nose tip and eye
centers), then calculate the angle formed between X, Y and Z axis with regards to the
normal vector of face plane. Thereafter, (2) process and normalize each and every
point of the frame accordingly and set the interocular distance to 100 pixels with the
middle point acting as the new origin of the face center.
In the first part of our strategy, we focus on extracting the subject’s eyelid and lips
features. We first construct a amplitude signal variable based on the facial feature
markers on the eyelid regions. We compute the amplitude of eyelid and lip end move-
ments during a smile using the procedure described in [21]. Eyelid amplitude signals
are computed using the eyelid aperture (Deyelid ) displacement at time t, given by:
Spontaneous Versus Posed Smiles—Can We Tell the Difference? 265
where 𝜅(li , lj ) denotes the relative vertical location function, which equals to −1 if lj is
located (vertically) below li on the face, and 1 otherwise. The equation above uses the
markers for eyelids namely 1–6 as shown in Fig. 2, to construct the amplitude signal
that calculate the eyelid aperture size in each frame t. The amplitude signal Deyelid is
then further computed to obtain a series of features. In addition to the amplitudes,
speed and acceleration signal are also extracted by computing the second derivatives
of the amplitudes.
Smile amplitude is estimated as the mean amplitude of right and left lip corners,
normalized by the length of the lip. Let Dlip (t) be the value of the mean amplitude
signal of the lip corners in the frame t. It is estimated as
t
l10 +l11
t t
l10 +l11
t
𝜖( 2
, l10
t
) + 𝜖( 2
, l11
t
)
Dlip (t) = (4)
2𝜖(l10
t
, l11
t
)
where lit denotes the 2D location of the ith point in frame t. For each video of our
subject we are able to acquire a 25-dimensional feature vectors based on the eye-
lids markers and lip corner points. Onset phase is defined as the longest continu-
ous increase in Dlip . Similarly, the offset phase is detected as the longest continuous
decrease in Dlip . Apex is defined as the phase between the last frame of the onset
and the first frame of the offset. The displacement signals of eyelids and lip cor-
ners could then be calculated using the tracked points. Onset, apex and offset phases
of the smile are estimated using the maximum continuous increase and decrease of
the mean displacement of the eyelids and lip corners. The D-Marker is then able
to extract 25 descriptive features each for eyelids and lip corner, so a vector of 50
features are obtained from each frame (using two frames at a time). The features are
then concatenated and passed through SVM for training and classification.
In the second phase of the feature extraction, we use our own proposed dense optical
flow [19] for capturing both global and local motions appearing in the smile videos.
Our approach is divided into four distinct stages that are fully automatic and does not
require any human intervention. The first step is to detect each frame in which the
face is present. We use our previously developed face, integration of sketch and graph
patterns (ISG) eyes and mouth detectors for face recognition on wearable devices and
human-robot-interaction [14, 23]. So we get the region of interest (ROI) for the face
(as shown in Fig. 3, left, yellow ROI) with 100 % accuracy on the entire UvA-NEMO
smile database [5]. In the second step, we determine the area corresponding to the
266 B. Mandal and N. Ouarti
Fig. 3 Left Face, eyes and mouth detections. Yellow ROI for face detection, red ROI for eyes
detection and blue ROI for mouth detection. Middle Two consecutive frames of a subject’s smile
video and Right their optical flows in x- and y-directions. (Best viewed in color and zoomed in.)
right eye, left eye in red ROI and mouth in blue ROI for which we get 96.9 % accuracy
on the entire database.
In the third step, the optical flow is computed between the image at time t and
at time t + 1 of the video sequence (see Fig. 3, middle). The two components of the
optical flow are illustrated in Fig. 3, right, which shows the optical flow along the
x-axis and the optical flow along the y-axis. Because we are using a dense optical
flow algorithm, the time to process one picture is relatively important. To speed up
the processing, we computed the optical flow only in the three ROI regions: right
eye, left eye and mouth. The optical flow computed in our approach is a pyramidal
differential dense algorithm that is based on the following constraint:
where the attach term is based on thresholding method [24] and the regularization
term (smooth) is based on the method developed by Meyer in [16], 𝛽 is a weight con-
trolling the ratio between the end attachment and the term control. Ouarti et al. in [19]
proposed to use a regularization that do not use an usual wavelet but a non-stationary
wavelet packet [18], which generalize the concept of wavelet for extracting optical
flow information. We extend this idea for extracting fine grained information for
both micro and macro motion variations in smile videos as shown in Fig. 4. Figure 5
shows the dense optical flows with spontaneous and posed smiles variations. In the
fourth step, for each of the three ROIs, the median of the optical flow is determined
that give a cue to the global motion of the area. An histogram is computed based on
the optical flow that has 10 bins. The top three bins in term of cardinality are kept
among all the bins. A linear regression is then applied to find the major axis of the
point group for each of the three bins determined. In the end, for each ROI we obtain:
the median value of the bin 1, the value of the bin 2 and the value of the bin 3. It
also calculates the intercept and slope for points of bins 1, 2 and 3. These result in
60 features for each frame (using two consecutive frames in a smile video). SVM is
then used on these features to classify the posed and spontaneous smiles.
The major advantage of this approach is that we can obtain useful smile discrimi-
native features using a fully automatic analysis of videos, no marker are needed to be
annotated by an operator/user. Moreover, rather than attempting to classify raw opti-
cal flow we design some processing to obtain a sparse representation of the optical
Spontaneous Versus Posed Smiles—Can We Tell the Difference? 267
Fig. 4 Original images and their dense optical flows with their corresponding micro and macro
motion variations of a subject. (Best viewed in color and zoomed in.)
Fig. 5 Original images and their dense optical flows with their corresponding spontaneous and
posed smiles variations of a subject. (Best viewed in color and zoomed in.)
flow signal. This representation helps in classification by extracting only the useful
information in low dimensions and speeds up the calculation of the SVM. Finally,
information is not completely connected to the positioning of the different ROI know-
ing that this positioning may vary from one frame to another, it is dependent on the
depth and highly variable depending on the individuals. Therefore a treatment which
would be too closely related to the choice of the ROI would lead to non-consistent
results.
3 Experimental Results
We test our proposed algorithm on UvA-NEMO Smile Database [5], it is the largest
and most extensive smile (both posed and spontaneous) database with videos from a
total of 400 subjects, (185 female, 215 male) aged between 8 to 76 years old, giving
us a total of 1240 individual videos. Each video consists of a short segment of 3–8 s.
The videos are extracted into frames at 50 frames per second. The extracted frames
are also converted to gray scale and downsized to 480 × 270. In all the experiments,
we split the database, in which 80 % is used as training samples and the remaining
20 % is used as testing samples. Binary classifier SVM with radial basis function as
the kernel and default parameters as in LIBSVM [12], is used to form a hyperplane
based on the training samples. When a new testing sample is passed into the SVM
it uses the hyperplane to determine which class the new sample falls under. This
process is then repeated 5 times using a 5-fold cross validation method. To measure
the subtle differences in the spontaneous and posed smiles we compute the confusion
matrices between the two smiles so as to find out how much accuracy we can obtain
in using each of them in the actual and classified separately. The results from all 5
processes are averaged and shown in Tables 1, 2, 3, 4 and 5 and compared with other
methods in Table 6.
268 B. Mandal and N. Ouarti
Table 1 The overall accuracy (%) in classifying spontaneous and posed smiles using only the eyes
features is 71.14 %. In bracket (⋅) shows accuracy using only the lips features as 73.44 %
Actual Classified
Spontaneous Posed
Spontaneous 60.1 (67.5) 39.9 (32.5)
Posed 17.5 (20.4) 82.5 (79.6)
Table 2 The overall accuracy (%) in classifying spontaneous and posed smiles using the combined
features from eyes and lips is 74.68 %. (rows are gallery, columns are testing)
Actual Classified
Spontaneous Posed
Spontaneous 65.3 34.7
Posed 16.3 83.7
Table 3 The accuracy (%) in classifying spontaneous and posed smiles using our proposed
X-directions dense optical flow is 59 %. In bracket (⋅) the accuracy using our proposed Y-directions
is 63.8 %
Actual Classified
Spontaneous Posed
Spontaneous 57.8 (58.3) 42.2 (41.7)
Posed 39.8 (30.8) 60.2 (69.2)
Table 4 The accuracy (%) in classifying spontaneous and posed smiles using our proposed fully
automatic system using X- and Y-directions of dense optical flow is 56.6 %
Actual Classified
Spontaneous Posed
Spontaneous 58.0 42.0
Posed 45.1 54.9
Table 1 and in bracket (⋅) show the accuracy rates in distinguishing spontaneous
smiles from the posed ones using eyes and lips features respectively. The results
show that the eye features play very crucial role in finding the posed smiles where
as the lips features are important for spontaneous smiles. Overall we could obtain
an accuracy of 71.14 and 73.44 % using eyes and lips features respectively. Table 2
shows the classification performance using combined features from eyes and lips. It
is evident from the table that using these facial component features, pose smile can
be classified better as compared to the spontaneous ones.
Spontaneous Versus Posed Smiles—Can We Tell the Difference? 269
Table 5 The accuracy (%) in classifying spontaneous and posed smiles using our proposed fused
approach comprising of both features from facial components and dense optical flow is 80.4 %
Actual Classified
Spontaneous Posed
Spontaneous 83.6 16.4
Posed 22.9 77.1
We use the features using dense optical flow as described in Sect. 2.3, the movement
in both X- and Y-directions are recorded between every consecutive frames of each
video. The confusion matrices are shown in Table 3, in bracket (⋅) and Table 4. It can
be see from the tables that the performance of optical flow is lower as compared to
the component based approach. However, the facial component based feature extrac-
tion method requires user initialization to find and track fiducial points, whereas the
dense optical flow features are fully automatic. It does not require any user inter-
vention, so it is more useful for practical applications like first-person-views (FPV)
or egocentric views on wearable devices like Google Glass for improving real-time
social interactions [9, 14].
We combine all the features obtained from facial component based parameters and
dense optical flow in to a single vector and apply SVM. Table 5 shows the confusion
matrix using spontaneous and posed smiles. It can be seen that the performance of
spontaneous smiles classification improved using features from dense optical flow.
The experimental results in Table 5 show that both features from facial components
and dense optical flows are important for improving the overall accuracy. Features
from facial components (as shown in Table 2) are useful for encoding information
arising from the muscle artifacts within a face, however, the regularized dense optical
flow features helps in encoding fine grained information for both micro and macro
motion variations in face smile videos. So combining them the overall accuracy has
been improved.
Correct classification rates (%) using various methods on UvA-NEMO are shown in
Table 6. It is evident from the table that our proposed approach is quite competitive
as compared to the other state-of-the-arts methodologies.
270 B. Mandal and N. Ouarti
4 Conclusions
References
1. Ambadar, Z., Cohn, J., Reed, L.: All smiles are not created equal: Morphology and timing of
smiles perceived as amused, polite, and embarrassed/nervous. Journal of Nonverbal Behavav-
ior 33, 17–34 (2009)
2. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In:
CVPR. Columbus, Ohio, USA (2014)
3. Cohn, J., Schmidt, K.: The timing of facial motion in posed and spontaneous smiles. Intl J.
Wavelets, Multiresolution and Information Processing 2, 1–12 (2004)
4. Dibeklioglu, H., Valenti, R., Salah, A., Gevers, T.: Eyes do not lie: Spontaneous versus posed
smiles. In: ACM Multimedia. pp. 703–706 (2010)
5. Dibeklioglu, H., Salah, A.A., Gevers, T.: Are you really smiling at me? spontaneous versus
posed enjoyment smiles. In: IEEE ECCV. pp. 525–538 (2012)
6. Ekman, P.: Telling lies: Cues to deceit in the marketplace, politics, and marriage. WW. Norton
& Company, New York (1992)
7. Ekman, P., Hager, J., Friesen, W.: The symmetry of emotional and deliberate facial actions.
Psychophysiology 18, 101–106 (1981)
Spontaneous Versus Posed Smiles—Can We Tell the Difference? 271
8. Ekman, P., Rosenberg, E.: What the Face Reveals: Basic and Applied Studies of Spontaneous
Expression Using the Facial Action Coding System. Second ed. Oxford Univ. Press (2005)
9. Gan, T., Wong, Y., Mandal, B., Chandrasekhar, V., Kankanhalli, M.: Multi-sensor self-
quantification of presentations. In: ACM Multimedia. pp. 601–610. Brisbane, Australia (Oct
2015)
10. He, M., Wang, S., Liu, Z., Chen, X.: Analyses of the differences between posed and sponta-
neous facial expressions. Humaine Association Conference on Affective Computing and Intel-
ligent Interaction pp. 79–84 (2013)
11. Hoque, M., McDuff, D., Picard, R.: Exploring temporal patterns in classifying frustrated and
delighted smiles. IEEE Trans. Affective Computing 3, 323–334 (2012)
12. Hsu, C., Chang, C., Lin, C.: A practical guide to support vector classification (2010)
13. Huijser, M., Gevers, T.: The influence of temporal facial information on the classification of
posed and spontaneous enjoyment smiles. Tech. rep., Univ. of Amsterdam (2014)
14. Mandal, B., Ching, S., Li, L., Chandrasekha, V., Tan, C., Lim, J.H.: A wearable face recogni-
tion system on google glass for assisting social interactions. In: 3rd International Workshop on
Intelligent Mobile and Egocentric Vision, ACCV. pp. 419–433 (Nov 2014)
15. Mandal, B., Eng, H.L.: Regularized discriminant analysis for holistic human activity recogni-
tion. IEEE Intelligent Systems 27(1), 21–31 (2012)
16. Meyer, Y.: Oscillating patterns in image processing and in some nonlinear evolution equations.
The Fifteenth Dean Jacquelines B. Lewis Memorial Lectures, American Mathematical Society
(2001)
17. Nguyen, T., Ranganath, S.: Tracking facial features under occlusions and recognizing facial
expressions in sign language. In: International Conference on Automatic Face & Gesture
Recognition. vol. 6, pp. 1–7 (2008)
18. Ouarti, N., Peyre, G.: Best basis denoising with non-stationary wavelet packets. In: Interna-
tional Conferenc on Image Processing. vol. 6, pp. 3825–3828 (2009)
19. Ouarti, N., SAFRAN, A., LE, B., PINEAU, S.: Method for highlighting at least one moving
element in a scene, and portable augmented reality (Aug 22 2013), http://www.google.com/
patents/WO2013121052A1?cl=en, wO Patent App. PCT/EP2013/053,216
20. Pfister, T., Li, X., Zhao, G., Pietikainen, M.: Differentiating spontaneous from posed facial
expressions within a generic facial expression recognition framework. In: ICCV Workshop.
pp. 868–875 (2011)
21. Schmidt, K., Bhattacharya, S., Denlinger, R.: Comparison of deliberate and spontaneous facial
movement in smiles and eyebrow raises. Journal of Nonverbal Behavavior 33, 35–45 (2009)
22. Valstar, M., Pantic, M.: How to distinguish posed from spontaneous smiles using geometric
features. In: In Proceedings of ACM ICMI. pp. 38–45 (2007)
23. Yu, X., Han, W., Li, L., Shi, J., Wang, G.: An eye detection and localization system for natural
human and robot interaction without face detection. TAROS pp. 54–65 (2011)
24. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l1 optical flow. In: In
Ann. Symp. German Association Patt. Recogn. pp. 214–223 (2007)
25. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods:
Audio, visual, and spontaneous expressions. PAMI 31(1), 39–58 (2009)
Handling Illumination Variation:
A Challenge for Face Recognition
Abstract Though impressive recognition rates have been achieved with various
techniques under the controlled face image capturing environment, making recogni-
tion more reliable under uncontrolled environment is still a great challenge. Security
and surveillance images, captured in open uncontrolled environments, are likely sub-
jected to extreme lighting conditions like underexposed, and overexposed areas that
reduce the amount of useful details available in the collected face images. This paper
explores two different preprocessing methods and compares the effect of enhance-
ment in recognition results using Orthogonal Neighbourhood preserving Projection
(ONPP) and Modified ONPP (MONPP), which are subspace based methods. Note
that subspace based face recognition techniques are highly sought after in recent
times. Experimental results on preprocessing techniques followed by face recogni-
tion using ONPP and MONPP are presented.
1 Introduction
Although face recognition algorithms are performing exceptionally well under con-
trolled illumination environments, it is still a major challenge to reliably recognize a
face under pose, expression, age and illumination variations. Illumination variation
is most common while capturing face images. Lighting condition, camera position,
face position all lead to change in illumination. Such illumination changes result in
a major source for recognition errors especially for appearance based techniques.
The task of face recognition algorithm is to identify an individual accurately despite
of such illumination variations. Two face images of same person can seem visually
very different under various illumination intensities and directions in [1]. It has been
shown that the variations in two face images of same person captured under differ-
ent illumination conditions are larger than the face images of two different person,
which makes face recognition under illumination variation a difficult task. To han-
dle such cases, several approaches are used such as preprocessing and normalization,
invariant feature extraction or face modeling [2]. In preprocessing based methods,
several image processing techniques are performed on image to nullify illumination
effects to some extent. Gamma correction, histogram equalization [3, 4] and loga-
rithm transforms [5] are some of these image processing techniques.
In this paper, two different preprocessing techniques are applied to compensate
illumination variation on face images taken under extreme lighting conditions and
the recognition performance is compared using Orthogonal Neighbourhood Preserv-
ing Projection(ONPP) [6] and Modified ONPP(MONPP) [7]. ONPP and MONPP
are mainly dimensionality reduction techniques which learn the data manifold using
subspace analysis. Recently both ONPP and MONPP are efficiently used for face
recognition task [6, 7]. Detailed experiments of preprocessing to nullify the illumi-
nation variation for face recognition have been performed on various benchmark face
databases such as The extended Yale-B database [8] and CMU PIE face database [9].
Face recognition results of ONPP and MONPP are compared and presented.
In the next section, preprocessing techniques are explained in detail, followed
by the dimensionality reduction algorithm MONPP explained in Sect. 3. Section 4
consists of experimental results followed by conclusion in Sect. 5.
2 Preprocessing
For better face recognition under uncontrolled and varying lighting conditions, the
features useful for discrimination between two different faces need to be preserved.
The shadows created in face images due to different lighting directions result in loss
of facial features useful for recognition. A preprocessing method must increase the
intensity in the areas those are under-exposed (poorly illuminated) and lower the
intensity in the areas those are over-exposed (highly illuminated) simultaneously,
while keeping the moderately illuminated area intact. Following two subsections
discuss two different preprocessing techniques.
An enhancement technique for colour images proposed in [10] takes care of such
extreme illumination conditions using a series of operations and a nonlinear inten-
Handling Illumination Variation: A Challenge for Face Recognition 275
sity transformation performed on images to enhance a colour image for better visual
perception. The intensity transformation function is based on previous research sug-
gested in [11].
In this paper, we have tested the nonlinear intensity transformation based on
inverse sine function enhancement on grayscale face images having high illumina-
tion irregularities for better recognition. This nonlinear enhancement technique is a
pixel by pixel approach where, the enhanced intensity value is computed using the
inverse sine function with a locally tunable parameter based on the neighbourhood
pixel values. The intensity range of the image is rescaled to [0 1], followed by a
nonlinear transfer function (Eq. 1).
2 −1 q
Ienh (x, y) = sin (In (x, y) 2 ) (1)
𝜋
where, In (x, y) is the normalized intensity value at pixel location (x, y) and q is the
locally tunable control parameter. In the darker area where intensity needs to be
increased, the value of q should be less than 1, and the over bright area where inten-
sity needs to be suppressed, the value of q should be greater than 1. Figure 1 shows
the transformation function with the value of q ranging from 0.2 to 5 for intensity
range [0 1]. The red curve shows transformation for q equal to 1, green curves show
q less than 1, which enhances darker region of image and blue curve shows q greater
than 1, which suppresses higher intensity in the over-exposed region of an image.
The curve of the transformation function used for a pixel is decided by the value of
q based on its neighbourhood.
The value of q is decided by the tangent function based on mean normalized
intensity values, which is determined by averaging three Gaussian filtered smooth
images. These smooth images are found using three different Gaussian kernels of
size Mi × Ni . Normalized Gaussian kernels are created as below:
hg (n1 , n2 )
kerneli (n1 , n2 ) = ∑ ∑ (2)
n1 n2 hg (n1 , n2 )
−(n2 +n2 )
1 2
hg (n1 , n2 ) = e 2𝜎 2 (3)
M M N N
where, ranges of n1 and n2 are [−⌊ 2i ⌋, ⌊ 2i ⌋] and [−⌊ 2i ⌋, ⌊ 2i ⌋] respectively. Here,
window size Mi × Ni is set to 6 × 6, 10 × 10 and 14 × 14, experimentally. Symbol i
M
indicates which Gaussian kernel is being used and 𝜎 is to be 0.3( 2i − 1) + 0.8
The Gaussian mean intensity at pixel (x, y) is calculated using
Mi Ni
∑
2 ∑
2
The mean intensity image IMn is then obtained by averaging these three filtered
images. The mean intensity value is normalized to range [0 1] and based on intensity
value in IMn at location (x, y), the tunable parameter q is determined using
{
tan( C𝜋 IMn (x, y)) + C2 IMn (x, y) ≥ 0.3
q= 1
1
1 (5)
C3
ln( 0.3 IMn (x, y)) + C4 IMn (x, y) < 0.3
to shading introduce false information for recognition. Band-pass filtering can help
to retain useful information in face images while get rid of unwanted information
or misleading spurious edge like features due to shadows. DoG filtering is a suit-
able way to achieve such a bandpass behavior. As DoG name suggests, it is basically
a difference of 2D Gaussian filters G𝜎1 and G𝜎2 having different variances (Outer
mask is normally 2−3 times broader than the inner mask). The inner Gaussian G𝜎2
is typically quite narrow (usually variance 𝜎2 ≤ 1 pixel essentially works as high
pass filter), while the outer Gaussian G𝜎1 is 2−4 pixels wider, depending on the spa-
tial frequency at which low-frequency information becomes misleading rather than
informative. Values for 𝜎1 and 𝜎2 are set to 1 and 2 respectively based on experiments
carried out on face databases [12].
The DoG as an operator or convolution kernel is defined as
x2 +y2 x2 +y2
1 1 − 2𝜎 2 1 − 2𝜎 2
DoG ≅ G𝜎1 − G𝜎2 = √ ( e 1 − e 2 )
2𝜋 𝜎1 𝜎2
MONPP is a two-step algorithm where, in the first step nearest neighbours for
each data point are sought and the data point is expressed as a linear combination
of these neighbors. In the second step, the data compactness is achieved through a
minimization problem in the projected space.
Let x1 , x2 , ...., xn be the given data points in m-dimensional space (xi ∈ Rm ). So,
the data matrix is X = [x1 , x2 , ...., xn ] ∈ Rm×n . The basic task of subspace based
methods is to find an orthogonal/non-orthogonal projection matrix Vm×d such that
Y = V T X, where Y ∈ Rd×n is the embedding of X in lower dimension as d is assumed
to be less than m.
For, each data point xi , nearest neighbors are selected in either of two ways: (1)
k neighbors are selected by Nearest Neighbor (NN) technique where k is suitably
chosen parameter, known as k nearest neighbors (2) neighbors could be selected
which are within 𝜀 distance apart from the data point known as 𝜀 neighbors. Let xi
be the set of k nearest neighbors. In first step, data point xi is expressed as a linear
∑k
combination of its k neighbors. Let j=1 wij xj be the linear combination of neighbors
xj ∈ xi of xi . The weight wij are calculated by minimizing the reconstruction errors
i.e. error between xi and the reconstruction of xi using the linear combination of
neighbours xj ∈ xi .
1∑ ∑
n k
arg min (W) = ∥ xi − wij xj ∥2 (7)
2 i=1 j=1
∑k
subject to wij = 0, if xj ∉ xi and j=1 wij = 1.
In traditional ONPP, closed form solution of Eq. 7 can be achieved by solving
a least square problem, resulting in linear weights for each of nearest neighbours.
MONPP incorporates nonlinear weighing scheme using following equation:
Ze
wi∶ = (8)
eT Ze
The effect of LTISN and DoG based enhancement on recognition rates using ONPP
and MONPP is compared with the recognition rates without any preprocessing and
reported in this section. For unbiased results the experiments are carried out on 10
different realizations from Extended Yale-B [8] and CMU-PIE face database [9].
The experiment is performed on 2432 frontal face images of 28 subjects each with
64 illumination condition. Images are resized to 60 × 40 to reduce computation.
Figure 2 shows face images of a person with 24 different illumination direction along
with preprocessed images using LTISN enhancement and DoG enhancement respec-
tively.
Figure 3 left and right shows average recognition result of LTISN and DoG based
enhancement techniques respectively, combined with ONPP and MONPP with vary-
ing number of nearest-neighbours(k) values 10, 15 and 20. The best recognition
result achieved with MONPP + LNIST is 99.84 % at 110 dimension as listed in
Table 1.
Fig. 2 Face images form Yale-B database (left), enhanced images using LTISN (middle), enhanced
images using DoG (right)
Handling Illumination Variation: A Challenge for Face Recognition 281
Fig. 3 Results of recognition accuracy (in %) using LNIST (left) and DoG (right) with ONPP and
MONPP on extended Yale-B
Fig. 4 Results of recognition accuracy (in %) using LNIST (left) and DoG (right) with ONPP and
MONPP on CMU-PIE
5 Conclusion
References
1. Y Adini, Y Moses, and S Ullman. Face recognition: The problem of compensating for changes
in illumination direction. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
19(7):721–732, 1997.
2. W Chen, M J Er, and S Wu. Illumination compensation and normalization for robust face recog-
nition using discrete cosine transform in logarithm domain. Systems, Man, and Cybernetics,
Part B: Cybernetics, IEEE Transactions on, 36(2):458–466, 2006.
3. S M Pizer, E P Amburn, J D Austin, R Cromartie, A Geselowitz, T Greer, B Romeny, J B
Zimmerman, and K Zuiderveld. Adaptive histogram equalization and its variations. Computer
vision, graphics, and image processing, 39(3):355–368, 1987.
4. S Shan, W Gao, B Cao, and D Zhao. Illumination normalization for robust face recognition
against varying lighting conditions. In Analysis and Modeling of Faces and Gestures. IEEE
International Workshop on, pages 157–164. IEEE, 2003.
5. M Savvides and BVK V Kumar. Illumination normalization using logarithm transforms for
face authentication. In Audio-and Video-Based Biometric Person Authentication, pages 549–
556. Springer, 2003.
6. E Kokiopoulou and Y Saad. Orthogonal neighborhood preserving projections: A projection-
based dimensionality reduction technique. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 29(12):2143–2156, 2007.
7. P Koringa, G Shikkenawis, S K Mitra, and SK Parulkar. Modified orthogonal neighborhood
preserving projection for face recognition. In Pattern Recognition and Machine Intelligence,
pages 225–235. Springer, 2015.
8. A S Georghiades, P N Belhumeur, and D J Kriegman. From few to many: Illumination cone
models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. Mach.
Intelligence, 23(6):643–660, 2001.
9. T Sim, S Baker, and M Bsat. The cmu pose, illumination, and expression database. In Automatic
Face and Gesture Recognition. Proceedings. Fifth IEEE International Conference on, pages
46–51. IEEE, 2002.
Handling Illumination Variation: A Challenge for Face Recognition 283
10. E Krieger, VK Asari, and S Arigela. Color image enhancement of low-resolution images
captured in extreme lighting conditions. In SPIE Sensing Technology + Applications, pages
91200Q–91200Q. International Society for Optics and Photonics, 2014.
11. S Arigela and VK Asari. Self-tunable transformation function for enhancement of high contrast
color images. Journal of Electronic Imaging, 22(2):023010–023010, 2013.
12. X Tan and B Triggs. Enhanced local texture feature sets for face recognition under difficult
lighting conditions. Image Processing, IEEE Transactions on, 19(6):1635–1650, 2010.
Bin Picking Using Manifold Learning
Abstract Bin picking using vision based sensors requires accurate estimation of
location and pose of the object for positioning the end effector of the robotic arm.
The computational burden and complexity depends upon the parametric model
adopted for the task. Learning based techniques to implement the scheme using low
dimensional manifolds offer computationally more efficient alternatives. In this
paper we have employed Locally Linear Embedding (LLE) and Deep Learning
(with auto encoders) for manifold learning in the visual domain as well as for the
parameters of robotic manipulator for visual servoing. Images of clusters of
cylindrical pellets were used as the training data set in the visual domain. Corre-
sponding parameters of the six degrees of freedom robot for picking designated
cylindrical pellet formed the training dataset in the robotic configuration space. The
correspondence between the weight coefficients of LLE manifold in the visual
domain and robotic domain is established through regression. Autoencoders in
conjunction with feed forward neural networks were used for learning of corre-
spondence between the high dimensional visual space and low dimensional con-
figuration space. We have compared the results of the two implementations for the
same dataset and found that manifold learning using auto encoders resulted in better
performance. The eye-in-hand configuration used with KUKA KR5 robotic arm
and Basler camera offers a potentially effective and efficient solution to the bin
picking problem through learning based visual servoing.
1 Introduction
estimation are some of the areas in which use of learning based techniques have
been found to yield significant results. Identification based on appearance of an
object in varying illumination and pose is an important problem in visual servoing.
Towards this, manifold based representation of data plays a significant role in
solution to this problem. One of the early implementation of this idea was by Nayar
and Murase [5]. In these manifold based approaches a large data set of the images
of object are created. Image features are extracted or alternatively images are
compressed to lower dimensions. The dataset thus obtained is represented as a
manifold. The test input (image of the unknown object) is projected onto the
manifold after dimensionality reduction/feature extraction. Its position on the
manifold helps identification and pose estimation. There are separate manifolds for
object identification and pose estimation. A major limitation of this approach is
large computational complexity. Interpolation between two points on a manifold is
also an added complexity. A bunch based method with a shape descriptor model
establishing low dimensional pose manifolds capable of distinguishing similar
poses of different objects into the corresponding classes with a neural
network-based solution has been proposed by Kouskourida et al. [13]. LLE was
considered as a potentially appropriate algorithm for manifold learning in our
problem as it discovers nonlinear structure in the high dimensional data by
exploiting the local symmetries of linear reconstructions. Besides, use of deep
learning algorithms with autoencoders is considered an efficient algorithm for
non-linear dimensionality reduction. The two algorithms have been used as diverse
techniques of manifold learning for open loop system and comparison of results.
3 Manifold Learning
Manifold Learning in Computer Vision has off late been regarded as one of the
most efficient techniques for learning based applications that involves dimension-
ality reduction, noise handling, etc. The main aim in non-linear dimensionality
reduction is to embed data that originally lies in a high dimensional space in a lower
dimensional space, while preserving characteristic properties of the original data.
Some of the popular algorithms for manifold modelling are Principal Component
Analysis (PCA), Classical Multi Dimensional Scaling (CMDS), Isometric Mapping
(ISOMAP), Locally Linear Embedding (LLE) etc. In this paper we have also
considered Deep Learning Techniques as potentially efficient algorithm for mani-
fold learning.
Fig. 1 Dimensionality
reduction with autoencoders
The training image set in visual domain consists of N images, each normalized to
size m × n. Every training image is converted to a D—dimensional (D = mn)
vector. These N vectors (each of dimension D) are concatenated to form a D × N
matrix. In order to find the K nearest neighbours for each of the N vectors, the
Euclidean distance with each of the other data points is calculated. The data points
pertaining to K shortest Euclidean distance are picked as the nearest neighbours.
Reconstruction weights Wij are computed by minimizing the cost function given by
Eq. 1, the two constraints for computation of Wij are:-
∑ Wij = 1 ð4Þ
290 A. Kumar et al.
For every bin picking action, the six degrees of freedom robot parameters can be
expressed as a six dimensional vector,
P=½X Y Z A B C T ð5Þ
Q = ½ A1 A2 A3 A4 A5 A6 T ð6Þ
Here A1, A2……. A6 are the six joint angles for the six DOF robot system. In
our experiment, with reference to N pellet-clusters expressed as n data points, there
would be N different positions of the robot end-effector for picking up a pellet.
Bin Picking Using Manifold Learning 291
Hence in the robot domain the data set would consist of N, 6-dimensional vectors.
These vectors in robot domain can also be mapped on to locally linear embedding
in terms of K nearest neighbours as in the case of visual domain.
The basic premise in this case is the correspondence between the selected data-point
in visual domain and corresponding parameters in the robot domain. As shown in
the flowchart in Fig. 3, in the learning phase of the algorithm, the K-nearest
neighbours for each data-point in the visual domain would be applicable to the
corresponding data-points in the robot domain. However the reconstruction weights
in both domains could vary. Therefore reconstruction weights in the robot domain
were computed based on the nearest neighbours of the visual domain. In order to
establish correspondence between the reconstruction weights in the visual domain
and in the robot domain, Support vector Regression (SVR) based learning algo-
rithm [14] has been used. In the testing phase, the candidate image is mapped on to
LLE manifold to find its K nearest neighbours and corresponding weights. The
weight vector is then used for input for prediction in SVR algorithm for compu-
tation of corresponding reconstruction weights in the robot domain. Resultant
end-effector co-ordinates of the robot domain pertaining to the test image is com-
puted by applying these reconstruction weights on to the corresponding nearest
neighbours in the training data-set of the robot domain.
The objecting of training the autoencoder is to learn these parameters for rep-
resentation of the input data through h and its reconstruction from h.
In our experiments with deep learning, as in the case of LLE, the input data
consisted of 500 images. Each low resolution image was of the size of 39 × 33
pixels, thus forming a 1287 dimensional vector when transformed to the vector
form. Thus the input layer for the feed forward neural network consisted of a 1287
dimensional vector. The end-effector co-ordinates of the robot were taken in a 2
292 A. Kumar et al.
We have used 2000 images of the pellet clusters created for our experiments
corresponding to our experimental setup with KUKA KR5 and Basler camera. The
main objective of the algorithm was computation of X and Y co-ordinates in world
coordinate system of robot for the centre of each pellet to be picked up. During
manifold learning with LLE, in the data-set of 2000 images, 800 images were used
for manifold modelling 800 were used for training and 100 data samples for testing.
While working with deep learning 1600 data samples were used for learning the
network and 400 samples for testing. The plots of robot end-effector coordinates
pertaining to training and test data for LLE based manifold learning and autoen-
coders based manifold learning are presented at Fig. 5 and Fig. 6 respectively.
Statistically the test data with LLE manifold had a localization accuracy of 89.6 %
while the autoencoder based manifold learning had an accuracy of 91.7 %. Com-
plexity of LLE algorithm ranges from cubic O(DKN3) to logarithmic O(NlogN) for
various steps [9]. Computationally deep learning algorithm is less efficient com-
pared to parametric algorithms [13]. Apparently the results of deep learning were
more accurate due to better learning of the characteristics of data and its
294 A. Kumar et al.
7 Conclusion
References
1. Ghita Ovidiu and Whelan Paul F. 2008. A Systems Engineering Approach to Robotic Bin
Picking. Stereo Vision, Book edited by: Dr. Asim Bhatti, pp. 372.
2. Kelley, B.; Birk, J.R.; Martins, H. & Tella R. 1982. A robot system which acquires cylindrical
workpieces from bins, IEEE Trans. Syst. Man Cybern., vol. 12, no. 2, pp. 204–213.
3. Faugeras, O.D. & Hebert, M. 1986. The representation, recognition and locating of 3-D
objects, Intl. J. Robotics Res., vol. 5, no. 3, pp. 27–52.
4. Edwards, J. 1996. An active, appearance-based approach to the pose estimation of complex
objects, Proc. of the IEEE Intelligent Robots and Systems Conference, Osaka, Japan,
pp. 1458–1465.
5. Murase, H. & Nayar, S.K. 1995. Visual learning and recognition of 3-D objects from
appearance, Intl. Journal of Computer Vision, vol. 14, pp. 5–24.
6. Ghita O. & Whelan, P.F. 2003. A bin picking system based on depth from defocus, Machine
Vision and Applications, vol. 13, no. 4, pp. 234–244.
7. Mittrapiyanuruk, P.; DeSouza, G.N. & Kak, A. 2004. Calculating the 3D-pose of rigid objects
using active appearance models, Intl. Conference in Robotics and Automation, New Orleans,
USA.
8. Ghita, O.; Whelan, P.F.; Vernon D. & Mallon J. 2007. Pose estimation for objects with planar
surfaces using eigen image and range data analysis, Machine Vision and Applications, vol. 18,
no. 6, pp. 355–365.
9. Saul Lawrence K and Roweis Sam T. 2000, An Introduction to Locally Linear Embedding,
https://www.cs.nyu.edu/∼roweis/lle/papers/lleintro.pdf.
10. Deep Learning, An MIT Press book in preparation Yoshua Bengio, Ian Goodfellow and
Aaron Courville, http://www.iro.umontreal.ca/∼bengioy/dlbook,2015.
11. Deep Learning Tool Box, Prediction as a candidate for learning deep hierarchical models of
data, Rasmus Berg Palm, https://github.com/rasmusbergpalm/DeepLearnToolbox.
12. Léonard, Simon, and Martin Jägersand. “Learning based visual servoing.” Intelligent Robots
and Systems, 2004. (IROS 2004). Proceedings. 2004 IEEE/RSJ International Conference on.
Vol. 1. IEEE, 2004.
296 A. Kumar et al.
13. Rigas Kouskouridas, Angleo Amanatiadis and Antonios Gasteratos, “Pose Manifolds for
Efficient Visual Servoing”, http://www.iis.ee.ic.ac.uk/rkouskou/Publications/Rigas_IST12b.
pdf.
14. Chih-Chung Chang and Chih-Jen Lin, LIBSVM: a library for support vector machines. ACM
Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available
at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
15. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., & Manzagol, P. A. (2010). Stacked
denoising autoencoders: Learning useful representations in a deep network with a local
denoising criterion. The Journal of Machine Learning Research, 11, 3371–3408.
Motion Estimation from Image Sequences:
A Fractional Order Total Variation Model
Abstract In this paper, a fractional order total variation model is introduced in the
estimation of motion field. In particular, the proposed model generalizes the integer
order total variation models. The motion estimation is carried out in terms opti-
cal flow. The presented model is made using a quadratic and total variation terms.
This mathematical formulation makes the model robust against outliers and pre-
serves discontinuities. However, it is difficult to solve the presented model due to the
non-differentiability nature of total variation term. For this purpose, the Grünwald-
Letnikov derivative is used as a discretization scheme to discretize the fractional
order derivative. The resulting formulation is solved by using a more efficient algo-
rithm. Experimental results on various datasets verify the validity of the proposed
model.
1 Introduction
P. Kumar (✉)
Department of Mathematics, Indian Institute of Technology Roorkee,
Roorkee 247667, India
e-mail: [email protected]
B. Raman
Department of Computer Science & Engineering,
Indian Institute of Technology Roorkee, Roorkee 247667, India
e-mail: [email protected]
© Springer Science+Business Media Singapore 2017 297
B. Raman et al. (eds.), Proceedings of International Conference on Computer Vision
and Image Processing, Advances in Intelligent Systems and Computing 460,
DOI 10.1007/978-981-10-2107-7_27
298 P. Kumar and B. Raman
ness patterns in an image sequence [10]”. In general, optical flow can be illustrated
as a two dimensional velocity vector which arises either due to the motion of the
camera/observer or objects in the scene. It actively used in many vision applica-
tions such as robot navigation, surveillance, human-computer interaction, medical
diagnosis, 3D reconstruction. Optical flow estimation is considered as an ill-posed
problem. Therefore, further a priori assumption is required for accurate estimation.
The researchers have proposed several variational models to determine the optical
flow in the literature starting from the seminal work [10, 11].
In the past two decades, differential variational models are quite popular among
the optical flow estimation techniques. The reason behind it is their advantages and
simplicity in modeling the problem and quality of the estimated optical flow [4, 5,
14, 15, 19]. In order to improve the estimation accuracy of the optical flow models,
different constraints have been imposed in the variational models and obtained an
impressive performance. Recent models like [1, 4, 5] include the additional con-
straints such as convex robust data term and the gradient constancy assumption in
order to make the model robust against outliers and reduce the local minima into
global minima. Some of the variational models such as [4, 20] proposed the motion
segmentation or parametric models to get a piecewise optical flow. Moreover, the
variational models proposed in [9, 21, 22] are based on L1 , L2 norms and total vari-
ation regularization (TV) terms. These models offer significant robustness against
illumination changes and noise. All these models are based on integer order differ-
entiation techniques. A modification of these differential variational models, which
generalizes their differential from integer order to fractional order has obtained more
attention from researchers. This can be categories into the class of fractional order
variational model. The fractional order differentiation based methods are now quite
popular in various image processing applications [7, 8, 16]. However, very rare
attention has been given to use fractional order variational model in optical flow
estimation problem. The first fractional order variational model for motion estima-
tion was proposed by Chen et al. [8]. But, it is based on [10], which is more sensitive
to noise.
The core idea of fractional order differentiation was introduced at the end of six-
teen century and later published in the nineteenth [13]. Fractional order differentia-
tion deals with differentiations of arbitrary order. The prominent difference between
the fractional and integer order differentiations is that we can find the fractional
derivatives even if the function is not continuous (as in case of images), whereas
integer order derivative failed. Thus, fractional derivative efficiently provides the
discontinuous information about texture and edges in the optical flow field [8, 13].
In some real life applications, fractional derivatives reduce the computational com-
plexity [13].
Motion Estimation from Image Sequences: A Fractional Order Total Variation Model 299
2 Contribution
In this paper, we introduce a fractional order total variation model for motion estima-
tion in the image sequences. The novelty of the proposed model is, it generalizes the
existing integer order total variation models corresponding to the fractional order.
The variational functional is formed using a quadratic and total variation terms. Due
to the presence of fractional order derivatives, the proposed model efficiently han-
dles texture and edges. The numerical implementation of fractional order derivative
is carried out using Grünwald-Letnikov derivative definition. The resulting varia-
tional functional is decomposed into a more suitable scheme and numerically solved
by an iterative method. The problem of large motion is solved by using a coarse to
fine and warping techniques. The validity of the model is tested on various datasets,
and compared with some existing models.
The rest of the paper is organized in the following sections: Section 3 describes
the proposed fractional order total variation model followed by the minimization
scheme. Section 4 describes the experimental datasets and evaluation metrics fol-
lowed by the experimental results. Finally, paper is concluded in Sect. 5 with future
work remarks.
In order to determine the optical flow u = (u, v), the model proposed by Horn and
Schunck [10] minimizes the following variational functional
[ ]
E(u) = 𝜆 (r(u, v))2 + (|∇u|2 + |∇v|2 ) dxdy (1)
∫𝛺
[ ]
E(u) = 𝜆 (r(u, v))2 + (|D𝛼 u| + |D𝛼 v|) dX (3)
∫𝛺
where, D𝛼 ∶= (D𝛼x , D𝛼y )T denotes the fractional order derivative operator [17] and
√
|D𝛼 u| = (D𝛼x u)2 + (D𝛼y u)2 . The fractional order 𝛼 ∈ ℝ+ , when 𝛼 = 1, the proposed
fractional order total variation model (3) take the form (2) [9]. In the similar way,
when 𝛼 = 2, the derivative in (3) is reduced to the second order integer derivative.
Thus, the proposed model (3) generalizes the typical total variation model (2) from
integer to fractional order.
In order to minimize the proposed fractional order total variation model (3), it is
decomposed into the following forms according to [6],
[ ]
1 1
ETV−1 = 𝜆 (r(̂u, v̂ ))2 + (u − û )2 + (v − v̂ )2 dxdy (4)
∫𝛺 2𝜃 2𝜃
[ ]
1
ETV−u = (u − û )2 + |D𝛼 u| dxdy (5)
∫𝛺 2𝜃
[ ]
1
ETV−v = (v − v̂ )2 + |D𝛼 v| dxdy (6)
∫𝛺 2𝜃
where, 𝜃 is a small constant and work as a threshold between (̂u, v̂ ) and (u, v). For
TV−1, (u, v) are considered as fixed and (̂u, v̂ ) have to determine. The variational
functionals given in (5) and (6) are demonstrated in the same manner as image
denoising model of Rudin et al. [18].
According to the Euler-Lagrange method, minimization of (̂u, v̂ ) of (4) results the
following equations
x 2 y
(1 + 2𝜆𝜃 (I2w ) ) û + 2𝜆𝜃 I2w
x
I2w v̂ = u − 2𝜆𝜃 ro I2w
x
(7)
I2w û + (1 + 2𝜆𝜃 (I2w )2 ) v̂ = v − 2𝜆𝜃 ro I2w
y y y
2𝜆𝜃 I2w
x
y
where, ro = It − uo I2w
x
− vo I2w and It = I2w − I1 .
Let D is the determinant of the above system of equations given in (7), then
x 2
) + (I2w )2 )
y
D = 1 + 2𝜆𝜃 ((I2w
Similarly for v̂
y x 2 y
− D̂v = 2𝜆𝜃 (I2w
x
I2w )u − (1 + 2𝜆𝜃 (I2w ) )v + 2𝜆𝜃 ro I2w (9)
Motion Estimation from Image Sequences: A Fractional Order Total Variation Model 301
After solving (8) and (9), we obtain the iterative expressions for û and v̂ as
∑
W−1
∑
W−1
D𝛼x ui,j = w(𝛼)
p
ui+p,j and D𝛼y ui,j = w(𝛼)
p
ui,j+p (12)
p=0 p=0
𝛤 (𝛼 + 1)
w(𝛼) p 𝛼 𝛼
p = (−1) Sp and Sp =
𝛤 (p + 1) 𝛤 (𝛼 − p + 1)
where, A(𝛼)
q
∈ ℝN×2 . Thus, the discrete version of variational functional (5) can be
written as
∑N
1
ETV−u = ‖A(𝛼)
q X‖ + ‖X − Y‖2 (15)
q=1
2𝜃
Now, the solution of the above formulation (15) is determined by using the following
expression of the primal dual algorithm described in [7],
up − 𝜏p div𝛼 dp+1 + 𝜏p 𝜃1 û
u p+1
= (16)
1 + 𝜃1 𝜏p
where, d is the solution of dual of (15). In the same way, we can determine the
solution of (6). A summary of the proposed algorithm for estimating the optical flow
is described in Algorithm 1.
Algorithm 1: Proposed algorithm
Step 1: Input: I1 , I2 , 𝜆, 𝛼, 𝜃 and iterations
Step 2: Compute û and v̂ from (7)
Step 3: Compute u and v from (5) and (6) using (16)
Step 4: Output: optical flow vector u = (u,v)
4.1 Datasets
AAE
AAE
2 2
1.8
1.5
1.6
1.4
1.5 1
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
α α α
Fig. 1 Optical flow results: sample images (first row), optimal fractional order plots (second row),
vector plots of the estimated optical flow (third row), estimated optical flow color maps (fourth row)
and ground truth plots of the optical flow in bottom row [2]
lar error(AE) for evaluating the performance of the proposed algorithm. This AE is
the angle between the correct flow vector (uc , vc , 1) and the estimated flow vector
(ue , ve , 1) (see [3] for details). It is defined as,
304 P. Kumar and B. Raman
STD
1.6 1.4
1.5
0.5 1.2
1.4
1
1.3
0 0.8
0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 1.2 1.4
α α α
Fig. 2 Optical flow results: sample images (first row), optimal fractional order plots (second row),
vector plots of the estimated optical flow (third row), estimated optical flow color maps (fourth row)
and ground truth plots of the optical flow in bottom row [2]
⎛ ⎞
−1 ⎜ uc ue + vc ve + 1 ⎟
AE = cos ⎜√ ⎟ (17)
⎜ (u2c + v2c + 1)(u2e + v2e + 1) ⎟
⎝ ⎠
Motion Estimation from Image Sequences: A Fractional Order Total Variation Model 305
Statistics:- The validity of the proposed model is also determined using the average
angular error (AAE) and standard deviation (STD). These terms are briefly defined
in [3].
(a)
1.6 (b)
algorithm 1
1.4 algorithm 2 algorithm 1
1.2 algorithm 3
1.5
1
AAE
AAE
0.8 1
0.6
0.4 0.5
0.2
0 0
Grove RubberWhale Urban Venus Grove RubberWhale Urban Venus
Image Dataset Image Dataset
Fig. 3 Comparisons of quantitative results: a Proposed model (Algorithm 1) with model [10]
(Algorithm 2), and b Proposed model (Algorithm 1) with total variation model [9] (Algorithm 3)
306 P. Kumar and B. Raman
(a) (b)
1 1.5
algorithm 1
0.8 algorithm 2 algorithm 1
algorithm 3
1
0.6
STD
STD
0.4
0.5
0.2
0 0
Army Mequon Army Mequon
Image Dataset Image Dataset
Fig. 4 Comparisons of quantitative results: a Proposed model (Algorithm 1) with model [10]
(Algorithm 2), and b Proposed model (Algorithm 1) with total variation model [9] (Algorithm 3)
tion components such as texture, several independent motions and motion blur. The
estimated color maps of the optical flow are compared with the ground truth color
maps in Figs. 1 and 2. These results demonstrate that the proposed model efficiently
handle textures and edges. The color maps of the optical flow are dense, which can
be justified by the vector forms of the optical flow. The quantitative results of the
model are compared with the total variation model [9] in Figs. 3 and 4. This compar-
ison shows that the fractional order total variation model gives comparatively better
results. Additionally, we compared our quantitative results with the model [10] in
Figs. 3 and 4. This shows the significant out performance of the proposed model.
A fractional order total variation model has been presented for motion estimation
from image frames. For 𝛼 = 1, the proposed model generalizes the integer order
total variation model [9]. The optimal fractional order for which the solution is stable
has provided for each image sequence by graphs. Experimental results on different
datasets validate that the proposed model efficiently handled texture and edges, and
provides dense flow. As a future work, the proposed fractional order total variation
model can be extended to the fractional order total variation-L1 model.
Acknowledgements The author, Pushpendra Kumar gratefully acknowledges the financial support
provided by Council of Scientific and Industrial Research(CSIR), New Delhi, India to carry out this
work.
Motion Estimation from Image Sequences: A Fractional Order Total Variation Model 307
References
1. Alvarez, L., Weickert, J., Sánchez, J.: Reliable estimation of dense optical flow fields with large
displacements. International Journal of Computer Vision 39(1), 41–56 (2000)
2. Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evalu-
ation methodology for optical flow. International Journal of Computer Vision 92, 1–31 (2011)
3. Barron, J.L., Fleet, D.J., Beauchemin, S.: Performance of optical flow techniques. International
Journal of Computer Vision 12, 43–77 (1994)
4. Black, M.J., Anandan, P.: The robust estimation of multiple motions: Parametric and piecewise
smooth flow. Computer Vision and Image Understanding 63(1), 75–104 (1996)
5. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based
on a theory for warping. Computer Vision - ECCV 4, 25–36 (2004)
6. Chambolle, A.: An algorithm for total variation minimization and applications. Journal of
Mathematical imaging and vision 20(1–2), 89–97 (2004)
7. Chen, D., Chen, Y., Xue, D.: Fractional-order total variation image restoration based on primal-
dual algorithm. Abstract and Applied Analysis 2013 (2013)
8. Chen, D., Sheng, H., Chen, Y., Xue, D.: Fractional-order variational optical flow model for
motion estimation. Philosophical Transactions of the Royal Society of London A: Mathemat-
ical, Physical and Engineering Sciences 371(1990), 20120148 (2013)
9. Drulea, M., Nedevschi, S.: Total variation regularization of local-global optical flow. In: 14th
International Conference on Intelligent Transportation Systems (ITSC). pp. 318–323 (2011)
10. Horn, B., Schunck, B.: Determining optical flow. Artificial Intelligence 17, 185–203 (1981)
11. Lucas, B., Kanade, T.: An iterative image registration technique with an application to
stereo vision. In: Seventh International Joint Conference on Artificial Intelligence, Vancou-
ver, Canada. vol. 81, pp. 674–679 (1981)
12. Miller, K.S.: Derivatives of noninteger order. Mathematics magazine pp. 183–192 (1995)
13. Miller, K.S., Ross, B.: An introduction to the fractional calculus and fractional differential
equations. Wiley New York (1993)
14. Motai, Y., Jha, S.K., Kruse, D.: Human tracking from a mobile agent: optical flow and kalman
filter arbitration. Signal Processing: Image Communication 27(1), 83–95 (2012)
15. Niese, R., Al-Hamadi, A., Farag, A., Neumann, H., Michaelis, B.: Facial expression recogni-
tion based on geometric and optical flow features in colour image sequences. IET computer
vision 6(2), 79–89 (2012)
16. Pu, Y.F., Zhou, J.L., Yuan, X.: Fractional differential mask: a fractional differential-based
approach for multiscale texture enhancement. IEEE Transactions on Image Processing 19(2),
491–511 (2010)
17. Riemann, B.: Versuch einer allgemeinen auffassung der integration und differentiation. Gesam-
melte Werke 62 (1876)
18. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms.
Physica D: Nonlinear Phenomena 60(1), 259–268 (1992)
19. Schneider, R.J., Perrin, D.P., Vasilyev, N.V., Marx, G.R., Pedro, J., Howe, R.D.: Mitral annulus
segmentation from four-dimensional ultrasound using a valve state predictor and constrained
optical flow. Medical image analysis 16(2), 497–504 (2012)
20. Weickert, J.: On discontinuity-preserving optic flow. In: Proceeding of Computer Vision and
Mobile Robotics Workshop (1998)
21. Werlberger, M., Trobin, W., Pock, T., Wedel, A., Cremers, D., Bischof, H.: Anisotropic huber-
l1 optical flow. In: BMVC. vol. 1, p. 3 (2009)
22. Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime tv-l 1 optical flow. In:
Pattern Recognition, pp. 214–223. Springer (2007)
Script Identification in Natural Scene
Images: A Dataset and Texture-Feature
Based Performance Evaluation
M. Verma (✉)
Mathematics Department, IIT Roorkee, Roorkee, India
e-mail: [email protected]
N. Sood
University Institute of Engineering and Technology,
Panjab University, Chandigarh, India
e-mail: [email protected]
P.P. Roy ⋅ B. Raman
Computer Science and Engineering Department, IIT Roorkee, Roorkee, India
e-mail: [email protected]
B. Raman
e-mail: [email protected]
1 Introduction
2 Data Collection
Availability of standard database is one of the most important issues for any pattern
recognition research work. Till date no standard database is available for all official
Indic scripts. Total 500 images are collected from different sources. Out of 500 script
images, 100 for each script are taken.
Initially there were scenic images present in the database, which was further seg-
mented into the yellow board pictures by selecting the four corner points of the
desired yellow board. Further, these images were converted into grayscale and then
into binary image using some threshold value. The binary image was then segmented
into words of each script. For those images, which could not give the required per-
fect segments, vertical segmentation followed by horizontal segmentation have been
carried out otherwise the manual segmentation for those script images has been done
(Fig. 1).
Challenges: Script recognition is challenging for several reasons. The first and
most obvious reason is that there are many script categories. The second reason is the
viewpoint variation where many boards can look different from different angles. The
third reason is illumination in which lighting makes the same objects look like dif-
ferent objects. The fourth reason is background clutter in which the classifier cannot
distinguish the board from its background. Other challenges include scale, deforma-
tion, occlusion, and intra-class variation. Some of the images from database have
been shown in Fig. 2.
312 M. Verma et al.
2.1 Scripts
Having said all that, when we look at country India, precisely, incredibly diverse
India, where language changes from one region to another just as easily as notes of a
classical music piece. In India, moving few kilometers north, south, east or for that
matter west, there’s a significant variation in language, both the dialect and the script
change, not to mention the peculiar accents that occasionally adorn the language.
Each region of this country is totally different from the rest, and this difference is for
sure inclusive of the language too.
Narrowing the horizon and talking of Hindi, Punjabi, Bengali, Urdu and Eng-
lish, these languages have their own history, each equally unique and very ancient
of course. Hindi, being in the Devanagari Script, Punjabi in the Gurumukhi Script,
Bengali in the Bangla Script, Urdu in the Persian Script with a typical Nasta’liq Style,
and English in the Roman Script, are very different in many ways despite a different
script. But scripts mark an important component of studying variations in languages.
The scripts decide to a much extent the development of a language. In the following
section a brief outline about the English, Hindi, Odia, Urdu and Telugu languages is
provided.
1. Roman Script: It is used to write English language which is an international lan-
guage. This script is a descendant of the ancient Proto-Indo-European language
family. About 328 million people in India use this language as a communication
medium.
2. Devanagari Script: Hindi is the one of the most popular languages in India which
uses this script. This language is under Indo-European language family. In India,
about 182 million people mainly residing in northern part use this language as
their communication medium.
Script Identification in Natural Scene Images: A Dataset and Texture-Feature . . . 313
3. Odia: Odia is language of Indian state Odisha and spoken by people of this state.
Moreover, it is spoken in other Indian states, e.g., Jharkhand, West Bengal and
Gujarat. It is an Indo-Aryan language used by about 33 million people.
4. Urdu Script: Urdu script is written utilizing Urdu alphabets in right-to-left order
with 38 letters and no distinct letter cases, the Urdu alphabet is usually written in
the calligraphic Nasta’liq script.
5. Telugu Script: Telugu script is utilized to write Telugu language and it is from the
Brahmic family of scripts. Telugu is the language of Andhra Pradesh and Telan-
gana states and spoken by people of these states alongwith few other neighboring
states.
3 Feature Extraction
Ojala et al. proposed local binary pattern [5] in which, each pixel of the image is
considered as a center pixel for calculation of pattern value. A neighborhood around
314 M. Verma et al.
each center pixel is considered and local binary pattern value is computed. Formu-
lation of LBP for a given center pixel Ic and neighboring pixel In is as follows:
∑
P−1
LBPP,R (x1 , x2 ) = 2n × T1 (In − Ic ) (1)
n=0
{
1 a≥0
T1 (a) =
0 else
∑
m
∑
n
H(L) ∣LBP = T2 (LBP(x1 , x2 ), L);
x1 =1 x2 =1 (2)
L ∈ [0, (2 − 1)]
P
{
1 a1 = b1
T2 (a1 , b1 ) = (3)
0 else
LBPP,R (x1 , x2 ) computes the local binary pattern of pixel Ic , where number of
neighboring pixels and the radius of circle taken for computation are denoted as P
and R and (x1 , x2 ) are coordinates of pixel Ic . H(L) computes the histogram of local
binary pattern map where m × n is the image size (Eq. 2).
Center-symmetric local binary patterns is modified form of LBP that calculated the
pattern based on difference of pixels in four different directions. Mathematically,
CSLBP can be represented as follows:
(P∕2)−1
∑
CSLBPP,R = 2n × T1 (In − In+(P∕2) ) (4)
n=0
∑
m
∑
n
H(L) ∣CSLBP = T2 (CSLBP(x1 , x2 ), L);
x1 =1 x2 =1 (5)
L ∈ [0, 5]
where In and In+(P∕2) correspond to the intensity of center-symmetric pixel pairs on
a circle of radius R with number of neighboring pixels P. The radius is set to 1 and
the number of neighborhood pixels are taken as 8. More information about CSLBP
can be found in [3].
The Directional Local Extrema Patterns (DLEP) are used to compute the relationship
of each image pixel with its neighboring pixels in specific directions [4]. DLEP has
been proposed for edge information in 0◦ , 45◦ , 90◦ and 135◦ directions.
Script Identification in Natural Scene Images: A Dataset and Texture-Feature . . . 315
I ′ (i ) = In − Ic ∀ n = 1, 2, … , 8 (6)
D𝜃 (Ic ) = T3 (Ij′ , Ij+4
′
) ∀𝜃 = 0◦ , 45◦ , 90◦ , 135◦
∀ j = (1 + 𝜃∕45)
{
1 Ij′ ′
× Ij+4 ≥0
T3 (Ij′ , Ij+4
′
)= (7)
0 else
| { }
DLEPpat (Ic ))| = D𝜃 (Ic ); D𝜃 (I1 ); D𝜃 (I2 ); … D𝜃 (I8 )
|𝜃
8
∑
DLEP(Ic )|𝜃 = 2n × DLEPpat (Ic )|𝜃 (n) (8)
n=0
∑
m
∑
n
H(L) ∣DLEP(𝜃) = T2 (DLEP(x1 , x2 ) ∣𝜃 , L);
x1 =1 x2 =1
L ∈ [0, 511]
where DLEP(Ic )|𝜃 is the DLEP map of a given image and H(L) ∣DLEP(𝜃) is histogram
of the extracted DLEP map.
For all three features (LBP, CSLBP and DLEP) final histogram of pattern map
work as a feature vector of image. Later, the feature vector for all scripts of training
and testing data was made for experimental purpose.
4 Classifiers
After feature extraction, classifier is used to differentiate scripts into different classes.
In the proposed work, script classification has been done using two well-known clas-
sifiers, i.e., k-NN and SVM classifier.
Distance measures of a testing image from each training image are computed and
sorted. Based on sorted distances k nearest distance measures are selected as good
matches.
Sigmoid
K(zi , zj ) = tanh(𝛾zTi zj + r) (13)
Here 𝛾, r and d are kernel parameters.
We evaluate our algorithm on the ‘Station boards’ data set. The results using texture-
based features with k-NN and SVM classifier are studied in this section.
Initially the images were present in the form of station boards, out of which each
word of different script was extracted. Later, different features were extracted out of
these images such as CS-LBP, LBP and DLEP. These feature vectors of have been
used to train and test the images together with support vector machine (SVM) or k
nearest neighbour (k-NN) to classify the script type.
Script Identification in Natural Scene Images: A Dataset and Texture-Feature . . . 317
We compare the results of SVM and k-NN for identification of 5 scripts. In the exper-
iments, cross validation with 9:1 ratio has been adopted. Testing image set of 10
images and 90 training images have been taken for each script. During experiment,
different set of testing images has been chosen and average result has been obtained
from all testing sets. In each experiment, 50 images are used as test images and 450
images are total training images. We use multi-class SVM classifier with different
kernels to get better results. In SVM classifier, Gaussian kernel with Radial Basis
Function has given better performance than other kernels. The main reason for poor
accuracy of k-NN is the less number of samples for training as it requires large train-
ing data base samples to improve the accuracy.
In k-NN, we found the distance between the feature vectors of training and test-
ing data using various distance measures, whereas more computations are required
in SVM such as kernel processing and matching feature vector with different para-
meter settings. The k-NN and SVM represent different approaches to learning. Each
approach implies different model for the underlying data. SVM assumes there exist
a hyper-plane separating the data points (quite a restrictive assumption), while k-NN
attempts to approximate the underlying distribution of the data in a non-parametric
fashion (crude approximation of parsen-window estimator).
The SVM classifier gave 84 % for LBP and 80.5 % for DLEP features whereas
k-NN gave 64.5 % accuracy for LBP feature. Results for both k-NN and SVM are
given in Table 1. The reason for poorer results for the k-NN as compared to SVM is
that the extracted features lack in regularity between text patterns and these are not
good enough to handle broken segments [10].
Some images are shown in Fig. 3 for which the proposed system has identified
correctly. First and fourth images are very noisy and hard to understand. Second
image is not horizontal and tilted with a small angle. Our proposed method worked
well for these kind of images. Few more images have been shown in Fig. 4 for which
the system could not identify the accurate script. Most of the images in this category
are very small in size. Hence, the proposed method does not work well for very small
size images. Confusion matrix for all scripts used in this database is shown in Table 2.
It shows that texture feature based method worked very well for English and Urdu
scripts and average for other scripts.
6 Conclusion
References
1. Chanda, S., Franke, K., Pal, U.: Text independent writer identification for oriya script. In:
Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on. pp. 369–
373. IEEE (2012)
Script Identification in Natural Scene Images: A Dataset and Texture-Feature . . . 319
2. Ghosh, D., Dube, T., Shivaprasad, A.P.: Script recognition–a review. Pattern Analysis and
Machine Intelligence, IEEE Transactions on 32(12), 2142–2161 (2010)
3. Heikkilä, M., Pietikäinen, M., Schmid, C.: Description of interest regions with local binary
patterns. Pattern recognition 42(3), 425–436 (2009)
4. Murala, S., Maheshwari, R., Balasubramanian, R.: Directional local extrema patterns: a new
descriptor for content based image retrieval. International Journal of Multimedia Information
Retrieval 1(3), 191–203 (2012)
5. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation invariant tex-
ture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE
Transactions on 24(7), 971–987 (2002)
6. Pal, U., Sinha, S., Chaudhuri, B.: Multi-script line identification from indian documents. In:
Proceedings of Seventh International Conference on Document Analysis and Recognition. pp.
880–884. IEEE (2003)
7. Phan, T.Q., Shivakumara, P., Ding, Z., Lu, S., Tan, C.L.: Video script identification based on
text lines. In: International Conference on Document Analysis and Recognition (ICDAR). pp.
1240–1244. IEEE (2011)
8. Shi, B., Yao, C., Zhang, C., Guo, X., Huang, F., Bai, X.: Automatic script identification in the
wild. In: Proceedings of ICDAR. No. 531–535 (2015)
9. Shijian, L., Tan, C.L.: Script and language identification in noisy and degraded document
images. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30(1), 14–24 (2008)
10. Shivakumara, P., Yuan, Z., Zhao, D., Lu, T., Tan, C.L.: New gradient-spatial-structural features
for video script identification. Computer Vision and Image Understanding 130, 35–53 (2015)
11. Singhal, V., Navin, N., Ghosh, D.: Script-based classification of hand-written text docu-
ments in a multilingual environment. In: Proceedings of 13th International Workshop on
Research Issues in Data Engineering: Multi-lingual Information Management (RIDE-MLIM).
pp. 47–54. IEEE (2003)
12. Sun, Q.Y., Lu, Y.: Text location in scene images using visual attention model. International
Journal of Pattern Recognition and Artificial Intelligence 26(04), 1–22 (2012)
13. Ullrich, C.: Support vector classification. In: Forecasting and Hedging in the Foreign Exchange
Markets, pp. 65–82. Springer (2009)
Posture Recognition in HINE Exercises
Abdul Fatir Ansari, Partha Pratim Roy and Debi Prosad Dogra
1 Introduction
Computer vision guided automatic or semi-automatic systems are one of the major
contributors in medical research. Images and videos recorded through X-Ray,
Ultrasound (USG), Magnetic Resonance (MR), Electrocardiography (ECG), or
Electroencephalography (EEG) are often analysed using computers to help the
physicians in the diagnosis process. Above modalities are mainly used to under-
stand the state of the internal structures of human body. On the other hand, external
imaging systems or sensors can act as important maintenance or diagnostic utility.
For instance, external imaging can be used in human gait analysis [1], infant [2] or
old person monitoring systems [3], pedestrian detection [4], patient surveillance [5],
etc.
Image and Video analysis based algorithms are also being used to develop
automatic and semi-automatic systems for assistance in detection and diagnosis in
medical examinations. Researches have, for instance, shown that experts con-
ducting Hammersmith Infant Neurological Examinations (HINE) [6] can take
the help of computer vision guided tools. HINE is used for assessment of neuro-
logical development in infants. These examinations include assessment of posture,
cranial nerve functions, tone, movements, reflexes and behaviour.
Examinations are carried out by visually observing the reaction of the baby and
assessing each test separately. Hence, these examinations often turn out to be
subjective. Therefore, there is a need to automate some of the critical tests of HINE,
namely adductors and popliteal angle measurement, ankle dorsiflexion estimation,
observation of head control, testing of ventral and vertical suspension, and grading
head movement during pulled-to-sit and lateral tilting to bring objectivity in the
evaluation process. Significant progress has already been made in this context. For
example, Dogra et al. have proposed image and video guided techniques to assess
three tests of HINE set, namely Adductors Angle Measurement [7], Pulled to Sit
[8], and Lateral Tilting [9] as depicted in Fig. 1.
However, existing solutions are not generic and hence cannot be used directly for
the remaining set of tests. In this paper, we attempt to solve the problem of posture
recognition of the baby which can be used in automating a large set of tests. We
classify a given video sequence of HINE exercise into one of the following classes:
Fig. 1 Four exercises of HINE set, a Adductors Angle. b Pulled to Sit. c Lateral Tilting and
d Vertical Suspension
Posture Recognition in HINE Exercises 323
Sitting (as in Pulled to Sit), Lying along Y-axis (as in Adductors Angle Mea-
surement), Lying along X-axis and Suspending (as in Vertical Suspension Test).
The rest of the paper is organized as follows. Proposed method is explained in
detail in Sect. 2. Results and performance of our system are demonstrated in
Sect. 3. Conclusions and future work are presented in Sect. 4.
2 Proposed Methodology
In the proposed method, we use pixel-color based skin detection method that can
classify every pixel as a skin or a non-skin pixel. After testing with various color
spaces (e.g. HSV, normalized RGB, and YCrCb), we found YCrCb to be the most
suitable for our application. This is because the luminance component (i.e. Y)
doesn’t affect the classification, and skins of different complexions can be detected
using the same bounds in this color space.
The image is smoothed using a mean filter before conversion into YCrCb color
space. Our algorithm relies on setting the bounds of Cr and Cb values explicitly
which were tested rigorously on various types of human skins. Histograms of Cr
and Cb components in skin pixels available in HINE videos are shown in Fig. 3.
The YCrCb components of a pixel from RGB values can be obtained as:
As the videos in our datasets were taken in constant illumination, fixed camera
settings, and a good contrast in baby’s body and background objects was main-
tained, we did not employ time-consuming classification and motion tracking
algorithms to detect the body. Such methods would have been resource intensive
and would have required lot of training datasets thereby defying the purpose of our
algorithm. The output image of the above algorithm is a binary image (denoted by
Iseg).
2.2 Skeletonization
Iseg comprises of blobs of skin regions (body of the baby) and of non-skin region.
Morphological operations—erosion and dilation—were applied to make the body
area prominent. The contours of the resulting binary image were determined. Blobs
with size less than a threshold area AT (experimentally set to 2000), were removed
considering them spurious regions.
Posture Recognition in HINE Exercises 325
The output image was then skeletonized (thinned) as described by Guo and Hall
[10]. Three thinning methodologies namely, morphological thinning, Zhang-Suen
algorithm [11] and Guo-Hall algorithm [10] were tested. The thinning algorithm by
Guo and Hall [10] provided best results with least number of redundant pixels
within affordable time. Contours of the image generated after thinning were then
determined and we saved the largest one (area wise) discarding others. This works
perfectly for our analysis as the largest contour is the most probable candidate of the
baby’s skeleton. The image generated in this phase is denoted by Iskel (Fig. 4).
The junction points and end points from the baby’s skeleton (Iskel) were then
searched. For each white pixel, white pixels in its eight neighborhoods were
counted. Depending on the number of white pixels (including the current pixel) in
the neighborhood (denoted by count) a pixel was classified as a body point, junction
point or end point.
• If count is equal to 2, then the pixel is classified as an end point.
• If count is greater than 4, then the pixel is classified as a junction point.
• In all other cases, the pixel is assumed as a normal body point.
Spurious junction points in the vicinity of actual junctions due to redundant
pixels or skeleton defects may be detected by the above heuristics. These junctions
were then removed by iterating over all the junction points and removing the ones
whose Euclidean distance from any other junction point was less than a threshold
DT (experimentally set to 2.5).
The bounding box and center of mass of the baby’s skeleton from Iskel was found
out and the following 6 features were extracted from every frame of the video
sequence.
1. F1: Width (marked W in Fig. 5) of the rectangle bounding the baby’s skeleton.
2. F2: Height (marked H in Fig. 5) of the rectangle bounding the baby’s skeleton.
3. F3: Aspect Ratio (W/H) of the rectangle bounding the baby’s skeleton.
4. F4: The Euclidean distance (marked D in Fig. 5) between the farthest junction
points in the baby’s skeleton.
326 A.F. Ansari et al.
After extraction of the 6 features from each frame of the video, we apply Hidden
Markov Model (HMM) based sequential classifier for classification of the video
sequence into one of the four classes namely Sitting, Lying along X axis, Lying
along Y axis and Suspending. An HMM can be defined by initial state probabilities,
state transition matrix A = [aij], i, j = 1, 2 … N, where aij denotes the transition
probability from state i to state j and output probability bj(OK). The density function
is written as bj(x) where x represents a k dimensional feature vector. The recog-
nition is performed using Viterbi algorithm. For the implementation of HMM, the
HTK toolkit was used.
Fig. 7 The Cb and Cr values for two skin types shown in Fig. 6
The ranges that we found most suitable for a pixel to be classified as a skin pixel
after testing on several HINE videos are, Cr = [140, 173] and Cb = [77, 127].
Therefore, we slightly modified the bounds given by Chai and Ngan [12]. This
range has been proven to be robust against different skin types present in our dataset
(Fig. 6). In Fig. 7, we present the results of skin segmentation on babies with two
different complexions along with their Cb versus Cr plots. It is evident from the
plots that the values of Cb and Cr for skin pixels are indeed clustered within a
narrow range.
The results of skin segmentation, morphological operations and skeletonization
for each of the four classes, (a) Sitting, (b) Lying Y-axis, (c) Lying X-axis, and
(d) Suspending, have been tabulated in Fig. 8.
Classification of videos was done after training using HMM. Testing was per-
formed in Leave-one-out cross-validation (LOOCV) on videos of different exer-
cises. The HMM parameters namely number of states and Gaussian Mixture
328 A.F. Ansari et al.
Models were fixed based on the validation set. Best results were achieved with 4
states and 4 Gaussian Mixture Models. Plots of variation of accuracy with number
of states and number of Gaussian Mixture Models are shown in Fig. 9. An overall
accuracy of 78.26 % was obtained from the dataset when compared against the
ground truths (Table 1).
Challenges and Error Analysis: As there are multiple stages in the algorithm,
the error propagates through these stages and often gets accumulated. Explicitly
defining the boundaries of Cb and Cr for skin segmentation sometimes leads to
spurious detection and no-detection of actual body region of the baby due to
interference caused by body parts and movements of the physician conducting the
test. During the exercises, as the perspective changes with respect to the still
camera, certain parts of the infant’s body may get removed when blobs are removed
based on area threshold. The thinning algorithm leads to skeleton defects and
detection of extra junction points at times. These steps will add to the error in the
feature extraction step. Classification using HMM requires a lot of training data to
improve accuracy.
References
1. R. Zhang, C. Vogler, and D. Metaxas. Human gait recognition at sagittal plane. Image and
Vision Computing, 25(3):321–330, 2007.
2. S. Singh and H. Hsiao. Infant Telemonitoring System. Engineering in Medicine and Biology
Society, 25th International Conference of the IEEE, 2:1354–1357, 2003.
3. J. Wang, Z. Zhang, B. Li, S. Lee, and R. Sherratt. An enhanced fall detection system for
elderly person monitoring using consumer home networks. IEEE Transactions on Consumer
Electronics, 60(1):23–29, 2014.
4. P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and
appearance. Ninth IEEE International Conference on Computer Vision: 734–741, 2003.
5. S. Fleck and W. Strasser. Smart camera based monitoring system and its application to
assisted living. Proceedings of the IEEE, 96(10):1698–1714, 2008.
6. L. Dubowitz and V. Dubowitz and E. Mercuri. The Neurological Assessment of the Preterm
and Full Term Infant. Clinics in Developmental Medicine, London, Heinemann, 9, 2000.
7. D. P. Dogra, A. K. Majumdar, S. Sural, J. Mukherjee, S. Mukherjee, and A. Singh. Automatic
adductors angle measurement for neurological assessment of post-neonatal infants during
330 A.F. Ansari et al.
follow up. Pattern Recognition and Machine Intelligence, Lecture Notes in Computer Science,
6744:160–166, 2011.
8. D. P. Dogra, A. K. Majumdar, S. Sural, J. Mukherjee, S. Mukherjee, and A. Singh. Toward
automating hammersmith pulled-to-sit examination of infants using feature point based video
object tracking. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 20
(1):38–47, 2012.
9. D. P. Dogra, V. Badri, A. K. Majumdar, S. Sural, J. Mukherjee, S. Mukherjee, and A. Singh.
Video analysis of hammersmith lateral tilting examination using kalman filter guided
multi-path tracking. Medical & biological engineering & computing, 52(9):759–772, 2014.
10. Z. Guo and R. W. Hall. Parallel thinning with two-subiteration algorithms. Communications
of the ACM, 32(3):359–373, 1989.
11. T.Y. Zhang and C.Y. Suen. A Fast Parallel Algorithm for Thinning Digital Patterns.
Communications of the ACM, 27(3):236–239, 1984.
12. D. Chai and K. Ngan. Face segmentation using skin-color map in videophone applications.
IEEE Trans. on Circuits and Systems for Video Technology, 9(4):551–564, 1999.
Multi-oriented Text Detection from Video
Using Sub-pixel Mapping
A. Mittal (✉)
Department of Civil Engineering, Indian Institute of Technology Roorkee,
Roorkee 247667, India
e-mail: [email protected]
P.P. Roy ⋅ B. Raman
Department of Computer Science and Engineering,
Indian Institute of Technology Roorkee, Roorkee 247667, India
e-mail: [email protected]
B. Raman
e-mail: [email protected]
1 Introduction
With the evolution of mobile devices and entry of new concept like augmented
reality, text detection becomes trending in recent years. Increase of mobile and it’s
applications on mobile devices [4], including the Android platforms and iPhone,
which can translate text into different languages in real time, has stimulated
renewed interest in the problems. The most expressive means of communications is
text, and can be embedded into scenes or into documents as a means of commu-
nicating information. The collection of huge amounts of street view data is one of
the driving application.
To recognize the text information from scene image/video data we need to
segment them before feeding to OCR. OCR typically achieves recognition accuracy
higher than 99 % on printed and scanned documents [5], text detection and
recognition in inferior quality and/or degraded data. Variations of text layout,
chaotic backgrounds, illumination, different style of fonts, low resolution and
multilingual content present a greater challenge than clean, well-formatted
documents.
With huge number of applications of text detection, text segmentation becomes
an important part. Multi-oriented low resolution text data which is still a problem.
Method proposed in paper deals with segmenting multi- oriented, low resolution
text data to its fundamental units.
2 Proposed Methodology
Problem of detecting text from images and Video Frames has been taken care by
Chen et al. [6] The authors proposed connected component analysis based text
detection but with the presence of low resolution multi-oriented and multi size
variance, the recognition performance of text dropped to 0.46 [7]. The performance
was increased with neural network classifier but due to non-ability of rotational
invariance it loses its ability to detect multi lingual text. Inspired by these problem
we have proposed an algorithm for detecting text and non-text pixel clusters and
segmentation of text cluster to individual unit for recognition. As proposed by
Khare et al. [3] instead of using HOG we have used HOM as a discriminator for
detection possible text in frame as HOM uses both spatial and intensity values.
HOM was modified as per requirement in a manner such that moment and rotation
is marked at the centroid of individual block regardless of centroid of connected
component. For low resolution videos we have used sub-pixel mapping based on
CIE colorimetry [1]. Figure 1 shows the stages of our algorithm with each indi-
vidual stage explained later.
Multi-oriented Text Detection from Video Using Sub-pixel Mapping 333
2.1 Preprocessing
Edges are the physical property of any object in an image that distinguishes one
object from other object. Text Objects in general have high contrast with the
background (Other objects may also have high contract), So in segmentation
considering edges increases accuracy. Enhancement of edges using morphological
Operations Ensures that small gap of approximately 3px are closed. This Process
ensures smooth contours (edges) around each objects. For a given Video, we Parse
frames and enhance edges by sharpening the images for enhancing edges Fig. 2b.
Fig. 2 Output for each stage: a Input image. b Edge detection. c Stroke width transformation.
d Output
334 A. Mittal et al.
Contours are detected and Hierarchy is formed i.e. parent Contour, child contours
are defined. For any contour bounding another contour than former contour is called
parent and later is called child. To remove unnecessary noises Parent Contour
having More than one child contour is removed, keeping children in the system.
Since our primary target is to detect text in a video frame we define parameter like
Solidity, Aspect ratio, Area of contour region. Contours within the threshold value
were preserved and obscure contours are eradicated leaving only regions with high
text object probability. Stroke Width Transform (SWT), as noted and used by
Epshtein et al. [8] is a transform that can be used to differentiate between text and
nontext with a very high level of confidence. Chen et al. [6] introduce a Stroke
Width Operator, which uses the Distance Transform in order to compute SWT.
The SWT value at any point in the image is the width of the stroke which the point
is a part of with highest probability.
Since text in images will always have a uniform stroke this step removes noises
to a large extent Fig. 2c.
In this Paper we have proposed a new way for segmenting low resolution video
frames by super resolving of portion in masked image for segmentation of cluster.
Fig. 3 Chromaticity plot for input and processes image showing error inclusion during up scaling
and subsequent removal after processing: a Sample input. b Up scaled image. c Processed image
Multi-oriented Text Detection from Video Using Sub-pixel Mapping 335
Fig. 4 Output of super resolution stage: a Input image (Text region from the first iteration).
b Super resolved image. c Segmented image
Fig. 5 Output for HOM classifier. a Input region. b Output values (Green = Positive
Red = Negative). c Orientation of moments
Using CIE- XYZ colorimetry and converting it to xyY color space used by
Jianghaoa et al. [1] we map each test pixel from the nearby mapped pixel for its
closeness to original pixel and converting to original pixel with least distance to the
test pixel and again segmenting this enhanced image till we get single segmented
object. Each segmented part is Up scaled and super resolve using CIE XYZ-> xyY
colorimetry based on Euclidean distance (Fig. 3). This ensures that no addition
information i.e. Noise in not introduced. This increases the recall to very significant
amount and hence improves our accuracy many fold. See Figs. 3 and 4.
336 A. Mittal et al.
Fig. 6 Output script independent text segmentation. a Input image. b Edge detection. c Stroke
width transformation. d Initial text regions. e Output image
Fig. 7 Output summary for video text segmentation. a Initial input frame. b Edge detection and
enhanced. c Possible text region and removing noises. d Separating background from foreground
using MOG. e Intermediate output frame. f Possible text region from stroke width transform.
g Super resolving text region. h Segmentation of super resolved masked image. i Text regions after
removal of noises using SWT and SVM classifier. j Final output after SVM and SWT denoising
with green color as stable text and blue color as moving text
Multi-oriented Text Detection from Video Using Sub-pixel Mapping 337
Fig. 8 Comparison of results obtained from Chen et al. [6] ’s method, images obtained by Stroke
Width Transform [2] and our method. Images are of ICDAR datasets
The components that survive the previous stage have very high probability of being
a text. However, there are usually some common patterns that get past the previous
stage of filtering by virtue of being extremely text-like. For example, arrows in sign
boards, or repeating urban patterns Fig. 9 input 4. These components are discarded
using a classifier trained to differentiate between text and non-text. The feature
vector used in the proposed method for this consists of a scaled down version of the
component (to fit in a constant small area), Orientation of Moments using HOM
descriptor as proposed by Khare et al. [3], xy values (where (x, y) is the position of
each foreground pixel), and aspect ratio. For our experimentation, we used a
Support Vector Machine (SVM) classifier and Radial Basis Function (RBF) as
kernel. Data set used was taken from ICDAR2013 [9]. Figure 5 Shows the orien-
tation of moments for text as well as non-text clusters. Green line shows orientation
towards centroid and Red line show orientation away from centroid.
338 A. Mittal et al.
Fig. 9 Output for multi-oriented script independent text segmentation. a Input image. b Text
regions. c Output Image
3 Experimental Results
In this paper we have proposed a novel Algorithm for multi oriented text seg-
mentation using iterative super resolution and Stroke Width Transform from low
resolution images and video frame Figs. 8 and 9. Super Resolution of regions with
high possibility of text regions increases the efficiency with marginal increase in
time of completion (Tested on Ubuntu 15.1 core i5 6 GB RAM at 20 frame per
340 A. Mittal et al.
References
1. Liu, Jiang Hao, and Shao Hong Gao. “Research on Chromaticity Characterization Methods of
the Ink Trapping.” Applied Mechanics and Materials. Vol. 262. 2013.
2. B. Epshtein, E. Ofek, and Y. Wexler. “ Detecting text in natural scenes with stroke width
transform.” In CVPR, 2010.
3. Vijeta Khare, Palaiahnakote Shivakumara, Paramesran Raveendran “A new Histogram
Oriented Moments descriptor for multi-oriented moving text detection in video” Volume 42,
Issue 21, 30 November 2015, Pages 7627–7640.
4. C. Liu, C. Wang, and R. Dai, “Text detection in images based on unsupervised classification
of edge-based features,” in Proc. IEEE Int. Conf. Doc. Anal. Recognit., 2005, pp. 610–614.
Multi-oriented Text Detection from Video Using Sub-pixel Mapping 341
Abstract This paper presents a novel and efficient approach to improve performance
of recognizing human actions from video by using an unorthodox combination of
stage-level approaches. Feature descriptors obtained from dense trajectory i.e. HOG,
HOF and MBH are known to be successful in representing videos. In this work,
Fisher Vector Encoding with reduced dimensions are separately obtained for each of
these descriptors and all of them are concatenated to form one super vector repre-
senting each video. To limit the dimension of this super vector we only include first
order statistics, computed by the Gaussian Mixture Model, in the individual Fisher
Vectors. Finally, we use elements of this super vector, as inputs to be fed to the
Deep Belief Network (DBN) classifier. The performance of this setup is evaluated
on KTH and Weizmann datasets. Experimental results show a significant improve-
ment on these datasets. An accuracy of 98.92 and 100 % has been obtained on KTH
and Weizmann dataset respectively.
1 Introduction
P. Dhar (✉)
Department of CSE, IEM Kolkata, India
e-mail: [email protected]
J.M. Alvarez
NICTA, Sydney, Australia
P.P. Roy
Department of CSE, IIT, Roorkee, India
extracting features from the video. (ii) Aggregating the extracted feature descrip-
tors and obtaining an encoding for each video, for the task of action localization.
(iii) Training a system based on the encoding of the training videos and using it to
classify the encoding of the test videos.
There exist many work on action recognition [1, 5, 6, 18]. We propose here an
efficient pipeline to use the dense trajectory features for action recognition, by encod-
ing each descriptor in the form of a super vector with reduced dimension and by using
DBN to classify these encoding. We test our framework with possible combination
of feature descriptors, to find the combination which corresponds to the best recog-
nition accuracy. Our pipeline consists of all of the above mentioned stages. Here, we
compare several approaches used in previous work for each stage, and use the best
known approach in the pipeline. The contributions of this paper are:
1. An efficient pipeline, which contains a fusion of the best known stage-level
approaches. Such stage-level approaches have never been amalgamated in any
known previous experiment.
2. Usage of Fisher Vectors without including second order statistics, for the task of
feature encoding. This helps to bring down the computation cost for the classifier,
and achieve competitive recognition accuracy.
2 Related Work
In this section we review the most relevant approaches to our pipeline. The selection
of related approaches for each stage of our pipeline is based on empirical results. We
do the following comparisons in order to select the best known stage level approaches
in our pipeline.
Local features vs Trajectory features: In this comparison, we tend to decide the
type of features which are to be extracted from videos. Using local features has
become a popular way for representing videos but, as mentioned in [20], there is
a considerable difference between 2D space field and 1D time field. Hence, it would
be unwise to detect interest points in a knotted 3D field.
KLT trajectory vs Dense trajectory: Having decided to include trajectory features
in our framework, we now compare results obtained by KLT and Dense Trajecto-
ries, in previous experiments. In some of the previous works trajectories were often
obtained by using KLT tracker which tracked sparse interest points. But in a recent
experiment by Wang et al. [21], dense sampling proved to outperform sparse interest
points, for recognizing actions.
Fisher Vector vs Other encoding approaches: For action recognition and localiza-
tion Fisher Vectors have recently been investigated, and shown to yield state-of-the-
art performance. In a recent work by Sun et al. [17], using Fisher Vectors produced
better results as compared to Bag-of-Words on a test dataset which contained dissim-
ilar videos of different quality. It was also established here that the usage of Fisher
Efficient Framework for Action Recognition Using Reduced Fisher . . . 345
Fig. 1 For selecting an approach for a single stage of the pipeline, we compare several stage level
approaches used in previous frameworks, and select the best known approach
3 Proposed Framework
The detailed version of our pipeline is shown in Fig. 2. Here, firstly the low-level
features are extracted and after that feature pre-processing is done. This is required
for codebook generation. It is used for feature encoding. Here, a generative model is
used to capture the probability distribution of the features. Gaussian Mixture Model
(GMM) is used for this task. The feature encodings are obtained using Fisher Vectors,
after which the obtained super-vectors are classified by DBN. The steps have been
discussed in detail below.
In this experiment, we extract all feature points and their dense trajectory aligned
feature descriptors from all the videos, using the public available code. Approxi-
mately, 3000 feature points are obtained from each video. Each feature point can be
represented using these descriptors: HOG, HOF, MBHx, MBHy. HOG deploys the
angular-binning of gradient orientations of an image cell. It saves static information
of the image. HOF and MBH provide details regarding motion using optical flow.
HOF quantifies the direction of flow vectors. MBH divides the optical flow into its
horizontal and vertical components (MBHx and MBHy), and discretizes the deriva-
tives of every component. The dimensions of these descriptors are 96 for HOG, 108
for HOF and 192 for MBH (96 for MBHx and 96 for MBHy). After that we apply
Principal Component Analysis (PCA) to each of these descriptors to half their dimen-
sion. The no. of features after PCA were selected mathematically. The dimensions
are chosen so that almost 95 % of the total information was captured (i.e. we chose
the minimum no. of dimensions so that cumulative energy content was just above or
equal to 0.95). So, now the dimension of the HOG, HOF, MBHx and MBHy descrip-
tors, which can represent any single feature point are 48, 54, 48 and 48. After this,
the features are used to create fingerprints of each video. For this task we use the
Fisher Vector (FV) encoding.
For each mode k, we consider the mean and covariance deviation vectors
[( )2 ]
1 ∑
N
xji − 𝜇jk 1 ∑N
xji − 𝜇jk
ujk = √ qik , vjk = √ qik −1 (2)
N 𝜋k i=1
𝜎jk N 2𝜋k i=1 𝜎jk
In this experiment, we aim to form a separate Fisher Vector Encoding for each
descriptor. In each case, we train a Gaussian Mixture Model (GMM) to obtain the
codebook. For Fisher encoding, we choose the number of modes as k = 64 and sam-
ple a subset of 64,000 features from the training videos to get estimation of the GMM.
After the training is finished the next task is to encode the feature descriptors into
one fisher Vector per video. We choose to include only the derivatives with respect
348 P. Dhar et al.
to Gaussian mean (i.e. only uk ) in the FV, which results in an FV of size D ∗ 64,
where D is the dimensionality of the feature descriptor. Including only the first order
statistics in the Fisher Vector limits its dimension to half of the dimension of a con-
ventional Fisher Vector. Thus, we conduct FV encoding for each kind of descriptor
independently and the resulting super vectors are normalized by power and L2 nor-
malization. Finally, these normalized Fisher Vectors are concatenated to form a super
vector which represents the motion information for the training video. Hence we have
one super vector representing for a single training video. It must be noted that if con-
ventional Fisher Vectors were used, the dimension of the super vector would have
been twice as large. Using the same PCA matrix and GMM parameters that were
obtained from training data, we obtain four FV encodings with reduced dimension
(each representing one descriptor), for each test video also. Finally, we concatenate
these four fisher encodings (in a similar fashion). So, now we have one super vec-
tor representing any single video (both test and train). The dimension of each Super
Vector is (48 + 54 + 48 + 48) ∗ 64 = 12672. It should be noted that, for the new
super vector of every single video, the first 3072 (48 ∗ 64) values represent the HOG
descriptor, the next 3456 (54 ∗ 64) values represent HOF, the next 3072 (48 ∗ 64)
values represent MBHx and the next 3072 (48 ∗ 64) represent MBHy descriptor.
The obtained testing and training super vectors are then fed to the classifiers. The
ensemble has been performed by Deep Belief Network. A deep belief network
(DBN) is a generative graphical model, consisting several multiple layers of hid-
den units. Connections are present between such layers, but not between the hidden
units. Such models learn to extract an ordered representation of the training data.
The detailed information about LSTM-RNN has been presented in Sect. 4.2. More
details about evaluation process are provided in Sect. 4.3.
4 Experimental Results
4.1 Datasets
The KTH dataset was introduced by Schuldt et al. [16] in 2004. The dataset has been
commonly used to evaluate models where handcrafted feature extraction is required.
Every video contains exactly one of the 6 contains : walking, jogging, running, box-
ing, hand-waving and hand-clapping, performed by 25 subjects in 4 different scenar-
ios. Here, each person performs the same action 3 or 4 times in the same video, with
some empty frames between every action sequence in the video. The dataset con-
tains 599 videos. 383 videos (performed by 9 subjects: 2, 3, 5, 6, 7, 8, 9, 10, and 22)
Efficient Framework for Action Recognition Using Reduced Fisher . . . 349
are used for training and remaining 216 videos (performed by rest of the subjects)
are used for testing.
The Weizmann dataset includes 10 actions running, walking, skipping, jumping-
jack, jumping-forward-on-two-legs, jumping-in-place-on-two-legs, galloping side-
ways, waving-two-hands, waving one-hand and bending performed by nine subjects
[3]. It contains 93 videos in all, each of which contains exactly one action sequence.
In our work, we train a DBN classifier with two hidden layers for classification of the
testing super vectors. Each value in the training super vector acts as an input, which
is fed to the DBN for training. The input dimension of the DBN is varied depending
on the feature (or combination of features) under consideration. We choose to use 70
hidden units in each of the hidden layers.
As mentioned earlier, the KTH training set consists of 383 videos; each of them is
represented by an FV of length 12672. We trained the DBN for several iterations over
these 383 training vectors and tested it on 216 testing vectors. In case of Weizmann
dataset, we apply Leave-one-out validation approach. Here, a single subject is used
for testing, while all other subjects are used to train the DBN. Each of the subjects
in Weizmann dataset is used for testing once. The configuration of the DBN is kept
the same for both datasets.
As a comparison, we have also tested our ensemble using the LSTM-RNN classifier
[7]. Recurrent Neural Networks (RNN) are commonly used for analysis of sequen-
tial data, because of the usage of time-steps. A time step is a hidden state of RNN,
whose value is a function of the previous hidden state and the current input vector.
Long Short Term Memory (LSTM) is a special variant of RNN, where along with
the hidden states, special cell states are also defined for every time step. The value of
the current cell state is a function of the input vector and the previous cell state. The
output value of the hidden state is a function of the current cell state. LSTM-RNNs
provide additive interactions over conventional RNNs, which help to tackle the van-
ishing gradient problem, while backpropagating. We use an LSTM-RNN architec-
ture with one hidden layer of LSTM cells. Here also, the input dimension of the RNN
is varied depending on the feature (or combination of features) under consideration.
There exists a full connection between LSTM cells and input layer. Also the LSTM
cells have recurrent connections with all the LSTM cells. The softmax output layer
is connected to LSTM outputs at each time step. We have experimented by vary-
ing the number of hidden LSTM, in the network. A configuration of 70 LSTM was
found to be optimal for classification. Backpropagation algorithm was used to train
the LSTM-RNN model. The configuration of LSTM-RNN used for both datasets is
the same.
350
Table 1 Comparison of action-wise recognition accuracy for KTH dataset obtained by using LSTM-RNN and DBN classifiers and by using different features
or combination of features. Best results are obtained when a combination of HOG, HOF and MBH is used
ACTION HOG HOF MBH HOG+HOF HOG+MBH HOF+MBH HOG+HOF+MBH
_ RNN DBN RNN DBN RNN DBN RNN DBN RNN DBN RNN DBN RNN DBN
Boxing 97.69 95.37 98.15 99.53 98.61 100.00 97.11 99.53 98.15 100.00 98.61 100.00 97.69 100.00
Walking 97.22 99.07 99.54 100.00 99.54 100.00 99.30 100.00 98.61 100.00 98.15 100.00 98.84 100.00
Running 94.91 96.30 94.44 97.22 96.30 98.15 96.30 96.30 96.30 97.69 95.37 97.22 97.22 97.22
Jogging 94.44 97.22 96.30 96.76 97.22 99.53 96.52 100.00 96.76 100.00 97.22 98.61 96.76 100.00
Handclapping 89.81 94.90 94.44 98.61 94.44 98.15 94.91 97.22 94.44 98.61 95.14 98.61 95.14 98.61
Handwaving 93.52 93.06 94.91 96.76 95.37 97.22 96.76 96.76 97.22 96.76 96.53 98.15 96.06 97.69
Accuracy 94.60 95.99 96.29 98.15 96.92 98.84 96.82 98.30 96.91 98.84 96.84 98.76 96.95 98.92
P. Dhar et al.
Efficient Framework for Action Recognition Using Reduced Fisher . . . 351
Fig. 3 Performance evaluation of proposed framework using different features (or combination of
features), using different classifiers on KTH dataset
We evaluate the results obtained by using different feature (or different combinations
of features) for both KTH and Weizmann datasets. We report the accuracy obtained
corresponding to the average across 10 trials. We use the values of the training super
vectors obtained as inputs to be fed to the Deep Belief Network (DBN) classifier.
We also test the ensemble with every possible combination of feature descriptors,
so as to detect the combination which corresponds to the best recognition accuracy.
Followed by this, the entire setup is again tested using the LSTM-RNN classifier and
the results obtained are compared those obtained when using the DBN classifier.
KTH The recognition results for KTH dataset are shown in Table 1. Its observed
that, for both LSTM-RNN and DBN classifiers, maximum accuracy is obtained when
a combination of all the dense trajectory features are used. In such a case, we obtain a
recognition accuracy 98.92 % using DBN classifier and 96.95 % using LSTM-RNN
classifier. It is observed that since running and jogging are very similar actions, the
ensemble wrongly classifies about 3 % of the testing data for running, as jogging.
Also it can be inferred from the graph in Fig. 3 that, for both classifiers, the relative
order of recognition accuracy obtained for a particular feature or a combination of
features is the same.
Weizmann In case of Weizmann dataset, no variation in recognition accuracy is
observed obtained for different features (or their different combination), when using
the DBN classifier. For all combination of features, we obtain a perfect accuracy
of 100 %. While using LSTM-RNN classifier also, negligible variance of results is
observed amd an average accuracy of 98.96 % is achieved.
Hence, we conclude that the best recognition accuracy is obtained when a combi-
nation of HOG, HOF and MBH is used, while using a Deep Belief Network classifier.
For such a setup, we report an average accuracy of 98.92 % for KTH and 100 % for
352 P. Dhar et al.
Table 2 Comparison of recognition accuracy of (a) KTH and (b) Weizmann dataset obtained by
our pipeline with those obtained in previous work
METHOD Accuracy (%)
(a)
Our method 98.92
Baccouche et al. [1] 94.39
Kovashka et al. [9] 94.53
Gao et al. [6] 95.04
Liu et al. [11] 93.80
Bregonzio et al. [4] 93.17
Sun et al. [19] 94.0
Schindler and Gool [15] 92.70
Liu and Shah [12] 94.20
(b)
Our method 100
Bregonzio et al. [4] 96.6
Sun et al. [18] 100.00
Ikizler et al. [8] 100
Weinland et al. [23] 93.6
Fathi et al. [5] 100
Sadek et al. [14] 97.8
Lin et al. [10] 100
Wang et al. [22] 100
Fig. 4 Running and Jogging actions from KTH dataset. Since both the actions have similar trajec-
tories, a considerable amount of test data for running is classified incorrectly as jogging
Weizmann dataset. Moving on with the best average accuracy obtained, we compare
our results with some of the previous work done on the same datasets (using the
same no. of samples) in Table 2a, b (Fig. 4).
Efficient Framework for Action Recognition Using Reduced Fisher . . . 353
In this paper, we have proposed a framework, where for each stage, the best known
approach has been used. Such a fusion of the best possible approaches has proved to
be highly successful in the task of action classification. Experimental results show
that the proposed pipeline gives competitive results, both on KTH (98.92 %) and
Weizmann (100 %). As future work, we will examine a similar setup, where for fea-
ture encoding, only second order statistics would be encoded in the respective Fisher
Vector of each descriptor. This would help us to evaluate the relative importance
of Gaussian mean and co-variance, computed by Gaussian Mixture Model. Also,
recent works are shifting towards other challenging video datasets, which contain
in-the-wild videos. Therefore, we aim to confirm the generality of our approach by
evaluating it on recent datasets, e.g. UCF sports, UCF-101, J-HMDB etc. Also, in
the near future, we aim to investigate deep learning methods to classify actions in
videos, in order to automate the process of feature learning.
References
1. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for
human action recognition. In: Human Behavior Understanding, pp. 29–39. Springer (2011)
2. Bengio, Y., LeCun, Y., et al.: Scaling learning algorithms towards ai. Large-scale kernel
machines 34(5) (2007)
3. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes.
In: The Tenth IEEE International Conference on Computer Vision (ICCV’05). pp. 1395–1402
(2005)
4. Bregonzio, M., Gong, S., Xiang, T.: Recognising action as clouds of space-time interest points.
In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009. pp.
1948–1955. IEEE (2009)
5. Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: IEEE Confer-
ence on Computer Vision and Pattern Recognition, 2008. CVPR 2008. pp. 1–8. IEEE (2008)
6. Gao, Z., Chen, M.Y., Hauptmann, A.G., Cai, A.: Comparing evaluation protocols on the kth
dataset. In: Human Behavior Understanding, pp. 88–100. Springer (2010)
7. Gers, F.A., Schraudolph, N.N., Schmidhuber, J.: Learning precise timing with lstm recurrent
networks. The Journal of Machine Learning Research 3, 115–143 (2003)
8. Ikizler, N., Duygulu, P.: Histogram of oriented rectangles: A new pose descriptor for human
action recognition. Image and Vision Computing 27(10), 1515–1526 (2009)
9. Kovashka, A., Grauman, K.: Learning a hierarchy of discriminative space-time neighborhood
features for human action recognition. In: 2010 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 2046–2053. IEEE (2010)
10. Lin, Z., Jiang, Z., Davis, L.S.: Recognizing actions by shape-motion prototype trees. In: 2009
IEEE 12th International Conference on Computer Vision,. pp. 444–451. IEEE (2009)
11. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: IEEE
Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009. pp. 1996–2003.
IEEE (2009)
12. Liu, J., Shah, M.: Learning human actions via information maximization. In: IEEE Conference
on Computer Vision and Pattern Recognition, 2008. CVPR 2008. pp. 1–8. IEEE (2008)
354 P. Dhar et al.
13. Oneata, D., Verbeek, J., Schmid, C.: Action and event recognition with fisher vectors on a
compact feature set. In: 2013 IEEE International Conference on Computer Vision (ICCV). pp.
1817–1824. IEEE (2013)
14. Sadek, S., Al-Hamadi, A., Michaelis, B., Sayed, U.: An action recognition scheme using fuzzy
log-polar histogram and temporal self-similarity. EURASIP Journal on Advances in Signal
Processing 2011(1), 540375 (2011)
15. Schindler, K., Van Gool, L.: Action snippets: How many frames does human action recognition
require? In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008.
pp. 1–8. IEEE (2008)
16. Schüldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In:
Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004.
vol. 3, pp. 32–36. IEEE (2004)
17. Sun, C., Nevatia, R.: Large-scale web video event classification by use of fisher vectors. In:
2013 IEEE Workshop on Applications of Computer Vision (WACV). pp. 15–22. IEEE (2013)
18. Sun, C., Junejo, I., Foroosh, H.: Action recognition using rank-1 approximation of joint self-
similarity volume. In: 2011 IEEE International Conference on Computer Vision (ICCV). pp.
1007–1012. IEEE (2011)
19. Sun, X., Chen, M., Hauptmann, A.: Action recognition via local descriptors and holistic fea-
tures. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Workshops, 2009. CVPR Workshops 2009. pp. 58–65. IEEE (2009)
20. Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Action recognition by dense trajectories. In:
2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),. pp. 3169–3176.
IEEE (2011)
21. Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-
temporal features for action recognition. In: BMVC 2009-British Machine Vision Conference.
pp. 124–1. BMVA Press (2009)
22. Wang, Y., Mori, G.: Human action recognition by semilatent topic models. IEEE Transactions
on Pattern Analysis and Machine Intelligence 31(10), 1762–1774 (2009)
23. Weinland, D., Boyer, E.: Action recognition using exemplar-based embedding. In: IEEE Con-
ference on Computer Vision and Pattern Recognition, 2008. CVPR 2008. pp. 1–7. IEEE (2008)
Detection Algorithm for Copy-Move Forgery
Based on Circle Block
Abstract Today lots of software tools are available which are used to manipulate the
images easily to change their originality. The technique which is usually used these
days for tampering an image without leaving any microscopic evidence is copy-move
forgery. There are many existing techniques to detect image tampering but their com-
putational complexity is high. Here we present a robust and effective technique to
find the tampered region. Initially the given image is divided into fixed size blocks
and DCT is applied on each block for feature extraction. Circle is used to represent
each transformed block with two feature vectors. In this way we reduce the dimen-
sion of the blocks to extract the feature vectors. Then lexicographical sort is applied
to sort the extracted feature vectors. Matching algorithm is applied to detect the tam-
pered regions. Results show that our algorithm is robust and has less computational
complexity than the existing one.
1 Introduction
These days there are many software (e.g. Photoshop) and applications are available
which are used to edit a picture comfortably and modify it without any noticeable
evidence. Thats why it is difficult to discern that a given image is original or forged.
It cause many problem generally in insurance claims, courtroom witness and scien-
tific scams. One of the famous example is shown in Fig. 1 One of the cosmonauts,
Grigoriy Nelyubov from the Russian team which completed an orbit of the earth for
Fig. 1 1961-cosmonauts in which left one is original and right one is tampered
the first time in 1961 led by Yuri Gagarin, was removed from a photo of the team
taken after their journey. Further he was removed from the team after finding him
guilty of misbehaving [1].
There are many types of image tampering method in which copy-move forgery is
broadly used where any image is covered by any object or natural image. Previously
many methods have been developed for detection of image forgery detection based
on square block matching. DCT based features are used by Fridrich [5] for forgery
detection, which is sensitive to variation in duplicated region when there is any addi-
tive noise in the image. Later on Huang et al. [6] reduce the dimension to improve the
performance but their method did not succeeded in detection of multiple copy-move
forgery. Popescu et al. [10] proposed a new method in which PCA based feature is
used which is capable of detecting the forgery when there is additive noise but the
accuracy of detection is not adequate. Luo et al. [7] proposed proposed a method in
which color feature and block intensity ratio is used to show the robustness of their
method. Bayram et al. [2] used Fourier-Mellin transform (FMT) to extract the feature
for each block. They used FMT to project the feature vector in one dimension. Luo
et al. [7] and Mahdian et al. [8] used blur moment invariants to locate the forgery
region. Pan et al. [9] used SIFT features to detect the duplicated regions which is
much robust. The methods discussed above have the higher computational complex-
ity as they used quantized square blocks for matching. As the dimension of feature
vectors are higher, the efficiency of detection is affected specially when the image
resolution and size is high.
Here we come up with an efficient and robust detection method based on enhanced
DCT. After comparing with the existing methods the prime features of proposed
method are as follows:
∙ Feature vectors are reduced in dimension.
∙ Robustness against various attacks (Gaussian blurring, noise contamination, mul-
tiple copy-move).
∙ Computational complexity is low.
Detection Algorithm for Copy-Move Forgery Based on Circle Block 357
The rest of this paper is organized as follows. In Sect. 2 the proposed method is
described in details. Results and discussion are presented in Sect. 3 whereas the pro-
posed technique is concluded in Sect. 4.
2 Proposed Method
Normally it is known that in copy-move forged image there must be two identical
regions are present. Exceptionally if two large regions are present in image such as
blue sky, then it won’t be considered. The task of forgery detection method is to
find the input image have any duplicated region. Since the shape and size of dupli-
cated region is undetermined and it is hardly possible to check each possible pairs
of region with distinct shape and size. So it will be sufficient to divide the input
image in fixed-sized overlapping blocks and apply the further process of matching
to get the duplicated region. In this process of matching, first we represent the blocks
by its features. To make our algorithm robust and effective we need a good feature
extraction method. Once all the blocks are perfectly represented by some features
then matching of block is done. The features of the matching block will be same
and it is sorted lexicographically to make our matching process more effective. In
this way computational complexity of the proposed detection algorithm is reduced
as compared to the existing methods.
Step 1: Let the input image I is gray scale of size M × N, if the image is color then
convert it by using the standard formula I = 0.299R + 0.587G + 0.114B. Firstly the
input image is split into overlapping fixed size sub-block of b × b pixels in which
each block have one different row and column. Every sub-block is denoted as bij ,
where i and j indicate the starting point of the block’s row and column respectively.
358 C.S. Prakash and S. Maheshkar
Step 2: Apply DCT on each block. After that we get a matrix of the same size, which
represent the corresponding blocks.
Step 3: Assume that the block bi where i = 1, 2, ...Nblocks is of 8 × 8 size, hence
the size of coefficient matrix is also 8 × 8 and there are 64 elements in the matrix.
According to the nature of DCT, the energy focuses on the low frequency coef-
ficients. Therefore we pruned the high frequency coefficients. Here the low fre-
quency coefficients are extracted in a zigzag order which occupy 1∕4th DCT coeffi-
cients. For this reason, we take circle block is considered to represent the coefficient
matrix.Hence we divide the circle block in two semicircles along with horizontal
and vertical direction as shown in Fig. 3.
If r is radius of the circle, the ratio between area of circle and area of the block is
given by
Area of Circle 𝜋r2
ratio = = 2 ≈ 0.7853 (3)
Area of Block 4r
From above calculation it signifies that circle block can be used to represent the block
as it consider most of the coefficients of the block and leave only few of them. So
Detection Algorithm for Copy-Move Forgery Based on Circle Block 359
Here vi is the mean of the coefficient value analogous to each Ci . In this way two
feature vectors are obtained which can be collectively represented as a feature vector
with size 1 × 2 as
V = [v1 , v2 ] (5)
Thus the dimension reduction is more as compared to other methods [5, 6, 10]
value reduction is by 1 × 64, 1 × 16, 1 × 32 features vectors respectively as shown
in Table 2.
Case 2: Similarly, the circle is divided into two semicircles C1’ and C2’ along with
vertical direction as shown in Fig. 3b. Features of C1’ and C2’ can be denoted as v′1
and v′2 and it can also be obtained as shown in equation (6) and the feature vectors
are represented as
V ′ = [v′1 , v′2 ] (6)
To check whether the feature vector is robust, post processing operations such
as additive white noise (AWGN) and Gaussian blurring are applied and analysed.
Gaussian blurring only affect some high frequency components and there is a little
change in low frequency components. For justification of the robustness of feature
360 C.S. Prakash and S. Maheshkar
vectors, we take a standard image (e.g. Baboon) and randomly select a 8 × 8 block
and perform some post processing operation such as AWGN and Gaussian blurring
with different parameters as shown in Table 1.
The correlation between post processed data is calculated. From Table 1, it is
observed that the correlation is 1.0 which indicates that the feature vectors are robust.
It also indicates that the reduction of dimension is successful.
Step 4: The extracted feature vectors are arranged in a matrix P which have dimen-
sion of (M − b + 1)(N − b + 1) × 2. Now the matrix P is sorted in lexicographical
order. As each element is a vector, the sorted set is defined as P. ̂ The Euclidean
̂ ̂ ̂
distance m_match (Pi , Pi+j ) between adjacent pair of P. This distance is compared
with the present threshold Dsimilar (Explained in detail in Sect. 3.1. If the distance
m_match is smaller than the threshold Dsimilar , then the tampered region is detected.
mathematically it can be represented as follows
√
√ 2
√∑
m_match(P̂i , P̂ i+j ) = √ (vki , vki+j ) < Dsimilar (7)
k=1
The experiments are performed on the Matlab R2013a. All the images taken in this
experiment are of size 256 × 256 pixel image in JPG format [4] (Table 2).
Let D1 is the duplicated region, D2 is the detected duplicated region. T1 and T2 are
the altered region and detected altered region respectively. The Detection Accuracy
Rate (DAR) and False Positive Rate (FPR) is calculated by the following equation:
|D1 ∩ D2 | + |T1 ∩ T2 | | | | |
DAR = | | | | , FPR = |D2 − D1 | + |T2 − T1 | (9)
|D | + |T | |D | + |T |
| 1| | 1| | 2| | 2|
Here overlapping of sub-block and circle representation method is used for extracting
the features. To set the threshold parameters, the radius of the circle is selected very
carefully. For this, we take different images with duplicated regions and set the radius
of circle varying from 2 to 6, with unit increment. Then a set of value for b, Dsimilar
and Nd . If the circle radius(r) is set to 4 for color images then, b = 8, Dsimilar = 0.0015
and Nd = 120 is obtained. The results obtained are shown in Fig. 4. This technique
is tested on set of standard database [4]. All the images are of size 256 × 256 for
detection of the tampered region.
The result obtained for DAR, FPR, FNR which has been tested on the database [4]
and the average result is shown in Table 3. The graphical representation for this is
shown in Fig. 5. It is observed that the DAR in average case is more than 75 %. It
signifies that the proposed technique is robust and efficient. In few cases it detect
only 50 % of the tampered region but it can be a positive clue for the verification of
originality of the image. FPR in most of the cases is zero and hence we can conclude
that the proposed technique is efficient to detect the forgeries. It is also observed that
362 C.S. Prakash and S. Maheshkar
Fig. 4 The forgery detection results (original image, tampered image, detection image)
Detection Algorithm for Copy-Move Forgery Based on Circle Block 363
in few cases the FPR is positive but the rate of detection is 100 % in such cases. FNR
in most of the cases is less than 25 % which shows that the detection of forgeries are
accurate and efficient.
4 Conclusion
In this paper a robust and effective algorithm for copy-move tamper detection is pro-
posed. It is a passive method for tamper detection that means a priori knowledge
about the tested is not required. The feature reduction is achieved up to the mark as
compared to the existing methods [5, 6, 10] as shown in Table 2. It is also observed
that the DAR is more than 75 % in average case indicating the efficiency of the pro-
posed algorithm. The robustness of feature vectors is tested on AWGN for various
SNR levels and Gaussian blurring and it is observed that the correlation coefficient
is 1. This indicates the robustness of the proposed technique. hence, we believe that
our method is efficient and robust enough to detect the image forgery.
References
1. http://www.fourandsix.com/photo-tampering-history/tag/science
2. Bayram, S., Sencar, H.T., Memon, N.: An efficient and robust method for detecting copy-move
forgery. In: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International
Conference on. pp. 1053–1056. IEEE (2009)
3. Cao, Y., Gao, T., Fan, L., Yang, Q.: A robust detection algorithm for copy-move forgery in
digital images. Forensic science international 214(1), 33–43 (2012)
4. Christlein, V., Riess, C., Jordan, J., Riess, C., Angelopoulou, E.: An evaluation of popular copy-
move forgery detection approaches. Information Forensics and Security, IEEE Transactions on
7(6), 1841–1854 (2012)
5. Fridrich, A.J., Soukal, B.D., Lukáš, A.J.: Detection of copy-move forgery in digital images. In:
in Proceedings of Digital Forensic Research Workshop. Citeseer (2003)
6. Huang, Y., Lu, W., Sun, W., Long, D.: Improved dct-based detection of copy-move forgery in
images. Forensic science international 206(1), 178–184 (2011)
7. Luo, W., Huang, J., Qiu, G.: Robust detection of region-duplication forgery in digital image. In:
Pattern Recognition, 2006. ICPR 2006. 18th International Conference on. vol. 4, pp. 746–749.
IEEE (2006)
8. Mahdian, B., Saic, S.: Detection of copy–move forgery using a method based on blur moment
invariants. Forensic science international 171(2), 180–189 (2007)
9. Pan, X., Lyu, S.: Detecting image region duplication using sift features. In: Acoustics Speech
and Signal Processing (ICASSP), 2010 IEEE International Conference on. pp. 1706–1709.
IEEE (2010)
10. Popescu, A., Farid, H.: Exposing digital forgeries by detecting duplicated image region [tech-
nical report]. 2004-515. Hanover, Department of Computer Science, Dartmouth College. USA
(2004)
FPGA Implementation of GMM
Algorithm for Background Subtractions
in Video Sequences
1 Introduction
processing algorithm can be applied to each individual frame. Besides, the contents
of each consecutive frames are visually closely related. Visual perception of a human
can be modeled as a hierarchy of abstractions. At the first level are the raw pixels
with RGB or brightness information. Further processing yields features such as lines,
curves, colours, edges and corners regions. A next abstraction layer may interpret
and combine these features as objects and their facets. At the highest level of
abstraction, the human level concepts involving one or more objects and relation-
ships among them. Object detection in a sequence of frames involves verifying the
presence of an object in each frame and locating it precisely for recognition. There
are two methods that realize object detection. One is changing at pixel level and
another based on feature comparison. The first method is based on a few visual
features, such as a specific colours are used to represent the object. It is fairly easy to
identify all pixels with the same colour as the object. The first method is easy and
better because of very fast detection of any kind of changes in the video sequences.
In the later approach, is very difficult to be accurately detected and identified because
of perceptual details of a specific person, such as different poses and illumination.
On the other hand, detecting a moving object has an important significance in
video object detection and tracking. Compared with object detection, moving object
detection complicates the detection problem by adding temporal change require-
ments. Several methods of moving object detection can be classified into groups
approximately. The first method is thresholding technique over interface difference.
These approaches based on the detection of temporal changes either at pixel level or
block level. Using a predefined threshold value, the difference map is binarized to
obtain the motion detection. The second method based on statistical tests constraints
to pixelwise independent decisions. In this approach detection masks and filter have
been considered but the masks cannot able to provide invariant change detection
with respect to size and illuminations. And finally the third method is based on
global energy frameworks, where the detection methods are performed using
stochastic or deterministic random algorithm such as mean field or simulated
annealing. For high speed parallel data processing such as video processing
FPGA’s are widely used. The design of VLSI techniques in video processing can be
performed in FPGA. The new FPGA are having a features of VGA port and HDMI
port for High definition video processing. The camera and high definition video
source interfaces also available (Fig. 1).
The video processing in real time is time consuming. The implementation of
video processing algorithms in hardware language offers parallelism, and thus
significantly reduces the processing time. The video sequences are processed as
each frame.
The paper is organized as follows. Section 2 reviews the related work done in
this area. The system model is described in Sect. 3. Results and Discussions are
presented in Sect. 4 and Conclusions are given in Sect. 5. Future work has been
discussed in Sect. 6.
2 Related Work
There are many research works have been done on video processing and moving
object detection. Zhang Yunchu et al. described about the object detection tech-
niques which was performed for the images captured at the night time, which
having low contrast and SNR, poor distinction between the object and the back-
ground, that pose more challenges to moving object detection. A moving object
detection algorithm for night surveillance based on dual-scale Approximate Median
Filter background models was proposed. The moving object detection was robustly
at night under adverse illumination conditions, with high spatial resolution resistant
to noise and with low computational complexity [1]. Bo-Hao Chen, et al. proposed
a moving object detection algorithm for intelligent transport system. The principal
component was radial basis function which was utilized for the detection of moving
objects [2]. An algorithm has developed for both high bit rate and low bit rate video
streams.
Eun-Young Kang, et al. proposed a bimodal Gaussian approximation method
using colour features and motion for moving object detection [3]. They introduce a
compound moving object detection by combining motion and colour features. The
accuracy of detected objects was increased by statistical optimization in
high-resolution video sequences with motion camera and camera vibrations. Motion
analysis involves information about the motion vector obtained H.264 decoder and
a moving-edge map. A dedicated integrated circuit is used in real time moving
object detection in video analysis. The designed circuits have been shown in the
works of [4, 5].
Processing of the information determines whether the information contains
specific object and exact location. This work is computationally profound and
several attempts have been made to design hardware-based object detection algo-
rithms. The majority of the proposed works target field-programmable gate-array
(FPGA) implementations; additionally, targets on ASIC application Specific Inte-
grated Circuits or operate on images of relatively small sizes in order to achieve
real-time response in [6]. Bowmans, et al. proposed a statistical background
modeling for moving object detection [7]. They assume the background is static in
the video. This assumption implies that video is captured by a stationary camera
like a common surveillance camera.
Implementation of OpenCV version of the Gaussian mixture algorithm has been
shown in [8]. To reduce the circuit complexity they had utilized compressed ROM
and binary multipliers. The enhancement of their work was done in [9]. The
background and foreground pixels were identified based on the intensity of the
368 S. Arivazhagan and K. Kiruthika
pixel. Xiaoyin et al. proposed fixed point object detection methodology using the
histogram of oriented gradients (HOG) [10]. HOG has delivered just 1FPS (frames
per second) on a high-end CPU but achieves high accuracy. The fixed point
detection reduces circuit complexity.
Hongbo Zhu et al. proposed using row-parallel and pixel-parallel architectures
for motion features from moving images in real. These architectures are based on
the digital pixel sensor technology. To minimize the chip area the directional edge
filtering of the input image was carried out in a row-parallel processing. As a result,
self adaptive motion feature extraction has been established [11]. Most of these
algorithms were assessed with software implementation on a general purpose
processor (GPP) [12]. It is usually suficient for verifying the function of the
algorithm. However, the difficulty occurs in real time at high data throughput video
analysis. Hongtu Jiang et al. describe about the embedded automated video
surveillance for a video segmentation unit [13]. The segmentation algorithm is
explored with potential increase of segmentation results and hardware efficiency.
A. Yilmaz et al. presented an extensive survey on object tracking method and
also reviewed the methods for object detection and tracking [14]. Describes the
context of use, degree of applicability, evaluation criteria, and qualitative com-
parisons of the tracking algorithms.
3 System Model
In GMM algorithm, the moving objects are detected. The algorithm has been
performed in hardware language. The input frames are stored in block RAM. For
each frame the background subtraction has been performed from the updated
parameters and the corresponding output frames are obtained as shown in Fig. 2.
For real time object detection, the input data capture from camera. The captured
video sources are processed and Gaussian mixture algorithm has to be performed.
GMM algorithm
VMOD CAM
VMOD CAM FPGA
FPGA DISPLAY DEVICE
HDMI DISPLAY
H H
CMOS
CMOS D D
MEMORY
MEMORY MONITOR
MONITOR
SENSOR
SENSOR M M
I I
CMOS
CMOS I O
SENSOR
SENSOR N MEMORY
MEMORY U MONITOR
MONITOR
INTERFACE
INTERFACE CONTROLLER
CONTROLLER T INTERFACE
INTERFACE
pixels
RAW to YCrCb BACKGROUND IDENTIFICATION
The motion objects are detected from the background subtraction which is to be
displayed on display devices (Fig. 3).
The GMM algorithm proposed by Stauffer and Grimson [15] has been modified,
that deals with statistical background subtraction using a mixture of Gaussian
distributions. The modified background subtraction requires only a minimum
number of operations to perform. This reduces the design complexity in hardware
language. A short description of GMM algorithm is given as follows.
Statistical model
The statistical model of video sequence of each pixel is composed by k Gaussian
distributions with parameters mean, variance, weight and matchsum. For the each
Gaussian of each pixel the Gaussian parameters differs and change every frame of
the video sequence.
Parameters Update
When the frame is acquired, for each pixel, the K Gaussian distributions are sorted
in decreasing order of a parameter, named Fitness Fk, t
if pixel − μk, t < λ*σ k, t ð2Þ
where λ represents the threshold value equal to 2.5 as in the OpenCV library.
Equation (2) establishes if the pixel can be considered part of the background.
A pixel can verify (1) for more than one Gaussian. The Gaussian that matches with
the pixel (Mk = 1) is considered to be the “matched condition” with the highest
fitness Fk, t value and its parameters are updated as follows
The parameter αw is the learning rate for the weight from αw the learning rate for
mean and variance are calculated which is denoted as αk, t and the equation as
follows.
When the pixel does not match with any of the Gaussians function, a specific
“No match” procedure is executed and the Gaussian distribution is updated with the
smallest fitness value Fk, t .
matchsumk, t + 1 = 1 ð11Þ
where the fixed initialization value is represented by vinit and msumtot is the sum of
the values of the matchsum of k − 1 Gaussians with highest fitness. The weight of
the k-1 Gaussians with Fk, t are reduced as in Eq. (12) while their variances and
means are unchanged.
Background Identification
The background identification is performed by using the following equation. The
algorithm for background subtraction in [6] is modified as follows.
0, if pixel − μk, t + 1 ≤ maxðTh, λ, σ k, t + 1 Þ
B= ð14Þ
1, if pixel − μk, t + 1 ≥ maxðTh, λ, σ k, t + 1 Þ
where σ k, t + 1 is the standard deviation calculated from variance. The set of the
Gaussian distributions that verify the equation represents the background; if 0, then
the pixel that matches one of these Gaussians will be classified as a background
pixel. If the algorithm evokes 1, then “No match” condition occurs, the pixel will be
classified as foreground.
From the video sequences, the variance, mean, weight and pixels of each frames are
calculated. Those values are fed as input to the VLSI circuit and then for the next
frame the parameter values are updated. The background subtraction is performed
after processing consecutive frames.
The fitness function is sorted in decreasing order. The higher order fitness is used
to update the parameters with match condition. The lower order fitness is used to
update the parameters for unmatched condition. The number of fitness calculation
depends upon the GMM parameter. The output variables are updated depending
upon the selected line. Figure 4 shows the parameter updating block where the
three parameters mean, variance and weight are updated with calculated fitness
value and learning rate.
Match Flow
Pixel
Initial
1 No Match
wt
F1
F2
Fitness F3
F4
Selection
Mean
Output
Matched
Variance
Match
Un Match
Weight
pixel
False
True
(c) Weight
pixel t
Parameters
vinit update
2
w ,t 1
For unmatched (i.e.,) Mk = 0 the mean, weight and variance are updated as the
same values. For No match block the mean and variances are updated as shown in
Fig. 7.
The background identification is performed by the Eq. (14). The pixels are
classified as foreground and background and the moving objects are detected. The
flow for background identification is shown in Fig. 8.
374 S. Arivazhagan and K. Kiruthika
The algorithm has been performed in offline by storing the collected video
sequences in FPGA memory. The collected video is consisting of 1250 frames,
each frame of size 360 × 240 and of 24 bit depth having a horizontal and vertical
resolution of 69 dpi. The samples are shown in Fig. 9.
Figure 10 shows the simulation results obtained from background identification
block. The sample pixels for corresponding background subtracted frames are
shown. The pixel format is 8 bits and each byte represent the intensity of the pixel.
Typically 0 represents black and 255 is taken to be white. After the parameters are
updated, the resultant values are stored as pixels in RAM.
The detected objects are verified in Matlab by processing the results obtained
from the text file. The verified object detection frames have been shown in the
Fig. 11.
FPGA Implementation of GMM Algorithm for Background … 375
5 Conclusion
In this proposed method, a moving object detection algorithm has been developed
and the hardware circuit for Gaussian Mixture Modeling has been designed. The
algorithm is modeled for three parameter values such as mean, variance, and
weight. After updating the parameters for consecutive frames, the background
subtraction has been performed. The fixed point representation has been performed
to reduce the hardware complexity. The algorithm has been able to perform effi-
ciently with vertical and horizontal resolution of 96 dpi.
6 Future Work
The implementation of GMM algorithm can be done in real time in the future. It has
been planned to implement by interfacing CMOS camera with FPGA for real time
moving object detection.
Acknowledgments The authors wish to express humble gratitude to the Management and
Principal of Mepco Schlenk Engineering College, for the support in carrying out this research
work.
References
1. Zhang Yunchu, Li Yibin, and Zhang Jianbin, “Moving object detection in the low
illumination night scene,” IET International Conference on Information Science and Control
Engineering 2012 (ICISCE 2012), Dec. 2012, pp. 1–4.
2. Bo-Hao Chen, Shih-Chia Huang, “An Advanced Moving Object Detection Algorithm for
Automatic Traffic Monitoring in Real-World Limited Bandwidth Networks,” IEEE Trans-
actions on Multimedia, vol. 16. no. 3, pp. 837–847, April 2014.
3. V. Mejia, Eun-Young Kang, “Automatic moving object detection using motion and color
features and bi-modal Gaussian approximation,” IEEE International Conference on Systems,
Man, and Cybernetics, Oct. 2011, pp. 2922–2927.
4. Hongtu Jiang, Hakan Ardo, and Viktor Owall, “Hardware Accelerator Design for Video
Segmentation with Multimodal Background Modelling,” IEEE International Symposium on
Circuits and Systems, vol 2. May 2005, pp. 1142–1145.
5. Tomasz Kryjak, Mateusz Komorkiewicz, and Marek Gorgon, “Real-time Moving Object
Detection For Video Surveillance System In FPGA,” Conference on Design and Architecture
for signal and Image Processings, pp. 1–8, Nov 2011.
6. Mariangela Genovese and Ettore Napoli, “ASIC AND FPGA Implementation Of The
Gaussian Mixture Model Algorithm For Real-time Segmentation Of High Definition Video,”
IEEE Transactions on very large scale integration (VLSI) systems, vol. 22, no. 3, March
2014, pp. 537–547.
7. T. Bouwmans, F. El Baf and B. Vachon, “Statistical Background Modelling for Foreground
Detection: A Survey,” in Handbook of Pattern Recognition and Computer Vision, World
Scientific Publishing, 2010, pp. 181–199.
376 S. Arivazhagan and K. Kiruthika
8. Mariangela Genovese and Ettore Napoli, “An Fpga - based Real-time Background
Identification Circuit For 1080p Video,” 8th International Conference on Signal Image
Technology and Internet Based Systems, Nov 2012, pp. 330–335.
9. Ge Guo, Mary E. Kaye, and Yun Zhang, “Enhancement of Gaussian Background Modelling
Algorithm for Moving Object Detection & Its Implementation on FPGA,” Proceeding of the
IEEE 28th Canadian Conference on Electrical and Computer Engineering Halifax, Canada,
May 3–6, 2015, pp. 118–122.
10. Xiaoyin Ma, Walid A. Najjar, and Amit K. Roy-Chowdhury, “Evaluation And Acceleration
Of High-throughput Fixed-point Object Detection On FPGA’S,” IEEE Transactions On
Circuits And Systems For Video Technology, Vol. 25, No. 6, June 2015, pp. 1051–1062.
11. H. Jiang, V. O wall and H. Ardo, “Real-Time Video Segmentation with VGA Resolution and
Memory Bandwidth Reduction,” IEEE International Conference on Video and Signal Based
Surveillance, 2006. AVSS’06., pp. 104–109, Nov. 2006.
12. M. Genovese, E. Napoli and N. Petra,”OpenCV compatible real time processor for
background, foreground identification,” Microelectronics (ICM), 2010 International Confer-
ence, pp. 487–470, Dec. 2010.
13. H. Jiang, H. Ardö, and V. Öwall, “A hardware architecture for real-time video segmentation
utilizing memory reduction techniques,”IEEE Trans. Circuit Syst. Video Technol., vol. 19, no.
2, pp. 226–236,Feb. 2009.
14. A. Yilmaz, O. Javed and M. Shah,”Object tracking: A survey,” ACM Comput. Surv., vol. 38,
no. 4, Dec. 2006.
15. http://dparks.wikidot.com/background-subtraction.
Site Suitability Evaluation for Urban
Development Using Remote Sensing, GIS
and Analytic Hierarchy Process (AHP)
Abstract An accurate and authentic data is prerequisite for proper planning and
management. If one looks for proper identification and mapping of urban devel-
opment site for any city, then accurate and authentic data on geomorphology,
transport network, land use/land cover and ground water become paramount. In
order to achieve such data in time satellite remote sensing and geographic infor-
mation system techniques has proved its potentiality. The importance of this
technique coupled with Analytic Hierarchy Process (AHP) in site suitability anal-
ysis for urban development site selection is established and accepted worldwide too
and to know the present actual status of environmental impact in surrounding of
urban development site. Remote Sensing, GIS, GPS and AHP method is a vital tool
for identification, comparison and multi criterion decision making analysis of urban
development site’s proper planning and management. Now keeping in view the
availability of high resolution data of IKONOS satellite, cartosat and IRS 1C/1D
LISS—III data has been used for preparation of various thematic layers in Lucknow
city and its environs. The study describes the detailed information on the site
suitability analysis for urban development site selection. The final maps of the study
area prepared using GIS software and AHP method, can widely applied to compile
and analyze the data on site selection for proper planning and management. It is
necessary to generate digital data on site suitability for urban development sites for
local bodies/development authorities in GIS & AHP environment, in which data are
reliable and comparable.
Anugya (✉)
IIT Roorkee, Roorkee, Uttarakhand, India
e-mail: [email protected]
V. Kumar
Remote Sensing Application Centre, Lucknow, U.P, India
K. Jain
IIT Roorkee, Roorkee, Uttarakhand, India
1 Introduction
During the recent years, the rate of urbanization was so fast, as there were very little
planned expansion, thereby leading to poor management practices, which have
created many urban problems like slum, traffic congestion, sewage and trans-
portation network etc. Consequently, it has led to immense deleterious impact on
existing land and environment, eventually affecting the man himself. A high pop-
ulation growth rate combined with unplanned urban area has resulted environs
pressure on our invaluable land and water resource and posing severe threat on
fertile land. This practice or haphazard growth of urban area is particularly seen in
Site Suitability Evaluation for Urban Development … 379
developing countries like India. India is close to 7933 municipalities as per census
of India 2011 and population of Lucknow city is around 29.00 Lakhs.
The applicability of GIS in suitability analyses is vast, and new methods of
spatial analysis facilitate suitability assessment and land use allocation [3, 9].
However, quick growth and sprawl bring new needs and demands for planners and
designers, including the consideration of new growth and new alternatives for land
use [2]. According to, GIS can also be used to evaluate plan for Smart Growth
Developments (SGDs) and Transit Oriented Developments (TODs). Many cities in
India are suffering this condition; these undefined areas are increasing rapidly and
occupied by large number of people [6, 12]. There is a rapid increase in pre-urban
boundary which in return results in increase in urban area and decrease in agri-
culture land [12]. Due to such haphazard growth of city land is also wasted. The
increase in urban growth is also responsible for land transformation.
Study Area
The Lucknow city and its environs is selected for study. An additional area mea-
suring 10–12 kms radius from the present boundary have been taken into consid-
eration for study which extends from 26° ‘41’ “11.12” N to 26° ‘56’ “59.05” N
latitude and 80° ‘54’ “55.55” E to 80° 48’ 0.57” E longitude. The geographical area
of Lucknow district is about 3,244 sq. km. and city area is approximate 800 sq. km
with population around 36,81,416 lakhs by 1456 per sq. km.
Data Sources
To meet the set objectives of the study, Survey of India topographical (SOI) Map
sheet no. 63B/13 and 63B/14 on 1:50,000 scale and satellite imageries of
IRS-1C/1D LISS-III 23.5 m resolution data on 1:50,000 scale acquired in 2001–02
and IKONOS satellite’s 1 m resolution data of 2014 and LISS III data has been
used for preparation of multi criterion layers i.e. landuse/landcover, geomorphol-
ogy, road/transport-network and ground water table data was collected from
state/Central Ground Water Department.
Method
The expert choice software was used for finding the relative weights. There were
four important steps applied in this process
1. determining the suitable factors for site selection procedure
2. then assign the weights to all the parameters
3. generating various land suitability thematic maps for urban development
4. Finally determining the most suitable area for urban development.
AHP model has been used on for these criteria pertaining to geomorphology
(Fig. 1), landuse/landcover (Fig. 2), accessibility of road/transport network (Fig. 3)
and groundwater.
determined by land used and land cover layer to protect the prime agriculture land
from expansion to urban area, salt affected/waste lands are preferable for urban
expansion [7].
The numbers of comparisons are calculated by using Eq. 1 as follows:
nðn − 1Þ
ð1Þ
2
ðλmax − nÞ
CI = ð2Þ
ð n − 1Þ
CI
CR = ð3Þ
RI
where,
CR = Consistency Ratio, CI = Consistency Index, and RI = Random Consis-
tency Index.
Urban Development Suitability Assessment
In this process overall composite weights calculated and then, the land suitability
maps for urban development have been generated, based on the linear combination
of each used factor’s suitability score [14, 16] as shown in Eq. (4). AHP method
applied to determine the importance of each parameter.
384 Anugya et al.
SI = ∑ Wi Xi . ð4Þ
Urban site selection model involves three steps to identify the most suitable
alternatives for urban development which comprises as preliminary analysis,
MCDM evaluation and identification of most suitable site. Preliminary analysis
involves creation of various criterion maps into input the raster data layers.
The geomorphological map of the study area has been obtained from available
report and maps of Land-Use and Urban Survey Division, UP-RSAC. On the basis
of regional geomorphology classification as shown in previous study. Two classes
namely older alluvial plain and younger alluvial plain for the study area have been
grouped and mapped as shown in Fig. 1. It is observed that two sites out of four
sites namely Site-1 and Site-2 are located in identical class (younger alluvial plain),
while Site-3 and Site-4 are located in older alluvial plain. Site-3 and Site-4 are
considered as suitable to siting an urban development (Table 2).
Land-use/Land-cover map has been prepared using IKONOS data. Fourteen different
land use classes have been identified in the study area namely agriculture land,
built-up land, industrial area, brick-kiln, fallow land, open scrub, orchard and plan-
tation, forest, built up rural, waste land/sodic land, river/drain, water bodies,
water-logged area and other as shown in Fig. 2. Waste lands are degraded lands
include sodic lands (salt affected land), scrub lands, which seem to be more suitable
for urban site because these land lack appropriate soil moisture, minerals etc. qualities
of fertile land. However scrub lands are often appear like fallow land and look like
crop land which could be discriminated but in essence these are also categorize as
waste land and considered as moderately suitable for urban development. Conse-
quently, it is noticed that Site-3 and Site-4 are more suitable area for urban devel-
opment, while Site-1 and Site-2 are seen to be moderately suitable (Table 3).
The transportation network map has been digitized from the 1:25,000/1:50,000
scale topographical maps and IKONOS Satellite are grouped into five classes
including railway line, national highway (NH), city major roads (CMjR) and city
minor roads (CMnR) as shown in Fig. 3. According to [15], distance less than
500 km from NH, SH and CMjR should be avoided. The urban site should locate
near to existing road networks to avoid the high cost of construction. Using Fig. 3,
distances of different sites from existing transportation network have been measured
and (Table 3) for proximity of transport network with respect to each site has been
prepared. It is seen that Site-3 and Site-4 are suitable because these are near to NH,
CMjR, and CMnR, whereas, Site-1 and Site-2 are unsuitable sites having distances
beyond the required distance from NH, CMjR and CMnR (Table 4).
The ground that water data was collected with help of field survey of various Sites
from which it was found that Site-1, Site-2, Site-3 is having water at 130–150 ft,
80–100, 50 respectively and Site-4 at 85 ft (Tables 5 and 6).
Table 5 This table shows the Sl. no. Site Water table (in ft.)
water table of various sites
2 Site-1 130 to 150
3 Site-2 80 to 100
4 Site-3 50
6 Site-4 85
In this study all criteria were analyzed through literature review. Four different
parameters were identified to select a suitable site for urban development. All four
parameters thematic map were prepared as input map layers. These criteria include
geomorphology, land-use/land-cover, transport network and ground water table.
After the preparation of output criterion maps, four possible sites have been
identified for urban development in the study area according to the data availability
include Mubarakpur (Site-1), Rasoolpur Kayastha (Site-2), Chinhat (Site-3) and
Aorava (Site-4). These sites are identified on the basis of flexibility of different
criterion features. These four possible sites for urban development have been
confirmed after the field investigation according to the urban development site
suitability prerequisite (Figure 4 and Table 7).
The pair-wise comparison for determination of weights is more suitable than
direct assignment of the weights, because one can check the consistency of the
weights by calculating the consistency ratio in pair-wise comparison; however, in
direct assignment of weights, the weights are depending on the preference of
decision maker [11, 17]. All the selected sites in the study area are ranked in terms
of relative weight according to their suitability for urban development. The suit-
ability index has the value 0.442 for Site-4 (Aorava) and 0.346 for Site-3 (Near
Chinhat). Thus, Site-4 (Aorava) and Site-3 (Near Chinhat) having higher suitability
index are identified as potential urban development sites in the study area. Site-1
and site-2 are found to have lowest suitability.
References
1. A G-O YEH (1999), Urban planning and GIS, Geographical information system, Second
edition, Volume2 Management issues and applications, 62, pp 877–888.
2. Brueckner J K (2000) Urban Sprawl: Diagnosis and Remedies, International Regional
Science Review 23(2): 160–171.
3. Brueckner, Jan K. “Strategic interaction among governments: An overview of empirical
studies.” International regional science review 26.2 (2003): 175–188.
4. Dutta, Venkatesh. “War on the Dream–How Land use Dynamics and Peri-urban Growth
Characteristics of a Sprawling City Devour the Master Plan and Urban Suitability?.” 13th
Annual Global Development Conference, Budapest, Hungary. 2012.
5. FAO, 1976. A framework for land evaluation. Food and Agriculture Organization of the
United Nations, Soils Bulletin No. 32, FAO: Rome.
6. Kayser B (1990) La Renaissance rurale. Sociologie des campagnes du monde occidental,
Paris: Armand Colin.
7. Malczewski, J., 1997. Propogation of errors in multicriteria location analysis: a case study, In:
fandel, G., Gal, T. (eds.) Multiple Criteria Decision Making, Springer-Verlag, Berlin, 154–155.
8. Malczewski, J., 1999. GIS and Multicriteria Decision Analysis, John Wiley & Sons, Canada,
392 p.
9. Merugu Suresh, Arun Kumar Rai, Kamal Jain, (2015), Subpixel level arrangement of spatial
dependences to improve classification accuracy, IV International Conference on Advances in
Computing, Communications and Informatics (ICACCl), 978–1-4799-8792-4/15/$31.00
©2015 IEEE, pp. 779–876.
10. Merugu Suresh, Kamal Jain, 2015, “Semantic Driven Automated Image Processing using the
Concept of Colorimetry”, Second International Symposium on Computer Vision and the
Internet (VisionNet’15), Procedia Computer Science 58 (2015) 453–460, Elsevier.
11. Merugu Suresh, Kamal Jain, 2014, “A Review of Some Information Extraction Methods,
Techniques and their Limitations for Hyperspectral Dataset” International Journal of
Advanced Research in Computer Engineering & Technology (IJARCET), Volume 3 Issue
3, March 2014, ISSN: 2278–1323, pp 2394–2400.
388 Anugya et al.
12. McGregor D, Simon D and Thompson D (2005) The Peri-Urban Interface: Approaches to
Sustainable Natural and Human Resource Use (eds.), Royal Holloway, University of London,
UK, 272.
13. Myers, Adrian. “Camp Delta, Google Earth and the ethics of remote sensing in archaeology.”
World Archaeology 42.3 (2010): 455–467.
14. Mieszkowski P and E S Mills (1993) The causes of metropolitan suburbanization, Journal of
Economic Perspectives 7: 135–47.
15. Mu, Yao. “Developing a suitability index for residential land use: A case study in Dianchi
Drainage Area.” (2006).
16. Saaty, T.L., 1990, How to make a decision: The Analytic Hierarchy Process. European
Journal of Operational Research, 48, 9–26.
17. Saaty, T.L. 1994. Highlights and Critical Points in the Theory and Application of the Analytic
Hierarchy Process, European Journal of Operational Research, 74: 426–447.
A Hierarchical Shot Boundary Detection
Algorithm Using Global and Local Features
1 Introduction
M. Verma (✉)
Mathematics Department, IIT Roorkee, Roorkee, India
e-mail: [email protected]
B. Raman
Computer Science and Engineering Department, IIT Roorkee, Roorkee, India
e-mail: [email protected]
system. It is near impossible to process a video for retrieval or analysis task, without
key frame detection. Key frame detection appears to reduce a large amount of data
from video that makes it easy for further process.
A video shot transition happens in two ways, i.e., abrupt and gradual transition.
The abrupt transition happens because of short cuts and gradual transition includes
shot dissolve and fades. Many algorithms have been proposed to detect abrupt and
gradual shot transition in video sequence [2]. A hierarchical shot detection algorithm
was proposed using abrupt transitions and gradual transitions in different stages [3].
Wolf and Yu presented a method for hierarchical shot detection based on different
shot transition analysis and used multi-resolution analysis. They used a hierarchical
approach to detect different shot transitions, e.g., cut, dissolve, wipe-in, wipe-out,
etc. [12]. Local and global feature descriptors have been used for feature extraction
in shot boundary detection. Apostolidis et al. used local Surf features and global
HSV color histograms for gradual and abrupt transitions for a shot segmentation [1].
In image analysis, only spatial information is required to extract. However, for
a video study, temporal information should be recognized with spatial information.
Temporal information defines the activity and transition of a frame to another frame.
Rui et. al. proposed a keyframe detection algorithm using color histogram and activ-
ity measure. Spatial information was analyzed using color histogram and activity
measure is used for temporal information detection. Similar shots are grouped later
for better segmentation [9]. A two stage video segmentation technique was proposed
using a sliding window. A segment of frame is used to detect shot boundary in first
stage, and in second stage, the 2-D segments are propagated across the window of
frames in both spatial and temporal direction [8]. Tippaya et al. proposed a shot
detection algorithm using RGB histogram and edge change ratio, and three different
dissimilarity measures have been used to extract difference between frame feature
vectors [10].
Event detection and video content analysis have been done based on shot detection
and keyframe selection algorithms [4]. Similar scene detection has been done using
clustering approach. Story line has been made from a long video [11]. Event detec-
tion in sports video has been analyzed using long, medium and close-up shots, and
play breaks are extracted for summarization of a video [5]. A shot detection tech-
nique has been implemented based on visual and audio content in video. Wavelet
transformation domain has been utilized for feature extraction [6].
stage, spatial information of extracted key frames from first stage are analyzed, and
redundant keyframes are excluded.
The paper has been organized in following way: In Sect. 1, a brief introduction of
video shots and keyframes have been given with literature survey. Section 2 describes
technique of the proposed method. In Sect. 3, framework of proposed algorithm is
described. Section 4 represents the experimental results. Finally, the work has been
concluded in Sect. 5.
Shot detection problem is very common in video processing. Processing a full video
at a time and extracting shot boundary may give results of similar shots. Frames of
a video are shown in Fig. 1. Ten different shots are there in the video, in which 3,
5 and 7, and 4, 6 and 8 are of similar kind. Hence, keyframes extracted from these
shots would be similar, and redundant information will be extracted from the video.
It is a small example and it can happen in large video. To resolve this problem, a
hierarchical scheme has been adopted for keyframe extraction from a video.
For abrupt shot boundary detection, we have used RGB color histogram. RGB
color histogram provides global distribution of three color bands in RGB space. A
quantized histogram of 8 bins for each color channel has been created. Initially, each
color channel has been quantized in 8 intensities and histogram has been generated
using the following equation.
∑ ∑
m n
HistC (L) = F(I(a, b, C), L)
a=1 b=1 (1)
where C = 1, 2, 3 for R, G, B color bands
{
1 if, a = b
F(a, b) = (2)
0 else.
where size of the image is m × n and L is total bins. I(a, b, C) is the intensity of color
channel C at position of (a, b).
For temporal information in a video sequence, each frame of video has been
extracted and RGB color histogram has been generated. Difference of each frame
to the next frame is extracted using the following distance measure.
∑
L
| n |
Dis(DBn , Q) = |Fdb (s) − Fq (s)| (3)
| |
s=1
If the measured distance between two frames is greater than a fixed threshold
value, then those frames are separated in different clusters. This process is applied
for each consecutive pair of frames in video sequence. In this process, we get different
clusters of similar frames. After getting clusters, we extract one key frame from each
cluster. For keyframe extraction, entropy has been calculated for each frame in one
cluster using Eq. 4, and the maximum entropy frame has been chosen as a keyframe
for that cluster. ∑
Ent(I) = − (pi × log2 (pi )) (4)
i
∑
p−1
LBPp,r = 2l × S1 (Il − Ic ) (5)
l=0
{
1 x≥0
S1 (x) =
0 else
∑ ∑
m n
Hist(L) ||LBP = F(LBP(a, b), L);
a=1 b=1 (6)
L ∈ [0, (2 − 1)]
p
where p and r represent number of neighboring pixels and radius respectively. After
calculating the LBP pattern using Eq. 5, histogram of LBP pattern is created using
A Hierarchical Shot Boundary Detection Algorithm Using Global . . . 393
Eq. 6. LBP is extracted from each of the keyframe obtained from the above process.
Now, the distance between each frame mutually is calculated using Eq. 3 as shown
in Fig. 2. Distance of frame 1 has been calculated with frame 2, 3 up to n. Distance
of frame 2 has been calculated with frame 3, 4 up to n. In a similar process, distance
of frame n − 1 has been calculate with frame n. A tri-diagonal matrix has been cre-
ated for all distance measures. Now, if the distance between two or more frames is
less than a fixed threshold, then all those frames will be grouped into one cluster.
In this process, even non-consecutive similar keyframes have been clustered, and
completely non-redundant data in different clusters have been obtained. Again, the
entropy of each of the frames in different cluster is calculated and maximum entropy
frame is obtained as final keyframe. Finally, we get a reduced number of final key
frames without any redundant information.
3.1 Algorithm
394 M. Verma and B. Raman
4 Experimental Results
For experimental purpose, we have used three different kinds of videos of news,
advertisement and movie clip. General details about all three videos are given in
Table 1. All three videos are of different size with respect to time and frame size.
In news video, anchor and guest are present at first. The camera is moving from
anchor to guest and guest to anchor many times in the video. Hence, in shot detec-
tion, many shots are having similar kind of frames (either anchor or guest). Fur-
ther, other events are shown in the video repeatedly one after another shot. All these
redundant shots are separated initially and key frames are selected. In the second
phase of algorithm, redundant key frames are clustered and keyframes of maximum
entropy are extracted as final key frames. Initially, 63 key frames are extracted and
after applying hierarchical process only 12 key frames are extracted at the end. This
hierarchical process has removed a significant amount of redundant key frames for
further processing.
Second video clip for the experiment is a small movie clip of a animation movie
called ‘Ice age’. The same hierarchical process is applied to the video clip. Initially,
11 key frames are extracted, and then by using LBP for spatial information, 6 final
non-redundant key frames are extracted. In Fig. 3 keyframes of initial and final phase
have been demonstrated.
The third video is taken for experiment is of a Tata-sky advertisement. The
proposed method is applied to the video and two phase keyframes are collected.
The keyframes of phase one and two are shown in Fig. 4. It is clearly visible that
using hierarchical method the number of key frames has been reduced significantly
and redundant key frames have been removed. Information regarding extracted key
396 M. Verma and B. Raman
frames in phase one and two are given in Table 2. Summary of reduced keyframes
explains that the proposed algorithm has removed repeated frames from keyframe
detected from the color histogram method. Further, in phase two using LBP we have
obtained optimum amount of key frames which summarize the video significantly.
Extracted keyframes can be saved as a database and used for video retrieval task.
Keyframe extraction is an offline process and retrieving video using keyframes is a
realtime online process. The proposed method of keyframe extraction is a two step
process hence it is time consuming. However, it is an offline process and can be
used to collect the keyframe database in one time. Number of keyframes has been
reduced in the proposed method and those can be used in realtime video retrieval
process. Since there are less number of keyframes hence the video retrieval will be
less time consuming.
A Hierarchical Shot Boundary Detection Algorithm Using Global . . . 397
5 Conclusions
In the proposed work, shot boundary detection problem has been discussed and fur-
ther, key frames have been obtained. A hierarchical approach is adopted for final
keyframes selection. This approach helped in reducing similar keyframes in non-
consecutive shots. Initially, a color histogram technique is used for temporal analysis
and abrupt transition is obtained. Based on abrupt transition, shots are separated and
keyframes are selected. Spatial analysis has been done in obtained keyframes using
local binary pattern and finally redundant keyframes are removed. In this process,
a significant amount of redundant keyframes are removed. The proposed method is
applied on three videos of news reading, movie clip and tv advertisement for experi-
ment. Experiments show that the proposed algorithm helped in removing redundant
keyframes.
References
1. Apostolidis, E., Mezaris, V.: Fast shot segmentation combining global and local visual descrip-
tors. In: Acoustics, Speech and Signal Processing (ICASSP), IEEE International Conference
on. pp. 6583–6587. IEEE (2014)
2. Brunelli, R., Mich, O., Modena, C.M.: A survey on the automatic indexing of video data.
Journal of visual communication and image representation 10(2), 78–112 (1999)
3. Camara-Chavez, G., Precioso, F., Cord, M., Phillip-Foliguet, S., de A Araujo, A.: Shot bound-
ary detection by a hierarchical supervised approach. In: Systems, Signals and Image Process-
ing, 2007 and 6th EURASIP Conference focused on Speech and Image Processing, Multimedia
Communications and Services. 14th International Workshop on. pp. 197–200. IEEE (2007)
4. Cotsaces, C., Nikolaidis, N., Pitas, I.: Video shot detection and condensed representation. a
review. Signal Processing Magazine, IEEE 23(2), 28–37 (2006)
5. Ekin, A., et al.: Generic play-break event detection for summarization and hierarchical sports
video analysis. In: Multimedia and Expo (ICME), International Conference on. vol. 1, pp. I–
169–172. IEEE (2003)
6. Nam, J., Tewfik, A.H., et al.: Speaker identification and video analysis for hierarchical video
shot classification. In: Image Processing, International Conference on. vol. 2, pp. 550–553.
IEEE (1997)
7. Ojala, T., Pietikäinen, M., Mäenpää, T.: Multiresolution gray-scale and rotation invariant tex-
ture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE
Transactions on 24(7), 971–987 (2002)
8. Piramanayagam, S., Saber, E., Cahill, N.D., Messinger, D.: Shot boundary detection and label
propagation for spatio-temporal video segmentation. In: IS&T/SPIE Electronic Imaging. vol.
9405, 94050D. International Society for Optics and Photonics (2015)
9. Rui, Y., Huang, T.S., Mehrotra, S.: Exploring video structure beyond the shots. In: Multimedia
Computing and Systems, IEEE International Conference on. pp. 237–240. IEEE (1998)
10. Tippaya, S., Sitjongsataporn, S., Tan, T., Chamnongthai, K.: Abrupt shot boundary detection
based on averaged two-dependence estimators learning. In: Communications and Information
Technologies (ISCIT), 14th International Symposium on. pp. 522–526. IEEE (2014)
11. Yeung, M., Yeo, B.L., Liu, B.: Extracting story units from long programs for video browsing
and navigation. In: Multimedia Computing and Systems, 3rd IEEE International Conference
on. pp. 296–305. IEEE (1996)
12. Yu, H.H., Wolf, W.: A hierarchical multiresolution video shot transition detection scheme.
Computer Vision and Image Understanding 75(1), 196–213 (1999)
Analysis of Comparators for Binary
Watermarks
H. Agarwal (✉)
Department of Mathematics, Jaypee Institute of Information Technology,
Sector 62, Noida, Uttar Pradesh, India
e-mail: [email protected]
B. Raman
Indian Institute of Technology Roorkee, Roorkee, India
e-mail: [email protected]
P.K. Atrey
State University of New York, Albany, NY, USA
e-mail: [email protected]
M. Kankanhalli
National University of Singapore, Singapore, Singapore
e-mail: [email protected]
1 Introduction
2 Comparator
A comparator has two elements: a function, that measures level of similarity between
two watermarks and a threshold. Two watermarks are said to be matched if the level
of similarity is more than the threshold. Otherwise, watermarks are not matched.
The mathematical formulation of a general comparator is as follows:
{
match if sim(x1 , x2 ) ≥ 𝜏
C(𝜏, sim) (x1 , x2 ) = , (1)
no match otherwise
where, x1 and x2 are two watermarks, sim(x1 , x2 ) is an arbitrary function that mea-
sures similarity between two watermarks and 𝜏 is a threshold value.
Analysis of Comparators for Binary Watermarks 401
1∑
N
NHS(x1 , x2 ) = 1 − x (i) ⊕ x2 (i), (2)
N i=1 1
∑N
i=1 x1 (i).x2 (i)
NCC(x1 , x2 ) = √ √ ; (4)
∑N ∑N
i=1 x1 (i) i=1 x2 (i)
2 2
∑N
i=1 (x1 (i)− x̄1 ).(x2 (i) − x̄2 )
MSNCC(x1 , x2 ) = √ √ , (5)
∑N ∑N
i=1 (x1 (i) − x̄1 ) i=1 (xk (i) − x̄1 )
2 2
x̄1 and x̄2 are mean (average) value of watermarks x1 and x2 respectively;
Respective range of the functions is [0, 1], [0.5, 1], [0, 1], [−1, 1] and [0, 1].
For same watermarks, value of each function is 1. For negative pair of watermarks,
respective value of the functions is 0, 1, 0, −1 and 1. Moreover, two fundamental
properties of NHS are as follows:
Note that NCC is failed if either of watermark is a pure black image and, MSNCC
and AMSNCC are failed if either of watermark is an uniform intensity image.
402 H. Agarwal et al.
The first condition for obtaining the ideal point is that set of watermarks (X) must not
consist any pair of same or negative watermarks. This condition gives that maximum
symmetric normalized Hamming similarity (MSNHS) of X must be strictly less than
one. Mathematically,
{ }
SNHS(xi , xj ) ∶
MSNHS(X) = max < 1. (9)
xi , xj ∈ X, i ≠ j
By the definition of MSNHS(X), it is clear that for all two different watermarks xi
and xj of X,
( )
1 − MSNHS(X) ≤ NHS xi , xj ≤ MSNHS(X). (10)
Analysis of Comparators for Binary Watermarks 403
The first condition is necessary for existence of second condition and second is suf-
ficient for existence of the ideal point. The second condition is
where, the term P represents the minimum similarity of any embedded watermark
with respect to the corresponding extracted watermark for a given watermarking
system. Range of the P is [0.5, 1]. A method to compute P is provided in algorithm 1.
Algorithm 1: An algorithm to compute P
Input: X, H, t, M = (Memb , Mext ) (for details, refer Sect. 1)
Output P
1. Create a matrix of size |X| × |H| with each element of 0. Store this matrix as
SNHS=zeros(|X|, |H|).
2. for i = 1 ∶ 1 ∶ |X|
3. for j = 1 ∶ 1 ∶ |H|
4. Select xi ∈ X, hj ∈ H
5. Embed watermark xi in hj by using the watermark embedding algorithm Memb
to obtain watermarked image ĥ ij .
6. Apply noise/attack t on the ĥ ij to obtain noisy/attacked watermarked image ĥ̂ ij
(for details of attack t, refer Tables 2 and 3).
7. Extract watermark x̂ ij from ĥ̂ ij by using the watermark extraction algorithm Mext .
8. Compute SNHS(i, j) = 0.5 + |NHS(̂xij , xi ) − 0.5|.
9. end
10. end
11. Find the minimum value of the matrix SNHS to obtain P.
Output P
1 + MSNHS
< 𝜏 ≤ P. (12)
2
The formula (12) provides sufficient threshold range for ideal point, that is if thresh-
old (𝜏) is in the range provided by the formula (12), then watermarking system is
tuned at ideal point. However, converse need not to be true, that is if watermarking
system is tuned at ideal point then threshold may or may not be in the range given
by (12). The proof for “(12) is sufficient” is discussed.
404 H. Agarwal et al.
where 𝜏 satisfies (12). Thus, we have the following four possible cases.
case (i) NHS(x, xi ) ≥ 𝜏 and NHS(x, xj ) ≥ 𝜏.
case (ii) NHS(x, xi ) ≤ 1 − 𝜏 and NHS(x, xj ) ≤ 1 − 𝜏.
case (iii) NHS(x, xi ) ≥ 𝜏 and NHS(x, xj ) ≤ 1 − 𝜏.
case (iv) NHS(x, xi ) ≤ 1 − 𝜏 and NHS(x, xj ) ≥ 𝜏. Now, we will discuss one by one.
case (i):
NHS(x, xi ) ≥ 𝜏 and NHS(x, xj ) ≥ 𝜏 ⇒
1+MSNHS
(by (12)), NHS(x, xi ) ≥ 𝜏 > 2
and
1+MSNHS
(by (12)), NHS(x, xj ) ≥ 𝜏 > 2
⇒
case (ii):
NHS(x, xi ) ≤ 1 − 𝜏 and NHS(x, xj ) ≤ 1 − 𝜏 ⇒
1−MSNHS
(by (12)), NHS(x, xi ) ≤ 1 − 𝜏 < 2
and
1−MSNHS
(by (12)), NHS(x, xj ) ≤ 1 − 𝜏 < 2
⇒
case (iii):
case (iv): This proof is similar to the proof for case (iii).
Table 2 Ideal point threshold range for various watermarking systems using ROC curve. Com-
parators are based on SNHS and AMSNCC. NGF: Negative + Gaussian Filter, NGN: Negative +
Gaussian Noise, NR: Negative + Rotation, 𝜎: variance, Q: Quality Factor, r: rotation in degree,
counter clockwise, DNE: Does Not Exist
Watermarking scheme Data-set Attack (t) Threshold range for
AMSNCC SNHS
Wong and Memon [16] D1 No attack [0.39, 1.00] [0.74, 1.00]
Wong and Memon [16] D1 Negative attack [0.39, 1.00] [0.74, 1.00]
Wong and Memon [16] D2 No attack [0.23, 1.00] [0.68, 1.00]
Wong and Memon [16] D2 Negative attack [0.23, 1.00] [0.68, 1.00]
Wong and Memon [16] D3 No attack [0.39, 1.00] [0.74, 1.00]
Wong and Memon [16] D3 Negative attack [0.39, 1.00] [0.74, 1.00]
Wong and Memon [16] D4 No attack [0.90, 1.00] [0.96, 1.00]
Wong and Memon [16] D4 Negative attack [0.90, 1.00] [0.96, 1.00]
Wong and Memon [16] D5 No attack [0.38, 1.00] [0.74, 1.00]
Wong and Memon [16] D5 Negative attack [0.38, 1.00] [0.74, 1.00]
Wong and Memon [16] D1 NGF, 𝜎 = 0.3 [0.39, 0.97] [0.74, 0.99]
Wong and Memon [16] D1 NGF, 𝜎 = 0.7 DNE DNE
Wong and Memon [16] D1 NGN, [0.28, 0.62] [0.68, 0.84]
𝜎 = 10−5
Wong and Memon [16] D1 JPEG, Q = 95 [0.15, 0.28] [0.60, 0.66]
Wong and Memon [16] D1 NR, r = 0.2 [0.39, 0.99] [0.74, 1.00]
Bhatnagar and Raman [4] D′1 No attack [0.43, 1.00] [0.73, 0.99]
Bhatnagar and Raman [4] D′2 No attack [0.27, 0.97] [0.69, 0.98]
Bhatnagar and Raman [4] D′3 No attack [0.43, 0.97] [0.73, 0.98]
Bhatnagar and Raman [4] D′4 No attack [0.91, 0.93] {0.96}
Bhatnagar and Raman [4] D′5 No attack [0.43, 0.82] [0.65, 0.91]
Bhatnagar and Raman [4] D′1 NGF, 𝜎 = 0.3 [0.42, 0.99] [0.73, 0.99]
Bhatnagar and Raman [4] D′1 NGN, [0.43, 0.99] [0.73, 0.99]
𝜎 = 10−4
Bhatnagar and Raman [4] D′1 JPEG, Q = 95 [0.43, 0.99] [0.73, 0.99]
Bhatnagar and Raman [4] D′1 NR, r = 0.2 DNE DNE
∙ Verify the analytically proved formula (12) with ROC curve for different water-
marking systems.
Ten data-set and two watermarking schemes have been used in the experiments.
Each data-set consists of a set of host images (original images) and a set of water-
marks. Host images are eight bit gray scale images and watermarks are binary
images. Further description of each data-set is provided in the Table 1. The data-set
D1 , D2 , … , D5 , are compatible with watermarking scheme of [16] and the data-set
D′1 , D′2 , … , D′5 are compatible with watermarking scheme of [4]. Various attacks,
such as Gaussian filter, Gaussian noise, JPEG compression, rotation, and negative
operation have been applied on the watermarked images.
Table 3 Verification of formula (12) for various watermarking systems using ROC curve. Comparator is based on SNHS. NGF: Negative + Gaussian Filter,
NGN: Negative + Gaussian Noise, NR: Negative + Rotation, 𝜎: variance, Q: Quality Factor, r: rotation in degree, counter clockwise, DNE: Does Not Exist
Watermarking scheme Data-set Attack (t) P MSNHS Ideal point threshold range Verified
ROC curve Formula (12)
Wong and Memon [16] D1 No attack 1 0.7347 [0.74, 1.00] (0.86, 1.00] Yes
Wong and Memon [16] D2 No attack 1 0.6799 [0.68, 1.00] (0.83, 1.00] Yes
Wong and Memon [16] D3 No attack 1 0.7347 [0.74, 1.00] (0.86, 1.00] Yes
Wong and Memon [16] D4 No attack 1 0.9501 [0.96, 1.00] (0.97, 1.00] Yes
Wong and Memon [16] D5 No attack 1 0.6371 [0.64, 1.00] (0.81, 1.00] Yes
Wong and Memon [16] D1 NGF, 𝜎 = 0.3 0.9904 0.7347 [0.74, 0.99] (0.86, 0.99] Yes
Wong and Memon [16] D1 NGN, 𝜎 = 10−3 0.5 0.7347 DNE DNE Yes
Analysis of Comparators for Binary Watermarks
Wong and Memon [16] D1 JPEG, Q = 95 0.6664 0.7347 [0.60, 0.66] DNE Yes
Wong and Memon [16] D1 NR, r = 0.2 1 0.7347 [0.74, 1.00] (0.86, 1.00] Yes
Bhatnagar and Raman [4] D′1 No attack 0.9963 0.7212 [0.73, 0.99] (0.86, 0.99] Yes
Bhatnagar and Raman [4] D′2 No attack 0.9880 0.6819 [0.69, 0.98] (0.84, 0.98] Yes
Bhatnagar and Raman [4] D′3 No attack 0.9880 0.7212 [0.73, 0.98] (0.86, 0.98] Yes
Bhatnagar and Raman [4] D′4 No attack 0.9673 0.9526 {0.96} DNE Yes
Bhatnagar and Raman [4] D′5 No attack 0.9133 0.6399 [0.65, 0.91] (0.81, 0.91] Yes
Bhatnagar and Raman [4] D′1 NGF, 𝜎 = 0.3 0.9963 0.7212 [0.73, 0.99] (0.86, 0.99] Yes
Bhatnagar and Raman [4] D′1 NGN, 𝜎 = 10−4 0.9963 0.7212 [0.73, 0.99] (0.86, 0.99] Yes
Bhatnagar and Raman [4] D′1 JPEG, Q = 95 0.9954 0.7212 [0.73, 0.99] (0.86, 0.99] Yes
Bhatnagar and Raman [4] D′1 NR, r = 0.2 0.5039 0.7212 DNE DNE Yes
407
408 H. Agarwal et al.
Performance of NHS, SNHS, NCC, MSNCC and AMSNCC based comparators are
examined using ROC curve. One main observation is that if extracted watermark is
not the negative of embedded watermark then all the comparators have same per-
formance. However, if extracted watermark is a negative of embedded watermark
then SNHS and AMSNCC based comparators have outstanding performance. Using
the ROC curves, we have obtained the ideal point threshold range for various water-
marking systems of SNHS and AMSNCC based comparators. Highlights of results
are shown in Table 2. The important observations from the experiment are as follows:
∙ For Wong and Memon [16] scheme based watermarking systems, AMSNCC and
SNHS based comparators are better than NHS, NCC and MSNCC based compara-
tors against negative attack on watermarked images.
∙ For Bhatnagar and Raman [4] scheme based watermarking systems, all the com-
parators have same performance.
∙ The performance of AMSNCC and SNHS based comparators is very close.
∙ The length of ideal point threshold range decrease with increase in attack level
and finally the range vanishes.
∙ The performance of the watermarking systems degrades with attack level.
Analytic formula (12) for ideal point threshold range consists of two parameters, P
and MSNHS. In the experiment, P and MSNHS are computed for several watermark-
ing systems to find ideal point threshold range using the formula (12). This threshold
range is compared with the ideal point threshold range obtained by ROC curve of
corresponding watermarking system. Highlights of results are given in Table 3. The
important observations from the experiments are as follows:
∙ The interval of ideal point threshold range obtained by the formula (12) is subin-
terval of threshold range obtained by ROC curve. This verifies the formula (12).
∙ If P is greater than 1+MSNHS
2
, then the ideal point threshold range obtained by ROC
curve is [MSNHS, P].
∙ If P is less than 1+MSNHS
2
then the upper bound of interval of ideal point threshold
range obtained by ROC curve is P.
6 Conclusions
attack are the best when SNHS and AMSNCC based comparators are used. However,
performance of Bhatnagar and Raman [4] scheme is independent of choice of com-
parator. Ideal point threshold range is found by using ROC curve for NHS, SNHS,
NCC, MSNCC and AMSNCC based comparators. Further, the formula (12) is the-
oretically proved that finds ideal point threshold range for SNHS based comparator.
This formula is verified for several watermarking systems and is computationally
efficient than ROC curve to find ideal point threshold range. The formula (12) is suf-
ficient, therefore, ideal point threshold range found by using (12) is sub-interval of
the range found by ROC curve.
Acknowledgements The author, Himanshu Agarwal, acknowledges the grants of the University
Grant Commission (UGC) of New Delhi, India under the JRF scheme and Canadian Bureau
for International Education under the Canadian Commonwealth Scholarship Program. He also
acknowledges research support of the Maharaja Agrasen Technical Education Society of India and
Jaypee Institute of Information Technology of India.
References
1. Agarwal, H., Atrey, P. K. and Raman, B. Image watermarking in real oriented wavelet transform
domain. Multimedia Tools and Applications, 74(23):10883–10921, 2015.
2. Agarwal, H., Raman, B. and Venkat, I. Blind reliable invisible watermarking method in wavelet
domain for face image watermark. Multimedia Tools and Applications, 74(17):6897–6935,
2015.
3. Bender, W., Butera, W., Gruhl, D., et al. Applications for data hiding. IBM Systems Journal,
39(3.4):547–568, 2000.
4. Bhatnagar, G. and Raman, B. A new robust reference watermarking scheme based on DWT-
SVD. Computer Standards & Interfaces, 31(5):1002–1013, 2009.
5. Cox, I. J., Kilian, J., Leighton, F. T. and Shamoon, T. Secure spread spectrum watermarking
for multimedia. IEEE Transactions on Image Processing, 6(12):1673–1687, 1997.
6. Kundur, D. and Hatzinakos, D. Digital watermarking for telltale tamper proofing and authen-
tication. Proceedings of the IEEE, 87(7):1167–1180, 1999.
7. Linnartz, J. P., Kalker, T. and Depovere, G. Modelling the false alarm and missed detection
rate for electronic watermarks. In Information Hiding, pages 329–343, 1998.
8. Memon, N. and Wong, P. W. Protecting digital media content. Communications of the ACM,
41(7):35–43, 1998.
9. Miller, M. L. and Bloom, J. A. Computing the probability of false watermark detection. In
Information Hiding, pages 146–158, 2000.
10. Pandey, P., Kumar, S. and Singh, S. K. Rightful ownership through image adaptive DWT-
SVD watermarking algorithm and perceptual tweaking. Multimedia Tools and Applications,
72(1):723–748, 2014.
11. Rani, A., Raman, B., Kumar, S. A robust watermarking scheme exploiting balanced neural
tree for rightful ownership protection. Multimedia Tools and Applications, 72(3):2225–2248,
2014.
12. Rawat, S. and Raman, B. A blind watermarking algorithm based on fractional Fourier trans-
form and visual cryptography. Signal Processing, 92(6):1480–1491, 2012.
13. Tefas, A., Nikolaidis, A., Nikolaidis, N., et al. Statistical analysis of markov chaotic sequences
for watermarking applications. In IEEE International Symposium on Circuits and Systems,
number 2, pages 57–60, Sydney, NSW, 2001.
410 H. Agarwal et al.
14. Tian, J., Bloom, J. A. and Baum, P. G. False positive analysis of correlation ratio watermark
detection measure. In IEEE International Conference on Multimedia and Expo, pages 619–
622, Beijing, China, 2007.
15. Vatsa, M., Singh, R. and Noore, A. Feature based RDWT watermarking for multimodal bio-
metric system. Image and Vision Computing, 27(3):293–304, 2009.
16. Wong, P. W. and Memon, N. Secret and public key image watermarking schemes for
image authentication and ownership verification. IEEE Transactions on Image Processing,
10(10):1593–1601, 2001.
17. Xiao, J. and Wang, Y. False negative and positive models of dither modulation watermarking.
In IEEE Fourth International Conference on Image and Graphics, pages 318–323, Sichuan,
2007.
On Sphering the High Resolution Satellite
Image Using Fixed Point Based ICA
Approach
Abstract On sphering the satellite data, classified images are achieved by many
authors that had tried to reduce the mixing effect in image classes with the help of
different Independent component analysis (ICA) based approaches. In these cases
multispectral images are limited with small spectral variation in heterogeneous
classes. For better classification, high spectral variance among different classes and
low spectral variance within a particular class should exhibit. In the consideration of
this issue, a Fixed point (FP) based Independent Component Analysis
(ICA) method is utilized to get better classification accuracy in the existing mixed
classes that consist similar spectral behavior. This FP-ICA method identifies the
objects from mixed classes having similar spectral characteristics, on sphering high
resolution satellite images (HRSI). It also helps to reduce the effect of similar
spectral behavior between different image classes. The estimation of independent
component related to non-gaussian distribution data (image) with optimizing the
performance of this approach with the help of nonlinearity, which utilize the low
variance between similar spectral classes. It is quite robust, effortless in computa-
tion and high convergence rate, even though the spectral distributions of satellite
images are rigid to classify. Hence, this FP-ICA approach plays a key role in image
classification such as buildings, grassland area, road, and vegetation.
1 Introduction
In the current scenario, rapid changes are found in environment, but the better
classification is still a challenging task to provide the distinct image information in
the respective of image classes in automated manner. One of the causes can be little
spectral variation among the different land classes. In this regard, to achieve this
challenge, the existing classification approaches stress to identify the accurate class
information from existing mixed classes in high resolution satellite image (HRSI).
While, this intricate problem is being resolved in different manner up to some level
only, due to the occurrence of mixed classes deceive less accurate classification. In
such scenario, the existence of edge information is reasonable useful to segregate
the image objects.
Since, the conservation of image information and diminution of noise are not
two interdependent parts in image processing. Hence, a single technique is not quite
valuable for other applications. On the basis of spectral resolution of HRSI and the
field of restoration, many premises were recommended for different noise reduction
methods. The quasi-newton methods and related higher order cumulants are quite
approachable to find the saddle points in these theories for segregating the spectral
similar classes. Such approaches have played an important role to reduce mixed
classes problem in the classification. The state-space based approach were utilized
to resolve the difficulty occurs in blind source separation and thereafter, the
deconvolution method was included to improve the existing classes [1–4].
Mohammadzadeh et al. [5] utilized particle swarm optimization (PSO) to opti-
mize a proposed fuzzy based mean estimation method. It helps to evaluate better
mean value for road detection in specific spectral band, due to improvement in
fuzzy cost function by PSO. Singh and Garg [6] calculated threshold value in
adaptive manner for extracting road network on considering the illumination factor.
In addition, artificial intelligence (AI) related methods are used in image-processing
techniques to classify the buildings in automatic way [7, 8].
Li et al. [9] designed a snake model for automatic building extraction by utilizing
the region growing techniques and also mutual information. Singh and Garg [10]
developed a hybrid technique for classification of HRSI which is comprised of a
nonlinear derivative method and improved watershed transforms to extract an
impervious surface (buildings and roads) from different urban areas.
The problem of mixed classes arises due to zero mean and unit variance among
them that occurs availability of the gaussian random variable (spectral behavior of
pixels). However, such limitation is an another issue to strength such kind of
approximations, in particular, due to the non-lustiness meet with negentropy. To
decrease the mutual information among (image classes) variables, negentropy is
increased. This reflects the better autonomy among mixed image classes that shows
the measurement of the non-gaussianity of independent components by negentropy.
The better value of non-gaussianity illustrates the separated classes in images [11,
12]. Hyyarinen [13] provided a new approximation to resolve the issues related to
On Sphering the High Resolution Satellite Image … 413
−1 ̸2
where, the diagonal matrix Dde is evaluated by using a easy component based
operation as
At this instant, the Eq. (5) after applying Newton method on each independent
components of Eq. (4) to deliver s a Jacobian matrix JFðwsi Þ,
n 0 o
JFðwsi Þ = E xsi xTsi g wTsi xsi − βI ð5Þ
The first term of Eq. (5) estimate to make easy the inversion of this matrix.
n 0 o n 0 o
E xsi xTsi g wTsi xsi ≈ E xsi xTsi E g wTsi xsi ð6Þ
n 0 o
= E g wTsi xsi I ð7Þ
and
w*si = wsi+ ̸wsi+ ð10Þ
where, w*si describes a new value of wsi for sphere the satellite image data. FP-ICA
algorithms exploited many independent components with a number of units (e.g.
neurons) and their corresponding weight vectors w1 , . . . , wn for the entire satellite
On Sphering the High Resolution Satellite Image … 415
image. In the continuation of iteration, the decorrelation step for image outputs
wT1 xsi , . . . , wTn xsi that assists to avoid distinct vectors from converging to the similar
maxima (single image class). An easy way for finding decorrelation is to approx-
imate the independent components ‘c’ (‘c’ vector) in a sequence and w1 , . . . , wc are
approximated. To find wc + 1 , a one-unit fixed-point algorithm is performed and the
next weight vector is subtracted from wc + 1 in next iteration that is the real pro-
jections wTc + 1 wj wj . Thereafter wc + 1 is renormalized, where j = 1, …, c are the
earlier estimated c vectors.
The proposed approach utilizes FP-ICA method to classify the image objects. It
also determines and suppresses the mixed class problem that helps to maintain the
principle of mutual exclusion with different classes. Initially, HRSI is taken input in
matrix form ‘xsi’, which dimensions are x (sources) and s (samples). Different spectral
values of pixels and image pixels are columns and rows of image matrix respectively.
The other possible input argument is the number of independent components (c) that
is an essential factor to extract image classes (default c = 1) and concurrently
weighting coefficients (wsi ) to identify distinct objects from mixed classes.
The proposed structure of FP-ICA approach is shown in Fig. 1 for classification
of satellite images. Firstly a preprocessing step is taken for whitening or sphering
process on the HRSI. Thereafter, non-gaussianity processing and negentropy
evaluation on sphered data is done before utilizing FP-ICA method for classifica-
tion. Non-gaussianity provides the less similarity between image classes and its
maximum values reveals the separations between the mixed classes. These classes
are generally less non-gausssian in comparison to existing classes which are
Whitening
Transformation Sphering process
High Resolution on image pixels
Satellite Imagery
Negentropy Measurement of
Calculation Non-gaussianity
Independent
Fixed-point FP-ICA
Approach
Component
Analysis
Performance Evaluation
Figure 2a and b show the different input satellite images SI1 and SI2 respectively.
The classification results for input satellite images SI1 and SI2 are shown in Fig. 2c
and d respectively. The lower values of performance index (PI) illustrate that the
Fig. 2 FP-ICA based classified results: a, b input image; c, d result of the proposed approach
On Sphering the High Resolution Satellite Image … 417
The PI values are used to calculate the qualitative analysis of the proposed FP-ICA
algorithm. The value of PI is described as follows [15]:
1 n n jgskm j njgsmk j
PI = ∑ ∑ −1 + ∑ −1 ð11Þ
nðn − 1Þ k = 1 m=1 max l jgskl j m=1 max l jgslk j
where, gskl is indicated as the (k, l) component in the matrix of the entire image,
GS = WH ð12Þ
where, the maximum value of the components in the ith row vector of G express as
maxl gskl and the maximum value of the components in the ith column vector of GS
express as maxl gskl . W and H stand for the separating matrix and mixing matrix
respectively. If the PI value is almost zero or smaller, it signifies that the perfect
separation has been acquired at the kth extraction of processing unit.
Table 1 shows the performance measurement of proposed approach for image
classification. The given number of iteration is needed to converge the method in
the mentioned elapsed time. Figure 2c and d demonstrate the convergence of the
FP-ICA method for ES1 and ES2 images, which took 2.03 and 2.14 s time
respectively to classify the satellite image in different classes. Further, PI values
0.0651 and 0.0745 respectively are calculated for ES1 and ES2 satellite images,
which illustrate the better separation in different classes of different image. The
suggested iterations are required to converge this method in the desired time. The
comparison of the calculated Kappa coefficient (κ) and overall accuracy (OA) for
image classification with the help of proposed FP-ICA method is shown in Table 2.
Tables 3 and 4 illustrates the producer’s accuracy (PA) and user’s accuracy
(UA) respectively for classification of HRSI.
4 Conclusion
In this paper, FP-ICA based approach achieves good level of accuracy for satellite
image classification in different classes. Therefore, the classified image facilitates to
recognize the existing class information distinctly. The classified image results
demonstrate the existing objects such as buildings, grasslands, roads, and vegeta-
tion in HRSI of emerging urban area. The uncorrelated image classes are identified
as dissimilar objects by using this FP-ICA approach. This proposed approach is
reasonably effective to decrease the issue of mixed classes in satellite images by
incorporation of whitening process on suppressing the effect of spectral similarities
among different classes. Thus, this proposed approach provides major role in
classification of satellite images among four major classes. The experimental out-
comes of the satellite image classification evidently specify that the proposed
approach has been achieved a good level of accuracy and convergence speed of
classification results. It is also suitable to resolve the classification of satellite image
problem in the existence of mixed classes. Moreover, the expansion in ICA with
other techniques can also be tested.
References
1. Amari, S.: Natural gradient work efficiently in learning. Neural Comput. 10(2), 251–276
(1998).
2. Cichocki, A., Unbehauen, R., Rummert, E.: Robust learning algorithm for blind separation of
signals. Electronics Letters 30(17), 1386–1387 (1994).
On Sphering the High Resolution Satellite Image … 419
3. Zhang, L., Amari, S., Cichocki, A.: Natural Gradient Approach to Blind Separation of Over-
and Under-complete Mixtures. In Proceedings of ICA’99, Aussois, France, January 1999,
pp. 455–460 (1999).
4. Zhang, L., Amari, S., Cichocki, A.: Equi-convergence Algorithm for blind separation of
sources with arbitrary distributions. In Bio-Inspired Applications of Connectionism IWANN
2001. Lecture notes in computer science, vol. 2085, pp. 826–833 (2001).
5. Mohammadzadeh, A., ValadanZoej, M.J., Tavakoli, A.: Automatic main road extraction from
high resolution satellite imageries by means of particle swarm optimization applied to a fuzzy
based mean calculation approach. Journal of Indian society of Remote Sensing 37(2), 173–
184 (2009).
6. Singh, P.P., Garg, R.D.: Automatic Road Extraction from High Resolution Satellite Image
using Adaptive Global Thresholding and Morphological Operations. J. Indian Soc. of Remote
Sens. 41(3), 631–640 (2013).
7. Benediktsson, J.A., Pesaresi, M., and Arnason, K.: Classification and feature extraction for
remote sensing images from urban areas based on morphological transformations. IEEE
Transactions on Geoscience and Remote Sensing 41, 1940–1949 (2003).
8. Segl, K., Kaufmann, H.: Detection of small objects from high-resolution panchromatic
satellite imagery based on supervised image segmentation. IEEE Transactions on Geoscience
and Remote Sensing 39, 2080–2083 (2001).
9. Li, G., Wan,Y., Chen, C.: Automatic building extraction based on region growing, mutual
information match and snake model. Information Computing and Applications, Part II, CCIS,
vol. 106, pp. 476–483 (2010).
10. Singh, P.P., Garg, R.D.: A Hybrid approach for Information Extraction from High Resolution
Satellite Imagery. International Journal of Image and Graphics 13(2), 1340007(1–16) (2013).
11. Hyvarinen, A., Oja, E.: A Fast Fixed-Point Algorithm for Independent Component Analysis.
Neural Computation 9(7), 1483–1492 (1997).
12. Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications.
Neural Networks 13(4–5), 411-430 (2000).
13. Hyvarinen, A.: Fast and Robust Fixed-Point Algorithms for Independent Component
Analysis. IEEE Transactions on Neural Networks 10(3), 626–634 (1999).
14. Luenberger, D.: Optimization by Vector Space Methods, Wiley (1969).
15. Singh, P.P., Garg, R.D.: Fixed Point ICA Based Approach for Maximizing the
Non-gaussianity in Remote Sensing Image Classification. Journal of Indian Society of
Remote Sensing 43(4), 851–858 (2015).
A Novel Fuzzy Based Satellite Image
Enhancement
1 Introduction
N. Sharma (✉)
Maharaja Agrasen Institute of Technology, Rohini, Delhi, India
e-mail: [email protected]
O.P. Verma
Delhi Technological University, Delhi, India
e-mail: [email protected]
details of an image. SVE, SVD, DCT and DWT are other approaches which are also
applied to enhance the satellite images [4]. The researchers are applied corrections
either on whole image or in any one component of the image. These pixel opera-
tions applied on whole image do not preserve the edge information [5–7]. Some-
times they also enhanced the noisy pixels. The edge information is preserved if we
apply the approaches in the lower region of image like applying gamma correction
on LL component of discrete wavelet transformed satellite image [8]. Due to the
presence of different region in the same satellite image the dynamic range of other
components are not increased. The variant fuzzy algorithms are studied in literature
to improve the uncertainty in the image enhancement [9–14]. The modified fuzzy
based approach is proposed to increases the contrast as well as the dynamic range of
the satellite image. The proposed approach removes the uncertainty in the different
spectral region of satellite images and enhances the quality of the image.
The organization of the paper is given as follows: Sect. 2 introduces the concept
of SVD in contrast enhancement. This decomposition increases the illumination of
the image. Section 3 presents the fuzzy based contrast enhancement. In Sect. 4, we
define the proposed approach use to increase the contrast of the satellite image. The
analysis based on the performance measures are given in Sect. 5. The results and
conclusions are drawn in Sects. 6 and 7 respectively.
SVD technique is used in the image enhancement for improving the illumination of
the given image. It is well-known that the SVD has optimal decorrelation and sub
rank approximation properties. This technique match the each block on the basis of
singular vectors. Use of SVD in image processing applications is described by [5].
The equation for singular value decomposition of X is as follows
X = UDV T ð1Þ
where X is the closest rank approximation matrix. The U, V and D is the left, right
and diagonal matrix. Among U, V and D, we kept our concentration on the diagonal
Eigen values only. These values are employed in the correction step which gives the
rank approximation. For example, the 4 × 4 block input to the SVD block will
have a total of 4 – k eigenvalues, where k denotes the number of different diagonal
elements. The significant eigenvalues are considered after arranging the Eigen
values in the descending order along the diagonal of the ‘D’ matrix for applying
correction factors in the input sample image.
A Novel Fuzzy Based Satellite Image Enhancement 423
In this paper HSV color model (Hue, Saturation and Gray level value) is considered
for satellite image enhancement. As if we perform the image enhancement directly
to Red, Green and Blue components of satellite image then it results the color
artifact. The color image enhancement must preserve the original color information
of the satellite image and increases the pixel intensity in such a way that it cannot
exceed the maximum value of an image. The color image enhancement must be free
from the gamut problem. Fuzzy set theory is one of the useful tools used to reduce
the uncertainty and ambiguity. In reference to the satellite image enhancement the
one important step have to involve is the creation of “IF…, Then…, Else” fuzzy
rules for image enhancement. A set of neighbourhood pixels results the prior and
subsequent clauses that behave as the fuzzy rule for the pixel to be enhanced. These
rules give decision similar to the reasoning of human beings. The fuzzy approach
mentioned above is employ on the luminance part of the color satellite image. The
reason for the same is that doing enhancement operation the luminance value of
pixels get overenhance or underenhance due to the fact that they will not sync with
the range of histogram. The fuzzy satellite image processing involves image
fuzzification, modification of membership values and defuzzification. The selection
of membership function is based on the application.
The grayscale image of size M × N having the intensity levels Imn in the range
[0, L − 1] is described as
μmn
I = ⋃fμðImn Þg = , m = 1, 2, . . . , M and n = 1, 2, . . . , N ð2Þ
Imn
where μðImn Þ represents the membership with Imn being the intensity at the (mth, nth)
pixel. An image can be classified into the low, mid and high intensity region
according to the defined three fuzzy rules (Table 1).
The above said rules are applied on different multispectral regions of the color
satellite image. The image splits into three different intensity regions. These divided
regions centred on the mean value for each region. The modified sigmoidal
membership function used to fuzzify the low and high intensity region of satellite
image is as follows,
1
μX ðxÞ = ð3Þ
1 + e − ðxÞ
where x represents the gray level of low and high intensity region.
Table 1 List of rules If the intensity is low then, region belongs to underexposed
If the intensity is mid then, region is moderate
If the intensity is high then, region belongs to overexposed
424 N. Sharma and O.P. Verma
1 − ðx − xavg Þ2
μXg ðxÞ = pffiffiffiffiffi e 2σ2 ð4Þ
σ 2π
where, x indicates the gray level intensity value of mid intensity region, xavg is the
average or mean gray level value in the given image and σ represents the deviation
from the mean or average gray level intensity value. These membership functions
transform the image intensity from spatial domain to the fuzzy domain. The values
of the same are in the range of 0–1. The sigmoid membership function modifies the
underexposed and overexposed regions of given satellite image. The mid intensity
region are modifies using the Gaussian membership function. Finally, ro is the
output membership function are defuzzify by using constant membership function
given by,
3 μ i ri
ro = ∑ ð5Þ
i = 1 μi
4 Proposed Methodology
Table 2 Fuzzy based Fuzzy based satellite color image enhancement algorithm
satellite color image
enhancement algorithm 1. Read a color satellite image and obtain three color channels
i.e. X = {R, G, B}
2. Transform the color channels X into HSV color space
3. Obtain the normalized value of each channel (H̄ , S̄ , I}
4. The intensity channel I is fuzzified in three membership
functions defined in Eqs. (2)–(4)
5. Obtain the modified membership values for underexposed
and overexposed regions
6. Defuzzification is done using Eq. (5)
7. Finally, apply the correction factor to the degraded channel
after performing SVD using Eq. (1)
A Novel Fuzzy Based Satellite Image Enhancement 425
5 Performance Measures
The contrast assessment function (CAF) is used to measure the quality of the
proposed methodology. This function calculates by computing the brightness,
entropy and contrast. These measures evaluate the quality of the enhanced image.
β
CAF = IEα + C̄ ð6Þ
where, the IE and C̄ represents the average entropy value of the given image and
the average color contrast in the image [15].
Table 3 Comparison of luminance, entropy, average contrast, CAF values using proposed
approach, histogram equalization and DCT based methods over different images
Images Luminance L̄ Average Entropy IE Average contrast C̄ CAF
1 Test image 79.9986 6.4551 65.6653 18.3753
Proposed approach 141.8108 7.0530 111.8474 22.9368
HE based method 79.9941 6.4722 66.0319 18.4498
DCT based method 89.4299 6.6564 73.0664 19.4610
2 Test image 65.4602 7.0769 99.3299 22.3417
Proposed approach 102.2152 7.4782 148.6010 26.1097
HE based method 65.5456 7.0912 96.0337 22.1985
DCT based method 85.2260 7.3169 127.5119 24.5875
3 Test image 71.6810 6.6087 54.6698 17.9702
Proposed approach 126.4009 7.5229 103.4835 23.9940
HE based method 67.5417 6.6926 67.5481 19.1867
DCT based method 101.3466 7.1918 81.3107 21.5960
4 Test image 39.8276 5.8954 56.0258 16.1292
Proposed approach 91.5350 6.7875 149.6766 23.7411
HE based method 39.9452 5.9362 55.8990 16.2316
DCT based method 85.5918 6.7959 144.7823 23.5737
7 Conclusions
Fuzzy based satellite image enhancement has been implemented by fuzzifying the
color intensity property of an image. An image may be categorized into overex-
posed and underexposed regions. A suitable modified membership functions are
employed for the fuzzification of different regions. The results of the proposed
fuzzy based contrast enhancement approach have been compared with the recent
used approaches. The proposed approach successfully depicts the better result in
terms of luminance, entropy, average contrast and CAF value.
References
1. Gonzalez, C. Rafael, and E. Richard. “Woods, digital image processing.” ed: Prentice Hall
Press, ISBN 0-201-18075-8, 2002.
2. Gillespie, R. Alan, B. Anne, Kahle, and E. Richard Walker. “Color enhancement of highly
correlated images. I. Decorrelation and HSI contrast stretches.” Remote Sensing of
Environment 20, vol. 3, pp. 209–235, 1986.
3. P. Dong-Liang and X. An-Ke, “Degraded image enhancement with applications in robot
vision,” in Proc. IEEE Int. Conf. Syst., Man, Cybern., Oct. 2005, vol. 2, pp. 1837–1842.
4. H. Ibrahim and N. S. P. Kong, “Brightness preserving dynamic histogram equalization for
image contrast enhancement,” IEEE Trans. Consum. Electron., vol. 53, no. 4, pp. 1752–1758,
Nov. 2007.
428 N. Sharma and O.P. Verma
1 Introduction
The ease with which digital images can be modified to alter their content and
meaning of what is represented in them has been increasing with the advancing
technology. The context in which these images are involved could be a tabloid, an
advertising poster, and also a court of law where the image could be legal evidence.
Many algorithms are now developing using the advantage of making a machine
learn to classify datasets into various classes, this classification is basically done by
identifying a set of features that are different for PIM and PRCG images. Later these
features are used to train a set of sample images and a threshold is set by the so
trained classifier, based on which testing is done.
There remains controversy in both forensics and human vision fields as there are
no agreed standards for measuring the realism of computer generated graphics.
Three varieties exists in computer graphics which mainly differ in the level of visual
coding.
1. Physical realism: same visual stimulation as that of the scene.
2. Photo realism: same visual response as the scene.
3. Functional realism: same visual information as the scene.
Among the three, the first kind of realism is hard to achieve in real applications.
Computer images which have the last kind of realism usually presented as cartoons
and sketches and hence can be easily classified as Computer Generated images. The
last kind of realism forms the crucial part of computer generated images as they
look as real as a photograph. Farid [1] has categorized the various tools used for
image forgery detection into five different categories of techniques used:
1. Detection of statistical anomalies at pixel level.
2. Detection of statistical correlation introduced by lossy compression.
3. Detection of the artefacts introduced by the camera lens, sensor or
post-processing operations.
4. Detection of anomalies in object interaction with light and sensor in 3D domain.
5. Detection of real world measurements.
Ng and Chang [2], developed a classifier using 33D power spectrum features,
24D local patch features, and 72D higher order wavelet statistic features, which
come under category 1 (Cat-1) defined in [1]. Lyu and Farid [3] proposed a 214D
feature vector based on first four order statistics of wavelet sub-band coefficients and
the error between the original and predicted sub-band coefficients. More recently,
Wang et al. [4] have proposed a 70D feature vector based on statistical features
extracted from the co-occurrence matrices of differential matrices based on con-
tourlet transform of the image and homomorphically filtered image and texture
similarity of these two images and their contourlet difference matrices (Cat-1). Based
on the physics of an imaging and image processing of a real world natural image, Ng
et al. [5] have proposed a 192D feature vector utilizing the physical differences in
PIM and PRCG images, viz., local patch statistics (Cat-1), local fractal dimension,
surface gradient, quadratic geometry, and Beltrami flow (Cat-4&5). Every PIM is
obtained using a specific camera, Dehnie et al. [6] differentiated PIM and PRCG
image based on camera response or sensor noise, which is specific to a given digital
camera and that this sensor noise would be absent in a PRCG image. Gallagher and
Chen [7] later used features extracted based on Bayer’s color filter array
(CFA) (Cat-3). More recently, methods that can discriminate computer generated
and natural human faces (Cat-1, 4&5) have been developed [8].
Differentiating Photographic and PRCG Images … 431
2 Proposed Method
This difference map is divided into 8 × 8 blocks, K1, K2, …, Kn, n is the number
of blocks. As the difference at the borders would be higher than that inside the block
for each block this operation gives the information of JPEG blocking artefacts in an
image. This operation also reveals the interpolation information in an image
assuming a Bayer array and a linear interpolation [7]. Then the mean of (i, j)th pixel
in each corresponding block is found. This mean, Blocking Artefacts Matrix (BAM),
432 R.S. Ayyalasomayajula and V. Pankajakshan
(a)
First order BAM and Peak
DFT
Difference BAV Analysis
(b)
Posi onal
HP filter (second Variance
DFT Peak Analysis
order Difference) (Diagonal
Averages)
Fig. 1 a Calculation of β. b Calculation for features in [7]. We see that except for the second level
rest of the analysis to calculate the features is more or less the same
which is an 8 × 8 matrix would have high values at the boundaries for a tampered
image.
1 n
BAM ði, jÞ = ∑ jKc ði, jÞj 1 < i, j < 8 ð2Þ
n c=1
BAM is then converted to Blocking Artefacts Vector (BAV), a vector of size 64.
Let the magnitude spectrum of BAV be P(w),
Now because there are sharp boundaries (transitions) at the edges of the BAM,
the DFT of BAV would result in clearly distinguishable peaks at (normalized
frequency) w = m/8, m = 1, 2, …, 7 as mentioned in [10]. But in experiments, it is
observed that these peaks are more pronounced in the case of a PIM (Fig. 2a) than
in the case of a PRCG (Fig. 2b). A PIM is generated with the help of CFA pattern,
because of which interpolation occurs and thus the peaks P(w) are more pronounced
for a PIM, than for a PRCG image. Since a PRCG image is created using a
software, it would not show the artefacts related to CFA interpolation.
In Fig. 2a, b distinct peaks can be observed at w = m/8, m = 1, 2, …, 7 for both
PIM and PRCG, but they are more stronger in the case of a PIM than a PRCG image.
Thus, it is possible to differentiate a PIM and PRCG based on β, computed as:
1 2 3
β = log P *P *P + ε − logðPð0Þ + εÞ ð4Þ
8 8 8
Gallagher and Chen [7] have also calculated a feature to identify the CFA
interpolation artefacts using a similar algorithm.
Similarly, another feature block measure factor, ‘α’ is calculated [10] based on
JPEG resampling, which is determined using the row (column) average of the DFT
of second order difference image [10].
Differentiating Photographic and PRCG Images … 433
1 M
EDA ðwÞ = ∑ jDFTðEðm, nÞÞj ð6Þ
M m=1
∑ ðEDA ðwÞ + εÞ
α = log ð7Þ
∑ logðEDA ðwÞ + εÞ
Two different steps of testing were performed in order to evaluate the effectiveness
of the proposed approach in differentiating PIM and PRCG images. The first step
aims at measuring the accuracy of the proposed classifier and to compare it with the
state-of-the-art, in particular with [4, 7]. In the second step, robustness of the
proposed classifier is tested and its benefits over [4, 7] are detailed.
To train and test the proposed 2-D feature vector classifier we have trained an SVM
classifier using the pre-built libraries of LIBSVM integrated with MATLAB. Using
images of different formats and conditions viz., 350 uncompressed TIFF images
from Dresden Image Database [12], 1900 JPEG images taken from 23 different
digital cameras from Dresden Image Database and 350 uncompressed TIFF images
from UCID database [13], and 2500 computer generated images from ESPL
Database [14, 15]. Also 800 JPEG compressed PIM images and 800 JPEG com-
pressed PRCG images from Columbia Dataset [9] are considered for comparison
with algorithms proposed in [4, 7].
The various values of β shown in Fig. 3 demonstrate clusters, different for
different cameras and a cluster for PRCG images. This shows that the tampering
based features are specific to a given digital camera. The magnitudes of β are
significantly larger for a PRCG image, because of very small AC component values
(Fig. 2b). This shows that there is little variations in pixel intensities from pixel to
pixel in PRCG images than that in a PIM images, as mentioned in Sect. 2. From the
similarities mentioned in Sect. 3 and the experimental results, it can be observed
that the source specific feature that has been estimated using these tampering based
features is the CFA interpolation pattern.
These 5000 images is divided randomly into two sets of 2500 images each, one
for training and the other for testing. This process is repeated for ten different
random selections, the results of these are shown in Table 1, and average of these
results is that there is a 7.8 % false positive rate for a 94.2 % true detection rate of
camera generated images.
From Table 1 the average accuracy of detection rate can be observed to be
93.2 %. To validate the proposed approach the obtained results are compared with
those of [4, 7] reported by Ng et al. [16]. The proposed classifier was tested on
Columbia Image dataset with 800 PRCG and 800 PIM images using the same
testing procedure. For PIM, 98.67 % true detection rate and 9.8 % false positive
rate, with an average accuracy of about 95.82 % was observed. The best accuracy
results of [4, 7] are shown in Table 2.
3.2 Robustness
Fig. 4 a PRCG Images from various websites. b PIM from Dresden Database. c (left to right and
top to bottom) Original example from ESPL Database, aliased, with banding artefacts, blurred,
Gaussian noise added, JPEG compressed with q = 5, image with ringing artefacts, and color
saturated
Fig. 5 ROC for Dresden [12] and ESPL [14, 15] database
Table 3 Robustness analysis Quality factor Average detection accuracy rate (%)
for various qualities in JPEG
compression 90 96.31
70 96.93
50 96.93
20 96.62
5 94.94
This can also be justified as—“a PRCG image is a completely tampered image”
and thus any other tampering or content hiding manipulations done to the image are
nullified.
4 Conclusion
References
1. Farid, Hany. “Image forgery detection.” Signal Processing Magazine, IEEE 26.2 (2009): 16–
25.
2. Ng, Tian-Tsong, and Shih-Fu Chang. “Classifying photographic and photorealistic computer
graphic images using natural image statistics,” ADVENT Technical report, 2004.
3. Lyu, Siwei, and Hany Farid. “How realistic is photorealistic?” Signal Processing, IEEE
Transactions on 53.2 (2005): 845–850.
4. Wang, Xiaofeng, et al. “A statistical feature based approach to distinguish PRCG from
photographs,” Computer Vision and Image Understanding 128 (2014): 84–93.
438 R.S. Ayyalasomayajula and V. Pankajakshan
5. T. Ng, S. Chang, J. Hsu, L. Xie, MP. Tsui, “Physics motivated features for distinguishing
photographic images and computer graphics,” in: Proceedings of MULTIMEDIA 05, New
York, NY, USA, ACM2005, pp. 239–248.
6. S. Dehnie, H.T. Sencar, N.D. Memon, “Digital image forensics for identifying computer
generated and digital camera images”, ICIP, IEEE, Atlanta, USA (2006), pp. 2313–2316.
7. Andrew C. Gallagher, Tsuhan Chen, “Image authentication by detecting traces of
demosaicing,” in: Proceedings of the CVPR WVU Workshop, Anchorage, AK, USA, June
2008, pp. 1–8.
8. H. Farid, M.J. Bravo, “Perceptual discrimination of computer generated and photographic
faces,” Digital Invest. 8 (3–4) (2012) 226 (10).
9. Tian-Tsong Ng, Shih-Fu Chang, Jessie Hsu, Martin Pepeljugoski, “Columbia Photographic
Images and Photorealistic Computer Graphics Dataset,” ADVENT Technical Report
#205-2004-5, Columbia University, Feb 2005.
10. Zuo, J., Pan, S., Liu, B., & Liao, X., “Tampering detection for composite images based on
re-sampling and JPEG compression,” In Pattern Recognition (ACPR), 2011 First Asian
Conference on (pp. 169–173). IEEE.
11. Chih-Chung Chang, Chih-Jen Lin, “LIBSVM: A library for support vector machines,” ACM
Transactions on Intelligent Systems and Technology (TIST), v.2 n.3, pp. 1–27, April 2011.
12. Gloe, T., & Böhme, R., “The ‘Dresden Image Database’ for benchmarking digital image
forensics,” In Proceedings of the 25th Symposium on Applied Computing (ACM SAC 2010)
(Vol. 2, pp. 1585–1591).
13. G. Schaefer and M. Stich (2003) “UCID - An Uncompressed Colour Image Database”,
Technical Report, School of Computing and Mathematics, Nottingham Trent University, U.K.
14. D. Kundu and B. L. Evans, “Full-reference visual quality assessment for synthetic images: A
subjective study,” in Proc. IEEE Int. Conf. on Image Processing., Sep. 2015, accepted,
September 2015.
15. Kundu, D.; Evans, B.L., “Spatial domain synthetic scene statistics,” Signals, Systems and
Computers, 2014 48th Asilomar Conference on, vol., no., pp. 948–954, 2–5 Nov. 2014.
16. Tian-Tsong Ng, Shih-Fu Chang, “Discrimination of computer synthesized or recaptured
images from real images,” Digital Image Forensics, 2013, p. 275– 309.
A Novel Chaos Based Robust Watermarking
Framework
1 Introduction
Recently, there is a rapid growth in the area of internet and communication tech-
nology. As a result, one can easily transmit, store and modify digital data, such as
image, audio and video, which essentially leads to the issues of illegal distribution,
copy, editing, etc. of digital data. Therefore, there is a need of some security measure
to prevent these issues. The technology of watermarking has recently identified as
one of the possible solutions [1, 2]. The basic idea of watermarking is to embed a
mark or information related to digital data into the digital data. This mark could be
a copyright information, time-stamp or any other useful information, which can be
extracted later for different purposes.
In the recent years, a range of watermarking schemes have been proposed for
diverse purposes. These techniques can be broadly classified into frequency and spa-
tial domain schemes [3–6]. For the frequency domain schemes, the information is
hided in the coefficients obtained from various transforms such as Discrete Fourier
Transform (DFT) [7], Discrete Cosine Transform (DCT) [8], Wavelet Transform
(DWT) and others [9, 10]. In contrast, the basic idea of the spatial domain water-
marking is to directly modify the values of digital data. That brings some advantages
such as simple structure, lower computational complexity, easy implementation,
etc., but also result in its lower security and weaker robustness to signal process-
ing attacks.
In this paper, a novel spatial domain image watermarking scheme is proposed
based on the pseudorandom noise (PN) sequence and non-linear chaotic map. The
basic idea is to first generate a chaotic sequence considering the length of watermark
and an initial seed. This sequence is then used to construct a feature vector using
modular arithmetic. Now, this feature vector is used to generate PN sequence fol-
lowed by the generation of a feature image by stacking PN sequence into an array.
For each pixel in watermark, a feature image is constructed using the circular shift
on the obtained feature image. Obtained feature images are then used for embed-
ding and extraction of the watermark. The reverse process is applied for validation
of the watermark. Finally, the attack analysis demonstrate the efficiency of proposed
scheme against a number of intentional/un-intentional attacks.
The remaining paper is organized as follows. In Sect. 2, non-linear chaotic map
and PN sequence are briefly described. Next, the detailed description of the proposed
watermarking scheme is introduced in Sect. 3. The detailed experimental results and
discussions are given in Sect. 4 followed by the conclusions in Sect. 5.
2 Preliminaries
In this section, the basics concepts used in the development of the proposed water-
marking technique are discussed. These are as follows.
In this work, piecewise non-linear chaotic map is used in order to generate a feature
vector. A piece-wise non-linear map (PWNLCM) ∶ [0, 1] → [0, 1] can be defined
as [11]
( )
⎧ 1
+ b (yk − bl ) −
al
(y − bl )2 , if yk ∈ [Jl , Jl+1 )
⎪ Jl+1 −Jl l Jl+1 −Jl k
(yk+1 ) = ⎨ 0, if yk = 0.5 (1)
⎪ (y − 0.05), if yk ∈ (0.5, 1]
⎩ k
A Novel Chaos Based Robust Watermarking Framework 441
where yk ∈ [0, 1], 0 = J0 < J1 < ⋯ < Jm+1 = 0.5 and bl ∈ (−1, 0) ∪ (0, 1) is the tun-
ing parameter for the lth interval sequence satisfying
∑
n−1
(Jl+1 − Jl )bl = 0 (2)
l=0
2.2 Pseudo-Noise(PN)-Sequences
In this section, a robust logo watermarking framework, which explores the charac-
teristics of the piece-wise non-linear map and PN-sequence, is proposed. Let us con-
sider H be the gray-scale host and W be the binary watermark images of size M × N
and m × n respectively. The proposed framework can be formalized as follows.
442 S.P. Singh and G. Bhatnagar
7. Embed the transformed image into the host image to get watermarked image HW
as follows.
HW = H + 𝛽 ∗ F ′ (5)
∑
l1
∑
l2
w1 ∗ C1 (i) + w2 ∗ C2 (j)
1 i=1 j=1
T= (6)
256 w1 + w2
A Novel Chaos Based Robust Watermarking Framework 443
where w1 and w2 are the weights for the negative and positive correlation coeffi-
cients whereas l1 and l2 be the lengths of segments C1 and C2 respectively.
6. Construct a binary sequence as follows.
{
1, if C(i) ≥ T
bext (i) = (7)
0, otherwise
7. Extracted watermark Wext is finally formed by stacking binary sequence (bext ) into
an array.
Gaussian Blur (3x3) Salt and Pepper Noise (10%) JEPG Compression (70:1)
(iv) (v) (vi)
A Novel Chaos Based Robust Watermarking Framework
where 𝜔 and 𝜛 denotes the original and extracted watermark images while 𝜇𝜔 and
𝜇𝜛 are their respective mean. The principle range for 𝜌 is from −1 to 1. The value
of 𝜌 close to 1 indicates the better similarity between the images 𝜔 and 𝜛. Here,
the results are shown visually only for Elaine image because it attains the maximum
PSNR. In contrast, correlation coefficients values are given for all images and are
tabulated in Table 1.
The transmission of digital image over the insecure network introduce degrada-
tion in image quality and increase the possibility of different kind of attacks. These
attacks generally distort the statistical properties of the image. As a result, these dis-
tortions effects the quality of the extracted watermark. Therefore, every watermark-
ing scheme should robust against these distortions. In general, these also depend
upon robustness of the scheme. Hence robustness is the key factor which represent
the efficiency of the technique.
The most common distortions in digital imaging is blurring and noise additions.
For blurring, Gaussian blur with filter size 3 × 3 is considered whereas 10 % slat
and pepper noise is added in the watermarked image before the watermark extrac-
tion. JPEG compression, a lossy encoding schemes to store the data efficiently, is
most another common distortion in real life applications. Therefore, the proposed
scheme is further investigated for JPEG Compression having compression ratio 70:1.
Performance of the proposed scheme is also evaluated for most common geomet-
ric distortions such as cropping and resizing. There are several ways to crop an
image. Essentially, it is done by a function which maps a region of the image to
zero. The 50 % information of the watermarked images is mapped to zero for crop-
ping. In contrast, the effect of resizing is created by reducing the size of water-
marked image to 256 × 256 and again up-sizing it to the original size. The proposed
scheme is further examined for general image processing attacks like histogram-
equalization, sharpening, contrast-adjustment and gamma-correction. For sharpen-
ing and contrast-adjustment, the sharpness and contrast of the watermarked image
are increased/decreased by 60 %/100 % respectively where as for gamma-correction
the watermarked image is corrected using a gamma of 5. The visual results for all
these distortions are depicted in Fig. 4.
5 Conclusion
In this paper, a novel spatial domain watermarking scheme is proposed where a visu-
ally meaningful binary image is embedded in the host image. The core idea is to
first obtain a feature vector based on non-linear map and then use it to generate
a PN sequence. This sequence is used to obtain feature images using the circular
shift followed by the embedding of watermark using feature images. In the proposed
framework, the feature images are the integral part of security since the watermarked
cannot be extracted without their exact knowledge. Finally, a detailed investigation
is performed to validate the efficacy of the proposed scheme.
A Novel Chaos Based Robust Watermarking Framework 447
Acknowledgements This research was supported by the Science and Engineering Research Board,
DST, India.
References
1. Katzenbeisser S. and Petitcolas, F.A.P., Information Hiding Techniques for Steganography and
Digital Watermarking. Artech House, Boston (2002).
2. Cox I. J., Miller M. and Bloom J., Digital Watermarking, Morgan Kaufmann (2002).
3. Mandhani N. and Kak S., Watermarking Using Decimal Sequences, Cryptologia 29(2005) 50–
58.
4. Langelaar G., Setyawan I., and Lagendijk R.L., Watermarking Digital Image and Video Data,
IEEE Signal Processing Magazine, 17 (2009) 20–43.
5. Wenyin Z. and Shih F.Y., Semi-fragile spatial watermarking based on local binary pattern
operators, Opt. Commun., 284 (2011) 3904–3912.
6. Botta M., Cavagnino D. and Pomponiu V., A modular framework for color image watermark-
ing, Signal Processing, 119 (2016) 102–114.
7. Ruanaidh O. and Pun T., Rotation, scale and translation invariant spread spectrumdigital image
watermarking, Signal Processing, 66 (1998) 303–317.
8. Aslantas V., Ozer S. and Ozturk S., Improving the performance of DCT-based fragilewater-
marking using intelligent optimization algorithms, Opt. Commun., 282 (2009) 2806–2817.
9. Yu D., Sattar F. and Binkat B., Multiresolution fragile watermarking using com-plex chirp
signals for content authentication, Pattern Recognition, 39 (2006) 935–952.
10. Bhatnagar G., Wu Q.M.J. and Atrey P.K., Robust Logo Watermarking using Biometrics
Inspired key Generation, Expert Systems with Applications, 41 (2014) 4563–4578.
11. Tao S., Ruli W. and Yixun Y., Generating Binary Bernoulli Sequences Based on a Class of
Even-Symmetric Chaotic Maps, IEEE Trans. on Communications, 49 (2001) 620–623.
12. Khojasteh M.J., Shoreh M.H. and Salehi J.A., Circulant Matrix Representation of PN-
sequences with Ideal Autocorrelation Property, Iran Workshop on Communication and Infor-
mation Theory (IWCIT), (2015) 1–5.
Deep Gesture: Static Hand Gesture
Recognition Using CNN
1 Introduction
The human visual system (HVS) is very effective in rapidly recognising a large
number of diverse objects in cluttered background effortlessly. Computer vision
researchers aim to emulate this aspect of HVS by estimating saliency [1, 2] of differ-
ent parts of the visual stimuli in conjunction with machine learning algorithms. Hand
gesture recognition is an active area of research in the vision community because of
its wide range of applications in areas like sign language recognition, human com-
puter interaction (HCI), virtual reality, and human robot interaction etc.
Expressive, meaningful body motions involving physical movements of the fin-
gers, hands, arms, head, face, or body [3] are called gestures. Hand gestures are either
static or dynamic. In this work we focus on the recognition of static hand gestures.
There are relatively few hand posture databases available in the literature, and so we
have shown results on three databases [4–6].
2 Prior Work
In the literature basically two approaches are proposed for recognition of hand ges-
tures. The first approach [7] makes use of external hardware such as gloves, magnetic
sensors, acoustic trackers, or inertial trackers. Theses external devices are cumber-
some and may make the person using them uncomfortable. Vision based approaches
do not need any physical apparatus for capturing information about human hand
gestures and have gained popularity in recent times. However, these methods are
challenged when the image background has clutter or is complex.
Skin detection is the initial step to detect and recognise human hand postures
[5]. However it is clear that different ethnicities have different skin features, which
make skin-based hand detection difficult. In addition, skin models are sensitive to
illumination changes. Lastly, if some skin like regions exist in the background then
it also leads to erroneous results. Recently, Pisharady et al. [5] proposed a skin based
method to detect hand areas within images. Furthermore in [5] an SVM classifier was
used to identify the hand gestures.
There are various methods proposed to detect and recognise hand postures which
do not use skin detection. Based on object recognition method proposed by Viola
and Jones [8], Kolsch and Turk [9] recognise hand postures for the recognition of
six hand postures. Bretzner et al. [10] use a hierarchical model, which consists of
palm and the fingers and a multi-scale colour feature to represent hand shape. Ong
and Bowden [11] proposed a boosted classifier tree-based method for hand detection
and recognition. View-independent recognition of hand postures is demonstrated in
[12]. Kim and Fellner [13] detect fingertip locations over dark environment. For
gesture recognition Chang et al. [14] recognise hand postures under a simple dark
environment. Triesch and Von der Malsburg [15] address the complex background
problem in hand posture recognition using elastic graph matching. A self-organizing
neural network is used by Flores et al. [16] in which the network topology determines
Deep Gesture: Static Hand Gesture Recognition Using CNN 451
hand postures. In this work, we overcome the limitations of methods using hand
crafted features.
Existing methods can recognize hand gestures in uniform/complex backgrounds
with clutter only after elaborate pre-processing [4–6, 17]. In this work we propose a
deep learning based method using convolutional neural networks (CNN) to identify
static hand gestures both in uniform and cluttered backgrounds. We show the supe-
riority of our CNN-based framework over approaches using handcrafted features [5,
17]. We do not need to use elaborate, complex feature extraction algorithms for hand
gesture images. We show that the CNN-based framework is much simpler and more
accurate on the challenging NUS-II dataset [5] containing clutter in the background.
In contrast to the proposed method, earlier works on hand gesture recognition e.g.
[18, 19] using CNN have have not reported results on the challenging datasets such
as Marcel dataset [4] and NUS-II clutter dataset [5].
There are very few publicly available hand gestures datasets with clutter in the back-
ground. Therefore, we choose to present the results of our CNN-based framework
on [4, 6] and NUS-II [5], which we describe below. The hand gesture database pro-
posed by Jochen Triesch [6] has 10 different classes in uniform light, dark and com-
plex background. In this paper we focus our work on the uniform dark background
dataset to show the effectiveness of proposed approach on both uniform as well as
complex background. A snapshot of the dataset [6] is shown in Fig. 1. The dataset
has 24 images per class and a total of 10 classes.
The hand gesture database proposed by Marcel [20] has 6 different classes sep-
arated into training and testing data.The testing data also has separate data for the
uniform and complex case. There are 4872 training images in total and 277 testing
images for the complex background. A snapshot of the Marcel dataset is shown in
Fig. 2
Fig. 1 Images from Triesch dataset [6] dataset with uniform dark background
452 A. Mohanty et al.
Recently, a challenging static hand gesture dataset has been proposed in [5]. The
NUS-II [5] dataset has 10 hand postures in complex natural backgrounds with vary-
ing hand shapes, sizes and ethnicities.The postures were collected using 40 subjects
comprising of both male and female members in age group of 22 to 56 years from
various ethnicities. The subjects showed each pose 5 times. The NUS-II [5] dataset
has 3 subsets. The first subset is a dataset of hand gestures in presence of clutter
consisting of 2000 hand posture color images (40 subjects, 10 classes, 5 images per
class per subject). Each image is of size 160 × 120 pixels with complex backgrounds.
Some sample images from the NUS-II dataset with clutter are shown in Fig. 3. In this
paper we focus our work on the subset of NUS-II dataset with only non-human clutter
in the background.
Convolutional neural nets originally proposed by LeCun [21] have been shown to be
accurate and versatile for several challenging real-world machine learning problems
[21, 22]. According to LeCun [21, 23], CNNs can be effectively trained to recognize
Deep Gesture: Static Hand Gesture Recognition Using CNN 453
objects directly from their images with robustness to scale, shape, angle, noise etc.
This motivates us to use CNNs in our problem since in real-world scenarios hand
gesture data will be affected by such variations.
4.1 Architecture
The general architecture of the proposed CNN is shown in Fig. 4. Apart from the
input and the output layers, it consists of two convolution and two pooling layers.
The input is a 32 × 32 pixels image of hand gesture.
As shown in Fig. 4, the input image of 32 × 32 pixels is convolved with 10 filter
maps of size 5 × 5 to produce 10 output maps of 28 × 28 in layer 1. These output
maps may be operated upon with a linear or non-linear activation function followed
by an optional dropout layer. The output convolutional maps are downsampled with
max-pooling of 2 × 2 regions to yield 10 output maps of 14 × 14 in layer 2. The 10
output maps of layer 2 are convolved with each of the 20 kernels of size 5 × 5 to
obtain 20 maps of size 10 × 10. These maps are further downsampled by a factor of
2 by max-pooling to produce 20 output maps of size 5 × 5 of layer 4.
The output maps from layer 4 are concatenated to form a single vector during
training and fed to the next layer. The quantity of neurons in the final output layer
depends upon the number of classes in the image database. The output neurons are
fully connected by weights with the previous layer. Akin to the neurons in the con-
volutional layer, the responses of the output layer neurons are also modulated by a
non-linear sigmoid function to produce the resultant score for each class. [22].
Fig. 4 Architecture of the proposed CNN model used for hand gesture classification
454 A. Mohanty et al.
It has been shown in [22] that data augmentation boosts the performance of CNNs.
The dataset of Triesch et al. [6] has 240 images each captured in three different con-
ditions such as uniform light, dark or complex background. Therefore we augmented
these images by cropping decrementing the size by 1 pixel successively along hori-
zontal and vertical direction 5 times and then resizing to original size to obtain 1200
images in each category. For Marcel [4] dataset there are 4872 training images and
277 testing images with complex background. We combined them to have a total
of 5149 images. We did not augment this database since the number of images was
substantial. The NUS-II dataset has images with two different backgrounds i.e. with
inanimate objects cluttering the background or with humans in the backdrop. The
NUS-II [5] dataset with inanimate objects comprising the cluttered background has
10 hand postures with 200 images per class. We separated 200 images per class into
a set of 120 training images and a set of 80 test images. Then we augmented the train-
ing and test data by 5 times to obtain 6000 training images and 4000 test images.
4.4 Dropout
Deep networks have large number of parameters which makes the network train
slower as well as difficult to tune. Hence, dropout is used to make the network train
faster and avoid over-fitting by randomly dropping some nodes and their connections
[24]. The impact of overfitting on the performance of the network for the Triesch
dataset [6] with dark background and NUS-II dataset [5] with clutter in the back-
ground is shown in Table 3. We used a dropout of 50 % in the two dropout layers that
follows the ReLu layer present after the convolutional layer in Fig. 4.
Deep Gesture: Static Hand Gesture Recognition Using CNN 455
5 Experimental Results
We have trained a CNN with images of hand gestures from three standard datasets
[4–6] considering the cases of both plain/uniform and complex backgrounds. The
architecture of the CNN shown in Fig. 4 remains invariant for all our experiments.
The weights of the proposed CNN are trained by the conventional back-propagation
method using the package in [25]. The total number of learnable parameters in the
proposed CNN architecture is 6282.
Firstly we trained the proposed CNN model on the original datasets of [4–6] without
any data augmentation. The details of the training procedure such as values of the
learning rate 𝛼, batch size, number of epochs are provided in Table 1. The variation
of the mean square error with epochs while training the CNN is plotted in Fig. 5a,
Fig. 5 The confusion matrix of un-augmented NUS-II dataset with inanimate clutter in the back-
ground with ReLu and dropout. Since we consider 80 test images per class so the sum of all elements
of a row is 80
456 A. Mohanty et al.
Fig. 6 Training phase: Variation of Mean squared error (MSE) for a the CNN trained on the
Triesch dark [6] dataset. b the CNN trained on the Marcel [4] dataset. c the NUS-II [5] dataset
containing clutter without augmentation
b, c for the three cases of Triesch dataset with dark background [6], Marcel dataset
with complex background [4], NUS-II [5] with clutter. In Fig. 6, we show image
visualizations of the 10 filter kernels obtained after training the proposed CNN model
on a sample input image from NUS-II [5] (with inanimate clutter) dataset. From
Fig. 6c we observe that the trained filter kernels automatically extract appropriate
features from the input images emphasizing edges and discontinuities. Note that we
tune the values of the parameters for optimal performance in all the experiments
conducted on all datasets [4–6].
Table 1 Performance of the proposed CNN model on the Triesch and Von der Malsburg [6],
Marcel [4] database and the NUS [5] datasets with sigmoid activation function without dropout
and augmentation
Data No. of Training Testing 𝛼 Batch Epochs MSE Without
classes set set size result ReLu
and
dropout
(%)
Triesch 10 200 40 0.5 10 2000 0.02 77.50
et al. [6]
(dark
back-
ground)
Marcel 6 3608 1541 0.2 5 500 0.014 85.98
[4]
(complex
case)
NUS-II 10 1200 800 0.3 5 3000 0.0217 84.75
[5]
(with
clutter)
Deep Gesture: Static Hand Gesture Recognition Using CNN 457
As shown in Table 1, for the un-augmented dataset of Triesch et al. [6] containing
hand gesture images in the dark background, using parameters 𝛼 = 0.5, batch size
= 10 even after 2000 epochs the accuracy of the proposed CNN model was only
77.50 %.
The efficacy of CNN for images of hand gestures with complex backgrounds on
a large dataset is demonstrated by the accuracy obtained on the test data in [4].
We obtained an accuracy of 85.59 % with 222 out of 1541 images being misclas-
sified on this challenging dataset. The accuracy obtained is higher than state-of-the
art result of 76.10 % reported on the Marcel dataset [4]. Similarly good accuracy
was seen for the NUS-II dataset for the cluttered background with values as high as
84.75 % as shown in Table 1. The performance further improved by using ReLu as
the non-linear activation function and using dropout to avoid overfitting. A dropout
of 50 % was used in two layers following the ReLu layer that follows the convolu-
tional layer in the CNN architecture depicted in Fig. 4. By using ReLu and dropout
on the un-augmented datasets of Triesch [6] and NUS-II with cluttered background
[5] an improved accuracy of 82.5 and 89.1 % respectively was achieved as shown in
Table 2. We observed that the accuracy obtained using ReLu and dropout was higher
than the accuracy obtained using a sigmoidal activation function, without dropout
as reported in Table 1. The confusion matrix for the un-augmented NUS-II dataset
with clutter in the background [5] having 80 images per class for testing is shown
in Fig. 7. For the Triesch dataset [6] the probability that the predicted class of a test
image is among the top 2 predictions made for the image is 93.6 %. The decrease
in training and validation error for the un-augmented NUS-II dataset with clutter
in the background, obtained using MatConvNet [26] is shown in Fig. 8. The testing
accuracy further improved to 88.5 % by augmenting the datasets of Triesch [6] by
augmenting it five times by successively cropping by one pixel and resizing it back
to the original size. The performance on the augmented data is reported in Table 3.
Table 2 Performance of the proposed CNN model on the Triesch et al. [6] and the NUS [5] datasets with ReLu activation function and dropout without
augmentation
Data No. of classes Training set Testing set 𝛼 Batch size Epochs MSE result ReLu and
dropout (%)
Triesch et al. (dark 10 200 40 9 × 10−7 10 500 0.005 82.5
background) [6]
NUS-II [5] with 10 1200 800 5 × 10−6 (upto500) 10 2000 0.015 89.1
clutter 1 × 10−6 (upto2000)
A. Mohanty et al.
Deep Gesture: Static Hand Gesture Recognition Using CNN 459
Fig. 8 The variation of training and validation error a for the un-augmented Triesch dataset [6].
b for un-augmented NUS-II [5] dataset with clutter in the background. c for the augmented Triesch
dataset [6]
Table 3 Performance of the proposed CNN model on the Triesch et al. [6] dataset with data aug-
mentation and using dropout and ReLu as the activation function
Data Classes Training Testing 𝛼 Batch Epochs MSE State of Proposed
set set size the art approach
result
Triesch and 10 1000 200 5 × 10−6 10 500 0.001 95.83 % 88.5 %
Von der Mals-
burg [6] (dark
background)
6 Conclusions
The proposed CNN model has been demonstrated to be able to recognize hand ges-
tures to a high degree of accuracy on the challenging popular datasets [4, 5]. Unlike
other state-of-the-art methods we do not need to either segment the hand in the input
image or extract features explicitly. In fact, with a simple architecture of the proposed
CNN model, we are able to obtain superior accuracy on the dataset in [4] (com-
plex background) and obtain comparable performance on the NUS-II dataset [5].
We believe that the proposed method can find application in areas such as sign lan-
guage recognition and HCI. In future we plan to extend this work to handle dynamic
hand gestures also.
460 A. Mohanty et al.
References
1. L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid scene
analysis,” IEEE Transactions on Pattern Analysis & Machine Intelligence, no. 11, pp. 1254–
1259, 1998.
2. A. Borji and L. Itti, “State-of-the-art in visual attention modeling,” Pattern Analysis and
Machine Intelligence, IEEE Transactions on, vol. 35, no. 1, pp. 185–207, 2013.
3. S. Mitra and T. Acharya, “Gesture recognition: A survey,” Systems, Man, and Cybernetics,
Part C: Applications and Reviews, IEEE Transactions on, vol. 37, no. 3, pp. 311–324, 2007.
4. S. Marcel, “Hand posture recognition in a body-face centered space,” in CHI ’99 Extended
Abstracts on Human Factors in Computing Systems, ser. CHI EA ’99. New York, NY, USA:
ACM, 1999, pp. 302–303. [Online]. Available: http://doi.acm.org/10.1145/632716.632901.
5. P. K. Pisharady, P. Vadakkepat, and A. P. Loh, “Attention based detection and recognition of
hand postures against complex backgrounds,” International Journal of Computer Vision, vol.
101, no. 3, pp. 403–419, 2013.
6. J. Triesch and C. Von Der Malsburg, “Robust classification of hand postures against complex
backgrounds,” in fg. IEEE, 1996, p. 170.
7. T. S. Huang, Y. Wu, and J. Lin, “3d model-based visual hand tracking,” in Multimedia and
Expo, 2002. ICME’02. Proceedings. 2002 IEEE International Conference on, vol. 1. IEEE,
2002, pp. 905–908.
8. P. Viola and M. Jones, “Robust real-time object detection,” International Journal of Computer
Vision, vol. 4, pp. 51–52, 2001.
9. M. Kölsch and M. Turk, “Robust hand detection.” in FGR, 2004, pp. 614–619.
10. L. Bretzner, I. Laptev, and T. Lindeberg, “Hand gesture recognition using multi-scale colour
features, hierarchical models and particle filtering,” in Automatic Face and Gesture Recogni-
tion, 2002. Proceedings. Fifth IEEE International Conference on. IEEE, 2002, pp. 423–428.
11. E.-J. Ong and R. Bowden, “A boosted classifier tree for hand shape detection,” in Automatic
Face and Gesture Recognition, 2004. Proceedings. Sixth IEEE International Conference on.
IEEE, 2004, pp. 889–894.
12. Y. Wu and T. S. Huang, “View-independent recognition of hand postures,” in Computer Vision
and Pattern Recognition, 2000. Proceedings. IEEE Conference on, vol. 2. IEEE, 2000, pp. 88–
94.
13. H. Kim and D. W. Fellner, “Interaction with hand gesture for a back-projection wall,” in Com-
puter Graphics International, 2004. Proceedings. IEEE, 2004, pp. 395–402.
14. C.-C. Chang, C.-Y. Liu, and W.-K. Tai, “Feature alignment approach for hand posture recogni-
tion based on curvature scale space,” Neurocomputing, vol. 71, no. 10, pp. 1947–1953, 2008.
15. J. Triesch and C. Von Der Malsburg, “A system for person-independent hand posture recogni-
tion against complex backgrounds,” IEEE Transactions on Pattern Analysis & Machine Intel-
ligence, no. 12, pp. 1449–1453, 2001.
16. F. Flórez, J. M. García, J. García, and A. Hernández, “Hand gesture recognition following
the dynamics of a topology-preserving network,” in Automatic Face and Gesture Recognition,
2002. Proceedings. Fifth IEEE International Conference on. IEEE, 2002, pp. 318–323.
17. P. P. Kumar, P. Vadakkepat, and A. P. Loh, “Hand posture and face recognition using a fuzzy-
rough approach,” International Journal of Humanoid Robotics, vol. 7, no. 03, pp. 331–356,
2010.
18. P. Barros, S. Magg, C. Weber, and S. Wermter, “A multichannel convolutional neural network
for hand posture recognition,” in Artificial Neural Networks and Machine Learning–ICANN
2014. Springer, 2014, pp. 403–410.
19. J. Nagi, F. Ducatelle, G. Di Caro, D. Cireşan, U. Meier, A. Giusti, F. Nagi, J. Schmidhuber,
L. M. Gambardella et al., “Max-pooling convolutional neural networks for vision-based hand
gesture recognition,” in Signal and Image Processing Applications (ICSIPA), 2011 IEEE Inter-
national Conference on. IEEE, 2011, pp. 342–347.
20. S. Marcel and O. Bernier, “Hand posture recognition in a body-face centered space,” in
Gesture-Based Communication in Human-Computer Interaction. Springer, 1999, pp. 97–100.
Deep Gesture: Static Hand Gesture Recognition Using CNN 461
21. Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document
recognition,” in Proceedings of the IEEE, 1998, pp. 2278–2324.
22. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional
neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
23. Y. Lecun, F. J. Huang, and L. Bottou, “Learning methods for generic object recognition with
invariance to pose and lighting,” in CVPR. IEEE Press, 2004.
24. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A
simple way to prevent neural networks from overfitting,” The Journal of Machine Learning
Research, vol. 15, no. 1, pp. 1929–1958, 2014.
25. R. B. Palm, “Prediction as a candidate for learning deep hierarchical models of data,” Master’s
thesis, 2012. [Online]. Available: https://github.com/rasmusbergpalm/DeepLearnToolbox.
26. A. Vedaldi and K. Lenc, “Matconvnet-convolutional neural networks for matlab,” arXiv
preprint arXiv:1412.4564, 2014.
A Redefined Codebook Model for Dynamic
Backgrounds
1 Introduction
Moving object detection is one of the key step to video surveillance. Objects in
motion are detected from the video and is then used for tracking and activity analy-
sis of the scene. It has found use in areas such as security, safety, entertainment
and efficiency improvement applications. Typically motion detection is the primary
processing step in Activity recognition which is used in traffic analysis, restricted
vehicle movements, vehicle parking slots, multi-object interaction, etc.
Background subtraction methods are widely used techniques for moving object
detection. It is based on the difference of the current frame from the background
model. A background model is maintained which is actually a representation of the
background during the training frames. The difference of each incoming frame from
this reference model gives the objects in motion in the video sequence.
In static background situations, where we have a fix background, it is relatively
easy to segment moving objects in the scene. Many methods have been developed
which are able to successfully extract foreground objects in video sequences. How-
ever, many techniques fail in major challenging situations such as non-static back-
grounds, waving trees, sleeping and walking person, etc. as illustrated in [1].
In this paper, we propose an improved multi-layered codebook method to deal
with dynamic backgrounds. It removes the false positives which were detected as
ghost regions due to the area left behind when moving objects dissolved into back-
ground and starts moving again. We also place a memory limit on the pixel code-
words length to avoid infinite unnecessary codewords in the model without affecting
the detection rate, thus improving the speed.
Section 2 discuss widely followed methods for background subtraction available
in literature. Section 3 gives an introduction to the basic codebook model used in our
algorithm. Section 4 elaborates the problem of ghost region encountered in dynamic
backgrounds. The proposed approach is enumerated in Sect. 5 and detailed algorith-
mic description is presented in Sect. 6. Section 7 evaluates the four object detection
algorithms using similarity measure. Finally, Sect. 8 concludes the paper.
2 Related Work
Many methods are proposed in literature [2–6] where in the training time we model
the background, perform background subtraction and then update the background
model each time. Wren et al. [2] proposed an approach where the background is rep-
resented as a Gaussian with single modality. For dynamic backgrounds, the approach
was extended to include not just a single Gaussian, but to represent background as a
set of multiple gaussians by Grimson [3]. It also assigned weights to each gaussian,
and adaptively learnt them. Many researchers [4, 5] has also followed the above
A Redefined Codebook Model for Dynamic Backgrounds 465
approach. But it has many limitations as it could not handle the shadows completely.
Also it needs to estimate the parameters for the Gaussian in uncontrolled environ-
ments having varying illumination conditions. Apart from parametric representa-
tion, non-parametric approach has also been used by many researchers. The back-
ground is modelled using kernel density estimation [6], where probability density
function (pdf ) is estimated for the background for detecting background and fore-
ground regions, but is not be appropriate for all dynamic environments. A codebook
model was proposed by Kim et al. [7] for modelling the background. The Code-
word contains the complete information of the background including the color and
intensity information of the pixels.
The method proposed in this paper deals in removing the ghost regions produced
due to dynamic backgrounds when non-permanent background suddenly starts to
move. It also improves the memory requirement as we place a memory limit on each
pixel codeword model length without affecting the detection results.
3 Codebook Model
The Codebook Model as illustrated in [7] uses a codebook for each pixel, which is
a set of codewords. A codeword is a quantized representation of background values.
A pixel is classified as foreground or background based on color distance metric and
brightness bounds. A codebook for a pixel is represented as:
C = {c1 , c2 … cL }
vi = R̄i , Ḡ i , B̄i
auxi = ⟨Ǐi , Îi , fi , 𝜆i , pi , qi ⟩
where, Ǐi , Îi are the minimum and maximum brightness of all pixels for the codeword
in the training period, fi is the frequency of occurrence of codeword, 𝜆i (MNRL-
Maximum Negative Run Length) is the maximum interval of non-occurrence of
codeword, pi and qi are the first and the last access time of codewords respectively.
The color distance of a pixel xt = (R, G, B) to a codeword vi = (Ri , Gi , Bi ) is calcu-
lated as:
√
√
√ (R2 R + G2i G + B2i B)
color_dist(xt , vi ) = √(R2 + G2 + B2 ) + i 2 (1)
Ri + G2i + B2i
466 V. Sharma et al.
∙ color_dist(xt , cm ) ≤ 𝜀
where 𝜀 is the detection threshold. Also, for layered modelling an addition to the
codebook model is made which includes a cache codebook H . It contains the set of
foreground codewords that might become background thus supporting background
changes.
Major challenging issues for motion detection are dynamic backgrounds. Many back-
ground subtraction algorithms as illustrated in Sect. 2 performs well for static back-
grounds, but leads to false detections for changing backgrounds. Dynamic back-
grounds may occur due to moving objects which were not part of the background
model, to come to halt and become part of the background itself.
In motion detection algorithms, such objects should then be treated as background
and be included in the background model.
For such dynamic backgrounds, Codebook model has proposed an approach of
layered modelling and detection in [7]–[9]. Apart from the original set of codebook
M for each pixel, it introduces another cache codebook H . H holds the codewords
for a pixel which may become part of the background. When a codeword is not
matched to the background model M , it is added to the cache codebook H . Now, on
successive incoming pixels for next frames, the codewords are updated as illustrated
in [7]. Codewords which stays for a long time in H are added to the codebook M as
it is the moving object which has come to stop and is now a part of the background.
According to the algorithm in [7], when an object such as a car is parked into an
area, it is dissolved into background. Later when the car leaves the parking area, a
ghost region is left behind for some time frames till the model learns the background.
This is because according to the Codebook model in [7], the background codewords
A Redefined Codebook Model for Dynamic Backgrounds 467
not accessed for a long time are deleted routinely from the set of codewords of a
pixel in M . This also deletes the codewords belonging to the actual background
which were not accessed for the time when the car was parked.
For example, initially after the training time, a pixel contains on average 3 code-
words as follows:
M = {c1 , c2 , c3 }
When a car is parked, it becomes part of the background and codewords for car
are added to the codebook M for that pixel.
When the background values not accessed for a long time are deleted from the
background model M , we have:
M = {ci1 , ci2 }
Just when the car leaves, codewords belonging to the actual background will not be
present in the background model. It will be classified as foreground. These code-
words would be part of the cache H till some learning time frame. For this duration
a ghost region will be present in the background. This is illustrated in the Fig. 1.
5 Proposed Approach
The approach proposed in this paper focuses on removing the ghost region left behind
due to non-static background. It eliminates the background learning time where false
positives are detected by retaining the background codewords in codebook M of
each pixel. As a result, as and when the object leaves the area, immediately the left
behind pixels are classified as background.
We redefine the codeword tuple MNRL (Maximum Negative Run Length) which
was the maximum amount of time a codeword has not recurred, to the LNRL (Latest
Negative Run Length), which is the latest amount of time a codeword is not accessed.
LNRL is obtained by the difference in the time since a codeword was last accessed
to the current time.
Removing MNRL does not effect the temporal filtering step [7] and layered mod-
elling step [7], since we can use the frequency component of codewords alone for
filtering without the need of MNRL.
We replace the MNRL used in the temporal filtering [7] to remove the set of code-
words belonging to the foreground from the background model in the training period,
defined as MNRL greater than some threshold, which is usually taken as N/2, where
N is the number of training samples.
Instead, we use the frequency component of a pixel for codeword Cm to perform
the same thresholding by using:
N
fcm < (3)
2
This removes the codewords of foreground objects and only keeps the codewords
staying for atleast half the training time frames.
Additionally, MNRL gives the maximum of all times the duration of
non-occurrence, it alone gives no information about the recent/latest access of the
codeword, which is provided by LNRL and hence it is used instead of MNRL.
For example, two different codewords say C1 and C2 representing different objects,
can have same frequencies as: fc1 = fc2 , and same MNRL: 𝛹c1 = 𝛹c2 , irrespective of
the fact that codeword C2 is accessed recently for a considerable duration of time.
Thus, simple codeword information of the two objects could not tell which among
them was recently accessed.
However, LNRL defined in this paper records the latest information of active and
inactive codeword in the set of codebook.
For multi-layered approach, we need to set a memory limit on the length of code-
book model M . It is been observed that the number of codewords for a pixel is on
an average [3 ⋯ 5] [7]. Thus, we may set the maximum limit max = [5 ⋯ 7] that is
enough to support multiple layers of background, where the range also supports an
average number of non permanent background codewords. The complete approach
is illustrated in Algorithm 1.
A Redefined Codebook Model for Dynamic Backgrounds 469
6 Algorithm Description
After the training as illustrated in [7] using N sample frames, LNRL(𝛹 ) is calcu-
lated as:
𝛹cm = N − qcm (4)
where, N is the total number of training frames and qcm is the last access time of
codeword Cm .
For background codewords in M , LNRL(𝛹 ) is close to 0.
Considering the scenario of dynamic background, initially the codebook model
after training M for pixel x is:
Mx = {c1 , c2 , c3 }
470 V. Sharma et al.
Fig. 2 Updation of codebooks M and H in improved multi-layered codebook model for ghost
elimination
Now, when a car gets parked, its codewords cf1 and cf2 are added to the background
model:
Mx = {c1 , c2 , c3 , cf1 , cf2 }
When the car leaves the parking area, the set of codewords for background
{c1 , c2 , c3 } are still part of the background model M and the area left behind is recog-
nised as background. For multi-layer scenarios, when a car leaves the parking area
and another car occupies it to become background, we need to let the first car code-
word be removed from the model and second car codewords (cf ′ , cf ′ in our case) to be
1 2
included in the model. This is done by filtering the codebook M using LNRL(𝛹 ).
After the first car leaves and before the second car arrives, the actual background is
visible for some time frame. This sets the LNRL(𝛹 ) of the background codewords
to 0.
As mentioned in step II of the algorithm 1, the codeword with maximum
LNRL(𝛹 ) is deleted from the model. Since the background codewords now have
LNRL(𝛹 ) close to 0, the first car codeword is deleted from the model as it is now
having the maximum LNRL(𝛹 ). Thus we have:
Mx = {c1 , c2 , c3 , cf ′ , cf ′ }
1 2
This also removes the objects which are non-permanent and currently not part of the
background. The complete scenario is illustrated in Fig. 2.
7 Results
Fig. 3 Results of object detection. a Original image without moving object. b Image with a moving
car. c Result of GMM. d ViBe. e Codebook. f Proposed approach
472 V. Sharma et al.
Qualitative Evaluation:
Figure 3 shows results using the frame sequence where car moves out of parking
area and creates empty background region. Figure 3 (c(ViBe), d(GMM),
e(CodeBook)) shows that state-of-art methods creates ghost region where the car
has left, while the proposed approach as shown in Fig. 3f adapts to changes in the
background and successfully detects uncovered background region providing better
result in comparison to other methods.
Quantitative Evaluation:
We used the similarity measure proposed in [12] for evaluating the results of
foreground segmentation. Let A be a detected region and B be the corresponding
“ground truth”, then the similarity measure between the regions A and B is defined
as:
A∩B
S(A, B) = (5)
A∪B
where, S(A, B) has a maximum value 1.0 if A and B are the same, and 0 when the
two regions are least similar.
Table 1 shows the similarity measures of the above three methods against the pro-
posed method on three video sets of dynamic background from CDnet dataset [13]
which includes videos of parking, abandonedBox and sofa along with their testing
number of frames. By comparing the similarity values in Table 1, it is shown that the
proposed method is successful in recognising background region more accurately
than the other three methods for dynamic backgrounds as its similarity measure is
higher than the other three with values varying from 0.529 to 0.734.
8 Conclusion
This paper proposed an approach that deals with regions detected falsely as fore-
ground due to dynamic background when a moving object dissolved into background
starts to move again. The ghost region left due to false detection in basic Codebook
model is eliminated by retaining the actual permanent background codewords. The
results are also compared to other motion detection approaches where our approach
removes the false positives and shows improved results. Quantitative evaluation and
A Redefined Codebook Model for Dynamic Backgrounds 473
comparison with 3 existing methods have shown that the proposed method has pro-
vided an improved performance in detecting background regions in dynamic envi-
ronments with the similarity measure ranging from 0.529 to 0.734. It also reduces
the memory requirement significantly by keeping a memory limit of 5 to 7 which
deletes the non-permanent background codewords without effecting the results.
References
Abstract In this work, we have designed a local descriptor based on the reas-
signed Stankovic time frequency distribution. The Stankovic distribution is one of
the improved extensions of the well known Wigner Wille distribution. The reassign-
ment of the Stankovic distribution allows us to obtain a more resolute distribution
and hence is used to describe the region of interest in a better manner. The suitability
of Stankovic distribution to describe the regions of interest is studied by consider-
ing face recognition problem. For a given face image, we have obtained key points
using box filter response scale space and scale dependent regions around these key
points are represented using the reassigned Stankovic time frequency distribution.
Our experiments on the ORL, UMIST and YALE-B face image datasets have shown
the suitability of the proposed descriptor for face recognition problem.
1 Introduction
Recently, Zheng et al. [4] used the Discrete Wavelet Transform for face recognition.
They fused the transform coefficients of the three detail sub bands and combined
with the approximation sub-band coefficients to develop their descriptor. Ramasub-
ramanian et al. [8] used the Discrete Cosine transform for face recognition. In their
work, they proposed to retain only those coefficients of the transform which could
contribute significantly for face recognition. Using principal component analysis on
these retained coefficients, they computed the “cosine faces” similar to the eigenfaces
to form their descriptor.
On the other hand, we have seen an excellent growth in signal processing domain
where researchers came out with improved versions of the basic transforms. Recently,
Auger and Flandrin [2] extended the concept of time frequency reassignment. Using
the reassignment method, they proved that it is possible to obtain improvement of
the time frequency distributions from their less resolute time frequency distributions.
They derived expressions for time frequency reassignment of the (smoothed/pseudo)
Wigner-Wille distribution, Margeneau-Hill distribution, etc. These reassigned time
frequency distributions possess improvement over their original distribution. Flan-
drin et al. [7] came up with the reassigned Gabor spectrogram. On the similar line,
Stankovic and Djurovic [3] derived a reassigned version of the Stankovic time fre-
quency distribution. These improved distributions and their significant characteris-
tics to capture the more accurate frequency distribution motivated us to investigate
their suitability for face recognition problem.
The principle of time frequency reassignment may be explained using the relation
between the well known Wigner-Ville time frequency distribution (WVD)and the
spectrogram. The WVD of a signal 𝐬 at a time 𝐭 and frequency 𝐟 is defined as [2]
(Fig. 1):
+∞
𝜏 𝜏
WVDs (t, f ) = s(t + ) s∗ (t − )e−i2𝜋f 𝜏 d𝜏 (1)
∫−∞ 2 2
Here 𝐬∗ stands for conjugate of 𝐬 and 𝐢 stands for imaginary part of a complex
number. The spectrogram of a signal 𝐬 at a time 𝐭 and frequency 𝐟 is defined as:
where
+∞
STFTs (t, f ) = w(𝜏 − t)s(𝜏)e−i2𝜋f 𝜏 d𝜏 (3)
∫−∞
Reassigned Time Frequency Distribution Based Face Recognition 477
Fig. 1 The time frequency reassignment procedure being executed on the Wigner-Wille distribu-
tion of a typical signal
is the Short Term Fourier Transform (STFT) of the signal 𝐬(𝐭) with w(t) being the
window function. Given two signals 𝐬(𝐭) and w(t), according to the unitarity property
of WVD [14], they are related by
+∞ +∞ +∞
| s(t)w(t)dt|2 = WVDs (t, f )WVDw (t, f )dtdf (4)
∫−∞ ∫−∞ ∫−∞
where WVDs and WVDw are the WVD functions of s(t) and w(t) respectively. Sup-
pose the function w(t) is shifted in time by 𝜏 and in frequency by 𝜈 as follows:
The transformation of Eq. 4 to Eq. 6 may be proved using the time and frequency
covariance properties of the WVD [11]. The Eq. 6 is the well known Cohen’s class
representation of spectrogram of a signal 𝐬. The spectrogram of a signal s(t) i.e. Eq. 6
478 B.H. Shekar and D.S. Rajesh
may now be viewed as the smoothing of the WVD of the signal (WVDs ), by the WVD
of the spectrogram smoothing function w(t) (i.e. WVDw ).
The method of improving the resolution of the spectrogram, by time frequency
reassignment may be found by analyzing Eq. 6. The WVDs localizes the frequency
components of the given signal very well, but its time frequency distribution con-
tains a lot of cross interference (between the frequency components present in the
signal, as seen in Fig. 2d) terms. Though the smoothing operation in Eq. 6 reduces
these interference terms, the sharp localization property of the WVDs is lost (the fre-
quency components plotted in the WVDs get blurred or spread out) and hence the
spectrogram is of poor resolution. The reason for this may be explained as follows.
It shall be observed from Eq. 6 that the window function WVDw runs all over the
time frequency distribution WVDs during convolution. Convolution at a particular
location (t, f ) can be seen in Fig. 1. The rectangular region is the WVDs of the sig-
nal 𝐬. The yellow coloured region shows a region of the WVDs distribution, with
significant values (supposedly indicating the strong presence of signals correspond-
ing to the frequencies and times of that region). Other such regions are not shown
for simplicity of understanding. The other oval shaped region shown is the window
function WVDw in action at (t,f). The weighted sum of the distribution values WVDs
(weighted by the WVDw values) within the arbitrary boundary of the window func-
tion is assigned to the location (t, f ) which here is the center of the window. The
WVDs , which had no value at (t, f ) (implying the absence of a signal correspond-
ing to (t, f )) before convolution, gets a value (wrongly implying the presence of a
signal corresponding to (t, f )) which is an indication of incorrect assignment of dis-
tribution values. All such incorrect assignments leads to a spread out low resolution
spectrogram. One possible way of correction is by a more sensible method called
time frequency reassignment. That is, during reassignment, the result of above con-
volution is assigned to a newly located position (t′ , f ′ ), which is the center of gravity
Fig. 2 TFD of a synthetic signal made of 4 freq components, shown as 4 bright spots in a. A low
resolution Gabor time frequency distribution. b. Reassigned Gabor distribution. c. The reassigned
Stankovic time frequency distribution. d. The WVD. e. The pseudo WVD f. The Stankovic time
frequency distribution
Reassigned Time Frequency Distribution Based Face Recognition 479
of the region of the distribution within the window function. Reassigning the con-
volution value to the center of gravity gives us a more correct representation of the
actual time frequency distribution. Thus the reassigned spectrogram will be more
accurate than its original. The equation that gives the offset in the position of the
centre of gravity (t′ , f ′ ), from (t, f ) is hence given by [2]
+∞ +∞
∫−∞ ∫−∞ 𝜏WVDw (𝜏 − t, 𝜈 − f )WVDs (𝜏, 𝜈)d𝜏d𝜈
CGtoff (t, f ) = +∞ +∞
(8)
∫−∞ ∫−∞ WVDw (𝜏 − t, 𝜈 − f )WVDs (𝜏, 𝜈)d𝜏d𝜈
and +∞ +∞
∫−∞ ∫−∞ 𝜈WVDw (𝜏 − t, 𝜈 − f )WVDs (𝜏, 𝜈)d𝜏d𝜈
CGfoff (t, f ) = +∞ +∞
(9)
∫−∞ ∫−∞ WVDw (𝜏 − t, 𝜈 − f )WVDs (𝜏, 𝜈)d𝜏d𝜈
where CGtoff and CGfoff are the offsets along the t-axis and the f-axis of the time
frequency distribution, as shown in Fig. 1. The Eq. 8 may also be expressed in terms
of the Rihaczek distribution [2] as
+∞ +∞
∫−∞ ∫−∞ 𝜏Ri∗w (𝜏 − t, 𝜈 − f )Ris (𝜏, 𝜈)d𝜏d𝜈
CGtoff (t, f ) = +∞ +∞
(10)
∫−∞ ∫−∞ Ri∗w (𝜏 − t, 𝜈 − f )Ris (𝜏, 𝜈)d𝜏d𝜈
where
Ris (𝜏, 𝜈) = s(𝜏)S(𝜈)e−i2𝜋𝜈𝜏 , Riw (𝜏, 𝜈) = w(𝜏)W(𝜈)e−i2𝜋𝜈𝜏 (11)
and 𝐒(𝜈), 𝐖(𝜈) are the Fourier transforms of signal and window functions respec-
tively.
Expressing Eq. 8 in terms of the Rihaczek distribution, expanding it using the
above Eq. 11 and rearranging the integrals of the resulting equation, gives the fol-
lowing equation for CGtoff
where the superscript 𝜏w implies the computation of STFT with a window 𝜏w, which
is the product of the traditional window function w(t) (used in STFT) with 𝜏. Simi-
larly, the equation for 𝐂𝐆𝐟𝐨𝐟𝐟 may be derived in terms of STFTs as
where the superscript dw implies the computation of STFT with a window dw,
which is the first derivative of the traditional window function used in STFT. Hence
the reassigned location of the convolution in Eq. 6 is given by 𝐭 ′ = 𝐭 − 𝐂𝐆𝐭𝐨𝐟𝐟 and
𝐟 ′ = 𝐟 + 𝐂𝐆𝐟𝐨𝐟𝐟 . Thus the result of convolution within the smoothing window is
480 B.H. Shekar and D.S. Rajesh
assigned to (𝐭 ′ , 𝐟 ′ ) instead of (𝐭, 𝐟 ). This computation will happen at all pixels (𝐭, 𝐟 )
of the time frequency distribution WVDs in Fig. 1, to get a more resolute and true,
reassigned time frequency distribution.
It is better to arrive at the equation for the Stankovic distribution starting with the
pseudo Wigner-Wille distribution [12], which is given by
+∞
1 𝜏 −𝜏 𝜏 𝜏
pWVD(t, 𝜔) = w( )w( )s(t + )s(t − )e−i𝜔𝜏 d𝜏 (14)
2𝜋 ∫−∞ 2 2 2 2
We can see in the above equation that a narrow window function w(t) limits the auto-
correlation of signal s and hence does a smoothing of WVD in the frequency domain
(we can see that the white interference strip present in Fig. 2d (WVD), is missing in
Fig. 2e (pseudoWVD) (i.e. it has improved over WVD, due to this smoothing) in the
distribution. The Eq. 14 may also be expressed as
+∞
1 𝜈 𝜈
pWVD(t, 𝜔) = STFTs (t, 𝜔 + )STFTs (t, 𝜔 − )d𝜈. (15)
2𝜋 ∫−∞ 2 2
where B does a smoothing operation in the time domain (we can see that the white
interference strip present in Fig. 2e (pseudoWVD), is missing in Fig. 2f (Stankovic
distribution i.e. it has improved over pseudoWVD), due to this smoothing. The
Cohen’s class representation of Stankovic distribution shown above may be written
using Eqs. 16 and 17 as [2, 14].
+∞ +∞
S(t, 𝜔) = b(t)WVDw (𝜈 − f )WVDs (t, f )dtdf (18)
∫−∞ ∫−∞
Reassigned Time Frequency Distribution Based Face Recognition 481
where b is the Fourier inverse of the window function B. We can view this equation
also (like Eq. 6) as a convolution operation of WVD of a signal 𝐬 (WVDs ) by WVD of
a window function w (WVDw ). Now the reassigned Stankovic TFD is [3]
The reassigned Stankovic TFD based descriptor is computed as follows. Using the
difference of box filter scale space [13], scale normalized interest points in face
images are detected and scale dependent square regions around the interest points
are represented using the reassigned Stankovic distribution as follows. Each region is
scaled down to 24 × 24 size, further divided into 16, 6 × 6 subregions. The Stankovic
distribution of each of these subregions is computed using Eq. 18, reassigned using
Eqs. 19 and 20 (size of this distribution will be 36 × 36 and due to symmetry, we
may neglect the distribution corresponding to the negative frequencies). The reas-
signed distribution of each subregion is converted into a row vector. Hence for each
square region around an interest point, we get 16 row vectors, which are stacked one
above the other to form a 2D matrix. Applying PCA, we reduce the dimension of the
2D matrix to 16 × 160, to develop the descriptor of each 24 × 24 size square region
around the interest point.
2.4 Classification
In our experiments over face datasets that have pose variation (ORL and UMIST),
we have formed interest point based descriptors as explained in Sect. 2.3. The gallery
image descriptors are put together to form a descriptor pool. Each test face descrip-
tor is made to vote a gallery subject, whose descriptor matches with the test face
482 B.H. Shekar and D.S. Rajesh
descriptor, the most. When once all the test face descriptors have completed their
voting, the gallery subject which earned the most number of votes is decided as the
subject, to which the test face image belongs. We have used the Chi-square distance
metric for descriptor comparison.
For experiments over face datasets without pose variation (YALE), we have
formed fixed position block based descriptors. The size of these blocks is also 24 × 24
and the descriptor for this block is formed as explained in Sect. 2.3. For classification
we have used fixed position block matching and Chi-square distance for descriptor
matching.
3 Experimental Results
Fig. 3 The CMC curves of our method. Plot on the left shows the results on the UMIST dataset
(with red, blue, green curves standing for 3, 4, and 5 images per subject for training) and on the
right shows results on the YALE dataset (with red, blue, green curves standing for 8:56, 16:48, and
32:32 training to testing images ratio
Reassigned Time Frequency Distribution Based Face Recognition 483
Fig. 4 The ROC curves (a), (b), (c) of our method on the UMIST dataset and d, e on the YALE
dataset. (Each of the plot a, b, c contain 2 curves and ROC curves d, e contain 3 curves according
to the ratios in color text)
Fig. 5 The ROC curves (a), (b), (c) and CMC curve (d) of our method on the ORL dataset. (Each
of the plot a, b, c contain 3 curves and d contains 3 curves according to the ratios in color text)
text/curves in Fig. 3). The results are also shown in Table 1 and Fig. 3. Also we have
plotted the ROC curves taking client to imposter subject ratio of 10:10 and 15:5 (as
mentioned in color text/curves in Fig. 4). In each of these cases, we have varied the
training samples per subject as 3, 4 and 5 (see Fig. 4).
484 B.H. Shekar and D.S. Rajesh
Table 1 Recognition rate (RR) of our method on various datasets (In the table T:T stands for
train:test ratio)
ORL dataset UMIST dataset YALE dataset
T:T RR (%) T:T RR (%) T:T RR (%)
3:7 96.8 3: 94.25 8:56 93.74
4:6 98.75 4: 98.76 16:48 98.67
5:5 99 5: 99.6 32:32 99.5
Table 2 Comparative analysis of our method with state of the art methods (T:T stands for train:test
ratio)
ORL T:T=3:7 UMIST T:T=4:rest all YALE T:T=32:32
Method Recognition Method Recognition Method Recognition
rate (%) rate (%) rate
ThBPSO [5] 95 ThBPSO [5] 95.1 GRRC [10] 99.1
SLGS [1] 78
KLDA [9] 92.92
2D-NNRW 91.90
[6]
Our method 96.8 98.76 99.5 %
4 Conclusion
References
1 Introduction
Now a days, in medical imaging applications, high spatial and spectral information
from a single image is required to monitor and diagnose during treatment process.
These informations can be achieved by multimodal image registration. Different
modalities of imaging techniques gives several information about the tissues and
organ of human body. According to their application range, the imaging techniques
mostly used are CT, MRI, FMRI, SPECT, and PET. A Computed tomography (CT)
image detects the bone injuries, whereas MRI defines the soft tissues of an organ
such as brain and lungs. CT and MRI provide high resolution image with biologi-
cal information. The functional imaging technique such as PET, SPECT, and fMRI
gives low spatial resolution with basic information. To get the complete and detailed
information from single modality is a challenging task, which necessitates the regis-
tration task to combine multimodal images [1]. The registered image is more suitable
for radiologist for further image analysis task.
Image registration has several applications such as remote sensing and machine
vision etc. Several researchers have been discussed and proposed different registra-
tion techniques in literature [2]. Image registration technique can be divided into
intensity based and feature based technique [3]. Intensity based techniques are asso-
ciated with pixel values whereas feature based techniques considers the different
features such as line, point and textures etc.
The steps of the registration technique are as follows:-
∙ Feature detection: Here, the salient or distinctive objects such as edges, contours,
corners are detected automatically. For further processing, these features can be
represented by their point representatives i.e. centers of gravity, line endings, dis-
tinctive points, which are called control points (CPs).
∙ Feature matching: In this step, the correspondence between the features detected
in the floating image and those detected in the reference image is established.
Different similarity measures along with spatial relationships among the features
are used for matching.
∙ Transform model estimation: The mapping function parameters of the floating
image with respect to the reference image are estimated. These parameters are
computed by means of the established feature correspondence.
∙ Image resampling and transformation: Finally, the floating image is transformed
by means of the mapping functions. The non-integer coordinates of the images are
computed by interpolation technique.
Wavelet transform (WT) based low level features, provide a unique representation
of the image, and are highly suitable for characterizing textures of the image [4].
As WT is inherently non-supportive to directionality and anisotropy, the limitations
were overcome by a new theory Multi-scale Geometric Analysis (MGA). Different
MGA tools were proposed by the researchers such as Ridgelet, Curvelet, Bandlet
and Contourlet transform for high-dimensional signals [5–10]. The principles and
methods of fusion are described in [11]. Manu et.al. proposed a new statistical fusion
Image Registration of Medical Images Using Ripplet Transform 489
rule based on Weighted Average Merging Method (WAMM) in the Non Subsampled
Contourlet Transform (NSCT) domain and compared with Wavelet domain [12].
Alam et.al. proposed entropy-based image registration method using the curvelet
transform [13].
In this paper, we proposed a new registration technique for multimodal medical
image with the help of ripplet transform. The detailed Ripplet transform (RT) is
depicted in Sect. 2. In Sect. 3, proposed method is described. Performance evaluation
is given in Sect. 4. Experimental Results are discussed in Sect. 5 with a conclusion
in Sect. 6.
2 Ripplet Transform
⃖⃖⃗ 𝜃) are the ripplet coefficients. When ripplet function intersects with
where R(a, b,
curves in images, the corresponding coefficients will have large magnitude, and the
coefficients decay rapidly along the direction of singularity as a → 0.
For digital image processing, discrete transform is mostly used than continuous
transform. Hence, discretization of Ripplet transform is defined. The discrete rip-
plet transform of an M x N image f (n1, n2) will be in the form of
∑∑
M−1 N−1
R = f (n1 , n2 )𝜌j ⃗kl (n1 , n2 ) (2)
⃗
j kl
n1 n2
490 S. Pradhan et al.
where R(j, k⃖⃗,l) are the ripplet coefficients. Again the image can be reformed by inverse
discrete ripplet transform.
∑∑∑
⃗f (n1 , n2 ) = R 𝜌j⃗kl (n1 , n2 ) (3)
j l ⃗jkl
3 Proposed Methodology
During acquisition of brain images, changes are found in brain shape and position
with the skull over short periods between scan. These images require corrections
for a small amount of subject motion during imaging procedure. Some anatomical
structures appear with more contrast in one image than other. These structures in
various modalities are convenient to have more information about them. Brain MR
images are more sensitive to contrast changes. The local neighboring coefficients
of wavelet-like transforms of an image results better as compared to the processing
of the coefficients of the entire sub-band. Alam et.al. proposed local neighboring
coefficients of the transform in the approximate sub-band [13]. In this paper, the
cost function is computed based on the probability distribution function (PDF) of
ripplet transform. The distorted parameters of the mapping function are obtained
by minimizing the cost function. The cost function is derived by the conditional
entropy between the neighboring ripplet transform of reference image and floating
image. The joint PDF of the local neighboring ripplet coefficients in the approximate
band of the reference and floating images are considered as the bivariate Gaussian
PDF. The conditional entropies can be calculated by differencing the joint and mar-
ginal entropies. At minimum conditional entropy, the floating image geometrically
aligned to the reference. The cost function of the for minimization of the conditional
entropies can be expressed as
where 𝜉r and 𝜉f are the random variable with conditional dependencies, and 𝛼 is the
weight parameter and 0 < 𝛼 < 1.
4 Performance Evaluation
For statistical analysis of the proposed scheme, several performance measures have
been carried out in the simulation.
Image Registration of Medical Images Using Ripplet Transform 491
where Root Mean Square Error (RMSE) between the registered image and the
reference image represents the error of mean intensity of both images. It can be
stated as √
[∑ ∑ ( )]2
1
RMSE = (m×n)
I ref (x, y) − I reg (x, y) (6)
Standard deviation defines the variation of the registered image. Image with high
dissimilarity results high STD value. It can be defined as
1
∑
STD = mn (Ireg (x, y) − mean)2 (7)
Mutual information defines the amount of dependency of the two images. Higher the
value of MI means better registered.
5 Experimental Results
As CT and MRI provide high resolution images with biological information, this
combination of brain images is considered for evaluation of proposed technique. For
parametric analysis, several set of images have been taken. Among them 6 set of CT
and MRI are shown in this paper. The existing and proposed techniques were eval-
uated with those images. The proposed technique is implemented using MATLAB
version 13.
Four set of image with the corresponding registered images are presented in Fig. 1.
The performance value of the registration process using ripplet transform and curvlet
transform are tabulated in Table 1. The obtained performance values such as STD,
492 S. Pradhan et al.
Fig. 1 1st row: floating images, 2nd row: reference images, 3rd row: registered image using ripplet
transform
MI, RMSE, and PSNR of the ripplet based registration technique are compared with
curvlet based registration technique. The graph plot for the same performance mea-
sure are plotted for the 6 set of images in Fig. 2. From the graph it is analysed that
the proposed scheme outperforms the existing method.
Image Registration of Medical Images Using Ripplet Transform 493
(a) 6 Curvlet
(b) 90 Curvlet
Ripplet 80 Ripplet
5
70
Standard Deviation
Mutual Information
4 60
50
3
40
2 30
20
1
10
0 0
I1 I2 I3 I4 I5 I6 I1 I2 I3 I4 I5 I6
Image Set Image Set
Fig. 2 Performance plot for a mutual information, b standard deviation of all sets of images
6 Conclusion
References
1. Pradhan, S., Patra, D. RMI based nonrigid image registration using BF-QPSO optimization
and P-spline, AEU-International Journal of Electronics and Communications, 69 (3), 609–621
(2015).
2. Mani, V.R.S and Rivazhagan, S. Survey of Medical Image Registration, Journal of Biomedical
Engineering and Technology, 1 (2), 8–25 (2013).
3. Oliveira, F. P., Tavares, J.M.R. Medical image registration: a review, Computer methods in
biomechanics and biomedical engineering, 17 (2), 73–93 (2014).
4. Acharyya, M., Kundu, M.K., An adaptive approach to unsupervised texture segmentation using
M-band wavelet tranform, Signal Processing, 81(7), 1337–1356, (2001).
5. Starck, J.L., Candes, E.J., Donoho, D.L., The curvelet transform for image denoising, IEEE
Transactions on Image Processing 11, 670–684 (2002).
6. Candes, E.J., Donoho, D., Continuous curvelet transform: II. Discretization and frames,
Applied and Computational Harmonic Analysis 19, 198–222 (2005).
7. Candes, E.J., Donoho, D., Ridgelets: a key to higher-dimensional intermittency, Philosoph-
ical Transactions: Mathematical, Physical and Engineering Sciences 357 (1760) 2495–2509
(1999).
494 S. Pradhan et al.
8. Do, M.N., Vetterli, M., The finite Ridgelet transform for image representation, IEEE Transac-
tions on Image Processing 12 (1), 16–28 (2003).
9. Do, M.N., Vetterli, M., The contourlet transform: an efficient directional multiresolution image
representation, IEEE Transactions on Image Processing 14 (12), 2091–2106 (2005).
10. Pennec, E. Le, Mallat, S.: Sparse geometric image representations with bandelets, IEEE Trans-
actions on Image Processing 14 (4), 423–438 (2005).
11. Flusser, J., Sroubek, F., Zitova, B., Image Fusion:Principles, Methods, Lecture Notes Tutorial
EUSIPCO (2007).
12. Manu, V. T., Simon P., A novel statistical fusion rule for image fusion and its comparison in
non-subsampled contourlet transform domain and wavelet domain, The International Journal
of Multimedia and Its Applications, (IJMA), 4 (2), 69–87 (2012).
13. Alam, Md., Howlader, T., Rahman S.M.M., Entropy-based image registration method using
the curvelet transform, SIViP, 8, 491505, (2014).
14. Xu, J., Yang, L., Wu, D., A new transform for image processing, J. Vis. Commun. Image Rep-
resentation, 21, 627–639 (2010).
15. Chowdhury, M., Das, S., Kundu, M. K., CBIR System Based on Ripplet Transform Using
Interactive Neuro-Fuzzy Technique, Electronic Letters on Computer Vision and Image Analy-
sis 11(1), 1–13, (2012).
16. Das, S., Kundu, M. K., Medical image fusion using ripplet transform type-1, Progress in elec-
tromagnetic research B, 30, 355–370, (2011).
3D Local Transform Patterns: A New
Feature Descriptor for Image Retrieval
Abstract In this paper, authors proposed a novel approach for image retrieval in
transform domain using 3D local transform pattern (3D-LTraP). The various
existing spatial domain techniques such as local binary pattern (LBP), Local ternary
pattern (LTP), Local derivative pattern (LDP) and Local tetra pattern (LTrP) are
encoding the spatial relationship between the neighbors with their center pixel in
image plane. The first attempt has been made in 3D using spherical symmetric three
dimensional local ternary pattern (SS-3D-LTP). But, the performance of
SS-3D-LTP is depend on the proper selection of threshold value for ternary pattern
calculation. Also, multiscale and color information are missing in SS-3D-LTP
method. In the proposed method i.e. 3D-LTraP, the first problem is overcome by
using binary approach Similarly, the other lacunas are avoided by using wavelet
transform which provide directional as well as multiscale information and color
features are embedded in feature generation process itself. Two different databases
which included natural and biomedical database (Coral 10 K and OASIS databases)
are used for experimental purpose. The experimental results demonstrate a
1 Introduction
1.1 Introduction
As technology grow rapidly and percolates into hearts of society, new means of
image acquisition (cameras, mobiles) enter into our day to day activities and
consequently digital images become an integral part of our lives. This exponential
growth in the amount of visual information available coupled with our inherent
tendency to organize things resulted in the use of text based image retrieval. This
method has some limitations such as image annotation and human perception. Later
a more potent alternative, content based image retrieval (CBIR) came into picture to
address the aforementioned problems. All CBIR systems are primarily a two-step
process. The first is feature extraction and second one query matching. Feature
extraction is a process where a feature vector is constructed to represent the abstract
from of the image. A feature vector represent the characteristics of the image which
utilizes global image features like color or local descriptors like shape and texture.
The second process in CBIR is query matching based on similarity measurement
which uses the distance of the query from each image in database to find the closest
image. A thorough and lucid summary of the literature survey on existing CBIR
techniques is presented in [1–6].
As texture based CBIR has matured over the years, texture extraction centered
about local patterns have emerged as frontrunners because of their relatively sim-
plistic yet potent texture descriptor and its relative invariance to intensity changes.
The underlining philosophy of all these descriptors is depicting the relationship
between a pixel and its neighbors. The trailblazer of this method was Ojala et al.
who proposed local binary pattern (LBP) [7]. Zhang et al. [8] modified the LBP
method by introducing the concept of derivative and suggested local derivative
patterns (LDP) method for face recognition. The problems related to variation in
illumination for face recognition in LBP and LDP is sorted out in local ternary
pattern (LTP) [9].
Several local patterns have been proposed by Subrahmanyam et al. including:
local maximum edge patterns (LMEBP) [10], local tetra patterns (LTrP) [11] and
directional local extrema patterns (DLEP) [12] for natural/texture image retrieval
and directional binary wavelet patterns (DBWP) [13], local mesh patterns (LMeP)
[14] and local ternary co-occurrence patterns(LTCoP) [15] for natural, texture and
biomedical image retrieval applications.
3D Local Transform Patterns: A New Feature Descriptor for Image … 497
Volume LBP (VLBP) is proposed by Zhao and Pietikainen [16] for dynamic
texture (DT) recognition. They have collected the joint distribution of the gray
levels for three consecutive frames of video for calculating the feature. Also, they
used 3D-LBP features for lip-reading recognition [17]. Subrahmanyam and Wu
[18] proposed the spherical symmetric 3-D LTP for retrieval purpose.
The SS-3D-LTP [18] inspired us to propose the 3D-LTraP method. The main
contributions of the 3D-LTraP method are as follows: (i) it is very clear that the
performance of LTP is mostly depend on the proper selection of threshold value for
ternary pattern calculation. An attempt has been made to resolve this problem.
(ii) 3D-LTraP used five different planes at different scale with directional infor-
mation whereas in SS-3D-LTP, both the information are missing. (iii) Color
information is incorporated in the proposed method which is absent in the
SS-3D-LTP.
This paper is organized as follows: Sect. 1gives overview of content based
image retrieval for different applications. Section 2 gives the information about 2D
LBP, SS-3D-LTP and 3D-LTraP. Section 3 explains the proposed algorithm and
different performance measures. Sections 4 and 5 show experimental outcomes for
natural and biomedical image database. Section 6, is dedicated to conclusion.
2 Local Patterns
Ojala et al. [7] devised the LBP method for face recognition application. For a given
gray value of center pixel, LBP value is calculated using Eqs. 1 and 2 as presented
in Fig. 1.
P
LBPP, R = ∑ 2ði − 1Þ × f1 ðgi − gc Þ ð1Þ
i=1
1 x≥0
f1 ðxÞ = ð2Þ
0 else
Subrahmanyam and Wu [18] used the VLBP concept to define the spherical
symmetric 3-D LTP for image retrieval application. From a given image, three
multiresolution images are generated using 2D Gaussian filter bank. With these
three multiresolution images, a 3D grid is constructed with five spherical symmetric
directions. Next, neighbors are collected for each direction and 3D-LTP features are
obtained. Detailed information of SS-3D-LTP is available in [18].
With specific direction α, 3D-LTraP values are calculated using Eqs. 4 and 5,
considering relationship among middle pixel Ic ðGT Þ and its surrounding pixels as
given below:
8
>
> f ðI0 ðGT Þ − Ic ðGT ÞÞ, f ðI1 ðGT Þ − Ic ðGT ÞÞ, f ðI2 ðGT Þ − Ic ðGT ÞÞ, f ðI3 ðGT Þ − Ic ðGT ÞÞ, f ðI4 ðGT Þ − Ic ðGT ÞÞ, f ðI5 ðGT Þ − Ic ðGT ÞÞ, f ðI6 ðGT Þ − Ic ðGT ÞÞ, f ðI7 ðGT Þ − Ic ðGT ÞÞ; α = 1
>
>
< f ðI2 ðRT Þ − Ic ðGT ÞÞ, f ðIc ðRT Þ − Ic ðGT ÞÞ, f ðI6 ðRT Þ − Ic ðGT ÞÞ, f ðI6 ðGT Þ − Ic ðGT ÞÞ, f ðI6 ðBT Þ − Ic ðGT ÞÞ, f ðIc ðBT Þ − Ic ðGT ÞÞ, f ðI2 ðBT Þ − Ic ðGT ÞÞ, f ðI2 ðGT Þ − Ic ðGT ÞÞ; α = 2
Vα jP = 8 = f ðI5 ðRT Þ − Ic ðGT ÞÞ, f ðIc ðRT Þ − Ic ðGT ÞÞ, f ðI1 ðRT Þ − Ic ðGT ÞÞ, f ðI1 ðGT Þ − Ic ðGT ÞÞ, f ðI1 ðBT Þ − Ic ðGT ÞÞ, f ðIc ðBT Þ − Ic ðGT ÞÞ, f ðI5 ðBT Þ − Ic ðGT ÞÞ, f ðI5 ðGT Þ − Ic ðGT ÞÞ; α = 3
>
>
> f ðI4 ðRT Þ − Ic ðGT ÞÞ, f ðIc ðRT Þ − Ic ðGT ÞÞ, f ðI0 ðRT Þ − Ic ðGT ÞÞ, f ðI0 ðGT Þ − Ic ðGT ÞÞ, f ðI0 ðBT Þ − Ic ðGT ÞÞ, f ðIc ðBT Þ − Ic ðGT ÞÞ, f ðI4 ðBT Þ − Ic ðGT ÞÞ, f ðI4 ðGT Þ − Ic ðGT ÞÞ; α = 4
>
:
f ðI3 ðRT Þ − Ic ðGT ÞÞ, f ðIc ðRT Þ − Ic ðGT ÞÞ, f ðI7 ðRT Þ − Ic ðGT ÞÞ, f ðI7 ðGT Þ − Ic ðGT ÞÞ, f ðI7 ðBT Þ − Ic ðGT ÞÞ, f ðIc ðBT Þ − Ic ðGT ÞÞ, f ðI3 ðBT Þ − Ic ðGT ÞÞ, f ðI3 ðGT Þ − Ic ðGT ÞÞ; α = 5
ð4Þ
P−1
3D − LTrapα, P, R = ∑ 2i Vα ðiÞ ð5Þ
i=0
In LBP, the feature vector length is 2P [7]. Further, feature vector length is
reduced to PðP − 1Þ + 2 using uniform pattern suggested by Guo et al. [19].
1 N 1 N2
HS ðlÞ = ∑ ∑ f1 ð3D − LTraPðj, kÞ, lÞ; l ∈ ½0, PðP − 1Þ + 2 ð6Þ
N1 × N2 j = 1 k = 1
(
1, x=y
f1 ðx, yÞ = ð7Þ
0, x ≠ y
3 Proposed Algorithm
The block schematic of proposed method is presented in Fig. 4 and its algorithm is
given below:
Algorithm:
HS ðlÞ 3D LTraP histogram
α spherical symmetric directions, α ∈ f1, 2, . . . N g
J Total decomposition levels in wavelet transform, J ∈ f1, 2, . . . M g
Input: Color image; Output: 3D-LTraP feature vector
1. Read the input color image.
2. Apply wavelet transform to each R, G and B plane for J number of decompo-
sition levels.
3. For J = 1 to M.
• Create the volumetric representation (as shown in Fig. 3) using RT , GT and
BT transform images.
• Generate 3D LTraP patterns in α spherical symmetric directions.
• Generate histogram for each 3D LTraP patterns using Eq. 7.
• Concatenated the histograms
3D Local Transform Patterns: A New Feature Descriptor for Image … 501
4. End of J.
5. Finally, 3D LTraP feature vectors are constructed by concatenating the his-
tograms obtained at step 3 for each decomposition level.
Similarity measurement is used to select similar images that look like the query
image. 3D- LTraP features are collected from the given database as per the pro-
cedure discussed in Sect. 3.1. Query image feature vector is compared with the
feature vector of images in the test database using the distance D (Eq. 10).
The distance is calculated as:
N f −f
Q, i
DðQ, IÞ = ∑
I, i
ð10Þ
i = 1 1 + fI, i + fQ, i
502 A.B. Gonde et al.
where
Q Query image;
N Feature vector length;
I Images in database;
fI, i ith feature of Ith image in the database;
fQ, i ith feature of query image Q
1 Nq
ARP = ∑ PðQk Þ ð13Þ
Nq k = 1
1 Nq
ARR = ∑ RðQk Þ ð14Þ
Nq k = 1
where
Nq Number of queries
Qk kth image in the database
3D Local Transform Patterns: A New Feature Descriptor for Image … 503
Fig. 5 a Average retrieval precision versus category. b Average retrieval rate versus category
c Average retrieval precision versus number of top matches. d Average retrieval rate versus
number of top matches
504
Table 1 Comparative results of average retrieval precision and average retrieval rate on natural database
Database Performance Method
CS_LBP LEPSEG LEPINV BLK_LBP LBP DLEP LTP SS-3D-LTP SS-3D-LTPu2 3D-LTraP 3D-LTrapu2
Corel–10 K Precision (%) 26.4 34 28.9 38.1 37.6 40 42.95 46.25 44.97 52.46 51.71
Recall (%) 10.1 13.8 11.2 15.3 14.9 15.7 16.62 19.64 19.09 21.90 21.21
A.B. Gonde et al.
3D Local Transform Patterns: A New Feature Descriptor for Image … 505
Fig. 6 Average retrieval precision versus number of top matches on biomedical image database
506 A.B. Gonde et al.
6 Conclusions
References
1. M. L. Kherfi, D. Ziou and A. Bernardi. Image Retrieval from the World Wide Web: Issues,
Techniques and Systems. ACM Computing Surveys, 36 35–67, 2004.
2. Ke Lu and Jidong Zhao, Neighborhood preserving regression for image retrieval.
Neurocomputing 74 1467–1473, 2011.
3. Tong Zhaoa, Lilian H. Tang, Horace H.S. Ip, Feihu Qi, On relevance feedback and similarity
measure for image retrieval with synergetic neural nets. Neurocomputing 51 105−124, 2003.
4. Kazuhiro Kuroda, Masafumi Hagiwara, An image retrieval system by impression words and
specific object names-IRIS. Neurocomputing 43 259–276, 2002.
5. Jing Li, Nigel M. Allinson, A comprehensive review of current local features for computer
vision. Neurocomputing 71 1771–1787, 2008.
6. Akgül C. B., Rubin D. L., Napel S., Beaulieu C. F., Greenspan H. and Acar B. Content-Based
Image Retrieval in Radiology: Current Status and Future Directions. Digital Imaging, 24, 2
208–222, 2011.
7. T. Ojala, M. Pietikainen, D. Harwood. A comparative study of texture measures with
classification based on feature distributions. Pattern Recognition, 29 51–59, 1996.
8. B. Zhang, Y. Gao, S. Zhao, J. Liu. Local derivative pattern versus local binary pattern: Face
recognition with higher-order local pattern descriptor. IEEE Trans. Image Process., 19,
2 533–544, 2010.
9. X. Tan and B. Triggs. Enhanced local texture feature sets for face recognition under difficult
lighting conditions. IEEE Trans. Image Process., 19, 6 1635–1650, 2010.
10. Subrahmanyam Murala, R. P. Maheshwari, R. Balasubramanian. Local Maximum Edge
Binary Patterns: A New Descriptor for Image Retrieval and Object Tracking. Signal
Processing, 92 1467–1479, 2012.
11. Subrahmanyam Murala, Maheshwari RP, Balasubramanian R. Local tetra patterns: a new
feature descriptor for content based image retrieval. IEEE Trans. Image Process 2012; 21(5):
2874–86.
12. Subrahmanyam Murala, R. P. Maheshwari, R. Balasubramanian. Directional local extrema
patterns: a new descriptor for content based image retrieval. Int. J. Multimedia Information
Retrieval, 1, 3 191–203, 2012.
3D Local Transform Patterns: A New Feature Descriptor for Image … 507
P. Ananth Raj
1 Introduction
These numbers are extensions of complex numbers, that consists of one real part
and three imaginary parts. A quaternion number with zero real part is called pure
quaternion. Quaternion number system was introduced by the mathematician
Hamilton [23] in 1843. Then Sang wine [1, 23] applied them for color image
represntation. A quaternion number q is written as
q = a + bi + cj + dk. ð1Þ
Where a, b, c and d are real numbers, i, j and k are orthogonal unit axis vectors
satisfies the following rules
i2 = j2 = k2 = − 1, ij = − ji = k ð2Þ
jk = − kj = i, ki = − ik = j
From these equations one can say that quaternion multiplication is not com-
mutative. Both conjugate and modulus of a quaternion number q is
Quaternion Circularly Semi-orthogonal Moments … 511
q̄ = a − bi − cj − dk
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
j q j = a 2 + b2 + c 2 + d 2
For any two quaternion numbers say p and q we have p.q = p̄.q̄. Quaternion
representation of a pixel in a color image is
In this expression fR ðr, θÞ, fG ðr, θÞ and fB ðr, θÞ denote red, green and blue
components of polar representation of image.
Let f ðr, θÞ be the polar representation of a gray level image of size N × M, then
the general expression for circularly orthogonal moments (Enm Þ of order n and
repetition m of a polar image f ðr, θÞ is
Z1 Z2π
1
Enm = f ðr, θÞTn ðrÞexpð − jmθÞrdrdθ ð5Þ
Z
r=0 θ=0
Z
π + 1 Z2π
1
Enm = f ðr, θÞTn ðrÞ expð − jmθÞrdθdr
2πan
r=π θ=0
512 P. Ananth Raj
e − j2πnr 4
2
Polar harmonic π ðM 2 + N 2 Þ
qffiffi 1
Exponent fourier—II 2 − j2πnr π ðM 2 + N 2 Þ
re
Zk + 1 Z2π
1
Enm = f ðr − k, θÞTn ðrÞ expð − jmθÞrdθdr
2π
r=k θ=0
Nmax Mmax
f ðr − k, θÞ = ∑ ∑ Enm Tn* ðrÞ expðjmθÞ
n=1 m=1
0.8
0.6
0.4
0.2
-0.2
-0.4
-0.6
-0.8
-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 1 Real part of radial function Tn(r) for n = 0, 1, 2, 3, 4, 5 values. x axis denotes ‘r’ values (1
to 2, in steps of 0.001) and y real part of radial function. Colors n = 0, blue, n = 1, green, n = 2,
red, n = 3 black, n = 4, cyan, n = 5 magenta
Z1 Z2π
1
f ðr, θÞð15Þ − 4 sinðn + 1Þπr expð − jmθÞrdrdθ
r
Enm = ð6Þ
Z
r=0 θ=0
1 M −1 N −1
Enm = ∑ ∑ f ðxi , yj ÞTn ðrij Þe − jmθij ð7Þ
Z i=0 j=0
qffiffiffiffiffiffiffiffiffiffiffiffiffi
x2i + y2j and θij = tan − 1 xji
1 2 y
where Z = π ðM 2 + N 2 Þ, rij =
More details can be seen in paper [19]. Next section presents Quaternion cir-
cularly semi orthogonal moments.
514 P. Ananth Raj
Z1 Z2π
1
R
Enm = f ðr, θÞTn ðrÞ expð − μmθÞrdθdr
2π
r=0 θ=0
Z1 Z2π
1
L
Enm = expð − μmθÞf ðr, θÞTn ðrÞrdθdr ð8Þ
2π
r=0 θ=0
In this work we consider only right side expression and drop out the symbol R.
Relationship between these two expressions is Enm L
= − En,R − m , it can be derived
using the conjugate property. Next, we derive an expression for implementation of
Enm , . substituting Eq. (4) into Eq. (8), we get
Z1 Z2π
1
Enm = Tn ðrÞ½fR ðr, θÞi + fG ðr, θÞ + fB ðr, θÞk expð − μmθÞrdθdr ð9Þ
2π
r=0 θ=0
Let
2 1 3
Z Z2π
1 4
Anm = fR ðr, θÞTn ðrÞ expð − μmθÞrdθdr5
2π
r=0 r=0
2 1 3
Z Z2π
1 4
Bnm = fG ðr, θÞTn ðrÞ expð − μmθÞrdθdr5
2π
r=0 θ=0
2 1 3
Z Z2π
1 4
Cnm = fB ðr, θÞTn ðrÞ expð − μjmθÞrdθdr5
2π
r=0 θ=0
Anm , Bnm and Cnm are complex values, hence, the above equation can be
expressed as
Enm = i ARnm + μAInm + j BRnm + μBInm + kðCnm
R
+ μCnm
I
Þ
i +pj ffiffi+ k
Substituting for μ = 3
and simplifying the above expression using Eq. 2 we
get
ði + j + kÞ I ði + j + kÞ I ði + j + kÞ I
R
Enm = i Anm + pffiffiffi Anm + j Bnm + R
pffiffiffi Bnm + k Cnm +R
pffiffiffi Cnm
3 3 3
1 I 1 1
Enm = − pffiffiffi Anm + Bnm + Cnm + i Anm + pffiffiffi ðBnm − Cnm Þ + j Bnm + pffiffiffi ðCnm
I I R I I R I
− AInm Þ
3 3 3
1
R
+ k Cnm + pffiffiffi ðAInm − BInm Þ .
3
L L
f ðr, θÞ = ∑ ∑ Enm Tn ðrÞeμmθ ð10Þ
n=0 m= −L
L L
f ðr, θÞ = ∑ ∑ ðS1 + iS2 + jS3 + kS4ÞTn ðrÞeμmθ
n=0 m= −L
516 P. Ananth Raj
i +pj ffiffi+ k
Substituting, μ = 3
in the above expression and after simplification we get
expression for inverse QCSO moments as
In this expression real (.) and imag (.) terms denote real and imaginary part of the
value within the bracket. Each term represents the reconstruction matrix of s1, s2,
s3 and s4 respectively and they are determined using
L L L L
s1 = ∑ ∑ S1 Tn ðrÞejmθ , s2 = ∑ ∑ S2 Tn ðrÞejmθ
n=0 m= −L n=0 m= −L
L L L L
s3 = ∑ ∑ S3 Tn ðrÞe jmθ
, s4 = ∑ ∑ S4 Tn ðrÞejmθ
n=0 m= −L n=0 m= −L
Let ðr, θÞ and f ðr, θ − φÞ be the un rotated and rotated (by an angle φ) images
expressed in polar form, then QCSO moments of a rotated image is
Z1 Z2π
1
Enm = f ðr, θ − φÞTn ðrÞ expð − μmθÞrdθdr
2π
r=0 θ=0
r
Enm ðf Þ = kEnm ðf Þk.1
Rotation of an image by an angle φ does not change the magnitude, but the
phase changes from − μmθ to − ðμmθ + μmφÞ. Hence, we say that rotation does not
change magnitude, therefore it is invariant to rotation. Translation invariant is
achieved by using the common centroid ðxc , yc Þ obtained using R, G, B images.
This procedure was suggested by Fluser [24] and employed by number of
researchers like Chen et al. [18], Nisrine Das et al. [7]. Procedure consists of fixing
the origin of the coordinates at the color image centroid obtained using
ðmR1, 0 + mG
1, 0 + m1, 0 Þ
B
ðmR + mG
0, 1 + m0, 1 Þ
B
xc = , yc = 0, 1 , m0, 0 = mR00 + mG B
00 + m00 ,
m00 m00
where mR00 , mR10 , mR01 are the geometric moments of R image, whereas G and B
superscripts denote green and blue images, using the above coordinates, QCSO
moments invariants to translation is given by
Z1 Z2π
1
Enm f ðr̄, θ̄ÞTn ðr̄Þ expð − μmθÞr̄
̄ dr̄dθ̄ ð11Þ
2π
r=0 θ=0
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
where r̄ = ðx − xc Þ2 + ðy − yc Þ2 θ̄ = tan − 1 y − yc
x − xc .
Moments calculated using the above expression are invariant to translation. In
most of the applications like image retrieval, images are scaled moderately, then the
scale invariant property is fulfilled automatically, because, QCSO moments are
defined on the unit circle using Eq. 6a [10].
Another useful property is flipping an image either vertical or horizontal. Let
f ðr, θÞ be the original image, f ðr, − θÞ and f ðr, π − θÞ be the vertical and horizontal
flipped images. One of the color images flipped vertically and horizontally are
shown in Fig. 2. We derive its QCSO moments. QCSO moments of the flip vertical
image is
1 2π ∞
V
Enm = ∫ ∫ f ðr, − θÞTn ðrÞe − μmθ rdrdθ
2π θ = 0 r = 0
(c) Rotated image 2 deg clockwise (d) Translated image x=2, y=1 units
Z2π Z∞
1
h
Enm = f ðr, π − θÞTn ðrÞe − μmθ rdrdθ
2π
θ=0 r=0
⋅ p ffi
− ði + j + kÞ
mπ
h
Enm = − Enm e 3
Hence, one can compute the flipped image moments using the above equation.
Some of the properties like invariance to contrast changes can be verified by
normalizing moments by E00.
7 Simulation Results
In order to verify the proposed Quaternion circularly semi orthogonal moments for
both reconstruction capability and invariant for rotation, translation, and flipping we
have selected four color images namely, Lion image, Vegetable image, Parrot
Quaternion Circularly Semi-orthogonal Moments … 519
image and painted Mona Lisa images and computed their QCSO moments and
these color images are shown in Fig. 2, down loaded from Amsterdam Library of
objects, reconstructed using only moments of order 40 (L = 40 in Eq. 10).
Obtained results are shown in Fig. 3. High frequency information like edges is well
preserved. These images (Lion image is shown in Fig. 2) are rotated by 2 degrees in
clock wise using IMROTATE function available in MATLAB 2010 software and
magnitude of only few QCSO moments are computed and results are reported in
Table 2. From these results we can note that before and after rotation QCSO
moments are almost equal. We have also verified translated property by translating
Lion image and Vegetable images by 2 units in x direction (dx = 2) and 1 unit
(dy = 1) in y direction. Their results (magnitude of Enm) computed using Eq 11 are
shown in Table 3. Difference, between the moments before and after translation is
very small. Finally, moments (magnitude of Enm) calculated for flipped vertical
images are reported in Table 4.
Quaternion Circularly Semi-orthogonal Moments … 521
8 Conclusions
Acknowledgements Author would like to thank Dr. B. Rajendra Naik, Present ECE Dept Head
and Dr. P. Chandra Sekhar, former ECE Dept Head for allowing me to use the Department
facilities even after my retirement from the University.
References
14. Zhuhong Sha, Huazhong Shu, Jiasong wu, Beijing Chen, Jean Louis, Coatrieux, “Quaternion
Bessel Fourier moments and their invariant descriptors for object reconstruction and
recognition”, Pattern Recognit, no 5, vol 47, 2013.
15. Hongqing Zho, Yan Yang, Zhiguo Gui, Yu Zhu and Zhihua Chen, “Image Analysis by
Generalized Chebyshev Fourier and Generalized Pseudo Jacobi-fourier Moments, Pattern
recognition (accepted for publication).
16. Bin Xiao, J. Feng Ma, X. Wang, “ Image analysis by Bessel Fourier moments”, Pattern
Recognition, vol 43, 2010.
17. E. Karakasis, G. Papakostas, D. Koulouriotis and V. Tourassis, “A Unified methodology for
computing accurate quaternion color moments and moment invariants”, IEEE Trans. on
Image Processing vol 23 no 1, 2014.
18. B. Chen, H. Shu, G. Coatrieux, G. Chen X. Sun, and J.L. Coatrieux, “Color image analysis by
quaternion type moments”, journal of Mathematical Imaging and Vision, vol 51, no 1, 2015.
19. Hai-tao-Hu, Ya-dong Zhang Chao Shao, Quan Ju, “Orthogonal moments based on Exponent
Fourier moments”, Pattern Recognition, vol 47, no 8, pp. 2596–2606, 2014.
20. Bin Xiao, Wei-sheng Li, Guo-yin Wang, “ Errata and comments on orthogonal moments
based on exponent functions: Exponent-Fourier moments”, Recognition vol 48, no 4,
pp 1571–1573, 2015.
21. Hai-tao Hu, Quan Ju, Chao Shao, “ Errata and comments on Errata and comments on orthonal
moments based on exponent functions: Exponent Fourier moments, Pattern Recognition
(accepted for publication).
22. Xuan Wang, Tengfei Yang Fragxia Guo, “Image analysis by Circularly semi orthogonal
moments”, Pattern Recognition, vol 49, Jan 2016.
23. W.R. Hamilton, Elements of Quaternions, London, U.K. Longman 1866.
24. T. Suk and J. Flusser, “Affine Moment Invariants of Color Images”, Proc CAIP 2009,
LNCS5702, 2009.
Study of Zone-Based Feature for Online
Handwritten Signature Recognition
and Verification in Devanagari Script
Abstract This paper presents one zone-based feature extraction approach for
online handwritten signature recognition and verification of one of the major Indic
scripts–Devanagari. To the best of our knowledge no work is available for signature
recognition and verification in Indic scripts. Here, the entire online image is divided
into a number of local zones. In this approach, named Zone wise Slopes of
Dominant Points (ZSDP), the dominant points are detected first from each stroke
and next the slope angles between consecutive dominant points are calculated and
features are extracted in these local zones. Next, these features are supplied to two
different classifiers; Hidden Markov Model (HMM) and Support Vector Machine
(SVM) for recognition and verification of signatures. An exhaustive experiment in a
large dataset is performed using this zone-based feature on original and forged
signatures in Devanagari script and encouraging results are found.
1 Introduction
R. Ghosh (✉)
Department of Computer Science & Engineering, National Institute of Technology,
Patna, India
e-mail: [email protected]
P.P. Roy
Department of Computer Science & Engineering, Indian Institute of Technology,
Roorkee, India
e-mail: [email protected]
possess both static as well as some dynamic features [1]. Dynamic features include
elevation and pressure signals which make each person’s signature unique. Even if
skilled forgers are able to produce the same shape of the original signature, it is
unlikely that they will also be able to produce the dynamic properties of the original
one. In this paper a zone based feature [2] extraction approach has been used for
recognition and verification of online handwritten Devanagari signatures. Zone
based features [2] have been used and shown efficient results in online character
recognition purpose.
Several studies are available [1–12] for online handwritten signature recognition
and verification in non-Indic scripts, but to the best of our knowledge no work is
available for signature recognition and verification in Indic scripts. In our system,
we perform preprocessing such as interpolating missing points, smoothing, size
normalization and resampling on each stroke of the signature. Then each online
stroke information of a signature is divided into a number of local zones by dividing
each stroke into a number of equal cells. Next, using the present approach, named
ZSDP, dominant points are detected for each stroke and next the slope angles
between consecutive dominant points are calculated separately for the portion of the
stroke lying in each of the zones. These features are next fed to classifiers for
recognition and verification of signatures. We have compared SVM and HMM
based results in this paper.
The rest of the paper is organized as follows. Section 2 details the related works.
In Sect. 3 we discuss about the data collection. Section 4 details the preprocessing
techniques used and the proposed approaches of feature extraction methods. Sec-
tion 5 details the experimental results. Finally, conclusion of the paper is given in
Sect. 6.
2 Literature Survey
To the best of our knowledge, no study is available for online handwritten signature
recognition and verification in Indic scripts. Some of the related studies available in
non-Indic scripts are discussed below.
Plamondon et al. [3] reported an online handwritten signature verification
scheme where signature features related to temporal and spatial aspects of the
signature, are extracted. Several methods have been proposed for using local fea-
tures in signature verification [4]. The most popular method uses elastic matching
concept by Dynamic Warping (DW) [5, 6]. In the literature, several hundreds of
parameters have been proposed for signature recognition and verification. Among
these, the parameters like position, displacement, speed, acceleration [7, 8], number
of pen ups and pen downs [8], pen down time ratio [7], Wavelet transform [9],
Fourier transform [10] have been extensively used. Dimauro et al. [11] proposed a
function-based approach where online signatures are analysed using local properties
Study of Zone-Based Feature for Online Handwritten … 525
Devanagari (or simply Nagari), the script used to write languages such as Sanskrit,
Hindi, Nepali, Marathi, Konkani and many others. Generally, in Devanagari script,
words or signatures are written from left to right and the concept of upper-lower
case alphabet is absent in this script. Most of the words or signatures of Devanagari
script have a horizontal line (shirorekha) at the upper part. Figure 1 shows two
different online handwritten signatures in Devanagari script where shirorekha is
drawn in the upper part of both the signatures.
Online data acquisition captures the trajectory and strokes of signatures. In
online data collection, the rate of sampling of each stroke remains fixed for all
signature samples. As a consequence, the number of points in the series of
co-ordinates for a particular sample does not remain fixed and depends on the time
taken to write the sample on the pad.
For our data collection, a total of 100 native Hindi writers belonging to different
age groups contributed handwritten signature samples. Each writer was prompted to
provide five genuine samples of each signature in Devanagari script. So, a total of
500 samples have been collected for each genuine signature in Devanagari script.
The training and testing data for genuine signatures are in 4:1 ratio. Each writer was
also prompted to provide five forged signatures of five other people. So, a total of
500 samples have been collected of forged signatures.
4 Feature Extraction
Before extracting the features from strokes, a set of preprocessing tasks is per-
formed on the raw data collected for each signature sample. Preprocessing includes
several steps like interpolation, smoothing, resampling and size normalization [13].
Figure 2 shows the images of one online handwritten signature in Devanagari script
before and after smoothing. The detailed discussion about these preprocessing steps
may be found in [13].
During feature extraction phase, the features that will be able to distinguish one
signature from another, are extracted. The feature extractions are done on the entire
signature image, irrespective of the number of strokes it contains. We discuss below
the proposed zone-based feature extraction approach for signature recognition and
verification.
Zone wise Slopes of Dominant Points (ZSDP): The whole signature image is
divided into a number of local zones of r rows × c columns. But, instead of
directly local feature extraction, we divide the portion of the strokes lying in each
zone into dominant points. These dominant points are those points where the online
stroke changes its slope drastically. In order to extract this feature, at first, slopes are
calculated between two consecutive points for the portion of the trajectory lying in
each local zone. The slope angles are quantized uniformly into 8 levels. Let, the
resultant quantized slope vector for any particular zone is Q = {q1, q2, …, qn).
Formally, a point, pi is said to be a dominant point if the following condition is
satisfied:
jqi + 1 − qi j % k ≥ CT
so on. We have tested with other bin divisions, but the one using pi/4 gives the best
accuracy. The histograms of feature values are normalized and we get 8 dimen-
sional feature vector for each zone. So, the total dimension for 9 zones is
8 × 9 = 72.
Support Vector Machine (SVM) and Hidden Markov Model (HMM) classifiers are
used for our online signature recognition and verification system. Support Vector
Machine (SVM) has been applied successfully for pattern recognition and regres-
sion tasks [14, 15].
We apply Hidden Markov Model (HMM) based stochastic sequential classifier
for recognizing and verifying online signatures. The HMM is used because of its
capability to model sequential dependencies [16]. HMM classifier has been
implemented through HTK toolkit [17].
The experimental testing of the proposed approach was carried out using online
handwritten Devanagari genuine and forged signatures. The feature vectors of
genuine signatures are used to build the training set which is used to build a model
for validating the authenticity of test signatures. Two separate testing sets are
created—one for genuine signatures and another for forged signatures.
Results using SVM: Using the current approach, the system has been tested using
different kernels of SVM and by dividing the entire signature image into different
zones. Using this approach, best accuracy is obtained using the combination of 16
zone division, CT = 2 and linear kernel of SVM. The detailed result analysis, using
ZSDP approach, is shown in Table 1.
528 R. Ghosh and P.P. Roy
Table 1 Signature recognition results using SVM with different kernels for ZSDP approach
ZSDP
Zones CT RBF kernel (%) Linear kernel (%) Polynomial kernel (%)
9 (3 × 3) 2 87.03 92.57 88.23
9 (3 × 3) 3 84.89 89.89 85.17
9 (3 × 3) 4 81.85 86.85 82.11
16 (4 × 4) 2 93.42 98.36 94.73
16 (4 × 4) 3 90.23 95.23 91.47
16 (4 × 4) 4 87.76 92.76 88.20
Results using HMM: The testing datasets for HMM based experimentation are
same as used in SVM based experimentation. Table 2 shows the recognition
accuracies using ZSDP approach. In our experiment, we have tried different
Gaussian mixtures and state number combinations. We noted that with 32 Gaussian
mixtures and 3 states, HMM provided the maximum accuracies. Figure 4 shows the
signature recognition results using ZSDP approach based on different top choices
for both SVM and HMM.
To validate the authenticity of each genuine signature, forged signatures are used as
test samples. For signature verification, generally two different measurement
techniques are employed to measure the performance of the verification system—
False Acceptance Rate (FAR) and False Rejection Rate (FRR). The first one
indicates the rate of accepting forgeries as genuine signatures and the second one
indicates the rate of rejecting forged signatures. For a good signature verification
system, the value of FAR should be very low and FRR should be high. Table 3
shows the signature verification results through FAR and FRR.
Study of Zone-Based Feature for Online Handwritten … 529
90
88
86
84
82
80
SVM HMM
Fig. 4 Signature recognition results for ZSDP approach based on different Top choices using
SVM and HMM
Among the existing studies in the literature, to the best of our knowledge, no work
exists on online handwritten signature verification system in Devanagari script. So,
the present work cannot be compared with any of the existing studies.
6 Conclusion
In this paper, we have described one approach of feature extraction for online
handwritten signature recognition and verification in Devanagari script. In our
dataset we considered five samples each for signatures of 100 different persons in
Devanagari script. The experimental evaluation of the proposed approach yields
encouraging results. This work will be helpful for the research towards online
recognition and verification of handwritten signatures of other Indian scripts as well
as for Devanagari.
530 R. Ghosh and P.P. Roy
References
1 Introduction
Identifying a plant is a difficult task even for experienced botanists, due to huge num-
ber of plant species existing in the world. This task is important for a large number
of applications such as agriculture, botanical medicine (ayurvedic treatment), cos-
metics and also for biodiversity conservation [1]. Plant identification is generally
done based on the observation of the morphological characteristics of the plant such
as structure of stems, roots and leaves and flowers followed by the consultation of
a guide or a known database. Flowers and fruits are present only for a few weeks
whereas leaves are present for several months and they also contain taxonomic iden-
tity of a plant. This is why many plant identification methods work on leaf image
databases [2–4]. A leaf image can be characterized by its color, texture, and shape.
Since the color of a leaf varies with the seasons and climatic conditions and also
most plants have similar leaf color, this feature may not be useful as discriminating
feature for the species recognition. Hence only shape and texture features are used as
discriminating features for plant species recognition. In this paper, we use shape and
texture features to identify the plant species. They are used to calculate the shape fea-
tures in many pattern recognition tasks, e.g. object detection. Shape descriptors can
be broadly categorized into contour-based descriptors and region based descriptors.
By using the shape feature, we get the global shape of the leaf. To get the com-
plete description about the leaf interior structure is important. The interior structure
extracted using the histogram of oriented gradients (HOG) and Gabor filters. The
combined shape and texture features is given to a Support Vector Machine (SVM)
for classification.
Leaf identification is a major research area in the field of computer vision and
pattern recognition because of its application in many areas. Many methods have
been proposed in the literature for leaf identification. Yahiaoui et al. [3] used Direc-
tional Fragment Histogram (DFH) [5] as a shape descriptor for plant species iden-
tification. They applied this method on plant leaves database containing scanned
images which gave an accuracy of 72.65 %. Wu et al. [6] proposed a leaf recogni-
tion using probabilistic neural network (PNN) with image and data processing tech-
niques. They also created the Flavia database [6], which consists of 32 classes of
leaf species and total of 1907 images. Kumar et al. [4] used Histograms of Curva-
ture over Scale (HoCS) features for identifying leaves. They obtained an accuracy
of 96.8 % when performed on dataset containing 184 tree species of Northeastern
United States. Bhardwaj et al. [7] used moment invariant and texture as features for
plant identification. This method obtained an accuracy of 91.5 % when performed
on a database containing 320 leaves of 14 plant species. Recently, Tsolakidis et al.
[8] used Zernike moments and histogram of oriented gradients as features for leaf
image, for which they obtained an accuracy of 97.18 % on the Flavia database.
The Zernike moments, being continuous moments suffer from the discretization
error and hence discrete orthogonal moments, like Krawtchouk moments (KM) [9]
are proved to be better alternatives. The Gabor filters [10] are widely used as to
extract the texture features because of their superior performance. This paper pro-
poses to investigate the following:
1. To investigate the superiority of the Krawtchouk moments over Zernike moments
as leaf-shape descriptors.
2. To propose Gabor features as alternative leaf texture features and study their per-
formance with and without integration with the shape features.
The rest of the paper is organized as follows. Section 2 gives a brief introduc-
tion about the orthogonal moments, then Sect. 3 presents HOG and Gabor filters.
Our proposed KMs and Gabor filter methods are presented in Sect. 4, followed by
Simulation results in Sect. 5 and finally the paper is concluded in Sect. 6.
Leaf Identification Using Shape and Texture Features 533
2 Orthogonal Moments
Moments are used as shape descriptors in many computer vision applications. Teague
[11] introduced moments with orthogonal basis function. Orthogonal moments has
minimum information redundancy, and they are generated using continuous and dis-
crete orthogonal polynomials.
∑
N
1
Kn (x; p, N) = ak,n,pxk =2 F1 (−n, −x; N; ) (1)
k=0
p
∞
∑ (a)k (b)k zk
2 F1 (a, b; c; z) = (2)
k=0
(c)k k!
𝛤 (a + k)
(a)k = a(a+)(a + 2) … … (a + k − 1) = (3)
𝛤 (a)
534 T. Pradeep Kumar et al.
∑
N
w(x; p, N)Kn (x; p, N)Km (x; p, N) = 𝜌n (x; p, N)𝛿nm (5)
x=o
The computation complexity can be reduced using following recurrence relation [9]
where √
(1 − p)(n + 1)
A= ,
p(N − n)
and √
(1 − p)2 (n + 1)n
B=
p2 (N − n)(N − n + 1)
x √
K1 (x;̄p, N) = (1 − ) w(x; p, N)
pN
Leaf Identification Using Shape and Texture Features 535
Similarly the weight function can also be calculated recursively using function
( ) p
N−x
w(x; p, N) = w(x; p, N) (8)
x+1 1−p
with
w(0; p, N) = (1 − p)N = eNln(1−p)
∑ ∑
N M
Qnm = K̂ n (x; p1, N − 1)K̂ m (y; p2, M − 1)f (x, y) (9)
x=0 y=0
where 0 < p1, p2 < 1 are constraints. They are used in the region-of-interest feature
extraction of KM.
3 Texture Features
where ∗ denotes the convolution operator, ∇fX (x, y) and ∇fY (x, y) are horizontal and
vertical gradients.
The magnitude and orientation of the gradients are given by
√
G= ∇fX (x, y)2 + ∇fY (x, y)2 (12)
( )
∇fY (x, y)
𝜃 = arctan (13)
∇fX (x, y)
respectively.
The input image is divided into 128 × 128 pixels. Thus we get total of 16 patches
of the image, of size 128 × 128 pixels each. Each pixel within a patch casts weighted
vote for an orientation 𝜃 based on the magnitude of gradient G over 9 bins evenly
536 T. Pradeep Kumar et al.
spaced over 0–180◦ . Thus by taking 50 % overlapping of image patches we get output
vector of size 7 × 7 × 9 = 441. The histogram is normalized by using the L2 norm.
i.e.
v
f =√ (14)
‖v‖2 + e
2-D Gabor filter is modelled according to simple cells in mamal visual cortex [10,
15, 16]. The filter is given by
(( ) ( ) )
f02 −
f0 2 x′2 + f0 2 y′2 ′
𝜓(x, y; f0 , 𝜃) = e 𝛾 𝜂 ej2𝜋f0 x (15)
𝜋𝛾𝜂
where
x′ = x cos(𝜃) + y sin(𝜃)
y′ = −x sin(𝜃) + y cos(𝜃)
f0 is central frequency of the filter, 𝜃 is rotation angle of major axis of the ellipse, 𝛾
is sharpness along the major axis and 𝜂 is sharpness along the minor axis.
4 Proposed Methods
In this section, we explain our proposed method of for leaf identification with KMs
and HOG. The block diagram is shown in Fig. 1. The steps of proposed method are
as follow
Pre-processing Stage
The given RGB image is converted to gray-scale image and further gray-scale
image is converted to binary image.
Leaf Identification Using Shape and Texture Features 537
(a) Original (b) Gray (c) Binary (d) Scale (e) Shift- (f) Rotationa
Image. Image. Image. Invariant Invariant Invariant
Image. Image. Image.
Scale Normalization
Scale invariance is achieved by enlarging or reducing each shape such that object
area (i.e., zeroth order moment m00 ) is set to a predetermined value. The scale nor-
malized image shown in Fig. 2d.
Translation Normalization
Translation invariance is achieved by transforming the object such that centroid is
moved to the origin. The translated image shown in Fig. 2e.
Rotation Normalization
The image should be rotated such that the major axis is vertical or horizontal as
shown in Fig. 2f.
Feature Extraction and Classification
The KMs upto order 12 are computed on pre-processed image and HOG are com-
puted on rotation normalized image. Both features are concatenated and normalized.
The normalized feature vectors are given to the SVM for classification.
538 T. Pradeep Kumar et al.
The steps of our proposed Gabor features based system is shown in Fig. 3. In this
method we used 5 scales (frequencies) and 8 orientations. Thus, a total of 40 Gabor
filters are used to get the texture features of leaves. The RGB image is converted
to gray image and then resized to 512 × 512 pixels. Perform low pass filtering to
remove dc content in image. Then, the smoothened image is convolved with Gabor
filter bank. The feature vector obtained by dividing the filtered image into 64 blocks
and take the local maximum from each image portion. So from each filtered image
we get 64 dimensional vector. Hence from all the 40 Gabor filters we get 2560 dimen-
sional vector.
4.3 Classification
The features obtained above are given to a SVM classifier for classification. We have
used the One-Against-All SVM [17, 18] technique for multiclass classification.
5 Simulation Results
The combined feature of KMs+HOG is tested on the Flavia database which con-
sists of 32 classes and each class having 65 leaf images. We obtained an accuracy
of 97.12 % for 80 % images for training and 20 % images for testing. We also imple-
mented the method [8] ZMs+HOG features, and obtained an accuracy of 96.86 %
on Flavia database (Fig. 4).
Since, the Krawtchouk moments overcome the discretization error of Zernike
moments, the combined feature KMs+HOG gave an accuracy of 97.12 % compared
Leaf Identification Using Shape and Texture Features 539
to that of 96.86 % in the case of ZMs+HOG features. The Gabor filter bank alone
has given an accuracy of 97.64 %. We have also tested the combined features of
ZMs+Gabor and KMs+Gabor but accuracy did not improve.
The simulation results obtained by our proposed method are listed in Table 1. We
have also compared our results with different existing methods in Table 2.
6 Conclusions
This paper investigated the leaf species classification problem. The shape features
are computed using Krawtchouk moments. The texture features are computed using
histogram of oriented gradients and Gabor features. The Gabor filter bank outputs
give better texture description over HOG, because of multiple scales and orienta-
tions. For extracting the complete description of the leaf, we have combined shape
feature with texture feature. The simulation results show that our proposed method
outperform the existing methods.
References
1. Arun Priya, C. T. Balasaravanan, and Antony Selvadoss Thanamani, “An efficient leaf recogni-
tion algorithm for plant classification using support vector machine,” 2012 International Con-
ference on Pattern Recognition, Informatics and Medical Engineering (PRIME), IEEE, 2012.
2. Z. Wang, Z. Chi, and D. Feng, “Shape based leaf image retrieval,” Vision, Image and Signal
Processing, IEE Proceedings-. Vol. 150. No. 1. IET, 2003.
3. Itheri Yahiaoui, Olfa Mzoughi, and Nozha Boujemaa, “Leaf shape descriptor for tree species
identification,” 2012 IEEE International Conference on Multimedia and Expo (ICME), IEEE,
2012.
4. Neeraj Kumar, Peter N. Belhumeur, Arijit Biswas, David W. Jacobs, W. John Kress, Ida Lopez
and Joao V. B. Soares, “Leafsnap: A computer vision system for automatic plant species iden-
tification,” Computer Vision ECCV 2012, Springer Berlin Heidelberg, 2012, 502–516.
5. Itheri Yahiaoui, Nicolas Herv, and Nozha Boujemaa, “Shape-based image retrieval in botani-
cal collections,” Advances in Multimedia Information Processing-PCM 2006. Springer Berlin
Heidelberg, 2006. 357–364.
6. Stephen Gang Wu, Forrest Sheng Bao, Eric You Xu, Yu-Xuan Wang, Yi-Fan Chang and Qiao-
Liang Xiang, “A leaf recognition algorithm for plant classification using probabilistic neural
network,” 2007 IEEE International Symposium on Signal Processing and Information Tech-
nology, IEEE, 2007.
7. Anant Bhardwaj, Manpreet Kaur, and Anupam Kumar, “Recognition of plants by Leaf Image
using Moment Invariant and Texture Analysis,” International Journal of Innovation and
Applied Studies 3.1 (2013): 237–248.
8. Dimitris G. Tsolakidis, Dimitrios I. Kosmopoulos, and George Papadourakis, “Plant Leaf
Recognition Using Zernike Moments and Histogram of Oriented Gradients,” Artificial Intelli-
gence: Methods and Applications, Springer International Publishing, 2014, 406–417.
9. Pew-Thian Yap, Raveendran Paramesran, and Seng-Huat Ong, “Image analysis by Krawtchouk
moments,” Image Processing, IEEE Transactions on 12.11 (2003): 1367–1377.
10. Tai Sing Lee, “Image representation using 2D Gabor wavelets,” Pattern Analysis and Machine
Intelligence, IEEE Transactions on 18.10 (1996): 959–971.
11. Michael Reed Teague, “Image analysis via the general theory of moments,” JOSA 70.8 (1980):
920–930.
12. R. Mukundan, S. H. Ong, and P. A. Lee, “Discrete vs. continuous orthogonal moments for
image analysis,” (2001).
13. M. Krawtchouk, “On interpolation by means of orthogonal polynomials,” Memoirs Agricul-
tural Inst. Kyiv 4 (1929): 21–28.
14. S. Padam Priyal, and Prabin Kumar Bora, “A robust static hand gesture recognition system
using geometry based normalizations and Krawtchouk moments,” Pattern Recognition 46.8
(2013): 2202–2219.
Leaf Identification Using Shape and Texture Features 541
15. Joni-Kristian Kamarainen, Ville Kyrki, and Heikki Klviinen “Invariance properties of Gabor
filter-based features-overview and applications,” Image Processing, IEEE Transactions on 15.5
(2006): 1088–1099.
16. Mohammad Haghighat, Saman Zonouz, and Mohamed Abdel-Mottaleb “CloudID: Trustwor-
thy cloud-based and cross-enterprise biometric identification,” Expert Systems with Applica-
tions 42.21 (2015): 7905–7916.
17. Fereshteh Falah Chamasemani, and Yashwant Prasad Singh, “Multi-class Support Vector
Machine (SVM) Classifiers-An Application in Hypothyroid Detection and Classification,”
2011 Sixth International Conference on Bio-Inspired Computing: Theories and Applications
(BIC-TA). IEEE, 2011.
18. Chih-Wei Hsu, and Chih-Jen Lin, “A comparison of methods for multiclass support vector
machines,” Neural Networks, IEEE Transactions on 13.2 (2002): 415–425.
19. Kulkarni, A. H. Rai, H. M. Jahagirdar, and K. A. Upparamani, “A leaf recognition technique for
plant classification using RBPNN and Zernike moments,” International Journal of Advanced
Research in Computer and Communication Engineering, 2(1), 984–988.
20. Alireza Khotanzad, and Yaw Hua Hong, “Invariant image recognition by Zernike moments,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on 12.5 (1990): 489–497.
21. Navneet Dalal, and Bill Triggs “Histograms of oriented gradients for human detection,” Com-
puter Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference
on. Vol. 1. IEEE, 2005.
Depth Image Super-Resolution: A Review
and Wavelet Perspective
1 Introduction
With the increasing need for HR images, the size of image sensors have increased.
To accommodate more pixels in the same sensor area, the pixel size started shrink-
ing. As the pixel size decreases, it may not register the image accurately due to shot
noise, thus there is a limit to the pixel size, beyond which it effects the imaging. The
optical cameras generally available are of the order of 8 mega-pixels (MP), but the
depth cameras (Mesa Swiss Ranger, CanestaVision) which are commercially avail-
able are of lower than 0.02 mega-pixels, which leads to LR depth images. Appli-
cations like robot navigation, human-machine interfaces (HMI), video surveillance,
etc., HR images are desirable for decision making. As the power consumed by the
depth sensor camera increases with increase in the sensors resolution, the power
consumption can become a critical requirement in mobile applications like robot
navigation. Thus its convenient to capture the depth images from LR depth cameras
and super-resolve remotely using hardware/software solutions.
SR is a class of techniques which enhances the image quality, and the quality
of an image is determined by the resolution of an image which is defined by, how
closely the lines are separated. SR is an ill-posed method. It has many solutions for a
given input image. Resolution can be classified into spatial, temporal, spectral, and
radiometric, but in this paper we could be discussing about the spatial resolution
enhancement of depth images unless otherwise specified.
The principle behind classical SR method is to fuse multiple LR images of a
same static scene viewed by a single camera which are sub-pixel apart from each
other. Each LR image contributes to the final reconstruction of the HR image. The
super-resolved output must have the high-frequency details (e.g. edges, textures) and
should be plausible to human eye. Varied methods have been developed based on this
principle to enhance the intensity image, but are limited to magnification factor not
beyond ×4. Similar methods have been tried to super-resolve the depth image which
give better results with higher magnification factor (e.g. ×4, ×8, or ×16).
Depth image contains the information of the distance of an object from the cam-
era position. Depth images are used in many applications like machine vision where
inspection and logistic system uses for detecting and measuring the volumes, or in
robotics where depth images are used for finding obstacle, in industrial applications
for assembling and monitoring, in medical field endoscopic surgery and for patient
monitoring, in surveillance to determine the presence, location and number of per-
sons. Since many applications require depth image, and due to the lack of depth cam-
era sensors resolution (which is typically low), there is a need of a solution which
produces HR image without disturbing the image integrity for better visibility. The
solution could be to increase the number of pixels on the same sensor size, but it
becomes the dense set of pixels which introduces shot noise during imaging, or the
other solution is by keeping enough distance between the pixels and increase the
sensor size to hold more number of pixels, but it becomes costlier and heavier. So
playing around with the no. of pixels on a given sensor plate is not in our hands.
Thus, appropriate solution would be to super-resolve the captured LR images off-
Depth Image Super-Resolution: A Review and Wavelet Perspective 545
line, which maintains the commercially available pricing of the cameras and reduces
the computational cost of system.
Depth image super-resolution (DISR) methods are mainly concerned about pre-
serving the discontinuity information (e.g. edges). As depth image doesn’t have
texture, it can be super resolved to a higher magnification factor (beyond ×4), as
opposed to that of other images like natural images, medical images, or domain spe-
cific images (e.g. face images).
The available methods are classified into two groups: DISR without intensity
image, and DISR with registered intensity image as a cue (also called RGB-D meth-
ods). We have reviewed the work of RGB-D methods, which is our first contribution,
and we have proposed a method for DISR using DWT transform, which is our second
contribution.
Use of DWT has been widely used in super-resolving the intensity images, but
DWT has not been tried for depth image super-resolution. The proposed method
uses interpolation of the DWT coefficients (LL, LH, HL, and HH) in super-resolving
the depth image. The proposed method is an intermediate step to recover the high-
frequency information by normalization the input image to the values of LL image,
and then finally combining all the sub-bands (except the interpolated LL sub-band)
with the normalized input image to generate the HR depth image using inverse dis-
crete wavelet transform (IDWT).
In Sect. 2 we will be discussing about the depth imaging techniques. Section 3 will
provide the description of the available methods for depth image super-resolution.
We will mainly be concentrating on DISR methods with HR color image as a cue.
The proposed DWT method for DISR and its comparative results will be discussed
in detail in Sect. 4, followed by the conclusions drawn in Sect. 5.
2 Depth Imaging
Depth images can be estimated/captured in various ways. The common way of esti-
mating the depth is the stereo image set-up. This is a passive technique which is
suitable for static scenes. In stereo imaging set-up there will be two optical cameras
facing the same scene with their optical axes making a finite angle. An angle of ∠0
means, their optical axes are parallel, and ∠180 means they are looking at each other,
but to have a proper depth estimation the axes should make a finite angle (between
∠0 to ∠180), which makes their image planes non-coplanar. Less angle means less
common scene between the two views, thus less search area for matching, and simi-
larly, more angle means there will be more common are between the views for proper
matching. These images need to be brought to the same plane (called rectification)
before reconstructing the 3D scene.
These cameras are placed apart by a distance called baseline. Small baseline gives
less precise distance estimation, and large baseline gives more precise distance esti-
mation but at a cost of accuracy [1]. Since the scene is viewed from two different
locations, it gives the sense of the depth information of the scene. The depth map
546 C.S. Balure and M. Ramesh Kini
obtained by this method will be of same size as that of the intensity images captured
by the stereo camera. The review of most of the point correspondence stereo methods
can be seen at [2].
The other way of finding the depth image is by using the depth camera which use
time-of-flight principle to find the distance of an object from the camera position.
These are the active sensors which are suitable for dynamic scenes with higher frame
rates. There are several depth cameras available in market e.g. Microsoft Kinect
(640 × 480), Swiss Ranger SR4000 (176 × 144), Photonic Mixer Devices (PMD),
CanestaVision, which use this principle. The IR light emitting diode (LED) emits
IR light and the reflected IR light from the reflective object is received by the IR
camera. The system calculates the distance of the object by measuring the phase of
the returned IR light. The ToF method for calculating the distance is based on Eq. (1),
1
Dmax = ct , (1)
2 0
where Dmax is the distance in pixels from the camera position, c is the speed of light
(2.997924583 × 108 m∕s), t0 is the pulse width in ns.
In this section we would see how high-resolution RGB image would help achieve
the task of super-resolving the LR depth image. The task here is to super-resolve the
input LR depth image (using LR depth camera) to a size equal to that of the size
of high-resolution RGB image (using HR optical camera) of the same scene. The
assumption is that these two input images are co-aligned with each other, such that
a point in LR depth image should coincide with the same point in the HR intensity
image.
Markov random field (MRF) framework [3] has been used in many tasks in image
processing and computer vision. To mention few, they are, image segmentation,
image registration, stereo matching, and super-resolution. The MRF method has
been used for range image super-resolution [4]. MRF is used for graphical mod-
eling the problem of data integration of the LR range image with the HR camera
image. The mode of probability distribution defined by the MRF provides us the HR
range image. It uses conjugate gradient fast optimization algorithm to MRF infer-
ence model for finding the mode. For example, x and z are the two types variables
mapped to the measurements of range and image pixels respectively. The variable
y represent the reconstructed range image whose density is same as the variable z.
Depth Image Super-Resolution: A Review and Wavelet Perspective 547
Variables u and w are the image gradient and range discontinuity respectively. The
MRF is defined through the following potentials:
The depth measurement potential,
∑
𝛹= k(yi − zi )2 (2)
i∈L
where L is the set of indexes for which depth measurement is available, k is the
constant weight, y is estimated range in HR grid, and z is the measured range which
is on LR grid.
The depth smoothness prior,
∑ ∑
𝛷= wij (yi − yj )2 (3)
i j∈N(i)
where N(i) is the neighborhood of i, wij is the weighting factor between the center
pixel i and its neighbor j. The resulting MRF is defined as the conditional probability
over variable y which is defined through the constraints 𝛹 and 𝛷,
( )
1 1
p(y|x, z) = exp − (𝛷 + 𝛹 ) (4)
Z 2
where Z is normalizer (partition function).
With the advent of bilateral filtering (BF) [5], many researches have tried improv-
ing it to match the real scenario for depth images (cases like depth image with heavy
noise). BF is a non-linear, non-iterative, local and simple to smooth image while
preserving the edges. It is bilateral because it combines the domain filtering (based
on closeness) and range filtering (based on similarity). The filtering is generally the
weighted average of pixels in the neighborhood. The intuition for filter weights to fall
off slowly over space as we move out from the center pixel is because the image vary
slowly over the space, but this is not true at the edges which results in blurring the
edges. Bilateral filtering for range image averages image values with weights which
decay with dissimilarity.
The filters are applied to image f (x) produces output image h(x). Combining
domain and range filtering enforces both the geometric and photometric locality,
which is described as,
∞ ∞
𝐡(𝐱) = k−1 (𝐱) 𝐟 (𝜉).c(𝜉, 𝐱).s(𝐟 (𝜉), 𝐟 (𝐱))d𝜉 (5)
∫−∞ ∫−∞
where c(𝜉, 𝐱) and s(𝐟 (𝜉), 𝐟 (𝐱)) are the geometric closeness and photometric similarity
respectively between neighborhood center x and nearby points 𝜉, k(𝐱) is the normal-
izing factor. In smooth regions, this bilateral filter acts as standard domain filter, but
at the boundaries the similarity function adjust according the center pixel of calcu-
lation and maintains a good filtering keeping the edge information preserved.
548 C.S. Balure and M. Ramesh Kini
Bilateral filtering was further extended for image upsampling [6]. They showed
that this method is useful for applications like stereo depth, image colorization, tone
mapping, and graph-cut based image composition, which need a global solution to
be found. Recently, joint bilateral filters have been introduced in which the range
filter is applied to a second guidance image, I.̃ Thus,
1 ∑
Jp = I .f (∥ p − q ∥).g(∥ Ĩp − Ĩq ∥) (6)
kp q∈𝛺 q
Here the only difference is that the range filter uses Ĩ instead of I. Using this joint
bilateral filter, a method called joint bilateral upsampling (JBU) [6] has been pro-
posed to upsample the LR solution S, with a given HR image I. ̃ The upsampled
̃
solution S is obtained as,
1 ∑
S̃ p = S .f (∥ p ↓ −q ↓∥).g(∥ Ĩp − Ĩq ∥) (7)
kp q↓∈𝛺 q↓
1 ∑
S̃ p = I .f (p ↓ −q ↓∥).[𝛼(𝛥𝛺 ).g(∥ Ĩp − Ĩq ∥) + (1 − 𝛼(𝛥𝛺 ).h(∥ Ip↓ − Iq↓ ∥)]
kp q↓∈𝛺 q↓
(8)
where f (⋅), g(⋅) and h(⋅) are the Gaussian functions, and 𝛼(𝛥𝛺 ) is a blending function.
Another work was by [8] to increase the range image resolution of time-of-flight
(ToF) camera using multi-exposure data acquisition technique and Projection onto
Convex Sets (POCS) reconstruction. Their method can be used for other image
modalities which are capable of capturing range image (e.g. Light Detection and
Ranging LADAR). The problem with the assumption of utilizing the HR color image
to super resolve the LR range image is that it does not always hold good at situations
where the shading and illumination variation can produce false depth discontinuity
and it sometimes introduce the texture copy artifacts due to noise in the region. The
novel idea is to utilize the alternating integration time (exposure) during range image
acquisition for depth super-resolution. Low integration time is suitable for capturing
Depth Image Super-Resolution: A Review and Wavelet Perspective 549
the near field but boosts noise in far field, while high integration time captures objects
in far field but cause the depth saturation in near field. So, the idea of multi-exposure
will merge useful depth information from different levels and eliminate saturation
and noise. Modeling the depth map is a function of various parameters, thus image
formation becomes,
𝐃i = f (𝛼i 𝐇i 𝐪 + 𝜂i ), (9)
where 𝐃i is ith low resolution depth image, and 𝐪 is the high resolution ground truth
surface, 𝐇i is linear mapping (consists of motion, blurring and downsampling), 𝛼i is
time dependent exposure duration, and f (⋅) is the opto-electronic conversion function
that convert the phase correlation at each spatial location to depth value.
Many a times, the estimated HR depth images does not lead back to the initial LR
depth image on subsampling which were used as inputs. This leads the HR solution
to be slightly away from the ground truth. Thus, [9] combines the advantages of
guided image filtering [10] and reconstruction constraints. The range image F is a
linear function of camera image I,
Fi ≈ aTk Ii + bk , ∀i ∈ wk (10)
The solution of this optimization problem can be computed using steepest descent
method.
A fusion of weighted median filters and bilateral filters for range image upsam-
pling was proposed by [11] and named it as bilateral weighted median filter. The
weighted median filter finds the value minimizing the sum of weighted absolute error
of given data, ∑
arg min W(𝐱, 𝐲)|b − I𝐲 |, (13)
b 𝐲∈N(𝐱)
550 C.S. Balure and M. Ramesh Kini
where W(𝐱, 𝐲) is weight assigned to pixel 𝐲 inside local region centered at pixel
𝐱. The bilateral weighted median filter corresponds to the following minimization
problem, ∑
arg min fS (𝐱, 𝐲) ⋅ fR (I𝐱 , I𝐲 ) |b − I𝐲 | (14)
b 𝐲∈N(𝐱)
where fS (⋅) and fR (⋅) are the spatial and range filter kernel.
This method alleviates texture transfer problem, more robust to misalignment, and
also removes depth bleeding artifacts.
Lu and Forsyth [12] proposed a method which fully utilizes the relationship
between the RGB image segmentation boundaries and depth boundaries. Since each
segment will have its depth field, which will be constructed independently using
novel smoothing method. They have shown results for super-resolution from ×4 to
×100. They have even contributed a novel dataset. The best part of this work is that
they could obtain the high resolution depth image output which is almost near ground
truth from aggressive subsampling.
DWT has been widely used in image processing tasks. It gives an insight into the
images spectral and frequency characteristics. One DWT operation decomposes the
input image into four coefficients, i.e. one approximation coefficients (low-low (LL)
sub-band), and three detail coefficients (low-high (LH), high-low (HL) and high-high
(HH) sub-bands).
The haar wavelet basis has shown the intrinsic relationship with the super-
resolution task, thus haar wavelet basis has been used throughout the paper unless
otherwise stated.
The use of DWT for super-resolution have been used for super-resolving the inten-
sity images. It proves useful for applications like satellite imaging where the cameras
mounted on satellites are of low-resolution. For intensity image super-resolution,
[13–15] have proposed methods for multi-frame super-resolution problem, where
[14] uses Daubechies db4 filter to super-resolve to a factor of ×4, and [13] uses batch
algorithm for image alignment and reconstruction using wavelet based iterative algo-
rithm, and [15] combines Fourier and wavelet for deconvolution and de-noising for
multi-frame super-resolution.
Our work is in line with [16, 17], where DWT method has been used to proposed
the intermediate steps for obtaining the high-frequency information from the avail-
able LR images and its corresponding interpolated coefficients (LL, LH, HL, and
HH). In available methods, the LL sub-band of the DWT transform of input LR depth
image were untouched and the LR input image was directly used with the estimated
intermediate high-frequency sub-bands for super-resolution. Our method is different
from the existing methods for two reasons: first, we have used the DWT method for
depth image super-resolution which others have not tried, and secondly, we modify
Depth Image Super-Resolution: A Review and Wavelet Perspective 551
Low Resolution
DWT
Image (mxn)
LL LH HL HH
Normalize
Interpolated Interpolated Interpolated Interpolated
and extract
LL LH HL HH
edge info
High Resolution
IDWT
Image (αmxαn)
Fig. 1 Block diagram of proposed algorithm for depth image super-resolution using DWT trans-
form
the input depth image w.r.t. the interpolated LL sub-band and then combined it with
the interpolated high-frequency sub-bands (LH, HL and HH).
The proposed method uses haar wavelet type. Haar wavelet is squared-shaped
rescaled sequence which together forms a wavelet family or wavelet basis. The
wavelet basis haar was chosen because it has some advantageous property of ana-
lyzing sudden transition, which can be incorporated in super-resolution method for
edge preserving.
The Fig. 1 shows the block diagram of the proposed method for depth image
super-resolution using DWT transform. It takes the input image which is taken from
the LR depth camera and then decomposes it into the approximation sub-band and
detail sub-bands. All these sub-bands were interpolated by a factor of ×2 using bicu-
bic interpolation because bicubic is the sophisticated method compared to other
interpolation methods. As we are trying to super-resolve the depth image by a fac-
tor of ×2, so is the reason that the input image is decomposed only once (level-1
decomposition). Once the interpolation of sub-band images are done, then the origi-
nal LR depth input image is used with the interpolated LL sub-band image to extract
the high-frequency detail by first normalizing to values [0 1], and then using it to
normalized to the values between minimum and the maximum of the interpolated
LL sub-band. The output of this process is then combined with the high-frequency
interpolated detail sub-band images (LH, HL, and HH) and apply IDWT to get the
high-resolution image.
552 C.S. Balure and M. Ramesh Kini
(a) Sawtooth (b) Cone (c) Teddy (d) Art (e) Aloe (f)Tsukuba
Fig. 2 Test images from stereo dataset of Middlebury and Tsukuba with their original resolution. a
380 × 434, b 375 × 450, Fig. 3b 375 × 450, Fig. 3c 370 × 463, e 370 × 427, f 480 × 640, (all images
are left image of the stereo pair)
Table 1 PSNR and SSIM performance metrics of proposed DWT method for depth image super
resoution on Middlebury and Tsukuba dataset for magnification factor ×2
Images PSNR (SSIM)
Bilinear Bicubic Proposed
Sawtooth 38.72 (0.98) 38.51 (0.98) 39.64 (0.98)
Cone 29.06 (0.94) 28.78 (0.94) 30.47 (0.94)
Teddy 28.49 (0.95) 28.22 (0.95) 29.55 (0.95)
Art 31.21 (0.96) 30.95 (0.96) 32.10 (0.98)
Aloe 31.59 (0.97) 31.29 (0.97) 32.36 (0.99)
Tsukuba 42.00 (0.99) 41.70 (0.99) 43.31 (0.99)
Fig. 4 Cropped region 300 × 200 of Art from the proposed DWT method
554 C.S. Balure and M. Ramesh Kini
After looking at the existing methods to find the solution of an ill-posed problem of
depth image super-resolution, we realized that there is still a scope of super resolving
the depth images captured from the low-resolution depth sensor cameras using the
high-resolution RGB image as a cue which is available from the optical cameras of
higher mega-pixels and at a low cost.
We have seen methods like joint bilateral filter which upsamples the LR depth
image using the RGB image of the same scene, but the problem of texture transfer
occurs in the presence of heavy noise. To overcome the problem of texture copy, a
noise-aware filter NAFDU for depth upsampling was proposed which eliminates the
problem of texture copy. The NLM has also been used for upsampling, but it results
in the depth bleeding the fine edges. A weighted median filter and bilateral filter
solves the problem of depth bleeding. With recent development of sparse methods,
we have seen that the aggressive subsampled depth image can be recovered to almost
near ground truth. When looking at these methods and their results, it seems like a
lot of the algorithms can be combined to produce the better results.
The intermediate step of obtaining the high frequency details has been proposed
which normalize the input image with respect to the LL sub-band of the DWT
transform of the input LR depth image. The proposed method has been tested on
widely used datasets of Middlebury and Tsukuba. The proposed method performs
better than the conventional interpolation methods. The proposed method shows an
improvement of 1.33 dB (on an average) in the PSNR value over the selected test
images when compared with the bicubic interpolation technique for a ×2 magnifica-
tion factor.
References
1. Okutomi, M., Kanade, T.: A multiple-baseline stereo. IEEE Transactions on Pattern Analysis
and Machine Intelligence 15(4), 353–363 (1993)
2. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspon-
dence algorithms. International journal of computer vision 47(1–3), 7–42 (2002)
3. Geman, S., Geman, D.: Stochastic relaxation, gibbs distributions, and the bayesian restora-
tion of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), 721–741
(1984)
4. Diebel, J., Thrun, S.: An application of markov random fields to range sensing. In: NIPS. vol. 5,
pp. 291–298 (2005)
5. Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Sixth International
Conference on Computer Vision, 1998. pp. 839–846. IEEE (1998)
6. Kopf, J., Cohen, M.F., Lischinski, D., Uyttendaele, M.: Joint bilateral upsampling. In: ACM
Transactions on Graphics (TOG). vol. 26, p. 96. ACM (2007)
7. Chan, D., Buisman, H., Theobalt, C., Thrun, S.: A noise-aware filter for real-time depth
upsampling. In: Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and
Applications-M2SFA2 2008 (2008)
8. Gevrekci, M., Pakin, K.: Depth map super resolution. In: 2011 18th IEEE International Con-
ference on Image Processing (ICIP), pp. 3449–3452. IEEE (2011)
Depth Image Super-Resolution: A Review and Wavelet Perspective 555
9. Yang, Y., Wang, Z.: Range image super-resolution via guided image filter. In: Proceedings of
the 4th International Conference on Internet Multimedia Computing and Service. pp. 200–203.
ACM (2012)
10. He, K., Sun, J., Tang, X.: Guided image filtering. IEEE Transactions on Pattern Analysis and
Machine Intelligence 35(6), 1397–1409 (2013)
11. Yang, Q., Ahuja, N., Yang, R., Tan, K.H., Davis, J., Culbertson, B., Apostolopoulos, J., Wang,
G.: Fusion of median and bilateral filtering for range image upsampling. IEEE Transactions on
Image Processing 22(12), 4841–4852 (2013)
12. Lu, J., Forsyth, D.: Sparse depth super resolution. In: Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition. pp. 2245–2253 (2015)
13. Ji, H., Fermuller, C.: Robust wavelet-based super-resolution reconstruction: theory and algo-
rithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(4), 649–660 (2009)
14. Nguyen, N., Milanfar, P.: A wavelet-based interpolation-restoration method for superresolution
(wavelet superresolution). Circuits, Systems and Signal Processing 19(4), 321–338 (2000)
15. Robinson, M.D., Toth, C., Lo, J.Y., Farsiu, S., et al.: Efficient fourier-wavelet super-resolution.
IEEE Transactions on Image Processing 19(10), 2669–2681 (2010)
16. Demirel, H., Anbarjafari, G.: Discrete wavelet transform-based satellite image resolution
enhancement. IEEE Transactions on Geoscience and Remote Sensing 49(6), 1997–2004
(2011)
17. Demirel, H., Anbarjafari, G.: Image resolution enhancement by using discrete and stationary
wavelet decomposition. IEEE Transactions on Image Processing 20(5), 1458–1460 (2011)
18. Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: 2003
IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Pro-
ceedings. vol. 1, pp. I–195. IEEE (2003)
19. Peris, M., Maki, A., Martull, S., Ohkawa, Y., Fukui, K.: Towards a simulation driven stereo
vision system. In: 2012 21st International Conference on Pattern Recognition (ICPR), pp.
1038–1042. IEEE (2012)
On-line Gesture Based User Authentication
System Robust to Shoulder Surfing
1 Introduction
The process by which a system recognizes a user or verifies the identity of a user
trying to access it, is known as authentication. Installing a robust authentication
technique that prevents impersonation, is of utmost importance for any personal-
ized system since it plays a major role to defend against unauthorized access of the
system. The procedure for establishing the identity of a user can be broadly branched
into three categories [1]:
1. Proof by Knowledge—A user’s identity can be authenticated with the help of
information which is known only to the actual user. (e.g. Password)
2. Proof by Possession—Here the authentication is done with the help of an object
specific to and in possession of the real user. (e.g. Smart Card)
3. Proof by Property—The user’s identity is validated by measuring certain prop-
erties (e.g. Biometrics) and comparing these against the claimed user’s original
properties (e.g. Fingerprint)
Majority of the research in this domain mainly focuses on the proof by knowledge
domain. Here, the validation is done with the use of password, PIN or pattern based
techniques. These authentication schemes are mainly victim of shoulder surfing as
shown in Fig. 1. It is a form of spying to gain knowledge of one’s password or identity
information. Here, the forger or imposter may observe or glance at the password, PIN
or pattern being entered during authentication and may use it to impersonate a valid
user. Extensive research is going on in this field to aid various applications such as
authentication to prevent e-financial incidents [7], etc. Most of these applications use
keystroke patterns [8] or biometrics [9, 11] or password entry [10] for authentication.
The visual feedback provided by above mentioned techniques make them vulnerable
to theft of identity. A possible solution to this may be to exploit the fact that the
field of view of the valid user will be different as compared to the impersonator
while shoulder surfing. Combining minimization of visual feedback with the above
possible solution is likely to create a robust system resistant to user impersonation.
This paper proposes a novel authentication technique to avoid identity theft
mainly caused by shoulder surfing. Here, we have used pattern based authentication
technique without visual feedback (unlike pattern based authentication used in touch
enabled devices) where Leap Motion device serves as the sensor to capture input sig-
nal. The device’s interface has been used to create patterns with the help of on-air
gestures. Leap Motion sensor1 is a recent release by Leap Motion Inc. It can capture
real-time movement of fingers and it can track precise movement of hand and fingers
in three-dimensional space. It has a tracking accuracy of 0.01 millimetre. The device
is currently being used for various gesture based applications like serious gaming
[13], human computer interface, augmented reality, physical rehabilitation [12], etc.
It is a low-cost device that is small in size. It supports a number of frameworks and
is fairly accurate. These features of the device makes it a good choice as compared
1
https://www.leapmotion.com/.
On-line Gesture Based User Authentication System Robust . . . 559
Fig. 1 An instance
portraying authentication by
a legitimate user while an
imposter is applying
shoulder surfing
to other similar devices such as Microsoft’s Kinect or Intel’s RealSense. For proper
tracking of hand or fingers, a user should place his/her hand in the field of view of the
device. Its range is about 150◦ with the distance constrained to less than a meter. The
device comprises of a pair of infra-red cameras and three LEDs providing a frame
rate varying from 20 to 200 fps. Information regarding the position of fingers, palm,
or frame time-stamp can be obtained from each frame.
We have developed a methodology to use this for authentication on personalized
devices. We start with partitioning the 2D screen or display into non-overlapping
rectangular blocks and map it with the 3D field of view of the device. Assuming
each of these blocks represent one character or symbol of the alphabet, users are
asked to draw patterns on air. However, during the process, we do not provide any
visual feedback to the user. Therefore, no cursor movement is seen on the screen.
Then the task of recognising these patterns can be done by classifiers such as Hidden
Markov Model (HMM) [5], Support Vector Machine (SVM), Conditional Random
Field (CRF), etc. Here we have used HMM as the classifier due to its ability to model
sequential dependencies and its robustness to intra-user variations. We train indepen-
dent HMM for each distinct pattern in the training set and then a given sequence is
verified against all trained model. The model having maximum likelihood is assumed
to be the best choice.
Rest of the paper is organized as follows. In Sect. 2, proposed methodology is pre-
sented. Results obtained using large set of samples collected in laboratory involving
several volunteers, are presented in Sect. 3. We conclude in Sect. 4 by highlighting
some of the possible future extensions of the present work.
560 S. Bhoi et al.
This section describes about the signal acquisition, field of view mapping, training
and testing the authentication methodology.
First, we divide the whole screen or display into non-overlapping rectangular boxes
and label each of those boxes. As an example, the screen can be arranged as a matrix
of size 4 × 4 labelled “A” to “P” as depicted in Fig. 2. Using the finger and hand
tracking utility of the Leap Motion device, we track movement of a user’s index
finger while performing the gesture during authentication. Initially, we provide a
visual feedback to the user in the form of a visible cursor that helps the user to get
an idea of the initial position of his/her finger with respect to the screen. Therefore,
before drawing the authentication pattern, the user first executes a predefined gesture
(e.g. circle gesture) that is used as a marker to start the authentication pattern and
thereafter we hide the cursor. Therefore, the visual feedback is removed and the user
draws the pattern through sense and anticipation. We have tested various patterns
such as swipe, screen-tap, key-tap or circle to understand the best choice for the start
marker. Circle gesture was found to be the most suitable and comfortable by the
volunteers. Based on their feedback and the fact that the execution of the gesture
should facilitate the knowledge of the finger position on screen before the cursor is
hidden, circle gesture was used.
Fig. 2 Partitioning of the display into non-overlapping blocks and assignment of symbols or alpha-
bets
On-line Gesture Based User Authentication System Robust . . . 561
Next, we present the method to map the field of view of the Leap Motion device
to the display screen (e.g. computer screen). Since the display screen is rectangular
in shape, therefore, instead of mapping the entire inverted pyramid interaction-space
of the device (3D) to the 2D screen, we create an interaction box within the field of
view to ease the movement and mapping of fingers. The height of interaction box can
be set according to the user’s preference of interaction height. Respective coordinate
systems of display screen and Leap Motion are shown in Fig. 3. From the figure, it
is evident that we need to flip the Y-axis of Leap Motion to map the coordinates
correctly to the display screen. We normalize the real world position of the finger
so that the coordinates lie between 0 and 1 and then translate the coordinates to the
screen position as described in (1) and (2). This helps us to localize the position of
the finger-tip on the display screen segment towards which the finger is pointing. We
have not included Z-axis of the real-world position (with respect to the device) of
the finger since we want to portray the movement on the 2-D screen.
Ys = (1 − Yn )Hs (2)
where,
Xs , Ys represent X and Y coordinate of the finger position mapped on to the screen,
respectively. Xn and Yn represent the normalized X and Y coordinate of the finger-tip
within the field of view of the device. Ws and Hs represent width and height of the
screen.
Next, acquisition of authentication patterns with respect to the above mentioned
mapping is described. Suppose, a user wants to draw a pattern “AEIJKL” as depicted
in Fig. 4. The user needs to move his/her finger over the device’s field of view to tra-
verse the labelled boxes in the following order of sequence, A, E, I, J, K, L. To accom-
Fig. 4 A sample pattern “AEIJKL” drawn over the field of view of the device and its corresponding
2D mapping onto the screen
plish this, the user must bring his/her finger within the device’s field of view and try to
point box “A”. After making a small circle gesture on box “A” (as described earlier),
the user needs to traverse the other boxes in the above mentioned order. Although
there is no visual feedback, position of the finger-tip in each frame is recorded. This
information is used for generating the pattern. A pattern of such movement can be
represented as follows,
p = (x1 , y1 ), … … .., (xk , yk ) (3)
where, p represents the pattern under consideration, (xk , yk ) represents the coordinate
of the finger-tip with respect to the screen-space in the kth frame. Figure 5 depicts
some of the patterns engaged in this experiment.
the task of recognition using the Viterbi decoding algorithm [2–4]. We assume that
the observation variable depends only on the present state. Therefore, a first order
left to right Markov model has been presumed to be correct in the present context.
The estimation of maximum likelihood parameters is carried out using Baum-Welch
training algorithm. It uses EM technique for maximization of the likelihood where
𝜃 = (A, bj , 𝜋) describes the Hidden Markov chain. The algorithm finds a local max-
imum of 𝜃 for a given set of observations (Y) as depicted in (4), where Y represents
the observation sequence. More on the method can be found in Rabiner’s pioneering
documentation on HMM [5].
The parameter 𝜃 that maximizes the probability of the observation can be used
to predict the state sequence for a given vector [6]. We compute the probability of
observing a particular pattern (pj ∈ S) using (5), where 𝜃i represents the parame-
ters of the ith HMM that are learned through training and X denotes the hidden
state sequence. Finally, given a test pattern we can classify it into one of the classes
using (6) assuming there are C such distinct patterns in the dataset.
∑
P(pj , 𝜃i ) = P(pj |X, 𝜃i )P(X, 𝜃i ) (5)
X
Using the normalized coordinate vector representing all samples including train-
ing and testing patterns, the approach seems fairly robust to intra-user variations. In
addition to that, since HMMs are scale-invariant in nature, the recognition process
works fairly well regardless of the size of the coordinate vector. The procedure is
summarized in Algorithm 1.
3 Results
This section presents the results obtained during experiment involving 10 unbiased
volunteers. To test the robustness of the proposed system, we have selected 10 vary-
ing authentication patterns (simple as well as complex patterns). Users were asked
to mimic these patterns. Each volunteer was involved for the data acquisition phase
where they were given a short demonstration to make them familiar with the Leap
Motion device. A total of 1000 patterns were collected and 80 % of this data was
used for training and remaining 20 % was used for testing.
A total of 10 models were created, one for each of the 10 unique patterns (essen-
tially represent 10 distinct users). These models were trained (HMM) following the
procedure described in Algorithm 1. Out of 1000 samples, 800 patterns were used
for training and 200 patterns were used for testing. Confusion matrix of the classi-
fication is presented in Table 1. It is evident from the results that, accuracy is quite
high for majority of these patterns except a few patterns. For example, it may be
noted that, single instance of two of the patterns, namely 7 and 9, often getting con-
fused with 9 and 6, respectively. 9 (“HGKL”) is being recognized as 6 (“DGKP”).
This is due to the fact that, while the user was trying to draw “HGKL” pattern, he/she
might have traversed the path representing “DGKP” as depicted in Fig. 6. Therefore,
On-line Gesture Based User Authentication System Robust . . . 565
unintentionally visiting nearby blocks during the gesture may cause failure in the
logging-in procedure. However, our experiments reveal that only in two cases, we
got a mismatch. Remaining cases were detected correctly with an overall accuracy
of 99 %.
4 Conclusion
The paper proposes a novel technique for personalized device authentication via pat-
terns without visual feedback. Here, we can conclude that, if the visual feedback
is eliminated during authentication, the process becomes robust. However, existing
touch-less or touch-based systems rely on visual feedbacks. On the contrary, the pro-
posed Leap Motion based interface is robust against shoulder surfing attacks. This
happens due to the difference in the field of views of the authentic user and the
imposter.
566 S. Bhoi et al.
The proposed system can be used for designing robust authentication schemes for
personalized electronic devices. This will mitigate some of the limitations of existing
contact-based or visual feedback based authentication mechanisms. However, the
proposed system needs to be tested against real-imposter attacks and experiments
need to be carried out to test its protection potential against such attacks.
References
1. Jansen, W.: Authenticating users on handheld devices. In: Proceedings of the Canadian Infor-
mation Technology Security Symposium, pp. 1–12 (2003)
2. Iwai, Y., Shimizu, H., Yachida, M.: Real-time context-based gesture recognition using HMM
and automaton. In: International Workshop on Recognition, Analysis, and Tracking of Faces
and Gestures in Real-Time Systems, pp. 127–134 (1999)
3. Rashid, O., Al-Hamadi, A., Michaelis, B.: A framework for the integration of gesture and
posture recognition using HMM and SVM. In: IEEE International Conference on Intelligent
Computing and Intelligent Systems, vol. 4, pp. 572–577 (2009)
4. Shrivastava, R.: A hidden Markov model based dynamic hand gesture recognition system using
OpenCV. In: 3rd IEEE International Conference on Advance Computing, pp. 947–950 (2013)
5. Rabiner, L.: A tutorial on hidden Markov models and selected applications in speech recogni-
tion. In: Proceedings of the IEEE, vol. 77, no. 2, pp. 257–286 (1989)
6. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hid-
den Markov model. In: IEEE Computer Society Conference on Computer Vision and Pattern
Recognition, pp. 379–385 (1992)
7. Seo, H., Kang Kim, H.: User Input Pattern-Based Authentication Method to Prevent Mobile
E-Financial Incidents. In: Ninth IEEE International Symposium on Parallel and Distributed
Processing with Applications Workshops (ISPAW), pp. 382–387 (2011)
8. Sheng, Y., Phoha, V. V., Rovnyak, S. M.: A parallel decision tree-based method for user authen-
tication based on keystroke patterns. In: IEEE Transactions on Systems, Man, and Cybernetics,
Part B: Cybernetics, vol. 35, no. 4, pp. 826–833 (2005)
9. Mengyu, Q., Suiyuan, Z., Sung, A. H., Qingzhong, L.: A Novel Touchscreen-Based Authenti-
cation Scheme Using Static and Dynamic Hand Biometrics. In: 39th Annual IEEE conference
on Computer Software and Applications, vol. 2, pp. 494–503 (2015)
10. Syed, Z., Banerjee, S., Qi, C., Cukic, B.: Effects of User Habituation in Keystroke Dynamics on
Password Security Policy. In: IEEE 13th International Symposium on High-Assurance Systems
Engineering (HASE), pp. 352–359 (2011)
11. Frank, M., Biedert, R., Ma, E., Martinovic, I.; Song, D.: Touchalytics: On the Applicability of
Touchscreen Input as a Behavioral Biometric for Continuous Authentication. In: IEEE Trans-
actions on Information Forensics and Security, vol. 8, no. 1, pp. 136–148 (2013)
12. Vamsikrishna, K., Dogra, D. P., Desarkar, M. S.: Computer Vision Assisted Palm Rehabilita-
tion With Supervised Learning. In: IEEE Transactions on Biomedical Engineering, DOI:10.
1109/TBME.2015.2480881 (2015)
13. Rahman, M., Ahmed, M., Qamar, A., Hossain, D., Basalamah, S.: Modeling therapy rehabil-
itation sessions using non-invasive serious games. In: Proceedings of the IEEE International
Symposium on Medical Measurements and Applications, pp. 1–4 (2014)
Author Index
B N
Balasubramanian, R., 495 Nagananthini, C., 151
Blumenstein, Michael, 241
P
C Pal, Umapada, 241
Chaudhury, Nabo Kumar, 221 Pankajakshan, Vinod, 429
Chaudhury, Santanu, 285 Prakash, Sachin, 221
D R
Das, Abhijit, 241 Raman, Balasubramanian, 331
Diwan, Vikas, 251 Revathi, G., 175
Dogra, Debi Prosad, 321 Roy, Partha Pratim, 321, 331, 523
F S
Ferrer, Miguel A., 241 Saravanaperumaal, S., 175
Sardana, H.K., 25
G Saxena, Nikhil, 251
Garg, R.D., 411 Sharma, Nitin, 421
Ghosh, Rajib, 523 Singh, Arshdeep, 25
Ghosh, Ripul, 25 Singh, Pankaj Pratap, 411
Gonde, Anil Balaji, 495 Srivastava, J.B., 285
J V
Jain, Kamal, 377 Verma, Om Prakash, 421
Vipparthi, Santosh Kumar, 495
K
Kiruthika, K., 365 Y
Kumar, Ashutosh, 285 Yogameena, B., 151, 175
Kumar, Satish, 25