0% found this document useful (0 votes)
19 views14 pages

Chiang 2015

This paper introduces a dynamic sparsity-based classification scheme for analyzing interaction actions between persons, addressing challenges such as occlusions and lighting changes. It proposes a novel Hamming distance classification (HDC) method that is robust to environmental variations and suitable for real-time applications. The framework emphasizes the efficiency of sparse representation in recognizing complex human actions without extensive computational processes.

Uploaded by

sidhant.kr6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Chiang 2015

This paper introduces a dynamic sparsity-based classification scheme for analyzing interaction actions between persons, addressing challenges such as occlusions and lighting changes. It proposes a novel Hamming distance classification (HDC) method that is robust to environmental variations and suitable for real-time applications. The framework emphasizes the efficiency of sparse representation in recognizing complex human actions without extensive computational processes.

Uploaded by

sidhant.kr6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

J. Vis. Commun. Image R.

30 (2015) 252–265

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.


journal homepage: www.elsevier.com/locate/jvci

Modeling and recognizing action contexts in persons using sparse


representation q
Hui-Fen Chiang a,b,1, Jun-Wei Hsieh a,⇑, Chi-Hung Chuang c, Kai-Ting Chuang a,1, Yilin Yan a,1
a
Department of Computer Science and Engineering, National Taiwan Ocean University, No. 2, Beining Rd., Keelung 202, Taiwan
b
Dep. of D. M. Des, Taipei C. S. U. of S. T, Beitou, 112 Taipei, Taiwan
c
Dept. of Learning and Digital Technology, Fo Guang University, No. 160, Linwei Rd., Jiaosi, Yilan 26247, Taiwan

a r t i c l e i n f o a b s t r a c t

Article history: This paper proposes a novel dynamic sparsity-based classification scheme to analyze various interaction
Received 9 November 2014 actions between persons. To address the occlusion problem, this paper represents an action in an
Accepted 20 April 2015 over-complete dictionary to makes errors (caused by lighting changes or occlusions) sparsely appear in
Available online 29 April 2015
the training library if the error cases are well collected. Because of this sparsity, it is robust to occlusions
and lighting changes. In addition, a novel Hamming distance classification (HDC) scheme is proposed to
Keywords: classify action events to various types. Because the nature of Hamming code is highly tolerant to noise,
Sparse coding
the HDC scheme is also robust to environmental changes. The difficulty of complicated action modeling
Sparse reconstruction error
Occlusions
can be easily tackled by adding more examples to the over-complete dictionary. More importantly, the
R transform HDC scheme is very efficient and suitable for real-time applications because no minimization process
Interaction action analysis is involved to calculate the reconstruction error.
Behavior analysis Ó 2015 Elsevier Inc. All rights reserved.
Person-to-person action recognition
Person-to-object action recognition

1. Introduction and then proposed a HMM-based framework to analyze the overall


behaviors in an extremely crowded scene. Some approaches
Human action analysis [1,2] is an important task in various decompose actions into sequences of key atomic action units
application domains like video surveillance [1–38], video retrieval which are referred to as atoms. For example, in [36], Gaidon
[3], human–computer interaction systems, and so on. et al. proposed an atom sequence model (ASM) to represent the
Characterization of human action is equivalent to dealing with a temporal structure of actions and then recognized actions in videos
sequence of video frames that contain both spatial and temporal using a sequence of ‘‘atoms’’ which are obtained by manual anno-
information. The challenge in human action analysis is how to tations. In addition to videos, humans can recognize activities
properly characterize spatial–temporal information and then facil- based on only still images. For example, in [42], Yao and Fei-Fei
itate subsequent comparison/recognition tasks. To treat this chal- proposed a data-mining scheme to discover human-object interac-
lenge, some approaches build various action syntactic primitives tions by encoding a set of highly related patches to grouplets
to represent and recognize events. For example, in [4], Park and according to their appearances, locations, and spatial relations.
Aggarwal used the ‘‘blob’’ concept to model and segment a human Maji et al. [35] trained thousands of poselets to form a poselet acti-
body into different body parts from which human events were ana- vation vector so that various human actions in still images can be
lyzed using the dynamic Bayesian networks. Instead of body parts, recognized. However, the prerequisite that body parts or poses
Wang et al. [9] used the R transform to extract contour features must be well estimated makes this scheme inappropriate for
from different frames and then proposed a HMM-based recogni- real-time analysis.
tion scheme to analyze human behaviors. The advantage of To avoid the difficulty of action primitive or body part extrac-
R-transform is its high tolerance to noise and incomplete contours tion, some approaches extract feature points of interest and their
of foreground objects. Furthermore, Kratz and Nishino [13] used a motion flows to represent and recognize actions. For example,
set of local spatio-temporal motion volumes to represent actions Fathi and Mori [37] detected and tracked corner features to obtain
their motion trajectories for event representation, and then classi-
q
fied events to different layers using the Adaboost algorithm. In
This paper has been recommended for acceptance by M.T. Sun.
⇑ Corresponding author. Fax: +886 2 2462 3249. [38], Ju et al. used the SIFT detector and a tracking technique to
1
Fax: +886 2 2462 3249. extract different point trajectories with which different event

http://dx.doi.org/10.1016/j.jvcir.2015.04.012
1047-3203/Ó 2015 Elsevier Inc. All rights reserved.
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 253

classifiers can be then trained by using the SVM classifier. In [16], knowledge, this is the first work using the sparse representation
Laptev et al. generated the concept of interest points (STIP) from to analyze interaction actions between persons. The sparse prop-
images to flow volumes and then extracted various key frames to erty can make errors (caused by different environmental changes
represent action events. The success of the above feature-flow like lighting or occlusions) sparsely appear in the training library
methods strongly depends on a large set of well tracked points and thus increase the robustness of our scheme to environmental
to analyze action events from videos. changes. In the literature [57], the sparse reconstruction cost
In addition to event features, another key problem in event (SRC) is usually adopted to classify data to different categories.
analysis is how to model the temporal and spatial dynamics of The calculation of SRC is very time-consuming and unsuitable for
events. To treat this problem, Mahajan et al. [14] proposed a layer real-time applications. Therefore, the second contribution of this
concept to divide the recognition task to three layers, i:e., physical, paper is to propose a Hamming distance classification (HDC)
logical, and event layers which correspond to feature extraction, scheme to classify data without calculating the SRC. Thus, our
action representation, and event analysis, respectively. In [15], method is more efficient and suitable for real-time applications.
Laxton et al. used dynamic Bayesian networks to model the inter- In addition to the efficiency improvement, the HDC scheme is also
action events between objects. Then, the Viterbi-like algorithm is robust to occlusions because the nature of Hamming code is highly
adopted to seek the best path for event interpretation. In [10], tolerant to noise. The main contributions of this work include:
Cong et al. used the sparse representation scheme to model and
identify abnormal events from normal ones. Hidden Markov model 1. A novel sparsity-based framework is proposed to analyze
(HMM) is another commonly used scheme to model event dynam- person-to-person interaction behaviors. Even though the inter-
ics. For example, Oliver et al. [12] used HMMs to classify human action relations are complicated, our method still works suc-
interactions into different event types like meeting, approaching, cessfully to recognize them.
walking, and so on. Furthermore, Messing et al. [17] tracked a set 2. A new Hamming distance classification scheme to classify
of corners to obtain their velocity histories and then used HMM events to multiple classes. It is robust to environmental changes
to learn activity event models. Kim and Grauman [44] proposed a due to the nature of Hamming code.
space–time MRF model to model the distributions of local optical 3. The proposed sparse representation is very efficient and suit-
flows within different regions and then estimated the degrees of able for real-time applications because no minimization process
event normality at each region to detect abnormal events. In addi- to calculate the SRC is involved.
tion, Nguyen et al. [39] developed a HHM-based surveillance sys-
tem to recognize human behaviors in various multiple-camera The remainder of the paper is organized as follows. Section 2
environments such as a corridor, a staff room, or a vision lab. The discusses some related work. An overview of our system is intro-
challenges related to HMMs involve how to specify and learn the duced in Section 3. Details of feature extraction are described in
HMM model structure. Section 4. Section 5 discusses the techniques of sparse representa-
A particular action between two objects can vary significantly tion and library learning. The framework of person-to-person
under different conditions such as camera views, person’s clothing, interaction action recognition using sparse representation is pro-
object appearances, and so on. Thus, it is more challenging to ana- posed in Section 6. The experiment results are given in Section 7.
lyze human events happening between two objects because of We then present our conclusions in Section 8.
their complicated interaction relations. In addition, occlusions
between objects often lead to the failure of action recognition. In
[47], Filipovych and Ribeiro used a probabilistic graphical model 2. Related work
to recognize several primitive actor-object interaction events like
‘‘grasping’’ or ‘‘touching’’ a fork (or a spoon, a cup). In [4], Ryoo There have been many approaches [8,9,13–18,42,44] proposed
and Aggarwal used a context-free-grammar framework that inte- in the literature for video-based human movement analysis. For
grates object recognition, motion estimation, and semantic-level example, in [52], Fengjun and Nevatia used a posture-based
recognition to hierarchically analyze various interactions between scheme to convert frames to strings from which actions were then
a human and an object. In [46], Ping et al. extracted SIFT points and classified into different types using the action nets. Weinland and
their trajectories to describe events and then used a kernel learning Boyer [53] selected a set of key postures to convert an action
scheme to analyze interaction events caused by two objects. In sequence to a vector. Then, the Bayesian classifier is designed to
addition to videos, some previous works address joint modeling classify the extracted vectors to different categories. In [54],
of human poses, objects and their relations from still images. Hsieh et al. proposed a centroid context descriptor to convert pos-
For example, in [40,41], Yao and Fei-Fei proposed a random tures to different symbols and then detected abnormal events
field model to encode the mutual connections of components in directly from videos. Messing et al. [55] used the velocity history
the analyzed object, the human pose, and the body parts to recog- of tracked key-points to represent actions and then combined
nize human-object interaction activities in still images. However, other features, such as appearance, position, and high level sematic
as described in [43], until now, reliable estimation of body config- features together to recognize activities of daily living. In [56], Fan
urations for persons in any poses remains a very challenging et al. modified the Viterbi algorithm to detect repetitive sequential
problem. event units and then recognized the predominant retail cashier
This paper addresses the problem of action analysis between activities at the checkout station.
persons (or human-object interactions) by using the sparse repre- To model the interactions between objects, in [45], Wu et al.
sentation. As described before, the complicated interaction integrated RFID and video streams to analyze the events happening
changes and the occlusion problem between two objects increase in kitchen via a Dynamic Bayesian network. In [43], Delaitre et al.
many challenges in action recognition. To treat the above prob- proposed a statistical model which integrates order-less
lems, this paper proposes a novel dynamic sparse representation person-object interaction responses to recognize actions in still
scheme to analyze interaction actions between persons. As we images by calculating spatial co-occurrences of individual body
know, actors often perform the same action type at different parts and objects. To achieve the property of view invariance,
speeds. From the nature of sparse representation, this temporal Karthikeyan et al. [8] computed the R transforms on action silhou-
variation problem can be easily tackled by creating a new code ettes from multiple views and then proposed a probabilistic sub-
in the training library to capture the speed change. From our space similarity technique to recognize actions by learning their
254 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265

inter-action and intra-action models. However, when recognizing not only from its efficiency but also its robustness and extensibility
an action, the scheme requires that all action sequences captured in interaction action analysis.
at different views should be collected together simultaneously.
As for sparse representation, it is a technique to build an over- 3. Flowchart of the proposed system
complete dictionary matrix for target representation and recogni-
tion. It has been applied to many applications, e:g., noise This paper proposes a novel recognition system to analyze
removing [19,20], image recovering [21], text detection [22], object action events between multiple persons using the sparse represen-
tracking [23], object classification [24,25], face recognition [26– tation. Three stages are involved in this system, that is, feature
30], action recognition [6,31–33], event analysis [10,11,34], and extraction, dictionary learning, and action event recognition.
so on. For example, Elad and Aharon [19] addressed the image Fig. 1 shows the flowchart of our system. Firstly, if a static camera
denoising problem by learning an over-complete dictionary from is used, two different features are extracted from each analyzed
image patches. In [20], Shang used an extended non-negative object, i:e., the coefficients of R transform and the HOG descriptor
sparse coding scheme to remove noise from data. Dong et al. [21] (see Fig. 1(a)). The former one is used to capture the outer property
restored a degraded image by optimizing a set of optimal sparse of the object and the latter one is used to describe its inner prop-
coding coefficients from a dictionary. As for object detection, erty. As for a moving camera, because the object contour is no
Zhao et al. [22] used a sparse representation with discriminative longer available, the HOF (Histogram of Flow) descriptor is used
dictionaries to develop a classification-based scheme for text to replace the feature extracted from the R transform (see
detection. Chen et al. [23] combined an appearance modeling Fig. 1(b)). With these features, a novel sparse representation is pro-
scheme and sparse representation to evaluate the distance posed to capture the spatial and temporal changes of interaction
between a target and the leaned appearance model for object actions between persons. Then, the K-SVD algorithm [5] is applied
tracking. Furthermore, in [24], Zhang et al. proposed a joint to learn an over-complete dictionary to represent action events
dynamic sparse representation to recognize objects from multiple more effectively and accurately. With the learned dictionary, two
views. Yuan and Yan [25] proposed a multi-task joint sparse repre- classification schemes, i:e., sparse reconstruction cost (SRC) [57]
sentation scheme to reconstruct a test sample from a few training and Hamming distance coding (HDC) are then developed and com-
subjects with multiple features and then applied it to face recogni- pared for person-to-person action analysis. The major advantages
tion and object classification. Wright et al. [26] proposed a sparse of this representation are the adaptability to deal with different
representation to recognize faces even under occlusions by evalu- occlusion conditions, the efficiency to analyze actions in real time,
ating which class leads to the minimal reconstruction error. Zhang and the extensibility to analyze multiple action events between
and Li [27] incorporated the classification error into an objective persons. In what follows, details of the feature extraction task are
function during the learning process and then proposed the dis- first discussed.
criminative K-SVD method to recognize faces. Zhang and Huang
[28] addressed the challenging task of face recognition in large
pose differences between gallery and probe faces by exploiting 4. Feature extraction
the sparse property of faces across poses. Ptucha and Savakis
[29] proposed a dimensionality reduction technique by adding In traditional surveillance systems, it is common to capture
the locality preserving constraint into sparse representation to find video sequences by a stationary camera. However, for actions in
the best manifold projection for face recognition. Wei et al. [30] TV or movies, a moving camera is usually adopted to capture action
also added the locality preserving constraint into sparse represen- videos. For a stationary camera, the background of the captured
tation but removed the norm-bounded constraint on the learned video sequence can be reconstructed using a mixture of Gaussian
library for improving feature representation and classification. As models or the codebook technique [51]. This allows us to detect
for action analysis, Qiu et al. [6] proposed a Gaussian process some samples and extract foreground objects by subtracting the
model to learn a compact and discriminative dictionary of action background. We then apply some simple morphological operations
attributes for action recognition by incorporating the class distri- to remove noise. After the connected component analysis, each
bution and appearance information into an object function. Lu foreground object can be then well detected for action analysis.
and Peng [31] proposed a spectral embedding approach for Then, the R transform and the HOG descriptor are adopted in this
extracting latent semantic high-level features with structured paper to describe its inner and outer properties (see Sections 4.1
sparse representation to analyze actions. Wang [32] et al. modified and 4.2). As for a moving camera, the person-centered descriptor
the sparse model which minimizes not only the reconstruction [49] is adopted for action analysis and described in Section 4.3.
error but also a similarity constraint to capture the correlations
between similar actions. Zhang et al. [33] proposed a manifold pro- 4.1. Radon transform and R transform
jection scheme for dimension reduction by adding the locality pre-
serving constraint into sparse representation and then finding the Let Iðx; yÞ denote the intensity function of the detected object.
best manifold projection for action recognition. Cong et al. [10,11] From Iðx; yÞ, this section uses the R transform to extract its contour
proposed a dictionary selection model with a low rank constraint features. R transform is the improved form of Radon transform.
for event representation and then designed a new SRC to measure Radon transform in two dimensions is the integral transform con-
the outlier for abnormal event detection in crowed scenes. Zhao sisting of the integral of an image over a straight line. Let L denote
et al. [34] proposed a framework to detect unusual events in videos this straight line with the form: x cos h þ y sin h ¼ t. Then, the
from an automatically learned dictionary by using the sparse cod- Radon transform of Iðx; yÞ along L is defined by:
ing scheme with online re-constructability. For all the above Z 1 Z 1
approaches, a sparse reconstruction cost is often adopted to clas- Radon fIg ¼ Pðh; t Þ ¼ Iðx; yÞdðx cos h þ y sin h  tÞdxdy; ð1Þ
1 1
sify data to different classes. However, this cost is inefficient to
real-time applications because an optimization process is needed where h 2 ½0; p; t 2 ð1; 1Þ, and dðxÞ is a Dirac delta function.
to iteratively calculate the SRC. In addition, the above Fig. 2 shows the projection example of Iðx; yÞ along L using the
sparsity-based method cannot recognize interaction actions transform. For the scaling, translation, and rotation changes in
between persons. The novelty of this paper is to provide a HDC Iðx; yÞ, this transform has the following properties:
framework to improve the performances of sparse representation For a scaling factor a,
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 255

Sparse
Reconstrucon Reconstrucon Error
Feature Extracon Coefficient
Training Diconary

R Transform üüü

R Transform

HOG
üüü

Code
Hamming Distance
} Walking

Histogram of Oriented
Gradient

(a)
Sparse
Reconstrucon Reconstrucon Error
Feature Extracon Coefficient
Training Diconary

HOF üüü

HOF

HOG
üüü

Code
Hamming Distance
} Walking

Histogram of Oriented
Gradient

(b)
Fig. 1. Flowchart of the proposed system. (a) Static camera. (b) Moving camera.

Eq. (5) is the R transform of Iðx; yÞ. For geometry transformation


Projection t such as scaling, translation, and rotation, similar to the Radon trans-
form, the R transform has the following properties:
ł For a scaling factor a,
Z 1 Z 1
1 1 1
Y P2 ðh; at Þdt ¼ P2 ðh; v Þdv ¼ RðhÞ; ð6Þ
I (x,y) a2 1 a3 1 a3
For a translation ðx0 ; y0 Þ,
Z 1 Z 1
蒼 P2 ðh; ðt  x0 cos h  y0 sin hÞÞdt ¼ P 2 ðh; v Þdv ¼ RðhÞ;
x ð7Þ
1 1

For a rotation h0 ,
Z 1
P2 ððh þ h0 Þ; t Þdt ¼ Rðh þ h0 Þ: ð8Þ
1

B Although the R transform is rotation-invariant, this property is use-


less for action classification and not adopted in this paper. Fig. 3
Fig. 2. Radon transform of an image Iðx; yÞ along the line x cos h þ y sin h ¼ t. shows the results of R transform after translation and scaling
changes. The R transform is invariant for translation changes. For
1 a scaling change, it would not change the shape of the R transform
Radon fIðax; ayÞg ¼ Pðh; at Þ: ð2Þ but the amplitude accordingly. If Iðx; yÞ is normalized to a fixed size,
a
the R transform is invariant under translation and scaling. In our
For a translation ðx0 ; y0 Þ,
experiments, for the R transform, the angle ranges from 0° to 180°
Radon fIðx  x0 ; y  y0 Þg ¼ Pðh; t  x0 cos h  y0 sin hÞ; ð3Þ and is further sampled with 4°. Thus, the feature dimension of this
transform is 45.
For a rotation h0 ,
 
Radon Ih0 ðx; yÞ ¼ Pðh þ h0 ; t Þ: ð4Þ 4.2. Histogram of oriented gradients
In Eq. (1), the result of Radon Transform is a two dimensional signal.
It can be converted to 1D signal by the improved form of Radon In addition to R transformation, this paper also uses the his-
transform [9]: togram of oriented gradients (HOG) [18] to describe an object’s
Z inner properties. Fig. 4 shows the flowchart of the HOG method.
1
RðhÞ ¼ P2 ðh; tÞdt: ð5Þ Let Ix and Iy denote the central differences at point ðx; yÞ are given
1 by
256 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265

Fig. 3. Results of R transform under geometrical changes. (a) Original frame. (c), (e), (g) Translated, scaled, rotated versions of (a), respectively. (b), (d), (f), and (h) Results of R
transform of (a), (c), (e), and (g), respectively.

Accumulate weighted vote Normalize contrast Collect HoGs for all


Input Compute
for gradient orientaon within overlapping blocks over
image Gradients
over spaal cells spaal blocks of cells detecon window

feature vector
f =[...,...,...]

Fig. 4. Flowchart of the HOG method.

Ix ¼ Iðx þ 1; yÞ  Iðx  1; yÞ and Iy ¼ Iðx; y þ 1Þ  Iðx; y  1Þ: orientations are divided into five bins, i:e., horizontal, vertical,
ð9Þ two diagonal orientations, and a no-gradient bin. For the HOF fea-
ture, the optical flow is discretized into five bins: no-motion, left,
Then, the gradient magnitude Mðx; yÞ and its orientation hðx; yÞ can right, up and down. Thus, the local descriptor of each grid is
be computed by ten-dimensional. In [49,50], the head orientation is divided into
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi five directions to create a compact and higher-dimensional repre-
Mðx; yÞ ¼ I2x þ I2y and hðx; yÞ ¼ tan1 Iy =Ix : ð10Þ sentation from which a different classifier for each discrete head
orientation can be learned. However, in our proposed method,
Based on Eqs. (9) and (10), HOG counts occurrences of gradient ori-
the variations of head orientation can be easily embedded in the
entation in localized portions of an image to describe the inner
sparse representation. Thus, the head pose classifier is not need
visual characteristics of an object. This paper divides the angle to
in our spare representations scheme. In addition, different from
12 bins where each bin accumulates the number of edge points
[49,50], we divide an upper body to 4  6 grids rather than 8  8
whose angles hðx; yÞ fall in this angle bin. In other words, each bin
grids to describe each person for action analysis. After concatena-
stands for degree 20 from 0 to 360°. The contribution of an edge
tion, the local context descriptor of each upper body is with the
point to the HOG is weighted by its gradient magnitude Mðx; yÞ. In
dimension 240.
real implementation, the analyzed object is further divided to four
grids. Then, an ensemble of HOG descriptors with 72 bins can be
formed for action analysis (72 = 4 grids  18 bins). 5. Sparse representation

4.3. Person-centered descriptor Sparse representation [5–7] is a technique to build an overcom-


plete dictionary to represent a target. With this dictionary that
If a moving camera is used to capture action sequences, this contains prototype signal-atoms, signals are described by sparse
paper adopts the person-centered descriptor [49] to model and linear combinations of these atoms. It has been applied to various
analyze interaction actions between persons. This descriptor uses applications including compression, regularization in inverse prob-
HOG and a linear SVM classifier to train an upper-body detector lems, feature extraction, object detection, denoising, and more. In
to look for upper bodies in 128  128 and 64  64 windows. this paper, we utilize the sparse representation and dictionary
Then, a KLT tracker is adopted to track each detected upper body. learning techniques to design a novel framework to analyze inter-
In addition, a head pose classifier is also trained in [49] to classify action actions happening between persons.
the head orientation to five discrete orientations: profile-left,
frontal-left, frontal-right, profile-right, and backwards. Then, the 5.1. Dictionary initialization and matching pursuit
detected upper body is further divided to different grids. Then, in
each grid, the histograms of oriented gradients (HOG) and optical Let x denote an n-dimensional feature vector, i:e., x 2 Rn . We say
flow (HOF) are computed. For the HOG feature, the edge that it admits a sparse approximation over a dictionary D in RnK ,
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 257

where each column vector is referred to as an atom. Consider a


training set of signals X ¼ ½x1 ; x2 ; . . . ; xN  2 RnN . Then, one can find
a linear combination of a ‘‘few’’ atoms from D that is ‘‘close’’ to the
set X of signals x; i:e.,
X  DA; ð11Þ

where A ¼ ½a1 ; . . . ; aN  2 RKN is the set of combination coefficients


in the sparse decomposition. For the task of dictionary learning,
this paper adopts the K-SVD algorithm [5] for dictionary learning
and construction because of its efficiency. Given X, the K-SVD algo-
rithm maintains the best representation of each signal with strict
sparsity constraints to learn the over complete dictionary D. It is
an iterative scheme alternating between the sparse coding result
of the training signals with respect to the current dictionary and
an updating process for the dictionary atoms so as to better fit
the training signals. Then, the learning process can be formulated Fig. 5. Person-centered descriptor. A grid-based structure is used to capture the
dominant gradient orientation and flow changes around the person. Grids with
a joint optimization problem with respect to D and A of the sparse significant motion are shown in red. (For interpretation of the references to colour
decomposition as in this figure legend, the reader is referred to the web version of this article.)

arg minkX  DAk22 s:t: 8i; kai k0 6 T; ð12Þ


D;A

where D ¼ ½d1 ; . . . ; dK  2 RnK ; A ¼ ½a1 ; . . . ; aN  2 RKN ; T is the most arg min knk  dk arow 2
k kF : ð15Þ
row
dk ;ak
desired number of non-zero coefficients, and kai k0 is the l0 -norm
which counts the number of nonzeros in the vector ai . Eq. (12)
The K-SVD scheme suggests the use of the SVD to find alternative dk
can be formulated as another equivalent problem:
and arow
k . The SVD finds the closest rank-1 matrix (in Frobenius
norm) that approximates nk . If the SVD of nk is expressed as USVT ,
arg minkai k0 s:t: kxi  Dai k22 6 e; ð13Þ
D;ai dk is updated by the first output basis vector u1 and the non-zero
values in arow
k are adjusted to the product of the first singular value
where e is an error tolerance of reconstruction. Solving this problem Sð1; 1Þ and the first column of V. The two stages are iteratively per-
was proved to be an NP-hard problem [24]. However, many approx- formed until convergence.
imation techniques for this task were proposed. One of common
methods to solve ai is the Orthogonal Matching Pursuit (OMP).
OMP is a greedy method to iteratively solve ai for each xi while fix- 6. Person-to-person action recognition using sparse
ing D; i:e., representation
ai ¼ minkxi  DAk22 s:t: kai k0 6 T; ð14Þ
A This section will deal with different challenges in human action
analysis between persons, that is, how to properly characterize
At each stage, the OMP selects an optimal parameter ai from D that
spatial–temporal information and how to perform the subsequent
best resembles the residual. After each such selection, the signal is
comparison/recognition tasks.
back projected onto the set of chosen atoms, and the new residual
signal is calculated. (see Fig. 5)
6.1. Action event representation
5.2. Dictionary learning via K-SVD
To characterize spatial–temporal information of an action
This paper applies the K-SVD algorithm to solve Eq. (14) by iter- event, this paper uses the R transform, the HOF descriptor, and
atively executing an optimization process to optimize D and A. X
the HOG descriptor to describe each frame. Let hR ; hhof , and hhog
X X

Each iteration consists of a sparse coding stage that optimizes the


denote the feature vectors extracted from the R-transform, the
coefficients in A and a dictionary update stage that improves the
HOF, and the HOG of an actor X, respectively. If a static camera is
atoms in D. Like Fig. 6, during the sparse coding stage, D is held
adopted to capture X, only the R-transform and the HOG feature
while each ai is optimized by solving Eq. (14) via the OMP scheme,
are adopted to describe X. As shown in Fig. 7(a), a new feature vec-
and allowing each coefficient vector to have no more than T non- T
X X
zero elements. During the dictionary update stage, each column dk tor F X ¼ ðhR ; hhog Þ will be formed to represent X. On the other
in D is updated sequentially so that its coefficients can better rep- hand, if a moving camera is adopted, we use the HOF descriptor
X X
resent X. The update process is the key inside of K-SVD which hhof to replace hR and then construct a new feature vector
accelerates the optimization process of Eq. (12) while maintaining X X T
F X ¼ ðhhof ; hhog Þ (see Fig. 7(b)) for action representation.
the sparsity requirement. Let dk denote the kth column in D to be
updated. In addition, we denote the kth row in A by arow Furthermore, we use X t to denote the version of X observed at
k , the set
of coefficients corresponding to dk . Then, the cost function in Eq. the tth frame, and X k to denote the kth example. Let EX
(12) can be rewritten as follows: denote an action event performed by X. If nf frames are collected
 2  2 to represent EX , a concatenation operator can be used to construct
 XK   X  EX ; i:e.,
   row 
kX  DAk2F ¼ X  dj arow
j  ¼ X  d arow
j j  d a
k k 
    X nf
j¼1 F j–k F EX ¼ F X 1  . . .  F ; ð16Þ
row 2
¼ knk  dk a k kF :
where the concatenation operator ‘’ to concatenate the features F X
The updated values of dk and a row
k can be obtained by solving and F Y is defined by:
258 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265

ŔűŢųŴŦġńŰťŪůŨ ŅŪŤŵŪŰůŢųźġŖűťŢŵŦ
ŊůŪŵŪŢŭŪŻŦġŅ
ĩŖŴŦġŎŢŵŤũŪůŨġőŶųŴŶŪŵĪ ĩńŰŭŶŮůĮţźĮńŰŭŶŮůġţźġŔŗŅĪ

Fig. 6. Flowchart of the K-SVD scheme.

ŃŦũŢŷŪŰųġŵźűŦġŅŪ
⎛ ⎞ ⎛ ⎞
ŕųŢůŴŧŰųŮ
⎜ ⎟ ⎜ ⎟ ⎛− − − −⎞ ⎛− − − −⎞

ʼnŐŇ
⎜ ⎟ ⎜ ⎟
⎜− − − −⎟ ⎜− − − −⎟
⎜ ⎟ ⎜ ⎟
⎜− − − −⎟
įįį ⎜− − − −⎟

⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
œġ
⎜− − − −⎟ ⎜− − − −⎟
⎜− − − −⎟ ⎜− − − −⎟
⎝ ⎠ ⎝ ⎠

⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎛− − − −⎞ ⎛− − − −⎞

⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜− − − −⎟
⎜− − − −⎟ ⎜ ⎟
⎜− − − −⎟
įįį ⎜− − − −⎟

⎝ ⎠ ⎝ ⎠ ⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠

ŇųŢŮŦŴ

įįį

įįį
įįį

įįį
įįį
⎛ ⎞ ⎛ ⎞
⎜ ⎟ ⎜ ⎟
⎛− − − −⎞ ⎛− − − −⎞

ʼnŐň
ʼnŐň

⎜− − − −⎟ ⎜− − − −⎟
⎜ ⎟
⎜− − − −⎟ įįį ⎜ ⎟
⎜− − − −⎟

⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠

⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎛− − − −⎞ ⎛− − − −⎞
⎜− − − −⎟ ⎜− − − −⎟
⎜ ⎟

⎜ ⎟ ⎜ ⎟
⎜ ⎟
⎜− − − −⎟
⎜ ⎟
įįį ⎜− − − −⎟
⎜ ⎟
⎜− − − −⎟

⎝ ⎠ ⎝ ⎠ ⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟
⎝− − − −⎠

(a) (b) (c)


Fig. 7. Sparse representation for analyzing single-person action events. (a) Feature vector of an object at each frame for a static camera. (b) Feature vector of an object for a
moving camera. (c) Matrix to represent single person action event.

8 8
< ðh X ; h X ; hY ; hY ÞT ; if a static camera is adopted; < ðhO ; hO ; hO ; hO ; mÞT ; for a static camera;
X Y R hog R hog XY R hog R hog
F F ¼ F ¼ ; ð21Þ
: X X Y Y T : O O O O T
ðhhof ; hhog ; hhof ; hhog Þ ; if a moving camera is adopted: ðhhof ; hhog ; hhof ; hhog ; mÞ ; for a moving camera:
ð17Þ O O O
where hR ; hhog , and hhof are the R transform, the HOG descriptor, and
After normalization, EX is a unit vector satisfying kEX k ¼ 1. In the the HOF descriptor of O, respectively, and m is set to zero. Let EXY
sparse representation, if an action class Di is represented by K i denote the action event performed by X and Y. If nf frames are col-
codes, Di can be represented by a matrix form: lected to represent EXY , it can be constructed with the following
Di ¼ ðEX 1 ; . . . ; EXk ; . . . ; EXK i Þ; ð18Þ sparse representation:
XY nf
EXY ¼ F XY 1      F XY t    F ; ð22Þ
where Di 2 RnK i and n is the dimension of EX . Fig. 7(b) shows the
matrix structure of Di performed by single person. Assume that where EXY is a column feature vector. Each column in Fig. 8(b)
there are L action types to be recognized. Then, the dictionary D shows the structure of EXY . Let nXY denote the dimension of EXY .
in the sparse representation can be constructed by the form Similar to Eq. (18), in the two-person case, if an action class Di is
represented by K i codes, Di can be represented by a matrix form
D ¼ ðD1 ; . . . ; DL Þ; ð19Þ
PL Di ¼ ðEXY 1 ; . . . ; EXY K i Þ; ð23Þ
where D 2 RnK and K ¼ i¼1 K i . Eqs. (16)–(19) are used to analyze
single-person action events. To analyze person-to-person action where X k denotes the kth code and Di 2 RnXY K i . Assume there are L
events, one challenge is the occlusion problem. In the following, a action types to be recognized. Then, the library D in the sparse rep-
novel sparse representation will be proposed to treat this problem. resentation can be integrated as the form
Assume the analyzed interaction events are performed by two
D ¼ ðD1 ; . . . ; DL Þ; ð24Þ
persons X and Y. The method to model their interaction relations
needs to be adjustable and robust to occlusions. As shown in where D 2 R nXY K
. With D, two classification schemes, i:e., SRC and
Fig. 8(a), if X and Y are not occluded, a new feature descriptor F XY HDC will be proposed to classify actions to different category types.
is constructed to describe X and Y as follows:
8 6.2. Event classification using sparse representation
< ðh X ; h X ; hY ; hY ; mÞT ; for a static camera;
XY X Y R hog R hog
F ¼ F  F  ðmÞ ¼
: X X Y Y T After obtaining D by the K-SVD algorithm, the action classifica-
ðhhof ;hhog ;hhof ; hhog ; mÞ ; for a moving camera;
tion task can be formulated as a signal reconstruction problem.
ð20Þ
Given an input signal x 2 Rn , we consider x as a linear combination
where m is the spatial feature (or relative distance) between X and of column vectors in D; i:e.,
Y. On the other hand, if X and Y are occluded together (see Fig. 8(b)),
x ¼ a1 d1 þ    þ    þ aK dK ;
we replace X and Y with their occluded version O since there is no
information to discriminate X from Y. Then, the descriptor to repre- where dk 2 D. Let a ¼ ða1 ; . . . ; aK Þ. The sparse solution a can be
sent X and Y is constructed as follows: obtained by solving the following minimization problem:
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 259

Behavior type

⎛X⎞ ⎛X⎞ ⎛X⎞


⎜Y ⎟ ⎜Y ⎟ ... ⎜Y ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜m ⎟ ⎜m ⎟ ⎜m ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠
⎛O ⎞ ⎛X⎞ ⎛O ⎞
⎜O ⎟ ⎜Y ⎟ ... ⎜O ⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜m⎟ ⎜m ⎟ ⎜m⎟
Y ⎝ ⎠ ⎝ ⎠

Frames
X Ő ⎝ ⎠

...

...

...

...
⎛X⎞ ⎛O ⎞ ⎛O ⎞
⎜Y ⎟ ⎜O ⎟ ... ⎜O ⎟
M ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜m ⎟ ⎜m⎟ ⎜m⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠
⎛X⎞ ⎛X⎞ ⎛X⎞
⎜Y ⎟ ⎜Y ⎟ ⎜Y ⎟
⎜ ⎟ ⎜ ⎟ ... ⎜ ⎟
⎜m ⎟ ⎜m ⎟ ⎜m ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠

(a) (b) (c)


Fig. 8. Sparse representation for analyzing interaction actions. (a) Non-occluded frame. (b) Occluded frame. (c) Matrix representation to analyze person-to-person actions.

arg minkak1 s:t: kx  Dak22 6 e: When considering all elements in D, a code word BðxÞ can be con-
a
structed to represent x:
This optimization problem can be efficiently solved via the
BðxÞ ¼ ½B1 ; B2 ; . . . ; Bi ; . . . ; BL ;
second-order cone programming [26]. Assume there are L action
types to be recognized. Then, we can separate D to L classes, where Bi ¼ ½bi;1 ; bi;2 ; . . . ; bi;j ; . . . ; bi;K i . Fig. 9 shows an example of this
i:e.,D ¼ ðD1 ; . . . ; DL Þ. Each Di is learned by the K-SVD algorithm. coding technique, where L ¼ 2 and K i ¼ 5. For the ith class, each ele-
Then, using only the set ai of coefficients associated with class i, ment dij in Di will associate with a bin ci;j whose value is one when
we compute the residual ri ðxÞ between x and the approximated one: the ith action event is coded. After concatenating all the bins ci;j , the
r i ðxÞ ¼ kx  Di ai k2 : ð25Þ code word C i to represent the ith event type is formed as

Then, x is assigned to its corresponding action type that minimizes C i ¼ ½01;1 ; 01;2 ; . . . ; 0i1;K i1 ; 1i;1 ; 1i;2 ; . . . ; 1i;K i ; 0iþ1;1 ; . . . ; 0L;K L :
the residual by the form Then, the corresponding event type of x can be decided by finding
TypeðxÞ ¼ arg min r i ðxÞ: ð26Þ its minimum Hamming distances among the L event categories, i:e.,
i
TypeðxÞ ¼ arg min HamDisðBðxÞ; C i Þ; ð30Þ
Table 1 summaries the complete classification algorithm for 16i6L

action event classification. The major disadvantage of this SRC where HamDisðBðxÞ; C i Þ is the Hamming distance between BðxÞ and
scheme is its inefficiency in the minimization process to find ai . C i . Fig. 10 shows the results of Hamming distance calculation of
To improve the efficiency of the SRC scheme, in Section 5.3, a novel Fig. 9. Compared with the SRC method, the HDC method does not
Hamming distance classification (HDC) scheme will be proposed require an optimization process to obtain the reconstruction error
for real-time action analysis. and thus is more efficient in action analysis. Actually, the SRC
method and the HDC scheme can be integrated together to classify
6.3. Event classification and analysis using hamming distance action events more effectively. Since the max value of HamDisðÞ is K
and the range of r i ðxÞ is [0, 1], a normalization term is added to the
To improve the inefficiency of the SRC method, a novel HDC HDC scheme to obtain the integration form:
scheme is proposed to analyze action events between persons. In
fact, if D is well learned, an input signal x can be sufficiently repre- HamDisðBðxÞ; C i Þ
identityðxÞ ¼ arg min r i ðxÞ þ : ð31Þ
sented using only the training samples from the same class. It i K
means if x belongs to class i, most of coefficients will be concen- P
where K ¼ Li¼1 K i . The performance comparisons among the SRC
trated on this same class. This representation is naturally sparse.
scheme, the HDC method, and the integration scheme will be dis-
The more sparse the recovered a is, the easier the identity of x will
cussed in Section 6.
be accurately determined.
Given two vectors x and y, the Euclidian distance is used to
measure their distance 7. Experimental results

X
n
To evaluate the performance of our proposed method, a
eðx; yÞ ¼ ðx  yÞt ðx  yÞ ¼ ðxl  yl Þ2 ; ð27Þ real-time system to analyze different action events between two
l¼1
persons at different lighting conditions was implemented. Two
n
where x and y 2 R . Let dij denote the jth column vector in the ith datasets were adopted in this paper for examining the effective-
class Di of D. Then, the max distance gi between any pair of ele- ness of our method, i:e., synthetic and real videos. Four kinds of
ments in Di can be calculated by action types were created in this dataset, i:e., waving, handshaking,
running, and walking. For each action type, there were one hun-
gi ¼ max eðdi;j ; di;k Þ: ð28Þ dred of action videos created for training and testing, where fifty
16j;k6K i
videos were collected for training and another set of fifty videos
With dij , its corresponding bit to code x is bi;j ðxÞ and can be deter- were used for testing. For the real dataset, thirty-two videos for
mined by each type were created for testing. Their training data was got from
 the same synthetic dataset. In addition to the four types, three
1; if eðx; di;j Þ 6 gi ;
bi;j ðxÞ ¼ ð29Þ extra action types were added in the real dataset to evaluate the
0; otherwise:
effectiveness of our methods under real conditions; that is, kicking,
260 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265

Table 1 method is about only 0.3 fps. The average accuracy of the SRC
Sparse representation-based classification scheme. method is about 86%.
Algorithm for sparse representation classification Table 3 shows the confusion matrix of the HDC method on the
1. Input: a matrix of training samples synthetic dataset. The average accuracy of the HDC method is
D ¼ ½D1 ; D2 ; . . . ; Di ; . . . ; DL  2 RnK for L classes, a test sample x 2 Rn 88.5% which is higher than the one of the SRC method. Especially
1
2. Solve the l -minimization problem: ai ¼ arg mina kx  Di Ak22 in the ‘‘running’’ type, the accuracy of this HDC method is quite
3. Compute the residuals r i ðxÞ ¼ kx  Di ai k2 better than the SRC method. The frame rate of the HDC method
4. Output: EventTypeðxÞSRC ¼ arg min r i ðxÞ
i
is about 12.5 fps. Since no optimization process is required, the
HDC method performs more efficiently than the SRC method.
Actually, the two methods can be integrated together to form an
ensemble of action event classifier (see Eq. (31)). Table 4 shows
punching, and soccer-juggling, respectively. Fig. 11 shows the the confusion matrix of this ensemble classifier. The average accu-
examples of synthetic data for the four action types, i:e., racy of this method is 91%. The frame rate of this integration
‘‘Handshaking’’, ‘‘Waving’’, ‘‘Walking’’, and ‘‘Running’’. Fig. 12 scheme is 0.25 fps. If the effectiveness is more important than
shows the examples of real data for the seven action types. The the efficiency, we will suggest the ensemble method to be adopted;
dimension of video frame is 320  240 pixel elements. The frame otherwise, the HDC method is suggested.
rate that our system can achieve is about 12 fps for the HDC In Eq. (20), the parameter m of F XY is set to the relative distance
method. The platform we used is a general PC with Intel CPU P-4 between objects X and Y. This feature lacks of the distinguishability
2.33 GHz, 1G RAM, run under Window 7. The training code was to identify the ‘‘walking’’ action type from the ‘‘running’’ action
implemented by Matlab and the testing code was implemented type. Thus, in Tables 3 and 4, there are many misclassifications
by Visual C++6.0 for efficiency consideration. To make fair compar- found between the two types. To treat this problem, the parameter
isons, three methods [16,48,49] were also compared in this paper m is changed to the speeds of X and Y. Table 5 shows the accuracy
based on the UT interaction dataset [58] and the TV interaction comparisons among the three methods when the speed feature
dataset [59]. was added. Clearly, the accuracy improvements in the ‘‘walking’’
Fig. 13 shows the results of person-to-person interaction recog- and ‘‘running’’ categories are very significant. Their average accura-
nition on the synthetic dataset. All the action types in this figure cies are 90%, 93%, and 94%, respectively.
were correctly recognized. Table 2 shows the confusion matrix of As for the real dataset, seven action types were recognized. The
performance evaluation on the synthetic dataset using the SRC first six types focus on person-to-person action recognition and the
method. In this table, the ‘‘walking’’ action type is easily misclassi- last one is to recognize person-to-object interaction events. Fig. 14
fied to the ‘‘running’’ type because they are very similar except shows the results of action type recognition in indoor environ-
their speeds. The ‘‘handshaking’’ action type was sometimes mis- ments. Fig. 15 shows the results of action type recognition in out-
classified to the ‘‘walking’’ type because their visual features are door environments. Fig. 16 shows the result of recognizing a
similar before performing the handshaking action. The SRC method person-to-object action event. Table 6 shows the confusion matrix
to classify action events needs an optimization process to calculate of the SRC method on real data. In this table, for the first four action
the reconstruction error r i ðxÞ. Thus, the frame rate of the SRC types, their corresponding event classifiers were trained from the

Fig. 9. Examples of Hamming coding technique.

B( x) B( x)

C1 C2

HamDis ( B(x ),C 1 ) = 2 HamDis (B ( x ),C2 ) = 8

Fig. 10. Results of Hamming distance calculation of Fig. 9.


H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 261

(a) Handshaking (b) Waving (c) Walking (d) Running


Fig. 11. Examples of synthetic data for four action types.

Fig. 12. Examples of real data for seven action types.

Fig. 13. Results of action recognition on the synthetic dataset.

Table 2 Table 4
Confusion matrix of the SRC method on the synthetic dataset. Confusion matrix of the ‘‘SRC+HDC’’ method on the synthetic dataset.

Action types Handshaking (%) Greeting (%) Walking (%) Running (%) Action types Handshaking (%) Greeting (%) Walking (%) Running (%)
Handshaking 94 0 0 6 Handshaking 84 0 4 12
Greeting 0 100 0 0 Greeting 0 100 0 0
Walking 16 0 80 4 Walking 0 0 92 8
Running 0 0 30 70 Running 0 0 12 88

synthetic dataset. For the last three types, their event classifiers
Table 3 were trained by using another set of real training data. Clearly,
Confusion matrix of the HDC method on the synthetic dataset.
the average accuracy of the event classifier trained by a real dataset
Action types Handshaking (%) Greeting (%) Walking (%) Running (%) is lower than the one trained by the synthetic dataset because of
Handshaking 84 4 4 8 the difficulty of foreground object detection. The worst accuracy
Greeting 0 100 0 0 was got from the ‘‘soccer jagging’’ category because the soccer is
Walking 4 0 84 12 smaller and easily confused with its background. The average accu-
Running 0 0 14 86
racy of the SRC method is 80.54%. Table 7 shows the confusion
262 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265

Table 5 method in the first set of the UT database. For the third and fourth
Accuracy improvements and comparisons among the SRC method, the HDC method, rows, the symbols ‘‘LCD’’ and ‘‘FULL’’ denote the methods sug-
and the ensemble one on the synthetic dataset after adding the speed feature.
gested in [49] when using only the local context descriptor and
Methods Action types the full structure, respectively. The ‘‘FULL’’ method performs better
Handshaking Greeting Walking Running Average but more time-consuming. Our method performs poor in the
(%) (%) (%) (%) (%) ‘‘punch’’ category because some clips were often misclassified to
SRC 94 100 86 78 90 ‘‘push’’. Compared with other methods, our method is better than
HDC 86 100 92 92 93 [16,49], and comparable to [48]. Our method is more efficient than
SRC+HDC 86 100 96 94 94 these methods because no optimization process is involved in our
HDC scheme. Table 10 shows another set of comparison results
when using the second set of the UT database. The accuracy of
matrix of the HDC method on the real dataset. Its average accuracy our method is the best.
is 80.99% which is slightly higher than the SRC method. However, In addition to the UT database, the TV human interaction data-
the HDC method is more efficient than the SRC method. Table 8 set [59] is also adopted and compared, where four interactions are
shows the confusion matrix of the ensemble method. Its average included, i:e., hand shake, high five, hug, and kiss. It is composed of
accuracy is 81.90%. 300 video clips (200 videos are labeled ‘‘positive’’ and 100 for neg-
To make fair comparisons, three methods [16,48,49] were com- ative examples) compiled from 23 different TV shows. From the
pared in this paper based on the UT dataset [58]. The UT database 200 positive videos, we selected 80 ones for training and the other
contains six classes of human–human interactions, including 120 ones for testing. In this dataset, the local context descriptor
hand-shaking, hugging, kicking, pointing, punching, and pushing, with 4  6 grids is adopted to represent each person. Table 11 lists
respectively. The interactions involve a number of non-periodic the accuracy comparisons among [16,49], and our method in this
and atomic-level actions, including stretch-arm, withdraw-arm, dataset. Compared with the UT dataset, it is more challenging to
stretch-leg, lower-leg, and lean-forward. 120 shorter video clips deal with the TV human interaction dataset because highly dynam-
(each containing a single interaction) are included in this dataset ical backgrounds are included in each video clip. Thus, it is not sur-
and divided to two different sets according to their capturing prised that lower accuracies were gained from this TV dataset than
environments. As suggested in [48], a 10-fold leave-one-out cross the UT dataset. The ‘‘kiss’’ type tends to be misclassified to ‘‘hog’’
validation was adopted for the performance evaluations. Table 9 due to their similar appearances and spatial relations. Thus, the
shows the accuracy comparisons among [16,48,49], and our worst case was got from this type. The worst result was got from

Fig. 14. Results of action recognition on real dataset in indoor environments.

Fig. 15. Results of action recognition on real dataset in outdoor environments.

Fig. 16. Result of recognizing a person-to-object event.


H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 263

Table 6
Confusion matrix of the SRC method on real dataset.

Action types Handshaking Greeting Walking Running Kicking Punching S-juggling


Handshaking 81.25% 0% 3.25% 0% 6.25% 9.25% 0%
Greeting 0% 93.75% 0% 0% 0% 6.25% 0%
Walking 6.25% 0% 78.125% 15.625 0% 0% 0%
Running 0% 0% 25% 75% 0% 0% 0%
Kicking 0% 0% 0% 0% 84.375% 15.625% 0%
Punching 6.25% 0% 0% 0% 12.5% 81.25% 0%
Soccer-juggling 3.4% 6.9% 6.9% 6.9% 3.4% 3.4% 69%

Table 7
Confusion matrix of the HDC method on real dataset.

Action types Handshaking Greeting Walk Run Kick Punch S-juggling


Handshaking 81.25% 3.125% 6.25% 0% 3.125% 6.25% 0%
Greeting 0% 93.75% 3.125% 0% 0% 3.125% 0%
Walk 3.125% 0% 81.25% 9.375% 3.125% 3.125% 0%
Run 3.125% 0% 9.375% 84.375% 3.125% 0% 0%
Kick 0% 0% 6.25% 3.125% 84.375% 6.25% 0%
Punch 6.25% 3.125% 3.125% 0% 3.125% 75.125% 0%
Soccer-juggling 6.9% 6.9% 10.344% 6.9% 3.4% 6.9% 62.07%

Table 8
Confusion matrix of the ensemble method ‘‘SRC+HDC’’ on real dataset.

Action types Handshaking Greeting Walk Run Kick Punch S-juggling


Handshaking 84.375% 0% 6.25% 0% 3.125% 9.375% 0%
Greeting 0% 93.75% 3.125% 0% 0% 3.125% 0%
Walk 3.125% 0% 81.25% 9.375% 3.125% 3.125% 0%
Run 3.125% 0% 9.375% 84.375% 3.125% 0% 0%
Kick 0% 0% 3.125% 3.125% 84.375% 9.375% 0%
Punch 6.25% 3.125% 3.125% 0% 9.375% 75.125% 0%
Soccer-juggling 3.4% 6.9% 10.344% 6.9% 0% 3.125% 69%

Table 9
Accuracy comparisons among [16,48,49] and our method in the first set of UT-Interaction dataset [58].

Methods Action types


Shaking Hug Kick Point Punch Push AVG
Latpev [16] 50% 80% 70% 80% 60% 70% 68.3%
[48] 70% 100% 100% 100% 70% 90% 88%
LCD [49] 90% 50% 50% X 60% 100% 70%
FULL [49] 100% 100% 80% X 60% 80% 84%
Our method 90% 90% 90% 100% 70% 90% 88%

Table 10
Accuracy comparisons among [16,48,49] and our method in the second set of UT-Interaction dataset [58].

Methods Action types


Handshaking Hugging Kicking Pointing Punching Pushing AVG
Latpev [16] 50% 70% 80% 90% 50% 50% 65%
[48] 50% 90% 100% 100% 80% 40% 77%
LCD [49] 80% 100% 50% x 10% 90% 66%
FULL [49] 90% 90% 90% x 70% 90% 86%
Our method 80% 90% 90% 100% 80% 90% 88%

the STIP-based approach because there are no enough STIP features projecting all possible combinations of body parts onto a nonlinear
extracted for action representation. As for [49], it performs better cost function. Then, the best label is obtained by solving a nonlin-
than our approach in the ‘‘kiss’’ category. Table 12 shows the effi- ear complicated optimization problem after lots of iterations. Thus,
ciency comparisons among [16,49], and our method on the it is very inefficient and not suitable for real-time applications. It is
UT-Interaction dataset [58] and TV Human Interaction Dataset the most time-consuming and inefficient among the three meth-
[59]. For the method proposed in [49], it encodes each action by ods. As to our method, on the UT dataset, its frame rate is about
264 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265

Table 11 [6] Q. Qiu, Z. Jiang, R. Chellappa, Sparse dictionary-based representation and


Accuracy comparisons among [16,49], and our method in TV Human Interaction recognition of action attributes, in: IEEE Conference on Computer Vision,
Dataset [59]. 2011.
[7] J. Mairal, F. Bach, J. Ponce, G. Sapiro, Online learning for matrix factorization
Methods Action types and sparse coding, J. Mach. Learn. Res. 11 (2010) 19–60.
[8] S. Karthikeyan, U. Gaur, B.S. Manjunath, Probabilistic subspace-based learning
Handshaking High Five Hug Kiss AVG
of shape dynamics modes for multi-view action recognition, in: IEEE
[49] 39.35% 45.82% 46.99% 37.60% 42.44% International Conference on Computer Vision, 2011.
STIP [16] 36.68% 21.71% 24.85% 26.68% 27.48% [9] Y. Wang, K. Huang, T. Tan, Human activity recognition based on R
Our method 43.33% 46.67% 46.67% 36.67% 43.33% transform, in: IEEE Conference on Computer Vision and Pattern
Recognition, 2007.
[10] Y. Cong, J. Yuan, J. Liu, Sparse reconstruction cost for abnormal event detection,
in: IEEE Conference on Computer Vision and Pattern Recognition, 2011.
[11] Y. Cong, J.S. Yuan, J. Liu, Abnormal event detection in crowded scenes using
sparse representation, Pattern Recogn. 46 (7) (2013) 1851–1864.
Table 12
[12] N.M. Oliver, B. Rosario, A.P. Pentland, A Bayesian computer vision system for
Efficiency comparisons among [16,49], and our method in UT-Interaction dataset [58] modeling human interactions, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8)
and TV Human Interaction Dataset [59], respectively. (2000) 831–843.
[13] L. Kratz, K. Nishino, Anomaly detection in extremely crowded scenes using
Methods Datasets
spatio-temporal motion pattern models, in: IEEE Conference on Computer
UT-Interaction dataset TV Human Interaction Dataset Vision and Pattern Recognition, 2009, pp.1446–1453.
[14] D. Mahajan, N. Kwatra, S. Jain, P. Kalra, A framework for activity recognition
[49] <0.1 fps <0.1 fps and detection of unusual activities, in: International Conference on Graphic
STIP [16] 1.1 fps 0.93 fps and Image Processing, 2004.
Our method 12.1 fps 2.4 fps [15] B. Laxton, L. Jongwoo, D. Kriegman, Leveraging temporal, contextual and
ordering constraints for recognizing complex activities in video, in: IEEE
Computer Society Conference on Computer Vision and Pattern Recognition,
12 fps. On the TV human interaction dataset, because most of the
2007, pp. 1–8.
running time is spent on the computation of optical flows for the [16] I. Laptev, P. Perez, Retrieving actions in movies, in: International Conference on
HOF descriptor, its efficiency is lower than the UT dataset. Computer Vision, 2007.
[17] R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of
However, in term of efficiency, our method still outperforms than
tracked keypoints, in: International Conference on Computer Vision, 2009.
other two methods [16,49]. Experimental results have proved the [18] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in:
superiority of our proposed system in human interaction behavior IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2005,
analysis. pp.886–893.
[19] M. Elad, M. Aharon, Image denoising via sparse and redundant representations
over learned dictionaries, IEEE Trans. Image Process. 54 (12) (2006) 3736–
8. Conclusions 3745.
[20] L. Shang, Non-negative sparse coding shrinkage for image denoising using
normal inverse Gaussian density model, Image Vis. Comput. 26 (8) (2008)
This paper has represented a novel dynamic sparse 1137–1147.
representation-based classification scheme to recognize various [21] W. Dong, L. Zhang, G. Shi, Centralized sparse representation for image
restoration, in: IEEE International Conference on Computer Vision (ICCV),
interaction actions between persons. Various datasets and meth- 2011.
ods have been used and compared to evaluate its performances. [22] M. Zhao, S. Li, J. Kwok, Text detection in images using sparse representation
Three major contributions of this work are made as follows: with discriminative dictionaries, Image Vis. Comput. 28 (12) (2010) 1590–
1599.
[23] F. Chen, Q. Wang, S. Wang, W.D. Zhang, W.L. Xu, Object tracking via
(a) A sparse representation scheme was proposed to analyze appearance modeling and sparse representation, Image Vis. Comput. 29 (11)
interaction events between objects. It is the first work to (2011) 787–796.
[24] H. Zhang, N.M. Nasrabadi, Y. Zhang, T.S. Huang, Multi-observation visual
analyze complicated interaction behaviors between persons recognition via joint dynamic sparse representation, in: IEEE International
using the sparse representation. Conference on Computer Vision (ICCV), 2011.
(b) A Hamming distance classification scheme was proposed to [25] X.-T. Yuan, S. Yan, Visual classification with multi-task joint sparse
representation, in: IEEE Conference on Computer Vision and Pattern
improve the efficiency of sparsity-based action classification.
Recognition, 2010.
Because of the nature of Hamming code, it is robust to envi- [26] J. Wright, A. Yang, A. Ganesh, S. Sastry, Y. Ma, Robust face recognition via
ronmental changes. sparse representation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009).
(c) A real-time classification scheme was proposed to analyze [27] Q. Zhang, B. Li, Discriminative k-svd for dictionary learning in face recognition,
in: IEEE Conference on Computer Vision and Pattern Recognition, 2010.
person-to-person interaction events. It is more suitable for [28] H.H. Zhang, N.M. Nasrabadi, Y.N. Zhang, T.S. Huang, Joint dynamic sparse
real-time applications since no minimization process to cal- representation for multi-view face recognition, Pattern Recogn. 45 (4) (2012)
culate the reconstruction error is involved. 1290–1298.
[29] R. Ptucha, A. Savakis, Manifold based sparse representation for facial
understanding in natural images, Image Vis. Comput. 31 (5) (2013) 365–378.
Experimental results have proved the superiority of our pro- [30] C.-P. Wei, Y.-W. Chao, Y.-R. Yeh, Y.-C. Frank Wang, Locality-sensitive dictionary
posed system in human interaction behavior recognitions using learning for sparse representation based classification, Pattern Recogn. 45 (5)
(2013) 1277–1287.
spare representation. [31] Z.W. Lu, Y.X. Peng, Latent semantic learning with structured sparse
representation for human action recognition, Pattern Recogn. 46 (7) (2013)
References 1799–1809.
[32] H.R. Wang, C.F. Yuan, W.M. Hu, C.Y. Sun, Supervised class-specific dictionary
[1] D. Weinland, R. Ronfard, E. Boyer, A survey of vision-based methods for action learning for sparse modeling in action recognition, Pattern Recogn. 45 (11)
representation, segmentation, and recognition, Comput. Vis. Image Underst. (2012) 3902–3911.
115 (2) (2011) 224–241. [33] X.G. Zhang, Y. Yang, L.C. Jiao, F. Dong, Manifold-constrained coding and sparse
[2] R. Poppe, A survey on vision-based human action recognition, Image Vis. representation for human action recognition, Pattern Recogn. 46 (7) (2013)
Comput. 28 (6) (2010) 976–990. 1819–1831.
[3] S. Park, J. Park, J.K. Aggarwal, Video retrieval of human interactions using [34] B. Zhao, L. Fei-Fei, E.P. Xing, Online detection of unusual events in videos via
model-based motion tracking and multi-layer finite state automata, Lecture dynamic sparse coding, in: IEEE Conference on Computer Vision and Pattern
Notes in Computer Science, vol. 2728, Springer, 2003, pp. 394–403. Recognition, 2011.
[4] M.S. Ryoo, J.K. Aggarwal, Hierarchical recognition of human activities [35] S. Maji, L. Bourdev, J. Malik, Action recognition from a distributed
interacting with objects, in: 2nd International Workshop on Semantic representation of pose and appearance, in: IEEE Conference on Computer
Learning Applications in Multimedia, 2007. Vision and Pattern Recognition, 2011.
[5] M. Aharon, M. Elad, A. Bruckstein, K-svd: an algorithm for designing [36] A. Gaidon, Z. Harchaoui, C. Schmid, Actom sequence models for efficient action
overcomplete dictionaries for sparse representation, IEEE Trans. Signal detection, in: IEEE Conference on Computer Vision and Pattern Recognition,
Process. (2006) 4311–4322. 2011.
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 265

[37] A. Fathi, G. Mori, Action recognition by learning mid-level motion features, in: [48] M.S. Ryoo, J.K. Aggarwal, Spatio-temporal relationship match: video structure
IEEE Computer Society Conference on Computer Vision and Pattern comparison for recognition of complex human activities, in: IEEE International
Recognition, 2008, pp. 1–8. Conference on Computer Vision, 2009.
[38] S. Ju, et al., Hierarchical spatio-temporal context modeling for action [49] A. Patron-Perez, M. Marszalek, I. Reid, A. Zisserman, Structured learning of
recognition, in: IEEE Computer Society Conference on Computer Vision and human interactions in TV shows, IEEE Trans. Pattern Recogn. Mach. Intell. 34
Pattern Recognition, 2009, pp. 2004–2011. (12) (2012) 2441–2453.
[39] N. T. Nguyen, H. H. Bui, S. Venkatesh, G. West, Recognition and monitoring [50] A. Patron-Perez, M. Marszalek, A. Zisserman, I. Reid, High five: recognising
high-level behaviours in complex spatial environments, in: IEEE International human interactions in TV shows, in: British Machine Vision Conference, 2010.
Conference on Computer Vision and Pattern Recognition, vol. 2, Madison, [51] K. Kim, T.H. Chalidabhongse, D. Harwood, L. Davis, Real-time foreground-
Wisconsin, USA, 2003, pp. 620–625. background segmentation using codebook model, Real Time Imag. 11 (3)
[40] B. Yao, L. Fei-Fei, Recognizing human-object interactions in still images by (2005) 172–185.
modeling the mutual context of objects and human poses, IEEE Trans. Pattern [52] L. Fengjun, R. Nevatia, Single view human action recognition using key pose
Anal. Mach. Intell. 34 (9) (2012) 1691–1703. matching and Viterbi oath searching, in: IEEE Computer Society Conference on
[41] B. Yao, X. Jiang, A. Khosla, A.L. Lin, L.J. Guibas, L. Fei-Fei, Human action Computer Vision and Pattern Recognition, 2007, pp. 1–8.
recognition by learning bases of action attributes and parts, in: International [53] D. Weinland, E. Boyer, Action recognition using exemplar-based embedding,
Conference on Computer Vision (ICCV), Barcelona, Spain, 2011. in: IEEE Computer Society Conference on Computer Vision and Pattern
[42] B. Yao, L. Fei-Fei, Grouplet: a structure image representation for recognizing Recognition, 2008, pp. 1–7.
human and object interactions, in: IEEE International Conference on Computer [54] J.-W. Hsieh, Y.-T. Hsu, H.-Y. Mark Liao, Video-based human movement analysis
Vision and Pattern Recognition, 2010. and its application to surveillance systems, IEEE Trans. Multimedia 10 (3)
[43] V. Delaitre, J. Sivic, I. Laptev, Learning person-object interactions for action (2008) 372–384.
recognition in still images, Adv. Neural Inf. Process. Syst. (2011). [55] R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of
[44] J. Kim, K. Grauman, Observe locally, infer globally: a space-time MRF for tracked keypoints, in: International Conference on Computer Vision, 2009.
detecting abnormal activities with incremental updates, in: IEEE International [56] Q. Fan, et al., Recognition of repetitive sequential human activity, in: IEEE
Conference on Computer Vision and Pattern Recognition, 2009. Conference on Computer Vision and Pattern Recognition, 2009.
[45] J.-X. Wu, et al., A scalable approach to activity recognition based on object use, [57] H.Y. Zhou, J.G. Zhang, L. Wang, Z.Y. Zhang, Lisa M. Brown, Pattern recognition
in: International Conference on Computer Vision, 2007, pp. 1–8. special issue: sparse representation for event recognition in video surveillance,
[46] W. Ping, D.A Gregory, M.R. James, Quasi-periodic event analysis for social Pattern Recogn. 46 (7) (2013).
game retrieval, in: International Conference on Computer Vision, 2009. [58] UT-Interaction dataset: <http://cvrc.ece.utexas.edu/SDHA2010/Human_
[47] R. Filipovych, E. Ribeiro, Recognizing primitive interactions by exploring actor- Interaction.html>.
object states, in: IEEE Conference on Computer Vision and Pattern Recognition, [59] TV-Interaction dataset: <http://www.robots.ox.ac.uk/vgg/data/tv_human_
2008, pp. 1–7. interactions/>.

You might also like