Chiang 2015
Chiang 2015
30 (2015) 252–265
a r t i c l e i n f o a b s t r a c t
Article history: This paper proposes a novel dynamic sparsity-based classification scheme to analyze various interaction
Received 9 November 2014 actions between persons. To address the occlusion problem, this paper represents an action in an
Accepted 20 April 2015 over-complete dictionary to makes errors (caused by lighting changes or occlusions) sparsely appear in
Available online 29 April 2015
the training library if the error cases are well collected. Because of this sparsity, it is robust to occlusions
and lighting changes. In addition, a novel Hamming distance classification (HDC) scheme is proposed to
Keywords: classify action events to various types. Because the nature of Hamming code is highly tolerant to noise,
Sparse coding
the HDC scheme is also robust to environmental changes. The difficulty of complicated action modeling
Sparse reconstruction error
Occlusions
can be easily tackled by adding more examples to the over-complete dictionary. More importantly, the
R transform HDC scheme is very efficient and suitable for real-time applications because no minimization process
Interaction action analysis is involved to calculate the reconstruction error.
Behavior analysis Ó 2015 Elsevier Inc. All rights reserved.
Person-to-person action recognition
Person-to-object action recognition
http://dx.doi.org/10.1016/j.jvcir.2015.04.012
1047-3203/Ó 2015 Elsevier Inc. All rights reserved.
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 253
classifiers can be then trained by using the SVM classifier. In [16], knowledge, this is the first work using the sparse representation
Laptev et al. generated the concept of interest points (STIP) from to analyze interaction actions between persons. The sparse prop-
images to flow volumes and then extracted various key frames to erty can make errors (caused by different environmental changes
represent action events. The success of the above feature-flow like lighting or occlusions) sparsely appear in the training library
methods strongly depends on a large set of well tracked points and thus increase the robustness of our scheme to environmental
to analyze action events from videos. changes. In the literature [57], the sparse reconstruction cost
In addition to event features, another key problem in event (SRC) is usually adopted to classify data to different categories.
analysis is how to model the temporal and spatial dynamics of The calculation of SRC is very time-consuming and unsuitable for
events. To treat this problem, Mahajan et al. [14] proposed a layer real-time applications. Therefore, the second contribution of this
concept to divide the recognition task to three layers, i:e., physical, paper is to propose a Hamming distance classification (HDC)
logical, and event layers which correspond to feature extraction, scheme to classify data without calculating the SRC. Thus, our
action representation, and event analysis, respectively. In [15], method is more efficient and suitable for real-time applications.
Laxton et al. used dynamic Bayesian networks to model the inter- In addition to the efficiency improvement, the HDC scheme is also
action events between objects. Then, the Viterbi-like algorithm is robust to occlusions because the nature of Hamming code is highly
adopted to seek the best path for event interpretation. In [10], tolerant to noise. The main contributions of this work include:
Cong et al. used the sparse representation scheme to model and
identify abnormal events from normal ones. Hidden Markov model 1. A novel sparsity-based framework is proposed to analyze
(HMM) is another commonly used scheme to model event dynam- person-to-person interaction behaviors. Even though the inter-
ics. For example, Oliver et al. [12] used HMMs to classify human action relations are complicated, our method still works suc-
interactions into different event types like meeting, approaching, cessfully to recognize them.
walking, and so on. Furthermore, Messing et al. [17] tracked a set 2. A new Hamming distance classification scheme to classify
of corners to obtain their velocity histories and then used HMM events to multiple classes. It is robust to environmental changes
to learn activity event models. Kim and Grauman [44] proposed a due to the nature of Hamming code.
space–time MRF model to model the distributions of local optical 3. The proposed sparse representation is very efficient and suit-
flows within different regions and then estimated the degrees of able for real-time applications because no minimization process
event normality at each region to detect abnormal events. In addi- to calculate the SRC is involved.
tion, Nguyen et al. [39] developed a HHM-based surveillance sys-
tem to recognize human behaviors in various multiple-camera The remainder of the paper is organized as follows. Section 2
environments such as a corridor, a staff room, or a vision lab. The discusses some related work. An overview of our system is intro-
challenges related to HMMs involve how to specify and learn the duced in Section 3. Details of feature extraction are described in
HMM model structure. Section 4. Section 5 discusses the techniques of sparse representa-
A particular action between two objects can vary significantly tion and library learning. The framework of person-to-person
under different conditions such as camera views, person’s clothing, interaction action recognition using sparse representation is pro-
object appearances, and so on. Thus, it is more challenging to ana- posed in Section 6. The experiment results are given in Section 7.
lyze human events happening between two objects because of We then present our conclusions in Section 8.
their complicated interaction relations. In addition, occlusions
between objects often lead to the failure of action recognition. In
[47], Filipovych and Ribeiro used a probabilistic graphical model 2. Related work
to recognize several primitive actor-object interaction events like
‘‘grasping’’ or ‘‘touching’’ a fork (or a spoon, a cup). In [4], Ryoo There have been many approaches [8,9,13–18,42,44] proposed
and Aggarwal used a context-free-grammar framework that inte- in the literature for video-based human movement analysis. For
grates object recognition, motion estimation, and semantic-level example, in [52], Fengjun and Nevatia used a posture-based
recognition to hierarchically analyze various interactions between scheme to convert frames to strings from which actions were then
a human and an object. In [46], Ping et al. extracted SIFT points and classified into different types using the action nets. Weinland and
their trajectories to describe events and then used a kernel learning Boyer [53] selected a set of key postures to convert an action
scheme to analyze interaction events caused by two objects. In sequence to a vector. Then, the Bayesian classifier is designed to
addition to videos, some previous works address joint modeling classify the extracted vectors to different categories. In [54],
of human poses, objects and their relations from still images. Hsieh et al. proposed a centroid context descriptor to convert pos-
For example, in [40,41], Yao and Fei-Fei proposed a random tures to different symbols and then detected abnormal events
field model to encode the mutual connections of components in directly from videos. Messing et al. [55] used the velocity history
the analyzed object, the human pose, and the body parts to recog- of tracked key-points to represent actions and then combined
nize human-object interaction activities in still images. However, other features, such as appearance, position, and high level sematic
as described in [43], until now, reliable estimation of body config- features together to recognize activities of daily living. In [56], Fan
urations for persons in any poses remains a very challenging et al. modified the Viterbi algorithm to detect repetitive sequential
problem. event units and then recognized the predominant retail cashier
This paper addresses the problem of action analysis between activities at the checkout station.
persons (or human-object interactions) by using the sparse repre- To model the interactions between objects, in [45], Wu et al.
sentation. As described before, the complicated interaction integrated RFID and video streams to analyze the events happening
changes and the occlusion problem between two objects increase in kitchen via a Dynamic Bayesian network. In [43], Delaitre et al.
many challenges in action recognition. To treat the above prob- proposed a statistical model which integrates order-less
lems, this paper proposes a novel dynamic sparse representation person-object interaction responses to recognize actions in still
scheme to analyze interaction actions between persons. As we images by calculating spatial co-occurrences of individual body
know, actors often perform the same action type at different parts and objects. To achieve the property of view invariance,
speeds. From the nature of sparse representation, this temporal Karthikeyan et al. [8] computed the R transforms on action silhou-
variation problem can be easily tackled by creating a new code ettes from multiple views and then proposed a probabilistic sub-
in the training library to capture the speed change. From our space similarity technique to recognize actions by learning their
254 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265
inter-action and intra-action models. However, when recognizing not only from its efficiency but also its robustness and extensibility
an action, the scheme requires that all action sequences captured in interaction action analysis.
at different views should be collected together simultaneously.
As for sparse representation, it is a technique to build an over- 3. Flowchart of the proposed system
complete dictionary matrix for target representation and recogni-
tion. It has been applied to many applications, e:g., noise This paper proposes a novel recognition system to analyze
removing [19,20], image recovering [21], text detection [22], object action events between multiple persons using the sparse represen-
tracking [23], object classification [24,25], face recognition [26– tation. Three stages are involved in this system, that is, feature
30], action recognition [6,31–33], event analysis [10,11,34], and extraction, dictionary learning, and action event recognition.
so on. For example, Elad and Aharon [19] addressed the image Fig. 1 shows the flowchart of our system. Firstly, if a static camera
denoising problem by learning an over-complete dictionary from is used, two different features are extracted from each analyzed
image patches. In [20], Shang used an extended non-negative object, i:e., the coefficients of R transform and the HOG descriptor
sparse coding scheme to remove noise from data. Dong et al. [21] (see Fig. 1(a)). The former one is used to capture the outer property
restored a degraded image by optimizing a set of optimal sparse of the object and the latter one is used to describe its inner prop-
coding coefficients from a dictionary. As for object detection, erty. As for a moving camera, because the object contour is no
Zhao et al. [22] used a sparse representation with discriminative longer available, the HOF (Histogram of Flow) descriptor is used
dictionaries to develop a classification-based scheme for text to replace the feature extracted from the R transform (see
detection. Chen et al. [23] combined an appearance modeling Fig. 1(b)). With these features, a novel sparse representation is pro-
scheme and sparse representation to evaluate the distance posed to capture the spatial and temporal changes of interaction
between a target and the leaned appearance model for object actions between persons. Then, the K-SVD algorithm [5] is applied
tracking. Furthermore, in [24], Zhang et al. proposed a joint to learn an over-complete dictionary to represent action events
dynamic sparse representation to recognize objects from multiple more effectively and accurately. With the learned dictionary, two
views. Yuan and Yan [25] proposed a multi-task joint sparse repre- classification schemes, i:e., sparse reconstruction cost (SRC) [57]
sentation scheme to reconstruct a test sample from a few training and Hamming distance coding (HDC) are then developed and com-
subjects with multiple features and then applied it to face recogni- pared for person-to-person action analysis. The major advantages
tion and object classification. Wright et al. [26] proposed a sparse of this representation are the adaptability to deal with different
representation to recognize faces even under occlusions by evalu- occlusion conditions, the efficiency to analyze actions in real time,
ating which class leads to the minimal reconstruction error. Zhang and the extensibility to analyze multiple action events between
and Li [27] incorporated the classification error into an objective persons. In what follows, details of the feature extraction task are
function during the learning process and then proposed the dis- first discussed.
criminative K-SVD method to recognize faces. Zhang and Huang
[28] addressed the challenging task of face recognition in large
pose differences between gallery and probe faces by exploiting 4. Feature extraction
the sparse property of faces across poses. Ptucha and Savakis
[29] proposed a dimensionality reduction technique by adding In traditional surveillance systems, it is common to capture
the locality preserving constraint into sparse representation to find video sequences by a stationary camera. However, for actions in
the best manifold projection for face recognition. Wei et al. [30] TV or movies, a moving camera is usually adopted to capture action
also added the locality preserving constraint into sparse represen- videos. For a stationary camera, the background of the captured
tation but removed the norm-bounded constraint on the learned video sequence can be reconstructed using a mixture of Gaussian
library for improving feature representation and classification. As models or the codebook technique [51]. This allows us to detect
for action analysis, Qiu et al. [6] proposed a Gaussian process some samples and extract foreground objects by subtracting the
model to learn a compact and discriminative dictionary of action background. We then apply some simple morphological operations
attributes for action recognition by incorporating the class distri- to remove noise. After the connected component analysis, each
bution and appearance information into an object function. Lu foreground object can be then well detected for action analysis.
and Peng [31] proposed a spectral embedding approach for Then, the R transform and the HOG descriptor are adopted in this
extracting latent semantic high-level features with structured paper to describe its inner and outer properties (see Sections 4.1
sparse representation to analyze actions. Wang [32] et al. modified and 4.2). As for a moving camera, the person-centered descriptor
the sparse model which minimizes not only the reconstruction [49] is adopted for action analysis and described in Section 4.3.
error but also a similarity constraint to capture the correlations
between similar actions. Zhang et al. [33] proposed a manifold pro- 4.1. Radon transform and R transform
jection scheme for dimension reduction by adding the locality pre-
serving constraint into sparse representation and then finding the Let Iðx; yÞ denote the intensity function of the detected object.
best manifold projection for action recognition. Cong et al. [10,11] From Iðx; yÞ, this section uses the R transform to extract its contour
proposed a dictionary selection model with a low rank constraint features. R transform is the improved form of Radon transform.
for event representation and then designed a new SRC to measure Radon transform in two dimensions is the integral transform con-
the outlier for abnormal event detection in crowed scenes. Zhao sisting of the integral of an image over a straight line. Let L denote
et al. [34] proposed a framework to detect unusual events in videos this straight line with the form: x cos h þ y sin h ¼ t. Then, the
from an automatically learned dictionary by using the sparse cod- Radon transform of Iðx; yÞ along L is defined by:
ing scheme with online re-constructability. For all the above Z 1 Z 1
approaches, a sparse reconstruction cost is often adopted to clas- Radon fIg ¼ Pðh; t Þ ¼ Iðx; yÞdðx cos h þ y sin h tÞdxdy; ð1Þ
1 1
sify data to different classes. However, this cost is inefficient to
real-time applications because an optimization process is needed where h 2 ½0; p; t 2 ð1; 1Þ, and dðxÞ is a Dirac delta function.
to iteratively calculate the SRC. In addition, the above Fig. 2 shows the projection example of Iðx; yÞ along L using the
sparsity-based method cannot recognize interaction actions transform. For the scaling, translation, and rotation changes in
between persons. The novelty of this paper is to provide a HDC Iðx; yÞ, this transform has the following properties:
framework to improve the performances of sparse representation For a scaling factor a,
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 255
Sparse
Reconstrucon Reconstrucon Error
Feature Extracon Coefficient
Training Diconary
R Transform üüü
R Transform
HOG
üüü
Code
Hamming Distance
} Walking
Histogram of Oriented
Gradient
(a)
Sparse
Reconstrucon Reconstrucon Error
Feature Extracon Coefficient
Training Diconary
HOF üüü
HOF
HOG
üüü
Code
Hamming Distance
} Walking
Histogram of Oriented
Gradient
(b)
Fig. 1. Flowchart of the proposed system. (a) Static camera. (b) Moving camera.
For a rotation h0 ,
Z 1
P2 ððh þ h0 Þ; t Þdt ¼ Rðh þ h0 Þ: ð8Þ
1
Fig. 3. Results of R transform under geometrical changes. (a) Original frame. (c), (e), (g) Translated, scaled, rotated versions of (a), respectively. (b), (d), (f), and (h) Results of R
transform of (a), (c), (e), and (g), respectively.
feature vector
f =[...,...,...]
Ix ¼ Iðx þ 1; yÞ Iðx 1; yÞ and Iy ¼ Iðx; y þ 1Þ Iðx; y 1Þ: orientations are divided into five bins, i:e., horizontal, vertical,
ð9Þ two diagonal orientations, and a no-gradient bin. For the HOF fea-
ture, the optical flow is discretized into five bins: no-motion, left,
Then, the gradient magnitude Mðx; yÞ and its orientation hðx; yÞ can right, up and down. Thus, the local descriptor of each grid is
be computed by ten-dimensional. In [49,50], the head orientation is divided into
qffiffiffiffiffiffiffiffiffiffiffiffiffiffi five directions to create a compact and higher-dimensional repre-
Mðx; yÞ ¼ I2x þ I2y and hðx; yÞ ¼ tan1 Iy =Ix : ð10Þ sentation from which a different classifier for each discrete head
orientation can be learned. However, in our proposed method,
Based on Eqs. (9) and (10), HOG counts occurrences of gradient ori-
the variations of head orientation can be easily embedded in the
entation in localized portions of an image to describe the inner
sparse representation. Thus, the head pose classifier is not need
visual characteristics of an object. This paper divides the angle to
in our spare representations scheme. In addition, different from
12 bins where each bin accumulates the number of edge points
[49,50], we divide an upper body to 4 6 grids rather than 8 8
whose angles hðx; yÞ fall in this angle bin. In other words, each bin
grids to describe each person for action analysis. After concatena-
stands for degree 20 from 0 to 360°. The contribution of an edge
tion, the local context descriptor of each upper body is with the
point to the HOG is weighted by its gradient magnitude Mðx; yÞ. In
dimension 240.
real implementation, the analyzed object is further divided to four
grids. Then, an ensemble of HOG descriptors with 72 bins can be
formed for action analysis (72 = 4 grids 18 bins). 5. Sparse representation
where D ¼ ½d1 ; . . . ; dK 2 RnK ; A ¼ ½a1 ; . . . ; aN 2 RKN ; T is the most arg min knk dk arow 2
k kF : ð15Þ
row
dk ;ak
desired number of non-zero coefficients, and kai k0 is the l0 -norm
which counts the number of nonzeros in the vector ai . Eq. (12)
The K-SVD scheme suggests the use of the SVD to find alternative dk
can be formulated as another equivalent problem:
and arow
k . The SVD finds the closest rank-1 matrix (in Frobenius
norm) that approximates nk . If the SVD of nk is expressed as USVT ,
arg minkai k0 s:t: kxi Dai k22 6 e; ð13Þ
D;ai dk is updated by the first output basis vector u1 and the non-zero
values in arow
k are adjusted to the product of the first singular value
where e is an error tolerance of reconstruction. Solving this problem Sð1; 1Þ and the first column of V. The two stages are iteratively per-
was proved to be an NP-hard problem [24]. However, many approx- formed until convergence.
imation techniques for this task were proposed. One of common
methods to solve ai is the Orthogonal Matching Pursuit (OMP).
OMP is a greedy method to iteratively solve ai for each xi while fix- 6. Person-to-person action recognition using sparse
ing D; i:e., representation
ai ¼ minkxi DAk22 s:t: kai k0 6 T; ð14Þ
A This section will deal with different challenges in human action
analysis between persons, that is, how to properly characterize
At each stage, the OMP selects an optimal parameter ai from D that
spatial–temporal information and how to perform the subsequent
best resembles the residual. After each such selection, the signal is
comparison/recognition tasks.
back projected onto the set of chosen atoms, and the new residual
signal is calculated. (see Fig. 5)
6.1. Action event representation
5.2. Dictionary learning via K-SVD
To characterize spatial–temporal information of an action
This paper applies the K-SVD algorithm to solve Eq. (14) by iter- event, this paper uses the R transform, the HOF descriptor, and
atively executing an optimization process to optimize D and A. X
the HOG descriptor to describe each frame. Let hR ; hhof , and hhog
X X
ŔűŢųŴŦġńŰťŪůŨ ŅŪŤŵŪŰůŢųźġŖűťŢŵŦ
ŊůŪŵŪŢŭŪŻŦġŅ
ĩŖŴŦġŎŢŵŤũŪůŨġőŶųŴŶŪŵĪ ĩńŰŭŶŮůĮţźĮńŰŭŶŮůġţźġŔŗŅĪ
ŃŦũŢŷŪŰųġŵźűŦġŅŪ
⎛ ⎞ ⎛ ⎞
ŕųŢůŴŧŰųŮ
⎜ ⎟ ⎜ ⎟ ⎛− − − −⎞ ⎛− − − −⎞
ʼnŐŇ
⎜ ⎟ ⎜ ⎟
⎜− − − −⎟ ⎜− − − −⎟
⎜ ⎟ ⎜ ⎟
⎜− − − −⎟
įįį ⎜− − − −⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
œġ
⎜− − − −⎟ ⎜− − − −⎟
⎜− − − −⎟ ⎜− − − −⎟
⎝ ⎠ ⎝ ⎠
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎛− − − −⎞ ⎛− − − −⎞
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜− − − −⎟
⎜− − − −⎟ ⎜ ⎟
⎜− − − −⎟
įįį ⎜− − − −⎟
⎝ ⎠ ⎝ ⎠ ⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
ŇųŢŮŦŴ
įįį
įįį
įįį
įįį
įįį
⎛ ⎞ ⎛ ⎞
⎜ ⎟ ⎜ ⎟
⎛− − − −⎞ ⎛− − − −⎞
ʼnŐň
ʼnŐň
⎜− − − −⎟ ⎜− − − −⎟
⎜ ⎟
⎜− − − −⎟ įįį ⎜ ⎟
⎜− − − −⎟
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟
⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎛− − − −⎞ ⎛− − − −⎞
⎜− − − −⎟ ⎜− − − −⎟
⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎜ ⎟
⎜− − − −⎟
⎜ ⎟
įįį ⎜− − − −⎟
⎜ ⎟
⎜− − − −⎟
⎝ ⎠ ⎝ ⎠ ⎜− − − −⎟
⎜ ⎟
⎝− − − −⎠
⎜ ⎟
⎝− − − −⎠
8 8
< ðh X ; h X ; hY ; hY ÞT ; if a static camera is adopted; < ðhO ; hO ; hO ; hO ; mÞT ; for a static camera;
X Y R hog R hog XY R hog R hog
F F ¼ F ¼ ; ð21Þ
: X X Y Y T : O O O O T
ðhhof ; hhog ; hhof ; hhog Þ ; if a moving camera is adopted: ðhhof ; hhog ; hhof ; hhog ; mÞ ; for a moving camera:
ð17Þ O O O
where hR ; hhog , and hhof are the R transform, the HOG descriptor, and
After normalization, EX is a unit vector satisfying kEX k ¼ 1. In the the HOF descriptor of O, respectively, and m is set to zero. Let EXY
sparse representation, if an action class Di is represented by K i denote the action event performed by X and Y. If nf frames are col-
codes, Di can be represented by a matrix form: lected to represent EXY , it can be constructed with the following
Di ¼ ðEX 1 ; . . . ; EXk ; . . . ; EXK i Þ; ð18Þ sparse representation:
XY nf
EXY ¼ F XY 1 F XY t F ; ð22Þ
where Di 2 RnK i and n is the dimension of EX . Fig. 7(b) shows the
matrix structure of Di performed by single person. Assume that where EXY is a column feature vector. Each column in Fig. 8(b)
there are L action types to be recognized. Then, the dictionary D shows the structure of EXY . Let nXY denote the dimension of EXY .
in the sparse representation can be constructed by the form Similar to Eq. (18), in the two-person case, if an action class Di is
represented by K i codes, Di can be represented by a matrix form
D ¼ ðD1 ; . . . ; DL Þ; ð19Þ
PL Di ¼ ðEXY 1 ; . . . ; EXY K i Þ; ð23Þ
where D 2 RnK and K ¼ i¼1 K i . Eqs. (16)–(19) are used to analyze
single-person action events. To analyze person-to-person action where X k denotes the kth code and Di 2 RnXY K i . Assume there are L
events, one challenge is the occlusion problem. In the following, a action types to be recognized. Then, the library D in the sparse rep-
novel sparse representation will be proposed to treat this problem. resentation can be integrated as the form
Assume the analyzed interaction events are performed by two
D ¼ ðD1 ; . . . ; DL Þ; ð24Þ
persons X and Y. The method to model their interaction relations
needs to be adjustable and robust to occlusions. As shown in where D 2 R nXY K
. With D, two classification schemes, i:e., SRC and
Fig. 8(a), if X and Y are not occluded, a new feature descriptor F XY HDC will be proposed to classify actions to different category types.
is constructed to describe X and Y as follows:
8 6.2. Event classification using sparse representation
< ðh X ; h X ; hY ; hY ; mÞT ; for a static camera;
XY X Y R hog R hog
F ¼ F F ðmÞ ¼
: X X Y Y T After obtaining D by the K-SVD algorithm, the action classifica-
ðhhof ;hhog ;hhof ; hhog ; mÞ ; for a moving camera;
tion task can be formulated as a signal reconstruction problem.
ð20Þ
Given an input signal x 2 Rn , we consider x as a linear combination
where m is the spatial feature (or relative distance) between X and of column vectors in D; i:e.,
Y. On the other hand, if X and Y are occluded together (see Fig. 8(b)),
x ¼ a1 d1 þ þ þ aK dK ;
we replace X and Y with their occluded version O since there is no
information to discriminate X from Y. Then, the descriptor to repre- where dk 2 D. Let a ¼ ða1 ; . . . ; aK Þ. The sparse solution a can be
sent X and Y is constructed as follows: obtained by solving the following minimization problem:
H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265 259
Behavior type
Frames
X Ő ⎝ ⎠
...
...
...
...
⎛X⎞ ⎛O ⎞ ⎛O ⎞
⎜Y ⎟ ⎜O ⎟ ... ⎜O ⎟
M ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎜m ⎟ ⎜m⎟ ⎜m⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠
⎛X⎞ ⎛X⎞ ⎛X⎞
⎜Y ⎟ ⎜Y ⎟ ⎜Y ⎟
⎜ ⎟ ⎜ ⎟ ... ⎜ ⎟
⎜m ⎟ ⎜m ⎟ ⎜m ⎟
⎝ ⎠ ⎝ ⎠ ⎝ ⎠
arg minkak1 s:t: kx Dak22 6 e: When considering all elements in D, a code word BðxÞ can be con-
a
structed to represent x:
This optimization problem can be efficiently solved via the
BðxÞ ¼ ½B1 ; B2 ; . . . ; Bi ; . . . ; BL ;
second-order cone programming [26]. Assume there are L action
types to be recognized. Then, we can separate D to L classes, where Bi ¼ ½bi;1 ; bi;2 ; . . . ; bi;j ; . . . ; bi;K i . Fig. 9 shows an example of this
i:e.,D ¼ ðD1 ; . . . ; DL Þ. Each Di is learned by the K-SVD algorithm. coding technique, where L ¼ 2 and K i ¼ 5. For the ith class, each ele-
Then, using only the set ai of coefficients associated with class i, ment dij in Di will associate with a bin ci;j whose value is one when
we compute the residual ri ðxÞ between x and the approximated one: the ith action event is coded. After concatenating all the bins ci;j , the
r i ðxÞ ¼ kx Di ai k2 : ð25Þ code word C i to represent the ith event type is formed as
Then, x is assigned to its corresponding action type that minimizes C i ¼ ½01;1 ; 01;2 ; . . . ; 0i1;K i1 ; 1i;1 ; 1i;2 ; . . . ; 1i;K i ; 0iþ1;1 ; . . . ; 0L;K L :
the residual by the form Then, the corresponding event type of x can be decided by finding
TypeðxÞ ¼ arg min r i ðxÞ: ð26Þ its minimum Hamming distances among the L event categories, i:e.,
i
TypeðxÞ ¼ arg min HamDisðBðxÞ; C i Þ; ð30Þ
Table 1 summaries the complete classification algorithm for 16i6L
action event classification. The major disadvantage of this SRC where HamDisðBðxÞ; C i Þ is the Hamming distance between BðxÞ and
scheme is its inefficiency in the minimization process to find ai . C i . Fig. 10 shows the results of Hamming distance calculation of
To improve the efficiency of the SRC scheme, in Section 5.3, a novel Fig. 9. Compared with the SRC method, the HDC method does not
Hamming distance classification (HDC) scheme will be proposed require an optimization process to obtain the reconstruction error
for real-time action analysis. and thus is more efficient in action analysis. Actually, the SRC
method and the HDC scheme can be integrated together to classify
6.3. Event classification and analysis using hamming distance action events more effectively. Since the max value of HamDisðÞ is K
and the range of r i ðxÞ is [0, 1], a normalization term is added to the
To improve the inefficiency of the SRC method, a novel HDC HDC scheme to obtain the integration form:
scheme is proposed to analyze action events between persons. In
fact, if D is well learned, an input signal x can be sufficiently repre- HamDisðBðxÞ; C i Þ
identityðxÞ ¼ arg min r i ðxÞ þ : ð31Þ
sented using only the training samples from the same class. It i K
means if x belongs to class i, most of coefficients will be concen- P
where K ¼ Li¼1 K i . The performance comparisons among the SRC
trated on this same class. This representation is naturally sparse.
scheme, the HDC method, and the integration scheme will be dis-
The more sparse the recovered a is, the easier the identity of x will
cussed in Section 6.
be accurately determined.
Given two vectors x and y, the Euclidian distance is used to
measure their distance 7. Experimental results
X
n
To evaluate the performance of our proposed method, a
eðx; yÞ ¼ ðx yÞt ðx yÞ ¼ ðxl yl Þ2 ; ð27Þ real-time system to analyze different action events between two
l¼1
persons at different lighting conditions was implemented. Two
n
where x and y 2 R . Let dij denote the jth column vector in the ith datasets were adopted in this paper for examining the effective-
class Di of D. Then, the max distance gi between any pair of ele- ness of our method, i:e., synthetic and real videos. Four kinds of
ments in Di can be calculated by action types were created in this dataset, i:e., waving, handshaking,
running, and walking. For each action type, there were one hun-
gi ¼ max eðdi;j ; di;k Þ: ð28Þ dred of action videos created for training and testing, where fifty
16j;k6K i
videos were collected for training and another set of fifty videos
With dij , its corresponding bit to code x is bi;j ðxÞ and can be deter- were used for testing. For the real dataset, thirty-two videos for
mined by each type were created for testing. Their training data was got from
the same synthetic dataset. In addition to the four types, three
1; if eðx; di;j Þ 6 gi ;
bi;j ðxÞ ¼ ð29Þ extra action types were added in the real dataset to evaluate the
0; otherwise:
effectiveness of our methods under real conditions; that is, kicking,
260 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265
Table 1 method is about only 0.3 fps. The average accuracy of the SRC
Sparse representation-based classification scheme. method is about 86%.
Algorithm for sparse representation classification Table 3 shows the confusion matrix of the HDC method on the
1. Input: a matrix of training samples synthetic dataset. The average accuracy of the HDC method is
D ¼ ½D1 ; D2 ; . . . ; Di ; . . . ; DL 2 RnK for L classes, a test sample x 2 Rn 88.5% which is higher than the one of the SRC method. Especially
1
2. Solve the l -minimization problem: ai ¼ arg mina kx Di Ak22 in the ‘‘running’’ type, the accuracy of this HDC method is quite
3. Compute the residuals r i ðxÞ ¼ kx Di ai k2 better than the SRC method. The frame rate of the HDC method
4. Output: EventTypeðxÞSRC ¼ arg min r i ðxÞ
i
is about 12.5 fps. Since no optimization process is required, the
HDC method performs more efficiently than the SRC method.
Actually, the two methods can be integrated together to form an
ensemble of action event classifier (see Eq. (31)). Table 4 shows
punching, and soccer-juggling, respectively. Fig. 11 shows the the confusion matrix of this ensemble classifier. The average accu-
examples of synthetic data for the four action types, i:e., racy of this method is 91%. The frame rate of this integration
‘‘Handshaking’’, ‘‘Waving’’, ‘‘Walking’’, and ‘‘Running’’. Fig. 12 scheme is 0.25 fps. If the effectiveness is more important than
shows the examples of real data for the seven action types. The the efficiency, we will suggest the ensemble method to be adopted;
dimension of video frame is 320 240 pixel elements. The frame otherwise, the HDC method is suggested.
rate that our system can achieve is about 12 fps for the HDC In Eq. (20), the parameter m of F XY is set to the relative distance
method. The platform we used is a general PC with Intel CPU P-4 between objects X and Y. This feature lacks of the distinguishability
2.33 GHz, 1G RAM, run under Window 7. The training code was to identify the ‘‘walking’’ action type from the ‘‘running’’ action
implemented by Matlab and the testing code was implemented type. Thus, in Tables 3 and 4, there are many misclassifications
by Visual C++6.0 for efficiency consideration. To make fair compar- found between the two types. To treat this problem, the parameter
isons, three methods [16,48,49] were also compared in this paper m is changed to the speeds of X and Y. Table 5 shows the accuracy
based on the UT interaction dataset [58] and the TV interaction comparisons among the three methods when the speed feature
dataset [59]. was added. Clearly, the accuracy improvements in the ‘‘walking’’
Fig. 13 shows the results of person-to-person interaction recog- and ‘‘running’’ categories are very significant. Their average accura-
nition on the synthetic dataset. All the action types in this figure cies are 90%, 93%, and 94%, respectively.
were correctly recognized. Table 2 shows the confusion matrix of As for the real dataset, seven action types were recognized. The
performance evaluation on the synthetic dataset using the SRC first six types focus on person-to-person action recognition and the
method. In this table, the ‘‘walking’’ action type is easily misclassi- last one is to recognize person-to-object interaction events. Fig. 14
fied to the ‘‘running’’ type because they are very similar except shows the results of action type recognition in indoor environ-
their speeds. The ‘‘handshaking’’ action type was sometimes mis- ments. Fig. 15 shows the results of action type recognition in out-
classified to the ‘‘walking’’ type because their visual features are door environments. Fig. 16 shows the result of recognizing a
similar before performing the handshaking action. The SRC method person-to-object action event. Table 6 shows the confusion matrix
to classify action events needs an optimization process to calculate of the SRC method on real data. In this table, for the first four action
the reconstruction error r i ðxÞ. Thus, the frame rate of the SRC types, their corresponding event classifiers were trained from the
B( x) B( x)
C1 C2
Table 2 Table 4
Confusion matrix of the SRC method on the synthetic dataset. Confusion matrix of the ‘‘SRC+HDC’’ method on the synthetic dataset.
Action types Handshaking (%) Greeting (%) Walking (%) Running (%) Action types Handshaking (%) Greeting (%) Walking (%) Running (%)
Handshaking 94 0 0 6 Handshaking 84 0 4 12
Greeting 0 100 0 0 Greeting 0 100 0 0
Walking 16 0 80 4 Walking 0 0 92 8
Running 0 0 30 70 Running 0 0 12 88
synthetic dataset. For the last three types, their event classifiers
Table 3 were trained by using another set of real training data. Clearly,
Confusion matrix of the HDC method on the synthetic dataset.
the average accuracy of the event classifier trained by a real dataset
Action types Handshaking (%) Greeting (%) Walking (%) Running (%) is lower than the one trained by the synthetic dataset because of
Handshaking 84 4 4 8 the difficulty of foreground object detection. The worst accuracy
Greeting 0 100 0 0 was got from the ‘‘soccer jagging’’ category because the soccer is
Walking 4 0 84 12 smaller and easily confused with its background. The average accu-
Running 0 0 14 86
racy of the SRC method is 80.54%. Table 7 shows the confusion
262 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265
Table 5 method in the first set of the UT database. For the third and fourth
Accuracy improvements and comparisons among the SRC method, the HDC method, rows, the symbols ‘‘LCD’’ and ‘‘FULL’’ denote the methods sug-
and the ensemble one on the synthetic dataset after adding the speed feature.
gested in [49] when using only the local context descriptor and
Methods Action types the full structure, respectively. The ‘‘FULL’’ method performs better
Handshaking Greeting Walking Running Average but more time-consuming. Our method performs poor in the
(%) (%) (%) (%) (%) ‘‘punch’’ category because some clips were often misclassified to
SRC 94 100 86 78 90 ‘‘push’’. Compared with other methods, our method is better than
HDC 86 100 92 92 93 [16,49], and comparable to [48]. Our method is more efficient than
SRC+HDC 86 100 96 94 94 these methods because no optimization process is involved in our
HDC scheme. Table 10 shows another set of comparison results
when using the second set of the UT database. The accuracy of
matrix of the HDC method on the real dataset. Its average accuracy our method is the best.
is 80.99% which is slightly higher than the SRC method. However, In addition to the UT database, the TV human interaction data-
the HDC method is more efficient than the SRC method. Table 8 set [59] is also adopted and compared, where four interactions are
shows the confusion matrix of the ensemble method. Its average included, i:e., hand shake, high five, hug, and kiss. It is composed of
accuracy is 81.90%. 300 video clips (200 videos are labeled ‘‘positive’’ and 100 for neg-
To make fair comparisons, three methods [16,48,49] were com- ative examples) compiled from 23 different TV shows. From the
pared in this paper based on the UT dataset [58]. The UT database 200 positive videos, we selected 80 ones for training and the other
contains six classes of human–human interactions, including 120 ones for testing. In this dataset, the local context descriptor
hand-shaking, hugging, kicking, pointing, punching, and pushing, with 4 6 grids is adopted to represent each person. Table 11 lists
respectively. The interactions involve a number of non-periodic the accuracy comparisons among [16,49], and our method in this
and atomic-level actions, including stretch-arm, withdraw-arm, dataset. Compared with the UT dataset, it is more challenging to
stretch-leg, lower-leg, and lean-forward. 120 shorter video clips deal with the TV human interaction dataset because highly dynam-
(each containing a single interaction) are included in this dataset ical backgrounds are included in each video clip. Thus, it is not sur-
and divided to two different sets according to their capturing prised that lower accuracies were gained from this TV dataset than
environments. As suggested in [48], a 10-fold leave-one-out cross the UT dataset. The ‘‘kiss’’ type tends to be misclassified to ‘‘hog’’
validation was adopted for the performance evaluations. Table 9 due to their similar appearances and spatial relations. Thus, the
shows the accuracy comparisons among [16,48,49], and our worst case was got from this type. The worst result was got from
Table 6
Confusion matrix of the SRC method on real dataset.
Table 7
Confusion matrix of the HDC method on real dataset.
Table 8
Confusion matrix of the ensemble method ‘‘SRC+HDC’’ on real dataset.
Table 9
Accuracy comparisons among [16,48,49] and our method in the first set of UT-Interaction dataset [58].
Table 10
Accuracy comparisons among [16,48,49] and our method in the second set of UT-Interaction dataset [58].
the STIP-based approach because there are no enough STIP features projecting all possible combinations of body parts onto a nonlinear
extracted for action representation. As for [49], it performs better cost function. Then, the best label is obtained by solving a nonlin-
than our approach in the ‘‘kiss’’ category. Table 12 shows the effi- ear complicated optimization problem after lots of iterations. Thus,
ciency comparisons among [16,49], and our method on the it is very inefficient and not suitable for real-time applications. It is
UT-Interaction dataset [58] and TV Human Interaction Dataset the most time-consuming and inefficient among the three meth-
[59]. For the method proposed in [49], it encodes each action by ods. As to our method, on the UT dataset, its frame rate is about
264 H.-F. Chiang et al. / J. Vis. Commun. Image R. 30 (2015) 252–265
[37] A. Fathi, G. Mori, Action recognition by learning mid-level motion features, in: [48] M.S. Ryoo, J.K. Aggarwal, Spatio-temporal relationship match: video structure
IEEE Computer Society Conference on Computer Vision and Pattern comparison for recognition of complex human activities, in: IEEE International
Recognition, 2008, pp. 1–8. Conference on Computer Vision, 2009.
[38] S. Ju, et al., Hierarchical spatio-temporal context modeling for action [49] A. Patron-Perez, M. Marszalek, I. Reid, A. Zisserman, Structured learning of
recognition, in: IEEE Computer Society Conference on Computer Vision and human interactions in TV shows, IEEE Trans. Pattern Recogn. Mach. Intell. 34
Pattern Recognition, 2009, pp. 2004–2011. (12) (2012) 2441–2453.
[39] N. T. Nguyen, H. H. Bui, S. Venkatesh, G. West, Recognition and monitoring [50] A. Patron-Perez, M. Marszalek, A. Zisserman, I. Reid, High five: recognising
high-level behaviours in complex spatial environments, in: IEEE International human interactions in TV shows, in: British Machine Vision Conference, 2010.
Conference on Computer Vision and Pattern Recognition, vol. 2, Madison, [51] K. Kim, T.H. Chalidabhongse, D. Harwood, L. Davis, Real-time foreground-
Wisconsin, USA, 2003, pp. 620–625. background segmentation using codebook model, Real Time Imag. 11 (3)
[40] B. Yao, L. Fei-Fei, Recognizing human-object interactions in still images by (2005) 172–185.
modeling the mutual context of objects and human poses, IEEE Trans. Pattern [52] L. Fengjun, R. Nevatia, Single view human action recognition using key pose
Anal. Mach. Intell. 34 (9) (2012) 1691–1703. matching and Viterbi oath searching, in: IEEE Computer Society Conference on
[41] B. Yao, X. Jiang, A. Khosla, A.L. Lin, L.J. Guibas, L. Fei-Fei, Human action Computer Vision and Pattern Recognition, 2007, pp. 1–8.
recognition by learning bases of action attributes and parts, in: International [53] D. Weinland, E. Boyer, Action recognition using exemplar-based embedding,
Conference on Computer Vision (ICCV), Barcelona, Spain, 2011. in: IEEE Computer Society Conference on Computer Vision and Pattern
[42] B. Yao, L. Fei-Fei, Grouplet: a structure image representation for recognizing Recognition, 2008, pp. 1–7.
human and object interactions, in: IEEE International Conference on Computer [54] J.-W. Hsieh, Y.-T. Hsu, H.-Y. Mark Liao, Video-based human movement analysis
Vision and Pattern Recognition, 2010. and its application to surveillance systems, IEEE Trans. Multimedia 10 (3)
[43] V. Delaitre, J. Sivic, I. Laptev, Learning person-object interactions for action (2008) 372–384.
recognition in still images, Adv. Neural Inf. Process. Syst. (2011). [55] R. Messing, C. Pal, H. Kautz, Activity recognition using the velocity histories of
[44] J. Kim, K. Grauman, Observe locally, infer globally: a space-time MRF for tracked keypoints, in: International Conference on Computer Vision, 2009.
detecting abnormal activities with incremental updates, in: IEEE International [56] Q. Fan, et al., Recognition of repetitive sequential human activity, in: IEEE
Conference on Computer Vision and Pattern Recognition, 2009. Conference on Computer Vision and Pattern Recognition, 2009.
[45] J.-X. Wu, et al., A scalable approach to activity recognition based on object use, [57] H.Y. Zhou, J.G. Zhang, L. Wang, Z.Y. Zhang, Lisa M. Brown, Pattern recognition
in: International Conference on Computer Vision, 2007, pp. 1–8. special issue: sparse representation for event recognition in video surveillance,
[46] W. Ping, D.A Gregory, M.R. James, Quasi-periodic event analysis for social Pattern Recogn. 46 (7) (2013).
game retrieval, in: International Conference on Computer Vision, 2009. [58] UT-Interaction dataset: <http://cvrc.ece.utexas.edu/SDHA2010/Human_
[47] R. Filipovych, E. Ribeiro, Recognizing primitive interactions by exploring actor- Interaction.html>.
object states, in: IEEE Conference on Computer Vision and Pattern Recognition, [59] TV-Interaction dataset: <http://www.robots.ox.ac.uk/vgg/data/tv_human_
2008, pp. 1–7. interactions/>.