Chapter 4
Chapter 4
4.1 Introduction
4.2.1.1 Preprocessing
In this step, we collect consecutive blobs for a single person from which the contours
of a human blob can be extracted. As can be seen in Figure 4.1, the centroid (xc , yc )
of the human blob is determined by the following equations.
1 N 1 N
xc = ∑
N i=1
xi , yc = ∑ yi
N i=1
(4.1)
1 Portions reprinted, with permission, from Xinyu Wu, Yongsheng Ou, Huihuan Qian, and Yang-
sheng Xu, A Detection System for Human Abnormal Behavior, IEEE International Conference on
Intelligent Robot Systems. [2005]
c IEEE.
2 Portions reprinted, with permission, from Yufeng Chen, Guoyuan Liang, Ka Keung Lee, and
where (xc , yc ) is the average contour pixel position, (xi , yi ) represent the points on
the human blob contour, and there are a total of N points on the contour. The distance
di from the centroid to the contour points is calculated by
di = (xi − xc )2 − (yi − yc )2 (4.2)
The distances for the human blob are then transformed into coefficients by means
of DFT. Twenty primary coefficients are selected from these coefficients and thus
80 coefficients are chosen from four consecutive blobs. We then perform PCA for
feature extraction.
Suppose that we have two sets of training samples: A and B. The number of train-
ing samples in each set is N. Φi represents each eigenvector produced by PCA,
as illustrated in [54]. Each of the training samples, including positive and negative
samples, can be projected onto an axis extended by the corresponding eigenvec-
tor. By analyzing the distribution of the projected 2N points, we can roughly select
the eigenvectors that have more motion information. The following gives a detailed
description of the process.
1. For a certain eigenvector Φi , compute its mapping result according to the two sets
of training samples. The result can be described as λi, j ,(1 ≤ i ≤ M, 1 ≤ j ≤ 2N).
2. Train a classifier fi using a simple method such as perception or another simple
algorithm that can separate λi, j into two groups, normal and abnormal behavior,
with a minimum error E( fi ).
3. If E( fi < θ ), then we delete this eigenvector from the original set of eigenvectors.
M is the number of eigenvectors, and 2N is the total number of training samples.
θ is the predefined threshold.
It is possible to select too few or even no good eigenvectors in a single PCA
process. We propose the following approach to solve this problem. We assume that
the number of training samples, 2N, is sufficiently large. We then randomly select
training samples from the two sets. The number of selected training samples in each
set is less than N/2. We then perform supervised PCA using these samples. By
repeating this process, we can collect a number of good features. This approach is
inspired by the bootstrap method, the main idea of which is to emphasize some good
features by reassembling data, which allows the features to stand out more easily.
4.2 Learning-based Behavior Analysis 47
Our goal is to separate behavior into two classes, abnormal and normal, according
to a group of features. Many types of learning algorithms can be used for binary
classification problems, including SVMs , radial basis function networks (RBFNs),
the nearest neighbor algorithm, and Fisher’s linear discriminant, among others. We
choose an SVM as our training algorithm because it has stronger theory interpreta-
tion and better generalization performance than the other approaches.
The SVM method is a new technique in the field of statistical learning the-
ory [53], [55], [56], and can be considered a linear approach for high-dimensional
feature spaces. Using kernels, all input data are mapped nonlinearly into a high-
dimensional feature space. Separating hyperplanes are then constructed with maxi-
mum margins, which yield a nonlinear decision boundary in the input space. Using
appropriate kernel functions, it is possible to compute the separating hyperplanes
without explicitly mapping the feature space.
In this subsection, we briefly introduce the SVM method as a new framework for
action classification. The basic idea is to map the data X into a high-dimensional
feature space F via a nonlinear mapping Φ , and to conduct linear regression in
this space.
Rreg [ f ] = Remp [ f ] + λ ω 2
= Σi=1
l
C( f (xi − yi )) + λ ω 2 (4.4)
where l denotes the sample size (x1 , . . . , xl ), C(·) is a loss function, and λ is a
regularization constant. For a large set of loss functions, Equation (4.4) can be min-
imized by solving a quadratic programming problem, which is uniquely solvable.
It is shown that the vector ω can be written in terms of the data points
ω = Σi=1
l
(αi − α ∗ )Φ (xi ) (4.5)
into account, we are able to rewrite the whole problem in terms of dot products in
the low-dimensional input space as
f (x) = Σi=1
l
(αi − α ∗ )(Φ (xi )Φ (x)) + b
= Σi=1
l
(αi − α ∗ )K(xi , x) + b (4.6)
4.2.1.4 Experiments
Our approach differs from previous detection methods in that (1) it can detect many
types of abnormal behavior using one method and the range of abnormal behavior
can be changed if needed; and (2) rather than detecting body parts ([53]) on the con-
tour, 20 primary components are extracted by DFT. In our system, we do not utilize
the position and velocity of body parts for learning, because the precise position and
velocity of body parts cannot always be obtained, which causes a high rate of false
alarms. In the training process, for example, we put the blobs of normal behavior,
such as standing and walking, into one class and the blobs of abnormal behavior,
such as running, into another one. PCA is then employed to select features, and an
SVM is applied to classify the behavior as running or normal behavior. In the same
way, we obtain classifiers for bending down and carrying a bar. The three SVM
classifiers are then hierarchically connected.
After the SVM training, we test the algorithm using a series of videos, as il-
lustrated in Figure 4.2, which shows that a running person is detected while other
people are walking nearby; Figure 4.3, which shows that a person bending down is
detected; and Figure 4.4, which shows that a person carrying a bar in a crowd is
detected.
The algorithm is tested using 625 samples (each sample contains four consecu-
tive frames). The success rate for behavior classification is shown in Table 4.1.
Before a feature can be extracted, we need to decide where the most prominent
one is. In the analysis of human behavior, information about motion is the most
important. In this section, we use the mean shift method to search for the region that
has a concentration of motion information.
50 4 Behavior Analysis of Individuals
The process of the basic continuously adaptive mean shift (CAMSHIFT), which
was introduced by Bradski [190], is as follows.
1. Choose the initial location of the search window and a search window size.
2. Compute the mean location in the search window and store the zeroth moment.
3. Center the search window at the mean location computed in the previous step.
4. Repeat Steps 2 and 3 until convergence (or until the mean location moves less
than a preset threshold).
5. Set the search window size equal to a function of the zeroth moment found in
Step 2.
6. Repeat Steps 2-5 until convergence (or until the mean location moves less than a
preset threshold).
In our problem formulation, attention should be paid to the window location
and size searching steps. Because a single feature is insufficient for the analysis of
human behavior, more features are retrieved based on the body’s structure. The head
4.2 Learning-based Behavior Analysis 51
and body size can be estimated according to the method explained in section II.A,
and the region of interest can be defined as shown in Figure 4.6.
Fig. 4.6 Feature selection: The white rectangle is the body location detected and the four small
colored rectangles are the context confined region for feature searching.
Here, four main feature regions are selected according to human movement char-
acteristics, and more detailed regions can be adopted depending on the requirements.
The locations of the search windows are restricted to the corresponding regions, and
the initial size is set to be the size of a human head. For Step 2, the centroid of the
moving part within the search window can be calculated from zero and first-order
moments [191]: M00 = ∑x ∑y I(x, y), M10 = ∑x ∑y xI(x, y), and M01 = ∑x ∑y yI(x, y),
where I(x, y) is the value of the difference image at the position (x, y).
The centroid is located at xc = M M01
M00 and yc = M00 .
10
The MHI method is based on the foreground image, and an appropriate threshold
is needed to transform the image into a binary one. If the difference is less than the
threshold, then it is set to zero. The first step is to update the MHI template T using
the foreground at different time stamps τ :
⎧
⎨ τ,
⎪ I(x, y) = 1
T (x, y) = 0, T (x, y) < τ − δ
⎪
⎩ T (x, y), else
In this part, to simplify the optimization procession, we take the radial basis function
(RBF) as a kernel for Equation (4.6):
where γ is the kernel parameter. Then, the optimized results can be found in the
extended space by parameter γ and loss function C(·).
Once the function is obtained, the classification line and the probability of the
data can be directly calculated. For each action type, we establish an SVM model
to give the probability P(x|Hi ) that evidence x, x ∈ X belongs to each hypothetical
action type Hi .
Generally, the hypothesis that has the maximum probability among all of the models
is selected as the recognized action type. If all of the probabilities are smaller than
a certain threshold value, then the action is recognized as an abnormal behavior.
However, the recognition of actions directly based on the maximum probability
of all of the hypotheses is limited, as some of the poses in different action sequences
are very similar. Therefore, the relationship of the context frame should be incorpo-
rated into the analysis for better results. The ideal probability is
but it is impossible to take into account all of the motion history information. There-
fore, we simplify the probability as follows:
where
P(Hi )P(x|Hi )
P(Hi |x) =
P(x)
P(Hi )P(x|Hi )
=
∑ P(x|Hi )P(Hi )
Therefore, given the starting hypothesis Hi0 and transfer matrix P(Hit |Hit−1 ), the
probability can be induced from the previous results and those of the SVM model.
4.2.2.6 Experiments
An action recognition experiment is carried out to evaluate the method. The exper-
iment is based on a database of five different action types, namely, hitting, kicking,
walking, waving, and double waving. The SVM model is trained with about 400
data, and over 280 data are tested. The data are recognized as six action types in-
cluding the type NOT, which means that the models do not know the action type
to which they belong. The recognition results for the training and testing data are
shown in Tables 4.2 and 4.3, respectively. The numbers in the tables are the numbers
of action data (row) that are classified as different actions (column). It can be seen
from the tables that the generalization of the method is acceptable, as the difference
between the testing and the training results is small; however, their accuracy needs
to be improved.
The better results that are obtained using the Bayesian framework based on
similar training and testing data are shown in Tables 4.4 and 4.5, respectively. The
recognition rates of different methods under different situations are shown in Figure
4.7. The figure shows that the Bayesian framework-based method largely improves
the performance of SVM recognition.
The experiment is carried out on a PC running at 1.7 GHz. Taking advantage of
the MHI-based context feature, a speed of about 15 fps can be achieved, which is
efficient for real-time surveillance.
100
90
80
70
Recognition Rate
60
50
40
30
Training
20 Testing
Training improved
10 Testing improved
0
1 2 3 4 5
Action Type
Fig. 4.7 Recognition rate comparison. The x-axis represents different action types: walking (1),
waving (2), double waving (3), hitting(4) and kicking (5). The y-axis is the recognition rate (%).
Figure 4.9 shows the detection of abnormal behavior, which includes people
falling down, bending down, and running in a household environment, based on
the tracking results.
Luo [166] built a security robot that can detect dangerous situations and provide
a timely alert, which focuses on fire, power, and intruder detection. Nikos [167]
presented a decision-theoretic strategy for surveillance as a first step towards au-
tomating the planning of the movements of an autonomous surveillance robot.
The overview of the system is shown in Figure 4.11. In the initialization stage,
two methods are employed to detect moving objects. One is to pan the camera step
by step and employ the frame differencing method to detect moving targets during
the static stage. The other method uses the passive acoustic location device to direct
the camera at the moving object, keeping the camera static and employing the frame
differencing method to detect foreground pixels. The foreground pixels are then
clustered into labels, and the center of each label is calculated as the target feature,
which is used to measure similarity in the particle filter tracking. In the tracking
process, the robot camera tracks the moving target using a particle filter tracking
algorithm, and updates the tracking target model at appropriate times. To detect
abnormal behavior, upper body (which is more rigid) tracking is implemented in
a way that focuses on the vertical position and speed of the target. At the same
time, with the help of a learning algorithm, the robot can detect abnormal audio
information, such as crying or groaning, even in other rooms.
60 4 Behavior Analysis of Individuals
Passive acoustic
Auto scan detection
location
Particle filter
tracking
Behavior analyzing
Audio information
Abnormal behavior
MFCC and PCA detection
feature extraction
Output
Abnormal audio
SVM classifier
information detection
Z P (X, Y, Z)
a
3
2
y
0 4
1
In contrast to video surveillance, audio surveillance does not require that the
scene be “watched” directly, nor is its effectiveness affected by occlusion, which can
62 4 Behavior Analysis of Individuals
result in the reporting of false alarms by video surveillance systems. In many houses
or storehouses, some areas may be occluded by moving or static objects. Also, the
robot and human operator may not in the same room if a house has several rooms.
We propose a supervised learning-based approach to achieve audio surveillance
in a household environment [183]. Figure 4.15 shows the training framework of our
approach. First, we collect a sound effect dataset (see Table 4.6), which includes
many sound effects collected from household environments. Second, we manually
label these sound effects as abnormal (e.g., screaming, gunfire, glass breaking, bang-
ing) or normal (e.g., speech, footsteps, shower running, phone ringing) samples.
Third, an MFCC feature is extracted from a 1.0 s waveform from each sound effect
sample. Finally, we use an SVM to train a classifier. When a new 1.0 s waveform
is received, the MFCC feature is extracted from the waveform; then, the classifier is
employed to determine whether the sound sample is normal or abnormal.
Many audio feature extraction methods have been proposed for different audio
classification applications. The spectral centroid, zero-crossing rate, percentage of
low-energy frames, and spectral flux [172] methods have been used for speech and
music discrimination tasks. The spectral centroid represents the balancing point of
the spectral power distribution; the zero-crossing rate measures the dominant fre-
quency of a signal; the percentage of low-energy frames describes the skewness
of the energy distribution; and the spectral flux measures the rate of change of
the sound. Timbral, rhythmic, and pitch [173] features, which describe the timbral,
rhythmic, and pitch characteristics of music, respectively, have been proposed for
automatic genre classification.
In our approach, the MFCC feature is employed to represent audio signals. Its use
is motivated by perceptual and computational considerations. As it captures some of
the crucial properties of human hearing, it is ideal for general audio discrimination.
The MFCC feature has been successfully applied to speech recognition [174], music
modeling [175], and audio information retrieval [176]. More recently, it has been
used in audio surveillance [177].
The steps to extract the MFCC feature from the waveform are as follows.
Step 1: Normalize the waveform to the range [−1.0, 1.0] and apply a hamming win-
dow to the waveform.
Step 2: Divide the waveform into N frames, i.e., 1000N ms for each frame.
Step 3: Take the fast Fourier transform (FFT) of each frame to obtain its frequency
information.
Step 4: Convert the FFT data into filter bank outputs. Since the lower frequencies
are perceptually more important than the higher frequencies, the 13 filters allocated
below 1000 HZ are linearly spaced (133.33 Hz between center frequencies) and the
27 filters allocated above 1000 Hz are spaced logorithmically (separated by a factor
of 1.0711703 in frequency). Figure 4.16 shows the frequency response of the trian-
gular filters.
Step 5: As the perceived loudness of a signal has been found to be approximately
logarithmic, we take the log of the filter bank outputs.
64 4 Behavior Analysis of Individuals
Step 6: Take the cosine transform to reduce dimensionality. As the filter bank out-
puts that are calculated for each frame are highly correlated, we take the cosine
transform that approximates PCA to decorrelate the outputs and reduce dimension-
ality. Thirteen (or so) cepstral features are obtained for each frame using this trans-
form. If we divide the waveform into 10 frames, then the total dimensionality of the
MFCC feature for a 1.0 s waveform is 130.
After extracting the MFCC features from the waveforms, we employ an SVM-
trained classifier to determine whether this sound is normal or not [178][179][180].
Our goal is to separate sounds into two classes, normal and abnormal, according
to a group of features. Many types of neural networks are used for binary classifi-
cation problems, including SVMs, RBFNs, the nearest neighbor algorithm, Fisher’s
linear discriminant, and so on. We choose an SVM as the audio classifier because it
has stronger theory interpretation and better generalization performance than neural
networks. Compared with other neural network classifiers, SVM ones have three
distinct characteristics. First, they estimate a classification using a set of linear func-
tions that are defined in a high-dimensional feature space. Second, they carry out
classification estimation by risk minimization, where risk is measured using Vap-
nik’s ε -insensitive loss function. Third, they are based on the structural risk mini-
mization (SRM) principle, which minimizes the risk function, which consists of the
empirical error and a regularized term.
4.4 Application: Household Surveillance Robot 65
To evaluate our approach, we collect 169 sound effect samples from the Internet
[184], including 128 normal and 41 abnormal samples, most of which are collected
in household environments.
For each sample, the first 1.0 s waveform is used for training and testing. The
rigorous jack-knifing cross-validation procedure, which reduces the risk of overstat-
ing the results, is used to estimate classification performance. For a dataset with M
samples, we chose M − 1 samples to train the classifier, and then test performance
using the remaining sample. This procedure is then repeated M times. The final es-
timation result is obtained by averaging the M accuracy rates. To train a classifier
using SVM, we apply a polynomial kernel, where the kernel parameter d = 2, and
the adjusting parameter C in the loss function is set to 1.
Table 4.7 shows the accuracy rates using the MFCC feature trained with differ-
ent frame sizes, while Table 4.8 shows those using MFCC and PCA features trained
with different frame sizes. PCA of the dataset is employed to reduce the number of
dimensions. For example, the number is reduced from 637 to 300 when the frame
size is 20 ms and from 247 to 114 when the frame size is 50 ms. Comparing the
results in Tables 4.7 and 4.8, we find that it is unnecessary to use PCA. The best ac-
curacy rate obtained is 88.17% with a 100 ms frame size. The number of dimensions
of this frame size is 117 and thus dimension reduction is unnecessary.
Nowadays, abnormal behavior detection is a popular research topic, and many
studies have presented methods to detect abnormal behaviors. Our surveillance
robot is for use in household environments, so it needs to detect abnormal behaviors
such as falling down or running.
Table 4.7 Accuracy rates (%) using the MFCC feature trained with 20, 50, 100, and 500 ms frame
sizes.
Table 4.8 Accuracy rates (%) using the MFCC and PCA features trained with 20, 50, 100, and
500ms frame sizes.
It is not easy to track the whole body of a person because of the large range of
possible body gestures, which can lead to false tracking. To solve this problem, we
propose a method that only tracks the upper body of a person (Figure 4.17), which
does not vary much with gesture changes. We take the upper half rectangle as the
upper body of a target. It may contain some part of legs or lost some part of upper
body. We can obtain pure upper body by using the clustering method mentioned be-
fore. Based on this robust tracking system, we can obtain the speed of the target, the
height and width of the target. Through the speed of the upper body and the thresh-
olds selected by experiments, running movement can be successfully detected. Also,
based on the height and width of the target, we can detect falling down movement
through space analysis.
Figures 4.18 (a)-(d) show the robot moving in the proper direction to detect the
target upon receiving a direction signal from the passive acoustic location device.
Figures 4.18 (a) and (b) are blurry because the camera is moving very quickly,
whereas Figures 4.18 (e) and 4.18 (f) are clear as the robot tracks the target and
keeps it in the center of the camera’s field of view.
Fig. 4.19 The process of abnormal behavior detection utilizing video and audio information: (a)
The initial state of the robot; (b) The robot turns to the direction where abnormality occurs; (c) The
initial image captured by the robot; (d) The image captured by the robot after turning to direction
of abnormality.
68 4 Behavior Analysis of Individuals
Fig. 4.20 The robot sends the image to the mobile phone of its master.
4.5 Conclusion
In this chapter, we present the behavior analysis of the actions of individuals and
groups. These cases have one thing in common - the environment is not crowded,
so clear human blobs are available for detailed behavior analysis. A learning-based
approach is demonstrated to be effective in the behavior analysis of individuals.
We develop both a contour- and a motion-based method of behavior analysis.
A rule-based method, because of its simplicity, also works for the classification of
unsophisticated behaviors.
In this chapter, we describe a household surveillance robot that can detect abnor-
mal events by combining video and audio surveillance techniques. Our robot first
4.5 Conclusion 69
detects a moving target by sound localization, and then tracks it across a large field
of vision using a pan/tilt camera platform. The robot can detect abnormal behaviors
in a cluttered environment, such as a person suddenly running or falling down on
the floor. It can also detect abnormal audio information and employs a camera to
confirm events.
This research offers three main contributions: a) an innovative strategy for the
detection of abnormal events that utilizes video and audio information; b) an initial-
ization process that employs a passive acoustic location device to help a robot detect
moving targets; and c) an update mechanism to regularly update the target model.