0% found this document useful (0 votes)
7 views25 pages

Chapter 4

Chapter 4 discusses behavior analysis of individuals using two main approaches: learning-based and rule-based methods. It details the learning-based method, focusing on contour-based feature analysis, supervised PCA for feature generation, and SVM classifiers for distinguishing between normal and abnormal behaviors. The chapter also introduces motion-based feature analysis, including mean shift-based searching and motion history images, emphasizing the complexity of recognizing human actions in surveillance contexts.

Uploaded by

deden91967
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views25 pages

Chapter 4

Chapter 4 discusses behavior analysis of individuals using two main approaches: learning-based and rule-based methods. It details the learning-based method, focusing on contour-based feature analysis, supervised PCA for feature generation, and SVM classifiers for distinguishing between normal and abnormal behaviors. The chapter also introduces motion-based feature analysis, including mean shift-based searching and motion history images, emphasizing the complexity of recognizing human actions in surveillance contexts.

Uploaded by

deden91967
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Chapter 4

Behavior Analysis of Individuals1, 2

4.1 Introduction

After successful segmentation and tracking, the temporal-spatial information of


each blob sequence can be indexed for further behavior analysis. In this chapter, we
elaborate two kinds of approaches for the behavior analysis of individuals: learning-
based and rule-based.

4.2 Learning-based Behavior Analysis

4.2.1 Contour-based Feature Analysis

4.2.1.1 Preprocessing

In this step, we collect consecutive blobs for a single person from which the contours
of a human blob can be extracted. As can be seen in Figure 4.1, the centroid (xc , yc )
of the human blob is determined by the following equations.

1 N 1 N
xc = ∑
N i=1
xi , yc = ∑ yi
N i=1
(4.1)

1 Portions reprinted, with permission, from Xinyu Wu, Yongsheng Ou, Huihuan Qian, and Yang-
sheng Xu, A Detection System for Human Abnormal Behavior, IEEE International Conference on
Intelligent Robot Systems. [2005]
c IEEE.
2 Portions reprinted, with permission, from Yufeng Chen, Guoyuan Liang, Ka Keung Lee, and

Yangsheng Xu, Abnormal Behavior Detection by Multi-SVM-Based Bayesian Network, Interna-


tional Conference on Information Acquisition. [2007]
c IEEE.

H. Qian et al., Intelligent Surveillance Systems, Intelligent Systems, 45


Control and Automation: Science and Engineering 51, DOI 10.1007/978-94-007-1137-2 4,

c Springer Science+Business Media B.V. 2011
46 4 Behavior Analysis of Individuals

where (xc , yc ) is the average contour pixel position, (xi , yi ) represent the points on
the human blob contour, and there are a total of N points on the contour. The distance
di from the centroid to the contour points is calculated by

di = (xi − xc )2 − (yi − yc )2 (4.2)

The distances for the human blob are then transformed into coefficients by means
of DFT. Twenty primary coefficients are selected from these coefficients and thus
80 coefficients are chosen from four consecutive blobs. We then perform PCA for
feature extraction.

Fig. 4.1 Target preprocessing.

4.2.1.2 Supervised PCA for Feature Generation

Suppose that we have two sets of training samples: A and B. The number of train-
ing samples in each set is N. Φi represents each eigenvector produced by PCA,
as illustrated in [54]. Each of the training samples, including positive and negative
samples, can be projected onto an axis extended by the corresponding eigenvec-
tor. By analyzing the distribution of the projected 2N points, we can roughly select
the eigenvectors that have more motion information. The following gives a detailed
description of the process.
1. For a certain eigenvector Φi , compute its mapping result according to the two sets
of training samples. The result can be described as λi, j ,(1 ≤ i ≤ M, 1 ≤ j ≤ 2N).
2. Train a classifier fi using a simple method such as perception or another simple
algorithm that can separate λi, j into two groups, normal and abnormal behavior,
with a minimum error E( fi ).
3. If E( fi < θ ), then we delete this eigenvector from the original set of eigenvectors.
M is the number of eigenvectors, and 2N is the total number of training samples.
θ is the predefined threshold.
It is possible to select too few or even no good eigenvectors in a single PCA
process. We propose the following approach to solve this problem. We assume that
the number of training samples, 2N, is sufficiently large. We then randomly select
training samples from the two sets. The number of selected training samples in each
set is less than N/2. We then perform supervised PCA using these samples. By
repeating this process, we can collect a number of good features. This approach is
inspired by the bootstrap method, the main idea of which is to emphasize some good
features by reassembling data, which allows the features to stand out more easily.
4.2 Learning-based Behavior Analysis 47

4.2.1.3 SVM classifiers

Our goal is to separate behavior into two classes, abnormal and normal, according
to a group of features. Many types of learning algorithms can be used for binary
classification problems, including SVMs , radial basis function networks (RBFNs),
the nearest neighbor algorithm, and Fisher’s linear discriminant, among others. We
choose an SVM as our training algorithm because it has stronger theory interpreta-
tion and better generalization performance than the other approaches.
The SVM method is a new technique in the field of statistical learning the-
ory [53], [55], [56], and can be considered a linear approach for high-dimensional
feature spaces. Using kernels, all input data are mapped nonlinearly into a high-
dimensional feature space. Separating hyperplanes are then constructed with maxi-
mum margins, which yield a nonlinear decision boundary in the input space. Using
appropriate kernel functions, it is possible to compute the separating hyperplanes
without explicitly mapping the feature space.
In this subsection, we briefly introduce the SVM method as a new framework for
action classification. The basic idea is to map the data X into a high-dimensional
feature space F via a nonlinear mapping Φ , and to conduct linear regression in
this space.

f (x) = (ω · Φ (x)) + b, Φ : RN → F, ω ∈ F (4.3)

where b is the threshold. Thus, linear regression in a high-dimensional (feature)


space corresponds to nonlinear regression in the low-dimensional input space RN .
Note that the dot product in Equation (4.3) between ω and Φ (x) has to be computed
in this high-dimensional space (which is usually intractable) if we are not able to
use the kernel, which eventually leaves us with dot products that can be implicitly
expressed in the low-dimensional input space RN . As Φ is fixed, we determine ω
from the data by minimizing the sum of the empirical risk Remp [ f ] and a complexity
term ω 2 , which enforces flatness in the feature space.

Rreg [ f ] = Remp [ f ] + λ ω 2

= Σi=1
l
C( f (xi − yi )) + λ ω 2 (4.4)

where l denotes the sample size (x1 , . . . , xl ), C(·) is a loss function, and λ is a
regularization constant. For a large set of loss functions, Equation (4.4) can be min-
imized by solving a quadratic programming problem, which is uniquely solvable.
It is shown that the vector ω can be written in terms of the data points

ω = Σi=1
l
(αi − α ∗ )Φ (xi ) (4.5)

with αi , α ∗ being the solution of the aforementioned quadratic programming


problem. αi and α ∗ have an intuitive interpretation as forces pushing and pulling
the estimate f (xi ) towards yi the measurements. Taking Equations (4.5) and (4.3)
48 4 Behavior Analysis of Individuals

into account, we are able to rewrite the whole problem in terms of dot products in
the low-dimensional input space as

f (x) = Σi=1
l
(αi − α ∗ )(Φ (xi )Φ (x)) + b
= Σi=1
l
(αi − α ∗ )K(xi , x) + b (4.6)

where αi and α ∗ are Lagrangian multipliers, and x are support vectors.

Fig. 4.2 A running person

4.2.1.4 Experiments

Our approach differs from previous detection methods in that (1) it can detect many
types of abnormal behavior using one method and the range of abnormal behavior
can be changed if needed; and (2) rather than detecting body parts ([53]) on the con-
tour, 20 primary components are extracted by DFT. In our system, we do not utilize
the position and velocity of body parts for learning, because the precise position and
velocity of body parts cannot always be obtained, which causes a high rate of false
alarms. In the training process, for example, we put the blobs of normal behavior,
such as standing and walking, into one class and the blobs of abnormal behavior,

Fig. 4.3 A person bending down.


4.2 Learning-based Behavior Analysis 49

Fig. 4.4 A person carrying a bar

such as running, into another one. PCA is then employed to select features, and an
SVM is applied to classify the behavior as running or normal behavior. In the same
way, we obtain classifiers for bending down and carrying a bar. The three SVM
classifiers are then hierarchically connected.
After the SVM training, we test the algorithm using a series of videos, as il-
lustrated in Figure 4.2, which shows that a running person is detected while other
people are walking nearby; Figure 4.3, which shows that a person bending down is
detected; and Figure 4.4, which shows that a person carrying a bar in a crowd is
detected.
The algorithm is tested using 625 samples (each sample contains four consecu-
tive frames). The success rate for behavior classification is shown in Table 4.1.

Table 4.1 Results of the classification.


Behavior Success rate
Human Running 82% (107 samples)
abnormal Bending down 86%(113 samples)
behavior Carrying a bar 87%(108 samples)
Human normal behavior 97%(297 samples)

4.2.2 Motion-based Feature Analysis

4.2.2.1 Mean Shift-based Motion Feature Searching

Before a feature can be extracted, we need to decide where the most prominent
one is. In the analysis of human behavior, information about motion is the most
important. In this section, we use the mean shift method to search for the region that
has a concentration of motion information.
50 4 Behavior Analysis of Individuals

People generally tend to devote greater attention to regions where more


movement is occurring. This motion information can be defined as the differ-
ence between the current and the previous image. Information about the motion
location and action type can be seen in Figure 4.5.

Fig. 4.5 Motion information of human action

The process of the basic continuously adaptive mean shift (CAMSHIFT), which
was introduced by Bradski [190], is as follows.
1. Choose the initial location of the search window and a search window size.
2. Compute the mean location in the search window and store the zeroth moment.
3. Center the search window at the mean location computed in the previous step.
4. Repeat Steps 2 and 3 until convergence (or until the mean location moves less
than a preset threshold).
5. Set the search window size equal to a function of the zeroth moment found in
Step 2.
6. Repeat Steps 2-5 until convergence (or until the mean location moves less than a
preset threshold).
In our problem formulation, attention should be paid to the window location
and size searching steps. Because a single feature is insufficient for the analysis of
human behavior, more features are retrieved based on the body’s structure. The head
4.2 Learning-based Behavior Analysis 51

and body size can be estimated according to the method explained in section II.A,
and the region of interest can be defined as shown in Figure 4.6.

Fig. 4.6 Feature selection: The white rectangle is the body location detected and the four small
colored rectangles are the context confined region for feature searching.

Here, four main feature regions are selected according to human movement char-
acteristics, and more detailed regions can be adopted depending on the requirements.
The locations of the search windows are restricted to the corresponding regions, and
the initial size is set to be the size of a human head. For Step 2, the centroid of the
moving part within the search window can be calculated from zero and first-order
moments [191]: M00 = ∑x ∑y I(x, y), M10 = ∑x ∑y xI(x, y), and M01 = ∑x ∑y yI(x, y),
where I(x, y) is the value of the difference image at the position (x, y).
The centroid is located at xc = M M01
M00 and yc = M00 .
10

More information can be inferred from higher-order moments, such as direction


and eigenvalues (major length and width). The general form of the moments can be
written as Mi j = ∑x ∑y xi y j I(x, y).
2b
arctan a−c
Then, the object orientation is θ = 2 , where a = M20
M00 − x2c , b = M11
M00 − xc yc ,
M02
and c = M00 − y2c .

4.2.2.2 Motion History Image-based Analysis

The distribution of the difference image is limited to human motion understanding.


For the recognition of action sequences, information about the direction of move-
ment is more important than that of static poses. In this section, we introduce a
real-time computer visual representation of human movement known as a motion
history image (MHI) [192, 193], which can help generate the direction and the gra-
dient from a series of foreground images.
52 4 Behavior Analysis of Individuals

The MHI method is based on the foreground image, and an appropriate threshold
is needed to transform the image into a binary one. If the difference is less than the
threshold, then it is set to zero. The first step is to update the MHI template T using
the foreground at different time stamps τ :

⎨ τ,
⎪ I(x, y) = 1
T (x, y) = 0, T (x, y) < τ − δ

⎩ T (x, y), else

where I is the difference image, and δ is the time window to be considered.


The second step is to find the gradient on this MHI template. θ (x, y) =
G (x,y)
arctan Gyx (x,y) , where Gy (x, y), Gx (x, y) are the sobel convolution results with the
MHI template in the x and y directions, respectively. The scale of the filter can vary
in a large region according to the requirements. However, the gradient on the bor-
der of the foreground should not be considered, for the information is incomplete
and using it may lead to errors. Then, the regional orientation of the MHI can be
obtained from the weighted contribution of the gradient at each point, where the
weight depends on the time stamp at that point.

4.2.2.3 Frame Work Analysis

Automatic recognition of human action is a challenge, as human actions are com-


plex under different situations. Some simple actions can be defined statically for the
recognition of basic poses, which is an efficient way to provide basic recognition
of action types. However, if no temporal information is taken into consideration,
then it is difficult to describe actions that are more complex. Some researchers have
tried to take into account the temporal dimension of actions. Most of such work is
centered on specific action sequences, such as the analysis of predefined gestures
using the hidden Markov model (HMM) and dynamic programming. In the analysis
of human actions, such models are very limited in terms of their effectiveness in the
recognition of complex action sequences.
In maintaining the accuracy of a recognition system, there is a trade-off between
the complexity of action and the range of action allowed. In contrast to typical ges-
ture recognition applications, intelligent surveillance concentrates on detecting ab-
normal movements among all sorts of normal movements in the training data rather
than accurately recognizing several complex action sequences.
It is not always easy to define what behaviors are abnormal. Generally, two ap-
proaches are used: one is to model all normal behaviors, so that any behavior that
does not belong to the model will be recognized as abnormal, and the other is to
directly model abnormal behaviors. As it is not practical to list all abnormal behav-
iors that may happen in a scene, we prefer the first approach. In this way, a limited
number of typical normal behaviors can be added dynamically into the database one
after another.
4.2 Learning-based Behavior Analysis 53

Given a feature space F and the feature evidence X, X ⊆ F, we try to find a


normal model M to express the probability P(H 0 |x) of hypothesis H 0 , where H 0 is
defined as the case that the candidate feature x(x ∈ X) belongs to abnormal behavior,
and H 0 belongs to normal behavior. In addition, types of action H i , i ∈ N can be
recognized using different kinds of action models Mi , i ∈ N.

4.2.2.4 SVM-based Learning

In this part, to simplify the optimization procession, we take the radial basis function
(RBF) as a kernel for Equation (4.6):

K(xi , x) = exp(−γ xi − x2 ), γ > 0 (4.7)

where γ is the kernel parameter. Then, the optimized results can be found in the
extended space by parameter γ and loss function C(·).
Once the function is obtained, the classification line and the probability of the
data can be directly calculated. For each action type, we establish an SVM model
to give the probability P(x|Hi ) that evidence x, x ∈ X belongs to each hypothetical
action type Hi .

4.2.2.5 Recognition using a Bayesian Network

Generally, the hypothesis that has the maximum probability among all of the models
is selected as the recognized action type. If all of the probabilities are smaller than
a certain threshold value, then the action is recognized as an abnormal behavior.
However, the recognition of actions directly based on the maximum probability
of all of the hypotheses is limited, as some of the poses in different action sequences
are very similar. Therefore, the relationship of the context frame should be incorpo-
rated into the analysis for better results. The ideal probability is

P(Hit |xt , xt−1 , ...)

but it is impossible to take into account all of the motion history information. There-
fore, we simplify the probability as follows:

P(Hit |xt , xt−1 , ...) ≈ P(Hit |Hit−1 , xt )

All of the motion history information is represented by the previous hypothesis


(t−1)
Hi . This result is independent of the evidence xt . Then,

P(Hit |Hit−1 , xt ) = P(Hit |Hit−1 )P(Hit |xt )


54 4 Behavior Analysis of Individuals

where
P(Hi )P(x|Hi )
P(Hi |x) =
P(x)
P(Hi )P(x|Hi )
=
∑ P(x|Hi )P(Hi )

Therefore, given the starting hypothesis Hi0 and transfer matrix P(Hit |Hit−1 ), the
probability can be induced from the previous results and those of the SVM model.

4.2.2.6 Experiments

An action recognition experiment is carried out to evaluate the method. The exper-
iment is based on a database of five different action types, namely, hitting, kicking,
walking, waving, and double waving. The SVM model is trained with about 400
data, and over 280 data are tested. The data are recognized as six action types in-
cluding the type NOT, which means that the models do not know the action type
to which they belong. The recognition results for the training and testing data are
shown in Tables 4.2 and 4.3, respectively. The numbers in the tables are the numbers
of action data (row) that are classified as different actions (column). It can be seen
from the tables that the generalization of the method is acceptable, as the difference
between the testing and the training results is small; however, their accuracy needs
to be improved.
The better results that are obtained using the Bayesian framework based on
similar training and testing data are shown in Tables 4.4 and 4.5, respectively. The
recognition rates of different methods under different situations are shown in Figure
4.7. The figure shows that the Bayesian framework-based method largely improves
the performance of SVM recognition.
The experiment is carried out on a PC running at 1.7 GHz. Taking advantage of
the MHI-based context feature, a speed of about 15 fps can be achieved, which is
efficient for real-time surveillance.

Table 4.2 Training data recognition result


Result NOT Walk Wave DWave Hit Kick
Test
Walk 0 46 0 0 0 2
Wave 2 0 75 10 17 5
DWave 0 0 13 81 5 0
Hit 0 0 6 12 77 5
Kick 0 0 2 0 9 40
4.3 Rule-based Behavior Analysis 55

Table 4.3 Testing data recognition result


Result NOT Walk Wave DWave Hit Kick
Test
Walk 0 38 1 0 5 0
Wave 0 0 28 6 9 7
DWave 0 0 6 39 9 4
Hit 0 0 6 10 68 8
Kick 0 0 4 0 6 31

Table 4.4 Training data recognition result improved


Result NOT Walk Wave DWave Hit Kick
Test
Walk 0 48 0 0 0 0
Wave 2 0 87 7 10 3
DWave 0 0 11 85 3 0
Hit 0 0 3 2 91 4
Kick 0 0 0 0 6 45

Table 4.5 Testing data recognition result improved


Result NOT Walk Wave DWave Hit Kick
Test
Walk 0 41 0 0 3 0
Wave 0 0 36 4 6 4
DWave 0 0 4 46 3 5
Hit 0 0 4 0 80 8
Kick 0 0 8 0 0 33

4.3 Rule-based Behavior Analysis

In some cases, rule-based behavior analysis, although simple, demonstrates robust


real-time performance.
It is not easy to track the whole body of a person because of the large range
of possible body gestures, which can lead to false tracking. To solve this problem,
based on tracking only the upper body of a person (Figure 4.17), we propose a rule-
based method that does not vary much with changes in gesture. We take the upper
half of the rectangle as the upper body of a target. It may contain some part of the
legs or may not contain the whole of the upper body. We can obtain solely the upper
body using the clustering method mentioned above. Based on this robust tracking
system, we can obtain the speed, height, and width of the target. With the speed
of the upper body and the thresholds selected by experiments, running motion can
successfully be detected. Also, based on the height and width of the target, we can
detect falling motion through shape analysis.
56 4 Behavior Analysis of Individuals

100

90

80

70
Recognition Rate

60

50

40

30
Training
20 Testing
Training improved
10 Testing improved

0
1 2 3 4 5
Action Type

Fig. 4.7 Recognition rate comparison. The x-axis represents different action types: walking (1),
waving (2), double waving (3), hitting(4) and kicking (5). The y-axis is the recognition rate (%).

Figure 4.9 shows the detection of abnormal behavior, which includes people
falling down, bending down, and running in a household environment, based on
the tracking results.

4.4 Application: Household Surveillance Robot

Sometimes, abnormal events occur in a room inside a house or in locations that


are outside the field of view of the camera. These events require a different detec-
tion approach. How can they be detected? In contrast to video surveillance, audio
surveillance does not require that a scene be “watched” directly, and its effectiveness
is not influenced by the occlusions that can cause failures in video surveillance sys-
tems. Especially in houses or storehouses, many areas may be occluded by moving
or static objects.
In this chapter, we present a surveillance system installed on a household robot
that detects abnormal events utilizing video and audio information. Moving targets
are detected by the robot using a passive acoustic location device. The robot then
tracks the targets using a particle filter algorithm. For adaption to different lighting
conditions, the target model is updated regularly based on an update mechanism. To
ensure robust tracking, the robot detects abnormal human behaviors by tracking the
upper body of a person. For audio surveillance, Mel frequency cepstral coefficients
(MFCCs) are used to extract features from audio information. Those features are
input into a SVM classifier for analysis. The experimental results show that the
robot can detect abnormal behavior such as falling down and running. In addition,
4.4 Application: Household Surveillance Robot 57

Fig. 4.8 Action Recognition Result

a 88.17% accuracy rate is achieved in the detection of abnormal audio information


such as crying, groaning, and gunfire. To lower the incidence of false alarms by the
abnormal sound detection system, the passive acoustic location device directs the
robot to the scene where an abnormal event has occurred, and the robot employs its
camera to confirm the occurrence of the event. Finally, the robot sends the image
that is captured to the mobile phone of its master.
58 4 Behavior Analysis of Individuals

Fig. 4.9 Abnormal behavior detection results.

The testing prototype of the surveillance robot is composed of a pan/tilt camera


platform with two cameras and a robot platform (Figure 4.10). One camera is em-
ployed to track targets and detect abnormal behaviors. The other is used to detect
and recognize faces, which is not discussed in this chapter. Our robot can detect
a moving target by sound localization and then track it across a large field of vi-
sion using the pan/tilt camera platform. It can detect abnormal behavior in a clut-
tered environment, such as a person suddenly running or falling down on the floor.
By teaching the robot the difference between normal and abnormal sound infor-
mation, the computational models of action built inside the trained support vector
machines can automatically identify whether newly received audio information is
normal. If abnormal audio information is detected, then the robot, directed by the
passive acoustic location device, can employ its camera to confirm that the event has
taken place.
A number of video surveillance systems for detecting and tracking multiple
people have been developed, including the W 4 in [162], TI’s system in [163], and
the system in [164]. Good tracking often depends on correct segmentation. How-
ever, among the foregoing systems, occlusion is a significant obstacle. Furthermore,
none of them is designed with abnormal behavior detection as its main function.
Radhakrishnan et al. [165] presented a systematic framework for the detection of
abnormal sounds that might occur in elevators.
4.4 Application: Household Surveillance Robot 59

Luo [166] built a security robot that can detect dangerous situations and provide
a timely alert, which focuses on fire, power, and intruder detection. Nikos [167]
presented a decision-theoretic strategy for surveillance as a first step towards au-
tomating the planning of the movements of an autonomous surveillance robot.

Fig. 4.10 The testing prototype of surveillance robot.

The overview of the system is shown in Figure 4.11. In the initialization stage,
two methods are employed to detect moving objects. One is to pan the camera step
by step and employ the frame differencing method to detect moving targets during
the static stage. The other method uses the passive acoustic location device to direct
the camera at the moving object, keeping the camera static and employing the frame
differencing method to detect foreground pixels. The foreground pixels are then
clustered into labels, and the center of each label is calculated as the target feature,
which is used to measure similarity in the particle filter tracking. In the tracking
process, the robot camera tracks the moving target using a particle filter tracking
algorithm, and updates the tracking target model at appropriate times. To detect
abnormal behavior, upper body (which is more rigid) tracking is implemented in
a way that focuses on the vertical position and speed of the target. At the same
time, with the help of a learning algorithm, the robot can detect abnormal audio
information, such as crying or groaning, even in other rooms.
60 4 Behavior Analysis of Individuals

Passive acoustic
Auto scan detection
location

Moving target detection

Clusterring and feature


extracting

Particle filter
tracking

Behavior analyzing
Audio information

Abnormal behavior
MFCC and PCA detection
feature extraction
Output

Abnormal audio
SVM classifier
information detection

Fig. 4.11 Block diagram of the system.

4.4.1 System Implementation

In many surveillance systems, the background subtraction method is used to find


the background model of an image so that moving objects in the foreground can
be detected by simply subtracting the background from the frames. However, our
surveillance robot cannot use this method because the camera is mobile and we must
therefore use a slightly different approach. When a person speaks or makes noise,
we can locate the position of the person with a passive acoustic location device and
rotate the camera to the correct direction. The frame differencing method is then
employed to detect movement. If the passive acoustic location device does not detect
any sound, then the surveillance robot turns the camera 30 degrees and employs
the differencing method to detect moving targets. If the robot does not detect any
moving targets, then the process is repeated until the robot finds a moving target or
the passive acoustic device gives it a location signal.
The household robot system is illustrated in Figure 4.12.
4.4 Application: Household Surveillance Robot 61

Fig. 4.12 System overview of the house hold robot

4.4.2 Combined Surveillance with Video and Audio

An object producing an acoustic wave is located and identified by the passive


acoustic location device. Figure 4.13 shows the device, which comprises four mi-
crophones installed in an array. The device uses the time-delay estimation method,
which is based on the differences in the arrival time of sound at the various mi-
crophones in the sensor array. The position of the sound source is then calculated
based on the time delays and the geometric position of the microphones. To obtain
this spatial information, three independent time delays are needed; therefore, the
four microphones are set at different positions on the plane. Once the direction re-
sults have been obtained, the pan/tilt platform moves so that the moving object is
included in the camera’s field of view.
The precision of the passive acoustic location device depends on the distances
between the microphones and the precision of the time delays. In our testing, we
find that the passive acoustic location error is about 10 degrees on the x-y plane.
The camera angle is about 90 degrees and much greater than the passive acoustic
location error (see Figure 4.14). After the passive acoustic location device provides
a direction, the robot turns the camera and keeps the target in the center of the
camera’s field of view. Thus, location errors can be ignored.

Z P (X, Y, Z)

a
3
2
y
0 4
1

Fig. 4.13 Passive acoustic location device.

In contrast to video surveillance, audio surveillance does not require that the
scene be “watched” directly, nor is its effectiveness affected by occlusion, which can
62 4 Behavior Analysis of Individuals

Fig. 4.14 Solving the location error.

result in the reporting of false alarms by video surveillance systems. In many houses
or storehouses, some areas may be occluded by moving or static objects. Also, the
robot and human operator may not in the same room if a house has several rooms.
We propose a supervised learning-based approach to achieve audio surveillance
in a household environment [183]. Figure 4.15 shows the training framework of our
approach. First, we collect a sound effect dataset (see Table 4.6), which includes
many sound effects collected from household environments. Second, we manually
label these sound effects as abnormal (e.g., screaming, gunfire, glass breaking, bang-
ing) or normal (e.g., speech, footsteps, shower running, phone ringing) samples.
Third, an MFCC feature is extracted from a 1.0 s waveform from each sound effect
sample. Finally, we use an SVM to train a classifier. When a new 1.0 s waveform
is received, the MFCC feature is extracted from the waveform; then, the classifier is
employed to determine whether the sound sample is normal or abnormal.

Fig. 4.15 Training framework.

4.4.2.1 MFCC Feature Extraction

To distinguish normal from abnormal sounds, a meaningful acoustic feature must


be extracted from the waveforms of sounds.
4.4 Application: Household Surveillance Robot 63

Table 4.6 The sound effects dataset

Normal sound effects Abnormal sound effects

normal speech gun shot


boiling water glass breaking
dish-washer screaming
door closing banging
door opening explosions
door creaking crying
door locking kicking a door
fan groaning
hair dryer
phone ringing
pouring liquid
shower
... ...

Many audio feature extraction methods have been proposed for different audio
classification applications. The spectral centroid, zero-crossing rate, percentage of
low-energy frames, and spectral flux [172] methods have been used for speech and
music discrimination tasks. The spectral centroid represents the balancing point of
the spectral power distribution; the zero-crossing rate measures the dominant fre-
quency of a signal; the percentage of low-energy frames describes the skewness
of the energy distribution; and the spectral flux measures the rate of change of
the sound. Timbral, rhythmic, and pitch [173] features, which describe the timbral,
rhythmic, and pitch characteristics of music, respectively, have been proposed for
automatic genre classification.
In our approach, the MFCC feature is employed to represent audio signals. Its use
is motivated by perceptual and computational considerations. As it captures some of
the crucial properties of human hearing, it is ideal for general audio discrimination.
The MFCC feature has been successfully applied to speech recognition [174], music
modeling [175], and audio information retrieval [176]. More recently, it has been
used in audio surveillance [177].
The steps to extract the MFCC feature from the waveform are as follows.
Step 1: Normalize the waveform to the range [−1.0, 1.0] and apply a hamming win-
dow to the waveform.
Step 2: Divide the waveform into N frames, i.e., 1000N ms for each frame.
Step 3: Take the fast Fourier transform (FFT) of each frame to obtain its frequency
information.
Step 4: Convert the FFT data into filter bank outputs. Since the lower frequencies
are perceptually more important than the higher frequencies, the 13 filters allocated
below 1000 HZ are linearly spaced (133.33 Hz between center frequencies) and the
27 filters allocated above 1000 Hz are spaced logorithmically (separated by a factor
of 1.0711703 in frequency). Figure 4.16 shows the frequency response of the trian-
gular filters.
Step 5: As the perceived loudness of a signal has been found to be approximately
logarithmic, we take the log of the filter bank outputs.
64 4 Behavior Analysis of Individuals

Fig. 4.16 Frequency response of the triangular filters.

Step 6: Take the cosine transform to reduce dimensionality. As the filter bank out-
puts that are calculated for each frame are highly correlated, we take the cosine
transform that approximates PCA to decorrelate the outputs and reduce dimension-
ality. Thirteen (or so) cepstral features are obtained for each frame using this trans-
form. If we divide the waveform into 10 frames, then the total dimensionality of the
MFCC feature for a 1.0 s waveform is 130.

4.4.2.2 Support Vector Machine

After extracting the MFCC features from the waveforms, we employ an SVM-
trained classifier to determine whether this sound is normal or not [178][179][180].
Our goal is to separate sounds into two classes, normal and abnormal, according
to a group of features. Many types of neural networks are used for binary classifi-
cation problems, including SVMs, RBFNs, the nearest neighbor algorithm, Fisher’s
linear discriminant, and so on. We choose an SVM as the audio classifier because it
has stronger theory interpretation and better generalization performance than neural
networks. Compared with other neural network classifiers, SVM ones have three
distinct characteristics. First, they estimate a classification using a set of linear func-
tions that are defined in a high-dimensional feature space. Second, they carry out
classification estimation by risk minimization, where risk is measured using Vap-
nik’s ε -insensitive loss function. Third, they are based on the structural risk mini-
mization (SRM) principle, which minimizes the risk function, which consists of the
empirical error and a regularized term.
4.4 Application: Household Surveillance Robot 65

4.4.3 Experimental Results

To evaluate our approach, we collect 169 sound effect samples from the Internet
[184], including 128 normal and 41 abnormal samples, most of which are collected
in household environments.
For each sample, the first 1.0 s waveform is used for training and testing. The
rigorous jack-knifing cross-validation procedure, which reduces the risk of overstat-
ing the results, is used to estimate classification performance. For a dataset with M
samples, we chose M − 1 samples to train the classifier, and then test performance
using the remaining sample. This procedure is then repeated M times. The final es-
timation result is obtained by averaging the M accuracy rates. To train a classifier
using SVM, we apply a polynomial kernel, where the kernel parameter d = 2, and
the adjusting parameter C in the loss function is set to 1.
Table 4.7 shows the accuracy rates using the MFCC feature trained with differ-
ent frame sizes, while Table 4.8 shows those using MFCC and PCA features trained
with different frame sizes. PCA of the dataset is employed to reduce the number of
dimensions. For example, the number is reduced from 637 to 300 when the frame
size is 20 ms and from 247 to 114 when the frame size is 50 ms. Comparing the
results in Tables 4.7 and 4.8, we find that it is unnecessary to use PCA. The best ac-
curacy rate obtained is 88.17% with a 100 ms frame size. The number of dimensions
of this frame size is 117 and thus dimension reduction is unnecessary.
Nowadays, abnormal behavior detection is a popular research topic, and many
studies have presented methods to detect abnormal behaviors. Our surveillance
robot is for use in household environments, so it needs to detect abnormal behaviors
such as falling down or running.

Table 4.7 Accuracy rates (%) using the MFCC feature trained with 20, 50, 100, and 500 ms frame
sizes.

Frame size 20 ms 50 ms 100 ms 500 ms

Accuracy 86.98 86.39 88.17 79.88

Table 4.8 Accuracy rates (%) using the MFCC and PCA features trained with 20, 50, 100, and
500ms frame sizes.

Frame size 20 ms 50 ms 100 ms 500ms

Accuracy 84.62 78.70 82.84 76.33


66 4 Behavior Analysis of Individuals

It is not easy to track the whole body of a person because of the large range of
possible body gestures, which can lead to false tracking. To solve this problem, we
propose a method that only tracks the upper body of a person (Figure 4.17), which
does not vary much with gesture changes. We take the upper half rectangle as the
upper body of a target. It may contain some part of legs or lost some part of upper
body. We can obtain pure upper body by using the clustering method mentioned be-
fore. Based on this robust tracking system, we can obtain the speed of the target, the
height and width of the target. Through the speed of the upper body and the thresh-
olds selected by experiments, running movement can be successfully detected. Also,
based on the height and width of the target, we can detect falling down movement
through space analysis.
Figures 4.18 (a)-(d) show the robot moving in the proper direction to detect the
target upon receiving a direction signal from the passive acoustic location device.
Figures 4.18 (a) and (b) are blurry because the camera is moving very quickly,
whereas Figures 4.18 (e) and 4.18 (f) are clear as the robot tracks the target and
keeps it in the center of the camera’s field of view.

Fig. 4.17 Clustering results on the upper body.

Fig. 4.18 Initialization and tracking results.


4.4 Application: Household Surveillance Robot 67

We test our abnormal sound detection system in a real-world work environment.


The sounds include normal speech, a door opening, the tapping of keys on a key-
board, and so forth, which are all normal sounds. The system gives eight false alarms
in one hour. A sound box is used to play abnormal sounds, such as gunfire and cry-
ing, among others. The accuracy rate is 83% among the 100 abnormal sound sam-
ples, which is lower than that obtained in previous noise experiments conducted in
real-world environments.
To lower the incidence of false alarms by the abnormal sound detection system,
the passive acoustic location device directs the robot to the scene where an abnor-
mal event has occurred and the robot employs its camera to confirm the occurrence
of the event. In Figure 4.19, we can see the process of abnormal behavior detec-
tion utilizing video and audio information when two persons are fighting in another
room. First, the abnormal sound detection system detects abnormal sounds (fighting
and shouting). At the same time, the passive acoustic location device obtains the
direction. The robot then turns towards the direction whence the abnormal sound

Fig. 4.19 The process of abnormal behavior detection utilizing video and audio information: (a)
The initial state of the robot; (b) The robot turns to the direction where abnormality occurs; (c) The
initial image captured by the robot; (d) The image captured by the robot after turning to direction
of abnormality.
68 4 Behavior Analysis of Individuals

Fig. 4.20 The robot sends the image to the mobile phone of its master.

originates and captures images to check whether abnormal behavior is occurring.


To detect fighting, we employ the optical flow method, which we have discussed in
the previous chapter [182].
Some abnormal cases cannot correctly be detected by our system. One example
is the case where a person has already fallen before the robot turns to the correct
direction. To solve this problem, the robot sends the image captured to the mobile
phone of its master (Figure 4.20).

4.5 Conclusion

In this chapter, we present the behavior analysis of the actions of individuals and
groups. These cases have one thing in common - the environment is not crowded,
so clear human blobs are available for detailed behavior analysis. A learning-based
approach is demonstrated to be effective in the behavior analysis of individuals.
We develop both a contour- and a motion-based method of behavior analysis.
A rule-based method, because of its simplicity, also works for the classification of
unsophisticated behaviors.
In this chapter, we describe a household surveillance robot that can detect abnor-
mal events by combining video and audio surveillance techniques. Our robot first
4.5 Conclusion 69

detects a moving target by sound localization, and then tracks it across a large field
of vision using a pan/tilt camera platform. The robot can detect abnormal behaviors
in a cluttered environment, such as a person suddenly running or falling down on
the floor. It can also detect abnormal audio information and employs a camera to
confirm events.
This research offers three main contributions: a) an innovative strategy for the
detection of abnormal events that utilizes video and audio information; b) an initial-
ization process that employs a passive acoustic location device to help a robot detect
moving targets; and c) an update mechanism to regularly update the target model.

You might also like