0% found this document useful (0 votes)
2 views

Depth-Based Human Fall Detection via Shape Features and Improved Extreme Learning Machine

This paper presents a novel automated fall detection method using a low-cost depth camera, specifically the Kinect, which combines shape-based fall characterization and an improved extreme learning machine (ELM) classifier. The approach extracts curvature scale space (CSS) features from human silhouettes and utilizes a bag of CSS words (BoCSS) to distinguish falls from other daily actions, achieving high sensitivity and accuracy. Additionally, a variable-length particle swarm optimization algorithm is introduced to optimize the ELM's hyperparameters, enhancing its performance in fall detection.

Uploaded by

Franco Cuya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Depth-Based Human Fall Detection via Shape Features and Improved Extreme Learning Machine

This paper presents a novel automated fall detection method using a low-cost depth camera, specifically the Kinect, which combines shape-based fall characterization and an improved extreme learning machine (ELM) classifier. The approach extracts curvature scale space (CSS) features from human silhouettes and utilizes a bag of CSS words (BoCSS) to distinguish falls from other daily actions, achieving high sensitivity and accuracy. Additionally, a variable-length particle swarm optimization algorithm is introduced to optimize the ELM's hyperparameters, enhancing its performance in fall detection.

Uploaded by

Franco Cuya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO.

6, NOVEMBER 2014 1915

Depth-Based Human Fall Detection via Shape


Features and Improved Extreme Learning Machine
Xin Ma, Member, IEEE, Haibo Wang, Bingxia Xue, Mingang Zhou, Bing Ji, and Yibin Li, Member, IEEE

Abstract—Falls are one of the major causes leading to injury of a wearable device is insensitive to environment, wearing it for
elderly people. Using wearable devices for fall detection has a high a long time will cause inconvenience to one’s daily activities.
cost and may cause inconvenience to the daily lives of the elderly. In The smart home solution [6], [7] attempted to unobtrusively
this paper, we present an automated fall detection approach that
requires only a low-cost depth camera. Our approach combines detect falls without using wearable sensors. They install am-
two computer vision techniques—shape-based fall characteriza- bient devices, such as vibration, sound sensors, infrared mo-
tion and a learning-based classifier to distinguish falls from other tion detectors, and pressure sensors, at multiple positions of a
daily actions. Given a fall video clip, we extract curvature scale room to record daily activities. By fusing the information of
space (CSS) features of human silhouettes at each frame and repre- these sensors, fall can be detected and alarmed with a high ac-
sent the action by a bag of CSS words (BoCSS). Then, we utilize the
extreme learning machine (ELM) classifier to identify the BoCSS curacy [8]–[11]. However, using multiple sensors significantly
representation of a fall from those of other actions. In order to elim- raises the cost of the solution. Moreover, putting many sensors
inate the sensitivity of ELM to its hyperparameters, we present a in a room will bring side effect on the health of the elderly
variable-length particle swarm optimization algorithm to optimize people.
the number of hidden neurons, corresponding input weights, and Recent advances in computer vision have made it possible to
biases of ELM. Using a low-cost Kinect depth camera, we build an
action dataset that consists of six types of actions (falling, bending, automatically detect human fall with a commodity camera [12].
sitting, squatting, walking, and lying) from ten subjects. Experi- Such a camera has a low price, and when installed remotely,
menting with the dataset shows that our approach can achieve up would not disturb the normal life of the detected person. As
to 91.15% sensitivity, 77.14% specificity, and 86.83% accuracy. On compared to other sensing modalities, camera can provide richer
a public dataset, our approach performs comparably to state-of- semantic information about the person as well as his/her sur-
the-art fall detection methods that need multiple cameras.
rounding environment. Therefore, it is possible to simultane-
Index Terms—Curvature scale space (CSS), extreme learning ously extract multiple visual cues, including human location,
machine (ELM), fall detection, particle swarm optimization, shape gait pattern, walking speed, and posture, from a single camera.
contour.
Vision-based fall detection requires a distinctive feature repre-
I. INTRODUCTION sentation to characterize fall action and a classification model
to distinguish fall from other daily activities. Although there
GING has become a worldwide issue that significantly
A raises healthcare expenditure. As the World Health Orga-
nization has reported [1], about 28–35% of people who are 65
have been many attempts, vision-based fall detection remains
an open problem.
Human shape is the first clue one would explore for fall de-
and older fall each year, and the percentage goes up to 32–42% tection. Human fall can be identified by analyzing human shape
for people 70+. Since more and more elderly people are living changes over a short period. Existing shape-based fall detectors
alone, automated fall detection becomes a useful technique for approximate human silhouettes by a regular bounding box or
prompt fall alarm. an ellipse, and extract geometric attributes such as aspect ratio,
There have existed many device-based solutions to automatic orientation [13], [14], or edge points [15] as representative fall
fall detection [2], [3], such as those using wearable accelerom- attributes. However, the accuracy of these methods is limited
eters or gyroscopes embedded in garments [4], [5]. Although by the proximity of the shape attributes. Fall action lasts rela-
tively short as compared to other actions [16]. Therefore, motion
analysis has also been used for fall detection, such as those en-
Manuscript received July 13, 2013; revised November 22, 2013 and January coding motion contour in the motion energy image [17] or in
25, 2014; accepted January 29, 2014. Date of publication February 3, 2014; date
of current version November 3, 2014. This work was supported in part by Na-
the integrated spatiotemporal energy map [18]. However, fall
tional Natural Science Foundation of China under Grants 61240052, 61203279, of elderly people might last longer than that of young people.
and 61233014, in part by the Natural Science Foundation of Shandong Province, Therefore, different motion models are often required to study
China (No. ZR2012FM036), and in part by the Independent Innovation Foun-
dation of Shandong University, China (No. 2012JC005). X. Ma and H. Wang
the fall behaviors of young and elderly people.
contributed equally to this work. X. Ma and Y. Li are corresponding authors of 3-D shape information is more robust to viewpoint and partial
this paper. occlusion than 2-D silhouette. Using a multicamera system, 3-D
The authors are with the School of Control Science and Engineering,
Shandong University, Jinan, Shandong, China (e-mail: [email protected];
volume distributions are extracted for fall detection in [19]. If
[email protected]; [email protected]; [email protected]; a majority of the distribution goes close to the floor within a
[email protected]; [email protected]). short time, fall alarm will be triggered. Then in [20], centroid
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
and orientation of person voxel are computed for fall detection.
Digital Object Identifier 10.1109/JBHI.2014.2304357 However, reconstructing 3-D human shape with a multicamera

2168-2194 © 2014 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution
requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
1916 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 6, NOVEMBER 2014

system is computationally demanding, which offsets the gain of


using 3-D shape. The recently emerged low-cost depth camera—
Microsoft Kinect—makes it readily feasible to use 3-D depth
information with no computational cost. The depth image ac-
quired from Kinect encodes the distance from a visible subject
to the Kinect sensor. By means of infrared light sensing, Kinect
is invariant to visible lights and works day and night. Moreover,
the depth image well masks identity information of the person
being detected. With the usage of depth maps from Kinect, 3-D
skeleton [21], [22], 3-D head detection [23], and 3-D gait infor-
mation [24] can all be readily extracted for action recognition.
Unfortunately, to our best knowledge, there are still no attempts
that particularly use Kinect to detect human falls. Fig. 1. Workflow of the proposed fall detection approach. CSS feature is
State-of-the-art classification models have been widely ex- extracted at each video frame. BoW is then applied to represent video action. At
training time, the ELM classifier is learned with the optimal parameters found
plored for fall detection. A fuzzy neural network classifier was by VPSO.
used in [25] for recognizing standing, bending, sitting, and lying.
Fall action is reported if body posture changes from standing
to lying in a short period. However, the performance of neural hidden neurons, input weights, and bias values, we propose a
network is limited by the lack of large-scale human fall dataset. variable-length particle swarm optimization (VPSO) algorithm
In [26], the k-nearest neighbor classifier is used for fall detec- to simultaneously optimize the three parameters. For clarity, we
tion, with simple silhouette features. Fuzzy logic has been also term the parameter-optimized ELM as VPSO–ELM. To validate
used to recognize static postures (lying, squatting, sitting, and the efficiency of VPSO–ELM, we build an action dataset using a
standing) [27], and a 74.29% recognition accuracy is reported Kinect camera. Since it is difficult to capture real falls of elderly
in their experiments. The computationally efficient support vec- people, ten young men and women subjects intentionally fall
tor machine (SVM) classifier has also been used. Using binary down. We also capture five other types of actions (walking,
SVM to classify fall from normal activities can yield a 4.8% sitting, squatting, bending, and lying) for the evaluation purpose.
error rate [16]. By fusing shape and context information, a di- In short, main contributions of the paper include:
rected acyclic graph SVM [28] can achieve an upto 97.08% fall 1) A novel shape-based fall representation approach, which
detection accuracy and a 0.8% false detection rate in a simu- is invariant to human translation, rotation, scaling, and
lated home environment. Since their datasets and used features action length.
are different, it is difficult to accurately rank the performances 2) A novel classifier—VPSO–ELM, which uses particle
of these classifiers. swarm to optimize the hyperparameters of ELM classi-
The recently developed extreme learning machine (ELM) fier so as to improve its generalization ability.
[29] has a comparable performance to SVM yet it: 1) is ex- 3) A dataset including six activity categories of ten subjects
tremely fast to compute and 2) has the capacity to deal with (including falling, bending, sitting, squatting, walking,
highly noisy data. When using ELM for fall classification, vari- and lying) for public fall detection benchmarking.
ations of pose, viewpoint, and background can be potentially The rest of the paper is organized as follows. In Section II,
well handled. Moreover, ELM has a closed-form solution, not we present details of the proposed fall detection approach. In
suffering from the common issues of gradient-based optimiza- Section III, we describe dataset and experimental settings. Fol-
tion such as low learning rate and local minima. The drawback lowing that, Section IV provides experimental results and dis-
of ELM lies in its sensitivity to input parameters, including the cussions. Finally, we draw conclusions in Section V.
number of hidden neurons, input weights, and hidden layer bias
values [30].
In this paper, we present a new vision-based approach for II. PROPOSED METHOD
automated fall detection in an indoor environment. The approach Fig. 1 shows workflow of the proposed fall detection ap-
relies on a Kinect depth camera to extract human silhouette proach. Briefly speaking, the approach involves three main mod-
information. We intentionally do not use color images in order to ules. First, CSS feature is extracted at each frame. Second, BoW
avoid uncovering the privacy of detected person. Then we extract model is built to represent video action. Third, VPSO optimizes
curvature scale space (CSS) features [31], [32] of the extracted the hyperparameters of ELM. In the following, we describe
silhouette at each frame. A Bag of words model (BoW) [33] built details of the three modules.
upon the CSS features (BoCSS) is finally used to characterize
fall action. The CSS feature is invariant to translation, rotation,
scaling, and local shape deformation. On the other hand, BoCSS A. Silhouette Extraction From Depth Image
is invariant to the length of fall action. Therefore, our approach The first step is to extract human silhouette from depth image.
is able to detect both rapid and slow falls. After that, we use We first applied the adaptive Gaussian mixture model (GMM)
the ELM model to classify falls from other daily activities. [34] to segment human from background. Since Kinect is robust
In order to eliminate the sensitivity of ELM to the number of against visible lights, the GMM model is only smoothly updated
MA et al.: DEPTH-BASED HUMAN FALL DETECTION VIA SHAPE FEATURES AND IMPROVED EXTREME LEARNING MACHINE 1917

over time. Following the segmentation, we detect silhouette


using the Canny edge detector [35].

B. Extracting Curvature Scale Space Features


The second step is to find a distinctive and compact fea-
ture to represent the detected human silhouette. We use the
CSS-based representation [31], [32], due to its robustness to
translation, rotation, scaling, and local deformation. To build
the CSS feature, we convolve the extracted silhouette with a
set of Gaussian functions. Given a planar curve Γ(x, y) with
(x, y) at Cartesian coordinates, first we reparameterize it by its
arc length u: Γ(u) = (x(u), y(u)). Then, we convolve Γ(u)
with a set of Gaussian functions {g(u, σ); σ = 1, 2, . . . , n}:
X(u, σ) = x(u) ∗ g(u, σ), Y (u, σ) = y(u) ∗ g(u, σ), to obtain
a set of convolved curves {Γσ = (X(u, σ), Y (u, σ))}. Here, ∗
indicates the convolution operator. As σ increases, the ordered Fig. 2. Illustration of extracting CSS features. The left most column shows
sequence of the convolved curves {Γσ }, which is also referred the extracted silhouettes of various actions. As σ increases from 0 to 10, the
to as the evolution of Γ, will be generated. The curvature κ of silhouettes shrink and eventually evolve to circle-like shapes. Zero-crossing
points on the evolved curves constitute a CSS image (right most column).
each Γσ is defined as
Xu (u, σ)Yu u (u, σ) − Xu u (u, σ)Yu (u, σ)
κ(u, σ) = (1)
(Xu2 (u, σ) + Yu2 (u, σ))3/2
where
Xu (u, σ) = x(u) ∗ gu (u, σ)
Yu (u, σ) = y(u) ∗ gu (u, σ)
Xu u (u, σ) = x(u) ∗ gu u (u, σ)
Yu u (u, σ) = y(u) ∗ gu u (u, σ)
gu and gu u are the first and second derivatives of g(u, σ) with
Fig. 3. (a) Testing ELM with 20 hidden neurons, random input weights and
respect to u, respectively. biases over 200 runs. (b) Testing ELM as varying hidden neurons from 20 to
The CSS image of Γ is defined at κ(u, σ) = 0, the zero- 100 over 200 runs. Details about the experiments can be found in Section III.
crossing (ZP ) point of all Γ. There are two types of ZP : ZP+
is the start point of a concavity arc where κ(u, σ) changes from
negative to positive, and ZP− is the start point of a convexity arc cluster center is a codeword, as a representative of similar feature
where κ(u, σ) changes from positive to negative. On a closed vectors. Then, by mapping the collected vectors of a video clip to
curve, ZP+ and ZP− always appear in a pair. The arc between the codebook, we have a histogram of occurrence counts of the
a pair of ZP+ and ZP− is either concave (ZP+ , ZP− ) or con- words, which is the BoW representation of video action. Since
vex (ZP− , ZP+ ). Since it is extracted from the curvatures at each CSS feature is a 2-D vector, generating the BoW model is
multiple scales, ZP is invariant to rotation, translation, and uni- real-time. The K value is critical for the K-means clustering. As
form scaling. To make it further robust against noise and local shown in Fig. 4(c), we empirically find that K = 40 produces
deformation, we resample the CSS features by curve interpola- the best fall detection performance.
tion. Fig. 2 demonstrates how to create CSS features. During the
curve evolution, the σ value keeps increasing until Γσ shrinks to D. VPSO–ELM for Fall Detection
a circle-like convex contour where all ZP s disappear. On a CSS
image, the (u, σ) coordinates of all ZP s form a set of continu- ELM [29] is a single-hidden-layer feedforward neural net-
ous curves. The maxima point of each curve is most distinctive. work. Given a set of samples {xj } and their labels {tj }, ELM
Therefore, the (u, σ) coordinates of all maxima points are used having L hidden neurons is modeled by
as our CSS feature vector. Note that u is length-normalized.

L

C. Action Representation via BoW β i g(ω i · xj + bi ) = tj , j = 1, . . . N (2)


i=1
By extracting CSS features at each frame, a video clip is
transformed to be a collection of framewise CSS features. To where g(x) is the activation function of ELM. ω i , bi , and β i
represent an action with their CSS features, we use the BoW represent input weights, biases, and output weights of the ith
model [33]. In the first stage of BoW, K-means clustering is hidden neuron, respectively. Rewriting (2) in a matrix form
applied over all feature vectors to generate a codebook. Each yields Hβ = T, where T = [tT1 , . . . , tTN ]T , and H is the hidden
1918 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 6, NOVEMBER 2014

Fig. 4. (a) SDUFall Dataset. Shown here are six action categories (rows) from eight subjects (columns): falling, lying, walking, sitting, squatting, and bending.
(b) The insensitivity of silhouette extraction to illumination. Whether light is turned off (a) or on (d), human silhouette [(b) and (e)] is clearly visible since Kinect is
insensitive to visible lights. Therefore, as (c) and (f) shows, silhouettes can be successfully extracted in both situations. (c) The choice of K for K-means clustering.
The highest fall detection accuracy is obtained when K equals to 40.

layer output matrix, in the swarm. The inertia factor ω is determined by


⎡ ⎤ ωm ax − ωm in
g(ω 1 x1 + b1 ) ··· g(ω L x1 + bL ) ω = ωm ax − ×t
⎢ .. .. ⎥ iterm ax
H=⎣ . ··· . ⎦ . (3)
where ωm ax and ωm in is the upper and lower bounds of ω,
g(ω 1 xN + b1 ) ··· g(ω L xN + bL ) N ×L respectively. iterm ax is the maximally allowed iteration
number and t is the current iteration count. Once pi is
The ith column of H is the output of the ith hidden neuron with updated, its own best position p∗i is updated according to
respect to the inputs x1 , . . . , xN . By assigning random values
to {ω i } and {bi }, ELM optimizes β via least squares. p∗i (t) if f (pi (t + 1)) ≤ f (p∗i (t))
p∗i (t + 1) =
As shown in Fig. 3(a), the input weights and biases dramati- pi (t + 1) if f (pi (t + 1)) ≥ f (p∗i (t)).
cally affect the performance of ELM (Details about the experi-
After then, the globally optimal particle p∗g is found by
ments can be found in Section III). Moreover, as Fig. 3(b) shows,
the accuracy of ELM classifier reaches maximum with 60–70 p∗g = arg max

(f (p∗1 ), f (p∗2 ), . . . , f (p∗m )). (5)
hidden neurons. Less neurons are inadequate to deal with sam- pi

ple variations while more leads to overfitting. These experiments


motivate us to use particle swarm optimization to optimize the By treating an ELM as a particle, with its input weights
ELM parameters. and biases as the unknowns, the particle variable p can be
1) PSO–ELM: Before introducing VPSO–ELM, we first in- expressed as a matrix
troduce PSO–ELM, which uses standard particle swarm ⎡ ⎤
w11 w12 · · · w1n b1
optimization (PSO) to optimize the parameters of ELM. w22 · · · w2n b2 ⎥
⎢w
PSO was introduced to simulate the social behaviors of p = ⎣ 21 ⎦ (6)
···
bird flocking on stochastic optimization [36]. A PSO wL 1 wL 2 · · · wL n bL L ×(n +1)
model consists of a swarm of particles, each of which
is encoded by a variable vector pi = (pi1 , . . . , pid ), and where L is the number of hidden neurons, (wi1 wi2 · · ·
its velocity vi = (vi1 , . . . , vid ). PSO randomly initializes win bi ) denote the input weights and biases of ith hidden
each particle and iteratively updates their values in terms neuron, i = 1, . . . , L. n is the feature dimension. The op-
of optimizing a fitness function f . For ELM optimization, timal values of input weights and biases correspond to the
f is referred to as the validation accuracy of ELM. At position of the globally optimal particle.
each iteration, each particle knows its own best position Note that in (4) and (6), each particle (an ELM classi-
p∗i as well as the position of the optimal particle p∗g in the fier) is required to have the same size, i.e., the number
swarm. The update strategy to pi and vi is of hidden neurons for each ELM classifier must be the
same. It follows that the optimal number of hidden neu-
vi (t + 1) = ωvi (t) + c1 r1 (p∗i (t) − pi (t)) rons in PSO–ELM must be determined before optimizing
the input weights and biases of ELM. In the following, we
+ c2 r2 (p∗g (t) − pi (t))
introduce a variable-length PSO–ELM that allows the par-
pi (t + 1) = pi (t) + vi (t + 1) (4) ticles in PSO–ELM having different lengths, from which
we then find the optimal particle length.
where ω is an inertia factor, r1 ∈ [0, 1] and r2 ∈ [0, 1] are 2) VPSO–ELM: To determine the optimal number of hidden
used to maintain the diversity of the population. c1 > 0 neurons in PSO, we propose variable-length PSO (VPSO),
is the coefficient of the self-recognition component, while which allows particles to have different lengths. For the
c2 > 0 is the coefficient of the social component. In (4), a purpose, as opposed to (4), we define new update equations
particle can decide where to move, based on its own best that can handle particles with different sizes. In VPSO, be-
position as well as the experience of the optimal particle fore its update, each particle in the swarm is first compared
MA et al.: DEPTH-BASED HUMAN FALL DETECTION VIA SHAPE FEATURES AND IMPROVED EXTREME LEARNING MACHINE 1919

against the globally optimal particle. If the size of the par-


ticle is different to that of the optimal particle, the particle
will be updated according to a new update strategy; oth-
erwise, the algorithm moves to next particle. Here are the
details:
a) If the row number of the ith particle pi (t), nri ,
is larger than that of the best particle p∗g (t), nrg ,
we randomly select nrg rows from the nri rows of
pi (t), vi (t), and p∗i (t), respectively, and set
nrg nrg
pi (t) = pi (t) , vi (t) = vi (t)
nri nri
nrg
p∗ ∗
i (t) = pi (t) . (7)
nri
Afterwards, pi (t), vi (t), and p∗
i (t) are updated ac-
cording to (4). For the unselected entries, pi (t), Fig. 5. Motion evolutions of bending, falling forward, sitting, and lying. Re-
vi (t), and p∗
i (t), we update them according to
gardless of time order, lying is the most confounding activity to fall, which
is consistent with the confusion matrices in Fig. 6. (a) Bending. (b) Falling
vi (t + 1) = ωvi (t) + c1 r1 (p∗ 
i (t) − pi (t)),
forward. (c) Sitting. (d) Lying.

pi (t + 1) = pi (t) + vi (t + 1). (8)


large object, turning the light on or off, changing room layout,
Finally, we restack pi (t
+ 1) and pi (t
+ 1), and changing direction and position relative to the camera. A Kinect
  camera was installed 1.5 m high to record the actions. A total
vi (t + 1) and vi (t + 1). Such an update scheme
guarantees that particle size remains unchanged dur- of 6 × 10 × 30 = 1800 video clips are collected. Video frame
ing the swarm evolution. is at size of 320 × 240, recorded at 30 fps in AVI format. The
b) If nri , is smaller than nrg , we first randomly select average video length is about 8 s, which is baseline for unit
nri rows from the nrg rows of p∗g (t), set action segmentation. Six action (falling, lying, walking, sitting,
squatting, and bending) examples from different subjects are
nri shown in Fig. 4(a).
p∗g i (t) = p∗g (t) (9)
nrg MultiCam Dataset: The MultiCam dataset was originally
and then update pi (t) and vi (t) according to (4). used in [20]. In the dataset, a healthy subject performed 24
As a summary, VPSO–ELM proceeds as follows: realistic scenarios. The first 22 scenarios contain a fall and con-
1) Create an initial population with each particle as an ELM; founding events, the last two ones contain only confounding
the number of hidden neurons for each particle is randomly events (11 crouching position, nine sitting position, and four
selected from [20, 100]; lying on a sofa position) under an eight-camera configuration.
2) Calculate the fitness function for each particle; The study area has a size of 7 m × 4 m, with a table, a chair,
3) Generate a new population using the update strategy de- and a sofa to simulate a real living room. Video streams have a
scribed in (7)–(9); size of 720 × 480 and were recorded at 30 fps.
4) If termination criterion is satisfied, stop and output the best
weights, biases, and number of hidden neurons; otherwise, B. Experimental Settings
return to step 2. Feature extraction was implemented in C++ while VPSO–
ELM was implemented in MATLAB. All the experiments were
III. EXPERIMENTS conducted with MATLAB 7.10 (R2010a) on a PC with Intel (R)
A. Dataset Core (TM) i3-2120 CPU and 2.00 GB RAM.
On SDUFall dataset, we perform six-class classification and
SDUFall Dataset: We built a depth action dataset with a low- then estimate fall-versus-nonfall classification metrics. Using
cost Kinect camera. The dataset has been released at our project the same BoCSS features, we compare different classifiers:
site1 for public fall detection benchmarking. Ten young men and one-versus-all multiclass SVM [37], ELM with random initial-
women subjects participate in the study. Each subject does the izations, ELM with PSO optimized weights and biases (PSO–
following six actions 30 times: falling down, bending, squat- ELM), ELM with VPSO optimized weights, biases, and number
ting, sitting, lying, and walking. Since it is difficult to capture of hidden neurons (VPSO–ELM). For SVM, we use a linear ker-
real falls, the subjects intentionally fall down. The 30 times for nel with the parameter C = 1. The experiment runs 100 times
each action differ in that each time we randomly change one with leave-one-subject-out cross validations. PSO–ELM and
or more of the following conditions: carrying or not carrying VPSO–ELM optimize hyperparameters by further splitting the
training dataset into training and validation set with leave-one-
1 http://www.sucro.org/homepage/wanghaibo/SDUFall.html subject-out cross validations. On MultiCam dataset, we perform
1920 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 6, NOVEMBER 2014

Fig. 6. Confusion matrices of action classification with One-versus-All SVM, ELM, PSO–ELM, and VPSO–ELM, respectively. VPSO–ELM obtains the highest
classification accuracies. While One-versus-All SVM is better than VPSO–ELM in fall detection accuracy, it has significantly higher false negatives. Lying is
most confounding activity to fall, which we can also see in Fig. 5. Please see the corresponding texts for more analysis. (a) One-versus-All SVM. (b) ELM.
(c) PSO–ELM. (d) VPSO–ELM.

fall-versus-nonfall binary classification 100 times with leave- TABLE I


FALL-VERSUS-NONFALL CLASSIFICATION RESULTS (±std) ON SDUFALL
one-out cross validations. On the dataset, we compare the pro-
posed VPSO–ELM approach against the reported results of [19],
which needs multiple cameras.

IV. RESULTS AND DISCUSSION


A. Influence of Illumination on Silhouette Extraction TABLE II
FALL-VERSUS-NONFALL CLASSIFICATION RESULTS (±std) ON MULTICAM
Since Kinect uses infrared light for depth sensing, it is in-
sensitive to visible light. This is also illustrated in Fig. 4(b). In
the scenario, human silhouettes are successfully extracted even
when the room goes fully dark. This is a clear advantage over
the methods using color images [19].
TABLE III
TIME COST OF CLASSIFICATION ON SDUFALL
B. Assessing Hyperparameters
The number of K-means clusters is critical to building the
BoW model. In Fig. 4(c), we evaluate the influence of K on
fall detection. This was done on the SDUFall dataset. K = 40
yields the highest classification accuracy, which indicates that of falling appear similar to that of lying. There are also 5.3%
40 clusters well represent the overall training CSS features. of falls miscategorized as walking. This happens because some
Therefore, K = 40 is used throughout our experiments. subject falls down and gets up too quickly.
The input weights and biases for ELM were randomly ini- By treating the five daily activities (sitting, walking, squat-
tialized from [−1, +1]. For each ELM particle, the number of ting, lying, and bending) as a nonfall category, we are able to
hidden neurons were randomly initialized from [20, 100]. VPSO measure the fall-versus-nonfall errors. Table I shows such results
finds the optimal number that is then used for training a final based on the confusion matrices in Fig. 6. VPSO–ELM is about
ELM. In our experiments, the parameters of VPSO have no big 2% more accurate than ELM and 1% than PSO–ELM. Similar
influence on the results. Therefore, we empirically fix them as improvements exist in the sensitivity and specificity measure-
the maximum inertia weight ωm ax = 1.2, the minimum of in- ments. Note that VPSO–ELM goes better but since STDs of
ertia weight ωm in = 0.73, the maximum iteration termination accuracy estimation are higher than the differences in accuracy,
criterion iterm ax = 200, the coefficient of self-recognition com- this result has to be tested in the future with more data/different
ponent c1 = 1.5, the coefficient of social component c2 = 1.5, datasets.
and the population size is 50.
D. Performance on MultiCam Dataset
C. Performance on SDUFall Dataset
Table II shows the classification results on MultiCam dataset.
Fig. 6 shows the classification results on SDUFall. VPSO–
As compared to [19], our approach is better than the three-
ELM has smaller misclassification errors as compared to other
camera solution but worse than the four-camera one. Since we
methods. The improvement of VPSO–ELM over PSO–ELM
need only a single camera, the cost of our approach is lower
reflects the influence of the number of hidden neurons. The
than that of [19].
One-versus-All SVM is the one more likely to misclassify other
activities as fall. This is mainly because multiclass SVM is not
as natural a multiclass classifier as ELM. The most confounding E. Time Cost
activity to fall is lying since about 14% of fall is misclassified On SDUFall dataset, we compared the time costs of SVM,
as lying. This reflects the similarity of the two activities as also ELM, PSO–ELM, and VPSO–ELM. As shown in Table III,
illustrated in Fig. 5. Regardless of time order, the silhouettes VPSO–ELM needs the longest time to optimize the parameters
MA et al.: DEPTH-BASED HUMAN FALL DETECTION VIA SHAPE FEATURES AND IMPROVED EXTREME LEARNING MACHINE 1921

TABLE IV [9] C. Doukas and I. Maglogiannis, “Emergency fall incidents detection in


TIME COST ON MULTICAM assisted living environments utilizing motion, sound, and visual perceptual
components,” IEEE Trans. Inf. Technol. Biomed., vol. 15, no. 2, pp. 277–
289, Mar. 2011.
[10] Y. Li, K. Ho, and M. Popescu, “A microphone array system for automatic
fall detection,” IEEE Trans. Biomed. Eng., vol. 59, no. 5, pp. 1291–1301,
May 2012.
[11] Y. Zigel, D. Litvak, and I. Gannot, “A method for automatic fall detection
of elderly people using floor vibrations and sound-proof of concept on
human mimicking doll falls,” IEEE Trans. Biomed. Eng., vol. 56, no. 12,
of ELM. However, since training is only needed one time, this is pp. 2858–2867, Dec. 2009.
affordable in practice. Once learned, the running time of ELM, [12] O. Popoola and K. Wang, “Video-based abnormal human behavior
PSO–ELM, and VPSO-ELM have almost no difference. While recognition—A review,” IEEE Trans. Syst., Man, Cybern., Part C: Appl.
Rev., vol. 42, no. 6, pp. 865–878, Nov. 2012.
SVM is the fastest to train, its running time is also the longest. [13] V. Vishwakarma, C. Mandal, and S. Sural, “Automatic detection of human
The BoCSS feature building runs at about 30 fps. For a unit fall in video,” in Proc. Int. Conf. Pattern Recogn. Mach. Intell., 2007,
action that typically lasts 8 s, our approach needs only about 8 s pp. 616–623.
[14] C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Fall detection
to recognize it. Table IV compares the time cost of our approach from human shape and motion history using video surveillance,” in Proc.
with that of [19] on MultiCam. Our approach is about two times 21st Int. Conf. Adv. Inf. Netw. Appl. Workshops, 2007, vol. 2, pp. 875–
faster than the GPU implementation of [19]. 880.
[15] C. Rougier, J. Meunier, A. St-Arnaud, and J. Rousseau, “Robust video
surveillance for fall detection based on human shape deformation,” IEEE
Trans. Circuits Syst. Video Technol., vol. 21, no. 5, pp. 611–622, May
V. CONCLUSION 2011.
In this paper, we presented a new vision-based fall detec- [16] B. Mirmahboub, S. Samavi, N. Karimi, and S. Shirani, “Automatic monoc-
ular system for human fall detection based on variations in silhouette
tion approach that uses only one low-cost Kinect depth camera. area,” IEEE Trans. Biomed. Eng., vol. 60, no. 2, pp. 427–436, Feb.
There are two critical components in our approach. First, we 2013.
represent video action by a bag-of-words model built upon dis- [17] H. Qian, Y. Mao, W. Xiang, and Z. Wang, “Recognition of human activities
using SVM multi-class classifier,” Pattern Recogn. Lett., vol. 31, no. 2,
tinctive CSS features of human silhouette. Second, for fall clas- pp. 100–111, 2010.
sification, we propose VPSO–ELM that utilizes particle swarm [18] Y. T. Liao, C.-L. Huang, and S.-C. Hsu, “Slip and fall event detection
optimization to optimize the model parameters of ELM. By using Bayesian belief network,” Pattern Recog., vol. 45, no. 1, pp. 24–32,
2012.
design, VPSO–ELM can choose optimal values for the input [19] E. Auvinet, F. Multon, A. Saint-Arnaud, J. Rousseau, and J. Meunier, “Fall
weights, biases, and number of hidden neurons to which ELM detection with multiple cameras: An occlusion-resistant method based on
is sensitive. Extensive experiments on a self-captured dataset 3-D silhouette vertical distribution,” IEEE Trans. Inf. Technol. Biomed.,
vol. 15, no. 2, pp. 290–300, Mar. 2011.
shows that fusing CSS and VPSO–ELM can achieve an upto [20] M. Yu, S. Naqvi, A. Rhuma, and J. Chambers, “One class boundary
86.83% fall detection accuracy with a single depth camera. method classifiers for application in a video-based fall detection system,”
SDUFall dataset contains five daily activities and intentional IET Comput. Vision, vol. 6, no. 2, pp. 90–100, 2012.
[21] R. Planinc and M. Kampel, “Introducing the use of depth data for fall
falls from ten subjects. In the future, we will acquire more daily detection,” Personal Ubiquitous Comput., vol. 17, no. 6, pp. 1063–1072,
activities and increase the number of subjects. Moreover, the 2011.
study is limited due to the lack of real falls. Since it is difficult [22] G. Mastorakis and D. Makris, “Fall detection system using kinect’s in-
frared sensor,” J. Real-Time Image Process., pp. 1–12, Mar. 2012.
to acquire real falls, we will seek to collaborate with nursing [23] A. T. Nghiem, E. Auvinet, and J. Meunier, “Head detection using kinect
homes to capture real falls of elderly people. camera and its application to fall detection,” in Proc. 11th Int. Conf. Inf.
Sci., Signal Process. Appl., 2012, pp. 164–169.
[24] G. Parra-Dominguez, B. Taati, and A. Mihailidis, “3D human motion
REFERENCES analysis to detect abnormal events on stairs,” in Proc. Second Int. Conf.
3D Imaging, Modeling, Process., Visualization Transmiss., 2012, pp. 97–
[1] W. H. O. Ageing and L. C. Unit, WHO global report on falls prevention 103.
in older age. World Health Org., Geneva, Switzerland, 2008. [25] C.-F. Juang and C.-M. Chang, “Human body posture classification by a
[2] M. Mubashir, L. Shao, and L. Seed, “A survey on fall detection: Principles neural fuzzy network and home care system application,” IEEE Trans.
and approaches,” Neurocomputing, vol. 100, pp. 144–152, 2013, pp. 1. Syst., Man Cybern., Part A: Syst. Humans, vol. 37, no. 6, pp. 984–994,
[3] M. Skubic, G. Alexander, M. Popescu, M. Rantz, and J. Keller, “A smart Nov. 2007.
home application to eldercare: Current status and lessons learned,” Tech- [26] C.-L. Liu, C.-H. Lee, and P.-M. Lin, “A fall detection system using k-
nol. Health Care, vol. 17, no. 3, pp. 183–201, 2009. nearest neighbor classifier,” Expert Syst. Appl., vol. 37, no. 10, pp. 7174–
[4] T. Shany, S. Redmond, M. Narayanan, and N. Lovell, “Sensors-based 7181, 2010.
wearable systems for monitoring of human movement and falls,” IEEE [27] D. Brulin, Y. Benezeth, and E. Courtial, “Posture recognition based on
Sensors J., vol. 12, no. 3, pp. 658–670, Mar. 2012. fuzzy logic for home monitoring of the elderly,” IEEE Trans. Inf. Technol.
[5] G. Zhao, Z. Mei, D. Liang, K. Ivanov, Y. Guo, Y. Wang, and L. Wang, Biomed., vol. 16, no. 5, pp. 974–982, Sep. 2012.
“Exploration and implementation of a pre-impact fall recognition method [28] M. Yu, A. Rhuma, S. Naqvi, L. Wang, and J. Chambers, “A posture
based on an inertial body sensor network,” Sensors, vol. 12, no. 11, recognition-based fall detection system for monitoring an elderly person
pp. 15338–15355, 2012. in a smart home environment,” IEEE Trans. Inf. Technol. Biomed., vol. 16,
[6] G. Demiris and B. Hensel., “Technologies for an aging society: A sys- no. 6, pp. 1274–1286, Nov. 2012.
tematic review of smart home applications,” Yearb. Med. Inform., vol. 32, [29] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: A
pp. 33–40, 2008. new learning scheme of feedforward neural networks,” in Proc. IEEE Int.
[7] N. Suryadevara, A. Gaddam, R. Rayudu, and S. Mukhopadhyay, “Wire- Joint Conf. Neural Netw., 2004, vol. 2, pp. 985–990.
less sensors network based safe home to care elderly people: Behaviour [30] Q. Zhu, A. Qin, P. Suganthan, and G. Huang, “Evolutionary extreme
detection,” Sensors Actuators A: Phys., vol. 186, pp. 277–283, 2012. learning machine,” Pattern Recogn., vol. 38, no. 10, pp. 1759–1763, 2005.
[8] A. Ariani, S. Redmond, D. Chang, and N. Lovell, “Simulated unobtrusive [31] F. Mokhtarian and A. K. Mackworth, “A theory of multiscale, curvature-
falls detection with multiple persons,” IEEE Trans. Biomed. Eng., vol. 59, based shape representation for planar curves,” IEEE Trans. Pattern Anal.
no. 11, pp. 3185–3196, Nov. 2012. Mach. Intell., vol. 14, no. 8, pp. 789–805, Aug. 1992.
1922 IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 6, NOVEMBER 2014

[32] F. Mokhtarian, “Silhouette-based isolated object recognition through cur- Mingang Zhou received the B.S. degree in automa-
vature scale space,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 5, tion and the Master’s degree in control science and
pp. 539–544, May 1995. engineering from Shandong University, Shandong,
[33] L. Fei-Fei and P. Perona, “A Bayesian hierarchical model for learning China, in 2010 and 2013, respectively.
natural scene categories,” in Proc. IEEE Comput. Soc. Conf. Comput. His research interests included machine vision and
Vision Pattern Recogn., 2005, vol. 2, pp. 524–531. pattern recognition.
[34] C. Stauffer and W. E. L. Grimson, “Adaptive background mixture models
for real-time tracking,” Proc. IEEE Comput. Soc. Conf. Comput. Vis., Fort
Collins, CO, USA, 1999, vol. 2.
[35] L. Ding and A. Goshtasby, “On the canny edge detector,” Pattern Recogn.,
vol. 34, no. 3, pp. 721–725, 2001.
[36] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proc. IEEE
Int. Conf. Neural Netw., 1995, vol. 4, pp. 1942–1948.
[37] R. Rifkin and A. Klautau, “In defense of one-versus-all classification,” J.
Mach. Learning Res., vol. 5, pp. 101–141, 2004. Bing Ji received the B.S. degree in electronic en-
gineering and the Master’s degree in signal and in-
formation processing from Xidian University, Xian,
Xin Ma (M’10) received the B.S. degree in industrial China, in 2007 and 2009, respectively, and the Ph.D.
automation and the M.S. degree in automation from degree in medical engineering from University of
Shandong Polytech University (now Shandong Uni- Hull, Hull, U.K., in 2012.
versity), Shandong, China, in 1991 and 1994, respec- He is currently a Lecturer at Shandong Univer-
tively. She received the Ph.D. degree in aircraft con- sity, Shandong, China. His research interests include
trol, guidance, and simulation from Nanjing Univer- cloud robotics and computational simulation of bio-
sity of Aeronautics and Astronautics, Nanjing, China, logical system.
in 1998. She is currently a Professor at Shandong
University. Her current research interests include ar-
tificial intelligence, machine vision, human–robot in-
teraction, and mobile robots.
Yibin Li (M’10) received the B.S. degree in au-
tomation from Tianjin University, Tianjin, China, in
1982, the Master’s degree in electrical automation
Haibo Wang received the B.S. degree in logistics
from Shandong University of Science and Technol-
engineering from the School of Control Science and
ogy, Shandong, China, in 1990, and the Ph.D de-
Engineering, Shandong University, Shandong, China.
gree in automation from Tianjin University, China, in
He had been a cotutelle M.S.–Ph.D. student in the
LIAMA and Laboratoire d’Informatique Fondamen- 2008.
From 1982 to 2003, he worked with Shandong
tale de Lille (LIFL) labs from 2005 to 2010. He re-
University of Science and Technology, China. Since
ceived the Ph.D. degree in computer science from
2003, he has been the Director of Center for Robotics,
LIFL/INRIA Lille-Nord-Europe, Université des Sci-
ences et Technologies de Lille, Lille, France, and the Shandong University. His research interests include
robotics, intelligent control theories, and computer control system.
Ph.D. degree in pattern recognition and intelligent
system from LIAMA/National Laboratory of Pattern
Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing,
China, in 2010.
He is currently a Lecturer at Shandong University. His research interests
include computer vision, pattern recognition, and their applications to virtual
reality.

Bingxia Xue received the B.S. degree in automa-


tion from Shandong University, Shandong, China,
in 2012. She is currently working toward the Mas-
ter’s degree in control science and engineering in
Shandong University.
Her research interests include artificial intelli-
gence and pattern recognition.

You might also like