0% found this document useful (0 votes)
19 views14 pages

Muivm

Uploaded by

jdibble1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views14 pages

Muivm

Uploaded by

jdibble1999
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Multi Visual Modality Fall Detection


Dataset
STEFAN DENKOVSKI1, 2 , SHEHROZ S. KHAN1, 2 , BRANDON MALAMIS2 , SAE YOUNG MOON1, 2 ,
BING YE 1, 2 , ALEX MIHAILIDIS.1, 2
1
KITE Research Institute, Toronto Rehabilitation Institute – University Health Network, Toronto, ON M5G 2A2, Canada
2
Institute of Biomedical Engineering, University of Toronto, Toronto, ON M5G 2A2, Canada
arXiv:2206.12740v1 [cs.CV] 25 Jun 2022

Corresponding author: Stefan Denkovski (e-mail: [email protected]).


This research was support by NCE:AGE-WELL, Fund 499052.

ABSTRACT Falls are one of the leading cause of injury-related deaths among the elderly worldwide.
Effective detection of falls can reduce the risk of complications and injuries. Fall detection can be performed
using wearable devices or ambient sensors; these methods may struggle with user compliance issues or
false alarms. Video cameras provide a passive alternative; however, regular RGB cameras are impacted by
changing lighting conditions and privacy concerns. From a machine learning perspective, developing an
effective fall detection system is challenging because of the rarity and variability of falls. Many existing fall
detection datasets lack important real-world considerations, such as varied lighting, continuous activities
of daily living (ADLs), and camera placement. The lack of these considerations makes it difficult to
develop predictive models that can operate effectively in the real world. To address these limitations, we
introduce a novel multi-modality dataset (MUVIM) that contains four visual modalities: infra-red, depth,
RGB and thermal cameras. These modalities offer benefits such as obfuscated facial features and improved
performance in low-light conditions. We formulated fall detection as an anomaly detection problem, in
which a customized spatio-temporal convolutional autoencoder was trained only on ADLs so that a fall
would increase the reconstruction error. Our results showed that infra-red cameras provided the highest
level of performance (AUC ROC=0.94), followed by thermal (AUC ROC=0.87), depth (AUC ROC=0.86)
and RGB (AUC ROC=0.83). This research provides a unique opportunity to analyze the utility of camera
modalities in detecting falls in a home setting while balancing performance, passiveness, and privacy.

INDEX TERMS fall detection, multi-modal, autoencoder, anomaly detection, deep learning, computer
vision.

I. INTRODUCTION clude the rarity and diversity of fall events [9]. Previous
studies by Stone et al.,[10] found 454 falls in 3339 days
Falls are one of the leading causes of injury-related deaths
worth of data and Debard et al., [11] found 24 in 1440
among the elderly worldwide [1, 2] and they are a major
days worth of data. Therefore, it is very challenging and
cause of both death and injury in people over 65 years of
consuming to run studies over a long duration, and even then
age [3, 4]. The faster an individual receives help after a
they may still contain few falls which may not be sufficient
fall, the lower the risk of complications arising from the fall
enough for building robust classifiers [12]. Another challenge
[5, 6, 7]. Fall detection systems improve the ability of older
is that fall events last only for short intervals in comparison
adults to live independently and “age in place” by ensuring
with normal ADLs [9]. Finally, each rare fall event can vary
they receive support when required. However, fall detection
greatly from one another, making it difficult to strictly define
is a challenging problem both from predictive modeling
a well-defined class or to capture all possible variations in
and practical considerations (such as low false alarm rate,
a dataset [12, 13, 14]. ƒ Many possibilities exist for the
privacy) [8].
practical implementation of fall-detection systems. However,
Predictive modeling challenges faced in fall detection in-

VOLUME 4, 2016 1
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

because solutions are ultimately intended to be implemented the benefits of wearables, the real-world short comings of this
in a person’s daily life, they should be easy to live with. Thus, type of device may limit its successful deployment.
passive systems are ideal, as they require no input from the Ambient or environmental systems, such as pressure mats,
user. They can monitor the environment before detecting a radar/Doppler, microphones, and motion sensors, typically
fall. These systems are preferable because a user may be use a wide range of sensors within the home to determine the
unresponsive after a fall, or they may not be wearing their user’s activities [23] [19]. These systems may be difficult to
device [15] [16]. However, older adults express privacy con- install or have high false-alarm rates owing to environmental
cerns regarding passive systems such as cameras. In addition noise [17] [24].
to these concerns, systems must balance high sensitivity to Computer vision systems cam typically rely on traditional
falls while maintaining a low false alarm rate. Missed falls are imaging modalities and techniques to determine when a fall
potentially dangerous to users, who may rely on the system has occurred [25]. They often use RGB or depth camera
to detect falls. In contrast, a system with a high false alarm modalities; however, other visual modalities have also been
rate can cause many issues. If a system automatically calls an used. These systems are generally passive, low-cost, and
ambulance or even notifies loved ones, a high false alarm rate can achieve high performance. However, they struggle to
would result in large bills or potentially ignored cases of real maintain user privacy. Imaging modalities, such as depth and
falls and eventual rejection of the system. thermal cameras, may help alleviate this issue by obscuring
Therefore, the key factors to consider when designing identifiable features. The techniques used to analyze images
a system are (i) passive, (ii) privacy protection and, (iii) vary, but 3D bounding boxes and background subtraction are
perform with a high fall detection rate and low false alarm among some of the most popular approaches. Recently, deep
rate. In this paper, we introduce a novel multi-camera and learning techniques have been applied to improve results.
mutli-modal fall detection dataset containing fall and ADL Modeling techniques include 2D-LSTM models, 3D CNNs.
from 30 healthy adults and ADL (with no fall) from 10 older Computer vision systems rely have traditionally relied on
adults. This dataset is collected data in a seminaturalistic classical machine learning techniques to determine when a
setting inside a designed home. The experiments are designed fall has occurred [25, 26]. They often use RGB or depth
to emulate real-world scenarios and sequences of events. It camera modalities; however, other visual modalities have
contains six visual modalities mounted on the ceiling and also been used. These systems are generally low-cost and
four additional wearable modalities. The six visual modali- can achieve high performance. However, they struggle to
ties consisted of two infra-red cameras, two depth cameras, maintain user privacy and may be limited in the area they
a thermal camera and an RGB camera. The four wearable are installed. Imaging modalities, such as depth and thermal
modalities were accelerometer, PPG, GSR and temperature. cameras, may help alleviate this issue by obscuring identi-
We did comprehensive experiments on detecting falls using fiable features, though a person may still be identified. The
a customized 3D convolutional autoencoder and showed techniques used to analyze images vary [26]. 3D bounding
that infra-red modalities performed the best among others, boxes and background subtraction are among some of the
including depth, thermal and RGB cameras. most popular traditional approaches. Recently, deep learning
techniques have been applied to improve results. Modeling
A. LITERATURE REVIEW
techniques include CNN-LSTM models and 3D CNNs have
A wide range of modalities, methods of capturing informa- shown promising results [25].
tion, and subsequent systems have been explored for fall
We now present a review of some of the fall detection
detection. They can be divided into various groups, such
datasets. Our review of fall detection datasets and multi-
as wearables, ambient sensors and computer vision-based
modal fall detection datasets is limited to those that contain
sensors [17] [18]. Khan and Hoey [9], presented a review of
at least one visual modality.
fall detection methods based on the availability of fall data
during model training.
Wearable systems often incorporate accelerometers, gy- B. UNIMODAL DATASETS
roscopes, and inertial measurement units (IMUs) to detect A review of existing publicly available fall datasets contain-
falls. However, other modalities may also be used such as ing only a singular visual modality is outlined below.
an EEG or a barometric Sensor [19] [20]. Systems that Charfi et al. [13] released an unnamed dataset containing
use accelerometers or IMUs are relatively affordable and 197 falls and 57 wall mounted videos of normal activities.
accurate [15] [17] [21]. However, these systems are invasive It comprised four background scenes, varying actors, and
and require the user to constantly wear and charge the device. different lighting conditions. The directions of activities rel-
This can lead to many missed falls as one may not wear ative to the camera are varied to reduce any impact that
the device all the time, for e.g. while charging or bathing. the directional camera location may have. All activities and
Chaudhuri et al. [16] found that the wearable device was falls started with a person in the frame and were segmented
not was not worn for two-thirds of falls events. Additionally, into short video clips of specific actions. To determine falls,
older adults may be hesitant to wear a device because of the manually generated bounding boxes were used to extract
stigma it may imply regarding independence [22]. Despite features, and an SVM was used to classify falls.
2 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

The multi-cameras fall dataset collected by Auvinet et al. of various activities. Five goals were highlighted to improve
[27] contained eight IP cameras placed around the room. the real-world effectiveness of the dataset. This includes
It included only 24 scenarios, 22 of which contained falls realistic settings, fall scenarios, improved balance of ADL
after a short activity. These videos were segmented into clips to fall activities, including real-world challenges (occlusions,
lasting less than 30 seconds. All rooms were well lit, with partial falls, lighting..etc.), and continuously recording the
minimal furniture placed in the center of the room. The data. However, cameras mounted on the walls were used, and
small size of the dataset and other limiting factors could no older adults were included in the dataset.
limit the development of generalized models in real-world The thermal fall detection dataset was designed to replicate
settings. Methods included background subtraction to obtain the KUL simulated fall dataset but with a thermal camera
a silhouette, in which the vertical volume distribution ratio [30]. This dataset contains only nine video segments with
was used to classify falls. ADL activities and 35 segments with fall scenarios. However,
The Kinect Infrared dataset published by Mastorakes et this dataset only uses a single eye level mounted thermal
al., [? ] used one camera placed at eye level in the middle camera with a slightly more constrained field of view. The
of a wall. The dataset included three variations of camera same limitations from the KUL simulated fall datasets also
angles (backward, forward and sideways) totaling 48 video apply, as they implemented that datasets set up.
falls. The direction of the fall is important because of the eye-
level placement of the camera. In addition to falls, activities Computer vision datasets can vary in several main cat-
were performed while sitting, lying on the floor and “picking egories: camera type, location, lighting, occlusions in the
up an item from the floor” were performed by eight different scene, and recorded participant activities.
participants. Additionally, two participants were asked to Different camera types or modalities struggle with various
perform activities slowly to replicate the movements of older considerations. cameras can capture clear images in general
adults. OpenNI and depth data were used from the Kinect lighting conditions; however, they may not work well under
sensor to generate a 3D bounding box whose first derivative poor lighting conditions, such as dark or night time scenarios.
parameters were analyzed to classify falls. RGB cameras also do not offer any level of privacy. Depth
The EDF/OCCU datasets contain two view points of a cameras, such as the defunct Microsoft Kinect camera, are
kinetic camera [28]. Their data were then divided into two a popular alternative to an RGB camera because they can
datasets. A non-occlusion dataset, EDF, of 40 falls and 30 provide a light independent image and protect the privacy of
actions, and an occlusion dataset, OCCU, of 30 occluded individuals [23? , 19, 18]. Vision modalities, such as thermal
falls and 80 actions were used. Both viewpoints were from and depth cameras obscure identifying features, while still
cameras placed at eye level, and thus ’directional falls’ were providing silhouettes of individuals. In addition, they may
performed. Room furniture and fall variations were minimal, perform better in certain scenarios such as those with poor
with most variations related to the direction of falls. Oc- lighting or visually busy environments due to independent
clusion was introduced by a single bed positioned to block lighting and silhouette segmentation.
the view of the bottom half of a fall. Actions in the dataset The location of the camera in the room can also affect
without occlusions included; picking things off the floor, the results. The location of the camera in the room can also
sitting on the floor and lying on the floor. These same actions affect the results. Most of the previous datasets commonly
were performed in the occluded dataset, with the further have a camera mounted on a wall at or above eye level,
addition of tying shoelaces. All actions, except for lying on as shown in Table 1. The problem with this placement is
the floor were occluded by the bed. that a fall looks very different depending on its direction
The SDU dataset comprises ten young adults performing (across or in-line with the field-of-view). As such, these
six simple actions captured with a depth camera[23]. These datasets include strict definitions of the orientation of falls
include falling, bending, squatting, sitting, lying down and relative to the camera. This limits the variety of falls to a
walking. These actions were repeated 30 times each under the short list of possible variations [19]. In addition to being
following conditions: carrying or not carrying a large object, affected by the fall direction, cameras mounted at eye level
lighting on or off, room layout, camera direction, and position are more susceptible to occlusions, blocking the view of
relative to the camera. Despite the large variation in each the participant and affecting the cameras’ ability to detect
repetition of the action, the eye-level placement of the camera falls (e.g. behind furnitures). Some datasets make efforts to
indicates that the direction is still important. Additionally, all include furniture/occlusions in the dataset, however they may
actions were still short and segmented with an average length be limited to a single chair or bed (The Multiple Camera [27],
of eight seconds per clip. EDF/OCCU [28], SDU Datasets[23]), while others have no
The KUL simulated fall detection dataset was designed furniture in frame (Mastorakis et al. [? ]). To mitigate these
to address many of the shortcomings that these datasets limitations (i.e., orientation of falls and occlusions) cameras
face when real-world factors were considered [29].Five IP can be placed on the ceiling. This helps provide a similar
cameras, 55 fall scenarios, and 17 ADL videos were used. view of every fall, and removes furniture that may be between
The dataset includes realistic environments (in terms of fur- the subject and camera [31] [32].
nishings) and longer videos instead of short segmented clips As with falls, the ADL performed by the participants in
VOLUME 4, 2016 3
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Fall Detection Datasets with Unimodal Camera(s).


Number of Varied Contextual Wall Mounted Privacy
Datasets Modalities Occlusions
Cameras Lighting Activities Cameras Protecting
Charfi et al. [13] RGB 1 Yes No Yes Yes No
Multi Cameras Fall Dataset [27] RGB 8 Yes No Yes Yes No
Kinect Infrared [? ] Depth 1 No No Yes No Yes
EDF/OCCU [28] Depth 2 No No Yes Yes Yes
SDU [23] Depth 1 No No Yes Limited Yes
Thermal Fall [29] Thermal 1 Yes Yes Yes Yes Yes
KUL Simulated Fall [30] RGB 5 Yes Yes Yes Yes No

the datasets varied widely. Activities are often segmented accuracy and with occlusions or furniture placed in a room.
into very short and specific motions such as a single squat, EEG headsets are also be extremely difficult to use in the real
picking up an object, taking a seat, and lying down..etc. world.
These short and specific videos last from five to 60 seconds The URFD dataset was recorded with two Kinect cameras.
may limit a system’s ability to generalize to the real world. One was at eye level and one was ceiling-mounted for fall
This is because of the oversimplification and trivialization of sequences [35]. However, only a single eye level camera was
diversity and problems that can arise in activities and falls. In used to record the ADL activities. The accelerometer was
response to the lack of real-world considerations Baldewijns worn on the lower back using an elastic belt. This sensor
et al. [29] attempted to create a dataset that considers these location is not ideal because a special device would need to
factors. However, the dataset still lacks ceiling mounted be worn. The dataset contained only 30 falls and 40 activities
cameras, varying environments, and older adult participants. of daily living. Along with the limited dataset size, it also
A table consisting of these “real” world factors are outlined contained short, segmented activities and falls with limited
in Table 1. variation.
The CMDFALL dataset [14] focuses on multi-view as well
C. MULTIMODAL DATASETS as multi-modal capturing. Seven overlapping Kinect sensors
In recent years, the ability to merge a variety of sensors to were used, and two accelerometers were worn on the right
improve performance has become of interest in fall detection hand and hip. The room was well lit, with a minimal amount
systems and several multi-modal datasets have emerged. This of furniture in the room unless required for the fall scenario.
is because a multi-modal approach provides different sources It also contained eight falls and 12 actions. The eight falls
of information that can help compensate for the deficiencies were based on three main groups, walking, lying on the bed,
in each other. or sitting on a chair. The dataset also mentioned many of
Combining modalities may complement each other from the same issues such as trimmed videos, and limited fall
a technical perspective, but it may cause practical consid- styles. However, they included a limited variety of simple
eration to be further impaired. More modalities mean more single action activities and fall styles. The environment also
sensors or cameras, increasing costs and potentially more is always well lit and minimal furniture and occlusion in the
inconvenience to the user. Because a wide range of modalities space.
are used for fall detection a wide range of combinations is These existing multi-modal datasets lack consideration for
possible. Thus it is important to select modalities that are lighting, furniture, fall, variety of ADL, and camera place-
complimentary to each other without increasing practical ment. This may impact real-world performance. Not only
costs to the user. These practical or real world considerations were these technical considerations lacking but also practi-
for multi-modal datasets are highlighted in Table 2. cal considerations with many multi-modal datasets requiring
multiple wearable devices to be worn.
The UP fall detection data set contained two cameras at
eye level with frontal and lateral views [19]. Other modalities
were captured through five IMU sensors, one EEG head- D. INTRODUCED MULTI-MODAL DATASET (MUVIM)
set and six infrared motion sensors placed in a grid. The To circumvent the problems associated with previous uni-
activities were limited to six simple motions and five fall modal and multi-modal datasets, we present the Multi Visual
variations. Activities varied from 10 to 60 seconds and were Modality Fall Detection Dataset (MUVIM), which is a novel
segmented from each other. The limited fall directions and multi-camera multi-modality fall detection dataset. The MU-
short segmented activities limit the real-world implications VIM dataset places a larger emphasis on different types of
of this dataset. In addition, many of the chosen modalities camera based modalities. Multiple camera modalities were
were also impractical. Five IMU sensors are difficult to im- selected because of their practical implication compared to
plement; selection of only most relevant sensors may be more wearable or other types of modalities. Additionally, this
plausible, which was performed by the authors in a follow up allows for a direct comparison between privacy-protecting
paper [33] [34]. However, motion sensors still struggle with and non privacy protecting modalities.
4 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2: Highlighting various multi-modal datasets, their modalities and considerations


Number of Number of Number of Length of
Datasets Participants Trials
Modalities Fall Types Activities ADL (secs)
UR Fall Detection Dataset 3 3 8 5 Adults 70 10
UP Fall Detection Dataset 4 5 6 17 Adults 70 10 - 60
CMDFALL 3 8 12 50 Adults 20 22.5
Activity Recognition for
Indoor Fall Detection Using 2 8 5 10 Adults 20 30-60
Convolutional Neural Network
30 Adults
Our Dataset 8 NA 25 400 180-240
10 Older Adults

In MUVIM, wearable modalities were also included for


completeness and to allow researchers to explore various
combinations and performances. However, they were re-
moved from further analysis.However, they were removed
from further analysis in this paper. Camera modalities in-
volving thermal, depth, infra-red and RGB cameras are only
considered to develop models to detect falls.
The RGB camera is the most commonly available camera (a) Center FLIR Thermal (b) Hikvision IP
and provides a good baseline for comparison with the other
modalities. Infra-red contains only grayscale images, but the
image quality is still high enough to discern identifying
features. However, these are low cost cameras, that operate
well in low light scenarios with a wide field of view. Depth
cameras are less common, but are still relatively low in cost.
They are light-independent, and do not capture identifying
features. They can struggle with creating a consistent image (c) Orbbec Depth (d) Orbbec IR
due to "holes" that occur in the depth-math due to sensor
limitations. In addition, they struggle with scenarios in which
participants, backgrounds or occlusions have similar dis-
tances. For access to the dataset, email: [email protected].
Please set the subject line too "Fall detection data access
request", and include your include title, email address, work (e) Stereolabs ZED Depth (f) Stereolabs ZED RGB
address, and affiliation. In addition, a data privacy waiver will
FIGURE 1: Start of the fall as indicated by manually produced
have to be completed. labels for each camera

II. DATASET COLLECTION


A. DESCRIPTION OF THE DATASET (iii) One Orbbec Astra Pro camera, captured 3D depth and
Six (6) vision-based sensors were mounted on the ceiling of a IR images using near-IR illumination. It captured video
floor at the Intelligent Assistive Technology and Systems Lab at 30 fps and at 640 x 840 pixels.
(IATSL), University of Toronto. This study was approved by (iv) Three FLIR ONE Gen 3 Thermal cameras were used.
the Research Ethics Board at the university. The participants These were used in order to capture an entire view of
have given their written consents to publish their images for the room due to their narrow field of view. They work as
research purposes. The layout of the lab is presented in Figure an attachment to a smartphone and capture video at 8.7
2. fps at 1440 x 1080 pixels.
Four camera types were used: All cameras were mounted in the middle of the ceiling
(i) One Hikvision IP network camera is a dome- style except for the thermal cameras which had two more placed
security camera that uses IR illumination. It captures 20 separately on each side of the room in order to cover the
fps video at a resolution of 704 x 408 pixels. entire room. An Empatica wristband was worn by each par-
(ii) A StereoLabs ZED Depth Camera, captures 3D depth ticipant to monitor movement and physiological signals. An
and RGB images. It releases on ambient illumination overview of all modalities used in this study can be found in
from the room to capture video at 30 fps and 1280 x 720 Table 3. Data was collected simultaneously from all sensors
pixels. during each trial.
VOLUME 4, 2016 5
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 3: Highlighting devices used and the data collected form each.
Device Description Installation Data Collected
IR video (grayscale)
Dome-style IP security camera
File Format: mp4
Hikvision IP network camera (1) (IR illumination) Ceiling (centre)
Framerate (fps): 20
FOV: 106 degrees
Resolution: 704 x 480 pixels
RGB video, Depth video
3D depth camera
File Format: avi
StereoLabs ZED depth camera (1) (ambient illumination) Ceiling (centre)
Framerate (fps): 30
FOV: 90 x 60 degrees
Resolution: 1280 x 720 pixels
IR video, Depth video
3D depth camera
File Format: avi
Orbbec Astra Pro (1) (near-IR illumination) Ceiling (centre)
Framerate (fps): 30
FOV: 60 x 49.5 degrees
Resolution: 640 x 840
Thermal video
FLIR ONE Gen 3 (3) Thermal camera attachment
Ceiling File Format: mp4
Model number: SM-G532M for smartphones
(left, centre, right) Framerate: 8.7 fps
(IATSL 278, IATSL 279, IATSL 280) FOV: 50 x 38 degrees
Resolution: 1440 x 1080 pixels
Wristband for monitoring of movement Plain text data
Empatica E4 (1) Wearable
and physiological signals File Format: csv

B. PARTICIPANTS AND ACTIVITIES as many props and furniture within the field of view of all
The study was divided into two phases, each with a different modalities.
participant population. In Phase 1, data was collected from 30 Each session contained ten (10) trials, five "day-time"
healthy younger adults between 18 and 30 years of age (mean trials that were well lit and five "night-time" trials with poor
age = 24; number of females = 14 (or 46.7%). Inclusion lighting. Blackout curtains and main overhead lights were
criteria for this phase included: must be from ages 18-30, be turned off in night time trials but an incandescent lamp was
clear of any health complications that may hinder balance still left on in order to provide enough illumination to allow
or performance in the study, and must be able to understand participants to move around safely. Each trial required the
and speak English, must be able to move around safely in a participant to act out a scripted “story”, while interacting with
furnished room without eyeglasses, must be able to travel and various furniture and props in the scene.For example, one
attend sessions on site. In Phase 2, data was collected from 10 story involves the participant returning home from an outing,
healthy older adults, who are at least 70 years of age (mean putting down his or her bags, taking off his or her shoes and
age =76.4 ; number of females = 5 (or 50%). Participants in jacket, making tea, and sitting down at a computer. The for-
Phase 1 were asked to simulate falls onto a 4” thick crash mat of each story was intended to put participants at ease and
mat. Participants in Phase 2 did not require the use of a crash increase the realism of the data that was collected. The col-
mat, since no falls were simulated by this population. lection of stories used in this study covered a broad range of
Prior to a session, participants were provided with a con- normal activities that might occur in a younger/older adult’s
sent form detailing the protocol of this study. Once consent living room. In addition, the stories allowed the participants
was provided, the researchers confirmed the eligibility of the to move around the room as they interact with the furniture
participant by conducting a brief screening questionnaire. and props placed throughout. The order of completion for all
Two versions of this questionnaire were used, since the ten trials were randomized between participants to prevent
populations for Phase I and Phase 2 were differed. In Phase researcher bias (ex. non-significant trends/presumptions due
2, questions pertained to the participants mobility and vision to order of trials) and allowed for better comparison between
issues to ensure safety and older adults were asked if they trials.
have experienced a fall within the last year to help ensure At the beginning of each session, a research assistant
they do not have any mobility issues. (outside the view of the camera) would provide cue to the
Each session was approximately one (1) hour in length, in- participant to perform certain activities. These cues were
cluding consent, preparation time, data collection, and inter- based on a pre-determined script that contained different
missions to change the setting of the room. The setting of the scenarios (as discussed above). The participant would follow
room (including furniture and the crash mat) was randomized the cues (e.g. sit on sofa, walks around, pick an object from
for each participant based on five (5) pre-made arrangements. the floor, work on laptop, etc.) and perform these activities
The furniture and crash mat were stationary throughout each the way they would like to do. In each session, a person
session, but were moved to a new configuration between carries out normal activities based on the given cues and each
participant sessions. This is done to avoid building a trivial session would end with a fall on the crash mat. The type of
classifier to detect crash mat as the cue for a fall in a scene. falls were kept diverse across each session and for different
These five room settings were created in order to include persons to capture different types of falls. Each story was
6 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FD001
Empatica
FLIR
FLIR 278
Day - Trial 1
Day - Trial 2
---
Night - Trial 5
FLIR 279
FLIR 280
IP
Day - Trial 1
Day - Trial 2
FIGURE 2: One example of the arrangements of five furniture
pieces in room. Furniture pieces are as follows; black - mat; ---
brown - shelf; olive - chair; grey - desk + chair; redblue - Night - Trial 5
table; red - sofa + end table; purple - lamp. Orbbec
ZED
Five general representative fall types of older adults were
selected FIGURE 3: Directory Tree Used to Store Trials and Cameras.
FD001 is the folder for participant 1, containing all cameras.
• Tripping or stumbling forward, and falling when unable Each camera has its own folder, containing 10 sub-folders for
to catch oneself. each trial done (labeled day or night). The FLIR thermal cam-
era contains three additional folders for each three thermal
• Standing or walking, losing one’s balance, and falling cameras used (Note: 278, 279 and 280 are generated labels).
backwards or sideways. Note that a backward fall ends
in a sitting or a lying position.
• Falling slowly/incrementally. (Person loses balance, main button by the participant at the beginning and end of
catches self on some object, but continues to fall slowly each trial. The Astra Pro, FLIR ONE, ZED, and IP cameras
when they don’t have the strength to recover.) were manually controlled by the researchers. Recordings
• Attempting to sit, but falling when the chair slides out were started at the beginning of each trial, and stopped at
from underneath. the end of each trial.
• Attempting to stand, putting weight on a chair, table or
walker, but the supporting object moves away and fails C. DATA PRE-PROCESSING AND CONSOLIDATION
to provide support. 1) Data Structure
• Sitting in a chair, leaning forward to reach an item on the Immediately after each session, the researchers transferred
floor, putting on shoes, or attempting other activities, all collected data to a secure network server, and into a folder
and toppling out of the chair. labelled with their participant ID. In each participant’s folder
contained subfolders for each type of modality used during
the trial including data from the Empatica E4 wristband. See
assigned one or more fall types and recovery methods that Figure 3 for a breakdown of the directory structure of the
could logically fit at the end of the story. If multiple fall types public dataset. Each modality folder contained ten folders
where possible for that story one was determined at random. for each trial (ie. Day 1-5, Night 1-5). Once data transfer
A higher number of normal activities relative to fall events was complete, the researchers converted each of the camera
were deliberately performed to represent actual scenarios. A videos into individual frames, in JPEG format.
disproportionately high number of falls may over-simplify or
bias the predictive models. In summary, a trial consisted of: 2) Labelling Procedure
enacting a story; simulating a fall; and engaging in a recovery, Once all data was transferred, the researchers labelled the
all of which were chosen randomly from a pre-defined script. beginning and end frame numbers for each fall that occurred
Participants in Phase 2 were also asked to complete ten sto- during all recorded sessions. All labels were noted using an
ries but were not tasked with simulating a fall or a recovery. Excel spreadsheet. Two researchers were recruited for this
Participants were prompted through the steps of each trial. task to reduce sample bias and increase labelling accuracy.
However, these prompts were not given with specific details The start of a fall was marked as the frame when the partic-
to allow the participant to complete the assigned tasks based ipant started to lose balance. The end of a fall was marked
on their own interpretation. This would help in building as the frame when the participant was at rest on the ground.
generalized classifiers and not based on a particular sequence The types of falls that were observed are listed below. Certain
of activities. trials also contained more than one fall. In this case, both falls
The Empatica E4 wristband was turned on at the beginning were documented in the spreadsheet.
of the session. Trials were defined through the pressing of the
VOLUME 4, 2016 7
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 4: 3D Autoencoder Structure

3) Limitations autoencoder will reconstruct unseen falls with a higher re-


One major limitation involved the positioning of the FLIR construction error. Thus, using 3D autoencoder, a fall can
cameras. The cameras were often not able to capture the be effectively detected as an anomaly [37] We adapted the
complete duration of the fall clearly. For example, when DeepFall [37] model and re-implemented it in PyTorch. The
the start of fall was unclear in the footage, the first visible source code of the implementation is available at GitHub
frame was used instead. If the end of fall was clearly visible, [link address].
it was still recorded in the main label table. This led to A 3D CNN auto-encoder is formed through stacked layers
the incomplete labelling of falls for certain trials. Another of 3D convolutional and 3D max-pooling layers. Model
limitation was that recording software for some modalities parameters and design were adapted from previous work
would froze during the trial, particularly with the thermal and including, stride, padding, activation functions and kernel
Orbbec cameras, and thus were unable to capture the fall. size [37]. Similar preprocessing steps were performed. All
However, this did not happen often and was not a major issue videos were resized to 64 x 64 pixels and interpolated or
with all missing videos noted in the Appendix. For any trial extrapolated to change the frame rate of each video to 8 fps.
that was missing the start or end of fall frames, the researcher Interpolation was performed by duplicating existing frames.
made a note but did not record the frame into the data Depth videos were in-painted in order to fill out black sec-
sheet. A further limitation appeared due to the lack of detail tions of the video. These black sections occur from poor
produced by recordings from the FLIR ONE cameras. Its low reflection of near-infrared light required to measure "time-of-
resolution video made it difficult to observe the exact start flight" in order to determine depth. In-painting was done us-
and end frames for certain falls. Specifically, we observed ing an Navier-Stokes based in-paint function provided in the
that when participants remained in one position for a long OpenCV python package. Model parameters were changed
time, they left a heat residue, and thus made it difficult to from the original DeepFall model to remove drop out layers
differentiate between the residue and the participant who had from auto-encoders as they simply added noise to the encoder
fallen. and did not provide any improved performance. Training
was performed with 20 epochs as further training did not
III. FALL DETECTION improve results. The MaxUnpool3d operation was used in
To compare the performance of the camera, a baseline PyTorch, which sets all non-maximal imputed values are set
modality performance was required. This would allow to the to zero. The architecture of the 3DCNN autoencoder models
comparison between camera modalities on the MUVIM fall for fall detection is shown in Figure 4.However, batch size
detection dataset. was increased to 128 frames and frame rate was lowered to
Fall detection was approached as a one-class classification eight fps. This decrease in frame rate efficiently increased
problem or anomaly detection owing to the rarity of fall the temporal window of the model. Experiments changing
events [36]. Previous work has established that 3D CNN window size down to four did not significantly alter results.
auto-encoders can achieve high performance by learning Training was performed on all videos of normal activities
spatial and temporal features [37]. In this approach, a spatio- of daily living and then tested on videos that contain falls. A
temporal autoencoder is trained only on video clips of normal notable difference between the data splits is that the ADL
activities that are available in abundance. During testing, videos only contained elderly participants, while the fall
the autoencoder should be able to reconstruct unseen nor- videos only contain younger adults. This results in 100 videos
mal activities with low reconstruction error. However, this of older adults for training and 300 for testing of younger
8 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

adults. However, due to data loss during capturing of data


the number of videos for each modality varies. Since the
thermal cameras have a narrow field of view, many falls
were not captured within the central cameras field of view.
When controlling for videos that are common across all
modalities then only 182 falls for testing and 62 videos of
ADL for training are available. Performance is reported for
these videos.
As done in DeepFall, two main methods of determining
reconstruction error can be used [37]. The first is on a
per-frame basis (cross-context) and the second is on a per-
window basis (within-context). These are outlined in Figure
??.
In cross-context, a reconstruction error is calculated for
each individual frame. However, as a single frame is repeated
in multiple windows, it has multiple reconstruction errors.
The mean and standard deviation reconstruction error for a
frame is then calculated across the multiple windows. As
data is taken from across multiple windows it is called cross-
context. AUC ROC is then calculated for each test video (per
video) and then mean performance is reported (see Table
4). This allows for more specific tuning on a per video
basis to a person and their specific activities. Alternatively,
all reconsturction scores can be concatenated together, after
which one AUC ROC can be found across all videos (global).
Global threshold results are reported in Table 5.This method
may show a more generalizable level of performance.
(a)
TABLE 4: Cross-context results when averaging performance
on a per video basis. Where (σ) is using the mean reconstruc-
tion error and (µ) is using the standard deviation of reconstruc-
tion error for each frame across all videos it appears in.
Modalities AUC ROC (σ) AUC ROC (µ) AUC PR(σ) AUC PR(µ)
IP 0.950(0.049) 0.937(0.051) 0.261 (0.280) 0.199 (0.246)
Orbbec IR 0.923 (0.085) 0.927 (0.077) 0.213 (0.224) 0.195 (0.212)
ZED Depth 0.874 (0.127) 0.894 (0.119) 0.123 (0.145) 0.143 (0.172)
Thermal 0.874 (0.112) 0.883 (0.088) 0.119 (0.11) 0.124 (0.115)
Orbbec Depth 0.839 (0.133) 0.872 (0.115) 0.093 (0.138) 0.101 (0.138)
ZED RGB 0.859 (0.125) 0.828 (0.123) 0.082 (0.114) 0.059 (0.097)

Secondly, a within-context anomaly score can be gener-


ated. This operates on a per-window basis, where mean re-
construction error of each sliding window is used. However,
in order to determine a window’s label a hyper-parameter on
the number of falls must be set. For example, if a window
contains eight frames, the hyper-parameter may be set to four,
so that at least four frames of the eight must contain a fall for
the window be classified as a fall. As such reporting for all
within-context results is for all possible parameters given a (b)
window size of eight. The results are presented in Figure 7.
FIGURE 5: Within-Context versus Cross-Context scoring
TABLE 5: Cross-context results when using a global receiving methods. (A) Within-context reconstruction scoring method.
operating characteristic curve across all videos. In this example, for illustration purposes three frames out of
five must contain a fall for the window to be considered a
Modalities AUC ROC (σ) AUC ROC (µ) AUC PR (σ) AUC PR (µ) fall. Reconstruction error is averaged across each window. (B)
IP 0.889 0.913 0.076 0.067 Cross-context reconstruction scoring method. Reconstruction
Orbbec IR 0.884 0.902 0.058 0.066 error for a frame is averaged across all windows containing
ZED Depth 0.780 0.843 0.032 0.042 the frame.
Orbbec Depth 0.759 0.831 0.022 0.031
Thermal 0.802 0.814 0.050 0.056
ZED RGB 0.619 0.657 0.017 0.016

VOLUME 4, 2016 9
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

reconstruction. Precision-recall performance is relative to the


ratio of true positives in the dataset divided by the number
of samples. Our dataset contained roughly 885 falls frames
from a total of 92,709 frames for each modality. This means
that the baseline of a random classifier would achieve a AUC
PR of 0.0096. Best AUC PR performance is seen with IP
modality achieving a AUC PR(σ) of 0.261(0.280) standard
deviation.
Instead of averaging performance across all videos, re-
construction error from each video camera is concatenated
into a single vector; AUC ROC and PR can then calcu-
lated. Table 5 presents the results. We observe a decrease
in performance across all modalities. However, the strongest
performing modalities had the least decline, with the IP
FIGURE 6: Mean Within-context AUC ROC scores for various
camera sill achieving an AUC ROC (σ) of 0.913 (down from
number of frames that contain a fall in order for the window to 0.937(0.051). However, weaker modalities saw a larger de-
be considered to contain a fall. crease, specifically with the RGB camera decreasing from an
AUC ROC (σ) of 0.828(0.123) to 0.657. When using global
thresholding, a large decrease in the AUC PR performance is
also seen. The best AUC PR performance decreased from the
IP camera’s AUC PR (µ) of 0.261(0.280) to 0.076. However,
since the baseline for AUC PR is 0.0096 (due to the large
class imbalance) performance is still better than a random
classifier. This decrease in AUC PR is most likely caused
by an optimal threshold for reconstruction error that cannot
be found on a per video, but instead could be generalizable
across all videos.
The within-context results using mean reconstruction error
are shown in Figure 6. Similar performance is achieved
with an within-context context scoring method. Performance
increases as the number of falls in a window required to
classify it a fall window increases, peaking at five and then
FIGURE 7: Mean Within-context AUC PR scores for various beginning to slightly decrease. However, the precision-recall
number of frames that contain a fall in order for the window to performance is best at a lower number. As a larger number
be considered to contain a fall. of frames are required to classify a window as containing a
fall, the fewer number of windows are considered to have
a fall. Thus, the ratio of true positive labels to total labels
IV. RESULTS AND DISCUSSION will decrease. As baseline performance is too small to show
We observe good performance across all modalities, mean clearly on the graph it is reported in Table 6.
performance for each modality cross-contexts is reported in IR is the best performing modality (see Table 4 with the
Table 4. The best performance is seen by the IP infra-red IP camera achieving AUC ROC (σ) scores of 0.950(0.049)
camera, achieving an average AUC ROC(σ) of 0.95 (0.049) AUC ROC (µ) scores of 0.937(0.077) and the Orbbec camera
using cross-context mean reconstruction error. Performance achieving AUC ROC (σ) 0.923(0.085) and AUC ROC (µ)
was comparable between mean reconstruction error and stan- 0.927(0.077). IR cameras benefit from high visual clarity
dard deviation of reconstruction error, however most modal- while also being able perform well in a range of lighting
ities performed slightly better using the standard error of conditions. This creates a low amount of a noise within the
image and less challenging conditions for an autoencoder
TABLE 6: Cross-context precision-recall results for a global to reconstruct. These factors can help improve the perfor-
threshold across all videos mance of the classifier. We also see a negligible difference
Fall Threshold 1 2 3 4 5 6 7 between the IP camera and Orbbec IR camera’s indicating
Baseline 0.018 0.016 0.013 0.011 0.008 0.006 0.004 that achieved performance is due to modality differences
IP 0.282 0.288 0.288 0.283 0.259 0.202 0.153 rather camera variables such as recording frame rate or field
Orbbec IR 0.243 0.242 0.239 0.219 0.200 0.163 0.119
ZED Depth 0.185 0.188 0.187 0.181 0.161 0.130 0.114
of view.
Thermal 0.171 0.165 0.157 0.147 0.135 0.117 0.099 Thermal video also performed very well achieving an
ZED RGB 0.149 0.146 0.144 0.137 0.128 0.107 0.088 AUC ROC (σ) of 0.874(0.112) and an AUC ROC (µ) of
Orbbec Depth 0.145 0.136 0.125 0.111 0.086 0.064 0.048 0.883(0.088) (see Table 4). This was despite the lower frame
10 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

and narrower frame of view. This modality tends to obscure Future work would expand on this analysis through the use
facial features, details in the background and create a silhou- of multi-modal fusion. By being able to combine multiple
ette of participants. This creates a strong separation between modalities and their respective strengths it may improve
an individual and their surroundings. However, when an overall performance with reduced false positives. Addition-
individual leaves or enters the frame the gradient in the image ally, the objective or loss functions of the autoencoder may
changes to represent the hottest and coldest parts of an image. be altered in order to improve the reconstructive ability of
As seen between images in Figure ?? without a participant the autoencoder. Future work may also explore other deep
and Figure 8 with a participant. This results may have re- learning methods, such as applying contrastive learning or
sulted in a decrease performance due to increased variability attention through the use of visual transformers.
between scenes and as a participants enter and leave frames.
Additionally, this modality did include interpolated frames in
order to increase FPS from 4 to 8.
The ZED Depth camera achieved AUC ROC(µ)
0.894(0.119). The Orbbec Depth camera achieved 0.839
We believe MUVIM will help provide a new bench mark
(0.133) and an AUC ROC(µ) of 0.872(0.115). Performance
dataset and help drive the development of real world fall
decreased for the depth modality with the use of a global
detection systems that can be deployed effectively in peoples
threshold as compared to a per-video basis with an AUC
homes.
ROC(µ) of 0.843 and 0.831 for the ZED and Orbbec cam-
eras respectively. This may be due to depth cameras often
have depth errors at an objects edges/corners, requiring in-
painting. Despite in-painting improving performance, it is an
estimation of missing values and thus introduces additional
noise into the image.
The RGB camera had the worst performance amongst
all datasets with an AUC ROC (σ) of 0.859(0.125) and
an AUC ROC (µ) of 0.828(0.123) . The field of view,
camera placement and image quality were similar to those
of other datasets. Varied lighting conditions both within a
single image and across videos may have greatly affected VI. CONFLICT OF INTEREST STATEMENT
performance.
As we approached one-class classification through mea-
suring reconstruction error, having a lower amount of noise
can help isolate reconstruction error due to fall activities.
However, strong performance is still observed across vari- The authors declare that the research was conducted in the
ous modalities and camera types, indicating that the signal absence of any commercial or financial relationships that
strength is strong. could be construed as a potential conflict of interest.

V. CONCLUSIONS AND FUTURE WORK


In this paper, we present a multi-modal fall detection dataset
with real-world considerations. It contains six-vision based
modalities, and four physiological modalities; considering
environmental factors, a variety of complex activities and
privacy considerations. Performance across multiple visual
modalities was analyzed within an anomaly detection frame-
work.
Infra-red modalities outperformed other modalities, fol-
lowed by thermal and, then depth with traditional RGB,
VII. ACKNOWLEDGMENTS
performing the worst. This order of results was maintained
through various different reconstruction scoring methods
(cross-context and within-context). This difference became
more apparent when a global threshold was used to clas-
sify the results. It is also encouraging that strong privacy
performing modalities such as the thermal and depth had The authors would like to thank all participants for their time,
competitive performance. This provides a path forward to and efforts in the creation of this dataset. We would also like
create a strong performing and privacy protecting passive fall to thank Paris Roserie for his support in collection of the
detection systems. dataset.
VOLUME 4, 2016 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

REFERENCES (ICPR). IEEE, 2018, pp. 1947–1952.


[1] “Falls,” April 2021. [Online]. Available: https://www. [15] S. Chaudhuri, H. Thompson, and G. Demiris, “Fall
who.int/news-room/fact-sheets/detail/falls detection devices and their use with older adults: a sys-
[2] H. H. Warner M. Kramarow E, Chen LH, “Deaths from tematic review,” Journal of geriatric physical therapy
unintentional injury among adults aged 65 and over: (2001), vol. 37, no. 4, p. 178, 2014.
United states, 2000-2013.” NCHS Data Brief, p. 199, [16] S. Chaudhuri, D. Oudejans, H. J. Thompson, and
2015. G. Demiris, “Real world accuracy and use of a wearable
[3] C. for Disease Control, Prevention et al., “Fatalities and fall detection device by older adults,” Journal of the
injuries from falls among older adults—united states, American Geriatrics Society, vol. 63, no. 11, p. 2415,
1993–2003 and 2001–2005,” MMWR: Morbidity and 2015.
mortality weekly report, vol. 55, no. 45, pp. 1221–1224, [17] M. Mubashir, L. Shao, and L. Seed, “A survey on fall
2006. detection: Principles and approaches,” Neurocomput-
[4] L. D. Gillespie, M. C. Robertson, W. J. Gillespie, ing, vol. 100, pp. 144–152, jan 2013.
C. Sherrington, S. Gates, L. Clemson, and S. E. Lamb, [18] X. Wang, J. Ellul, and G. Azzopardi, “Elderly fall detec-
“Interventions for preventing falls in older people living tion systems: A literature survey,” Frontiers in Robotics
in the community,” Cochrane database of systematic and AI, vol. 7, p. 71, 2020.
reviews, no. 9, 2012. [19] L. Martínez-Villaseñor, H. Ponce, J. Brieva, E. Moya-
[5] A. Stinchcombe, N. Kuran, and S. Powell, “Seniors’ Albor, J. Núñez-Martínez, and C. Peñafort-Asturiano,
falls in Canada: Second report: Key highlights,” Tech. “Up-fall detection dataset: A multimodal approach,”
Rep. 2-3, 2014. Sensors (Switzerland), vol. 19, no. 9, 2019.
[6] L. Z. Rubenstein and K. R. Josephson, “The epidemiol- [20] P. Pierleoni, A. Belli, L. Maurizi, L. Palma, L. Pernini,
ogy of falls and syncope,” Clinics in geriatric medicine, M. Paniccia, and S. Valenti, “A Wearable Fall Detector
vol. 18, no. 2, pp. 141–158, 2002. for Elderly People Based on AHRS and Barometric
[7] M. E. Tinetti, W.-L. Liu, and E. B. Claus, “Predictors Sensor,” pp. 6733–6744, 2016.
and prognosis of inability to get up after falls among [21] T. Lee and A. Mihailidis, “An intelligent emergency
elderly persons,” Jama, vol. 269, no. 1, pp. 65–70, 1993. response system: Preliminary development and testing
[8] R. Igual, C. Medrano, and I. Plaza, “Challenges, issues of automated fall detection,” Journal of Telemedicine
and trends in fall detection systems,” Biomedical engi- and Telecare, vol. 11, no. 4, pp. 194–198, 2005.
neering online, vol. 12, no. 1, pp. 1–24, 2013. [22] S. Chaudhuri, L. Kneale, T. Le, E. Phelan, D. Rosen-
[9] S. S. Khan and J. Hoey, “Review of fall detection berg, H. Thompson, and G. Demiris, “Older Adults’
techniques: A data availability perspective,” Medical Perceptions of Fall Detection Devices,” Journal of Ap-
engineering & physics, vol. 39, pp. 12–22, 2017. plied Gerontology, vol. 36, no. 8, pp. 915–930, aug
[10] E. E. Stone and M. Skubic, “Fall detection in homes of 2017.
older adults using the microsoft kinect,” IEEE journal [23] X. Ma, H. Wang, B. Xue, M. Zhou, B. Ji, and Y. Li,
of biomedical and health informatics, vol. 19, no. 1, pp. “Depth-based human fall detection via shape features
290–301, 2014. and improved extreme learning machine,” IEEE journal
[11] G. Debard, P. Karsmakers, M. Deschodt, E. Vlaeyen, of biomedical and health informatics, vol. 18, no. 6, pp.
E. Dejaeger, K. Milisen, T. Goedemé, B. Vanrumste, 1915–1922, 2014.
and T. Tuytelaars, “Camera-based fall detection on real [24] F. Riquelme, C. Espinoza, T. Rodenas, J.-G. Minonzio,
world data,” in Outdoor and large-scale real-world and C. Taramasco, “ehomeseniors dataset: An infrared
scene analysis. Springer, 2012, pp. 356–375. thermal sensor dataset for automatic fall detection re-
[12] S. Khan, “Classification and Decision-Theoretic search,” Sensors, vol. 19, no. 20, p. 4565, 2019.
Framework for Detecting and Reporting Unseen Falls. [25] J. Gutiérrez, V. Rodríguez, and S. Martin, “Compre-
Thesis. University of Waterloo. Ontario, Canada.” hensive review of vision-based fall detection systems,”
Tech. Rep., 2016. Sensors, vol. 21, no. 3, p. 947, 2021.
[13] I. Charfi, J. Miteran, J. Dubois, M. Atri, and R. Tourki, [26] A. Ramachandran and A. Karuppiah, “A survey on
“Definition and performance evaluation of a robust svm recent advances in wearable fall detection systems,”
based fall detection solution,” in 2012 eighth inter- BioMed research international, vol. 2020, 2020.
national conference on signal image technology and [27] E. Auvinet, F. Multon, A. Saint-Arnaud, J. Rousseau,
internet based systems. IEEE, 2012, pp. 218–224. and J. Meunier, “Fall detection with multiple cameras:
[14] T.-H. Tran, T.-L. Le, D.-T. Pham, V.-N. Hoang, V.-M. An occlusion-resistant method based on 3-d silhouette
Khong, Q.-T. Tran, T.-S. Nguyen, and C. Pham, “A vertical distribution,” IEEE transactions on information
multi-modal multi-view dataset for human fall analysis technology in biomedicine, vol. 15, no. 2, pp. 290–300,
and preliminary investigation on modality,” in 2018 2010.
24th International Conference on Pattern Recognition [28] Z. Zhang, C. Conly, and V. Athitsos, “Evaluating depth-
based computer vision methods for fall detection un-
12 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

der occlusions,” in International Symposium on Visual VIII. APPENDIX


Computing. Springer, 2014, pp. 196–207. Participants with incomplete data: 9/30
[29] G. Baldewijns, G. Debard, G. Mertes, B. Vanrumste, • FD003:
and T. Croonenborghs, “Bridging the gap between real- -- Orbecc: Day 2, 4 (video stopped recording prema-
life data and simulated data by providing a highly turely, doesn’t contain fall)
realistic fall dataset for evaluating camera-based fall -- FLIR 279: Day 2, 3 (video missing from drive, not
detection algorithms,” Healthcare technology letters, transferred properly after trial)
vol. 3, no. 1, pp. 6–11, 2016. -- FLIR 280: Day 1,2,3 (video missing from drive, not
[30] S. Vadivelu, S. Ganesan, O. R. Murthy, and A. Dhall, transferred properly after trial)
“Thermal imaging based elderly fall detection,” in
• FD006:
Asian Conference on Computer Vision. Springer, 2016,
pp. 541–553. -- Orbecc: Day 3 (video missing from drive, not trans-
[31] X. Luo, T. Liu, J. Liu, X. Guo, and G. Wang, “De- ferred properly after trial)
sign and implementation of a distributed fall detection • FD007:
system based on wireless sensor networks,” EURASIP -- Orbecc: Day 4 (video missing from drive, not trans-
Journal on Wireless Communications and Networking, ferred properly after trial)
vol. 2012, no. 1, pp. 1–13, 2012. • FD009:
[32] S. Gasparrini, E. Cippitelli, S. Spinsante, and E. Gambi, -- FLIR 280: Day 1 (video is corrupted on drive)
“A depth-based fall detection system using a Kinect®
• FD010:
sensor,” Sensors (Switzerland), vol. 14, no. 2, pp. 2756–
2775, 2014. -- Orbecc: Night 1,2,3,4,5 (invalid video on drive <ie.
[33] L. Martinez-Villaseñor and H. Ponce, “Design and very small video size>, not transferred properly after
analysis for fall detection system simplification,” JoVE trial)
(Journal of Visualized Experiments), no. 158, p. • FD029:
e60361, 2020. -- FLIR 279: ALL (video missing from drive, not trans-
[34] H. Ponce, L. Martínez-Villaseñor, and J. Nuñez- ferred properly after trial)
Martínez, “Sensor location analysis and minimal de- • FD030:
ployment for fall detection system,” IEEE Access, -- FLIR 279: ALL (Wrong participant recorded in
vol. 8, pp. 166 678–166 691, 2020. video, not transferred properly after trial)
[35] B. Kwolek and M. Kepski, “Improving fall detection by
• FD031:
the use of depth sensor and accelerometer,” Neurocom-
puting, vol. 168, pp. 637–645, 2015. -- Day 1,2,3,4,5, Night 1,2,3 (invalid video on drive <ie.
[36] S. S. Khan and M. G. Madden, “One-class classifica- very small video size>, not transferred properly after
tion: taxonomy of study and review of techniques,” The trial)
Knowledge Engineering Review, vol. 29, no. 3, pp. 345– • FD033:
374, 2014. -- Orbecc: Day 3,4,5 (invalid video on drive <ie. very
[37] J. Nogas, S. S. Khan, and A. Mihailidis, “DeepFall: small video size>, not transferred properly after trial)
Non-Invasive Fall Detection with Deep Spatio-
Temporal Convolutional Autoencoders,” Journal of
Healthcare Informatics Research, vol. 4, no. 1, pp.
50–70, mar 2020. [Online]. Available: https://link.
springer.com/article/10.1007/s41666-019-00061-4

VOLUME 4, 2016 13
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 8: Additional possible room furniture layouts used in


study protocol. Furniture pieces are as follows; black - mat;
brown - shelf; olive - chair; grey - desk + chair; redblue -
table; red - sofa + end table; purple - lamp.

14 VOLUME 4, 2016

You might also like