Muivm
Muivm
ABSTRACT Falls are one of the leading cause of injury-related deaths among the elderly worldwide.
Effective detection of falls can reduce the risk of complications and injuries. Fall detection can be performed
using wearable devices or ambient sensors; these methods may struggle with user compliance issues or
false alarms. Video cameras provide a passive alternative; however, regular RGB cameras are impacted by
changing lighting conditions and privacy concerns. From a machine learning perspective, developing an
effective fall detection system is challenging because of the rarity and variability of falls. Many existing fall
detection datasets lack important real-world considerations, such as varied lighting, continuous activities
of daily living (ADLs), and camera placement. The lack of these considerations makes it difficult to
develop predictive models that can operate effectively in the real world. To address these limitations, we
introduce a novel multi-modality dataset (MUVIM) that contains four visual modalities: infra-red, depth,
RGB and thermal cameras. These modalities offer benefits such as obfuscated facial features and improved
performance in low-light conditions. We formulated fall detection as an anomaly detection problem, in
which a customized spatio-temporal convolutional autoencoder was trained only on ADLs so that a fall
would increase the reconstruction error. Our results showed that infra-red cameras provided the highest
level of performance (AUC ROC=0.94), followed by thermal (AUC ROC=0.87), depth (AUC ROC=0.86)
and RGB (AUC ROC=0.83). This research provides a unique opportunity to analyze the utility of camera
modalities in detecting falls in a home setting while balancing performance, passiveness, and privacy.
INDEX TERMS fall detection, multi-modal, autoencoder, anomaly detection, deep learning, computer
vision.
I. INTRODUCTION clude the rarity and diversity of fall events [9]. Previous
studies by Stone et al.,[10] found 454 falls in 3339 days
Falls are one of the leading causes of injury-related deaths
worth of data and Debard et al., [11] found 24 in 1440
among the elderly worldwide [1, 2] and they are a major
days worth of data. Therefore, it is very challenging and
cause of both death and injury in people over 65 years of
consuming to run studies over a long duration, and even then
age [3, 4]. The faster an individual receives help after a
they may still contain few falls which may not be sufficient
fall, the lower the risk of complications arising from the fall
enough for building robust classifiers [12]. Another challenge
[5, 6, 7]. Fall detection systems improve the ability of older
is that fall events last only for short intervals in comparison
adults to live independently and “age in place” by ensuring
with normal ADLs [9]. Finally, each rare fall event can vary
they receive support when required. However, fall detection
greatly from one another, making it difficult to strictly define
is a challenging problem both from predictive modeling
a well-defined class or to capture all possible variations in
and practical considerations (such as low false alarm rate,
a dataset [12, 13, 14]. ƒ Many possibilities exist for the
privacy) [8].
practical implementation of fall-detection systems. However,
Predictive modeling challenges faced in fall detection in-
VOLUME 4, 2016 1
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
because solutions are ultimately intended to be implemented the benefits of wearables, the real-world short comings of this
in a person’s daily life, they should be easy to live with. Thus, type of device may limit its successful deployment.
passive systems are ideal, as they require no input from the Ambient or environmental systems, such as pressure mats,
user. They can monitor the environment before detecting a radar/Doppler, microphones, and motion sensors, typically
fall. These systems are preferable because a user may be use a wide range of sensors within the home to determine the
unresponsive after a fall, or they may not be wearing their user’s activities [23] [19]. These systems may be difficult to
device [15] [16]. However, older adults express privacy con- install or have high false-alarm rates owing to environmental
cerns regarding passive systems such as cameras. In addition noise [17] [24].
to these concerns, systems must balance high sensitivity to Computer vision systems cam typically rely on traditional
falls while maintaining a low false alarm rate. Missed falls are imaging modalities and techniques to determine when a fall
potentially dangerous to users, who may rely on the system has occurred [25]. They often use RGB or depth camera
to detect falls. In contrast, a system with a high false alarm modalities; however, other visual modalities have also been
rate can cause many issues. If a system automatically calls an used. These systems are generally passive, low-cost, and
ambulance or even notifies loved ones, a high false alarm rate can achieve high performance. However, they struggle to
would result in large bills or potentially ignored cases of real maintain user privacy. Imaging modalities, such as depth and
falls and eventual rejection of the system. thermal cameras, may help alleviate this issue by obscuring
Therefore, the key factors to consider when designing identifiable features. The techniques used to analyze images
a system are (i) passive, (ii) privacy protection and, (iii) vary, but 3D bounding boxes and background subtraction are
perform with a high fall detection rate and low false alarm among some of the most popular approaches. Recently, deep
rate. In this paper, we introduce a novel multi-camera and learning techniques have been applied to improve results.
mutli-modal fall detection dataset containing fall and ADL Modeling techniques include 2D-LSTM models, 3D CNNs.
from 30 healthy adults and ADL (with no fall) from 10 older Computer vision systems rely have traditionally relied on
adults. This dataset is collected data in a seminaturalistic classical machine learning techniques to determine when a
setting inside a designed home. The experiments are designed fall has occurred [25, 26]. They often use RGB or depth
to emulate real-world scenarios and sequences of events. It camera modalities; however, other visual modalities have
contains six visual modalities mounted on the ceiling and also been used. These systems are generally low-cost and
four additional wearable modalities. The six visual modali- can achieve high performance. However, they struggle to
ties consisted of two infra-red cameras, two depth cameras, maintain user privacy and may be limited in the area they
a thermal camera and an RGB camera. The four wearable are installed. Imaging modalities, such as depth and thermal
modalities were accelerometer, PPG, GSR and temperature. cameras, may help alleviate this issue by obscuring identi-
We did comprehensive experiments on detecting falls using fiable features, though a person may still be identified. The
a customized 3D convolutional autoencoder and showed techniques used to analyze images vary [26]. 3D bounding
that infra-red modalities performed the best among others, boxes and background subtraction are among some of the
including depth, thermal and RGB cameras. most popular traditional approaches. Recently, deep learning
techniques have been applied to improve results. Modeling
A. LITERATURE REVIEW
techniques include CNN-LSTM models and 3D CNNs have
A wide range of modalities, methods of capturing informa- shown promising results [25].
tion, and subsequent systems have been explored for fall
We now present a review of some of the fall detection
detection. They can be divided into various groups, such
datasets. Our review of fall detection datasets and multi-
as wearables, ambient sensors and computer vision-based
modal fall detection datasets is limited to those that contain
sensors [17] [18]. Khan and Hoey [9], presented a review of
at least one visual modality.
fall detection methods based on the availability of fall data
during model training.
Wearable systems often incorporate accelerometers, gy- B. UNIMODAL DATASETS
roscopes, and inertial measurement units (IMUs) to detect A review of existing publicly available fall datasets contain-
falls. However, other modalities may also be used such as ing only a singular visual modality is outlined below.
an EEG or a barometric Sensor [19] [20]. Systems that Charfi et al. [13] released an unnamed dataset containing
use accelerometers or IMUs are relatively affordable and 197 falls and 57 wall mounted videos of normal activities.
accurate [15] [17] [21]. However, these systems are invasive It comprised four background scenes, varying actors, and
and require the user to constantly wear and charge the device. different lighting conditions. The directions of activities rel-
This can lead to many missed falls as one may not wear ative to the camera are varied to reduce any impact that
the device all the time, for e.g. while charging or bathing. the directional camera location may have. All activities and
Chaudhuri et al. [16] found that the wearable device was falls started with a person in the frame and were segmented
not was not worn for two-thirds of falls events. Additionally, into short video clips of specific actions. To determine falls,
older adults may be hesitant to wear a device because of the manually generated bounding boxes were used to extract
stigma it may imply regarding independence [22]. Despite features, and an SVM was used to classify falls.
2 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
The multi-cameras fall dataset collected by Auvinet et al. of various activities. Five goals were highlighted to improve
[27] contained eight IP cameras placed around the room. the real-world effectiveness of the dataset. This includes
It included only 24 scenarios, 22 of which contained falls realistic settings, fall scenarios, improved balance of ADL
after a short activity. These videos were segmented into clips to fall activities, including real-world challenges (occlusions,
lasting less than 30 seconds. All rooms were well lit, with partial falls, lighting..etc.), and continuously recording the
minimal furniture placed in the center of the room. The data. However, cameras mounted on the walls were used, and
small size of the dataset and other limiting factors could no older adults were included in the dataset.
limit the development of generalized models in real-world The thermal fall detection dataset was designed to replicate
settings. Methods included background subtraction to obtain the KUL simulated fall dataset but with a thermal camera
a silhouette, in which the vertical volume distribution ratio [30]. This dataset contains only nine video segments with
was used to classify falls. ADL activities and 35 segments with fall scenarios. However,
The Kinect Infrared dataset published by Mastorakes et this dataset only uses a single eye level mounted thermal
al., [? ] used one camera placed at eye level in the middle camera with a slightly more constrained field of view. The
of a wall. The dataset included three variations of camera same limitations from the KUL simulated fall datasets also
angles (backward, forward and sideways) totaling 48 video apply, as they implemented that datasets set up.
falls. The direction of the fall is important because of the eye-
level placement of the camera. In addition to falls, activities Computer vision datasets can vary in several main cat-
were performed while sitting, lying on the floor and “picking egories: camera type, location, lighting, occlusions in the
up an item from the floor” were performed by eight different scene, and recorded participant activities.
participants. Additionally, two participants were asked to Different camera types or modalities struggle with various
perform activities slowly to replicate the movements of older considerations. cameras can capture clear images in general
adults. OpenNI and depth data were used from the Kinect lighting conditions; however, they may not work well under
sensor to generate a 3D bounding box whose first derivative poor lighting conditions, such as dark or night time scenarios.
parameters were analyzed to classify falls. RGB cameras also do not offer any level of privacy. Depth
The EDF/OCCU datasets contain two view points of a cameras, such as the defunct Microsoft Kinect camera, are
kinetic camera [28]. Their data were then divided into two a popular alternative to an RGB camera because they can
datasets. A non-occlusion dataset, EDF, of 40 falls and 30 provide a light independent image and protect the privacy of
actions, and an occlusion dataset, OCCU, of 30 occluded individuals [23? , 19, 18]. Vision modalities, such as thermal
falls and 80 actions were used. Both viewpoints were from and depth cameras obscure identifying features, while still
cameras placed at eye level, and thus ’directional falls’ were providing silhouettes of individuals. In addition, they may
performed. Room furniture and fall variations were minimal, perform better in certain scenarios such as those with poor
with most variations related to the direction of falls. Oc- lighting or visually busy environments due to independent
clusion was introduced by a single bed positioned to block lighting and silhouette segmentation.
the view of the bottom half of a fall. Actions in the dataset The location of the camera in the room can also affect
without occlusions included; picking things off the floor, the results. The location of the camera in the room can also
sitting on the floor and lying on the floor. These same actions affect the results. Most of the previous datasets commonly
were performed in the occluded dataset, with the further have a camera mounted on a wall at or above eye level,
addition of tying shoelaces. All actions, except for lying on as shown in Table 1. The problem with this placement is
the floor were occluded by the bed. that a fall looks very different depending on its direction
The SDU dataset comprises ten young adults performing (across or in-line with the field-of-view). As such, these
six simple actions captured with a depth camera[23]. These datasets include strict definitions of the orientation of falls
include falling, bending, squatting, sitting, lying down and relative to the camera. This limits the variety of falls to a
walking. These actions were repeated 30 times each under the short list of possible variations [19]. In addition to being
following conditions: carrying or not carrying a large object, affected by the fall direction, cameras mounted at eye level
lighting on or off, room layout, camera direction, and position are more susceptible to occlusions, blocking the view of
relative to the camera. Despite the large variation in each the participant and affecting the cameras’ ability to detect
repetition of the action, the eye-level placement of the camera falls (e.g. behind furnitures). Some datasets make efforts to
indicates that the direction is still important. Additionally, all include furniture/occlusions in the dataset, however they may
actions were still short and segmented with an average length be limited to a single chair or bed (The Multiple Camera [27],
of eight seconds per clip. EDF/OCCU [28], SDU Datasets[23]), while others have no
The KUL simulated fall detection dataset was designed furniture in frame (Mastorakis et al. [? ]). To mitigate these
to address many of the shortcomings that these datasets limitations (i.e., orientation of falls and occlusions) cameras
face when real-world factors were considered [29].Five IP can be placed on the ceiling. This helps provide a similar
cameras, 55 fall scenarios, and 17 ADL videos were used. view of every fall, and removes furniture that may be between
The dataset includes realistic environments (in terms of fur- the subject and camera [31] [32].
nishings) and longer videos instead of short segmented clips As with falls, the ADL performed by the participants in
VOLUME 4, 2016 3
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
the datasets varied widely. Activities are often segmented accuracy and with occlusions or furniture placed in a room.
into very short and specific motions such as a single squat, EEG headsets are also be extremely difficult to use in the real
picking up an object, taking a seat, and lying down..etc. world.
These short and specific videos last from five to 60 seconds The URFD dataset was recorded with two Kinect cameras.
may limit a system’s ability to generalize to the real world. One was at eye level and one was ceiling-mounted for fall
This is because of the oversimplification and trivialization of sequences [35]. However, only a single eye level camera was
diversity and problems that can arise in activities and falls. In used to record the ADL activities. The accelerometer was
response to the lack of real-world considerations Baldewijns worn on the lower back using an elastic belt. This sensor
et al. [29] attempted to create a dataset that considers these location is not ideal because a special device would need to
factors. However, the dataset still lacks ceiling mounted be worn. The dataset contained only 30 falls and 40 activities
cameras, varying environments, and older adult participants. of daily living. Along with the limited dataset size, it also
A table consisting of these “real” world factors are outlined contained short, segmented activities and falls with limited
in Table 1. variation.
The CMDFALL dataset [14] focuses on multi-view as well
C. MULTIMODAL DATASETS as multi-modal capturing. Seven overlapping Kinect sensors
In recent years, the ability to merge a variety of sensors to were used, and two accelerometers were worn on the right
improve performance has become of interest in fall detection hand and hip. The room was well lit, with a minimal amount
systems and several multi-modal datasets have emerged. This of furniture in the room unless required for the fall scenario.
is because a multi-modal approach provides different sources It also contained eight falls and 12 actions. The eight falls
of information that can help compensate for the deficiencies were based on three main groups, walking, lying on the bed,
in each other. or sitting on a chair. The dataset also mentioned many of
Combining modalities may complement each other from the same issues such as trimmed videos, and limited fall
a technical perspective, but it may cause practical consid- styles. However, they included a limited variety of simple
eration to be further impaired. More modalities mean more single action activities and fall styles. The environment also
sensors or cameras, increasing costs and potentially more is always well lit and minimal furniture and occlusion in the
inconvenience to the user. Because a wide range of modalities space.
are used for fall detection a wide range of combinations is These existing multi-modal datasets lack consideration for
possible. Thus it is important to select modalities that are lighting, furniture, fall, variety of ADL, and camera place-
complimentary to each other without increasing practical ment. This may impact real-world performance. Not only
costs to the user. These practical or real world considerations were these technical considerations lacking but also practi-
for multi-modal datasets are highlighted in Table 2. cal considerations with many multi-modal datasets requiring
multiple wearable devices to be worn.
The UP fall detection data set contained two cameras at
eye level with frontal and lateral views [19]. Other modalities
were captured through five IMU sensors, one EEG head- D. INTRODUCED MULTI-MODAL DATASET (MUVIM)
set and six infrared motion sensors placed in a grid. The To circumvent the problems associated with previous uni-
activities were limited to six simple motions and five fall modal and multi-modal datasets, we present the Multi Visual
variations. Activities varied from 10 to 60 seconds and were Modality Fall Detection Dataset (MUVIM), which is a novel
segmented from each other. The limited fall directions and multi-camera multi-modality fall detection dataset. The MU-
short segmented activities limit the real-world implications VIM dataset places a larger emphasis on different types of
of this dataset. In addition, many of the chosen modalities camera based modalities. Multiple camera modalities were
were also impractical. Five IMU sensors are difficult to im- selected because of their practical implication compared to
plement; selection of only most relevant sensors may be more wearable or other types of modalities. Additionally, this
plausible, which was performed by the authors in a follow up allows for a direct comparison between privacy-protecting
paper [33] [34]. However, motion sensors still struggle with and non privacy protecting modalities.
4 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 3: Highlighting devices used and the data collected form each.
Device Description Installation Data Collected
IR video (grayscale)
Dome-style IP security camera
File Format: mp4
Hikvision IP network camera (1) (IR illumination) Ceiling (centre)
Framerate (fps): 20
FOV: 106 degrees
Resolution: 704 x 480 pixels
RGB video, Depth video
3D depth camera
File Format: avi
StereoLabs ZED depth camera (1) (ambient illumination) Ceiling (centre)
Framerate (fps): 30
FOV: 90 x 60 degrees
Resolution: 1280 x 720 pixels
IR video, Depth video
3D depth camera
File Format: avi
Orbbec Astra Pro (1) (near-IR illumination) Ceiling (centre)
Framerate (fps): 30
FOV: 60 x 49.5 degrees
Resolution: 640 x 840
Thermal video
FLIR ONE Gen 3 (3) Thermal camera attachment
Ceiling File Format: mp4
Model number: SM-G532M for smartphones
(left, centre, right) Framerate: 8.7 fps
(IATSL 278, IATSL 279, IATSL 280) FOV: 50 x 38 degrees
Resolution: 1440 x 1080 pixels
Wristband for monitoring of movement Plain text data
Empatica E4 (1) Wearable
and physiological signals File Format: csv
B. PARTICIPANTS AND ACTIVITIES as many props and furniture within the field of view of all
The study was divided into two phases, each with a different modalities.
participant population. In Phase 1, data was collected from 30 Each session contained ten (10) trials, five "day-time"
healthy younger adults between 18 and 30 years of age (mean trials that were well lit and five "night-time" trials with poor
age = 24; number of females = 14 (or 46.7%). Inclusion lighting. Blackout curtains and main overhead lights were
criteria for this phase included: must be from ages 18-30, be turned off in night time trials but an incandescent lamp was
clear of any health complications that may hinder balance still left on in order to provide enough illumination to allow
or performance in the study, and must be able to understand participants to move around safely. Each trial required the
and speak English, must be able to move around safely in a participant to act out a scripted “story”, while interacting with
furnished room without eyeglasses, must be able to travel and various furniture and props in the scene.For example, one
attend sessions on site. In Phase 2, data was collected from 10 story involves the participant returning home from an outing,
healthy older adults, who are at least 70 years of age (mean putting down his or her bags, taking off his or her shoes and
age =76.4 ; number of females = 5 (or 50%). Participants in jacket, making tea, and sitting down at a computer. The for-
Phase 1 were asked to simulate falls onto a 4” thick crash mat of each story was intended to put participants at ease and
mat. Participants in Phase 2 did not require the use of a crash increase the realism of the data that was collected. The col-
mat, since no falls were simulated by this population. lection of stories used in this study covered a broad range of
Prior to a session, participants were provided with a con- normal activities that might occur in a younger/older adult’s
sent form detailing the protocol of this study. Once consent living room. In addition, the stories allowed the participants
was provided, the researchers confirmed the eligibility of the to move around the room as they interact with the furniture
participant by conducting a brief screening questionnaire. and props placed throughout. The order of completion for all
Two versions of this questionnaire were used, since the ten trials were randomized between participants to prevent
populations for Phase I and Phase 2 were differed. In Phase researcher bias (ex. non-significant trends/presumptions due
2, questions pertained to the participants mobility and vision to order of trials) and allowed for better comparison between
issues to ensure safety and older adults were asked if they trials.
have experienced a fall within the last year to help ensure At the beginning of each session, a research assistant
they do not have any mobility issues. (outside the view of the camera) would provide cue to the
Each session was approximately one (1) hour in length, in- participant to perform certain activities. These cues were
cluding consent, preparation time, data collection, and inter- based on a pre-determined script that contained different
missions to change the setting of the room. The setting of the scenarios (as discussed above). The participant would follow
room (including furniture and the crash mat) was randomized the cues (e.g. sit on sofa, walks around, pick an object from
for each participant based on five (5) pre-made arrangements. the floor, work on laptop, etc.) and perform these activities
The furniture and crash mat were stationary throughout each the way they would like to do. In each session, a person
session, but were moved to a new configuration between carries out normal activities based on the given cues and each
participant sessions. This is done to avoid building a trivial session would end with a fall on the crash mat. The type of
classifier to detect crash mat as the cue for a fall in a scene. falls were kept diverse across each session and for different
These five room settings were created in order to include persons to capture different types of falls. Each story was
6 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FD001
Empatica
FLIR
FLIR 278
Day - Trial 1
Day - Trial 2
---
Night - Trial 5
FLIR 279
FLIR 280
IP
Day - Trial 1
Day - Trial 2
FIGURE 2: One example of the arrangements of five furniture
pieces in room. Furniture pieces are as follows; black - mat; ---
brown - shelf; olive - chair; grey - desk + chair; redblue - Night - Trial 5
table; red - sofa + end table; purple - lamp. Orbbec
ZED
Five general representative fall types of older adults were
selected FIGURE 3: Directory Tree Used to Store Trials and Cameras.
FD001 is the folder for participant 1, containing all cameras.
• Tripping or stumbling forward, and falling when unable Each camera has its own folder, containing 10 sub-folders for
to catch oneself. each trial done (labeled day or night). The FLIR thermal cam-
era contains three additional folders for each three thermal
• Standing or walking, losing one’s balance, and falling cameras used (Note: 278, 279 and 280 are generated labels).
backwards or sideways. Note that a backward fall ends
in a sitting or a lying position.
• Falling slowly/incrementally. (Person loses balance, main button by the participant at the beginning and end of
catches self on some object, but continues to fall slowly each trial. The Astra Pro, FLIR ONE, ZED, and IP cameras
when they don’t have the strength to recover.) were manually controlled by the researchers. Recordings
• Attempting to sit, but falling when the chair slides out were started at the beginning of each trial, and stopped at
from underneath. the end of each trial.
• Attempting to stand, putting weight on a chair, table or
walker, but the supporting object moves away and fails C. DATA PRE-PROCESSING AND CONSOLIDATION
to provide support. 1) Data Structure
• Sitting in a chair, leaning forward to reach an item on the Immediately after each session, the researchers transferred
floor, putting on shoes, or attempting other activities, all collected data to a secure network server, and into a folder
and toppling out of the chair. labelled with their participant ID. In each participant’s folder
contained subfolders for each type of modality used during
the trial including data from the Empatica E4 wristband. See
assigned one or more fall types and recovery methods that Figure 3 for a breakdown of the directory structure of the
could logically fit at the end of the story. If multiple fall types public dataset. Each modality folder contained ten folders
where possible for that story one was determined at random. for each trial (ie. Day 1-5, Night 1-5). Once data transfer
A higher number of normal activities relative to fall events was complete, the researchers converted each of the camera
were deliberately performed to represent actual scenarios. A videos into individual frames, in JPEG format.
disproportionately high number of falls may over-simplify or
bias the predictive models. In summary, a trial consisted of: 2) Labelling Procedure
enacting a story; simulating a fall; and engaging in a recovery, Once all data was transferred, the researchers labelled the
all of which were chosen randomly from a pre-defined script. beginning and end frame numbers for each fall that occurred
Participants in Phase 2 were also asked to complete ten sto- during all recorded sessions. All labels were noted using an
ries but were not tasked with simulating a fall or a recovery. Excel spreadsheet. Two researchers were recruited for this
Participants were prompted through the steps of each trial. task to reduce sample bias and increase labelling accuracy.
However, these prompts were not given with specific details The start of a fall was marked as the frame when the partic-
to allow the participant to complete the assigned tasks based ipant started to lose balance. The end of a fall was marked
on their own interpretation. This would help in building as the frame when the participant was at rest on the ground.
generalized classifiers and not based on a particular sequence The types of falls that were observed are listed below. Certain
of activities. trials also contained more than one fall. In this case, both falls
The Empatica E4 wristband was turned on at the beginning were documented in the spreadsheet.
of the session. Trials were defined through the pressing of the
VOLUME 4, 2016 7
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
VOLUME 4, 2016 9
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
and narrower frame of view. This modality tends to obscure Future work would expand on this analysis through the use
facial features, details in the background and create a silhou- of multi-modal fusion. By being able to combine multiple
ette of participants. This creates a strong separation between modalities and their respective strengths it may improve
an individual and their surroundings. However, when an overall performance with reduced false positives. Addition-
individual leaves or enters the frame the gradient in the image ally, the objective or loss functions of the autoencoder may
changes to represent the hottest and coldest parts of an image. be altered in order to improve the reconstructive ability of
As seen between images in Figure ?? without a participant the autoencoder. Future work may also explore other deep
and Figure 8 with a participant. This results may have re- learning methods, such as applying contrastive learning or
sulted in a decrease performance due to increased variability attention through the use of visual transformers.
between scenes and as a participants enter and leave frames.
Additionally, this modality did include interpolated frames in
order to increase FPS from 4 to 8.
The ZED Depth camera achieved AUC ROC(µ)
0.894(0.119). The Orbbec Depth camera achieved 0.839
We believe MUVIM will help provide a new bench mark
(0.133) and an AUC ROC(µ) of 0.872(0.115). Performance
dataset and help drive the development of real world fall
decreased for the depth modality with the use of a global
detection systems that can be deployed effectively in peoples
threshold as compared to a per-video basis with an AUC
homes.
ROC(µ) of 0.843 and 0.831 for the ZED and Orbbec cam-
eras respectively. This may be due to depth cameras often
have depth errors at an objects edges/corners, requiring in-
painting. Despite in-painting improving performance, it is an
estimation of missing values and thus introduces additional
noise into the image.
The RGB camera had the worst performance amongst
all datasets with an AUC ROC (σ) of 0.859(0.125) and
an AUC ROC (µ) of 0.828(0.123) . The field of view,
camera placement and image quality were similar to those
of other datasets. Varied lighting conditions both within a
single image and across videos may have greatly affected VI. CONFLICT OF INTEREST STATEMENT
performance.
As we approached one-class classification through mea-
suring reconstruction error, having a lower amount of noise
can help isolate reconstruction error due to fall activities.
However, strong performance is still observed across vari- The authors declare that the research was conducted in the
ous modalities and camera types, indicating that the signal absence of any commercial or financial relationships that
strength is strong. could be construed as a potential conflict of interest.
VOLUME 4, 2016 13
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
14 VOLUME 4, 2016