0% found this document useful (0 votes)
9 views

Lightweight Fall Detection Algorithm Based On Alph

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lightweight Fall Detection Algorithm Based On Alph

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Hindawi

Mathematical Problems in Engineering


Volume 2022, Article ID 9962666, 15 pages
https://doi.org/10.1155/2022/9962666

Research Article
Lightweight Fall Detection Algorithm Based on AlphaPose
Optimization Model and ST-GCN

Hongtao Zheng and Yan Liu


School of Information and Electrical Engineering, Zhejiang University City College, Hangzhou 310000, China

Correspondence should be addressed to Yan Liu; [email protected]

Received 9 May 2022; Accepted 9 June 2022; Published 11 July 2022

Academic Editor: Xuefeng Shao

Copyright © 2022 Hongtao Zheng and Yan Liu. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is
properly cited.
Falls cause great harm to people, and the current, more mature fall detection algorithms cannot be well-migrated to the embedded
platform because of the huge amount of calculation. Hence, they do not have a good application. A lightweight fall detection
algorithm based on the AlphaPose optimization model and ST-GCN was proposed. Firstly, based on YOLOv4, the structure of
GhostNet is used to replace the DSPDarknet53 backbone network of the YOLOv4 network structure, the path convergence
network is converted into BiFPN (bidirectional feature pyramid network), and DSC (deep separable convolution) is used to
replace the standard volume of spatial pyramid pool, BiFPN, and YOLO head network product. Then, the TensorRt acceleration
engine is used to accelerate the improved and optimized YOLO algorithm. In addition, a new type of Mosaic data enhancement
algorithm is used to enhance the pedestrian detection algorithm, improving the effect of training. Secondly, use the TensorRt
acceleration engine to optimize attitude estimation AlphaPose model, speeding up the inference speed of the attitude joint points.
Finally, the spatiotemporal graph convolution (ST-GCN) is applied to detect and recognize actions such as falls, which meets the
effective fall in different scenarios. The experimental results show that, on the embedded platform Jeston nano, when the image
resolution is 416 × 416, the detection frame rate of this method is stable at about 8.33. At the same time, the accuracy of the
algorithm in this paper on the UR dataset and the Le2i dataset has reached 97.28% and 96.86%, respectively. The proposed method
has good real-time performance and reliable accuracy. It can be applied in the embedded platform to detect the fall state of people
in real time.

1. Introduction according to the environmental noise formed when the


human body falls, e.g., sensing the object’s pressure and
Falls can cause all kinds of trauma, which can be life- sound changes, are used to detect falls, however, this method
threatening in severe cases. Studies also show that nearly half has a higher false positive rate and is less likely to be adopted.
of all falls worldwide lead to medical attention, decreased (2) Detection methods based on wearable sensors [7–10],
functioning, impaired social or physical activity, and even e.g., using accelerometers and gyroscopes, to detect falls,
death [1, 2]. Medical surveys have shown that if timely however, wearing sensors for a long time will affect people’s
treatment can be performed after a fall, the risk of death can comfort and increase the physical burden. The false positive
be reduced by 80% and the survival rate can be significantly rate is also higher for complex activities. (3) Detection
improved. However, all actions taken after a fall are less methods based on visual recognition [11–15] can be divided
important than detecting a person’s posture before they fall. into two categories: one is the traditional machine vision
Therefore, it is of great significance to quickly detect the method to extract effective fall features. It requires low
occurrence of falls [3]. hardware requirements for the running platform, however,
At present, the research on fall detection can be divided the robustness is not strong, and it is easily disturbed. The
into three main categories: (1) detection methods based on other type is artificial intelligence method, which uses the
environmental equipment [4–6], which are detected image captured by the image sensor for the training and
2 Mathematical Problems in Engineering

reasoning of the convolutional neural network, and the (HOG), which exploits the directionality of edges to describe
recognition accuracy can reach a high level. However, at the the overall appearance of pedestrians. However, the ex-
same time, this method also requires a high training envi- traction steps of this extraction method are cumbersome,
ronment configuration, which greatly limits the application and the calculation of the recognition algorithm is com-
and promotion of this method. At the same time, in recent plicated, resulting in poor real-time performance.
years, many embedded devices have appeared, such as Jeston Pedestrian detection has achieved rapid progress because
nano, Jeston NX, Jeston TX2. Relatively cheap and small of recent developments in deep learning research. At
embedded devices also have considerable computing power, present, target detection algorithms based on deep learning
which provides the possibility for the migration and de- can be roughly divided into two categories: (1) two-stage
ployment of artificial intelligence algorithms. Most of the detection algorithms represented by R-FCN (region-based
methods currently on the market cannot run well on em- fully convolutional neural network) [18] and (2) YOLO as
bedded devices. Hence, this paper proposes a fall detection the representative single-stage detection method (you only
algorithm to solve this problem. look once) [19]. The two-stage detection method has high
The specific improvement of the algorithm in this paper accuracy and poor real-time performance. The single-stage
is as follows: detection method has slightly lower accuracy but has good
real-time performance and fast detection speed.
(1) In the early stage, to enhance the generalization
The two-stage detection method realizes the cascade
ability of the dataset, the original mosaic data en-
structure, the network calculation amount increases, and the
hancement algorithm was improved and optimized,
accuracy is correspondingly improved, however, the de-
and a new mosaic data enhancement method was
tection speed is sacrificed accordingly, and the real-time
proposed.
requirements cannot be met. The problem has not been fixed
(2) To reduce the structural complexity of the target well since then, although it has worked hard to make up for
detection algorithm, and at the same time, ensure a this shortcoming. Regarding the single-stage detection
better recognition accuracy for people at different method, Redmon et al. proposed YOLO (you only look
levels of complexity, this paper improves the once) [19] in 2016, which is the first single-stage detection
structure of YOLOv4 and proposes a structure of a method based on deep learning. It creatively combines
novel object detection algorithm. candidate regions with target recognition, which solves the
(3) To improve the YOLO algorithm to a greater extent, problem of low efficiency of two-stage target detection al-
this paper uses the TensorRt acceleration engine to gorithms. Redmon and Farhadi then went on to propose
accelerate. YOLOv2 [20] and YOLOv3 [21], which significantly im-
(4) To ensure the accuracy of the detection algorithm, proved the detection performance and enabled the YOLO
the joint detection algorithm selected in this paper is family of methods to be widely used in various tasks. In 2020,
AlphaPose, and at the same time, considering the Bochkovskiy improved the network structure of YOLOv3
need to migrate AlphaPose to embedded devices, this and proposed YOLOv4. YOLOv4 greatly improves detection
paper proposes an optimization method for the accuracy while ensuring speed. More recently, Jocher pro-
detection model of AlphaPose. posed YOLOv5, which brings together other state-of-the-art
technologies. Compared with YOLOv4, although the per-
(5) Introduce a spatiotemporal graph convolution al-
formance of YOLOv5 is slightly worse, it is more flexible and
gorithm as the actual detection of the fall state.
faster than Yolov4 and has certain advantages in rapidly
deploying models.
2. Related Work
At present, the most common and generally effective fall
2.2. Development of Joint Detection Algorithms. In human
detection algorithm is the vision-based detection algorithm.
pose detection, there are two main methods of joint point
Generally speaking, the overall operation logic of the vision-
detection: bottom-up and top-down. The bottom-up ap-
based detection algorithm is to first use the target detection
proach is represented by Openpose [22], which is an end-to-
algorithm to detect the pedestrians in the image and input
end detection algorithm based on convolutional neural
the detection results into the joint point detection algo-
networks, supervised learning, and an open-source library
rithms, such as AlphaPose and openpose, and finally
developed with caffe as the framework. It can realize pose
according to the specific parameters of the joint points, the
estimation, such as human motion, facial expression,
coordinates are combined with the behavioral state at the
movement, and so on. It has excellent robustness for single
time of the fall to determine whether to fall.
and multiplayer. The algorithm, firstly, detects all human
body joint points in the image and then distinguishes which
2.1. Object Detection Algorithm Based on Pedestrian Detection. human body the joint points belong to through the rela-
Traditional pedestrian detection methods mainly extract tionship between the joint points. Although this method has
features manually. Tian et al. [16] propose a novel multiplex a faster operation speed, it is easily disturbed by nonhuman
classifier model, which is composed of two multiplex cas- bodies. The top-down method is represented by AlphaPose
cades parts: Haar-like cascade classifier and shapelet cascade [23], which is a multistage detection method. Firstly, target
classifier. [17] proposed a histogram of oriented gradients detection is performed to identify the human target in the
Mathematical Problems in Engineering 3

image and mark each human body area rectangle to exclude 3.1. Object Detection Algorithm Based on Pedestrian Detection.
nonhuman interference, the detection of joint points for The mosaic method was first proposed in the YOLOv4 paper.
each human body area is very accurate, and the calculation This method is based on the CutMix (cutting and mixing)
speed is also fast. [29] method to expand the generated data enhancement
algorithm. The two blue paths in Figure 2 are mentioned in
the YOLOv4 paper. m1 represents the original image input,
2.3. Other Recommendations. Reference [24] proposed a m4 represents the image four-in-one input, and the inno-
multilayer dual LSTM network-based framework for vation of the mosaic algorithm in this paper is that an input
multimodal sensor fusion to perceive and classify patterns form m9 is added under these two paths, which represents the
of daily activities and highly shared events. Reference [25] image nine-in-one input. Once input, the specific generation
proposed an optically anonymous image sensing system, flow chart is shown in Figure 3. Compared with m4, m9 greatly
which uses convolutional neural networks and autoen- enriches the background of detected objects. In BN calcu-
coders for feature extraction and classification to detect lation, the data of 9 pictures can be calculated at a time, which
abnormal behaviors, which largely protects the privacy of makes the hardware resource requirements lower during
the elderly. Reference [26] uses the two-dimensional image training and can save more hardware resources.
data to extract an effective image background through the The specific operation is as follows: the first step is to take
frame difference method, Kalman filter, etc., and uses it as the length and width (w, h) of the input image as a boundary
the input of KNN (K-nearest neighbor) classifier, which value. Then, scale the image, where the x-axis and y-axis are,
achieves an accuracy rate of 96%, and it is susceptible to respectively, scaled to a certain multiple of kx and ky, whose
variable factors. Reference [27] uses the two-dimensional formulas are as follows:
image data to calculate optical flow information and sends kx � Rand kw , kw + Δkw 􏼁, (1)
it to VGG (visual geometry group) for feature extraction
and classification of optical flow information to detect falls.
ky � Rand kh , kh + Δkh 􏼁. (2)
In the literature [28], the feature information extracted by
the CNN convolutional layer and the fully connected layer Among them, kx and ky are the minimum values of the
is sent to the long short-term memory (LSTM) network to length and width scaling multiples, respectively, and Δkw
train to extract the temporal correlation of human spatial and Δkh are the lengths of the random size of the length-
actions and identify human behavior. LSTM needs to width scaling multiples, which are the hyperparameters. The
dynamically store and update data with limited real-time Rand function is a random function.
performance. The coordinates of the upper left corner and the lower
right corner of the image after scaling are (Ai, Bi) and (ai, bi),
and these four unknowns are obtained by the following
3. Materials and Methods formulas:
The basic flow of the fall detection algorithm in this paper is ⎪
⎧ 0, i � 1, 2, 3,


as follows: (1) regarding the training of the front weight file, Ai � ⎪ w × k1 , i � 4, 5, 6,
the pedestrian dataset is collected by ordinary cameras and ⎪

the new mosaic data enhancement method is used for data w × k2 , i � 7, 8, 9,
enhancement, and the target detection algorithm and the ⎪
⎧ 0, i � 1, 4, 7,
joint point detection algorithm are carried out, respectively. ⎪
⎨ (3)
(2) Regarding the running process of the overall algorithm, B i � ⎪ h × k3 , i � 2, 5, 8,


the camera connected to Jeston nano captures real-time h × k4 , i � 3, 6, 9,
pedestrian images, uses the improved new YOLOv4 algo-
c i � A i + w × kw ,
rithm to accelerate the TensorRt engine to detect the target,
and then converts the detection result to the tensor data di � Bi + h × kh .
structure to serialize the target image, invests in the
Alpahpose joint point detection algorithm optimized by the Among them, k1 and k2 are the ratios of the distance
model, and finally, the spatiotemporal graph convolutional between the upper left coordinate point and the 0 point of
neural network ST-GCN uses the coordinates of the key the two sets of images on the x-axis, except for the 0 point to
points of the human skeleton extracted by AlphaPose as the the total width. Similarly, k3 and k4 are in the y-axis, except
model input and constructs a joint as the graph node. The for the 0 point. k3 and k4 are the distance between the upper
temporal relationship of the same joint is the spatiotemporal left coordinate point and the 0 point of the two sets of images
graph of the graph edge, taking the natural connection of and the total length ratio. The vertical dotted line in the
human bones and the time relationship of the same joint as figure is the picture width scale, accounting for one-tenth of
the time-space diagram of the edge of the graph, so that the the picture width, and the horizontal small dotted line is the
information is integrated in the spatiotemporal and spatial picture length scale, accounting for one-tenth of the picture
domains. The final result is obtained by combining the length. The first photo is of the same scale as the other eight
motion analysis research. The specific algorithm structure photos, and the width and length are kw and kh times the
flow chart is shown in Figure 1. original.
4 Mathematical Problems in Engineering

Alphapose
detection Detection
model model
training optimization
New Mosaic
Data
Enhancement training of
tensorRT
target
model
Data collection detection
acceleration
model
laying
laying

camera real- Improved Alphapose ST-GCN


YOLOv4 result tensor
time Joint combined Output
algorithm data structure
shooting Detection motion state Outcome
for detection transformation.
Algorithm analysis

Figure 1: Magnetization as a function of the applied field. Note. “Fig.” is abbreviated.

m1

Primary Input Improved Input

m9

m9
Figure 2: Improved mosaic data enhancement method.

0
x

Step 1

(A1, B1)
y 1 2 3

(a1, b1) h

w
r1 r2 Step 2
Step 3

1 2 3
1 2 3
r3 Δr3
4 5 6 4 5 6 4 5 6
r4 Δr4
7 8 9 7 8 9
Δr1 Δr2

7 8 9

Figure 3: Mosaic nine-in-one data enhancement flowchart.


Mathematical Problems in Engineering 5

In step 2, flip, color gamut, and stitch the 9 photos 3.2.1. Human Feature Extraction Based on Ghostnet.
cropped in the previous stage. Rely on the bounding box to Since the CSPDarknet53 structure in YOLOv4 requires a
limit the size of the stitched pictures, and crop the excess. large amount of computation while efficiently extracting
There will be overlapping images. According to the sche- image features, this paper chooses a lightweight network
matic diagram of step 1 in Figure 3, the position of the small structure like the GhostNet. The core idea of GhostNet is to
area needs to be reassigned, as shown in the following use some operations with lower computational cost to
formula: generate the same features. There are many similarities
between the network feature layers, and the redundant part
Ci , C1 < w,
C′i � 􏼨 in the feature layer may be an important part. Hence,
w, C1 ≥ w, GhostNet saves redundant information and obtains feature
(4)
Di , D1 < w, information with a lower computational cost.
C′i � 􏼨 The convolution block of GhostNet is the Ghost Module.
h, D1 ≥ w. Its function is to replace ordinary convolution. It divides
ordinary convolution into two parts. Firstly, a 1 × 1 ordinary
After the edge is cropped, use eight parallel dashed lines convolution is performed. For example, the convolution of
(as shown in step 2) to enclose four square areas, and use 32 × 32 channels is normally used. But the GhostNet net-
them as a random area for segmentation; k1, k2, k3, and k4 work uses 16-channel convolutions, the function of this 1 × 1
are the ratios of the coordinates of the segmentation line to convolution is similar to feature integration, generating the
the distance from the origin and the boundary. In the third feature concentration of the input feature layer. Then, we
stage, the inner overlapping part is to be cut for the second perform a depthwise separable convolution, which is a layer-
time, and the coordinate Si of the dividing line can be by-layer convolution that uses the previous step to perform
obtained by the following formula: features. Condensation generates ghost feature maps.
Si � Rand ki , ki + Δki 􏼁 i � 1, 2, 3, 4. (5) The network structure combined with the GhostNet is
shown in Figure 1, in which GBN is represented as
After cropping, the m9 image stitching is completed. GhostNetBottleNeck, which is a component of GhostNet.
Since there will be some missing content in scaling and The GhostNetBottleNeck bottleneck layer consists of two
splicing, the edge targets of the original image may be GhostModules. The first is used to expand the number of
cropped. Hence, the real boxes of these targets need to be channels, and the second is used to reduce the number of
cropped to meet the needs of target detection. channels, matching the number of channels connected to the
input. When the input is 416 × 416, the construction method
of the GhostNet is shown in Table 1. When a picture is input
3.2. Structure Optimization of Human Object Detection into the GhostNet, we perform a 16-channel ordinary 1 × 1
Algorithm. The original AlphaPose human target detection convolution block
algorithm uses YOLOv3, however, YOLOv4 proposed in (convolution + normalization + activation function). After
recent years has significantly surpassed YOLOv3 in terms of that, the stacking of the ghost bottlenecks began. Using ghost
detection accuracy and detection speed. It can cope with bottlenecks, a 7 × 7 × 160 feature layer was finally obtained
more complex detection environments (such as complex (when the input was 224 × 224 × 3). Then, a 1 × 1 convolu-
light and occlusion). However, because of the large amount tion block is used to adjust the number of channels, and a
of calculation, it is not suitable to migrate to embedded 7 × 7 × 960 feature layer can be obtained at this time. After
devices. Therefore, the human target detection algorithm in that, a global average pooling is performed, and then a 1 × 1
this paper is improved on the basis of the YOLOv4 algorithm convolution block is used to adjust the number of channels
structure, which ensures high pedestrian detection accuracy to obtain a 1 × 1 × 1280 feature layer. Then, after tiling, the
and faster recognition of frames. full connection can be performed for classification.
The improvement of the specific structure is as follows: The operation of generating n feature images for any
(1) the structure of GhostNet [29] is adopted to replace the convolutional layer can be expressed as follows:
DSPDarknet53 backbone network in the YOLOv4 network
structure, which realizes the simplification of the network Y0 � XΟf + b, (6)
while maintaining the accuracy. (2) Convert the path ag-
where X ∈ Rh×c×w, and f ∈ Rc×k×k×m is the convolution kernel
gregation network into BiFPN (bidirectional feature pyr-
of this layer. O represents the convolution operation, and b is
amid network) [30] to shorten the path from low-level
the bias term. At this time, the feature map is as follows:
information to high-level information and build the re-
′ ′ ′
sidual structure of the feature pyramid network to integrate Y0 ∈ Rh ×w ×m . (7)
richer semantic features and save spatial information. (3)
DSC (deep separable convolution) [31] is adopted to re- The required floating-point number is
place the standard convolution of spatial pyramid pooling. n × h′ × w′ × c × k × k. Assume that the ghost module con-
BiFPN and YOLO head the network, which greatly reduces tains an intrinsic feature map and m × (s − 1) � n/s × (s −
the amount of computation and improves network per- 1) linear transformation operations. The size of each op-
formance. The improved YOLOv4 algorithm structure is eration kernel and the theoretical speedup of the ghost
shown in Figure 4. module upgrading the ordinary convolution are as follows:
6 Mathematical Problems in Engineering

GhosetNet

Input (416, 416, 3)

Conv2D (208, 208, 16) CBL1 Conv = DSC Conv

GBN (208, 208, 16) ×2


BiFPN
YOLO Head
GBN (104, 104, 24) ×2

GBN (52, 52, 40) × 2 Conv Concat+ CBL×5 CBL ×1+Conv YOLO Head

GBN (26, 26, 112)×6


CBL ×1+UpSamping DownSampling
Conv
GBN (13, 13, 160)×4

Concat+CBL×5 Concat+CBL×5 CBL ×1+Conv YOLO Head


CBL ×3

SPP
max CBL ×1+UpSamping DownSampling
pooling 5 9 13

Concat+ CBL ×3 Concat+CBL×5 CBL ×1+Conv YOLO Head

Figure 4: Improved structure of YOLOv4 algorithm.

Table 1: GhostNet construction method diagram.


Input Operator #exp Out SE Stride
4162 × 3 Conv2d 3 × 3 — 16 — 2
2082 × 16 GBN 16 16 — 1
2082 × 16 GBN 48 24 — 2
1042 × 24 GBN 72 24 — 1
1042 × 24 GBN 72 40 1 2
522 × 40 GBN 120 40 1 1
522 × 40 GBN 240 80 — 2
262 × 80 GBN 200 80 — 1
262 × 80 GBN 184 80 — 1
262 × 80 GBN 184 80 — 1
262 × 80 GBN 480 112 1 1
262 × 112 GBN 672 112 1 1
262 × 112 GBN 672 160 1 2
132 × 160 GBN 960 160 — 1
132 × 160 GBN 960 160 1 1
132 × 160 GBN 960 160 — 1
132 × 160 GBN 960 160 1 1
132 × 160 Conv2d 1 × 1 — 960 — 1
132 × 960 AvgPool 7 × 7 — — — —
12 × 960 Conv2d 1 × 1 — 1280 — 1
12 × 1280 FC — 1000 — —

n × h′ × w ′ × c × k × k
rc �
(n/s) × h′ × w′ × c × k × k +(s − 1) ×(n/s) × h′ × w′ × d × d
(8)
c×k×k s×c
� ≈ ≈ s,
(1/s) × c × k × k +((s − 1)/s) × d × d s + c − 1
Mathematical Problems in Engineering 7

Original part of YOLOv4 head (deep separable convolution), which further reduces the
network computing cost in practical applications. The
modified part of CBL1 is shown in Figure 5. The standard
convolutional network calculation uses a weight matrix to
realize the joint mapping of spatial dimension features and
Head channel dimension features at the cost of high computa-
tional complexity, high memory overhead, and many weight
coefficients.
DSC specifically divides the traditional convolution
operation into two steps. Assuming that the original con-
improved part of YOLOv4 head
volution is 3 × 3, DSC is to first convolve M feature maps of
M 3 × 3 convolution kernels one-to-one. M results are
Conv generated directly without summing. Then, the M results
previously generated are normally convolved with N 1 × 1
CBL1 convolution kernels, summed, and finally, N results are
generated. Therefore, the literature [17] divides DSC into
two steps, as shown in Figure 6 below. One step is called
DSC depthwise convolution, which is B in the figure below, and
the other step is pointwise convolution, which is C in the
Figure 5: The modified part of CBL1 on the head of small-YOLOv4. figure below.
Assuming that the size of our input feature map is
DF × DF, the dimension is M, the size of the filter is Dk × Dk,
where d × d and k × k are similar. The theoretical parameter
the dimension is N, and assuming that the padding is 1, the
compression ratio is as follows:
stride is 1. Hence, the original convolution operation re-
n×c×k×k s×c quires the following number of matrix operations:
rc � ≈ ≈ s.
n/s × c × k × k +(s − 1)/s × n × d × d s + c − 1 Dk × Dk × M × N × DF × DF. The parameter of the convolu-
(9) tion kernel is Dk × Dk × M × N, and the number of matrix
operations that DSC needs to perform is
The theoretical parameter compression ratio of replacing Dk × Dk × M × DF × DF + M × N × DF × DF. The parameter of
ordinary convolution with the ghost module is approxi- the convolution kernel is Dk × Dk × M + N × M. Since the
mately equal to the theoretical speedup ratio. convolution process is mainly a process of reducing spatial
dimension and increasing channel dimensions, namely
N > M, the convolution kernel parameter of standard con-
3.2.2. Improving Panet with reference to BIFPN. BiFPN
volution is larger than that of DSC. At the same time, the
(bidirectional feature pyramid network) was first proposed
ratio of the parameter quantity of DSC to the standard
in the paper of EffientDet [31], and the author proposed that
convolution parameter quantity is as shown in equation (4).
its purpose was to pursue a more efficient multiscale fusion
From equation (4), we can get a convolution kernel with
method.
a size of 3 × 3, which reduces the computation to 11.1% of the
YOLOV4’s original PANet adds a bottom-up channel
standard convolution.
based on FPN, and its CNN backbone provides a long path
from the bottom to the top through more than 100 layers. In
BiFPN, the input nodes and output nodes of the same layer
3.3. Structure Optimization of Human Object Detection
can be connected across layers to ensure that more features
Algorithm. Commonly used model compression methods
are incorporated without increasing the loss. This algorithm
are as follows: network pruning, knowledge distillation,
performs cross-layer connections on the same level of PANet
model quantization, etc. Since the network structure used in
(the three orange lines in Figure 4). In this way, the path
this paper is replaced by the lightweight GhostNet network,
from low-level information to high-level information can be
if the network continues to be pruned, it is very likely to
shortened, and their semantic features can be combined
destroy the integrity of the model and have a greater impact
together. In BiFPN, adjacent layers can be merged in series.
on the accuracy. Therefore, this paper uses model quanti-
In this paper, the adjacent layers of PANet are merged in
zation to further reduce the number of parameters and
series (the two blue lines in Figure 4).
model size.
The improved PANet has the characteristics of bidi-
The quantization method is further divided into quan-
rectional cross-scale connection and weighted feature fu-
tization-aware training and post-training quantization. The
sion, which improves the feature fusion ability and further
post-training quantization method is divided into hybrid
increases the feature extraction ability.
quantization, 8-bit integer quantization, and half-precision
floating-point quantization. Post-training quantization di-
3.2.3. DSC Replaces Standard Convolution. In the algorithm rectly quantizes the model after ordinary training. The
of this paper, the 1 × 1 standard convolutional network in the process is simple, and there is no need to consider the
CBL1 module of the YOLOv4 head is replaced with DSC quantization problem during the training process. The
8 Mathematical Problems in Engineering

M Dk 1

Dk Dk

Dk


N

(a) (b)

(c)

Figure 6: Structure diagram of DSC. (a) Stand convolution filters. (b) Depthwise convolution filters. (c) Depthwise separable convolution.

accuracy of the model with a large amount of parameter 3.4. Structure Optimization of Human Object Detection
redundancy is lost. Algorithm. After the detection result is obtained through the
This paper uses the TensorRT acceleration engine to target detection algorithm, the detection result is converted
convert the model weight file into an int8 type trt file using into a 2-dimensional tensor data structure, and the specific
the post-training quantization method and performs overall data structure form is shown in equation (9).
optimization through a series of operations, such as tensor
fusion, kernel adjustment, and multistream execution.
Figure 7 is a schematic diagram of the overall optimization of
TensorRT.

Td � 􏼂􏼂x1 , y1 , w1 , h1 , c1 􏼃, 􏼂x2 , y2 , w2 , h2 , c2 􏼃, . . . , 􏼂xi , yi , wi , hi , ci 􏼃􏼃, (10)


Mathematical Problems in Engineering 9

Accuracy calibration

Tensor Fusion

Trained model Kernal autotuning Optimize the accelerated trt model file

Dynamic tensor Fusion

Multistream execution

Figure 7: Schematic diagram of TensorRT optimization.

where [xi, yi, wi , hi, ci] represents the structured data of the ith is set to input, and the output layer is set to output. Create a
pedestrian, and x represents the upper left corner of the target detection model calculation graph, set the input di-
prediction box. mension of the calculation graph to (1, 3, Wd, Hd), where 1
The original image Tm is transformed into a floating- means the batchsize is 1, 3 means the number of image
point 32-bit tensor type data Tt. Hence, formula (1) rep- channels, and Wd, Hd means the network layer input image
resents a normalization operation on Im_t, whereTt [0] is the normalization scale. Wd � 160, Hd � 224 in this paper. Load
R channel data of Im, G channel data of Tt [1], and B channel the model conversion optimizer to generate the pose joint
data for Tt [2]. detection optimization model AlphaPose-trt.

⎧ Tt [0]+ � −0.416,


⎪ T [1]+ � −0.461, (11) 3.6. Spatial Temporal Graph Convolutional Networks for
⎪ t
⎩ Skeleton-Based Action Recognition. Using the spatio-tem-
Tt [2]+ � −0.479.
poral graph convolutional network ST-GCN [32], using the
According to Td, the human body area images are cut out coordinates of human skeleton key points output by the
from the original images, and they are arranged in the AlphaPose algorithm as model input, construct a graph node
descending order of confidence to obtain a serialized image with joint points as the natural connection of the human
list, which realizes the serialization of human body images skeleton and the same joints. The temporal relationship is a
and improved data interaction efficiency between the target spatiotemporal graph of graph edges, so that information is
detection model and the human joint point detection model. integrated in the temporal and spatial domains.
The spatiotemporal graph convolutional neural network
is divided into a spatial graph convolution and temporal
3.5. Optimization of Algorithm Model for Pose Joint Point graph convolution. Spatial graph convolution is to construct
Detection. The algorithm of AlphaPose in the original text spatial graph convolution within frames based on the natural
uses the Fast_Reset50-based network, and the optimization connectivity of human joints. Spatial graph convolution is to
method is shown in Figure 8. construct spatial graph convolution within the frame
The pose joint point detection model inputs dummy according to the natural connectivity of human joint points,
network layer dimension initialization, and the dummy which can be recorded as GS � (VS, ES), where
network layer input dimension is set to tensor type VS � {vti |i � 1,2, . . ., NS} represents all the joint points in a
(1,3,Hdummy, Wdummy), where 1 means that the batchsize is 1, skeleton, and Es 􏽮vij vij /(i, j) ∈ H􏽯 represents the connection
3 means the number of image channels, and Wdummy, between the joint points. Each node is described by a feature
Hdummy indicates the network layer input image normali- vector F(Vi) to describe the spatial feature, which is rep-
zation scale. In this paper, Wdummy � 160 and Hdummy � 224. resented by the spatial graph convolution which is obtained.
Customize the design for the input and output network Temporal graph convolution connects the same nodes in
layers of the dimensionally initialized model. The input layer consecutive multiframe images on the spatial graph to form
10 Mathematical Problems in Engineering

Input and output


Model input
Pose network layer Create a Model Generate an
dummy network Create input dimension
Detection customization, Transformation optimized model for
layer dimension onnx node set (1, 3, H, W)
Model create calculation Optimizer pose joint detection
initialization
graph

Figure 8: Optimization steps of AlphaPose detection model.

the spatial-temporal graph of the skeleton sequence, denoted 4.3. Training and Operation Environment. The model
as GT � (VT, ET). VT � {vti |t � 1,2, . . ., Nt} represents the joint training platform in our laboratory is RTX 3090, video
point sequence of the same part, and ET � 􏽮vij v(t+l) 􏽯 rep- memory 24G, etc. The specific parameters are shown in
resents the connection between them, as shown in Figure 9. Table 4. The network model is trained on the deep learning
The spatiotemporal graph convolution algorithm com- framework of Tensorflow2.5 based on GhostNet and
bines the motion analysis research to divide the spatial graph CSPDarknet53. All input images are of size 416 × 416. The
into three subsets, which represent the features of centripetal follow-up effect verification and testing platform of the
motion, eccentric motion, and rest, respectively. The root experiment is with Jeston nano.
node is the selected skeleton joint point itself, including
static features. Connecting the neighbor nodes closer to the
center of gravity of the skeleton than the root node includes 4.4. Evaluation Criteria. We use FPS, precision, mAP, ac-
centripetal motion features. Connecting the neighbor nodes curacy, F-score, sensitivity, specificity, and other indicators
farther from the root node than the center of gravity of the to evaluate our proposed method. The test set is divided into
skeleton includes centrifugal motion features. The three two categories, one is positive samples and the other is
subset convolution results express action features at different negative samples. TP is the number of positive samples
scales, respectively. predicted as positive samples. FP is the number of negative
The spatiotemporal graph convolutional neural network samples predicted as positive samples. FN is the number of
model takes the joint coordinate vector of the graph node as predicted positive samples as negative samples. TN is the
input and extracts deeper features through the 9-layer ST- number of predicted negative samples as negative.
GCN convolution module. The feature dimension of each
node is 256, and the key frame dimension is 38. Then, the 4.4.1. FPS (Frames per Second). The evaluation standard of
obtained tensors are globally pooled, and backpropagation is detection speed used in this paper is FPS, which refers to the
used to train the model end-to-end. Finally, the SoftMax number of frames per second. The larger the FPS, the more
classifier obtains the corresponding action category prob- frame rates the American Standard transmits, and the
ability and outputs the action with the highest probability. smoother the displayed image. To meet the real-time re-
Each ST-GCN layer adopts the Resnet structure to enhance quirements of human body detection, the larger the FPS
the gradient propagation and adds a dropout strategy to the value, the smoother the picture seen, and the better the
ST-GCN layer to solve the gradient explosion problem. The effect.
overall flow of the model is shown in Figure 10.
4.4.2. mAP (Mean Average Precision). The definition of the
4. Experiments and Analysis mAP is shown in equation (12), which represents the average
value of the average precision APi of n types of targets, and
4.1. Dataset Analysis. The datasets used for training in this n � 1 in this experiment.
experiment mainly include 20 categories of VOC2007 and
􏽐 AP
VOC2012, and 10,000 datasets of people that the author mAP � � 􏽘 AP. (12)
randomly collected. Through the program, VOC2012 and N(Class)
VOC2007 only retain the label information of this category.
The dataset of 10,000 people collected by the author is di-
vided into the training set, validation set, and test set 4.4.3. Accuracy. Accuracy is a commonly used evaluation
according to the ratio of 6 : 2 : 2. The final number of images index. Generally speaking, the higher the accuracy rate, the
is shown in Table 2. better the classifier.
TP + TN
Accuracy � . (13)
TP + TN + FP + FN
4.2. Anchor Box. To be more suitable for the category of
person, the prior frame in the improved target detection
algorithm in this paper is obtained by the K-means clus-
4.4.4. Precision. Precision can measure the accuracy of
tering dataset method. The image input in this paper adopts
object detection, specifically defined as shown in equation
416 × 416, and the clustering iteration reaches 73 times. The
(14) below.
union ratio of the box and the prior box reaches 78.91%, and
nine a priori boxes are obtained, as shown in Table 3. TP
precision � . (14)
TP + FP
Mathematical Problems in Engineering 11

1 2 N

...

(a) (b)

Figure 9: Construction of the spatio-temporal map of human joint points. (a) Bone space map sequence of N frames. (b) Skeleton space
time-diagram (arrows indicate time series edges).

ATM

GCN

TCN
GCN
ATM

TCN

Pose joint

Pool

FC
BN

input video .......


detection

Falling Down
ST-GCN

ATM GCN TCN action scores

attention mechanism Graph Convolutional Networks Temporal Convolutional Networks

Figure 10: The overall framework of ST-GCN.

Table 2: The number of various data sets. 4.4.5. F-Score. The F-score indicator combines the results of
Types of data sets Total number
precision and recall outputs. The value of F-Score ranges
from 0 to 1, where 1 represents the best output result of the
Training sets 14265
Test sets 4765
model, which is specifically defined as shown in equation
Validation set 4755 (15) below.
2TP
F − score � . (15)
2TP + FP + FN
Table 3: A priori frame size.
Size Anchor box
13 × 13 (234, 280), (260, 379), (377, 354) 4.4.6. Sensitivity. Sensitivity represents the sensitivity,
26 × 26 (107, 270), (135, 351), (171, 190)
which represents the predictive ability of positive examples
52 × 52 (27, 44), (60, 121), (75, 230)
(the higher, the better), and it is numerically equal to the
recall rate, which is specifically defined as shown in equation
(16) below.
Table 4: Software and hardware configuration. TP
Sensitivity � . (16)
Component Configuration TP + FN
Operating system Ubuntu 18.04
Memory 64
GPU Nvidia GeForce RTX 3090
GPU acceleration library CUDA 11.2 cuDNN v8.2.1 4.4.7. Specificity. Sensitivity represents the predictive power
Deep learning framework Tensorflow2.5 of positive examples (higher is better), and the specific
Programming language Python3.9 definition is shown in equation (17) below.
12 Mathematical Problems in Engineering

Table 5: The influence of mosaic data enhancement method on target recognition accuracy under different proportions.
Algorithm type m1 : m4 : m9 Dim light (%) Chaotic environment mAP (%) Human body occlusion mAP (%)
Algorithm in this paper 1:0:0 72.28 71.54 75.55
0 :1 : 0 60.11 65.58 66.18
0 : 0 :1 41.74 50.69 50.10
1 :1 : 0 72.45 71.99 75.10
1 : 0 :1 70.71 70.24 73.56
1 :1 :1 72.19 71.58 76.01
2 : 2 :1 76.28 76.68 78.10
2 :1 :1 72.27 70.44 73.56
3 : 2 :1 72.78 72.57 76.15
4 : 2 :1 73.35 72.16 74.20
4:3:2 72.80 72.01 75.98
5:3:2 74.01 74.48 77.52
YOLOv4 1 :1 : 0 75.90 75.56 76.14

TN improved, which meets the basic ability of running on


Specificity � . (17) embedded devices. Finally, we chose to use the TensorRt
TN + FP
framework to accelerate, and after using the TensorRt
framework, the runnable frame rate was greatly improved,
4.5. Evaluation Criteria while the mAP value remained basically unchanged.

4.5.1. A Novel Mosaic Data Augmentation Method. The new


Mosaic data enhancement method in this paper is used to 4.5.3. Comparison of Optimization Effectiveness of AlphaPose
enhance the dataset, and the image input ratio of the three Algorithm Model. To verify the effectiveness of the Alpha-
paths of m1, m4, and m9 in Figure 2 that can maximize the Pose algorithm model optimization method in this paper,
accuracy of identifying complex situations is a problem that this paper chooses to compare the effects of three models,
needs to be discussed. Table 5 below shows the influence of including openpose, AlphaPose, and AlphaPose-trt. The
different input ratios of m1, m4, and m9 on the accuracy of mAP value in this paper is the human detection effect for the
human recognition in three complex situations of dim light, test set of this paper. The results of running on Jeston nano
chaotic environment, and human occlusion in the dataset. It are shown in Table 7 below. It can be seen from Table 7 that
can be seen from this table that when m1, m4, and m9 ratio is the frame rate of openpose is lower than that of AlphaPose,
2 : 2 : 1, the effect of data enhancement is most obvious. while the mAP value is also lower than that of AlphaPose.
Compared with the original model (AlphaPose), the opti-
mized model (AlphaPose-trt) has a stable mAP value and
4.5.2. Target Detection Algorithm Network Improvement greatly improves the running frame rate.
Effectiveness. To verify the impact of the improvement of the
target detection algorithm on the performance of the
YOLOv4 model, the above three improved methods were 4.5.4. Comparison of Effectiveness of Fall Detection
designed for ablation experiments on Jeston nano for more Algorithms. Because of the need to further demonstrate the
adequate comparison, thus proving the necessity and ef- overall advantages of the algorithm in this paper in detection
fectiveness of the proposed method. Among them, “+” in- accuracy and running frame rate, we need to compare the
dicates that the improved method is used in the experiment, algorithm in this paper with other computer vision algo-
“−” indicates that the method is not used, and the test in- rithms of the same type, however, considering that many of
dicators in this table refer to the detection effect of the the more popular algorithms are not open source, it is
human body in the test set of this paper. As can be seen from impossible to migrate to Jeston nano to run. Hence, the
Table 6, after replacing the backbone network with the selected comparison algorithms cannot have an accurate
GhostNet, although the mAP value for the identification of running frame rate, however, after analyzing the structure of
person categories has been slightly reduced, the running these algorithms, it can be concluded that these algorithms
frame rate has been significantly improved. After the in- are computationally complex and require a large number of
troduction of BiFPN, the running frame rate has basically calculations, and they do not have the ability to migrate to
not changed, however, the mAP value has been greatly embedded devices. The final results are shown in Table 8. The
improved. Using the depthwise separable convolution to data in this table is analyzed, and various evaluation data for
replace the ordinary convolution in the original YOLOv4 human fall detection are tested in the Le2i fall and UR fall
head, the running frame rate is significantly improved while datasets, respectively. Compared with this paper, the liter-
the mAP value is slightly reduced. Compared with YOLOv4, ature [33] has achieved better results. The reason for the F1-
the improved network structure has a slight decrease in the score is because they employ a two-pass ensemble, using two
mAP value for the detection effect of Person, however, at the classifiers, including random forest (RF) and multilayer
same time, the running frame rate has been significantly perceptron (MLP), to identify falls, however, it leads to more
Mathematical Problems in Engineering 13

Table 6: Ablation study on the people dataset.


GhostNet Bi-FPN DSC TensorRt mAP (%) FPS
YOLOv4 − − − — 87.27 2.35
+ − − — 86.38 6.32
− + – — 87.40 2.23
− – + — 86.51 3.12
+ + − — 87.08 6.13
+ + + — 86.81 8.02
Our method + + + + 86.81 24.33

Table 7: The report posture detection model performance comparisons.


Pose estimation model Frame rate mAP (%) Resolution of the image
Openpose 3.66 71.11 416 × 416
AlphaPose 7.72 82.12 416 × 416
AlphaPose-trt 13.13 82.05 416 × 416

Table 8: Comparison of different fall detection algorithms.


Algorithm type Dataset Accuracy (%) Precision (%) Sensitivity (%) Specificity (%) F-score FPS
Wang et al. [33] Le2i fall 96.91 97.65 96.51 97.37 97.08 —
Chamle et al. [35] Le2i fall 79.31 79.41 83.47 73.07 81.39
Our method Le2i fall 96.86 97.01 96.71 96.81 96.77 8.33
Wang et al. [33] UR fall 97.33 97.78 97.78 96.67 97.78 —
Harrou et al. [34] UR fall 96.66 94 100 94.93 96.91 —
Our method UR fall 97.28 97.15 97.43 97.30 97.29 8.33

Figure 11: Effect diagram of experimental results.


14 Mathematical Problems in Engineering

computational complexity. It may also take more time from Acknowledgments


the classifier to the ensemble result, which leads to the poor
real-time and transferability of the detection method. In This work received support from the Industry University
contrast, the F1-score of the algorithm in this paper is Research Innovation Fund of Science and Technology De-
slightly lower than that of [33]. At the same time, the real- velopment Center of the Ministry of Education (No.
time performance and migration are excellent. Compared 2021JQR004), Public Welfare Projects in Zhejiang Province
with the methods of [34, 35] under the same dataset, the (No. LGF20F030002), Project of Hangzhou Science and
algorithm in this paper also has advantages in migration and Technology Bureau (No. 20201203B96), 2021 National In-
real-time performance, and it also achieves a better balance novation Training Project for College Students (No.
in the two indicators of sensitivity and specificity. The results 202113021008), The Ministry of Education Industry-Uni-
of analyzing the two validation datasets are similar, which versity Cooperation Collaborative Education Project
further proves the stability of the algorithm in this paper. (202102019039), and Zhejiang University City College Sci-
Figure 11 shows the detection results of the fall detection entific Research Cultivation Fund Project (J-202223). It is
algorithm in this paper. supported by the Zhejiang University Student Science and
Technology Innovation Activity Plan (Xinmiao Talent Plan),
project number: 2021R437010.
5. Conclusions and Future Work
This paper mainly studies the fall detection method based on References
computer vision technology. This method combines YOLO,
[1] J. Gutiérrez, V. Rodrı́guez, and S. Martin, “Comprehensive
AlphaPose, and ST-GCN. Through YOLO and AlphaPose, review of vision-based fall detection systems,” Sensors, vol. 21,
the key points and position information of the human body no. 3, p. 947, 2021.
are obtained then output the recognition result through the [2] M. Mubashir, L. Shao, and L. Seed, “A survey on fall detection:
spatiotemporal graph convolutional network. ST-GCN takes principles and approaches,” Neurocomputing, vol. 100,
the output coordinates of the key points of human skeleton pp. 144–152, 2013.
as a model input and constructs a spatiotemporal graph with [3] P. Pierleoni, A. Belli, L. Palma, M. Pellegrini, L. Pernini, and
joint points as graph nodes, natural connections of human S. Valenti, “A high reliability wearable device for elderly fall
skeletons, and the temporal relationship of the same joints as detection,” IEEE Sensors Journal, vol. 15, no. 8, pp. 4544–4553,
graph edges, so that the information is in the time and space 2015.
domains that are integrated together. [4] M. Alwan, P. J. Rajendran, and S. Kell, “A smart and passive
floor-vibration based fall detector for elderly,” in Proceedings
The experimental results show that the method is
of the Information & Communication Technologies, Ictta.
transferable. In this paper, the improvement and optimi- IEEE, Berkeley, CA, USA, May 2006.
zation of the YOLOv4 algorithm and the effectiveness of the [5] Y. Zigel, D. Litvak, and I. Gannot, “A method for automatic
detection model optimization of AlphaPose are obtained fall detection of elderly people using floor vibrations and
under the running test of VOC07 + 12 and the self-made sound-proof of concept on human mimicking doll falls,” IEEE
dataset. In addition, through the more popular fall detection Transactions on Biomedical Engineering, vol. 56, no. 12,
algorithm in recent years and the test and verification of the pp. 2858–2867, 2009.
algorithm in this paper in the UR Fall dataset, it is concluded [6] S. M. Khan, M. Yu, P. Feng, L. Wang, and J. Chambers, “An
that the algorithm in this paper has a high running frame unsupervised acoustic fall detection system using source
rate on the basis of the detection accuracy, which is not separation for sound interference suppression,” Signal Pro-
cessing: The Official Publication of the European Association
much different from other algorithms, and it has better
for Signal Processing (EURASIP), vol. 110, 2015.
mobility and better adaptability in embedded devices. [7] J.-S. Lee and H.-H. Tseng, “Development of an enhanced
In the future, we will focus more on complex fall detection threshold-based fall detection system using smartphones with
and multiperson detection, such as outdoor fall detection and built-in accelerometers,” IEEE Sensors Journal, vol. 19, no. 18,
crowd trampling. At the same time, combined with the high pp. 8293–8302, 2019.
applicability of embedded devices, we will integrate algo- [8] X. Xi, W. Jiang, and L. ü Zhong, “Daily activity monitoring
rithms into real life, such as fall detection algorithms and and fall detection based on surface electromyography and
monitoring systems. At the same time, there are many details plantar pressure,” Complexity, vol. 2020, Article ID 9532067,
that need to be improved for the operation effect of the al- 12 pages, 2020.
gorithm in this paper, and we will continue to work hard. [9] O. Kerdjidj, N. Ramzan, and K. Ghanem, “Fall detection and
human activity classification using wearable sensors and
compressed sensing,” Journal of Ambient Intelligence and
Data Availability Humanized Computing, vol. 11, 2019.
[10] S. Angela and J. José Vargas-Bonilla, “Real-life/real-time el-
The datasets used to support the findings of this study are derly fall detection with a triaxial accelerometer[J],” Sensors,
available from the authors upon reasonable request. vol. 18, no. 4, p. 1101, 2018.
[11] S. G. Miaou, P. H. Sung, and C. Y. Huang, “A customized
human fall detection system using omni-camera images and
Conflicts of Interest personal information,” in Proceedings of the Conference on
Distributed Diagnosis & Home Healthcare, IEEE, Arlington,
The authors declare that they have no conflicts of interest. Virginia, April 2006.
Mathematical Problems in Engineering 15

[12] F. Merrouche and N. Baha, “Depth camera based fall de- networks,” Wireless Communications and Mobile Computing,
tection using human shape and movement,” in Proceedings of vol. 2017, no. 1, pp. 1–16, 2017.
the IEEE International Conference on Signal & Image Pro- [28] H. Gammulle, S. Denman, and S. Sridharan, “Two stream
cessing, IEEE, Beijing, China, September 2017. LSTM: a deep fusion framework for human action recogni-
[13] K. H. Chen, Y. W. Hsu, and J. J. Yang, “Enhanced charac- tion,” in Proceedings of the 2017 IEEE Winter Conference on
terization of an accelerometer-based fall detection algorithm Applications of Computer Vision, pp. 24–31, Piscataway:IEEE,
using a repository,” Instrumentation Science & Technology: Honolulu, HI, USA, July 2017.
Designs and applications for chemistry, biotechnology, and [29] Y. Yang, G. Xie, and Y. Qu, “Real-time detection of aircraft
environmental science, vol. 45, 2017. objects in remote sensing images based on improved
[14] W. N. Lie, A. T. Le, and G. H. Lin, “Human fall-down event YOLOv4,” in Proceedings of the 2021 IEEE 5th Advanced
detection based on 2D skeletons and deep learning approach,” Information Technology, Electronic and Automation Control
in Proceedings of the 2018 International Workshop on Ad- Conference (IAEAC), vol. 5, pp. 1156–1164, Chongqing,
vanced Image Technology (IWAIT), IEEE, Chiang Mai, China, March 2021.
Thailand, January 2018. [30] M. Tan, R. Pang, and Q. V. Le, “EfficientDet: scalable and
[15] A. Lotfi, S. Albaw endi, H. Powell, K. Appiah, and efficient object detection,” in Proceedings of the 2020 IEEE/
C. Langensiepen, “Supporting independent living for older CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pp. 10778–10787, Seattle, Washington, USA, June
adults; employing a visual based fall detection through ana-
2020.
lysing the motion and shape of the human body,” IEEE Access,
[31] F. Chollet, “Xception: deep learning with depthwise separable
vol. 6, pp. 70272–70282, 2018.
con-volutions,” in Proceedings of the IEEE conference on
[16] H. Tian, Z. Duan, and A. Abraham, “A novel multiplex
computer vision and pattern recognition, pp. 1251–1258,
cascade classifier for pedestrian detection,” Pattern Recogni-
Honolulu, HI, USA, July 2017.
tion Letters, vol. 34, no. 14, pp. 1687–1693, 2013. [32] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph con-
[17] N. Dalal and B. Triggs, “Histograms of oriented gradients for
volutional networks for skeleton-based action recognition,”
hu-man detection,” in Proceedings of the 2005 IEEE Computer 2018, https://arxiv.org/abs/1801.07455.
Society Conference on Computer Vision and Pattern Recog- [33] B.-H. Wang, J. Yu, K. Wang, X.-Y. Bao, and K.-M. Mao, “Fall
nition (CVPR’05), pp. 886–893, IEEE, San Diego, CA, USA, detection based on dual-channel feature integration,” IEEE
June 2005. Access, vol. 8, pp. 103443–103453, 2020.
[18] J. Dai, Y. Li, and K. He, “Object detection via region-based [34] F. Harrou, N. Zerrouki, Y. Sun, and A. Houacine, “An in-
fully convolutional networks,” Advances in Neural Informa- tegrated vision-based approach for efficient human fall de-
tion Processing System, vol. 29, 2016. tection in a home environment,” IEEE Access, vol. 7,
[19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, pp. 114966–114974, 2019.
“Youonlylookonce:unified,realtimeobjectdetection,” in Pro- [35] M. Chamle, K. G. Gunale, and K. K. Warhade, “Automated
ceedings of the 2016IEEEConfer- unusual event detection in video surveillance,” in Proceedings
enceonComputerVisionandPatternRecognition, pp. 779–788, of the International Conference on Inventive Computation
IEEE, Las Vegas, NV, USA, June 2016. Technologies. (ICICT), pp. 1–4, IEEE, Bangkok, Thailand,
[20] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stron- August 2016.
ger,” in Proceedings of the IEEE conference on computer vision
and pat-tern recognition, pp. 7263–7271, Honolulu, HW,
USA, July 2017.
[21] J. Redmon and A. Farhadi, “Yolov3: an incremental im-
provement,” 2018, https://arxiv.org/abs/1804.02767.
[22] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: regional
multi-person pose estimation,” in Proceedings of the 2017
IEEE International Conference on Computer Vision (ICCV),
pp. 2353–2362, Venice, Italy, October 2017.
[23] Z. Cao, T. Simon, S. Wei, and Y. Sheikh, “Realtime multi-
person 2D pose estimation using Part Affinity fields,” in
Proceedings of the 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), pp. 1302–1310, Honolulu,
HI, USA, July 2017.
[24] H. Li, A. Shrestha, and H. Heidari, “Bi-LSTM network for
multimodal continuous human activity recognition and fall
detection,” IEEE Sensors Journal, vol. 20, no. 3, pp. 1191–1201,
2019.
[25] C. Ma, A. Shimada, H. Uchiyama, H. Nagahara, and
R.-i. Taniguchi, “Fall detection using optical level anonymous
image sensing system,” Optics & Laser Technology, vol. 110,
pp. 44–61, 2019.
[26] K. De Miguel, A. Brunete, M. Hernando, and E. Gambao,
“Home camera-based fall detection system for the elderly,”
Sensors, vol. 17, no. 12, p. 2864, 2017.
[27] A. Núñez-Marcos, G. Azkune, and I. Arganda-Carreras,
“Vision-based fall detection with convolutional neural

You might also like