A Water Behavior Dataset For An Image-Based Drowning Solution
A Water Behavior Dataset For An Image-Based Drowning Solution
Drowning Solution
1st Saifeldin Hasan 2nd John Joy 3rd Fardin Ahsan
Electrical Engineering Department Electrical Engineering Department Electrical Engineering Department
2021 IEEE Green Energy and Smart Systems Conference (IGESSC) | 978-1-6654-3456-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/IGESSC53124.2021.9618700
Abstract—Drowning is responsible for an estimated of 320,000 repetitive task [4], [5]. Although lifeguards are usually highly
deaths annually worldwide, roughly 25% of those deaths are in alert, their task is extremely difficult resulting in egregious
swimming pools. This is probably due to the fact that a drowning examples of inattention. The aquatics industry is acutely aware
person, to the untrained eye, will appear to be normally playing
or floating in the water. While drowning, a person is unable to call of the challenges they face to prevent drownings in lifeguarded
for help, as the nervous system focuses on gathering oxygen for swimming areas. Addressing the question of how one can help
the lungs. To assist the lifeguards with their rescue mission, we lifeguards complete the highly challenging task of identifying
propose a water behavior dataset curated to support the design rare events (drownings) while completing a repetitive scanning
of image-based methods for drowning detection. The dataset task is a multifaceted problem that requires a multifaceted
includes three major water activity behaviors (swim, drown,
idle) that have been captured by overhead and underwater solution.
cameras. Moreover, we develop and test two methods to detect Recently, many research works have been devoted to drown-
and recognize the drowning behavior using the proposed dataset. ing behavior signs understanding. Lu and Ten [6] presented
Both methods use deep learning and aim to support a fast and a vision-based approach to detection of drowning incidents
smart pool rescue system by watching for the early signs of in swimming pools at the earliest possible stage using a
drowning rather than looking for a drowned person. The results
show a high performance of the presented methods validating number of video clips of simulated drowning. The approach
our dataset, which is the first public water behavior dataset and detects, tracks swimmers and parses observation sequences
the main contribution of the work. of swimmer features for possible drowning behavioral signs.
Index Terms—water behavior dataset, drowning, computer In [7], a real time drowning detection method uses a HSV
vision, early rescue thresholding mechanism along with contour detection to detect
a drowning person in indoor swimming pools and sends an
I. I NTRODUCTION alarm to the lifeguard if the previously detected person is
The burden of drowning for children has become a leading missing for a specific amount of time. A real-time vision
public health problem [1]. The high rates of drowning are an system operating at an outdoor swimming pool is presented
impediment to achieving reductions of early childhood mortal- in [8]. The system is designed to automatically recognize
ity [2]. Many of these death incidents are attributed to a poor different swimming activities and to detect occurrence of
adult supervision. Various drowning scenarios involve newly early drowning incidents. To learn unique traits of different
mobile toddlers who wander off, preschoolers who discover swimming behaviors, the authors simulate and collect unique
ungated swimming pools, in addition to older children and traits of early drowning behaviors and numerous swimming
adults at unguarded public swimming areas, such as beaches, styles.
rivers, and residential swimming pools. Effective interventions The industry has made a shift in the past few years to ad-
that mitigate drowning risk will improve health outcomes. Yet, dress the drowning crisis by moving toward more safeguarded
interventions, such as fencing around pools, lifeguards, and pools. This also brought in a generation of drowning preven-
flotation devices are not always feasible. A recent study shows tion products. Many of these products enable the lifeguard
a small but alarming number of drowning deaths at public with a deeper vision through a 3D monitoring screen [9] or the
swimming areas that are guarded by professional lifeguards swimmer with a wearable that tracks how long a swimmer’s
[3]. There is strong evidence that humans simply are not face is submerged [10]. Other surveillance technologies have
very good at noticing rare events while completing a boring, advanced to the point of being able to answer a distress call
Authorized licensed use limited to: The Islamia University of Bahawalpur. Downloaded on April 27,2023 at 05:00:08 UTC from IEEE Xplore. Restrictions apply.
using artificial intelligence software to monitor the swimmer’s TABLE I: Summary of the overhead and underwater videos
activity and detect potential problems when the swimmer has characteristics for each of the three water activities.
been struggling for more than seven seconds [11]. Category No. of No. of No. of No. of
videos frames videos videos
According to the previous surveys, most of the research has (overhead) (overhead) (underwater) (underwater)
been focused on the detection of the swimmer’s location to Swim 19 9,384 19 8,388
identify submerged swimmers with no movement. It is not Drown 18 6,080 14 7,363
Idle 10 9,265 11 6,259
applicable to water activity recognition, in our case such as
activities that describe specific behaviors of swimmers. On
the other hand, these recognition works applied the model B. Method 2: Body Pose Estimation
on either simulated swimming scenarios or on their own
While the features from the pre-trained DNNs can be useful
private real-time video frames. Based on these considerations,
for human activity recognition in the water, another network,
this paper first proposes a water activity behavior dataset of
the High Resolution Network (HRNet), can be trained to
videos that are captured above water and under water. The
detect and compute the body pose keypoints. A BGR to
videos illustrate three types of water activities that describe
RGB filter is applied first to the frames before passing them
the behaviors of swimming, drowning and staying idle (resting
to the DNN to ameliorate the network’s performance.The
or playing). Moreover, we present two image-based methods
HRNet consists of parallel high-to-low resolution subnetworks
that train a deep learning model on the proposed dataset.
with repeated information exchange across multi-resolution
The remainder of the paper is organized as follows. Section
subnetworks (multi-scale fusion). The model has been pre-
II presents the two drowning recognition methods. The exper-
trained on the COCO tain2017 dataset [17] that contains over
imental setup describing the proposed dataset and the results
50,000 images and 150,000 person instances labeled with 17
are presented in Section III. Finally, the conclusions are drawn
keypoints. Finally, the detected keypoints are classified using a
in Section IV.
DNN of 5 layers with (50,50,50) neurons in the hidden layers
to classify the three different human water activities (swim,
II. D ROWNING R ECOGNITION
drown, idle).
We utilize existing deep neural networks (DNNs) pre-
trained on the large ImageNet dataset [12], and adapt them III. E XPERIMENTAL S ETUP AND R ESULTS
for water behavior recognition with the sole purpose of iden- A. Water Behavior Dataset
tifying the early signs of drowning. The pre-trained feature To train and test the previously described models, a dataset
representations provide a starting point for creating robust clas- of short videos displaying water human activities is curated.
sifiers for drowning detection. We consider two scenarios for For this purpose, two different cameras are used above water
incorporating pre-trained neural networks. First, we use pre- and under water. The DJI OSMO+ is used to film above
trained DNNs to re-train them by fine-tuning their parameters. water, with a resolution of 1920x1080 at 30 fps, and a GoPRO
Next, we detect the different body keypoints with a Deep High hero 7 is used for the underwater scenes, with a resolution of
Resolution Network (HRNet) [13] and use them to train a 1920x1440 at 30 fps.
generic DNN. All the models are trained on a NVIDIA GTX For practicality purpose, every video shows only one person
1070 GPU. performing one of three different activities: swimming, drown-
ing and staying idle. Every video in the dataset is presented
A. Method 1: Scene Classification
and labeled as ”activity id person.mp4”. The underscore (’ ’)
The first method is applied on the proposed dataset to delimiter is used to separate the fields of interest for a better
perform a standard video scene classification. The method visibility. The first field is the activity type (swim, drown,
trains a deep neural network (DNN) using transfer learning idle), followed by ’id’ to indicate the number of the video
in order to perform predictive labelling on the video frames to and finally ’person’ to indicate that we are labelling videos
classify the different human activities in water. We test three of a person. Individual frames are generated every 0.0333
different DNN architectures: ResNet50 [14], VGG16 [15], and seconds from their respective videos. The video frames are
MobileNet [16]. These networks have been pre-trained on the resized to (640x144) for the scene classification method and
ImageNet dataset that includes over 1.2 million images for to (640x368) for the pose estimation method. The dataset
1,000 object classes. We finally re-train the DNNs by fine- includes a total of 91 videos of an average of 57 minutes each.
tuning the parameters of the neural networks. Fine-tuning is They are split into overhead and underwater videos, with sub-
essentially training the network for several more iterations on a categories for the different three activities. There are 47 videos
new dataset. This process will adapt the generic filters trained for the overhead scenes, and 44 videos for the underwater
on the ImageNet dataset to the drowning recognition problem. scenes. The subjects in the dataset are all males ranging from
We train the networks for 15 epochs using stochastic gradient 18 – 21 years of age, mostly of middle eastern ethnicity. Fig.
descent. At around 15 epochs, all of the network architectures 1 and Fig. 2 show sample images from both overhead and
achieve near 100% accuracy on the training set, so no more underwater videos for each water activity. Table I presents a
improvement in training can be achieved. summary of the dataset characteristics displaying the number
Authorized licensed use limited to: The Islamia University of Bahawalpur. Downloaded on April 27,2023 at 05:00:08 UTC from IEEE Xplore. Restrictions apply.
(a) Overhead - Swim
(a) Underwater - Swim
Authorized licensed use limited to: The Islamia University of Bahawalpur. Downloaded on April 27,2023 at 05:00:08 UTC from IEEE Xplore. Restrictions apply.
TABLE II: Accuracy (%) of models trained and tested on the
proposed water behavior dataset for the three activity classes.
DNN Overhead Underwater Average
Accuracy (%) Accuracy (%) Accuracy (%)
ResNet50 98.70 95.0 96.85
MobileNet 89.90 76.60 83.25
VGG16 98.30 95.10 96.70
Authorized licensed use limited to: The Islamia University of Bahawalpur. Downloaded on April 27,2023 at 05:00:08 UTC from IEEE Xplore. Restrictions apply.
estimation method performs well for the three activity classes [7] N. Salehi, M. Keyvanara, S. A. Monadjemmi, ”An automatic video-
for the overhead and underwater cases, slightly outperforming based drowning detection system for swimming pools Using active con-
tours,” International Journal of Image, Graphics and Signal Processing,
the scene classification method (Table II and Table III). vol. 8, no. 8, p. 1, August 2016.
Next, we compute the f1-score, precision and recall for [8] H. L. Eng, K. A. Toh, W. Y. Yau and J. Wang, ”DEWS: A live
each of the three activity classes for both above water and visual surveillance system for early drowning detection at pool,” IEEE
transactions on circuits and systems for video technology, vol. 18, no.
under water, as shown in Table VII and Table VIII. Here 2, pp. 196-210, Mar 2008.
too, the performance of the pose estimation method proves [9] https://www.aqua-conscience.com/
to be efficient at recognizing the different water behavior [10] https://www.wavedds.com/
[11] https://www.angeleye.tech/en/en-lifeguard/
activities. Again, comparing Table VII and Table VIII to Table [12] O. Russakovsky, J. Deng, H. Su, J.Krause, S. Satheesh, S. Ma, Z. Huang,
IV and Table V respectively, we notice that the pose estimation A. Karpathy, A. Khosla, and M. Bernstein, ”ImageNet large scale visual
method performs better than the video scene classification recognition challenge,” International Journal of Computer Vision, vol.
115, no. 3, pp. 211–252, April 2015.
method. [13] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep High-Resolution Rep-
We observe that the pose estimation method is independent resentation Learning for Human Pose Estimation,” arXiv:1902.09212
of the scene variations and solely relies on the body pose and [cs], Feb. 2019, Accessed: Jun. 15, 2021. [Online]. Available:
https://arxiv.org/abs/1902.09212.
the swimmer’s behavior in the water. This is in contrast to the [14] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ”ImageNet classification
video scene classification method that relies on the features with deep convolutional neural networks,” Advances in Neural Informa-
of the overall frame to learn the water activity class. This tion Processing Systems, vol. 25, pp. 1097–1105, 2012.
[15] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for
might be affected by several scene factors that constrain the large-scale image recognition,” in International Conference on Learning
classification and result in a less robust recognition method Representations, San Diego, May 2015, pp. 1-14.
compared to the pose estimation method. [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T.
Weyand, M. Andreetto and H. Adam, ”Mobilenets: Efficient convolu-
IV. C ONCLUSION tional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861. April 2017.
We proposed a water behavior dataset of videos that were [17] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P.
captured above water and under water for drowning recogni- Dollár and C. L. Zitnick, ”Microsoft coco: Common objects in context,”
in European conference on computer vision, September 2014.
tion. The videos illustrated three types of water activities to
describe the behaviors of swimming, drowning and staying
idle. Moreover, we presented two image-based methods that
trained different deep learning models on the proposed dataset.
Both methods, the scene classification and pose estimation
methods, proved to be efficient at recognizing each of the wa-
ter behavior activities. In the first method, ResNet50 showed to
perform the best. However, the pose estimation method slightly
outperformed the scene classification method, knowing that the
former depended less on the scene variations. The recognition
process relied on keypoint features that described the body
pose, which better related to the concept of human behavior
in the water.
ACKNOWLEDGMENT
The authors would like to thank Luke Cunningham and Kim
Beasley of BlueGuard - Al Wasl Swimming Academy who
supported with creating the proposed water behavior dataset.
R EFERENCES
[1] D. You, G. Jones, K. Hill, T. Wardlaw and M. Chopra M, ”Levels and
trends in child mortality, 1990-2009,” Lancet, vol. 376, no. 9745, pp.
931–933, September 2010.
[2] M. Peden, K. Oyegbite, J. Ozanne-Smith and A. A. Hyder, ”World
Report on Child Injury Prevention: Summary,” Geneva, Switzerland:
World Health Organization, 2008.
[3] Redwoods Group, ”Teen dies in accident at YMCA,” available
at: http://www.redwoodsgroup.com/YMCAs/RiskManagement/ Aquatic-
sAlerts.html. Accessed July 23, 2020.
[4] J. Duncan, G. W. Humphreys GW, ”Visual search and stimulus similar-
ity,” Psychological Review, vol. 96, pp. 433–458, July 1989.
[5] J. M. Wolfe, T. S. Horowitz, N. M. Kenner, ”Rare items often missed
in visual searches,” Nature, vol. 435, pp. 439–440, May 2005.
[6] W. Lu, Y. P. Tan, ”A vision-based approach to early detection of
drowning incidents in swimming pools,” IEEE transactions on circuits
and systems for video technology, vol. 14, no. 2, pp. 159-78, March
2004.
Authorized licensed use limited to: The Islamia University of Bahawalpur. Downloaded on April 27,2023 at 05:00:08 UTC from IEEE Xplore. Restrictions apply.