Resnet 152
Resnet 152
by
Pranay Mandadapu
W
A Thesis Submitted in
IE
Partial Fulfillment of the
Master of Science
PR
in Computer Science
at
December 2023
ABSTRACT
by
Pranay Mandadapu
This thesis explores deep learning methods for Human Activity Recognition (HAR) from videos
to automate the annotation of human activities in videos. The research is particularly relevant for
continuous monitoring in healthcare settings such as nursing homes and hospitals. The innovative
W
part of the approach lies in using YOLO models to first detect humans in video frames and then
IE
isolating them from the rest of the image for activity recognition which leads to an improvement
in accuracy. The study employs pre-trained deep residual networks, such as ResNet50, ResNet152-
EV
V2, and Inception-ResNetV2, which were found to work better than custom CNN-based models.
The methodology involved extracting frames at one-minute intervals from 12-hour-long videos of
18 subjects and using this data for training and testing the models for human activity recognition.
PR
This thesis contributes to HAR research by demonstrating the effectiveness of combining deep
learning with advanced image processing, suggesting new directions for healthcare monitoring
applications.
ii
W
IE
© Copyright by Pranay Mandadapu, 2023
All Rights Reserved
EV
PR
iii
TABLE OF CONTENTS
LIST OF FIGURES ...................................................................................................................... VI
LIST OF TABLES...................................................................................................................... VII
LIST OF ABBREVIATIONS.................................................................................................... VIII
ACKNOWLEDGEMENTS .......................................................................................................... IX
CHAPTER 1 ................................................................................................................................... 1
1 I NTRODUCTION:......................................................................................................................... 1
1.1 Background and Research Challenge......................................................................... 1
1.2 Significance of Research ............................................................................................. 2
1.3 Objectives and Methodology....................................................................................... 2
1.4 Hypothesis Testing and Model Development.............................................................. 3
CHAPTER 2 ................................................................................................................................... 4
2 LITERATURE REVIEW:.......................................................................................................... 4
W
CHAPTER 3 ................................................................................................................................... 7
3. METHODOLOGY AND MATERIALS ............................................................................................ 7
IE
3.1 Data Source....................................................................................................................... 7
3.2 Machine Learning and Deep Learning Techniques.......................................................... 8
3.2.1 Classification.............................................................................................................. 8
3.2.2 Neural Networks ........................................................................................................ 9
EV
3.2.3 Convolution Neural Networks ................................................................................. 10
3.2.4 Pre-trained Image Processing Models ..................................................................... 10
3.2.4.1 ResNet50........................................................................................................... 10
3.2.4.2 ResNet152V2.................................................................................................... 11
3.2.4.3 Inception-ResNet V2......................................................................................... 12
PR
W
IE
EV
PR
v
LIST OF FIGURES
FIGURE 3.1 COLLAGE OF DIFFERENT SUBJECTS DOING DIFFERENT ACTIVITIES............................................................................................. 8
FIGURE 3.2 YOLO MODEL OBJECT AND HUMAN DETECTION WITH PROBABILITIES ...................................................................................15
FIGURE 3.3: DATA DISTRIBUTION OF UNCROPPED IMAGES AMONG DIFFERENT CLASSES ACROSS DIFFERENT SUBJECTS ..............................21
FIGURE 3.4: FROM TOP TO BOTTOM, IMAGE FRAME FROM THE VIDEO, HUMAN DETECTED WITH YOLO V8 AND CROPPED HUMAN
SUBJECT....................................................................................................................................................................................22
FIGURE 3.6: A RCHITECTURE O VERVIEW ...............................................................................................................................................23
FIGURE 4.1: TEST SUBJECT 1031 – SITTING POSITION ..........................................................................................................................31
FIGURE 4.2: TRAIN SUBJECT 1002 – SITTING POSITION. .......................................................................................................................32
FIGURE 4.3: TEST SUBJECT 1073 – STANDING POSITION. .....................................................................................................................33
FIGURE 4.4: TRAIN SUBJECT 1025 – STANDING POSITION. ...................................................................................................................33
W
IE
EV
PR
vi
LIST OF TABLES
TABLE 3.1 PRE-TRAINED MODELS PERFORMANCE ON IMAGENET DATASET.............................................................................................13
TABLE 3.2 DATA DISTRIBUTION OF UNCROPPED IMAGES AMONG DIFFERENT CLASSES AND SUBJECTS.......................................................20
TABLE 4.1 CONFUSION MATRIX OF INCEPTION-RESNET V2 WITHOUT YOLO IMAGE PRE-PROCESSING....................................................26
TABLE 4.2 YOLO DETECTION RATE FROM THE ORIGINAL DATASET .........................................................................................................28
TABLE 4.3 CONFUSION MATRIX OF INCEPTION-RESNET V2 WITH YOLO IMAGE PRE-PROCESSING ..........................................................29
TABLE 4.4 SUBJECT-WISE ACCURACY WITHOUT YOLO IMAGE PRE-PROCESSING.....................................................................................30
TABLE 4.5 SUBJECT-WISE ACCURACY WITH YOLO IMAGE PRE-PROCESSING ..........................................................................................34
W
IE
EV
PR
vii
LIST OF ABBREVIATIONS
HAR Human Activity Recognition
CNN Convolutional Neural Network
YOLO You Look Only Once
IoHT Internet of Healthcare Things
IoT Internet of Things
ML Machine Learning
ResNet Residual Network
W
IE
EV
PR
viii
ACKNOWLEDGEMENTS
I extend my heartfelt thanks to my advisor, Prof. Rohit J. Kate, for his invaluable guidance and
support throughout my thesis research. His endless patience, encouragement, and dedication have
shaped my research journey. His mentorship has been instrumental in the completion of my work.
I am also grateful to Prof. Scott Strath and the Department of Kinesiology at the University
of Wisconsin Milwaukee for their generosity in providing the experimental data for this study.
W
Thanks to Prof. Jun Zhang and Prof. Scott Strath for their willingness to serve on my thesis
committee.
IE
Lastly, my most profound appreciation goes to my parents. Their constant love,
EV
unwavering support, and encouragement have been the bedrock of my academic pursuits. I am
eternally grateful for their guidance, faith in me, and all the sacrifices they have made on my behalf.
PR
ix
Chapter 1
1 Introduction:
learning models to annotate human activities in videos automatically. The central research
motivation is the inefficiency and lack of scalability of manual annotation for video datasets. For
instance, in our dataset, human annotators meticulously labeled every second of 12-hour-long
videos for each of the 18 subjects. These annotations span diverse activities, including sitting,
W
walking, standing, lying, crouching/kneeling/squatting, and other less frequent postures like
IE
stepping and dark/obscured/off-frame (oof) scenarios. This manual process is time-consuming,
labor-intensive, and costly, thus highlighting the need for an automated solution.
EV
The motivation for this research is deeply rooted in the desire to enhance the efficiency and
accuracy of activity recognition in settings where continuous monitoring is crucial. One of the
driving inspirations behind this work is the potential application of automated HAR systems in
PR
nursing homes and hospitals [1]. In such environments, continuous monitoring is vital for patient
safety and care, yet resource constraints and the impracticality of round -the-clock manual
observation often hinder it. By automating the activity recognition process, this research aims to
provide a scalable solution that could significantly improve patient monitoring, ensuring timely
1
1.2 Significance of Research
The novelty of this research is in going beyond the conventional use of Convolutional Neural
Networks (CNNs) in Human Activity Recognition (HAR). While employing CNNs and pre-
trained models like ResNet 50 [6] in HAR is not novel, this research introduces a unique
application of these deep-learning techniques. The novelty lies in the integration of advanced
image processing using Yolo V8 [13] for detecting and isolating humans in the video frames and
Specifically, this study innovatively employs two separate models: one trained on the
W
original, unaltered dataset and another on a subset where humans are isolated from their
environment. This bifurcated model approach is designed to enhance the accuracy and efficiency
IE
of activity recognition. The decision of which model to use is dynamically based on whether a
human is detected in the frame, allowing for a more focused and precise annotation of human
EV
activities, or not in which case the model trained on the unaltered dataset is employed.
Such an approach has not been extensively explored in existing HAR research,
PR
The primary objective of this research is to develop an accurate model capable of automatically
annotating human activities from video frames. This study utilizes a dataset comprising 18
subjects, each captured in extensive 12-hour-long video sessions doing activities of daily living in
a metabolic chamber. The methodology involves initially extracting frames from these videos at
one-minute intervals. These frames are then used for training and for evaluating by employing the
Reproduced with permission of copyright owner. Further reproduction prohibited without permission.