Project 1
Project 1
1.INTRODUCTION
Computer vision is a branch of computer science that focuses on changing some of th
e complexities of human vision, enabling computers to recognize and process objects in imag
es and videos like humans. Thanks to advances in intelligence and innovations in deep learni
ng and neural networks, this field has prospered in recent years, outpacing humans in certain t
asks related to search and recording.
So, the first task is to acquire the dataset that matches with the Requirements
mentioned in the SRS report. Then one can fine-tune the dataset so that it can incorporate
other actions that the stakeholder requires. This load dataset is then partitioned into two parts
one is for training and another one is for testing. The video that needs to be monitored to
detect action from it needs to be segmented. Segmentation is the process of partitioning a
video sequence into a disjoint set of consecutive frames. The incoming video is segmented
further to extract key features from it. Using Feature Selection and Feature Extraction the
critical features of the frame can be restored. In feature extraction the system transforms the
arbitrary data into numerical data without losing the data, feature selection selects the
relevant data and eliminates noisy data. The frames thus generated needs to be trained using
the appropriate algorithm, so that the model developed achieves the overall Objective of the
project. Object detection detects the action performed by the human using the LRCN
algorithm. The algorithm aims to learn viewpoint invariant representation for action
recognition and action detection.
The LRCN algorithm, also known as the LongTerm Recurrent Convolutional Network, is a
deep learning algorithm that combines the power of Convolutional Neural Networks (CNN)
and Recurrent Neural Networks (RNN) to process and understand data connectivity, especiall
y in the context of video analysis.
Convolutional Neural Network (CNN) layer: The LRCN first processes each frame of the
video through a CNN designed to extract the agreed-upon features. CNN layers help capture
spatial information and identify important objects, shapes, and patterns in each frame.
Advanced Modeling Using Recurrent Neural Networks (RNN): When a frame is encoded
by the CNN, the extracted data is transmitted to the RNN layer. RNN takes into account the t
emporal relationship between frames and can capture long-term dependencies in videos.
In LRCNs, LSTM (Long Short Term Memory) differs from the commonly used RNN in that
it can work well for long periods of time.
Video classification or recording: The results of the RNN layer can be used for different
tasks depending on the purpose. For video classification, the softmax method is usually added
on top of the RNN layer to predict the class name or category of the video. This allows
LRCN to recognize videos and classify them into different categories.
The LRCN algorithm can be trained end-to-end using backpropagation through time. During
training, the model is presented with labeled video data, and the parameters of both the CNN
and RNN layers are optimized to minimize the loss function.
Overall, the LRCN algorithm combines the strengths of CNNs in spatial feature extraction
and RNNs in capturing temporal dynamics to perform video analysis tasks such as
classification and captioning. It has shown promising results in various video-related
applications, including action recognition, video summarization, and video captioning.
After the dataset is loaded into the system, the dataset passes through various rounds of
training
Referred as Epoch. In the context of machine learning and deep learning, an epoch refers to a
complete pass through the entire training dataset during the training phase of a model. It is an
important concept related to the iteration and optimization process of training a model.
During the training phase, the dataset is divided into smaller batches, and each batch is fed to
the model for forward propagation, followed by the computation of the loss and subsequent
backpropagation to update the model's parameters. An epoch is completed when all batches
in the training dataset have been processed once.
1. Forward Propagation: Each batch of training data is fed through the model, and the inputs
are processed to generate predictions or output values.
2. Loss Computation: The model's predictions are compared to the corresponding ground
truth labels in the batch, and a loss function is calculated. The loss function measures the
dissimilarity between the predicted values and the actual values.
3. Backpropagation: The gradients of the loss with respect to the model's parameters are
computed using the backpropagation algorithm. These gradients indicate the direction and
magnitude of the parameter updates necessary to minimize the loss.
4. Parameter Update: The model's parameters (weights and biases) are adjusted based on the
computed gradients. This update step is typically performed using an optimization algorithm
such as stochastic gradient descent (SGD) or one of its variants.
5. Repeat for all Batches: Steps 1 to 4 are repeated for all batches in the training dataset,
allowing the model to learn from different subsets of the data.
Once all the batches have been processed, one epoch is completed. The number of epochs
determines how many times the model will iterate over the entire training dataset. Training
for more epochs allows the model to further refine its parameters and improve its
performance, as it gets exposed to the entire dataset multiple times.
The choice of the number of epochs depends on various factors, including the complexity of
the problem, the size of the dataset, and the convergence behavior of the model. It is often
determined through experimentation and monitoring the model's performance on a separate
validation dataset. Early stopping techniques can also be employed to automatically stop
training when the model's performance plateaus or starts to degrade, thus preventing
overfitting.
Once the training phase of the model is completed, the number of frames generated can show
different actions that is spread across various frames. Thus, it becomes important that the
action which is repeated maximum number of times in the frames is prioritized. This is where
a Softmax Function comes into the picture.
Action recognition is typically a multi-class classification problem, where a video can belong
to one of several action classes. The Softmax function is well-suited for multi-class
classification tasks as it transforms the model's output into a valid probability distribution. It
ensures that the predicted probabilities for each class sum up to 1, making it easier to
compare and rank different action categories.
Aim:
Objectives:
2) To resize and normalize the input video in order to make it suitable to be fed as an input
3)To train and develop the model using the algorithm that results in high accuracy and low
loss function.
4) The model which is build needs to recognize the performed action and label the action
onto the video frame accordingly.
Scope:
a) Surveillance Cameras:-
Cameras installed in public areas such as banks, airports, hospitals, etc. This
helps in activity detection of objects to monitor suspicious activities for real-time
reactions like stealing etc.
For example, if the surveillance camera is present in the bank area then it
monitors the people in the area. If any action of the person is found suspicious. Then
it will alert the administrator.
Limitation:
1. The video sample should be real-world related. Where only one positive sample is
provided at a time. There should not be any forking path or more than one positive
case.
Week 6-7 Research through various Algorithms for developing the Model
Week 8-9 Finalize the Algorithm and Start the Development Phase
Week 14 Focus on Preparing key Diagrams that describe the process of the
project
Week 16-18 Using the LRCN algorithm extract key features from the dataset and
decide the actions that needs to be recognized
Week 19-21 Continuously Monitor the Accuracy and Loss function of the model
and hit and trial with the epoch, keeping in mind the threat of
overfitting
Week Test the model with different videos that do not exceed the 30s range
22-25 and note the performance of the model
Week Prepare the Required Report with Holistic View of the Project and with
26-27 required results.
Project initiation:
Define the project objectives, scope, and deliverables.
Identify the stakeholders and their requirements.
Create a project charter and gain approval from the stakeholders.
Form a project team with the necessary skills and expertise.
Planning:
Develop a project management plan that outlines the project approach, timeline, budget, and
resource allocation.
Conduct a risk assessment and develop a risk management plan.
Define the project requirements, including the AI algorithms, natural language processing,
etc.
Create a detailed project schedule and task list.
Identify the technical and infrastructure requirements, including hardware, software, and data
storage.
Establish a communication plan to ensure that stakeholders are kept informed of progress and
changes.
Execution:
Finalize and develop the AI algorithms and natural language processing capabilities.
Develop the ML model using the dataset .
Use of Early Stopping function to avoid Overfitting.
Efficient use of number of epoch and controlling the batch size.
Project closure:
Obtain sign-off from stakeholders that the project objectives have been met.
Archive project documentation and data.
Conduct a post-project review to identify lessons learned and areas for improvement.
Release the virtual assistant to users, if applicable.
COCOMO Model:
The COCOMO (Constructive Cost Model) is a regression model used to estimate the effort
and development time for software projects. It is based on the size of the software product,
measured in Kilo Lines of Code (KLOC). For an Embedded project with 1400 Lines of Code
(LOC), we first convert LOC to KLOC by dividing by 1000:
These are rough estimates and actual values may vary depending on various factors
specific to our project.
- This paper introduced the two-stream convolutional networks, consisting of spatial and
temporal streams, for action recognition. The spatial stream utilizes frame-level appearance
information, while the temporal stream captures motion information
- This work proposed the Temporal Segment Networks (TSN) architecture, which samples
multiple snippets from a video to model temporal dynamics effectively. TSN achieved state-
of-the-art results on several benchmark datasets.
- The I3D architecture extended 2D CNNs to 3D CNNs by inflating the 2D filters along the
temporal dimension. It achieved state-of-the-art performance on various action recognition
benchmarks, including Kinetics and UCF101.
- This work adapted the Bidirectional Encoder Representations from Transformers (BERT)
model for video action recognition. By leveraging pretraining and self-supervised learning, it
achieved state-of-the-art results on various benchmarks.
- This paper introduces a method for prognosing human activity by utilizing action
forecasting and a structured database, enabling accurate predictions of ongoing activities
based on completed action sequences and knowledge stored in the database.
- This paper presents a method for long-term trajectory prediction of the human hand and
duration estimation of human actions, enabling accurate anticipation of hand movements and
action durations in human-robot collaborative tasks.
7. “Peeking into the Future: Predicting Future Person Activities and Locations in
Videos” by Junwei Liang, Lu Jiang, Juan Carlos Niebles, Alexander Hauptmann And
Li Fei-Fei (2020):
- This paper presents a novel approach for predicting future person activities and
locations in videos, enabling the ability to anticipate human behavior and movements. The
proposed method utilizes spatio-temporal modeling and deep learning techniques to forecast
future actions and spatial trajectories, demonstrating promising results in predicting future
person activities and locations.
- This paper presents a method for early recognition of ongoing activities from
streaming videos, enabling real-time prediction of human activity. The proposed approach
leverages deep learning techniques and temporal modeling to accurately anticipate ongoing
activities, demonstrating effective results in predicting human actions from streaming video
data.
10. “Human Activity Recognition and Prediction” by David Jardim, Luís Miguel Nunes,
and Miguel Sales Dias:
12. “Human activity recognition: A review” by Ong Chin Ann And Bee Theng Lau
(2015):
13. “Human Action Recognition and Prediction: A Survey” by Yu Kong And Yun Fu”:
- This paper presents a survey on human action recognition and prediction, covering
various methodologies and techniques used in the field, providing an overview of the
advancements and challenges in the area of recognizing and forecasting human actions.
14. “Long Term and Key Intentions for Trajectory Prediction” by Harshayu Girase ,
Haiming Gang, Srikanth Malla, Jiachen Li ,Akira Kanehara, Karttikeya Mangalam
And Chiho Choi (2021) :
15. “Cross-Domain Human Action Recognition” by Wei Bian, Dacheng Tao And Yong
Rui:
16. “Forecasting future action sequences with attention: a new approach to weakly
supervised action forecasting” by Yan Bin Ng And Basura Fernando:
- This paper presents a method for predicting human activity by discovering temporal
sequence patterns, enabling accurate anticipation of future activities based on learned patterns
and temporal dependencies in the data.
These papers represent a selection of influential works in the field of human action
recognition on common benchmark datasets. They showcase various approaches, including
two-stream networks, trajectory-based methods, 3D CNNs, attention mechanisms, and
leveraging pretraining techniques, all aimed at improving action recognition performance.
3.Requirement Analysis
1. Define Scope: Clearly define the scope of the project, including any specifications that nee
d to be implemented, the type of human behavior should be defined and a performance
measure or criterion set.
4. Selection Algorithm: Suggest using LRCN algorithm for authentication. LRCN combines
a convolutional neural network (CNN) for spatial feature extraction and a recurrent neural net
work (RNN) to capture physical interactions in
action sequences. Specify changes or modifications to the basic LRCN model
based on your needs.
5.Performance Requirements: Specify the performance requirements for the recognized per
formance. This may include metrics such as accuracy, precision, recall, LossFunction, or spec
ial rules regarding the efficiency of time or the performance of the calculation.
6. Hardware and Software Requirements: Define the hardware and software
infrastructure required to train and deliver an operational experience. Consider computing res
ources, memory requirements, and the specific software or libraries required, such as TensorF
low, PyTorch, or Keras.
9. Documentation and Maintenance: Plan for appropriate system
documentation, including algorithms, data preprocessing steps, training process, and any soft
ware code or script. Consider future maintenance and possible system updates.
By following these steps and considering these scenarios, you can write the necessary rules to
create a human recognition algorithm using the LRCN algorithm on a shared data rate.
1.Functional Requirements:
⮚ The system must be able to read the video as input
⮚ The system must be able to extract each frame from the video input for processing
⮚ The system must be able to compare the frame with the learned weight.
⮚ After comparison, the system should be able to classify the input order
within each group with accuracy.
2.Nonfunctional Requirements:
a) Security
Do not allow third parties to modify content without permission.
b) Availability
⮚ Self-study support should be available.
⮚ The system should be smart enough to suggest appropriate steps as you continue to use the
system.
⮚ The system should be able to recognize the many tasks humans can do.
⮚ There should be no limit to the types of input video streams the system can handle.
c)Reliability
⮚ The system should be able to recover itself in a timely manner.
⮚ The system must be able to handle all exceptions
Admin –
Load Test data
Train the model
Test the Performance
Deploy Model
Action Recognition
End User –
Alert Generation
Hardware/Software Cost
Computer system with i5 10th generation or 50000
above/8GB or above RAM/128GB SSD
NVDIA 1650 GPU 10000
Python IDE to run machine learning Open Source
Modules
Python 2.7; Tensorflow = 1.10.0 Open Source
Electricity 560
Internet 3000
4.System Design
Action Detection
a) Start
f) Human Detected
g) End
a) Start
b) Detect video
d) Classify object
e) Object detected
f) End
a) Start
b) Detect video
e) End
a) DFD Level 0:
c) DFD Level 1:
Figure 4.6
Component Diagram
5.Implementation:
1. Software:
2. Hardware:
Desktop/Laptop 1
RAM 16GB
Graphics Card 8GB
Detailed implementation:
1. Dataset Description:
The UCF50 dataset is a dataset containing 100 segments for each activity unit, with li
nks representing 6 activities. Each episode has about 600 frames and the video is shot
at 25fps. Kaggle database contains more than 100 videos extracted from movies and
YouTube videos that can be used for education
UCF50 data set's 50 action categories collected from youtube are: Baseball Pitch,
Basketball Shooting, Bench Press, Biking, Biking, Billiards Shot, Breaststroke, Clean
and Jerk, Diving, Drumming, Fencing, Golf Swing, Playing Guitar, High Jump, Horse
Race, Horse Riding, Hula Hoop, Javelin Throw, Juggling Balls, Jump Rope, Jumping
Jack, Kayaking, Lunges, Military Parade, Mixing Batter, Nun chucks, Playing Piano,
Pizza Tossing, Pole Vault, Pommel Horse, Pull Ups, Punch, Push Ups, Rock
Climbing Indoor, Rope Climbing, Rowing, Salsa Spins, Skate Boarding, Skiing,
Skijet, Soccer Juggling, Swing, Playing Tabla, TaiChi, Tennis Swing, Trampoline
Jumping, Playing Violin, Volleyball Spiking, Walking with a dog, and Yo Yo.
2. Data processing:
a) Reading videos and text: using OpenCV library from class files and their files. Vid
eo class files are stored in NumPy arrays that are read from the class folder.
b) Split into frames to create a sequence: Each video is read using the OpenCV library
and only 30 frames of equal length are read to create a sequence of 30 frames.
c) Resize: When we need to increase or decrease all pixels, we need to change the ima
ge. So, we resize all frames to width: 64px and height: 64px so that the picture frame l
ooks the same.
d) Normalization: Normalization helps the learning algorithm learn faster and capture
the appropriate features of the image. So, we normalize the resized frame by dividing
it by 255 so that each pixel value is between 0 and 1.
e) Stored in NumPy arrays: An array of 30 resized and normalized frames is stored in
the NumPy array and put into the build.
3. Train Test Split Data:
75% of Data Used for Training
25% of Data Used for Testing
Libraries like
import os
import cv2
import pafy
import math
import random
import pandas as pd
import numpy as np
import datetime as dt
import tensorflow as tf
%matplotlib inline
cv2 library : The cv2 module is from opencv library for computer vision task
like to capture the video and process operation on it.
pafy library: It is used to extract the data and download Youtube video using
URL'S.
datetime library: It supplies classes for working with dates and times.
In this model we used UCF50 data set published by CRCV. This dataset
contains 50 video actions.
And we have fine tune it by adding two more action which are Fire and
StreetFighting.
After this we load the dataset and split it into train and test in order of 75%
and 25% respectively.
import zipfile
zip_ref.extractall('/content/dataset')
zip_ref.close()
dataset_path = '/content/dataset'
In this the video which would be taken for training the model is preprocessed by
resizing and normalizing it.
It is resized into 64X64 and divided by 255 to convert it into 0 and 1, to make the
color intensity to black and white.
Also defining the classes list on which the model will be trained for.
SEQUENCE_LENGTH = 40
DATASET_DIR = "/content/dataset/UCF50/UCF50"
CNN:
The CNN learns to recognize spatial patterns in each frame, such as edges, textures,
and object features.
RNN:
It takes the output features from the CNN and processes them sequentially through
recurrent layers, such as LSTM (Long Short-Term Memory).
The RNN captures the temporal dynamics of the video by modeling the
dependencies between consecutive frames and learning long-term dependencies.
By passing the perfect parameters to the model it will help to build a efficient
model.
model = Sequential ()
model.add(TimeDistributed(MaxPooling2D((4, 4))))
model.summary()
Also use early stopping to make model perfectly fit and not overfitting nor
underfitting.
It uses a moving average of squared gradients to scale the learning rate for each
parameter.
RMSprop is known for its ability to converge quickly and handle sparse gradients
effectively.
Model Creation:
A deep learning Network, LRCN is used in our proposed system from video surveillance
The main idea behind LRCNs is to use a combination of CNNs to learn the properties of ima
ges and LSTMs to convert sequences of images into lists, sentences, occurrences or whatever
you want. So, as seen, the input is processed by the CNN whose output is fed into a set of seq
uential models.
LSTM networks are well suited for classification, processing, and forecasting based on real-
time data, where significant events may have long-term tradeoffs between them. LSTMs were
developed to solve the gradient extinction problem that can be encountered when training RN
Ns.
Description:
a) This module processes the frames from the video for detecting a human object.
b) It uses the LRCN algorithm and pretrained weights to identify human in video.
c) The CNN extract the visual features and the LSTM store the value of the frames for
longer time to include all the frames.
Module Input: Things or items edge detected from the frames of videos.
Description:
a) This function collects the frames and loads in the trained model.
b) It then applies the deep learning method by passing through the hidden layers of
model while different weights are applied on the frames.
c) By comparing the things or items with frames or images of videos already available
in the dataset on the basis of this comparison it classifies whether the object is Nunchuck,
building, trees or boxing gloves.
Module Input: Along with human and object detected from the frames of videos.
Description:
a) This module or function collects the frames from the video and loads it
into the trained model. It then compares with the trained dataset and
perform action on it.
b) It checks for the similar action from the dataset and displays the result on
the video frame by frame. Action like Fire, Punch, Taichi, Nunchuck and
Streetfighting.
6.2) Testing:
Unit Testing:
1. Human Identification
2. Object Detection
3.Action Detection
System Testing:
7. Performance Analysis:
The above graph gives the description of Accuracy vs Validation Accuracy for LRCN
Algorithm. It shows that the accuracy is stabilized after 25 epochs to give a 100% accuracy.
The Validation accuracy nearly follows the curve of the total accuracy indicating that the
model built isn’t overfitted and the Loss Function is minimized which is one of the objectives
of the project.
The above graph represents the steady decrease in losses as the epoch approaches till 30
And then stabilizes further. The validation losses also stabilize with increase in epoch.
8. Future Scope:
A more efficient software can understand and analyze long videos on a daily basis. Although
many comprehensive review articles have been published on the general topic of HAR, the
development of a range of topics, along with the multidisciplinary nature of HAR, encourage
the need for topic review. In fact, most computer vision applications are specialized for HAR
work, including human computer interaction, virtual reality, security, video surveillance, and
homemonitoring.
This creates new and important challenges in the development cycle of HAR models. H
ere
we present a unique insight into current work and research used to detect human movement.
Comparing idiosyncratic human motion using similar models is difficult because of well-
designed methods to represent similarities and the results depend on the image dataset used.
This will be useful for the researcher's future research in this area. The system of generating
an alert by classifying the action according to the intent and notifying the end user if the
action recognized is dangerous can be an extended part of the project
9. Applications:
Advances in today's technology offer us new ways to improve the quality of life of the
elderly and disabled. Housing services and operations use HAR technology and analytics to
monitor residents and help keep them safe.
The smart home is an environment filled with sensors that improve the safety and well-being
of residents while monitoring their health. Thus, such buildings with HAR systems
contribute to the independence and quality of life of people who need physical and mental
support. Basically, in a smart building, the data collected by the sensors is analyzed and the
behavior of the inhabitants and their interaction with the environment are monitored.
Installation:
2) Install Anaconda Navigator or Visual Studio Code or use Google Colab directly.
imageio-->2.31.0
User Manual:
The above snap shows that the system detected the Fire from the Video Input
The above snap shows that the system detected the Action of Nunchucks with 99% accuracy.
The above snap shows that the system detected the action of Punching.
The above snap shows that the system detected the action of TaiChi with 98% accuracy.
12.Ethics:
1. Surf the internet for personal interest and non-class related purposes during classes
7. Buy software with a single user license and then install it on multiple Computers
13.References:
1. Vibekananda Dutta, Teresa Zielinska, “Prognosing Human Activity Using Actions
Forecast and Structured Database”, IEEE Journal Paper, Volume 8, Page no. 6098 –
6116, 03 January 2020.
6. David Jardim, Luís Miguel Nunes, and Miguel Sales Dias, “Human Activity
Recognition and Prediction” Microsoft Language and Development Center, Lisbon,
Portugal 2 Instituto Universitário de Lisboa (ISCTE-IUL), Lisbon, Portugal 3 IT -
Instituto de Telecomunicações, Lisbon, Portugal 4 ISTAR-IUL, Lisbon, Portugal.
8. Ong Chin Ann, Bee Theng Lau “Human activity recognition: A review” IEEE
Conference Paper: March 2015.
9. Yu Kong, Yun Fu “Human Action Recognition and Prediction: A Survey”
International Journal of Computer Vision , 28 March 2022.
10. Harshayu Girase ,Haiming Gang, Srikanth Malla, Jiachen Li ,Akira Kanehara,
Karttikeya Mangalam, Chiho Choi, “Long Term and Key Intentions for Trajectory
Prediction”, IEEE International Conference on Computer Vision October 2021.