0% found this document useful (0 votes)
44 views67 pages

Army Public College of Management & Sciences: Final Year Project Thesis Format

This document provides a thesis format for a final year project on human action recognition using 3D point clouds. It includes sections on the introduction and motivation, literature review, methodology, development tools, and a framework for development. The proposed approach uses a structured tree neural network that represents human skeleton joints as nodes in a tree structure. The network classifies actions based on changes in joint displacement and angles between connections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views67 pages

Army Public College of Management & Sciences: Final Year Project Thesis Format

This document provides a thesis format for a final year project on human action recognition using 3D point clouds. It includes sections on the introduction and motivation, literature review, methodology, development tools, and a framework for development. The proposed approach uses a structured tree neural network that represents human skeleton joints as nodes in a tree structure. The network classifies actions based on changes in joint displacement and angles between connections.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 67

ARMY PUBLIC COLLEGE OF MANAGEMENT &

SCIENCES

FINAL YEAR PROJECT THESIS FORMAT

Department of Computer Sciences

1
Human Action Recognition using 3D point clouds

A project submitted
in partial fulfillment of the requirements for the degree of
Bachelor of Science in Computer Science

by
Misha Karim (UET-16f-BSCS-02)
Nisar Bahoo (UET-16f-BSCS-84)
Muhammad Junaid Khalid (UET-16f-BSCS-92)

Supervised by
Dr. Muhammad Sajid Khan

Army Public College of Management & Sciences


Rawalpindi, Pakistan
Department of Computer Sciences

Affiliated with UET (Taxila)

Session 2016 – 2020

2
ACKNOWLEDGEMENT

We cannot express enough thanks to our supervisor for their continued support
and encouragement: Dr. Sajid Khan, we offer our genuine gratefulness for the
learning openings given by him.

Our completion of this project could not have been accomplished without the
support of each other. To other friend who gave us his GPU fitted laptop for our
implementation – thank you for allowing us time away from you to research and
write. Thanks to our parents as well, the occasions you helped us by tackle our rushed
timetables won't be overlooked.

Finally, to our outstanding and supportive institution, APCOMS; our most


profound appreciation. Your encouragement when the times got rough are much
appreciated and duly noted. It was a great comfort and relief to know that you were
willing to provide sufficient management roles to our thesis completion. A heartfelt
thank you.

3
UNDERTAKING

This is to declare that the project entitled “Human Action Recognition using
3D point clouds” is an original work done by undersigned, in partial fulfillment of the
requirements for the degree “Bachelor of Science in Computer Science” at Computer
Science /Software Engineering Department, Army Public College of Management &
Sciences, affiliated with UET Taxila, Pakistan.

All the analysis, design and system development have been accomplished by
the undersigned. Moreover, this project has not been submitted to any other college or
university.

Date:

Student 1: Misha Karim

Student 2: Nisar Bahoo

Student 3: Muhammad Junaid Khalid

4
CERTIFICATE

This is to certify that the project titled


“____________________________________________” is the bona fide work carried
out by ________________, student of Bachelor of Science in Computer Science
(BSCS), of Army Public College of Management and Sciences, affiliated to
University of Engineering and Technology, Taxila (Pakistan) during the academic
year __________, in partial fulfillment of the requirements for the award of the
degree of Bachelor of Science in Computer Science (BSCS), and that the project has
not formed the basis for the award previously of any other degree, diploma,
fellowship or any other similar title.

Signature of the Supervisor

Date:

5
ABSTRACT

The ability for automated technologies to correctly identify a human’s actions


provides considerable scope for systems that make use of human-machine interaction. Thus,
automatic3D Human Action Recognition is an area that has seen significant research effort. In
work described here, a human’s everyday 3D actions recorded in the NTU RGB+D dataset
are identified using a novel structured-tree neural network. The nodes of the tree represent the
skeleton joints, with the spine joint being represented by the root. The connection between a
child node and its parent is known as the incoming edge while reciprocal connection is known
as the outgoing edge. The uses of tree structure lead to a system that intuitively maps to
human movements. The classifier uses the change in displacement of joints and change in the
angles between incoming and outgoing edges as features for classification of the actions
performed.

6
Table of Contents
1. Introduction:........................................................................................................................................
1.1 Objective:....................................................................................................................
1.2 Motivation:..................................................................................................................
1.3 Problem Definition:....................................................................................................
1.4 Scope:...........................................................................................................................
1.5 Problem Solution:...................................................................................................
1.6 Existing systems..........................................................................................................
1.7 Project breakdown structure.....................................................................................
1.8 Block diagram.............................................................................................................
1.9 Applications.................................................................................................................
2. Literature Review................................................................................................................................
2.1 Related Work..............................................................................................................
2.2 Analysis........................................................................................................................
2.2.1 Analytical Graph:...............................................................................................
2.3 State-of-the-Art:..........................................................................................................
2.4 Problem Solution:.......................................................................................................
3. Methodology.........................................................................................................................................
3.1 Introduction:...............................................................................................................
3.1.1 Coordinate System..............................................................................................
3.1.2 Orthographic Environment................................................................................
3.2 Inputs to System (Dataset).........................................................................................
3.3 System Requirements.................................................................................................
3.4 Overview of System.....................................................................................................
3.4.1 3D Video Inputs...................................................................................................
3.4.2 Pre Processing.....................................................................................................
3.4.3 Human Body detection and getting skeleton.....................................................
3.4.4 Skeleton Extraction:...........................................................................................
3.4.5 Neuron Tree Model:............................................................................................
3.4.6 Movements of joints:...........................................................................................
3.4.7 Deforming of edge...............................................................................................
3.5 Proposed Approach (Structure Tree Neural Network)............................................
3.6 Pseudo Code................................................................................................................

7
3.7 Scenario of Processing................................................................................................
3.8 Expected Output.........................................................................................................
3.9 Summary.....................................................................................................................
4.1 Development Tool........................................................................................................................
4.2 Implementation issues.................................................................................................................
4.3 Configuration Management........................................................................................................
4.4 Framework Section for development.........................................................................................
4.5 Deployment factors......................................................................................................................
4.5.1 PyTorch...................................................................................................................
4.5.2 Pandas......................................................................................................................
4.5.3 MATLAB.................................................................................................................
4.6 Summary......................................................................................................................................
5. System Testing.....................................................................................................................................
5.1 Introduction.................................................................................................................
5.2 Output..........................................................................................................................
5.3 Automatic Testing.......................................................................................................
5.4 Statistical and Graphical Analysis.............................................................................
5.5 Complete User Interface.............................................................................................
5.5.1 Layout 1:..............................................................................................................
5.5.2 Layout 2:..............................................................................................................
5.5.3 Layout 3:..............................................................................................................
5.5.4 Layout 4:..............................................................................................................
5.5.5 Layout 5:..............................................................................................................
5.5.6 Layout 6:..............................................................................................................
5.5.7 Layout 7:..............................................................................................................
5.6 Summary.....................................................................................................................
6. Conclusion........................................................................................................................................
Major Contributions...............................................................................................................................
7. References............................................................................................................................................

8
Figure No. Figure Name Page
No.
Figure 1.1 Point cloud representation of Humans 13
Figure 1.2 Extracting skeleton points from point clouds 14
Figure 1.1.1 Point cloud extraction 15
Figure 1.2.1 Human Action Recognition is dependent on multiple disciplines 16

Figure 1.6.1 Point clouds and skeleton points of a hand 18


Figure 1.6.2 Coordinates for a depth image. 18
Figure 1.6.3 Frame by frame processing of an RGB video 19
Figure 1.6.4 3D grid around a human body 19
Figure 1.6.5 Two different clusters of human actions. 20
Figure 1.7.1 Project breakdown structure 20
Figure 1.8.1 Block diagram of system 21
Figure 2.1 Point clouds formed 23
Figure 2.2 Change in shapes 23
Figure 2.1.1 Silhouette sequences 24
Figure 2.1.2 Skeletal points 25
Figure 2.1.3 Flow of Events 26
Figure 2.1.4 Two different views of 3D point clouds 26
Figure 2.1.5 Directed Acyclic Graph used in Directed Graph Neural Network 28
Figure 3.2.1 Orthographic projection demonstration 34
Figure 3.3.1 Block Diagram of system 36
Figure 3.4.2.1 Application of median filter 37
Figure 3.4.4.1 Extracted skeleton of clap action 39
Figure 3.4.5.1 Skeleton (left) and a possible tree of skeleton (right) 39
Figure 3.5.1.2 Demonstrates the flow of proposed approach Structured Tree 41
Neural Networks
Figure 3.5.1.3 Demonstrates the Class Prediction involved in Structured Tree 41
Neural Networks.
Figure 4.4.1 Demonstration The Flow Of Proposed Approach Structured Tree 47
Neural Networks.
Figure 5.2.1 Depiction Of Layers Of STNN 50
Figure 5.2.2 Accuracy confusion matrix. 51
Figure 5.3.1 All Test Passed In Automatic Testing 51
Figure 5.5.1.1 Most Initial Layout 53
Figure 5.5.2.1 Selection Of Video By User By Clicking Select Video Input 54
Button.
Figure 5.5.3.1 Output Of The Skeletons Displayed. 55
Figure 5.5.4.1 Output Of The Extracted Skeleton. 56
Figure 5.5.5.1 Output Of The Structured Tree. 57
Figure 5.5.6.1 Output Of The Extracted Features From The Tree. 58
Figure 5.5.7.1 Output Of The Performed Action Along With The Feature Points 59
Used.

9
Table No. Table Name Page No.
Table 2.2.1 Comparative summary of existing systems 29-30

Table 3.2.1 Some most commonly used Action Datasets 34

Table 3.4.2.1 Two Laplace filters 38

Table 5.4.1 Statistical analysis 52

10
Graph No. Graph Name Page No.
Graph 2.2.1.1 Accuracies of existing systems 31

Graph 5.4.1 Graphical analysis 52

11
1. Introduction:

Human action recognition (HAR) is a vigorous and challenging research topic


with the aim of observing and analyzing the actions of a person based on video
observation data using 3D dataset. The system is gaining huge amount of importance
due to the fact that computer vision is becoming a trend since all the old manual tasks
that were carried out by human are now being taken by the machines by replacing the
man-labor in various fields such as surveillance system is now moved to live CCTV
cameras covering the recognition of the illegal action. HAR systems rectify and
process contextual environmental, spatial, and temporal data to understand the human
behavior. In general, the HAR process involves several steps starting from collecting
information on human behavior out of raw sensor data to the conclusion about the
currently performed actions.

“Recognition is the most inexpensive, easy-to-use motivational


technique available to management.”
- Jim Clemmer
Nowadays the Point Clouds System is becoming most emerging system in
Human Action Recognition. In earlier time depth map sequences were widely used in
3D systems. Their greatest problem is that they take a lot of response time, feature
extraction was also difficult in depth map sequence. We will use point clouds because
they are easy to detect and manipulate. In the latest development most 3D scanners
and image developers use point clouds. Point in point cloud are used to represent the
surface of the object, they do not contain any information about internal features,
color, materials, and so on[1].

12
Figure 1.1 Point cloud representation of Humans

Various point cloud datasets help the scientists to explore ways on tasks such
as image classification, object detection or semantic segmentation. A point cloud is a
collection of 3D positions in a 3D coordinate system, commonly defined by x, y, and
z coordinates to represent the surface of an object (do not contain data of any internal
features, color, materials, etc.) or space. Point clouds are becoming the basis for
highly accurate 3D models, used for measurements and calculations directly in or on
the object, e.g. distances, diameters, curvatures or cubature(s). They are therefore not
only a great source of information in 3D feature and object recognition, but as well as
in deformation analysis of surfaces.

In most Human Action Recognition systems, an input test data is compared


with the reference data in the database using various methods, followed by
classification and other techniques for recognizing data.[2]

For reference, the recent framework proposed by [3] uses several Kinect
datasets (MSR-3D Dataset and UTKinect-3D-Action datasets are enough to perform
the test and training to show the capabilities) and a deep CNN (Convolution Neural
Network) architecture using Spatial arrangement. The currently prevailing strategy is
to use Machine Learning methods, such as space partitioning. Using depth image
features and skeleton points, 3D structure posture can be recognized using point
clouds in multi-camera scenario.[3]

With evolution, views from 3D point clouds made the task reliable of observing and
analyzing human actions, as well as 3D Point Cloud systems are more accurate than

13
of Depth sequences.[4] Initially the task was carried out by human operators, in need
of monitoring and detecting surveillance crimes along-with many others but with the
increasing number of camera views and technical monitoring devices, however, this
task becomes not only more challenging for the operators but also became cost-
intensive, due to efficient response time. The Human Action Recognition system
plays a wide role in different applications like Video Surveillance System (used to
capture the illegal actions performing on CCTV cameras), Video Retrieval (used to
retrieve the specific portion of any video recording automatically by recognizing
actions), Healthcare monitoring applications (used to monitor the patient's
physiological parameters constantly), Robot Visions: (understand the human action
and respond accordingly). Though these applications have shown the tremendous
work about action recognition but have shown problem regarding response time and
performance due to training of only specified actions.

Figure 1.2. Extracting skeleton points


from point clouds
In past few years the Robot Vision have become most emerging field, so in order to
make the robot to understand and respond to human actions the Human Action
Recognition system was much needed.

14
1.1 Objective:

The main objective of system is to recognize the actions performed by the


human in given 3D video. Which involves following goals to reach to our objective
that are:

1. To take the 3D Video using UTKinect-3D Action dataset


2. List of recognizing walking, running, laying, sitting, waving.
3. To implement the proposed algorithm to achieve the desire actions
4. More focus will be the recognition rate and performance on the system
5. Response time on the system

Figure 1.1.1. Point cloud extraction

1.2 Motivation:
Our aim is to recognize actions using 3D point clouds because it is better,
cheaper and more efficient.

15
Figure 1.2.1. Human Action Recognition is dependent on multiple disciplines.

If the image is observed closely, it gives an idea that work like 3D Action
Recognition does not depend on one concept, rather is a mesh of Statistics, Machine
Learning and Artificial Intelligence combined.
“In the arena of human life, the honors and rewards fall to those
who show their good qualities in action.”
- Aristotle
This quote of Aristotle is related, as it compares two different dimensions for
the same role; action recognition. With the emerging 3D point clouds, The challenge
of representing images has improved, as 2D images was not able to cover the feature
such as good quality of an image. With 3D view of AR, you can have a better look at
perceived action through these images as they cover various viewpoints.

1.3 Problem Definition:

“To develop a computer system that recognize and infer the human
actions based on complete executions using 3d point clouds”

16
Many existing systems have already worked on solving the problem but was only able
to identify some specific actions like walking, laying, sitting, or running.

1.4 Scope:

“To detect and recognize the human actions by using point clouds in an
orthographic environment and classifying the actions according to their category
while improving the overall system’s efficiency”

1.5 Problem Solution:

Our proposed algorithm will follow certain steps in terms of final output.
Usually the procedure of solving the problem fall into these certain steps:

(1) Pre-processing of the raw data from sensor streams for handling unwanted
distortions or features such as eliminating noise and redundancy, or performing data
aggregation and normalization which is important for further processing

(2) Segmentation—highlighting the most significant data segments in order to


simplify how an image is represented and analyzed.

(3) Feature extraction—extracting the features such as temporal and spatial


information from the data obtained from step (2)

(4) Dimensionality reduction— selecting only few appropriate features by


increasing their quality and reduce the computational effort needed for the
classification; and

(5) Classification, the core and most important step of all, determining the
given activity i.e. labelling images into their respective predefined categories.

1.6 Existing systems

Hand PointNet: 3D Hand Pose Estimation using Point Sets

17
In this paper authors have used 3D Point Clouds to create the 3D model of the
hand, then they have detected the shape and position of hand using Regression in
CNNs (Convolutional Neural networks) and detect the fingertips and center of hand to
estimate the pose, then they have extracted the features using the extracted points.
Then they have created classified the histograms created from the features and have
given the estimation of the pose detected. [5]

Figure 1.6.1 Point clouds and skeleton points of a hand

HOPC: Histogram of Oriented Principal Components of 3D Point clouds


for Action Recognition

In this paper authors have proposed a new technique to directly process point
clouds directly instead to making depth images. They have employed key point
extractor and novel descriptor for the purpose. They directly make the histograms
through key points extracted and then classify the histograms made to predict the
action detected[6]

Fig 1.6.2. Coordinates for a depth image.

18
Human Activity Recognition from Unstructured Feature Points

In this paper researchers do not do any pose estimation or use it in any way for
the processing instead they have used the visual attention module that learns activity
from the video frames and creates the interest points, then it classifies the points and
give activity prediction. Interest points were gathered by monitoring the glimpses
frame by frame [7]

Fig 1.6.3 Frame by frame processing of an RGB video

3D Point Cloud Descriptor for Posture Recognition

In this paper authors proposed algorithm preserves spatial contextual


information about a 3D object in a video sequence. They captures point clouds using
regular spatial partitioning 3D grid and describes the pose by checking the occupancy
of every cell of the grid and then used unsupervised K-mean clustering to differentiate
between the actions all this is done without joint estimation [8]

Fig 1.6.4 3D grid around a human body

19
Figure 1.6.5. Two different clusters of human actions.

1.7 Project breakdown structure

Figure 1.7.1 Project breakdown structure

20
1.8 Block diagram

Figure 1.8.1 Block diagram of system

1.9 Applications

There are several application domains where HAR can be developed. It includes these
four categories

1. Active and assisted living (AAL) systems: the use of information and
communication technologies (ICT) in a person's daily living and working
environment to enable them to stay active longer, remain socially connected
and live independently into old age”
2. Healthcare monitoring applications with the interest of
constantly monitoring the patient's physiological parameters.
3. monitoring and surveillance systems for both indoor and outdoor activities,
4. Tele-immersion (TI) applications implemented that will enable users in
different geographic locations to perform similar actions through simulation.
5. Human-Computer interaction
6. Robot Visions: mimic the actions of human and perform similar actions

21
2. Literature Review
“Action” is derived from “act”, described as a fact or a process to do
something. However, human action is the aftermath of joint-movements of any human
body such as elbows for stand-ups, ankles and knees for sit-down or walking, hips,
shoulders for throwing or any other muscle action. Recognizing human actions under
the conditions of wrong pose orientation, orientation towards camera and half body
input in interactive entertainment and robots is a challenging task. The representation
of such actions encompasses by a series of motion, and although similar motions can
result in different correspondences to different actions, the study proves to be
interesting.

The system Human Action Recognition tends to gain more and more
importance in surveillance-related issues. Human Action Recognition commits to
fulfill the great challenges of Computer Vision, by replacing the traditional tasks to
Human Robotic Interactions. In 1970s when, Gunnar Johansson explained the concept
tat how humans recognizes the actions[9] this field became the center of attention of
many computer scientists.

In earlier times, the task of detecting and recognizing human actions used only
2D datasets (images and videos), these datasets were able to highlight the features like
edges, silhouette of the human body, 2D optical flow, 3D spatio-temporal volumes.
Conversely, effective local representations are Silhouette representation by
Histogram-of-Oriented-Gradients (HOG) or Histogram-of-Optical-Flow (HOF)
descriptors[10]. However, these datasets were unable to create a reliable
environmental setup for retrieving 3D information. Later, with evolution of 3D
scanners in 1960s [5], scientists had the ability to retrieve information regarding the
3D-based images and related features from a natural environmental phenomena
containing special point clouds. Point clouds are defined as the points in the space
comprising of the three essential coordinates (x, y, z) of any object in orthogonal
direction[11], which marks the point clouds highly beneficial for an exceptional
accuracy in terms of obtaining précised data. Point clouds also provide the effective
point cloud scans for measuring dimensions or interpreting the spatial relation
between the different parts of the structure. The history of points clouds relate directly

22
to the history of 3D scanners, major ones include LiDAR, PX-80, and Microsoft
Kinect.

In the conventional 2D feature extraction model of Human action recognition,


the multiple camera orientations were insufficient for the information required for the
processing. However, the data points obtained from 3D environments and scanners
provided sufficient information. For Human Action Recognition, a suitable skeleton is
marked by joints obtained from 3D point clouds (Figure 2.1). Monitoring the
sequences and changes in the position and other features of skeletal points was a
difficult task (Figure 2.2).

Figure 2.1: Point clouds formed[12]

Figure 2.2: Change in shapes[12]

2.1 Related Work


Major motive behind any human action recognition system is to improve the
accuracy, performance, efficiency and response time of the system. Researchers

23
named Rusu, R.B., Bandouch, J., Marton, Z.C., Blodow, N., and Beetz, M, proposed a
solution for human action recognition process using intelligent environments where a
spherical coordinate system is used [5]. An intelligent enviornment is capable of
calculating the difference between the silhouette image sequences frame-by-frame.
The 2D silhouette images focuses only on two dimensions, one space and other, time.
However, while implementing a cube algorithm on 2D input converts the input into
3D by adding another dimension, spatial dimension for recognizing actions
Techniques like background subtraction and Gaussian spatial filter segments the data,
and silhouettes images are obtained. For differentiating actions over gestures, K-D
trees are used.

Figure 2.1.1. Silhouette sequences [13]

3D Skeleton joints also help recognizing actions (Li, M., and Leung, H.J.P.R.L.)
using skeletal points as input for their system. They make skeletal points (Figure 4)
from every joint of our body as shown in the figure 4. Then the directly extract
features such as Relative Variance of Joint Relative Distance and Temporal Pyramid
covariance descriptor. Then the data extracted is represented using Joint Spatial Graph
(JSG) after that they analyze JSGs by proposing a new technique named as JSG
Kernel. JSG Kernel has parallelized the processes of edge attribute similarity
checking and vertex attribute similarity checking. This Synchronizing of work made
the work easier for classifier to label the actions. They used 3 datasets MSR-3D-
action dataset[14], UTKinect-3D-action dataset[15] and Florence-3D action
dataset[16] for training, testing and evaluating.[17]

24
Figure 2.1.2. Skeletal points[17]

Using same method of skeleton points Yang and Tian[18] proposed a new
technique in which the 3D skeleton points were extracted from the 3D depth data
involved different RGB-D cameras. The skeleton joints are also known as Eigen
Joints, which combines actions information including static posture, motion property
and overall dynamics of each point with different values for each frame. In
preprocessing Accumulated Motion Energy (AME) technique [19] was proposed to
perform the selection of informative frames from the input video and in feature
extraction the noisy frames are removed or reduced computational cost enhanced the
performance of the proposed system in an effective manner. After performing (AME)
[19] preprocessing a non-parametric Naïve-Bayes-Nearest-Neighbor (NBNN)[20]
algorithm is used to classify multiples actions performed by comparing it with the
actions stored in the system through which the system was trained. The input in this
system was taken by using a 3D video dataset (MSR Action3d dataset) [14]. Different
experiments compare the performance and accuracy of the systems. Results obtained
by multiple datasets concluded that the proposed approach outperforms the state-of-
the-art methods. The frame rate was investigated and the first 30-40% frames
achieved the desired results instead of processing video on MSR Action3D[14]
Dataset[18] any further.

25
Figure 2.1.3. Flow of Events.[18]

An autonomous system for real time action recognition using 3D motion flow
estimation (Munaro, M., Ballin, G., Michieletto, S., and Menegatti, ) recognizes the real-
time online human actions. For visualizing the Point clouds, different colors highlight
the essential information of images. A 3D sensor device name Microsoft Kinect
senses the view and give the cumulative value of all the three planes (axes). 3D grid-
based descriptor joins multiple point clouds to form a shape to be recognized.
Algorithm nearest neighbor classify the performed actions stored on the system.
Overall, the performance and efficiency of the system was tested under this dataset.
The efficiency of the system measured through this dataset in recognizing the 90% of
the action.[10]

Figure 2.1.4. Two different views of 3D point clouds[10]

26
The development of advanced new depth sensor with low cost such as Kinect
is very helpful for recognizing human actions. Depth sensors provide better body
postures and human body silhouettes. RGB-D images provide feature sources or
Motion History Images (MHI)[21], Motion Energy Images (MEI) [19] or Depth
Motion Map (DMM)[22] in different views. Each depth image projected into three
orthogonal planes employed a representation classifier for action recognition. The
methods proposed in the early stage of human actions recognition field, stated poor
performance.

The data is at first, trained and tested. Noisy frames that still exists in public
data due to the segmentation of irregular data are filtered. Angular spatial-temporal
descriptor, which is a stable and low-cost descriptor, combines kinematic parameters
of humans. For the recognition process, KNN classifier [13] is preferred. It also
includes the computation of the descriptor of each frame and estimation of the time
label of the frame.[23]

To achieve the high accuracy and performance in this field some scientists Lei
Shi, Yifan Zhang, Jian Cheng and Hanqing Lu proposed Directed Graph Neural
Network[24] which treats the skeleton as a directed acyclic graph here is the
summary how this works.

Typically, the raw skeleton data are a sequence of frames, each of which
contains a set of joint coordinates. Given a skeletons sequence, we first extract the
bone information according to the 2D or 3D coordinates of the joints. Then, the joints
and bones (the spatial information) in each frame are represented as the vertexes and
edges within a directed acyclic graph, which is fed into the directed graph neural
network (DGNN) to extract features for action recognition. Finally, the motion
information, which is represented with the same graph structure that used for spatial
information, is extracted and combined with the spatial information in a two-stream
framework to further improve the performance.

27
Figure 2.1.5. Directed Acyclic Graph used in Directed Graph Neural Network

Skeleton data is represented as a directed acyclic graph (DAG) with the joints
as vertexes and bones as edges. The direction of each edge is determined by the
distance between the vertex and the root vertex, where the vertex closer to the root
vertex points to the vertex farther from the root vertex. Here, the Neck joint is
considered as root joint as show in fig 2.1.5.

This algorithm is an improved version of Spatio Temporal – Graph


Convolutional Neural Network, it also extracts information (attributes) between edges
and vertices in the form of vectors. The algorithm works in layers input to each layer
is attributes/updated attributes by previous layer, vertices and edges. Bottom Layers
are responsible for manipulation of nearer/adjacent vertices and Top layers are
responsible for manipulation of farther vertices

2.2 Analysis
Working with the challenging research area of Action Recognition, many
researchers gave their best on it, but each had their drawbacks. For instance, in 2009,
the extraction of point cloud features in intelligent environments was not a very
efficient research due to poor response time. In 2013, three dimensional action
recognition carried a massive load that did not had a good impact on performance.
However, within a year (2014), some researchers approached action recognition again
through 3D Eigen joints, which worked on reliable classifier but provided only a

28
single dataset. In 2017, two major research portions were conducted, one was based
on the use of graphs, which provided quick response for multiple datasets but still the
actions were limited to a very few number. The other research held in same year
(2017), based on Kinematic Similarity in Real Time, which was suitable for obtaining
good results, but the use of complex algorithms was time-consuming. Later on in
2019 Directed Graph Neural Network was proposed which produced great
performance and accuracy, but it required very large storage space. The timeline
shows that with the advancement of technology, and discovery of new algorithms, the
task of action recognition became easier and achieved a better performance up to 90%
but still, the target of 100% was left-behind.
Table 2.2.1 comparative summary of existing systems
Name of Recognized
Title of paper Year Methodology Advantages Drawbacks
Authors actions
Action
Radu Bogdan Recognition in Opening
Rusu, Jan Intelligent Door,
Bandouch, Environments Closing First use of Slow
K-D trees are
Zoltan Csaba using Point 2009 Door, 3D point processing
used
Marton, Nico Cloud Features Unscrewing clouds response time
Blodow, Extracted from and
Michael Beetz Silhouette Drinking
Sequences[13]
Get up,
Matteo
3D Flow kick, pick Slow
Munaro, Color point
Estimation for KNN up, sit performance
GioiaBallin, 2013 clouds were
Human Classifier down, due to triple
Stefano used
Action[10] walk, turn workload
Michieletto,
around
Xiaodong Effective 3D 2014 Step back, Better Single dataset
Yang, YingLi action Naïve-Bayes- step performance used
Tian recognition Nearest- forward, under
using Eigen Neighbor step aside, different
Joints[18] (NBNN) kick, run, light

29
twist, climb conditions
Quick
forward
Graph-based response as
punch,
Meng Lia, approach for compare to Limited
Joint spatial forward
Howard 3D human 2017 previous actions
graphs kick, walk
Leunga skeletal action ones under recognized
pick up,
recognition[17] 3 different
clap, bow
datasets
Basketball
shoot,
Qingqiang Human Action Boxing, Suitable
Slow
Wu ,Guanghua Recognition Depth motion Bowling, feature
performance
Xu ,Longting Based On maps with Push, jog, extraction
2017 and difficult
Chen ,Ailing Kinematic KNN walk, results in
feature
Luo ,Sicong Similarity In classifier Tennis high
extraction
Zhang Real Time[23] serve and accuracy
Tennis
swing
Skeleton-
Lei Shi Based Action Great Very large
Directed
Yifan Zhang Recognition accuracy storage
2019 graph neural -
Jian Cheng with Directed over a large space
networks
Hanqing Lu Graph Neural dataset required
Networks[24]

2.2.1 Analytical Graph:


After the complete analysis of existing systems, the task of obtaining the most
efficient Human Action Recognition system got easier. The major concern with such
kind of project is to improve the efficiency in terms of response time, which is
directly proportional to the performance rate. Before the conclusion, we need to study
their accuracy rates for overall results. The graph 2.6.1 demonstrates the accuracy rate
with period (time).

30
Accuracy
100

95

90

85

80

75

70
2009 2013 2014 2017 2017 2019

Accuracy

Graph 2.2.1.1 Accuracies of existing systems

2.3 State-of-the-Art:
After analyzing these certain researches, the best research with utmost result
among all is the Directed Graph Neural Network. The techniques like global average
pooling is used for feature extraction, and DGNN classifier for the classification of
different actions, produced highly efficient result. Even though it used complex
algorithms for processing that affected the performance, it achieved more than 95%
accuracy.

2.4 Problem Solution:


To resolve the drawbacks of these existing systems, following steps are
considerable,

1. Enhancing the input frames through different digital image processing


techniques
2. Detecting and highlighting skeleton points of body using standard libraries
like Point Cloud Library (PCL), Open Pose or VideoPose3D
3. Extracting the skeleton points only for accurate results
4. Extracting other features such as Geometrical shape coordinates, or volume
along-with classification for labeling the action.

31
The steps are discussed in detail in the coming chapter

32
3. Methodology
3.1 Introduction:
Recognizing the human actions plays an important role of base in robotic vision.
Robots cooperate with the human behavior to accommodate their actions and to
mimic them accordingly, particularly within the industrial areas, wherever the reach
of an individual's being isn't achievable like lifting significant objects or precise
positioning of objects on top of one’s shoulders or perhaps the aging work-force
agencies by providing a robotic assistant instead of a totally committed auto-manual
system. The essential plan is to combine the cognitive capabilities of humans with the
physical strength and potency of the robots/machines. Humans in general have
perception and psychological feature functions and are able to act and react
with regard to a given scenario, robots will learn their behavior and
act accordingly[25].

The human actions related to such quite applications are nominative by


utilizing human skeletal points. Techniques that are already noted including silhouette
sequences, optical views, depth map sequences comprised of some smart quality
approach along-with response time; nonetheless there was a desire of a
stronger approach with the advancement of technology and time. Therefore,
in 1960s[25] the introductory 3D scanners shoot the motion of the point clouds, the
point clouds are basically the points within the space comprising of the 3 essential
coordinates (x, y, z) of any object in orthogonal direction, a method of representing
three-dimensional objects in 2 dimensions. It's a type of parallel projection, within
which all the projection lines are orthogonal to the projection plane.

3.1.1 Coordinate System


In mathematics, a 3D coordinate system is basically three dimensional
perpendicular number lines. Camera coordinate system is used in this project

The camera read space results in reworking of world-space coordinates to


coordinates that are before of the user's read. The read space is seen from the camera's
purpose of read, often typically accomplished with a mix of translations and rotations
to translate/rotate the view. These combined transformations are usually held on
within a read matrix that transforms world coordinates to look at space[26].

33
3.1.2 Orthographic Environment

Orthographic projection is a way to represent the three-dimensional objects in


two dimensions. In this projection lines are orthogonal to the projection plane. As
shown in figure 3.2.1. This project will produce the output in orthographic projection.
[27]

Figure. 3.2.1. Orthographic projection demonstration

3.2 Inputs to System (Dataset)


The data is collected in form of data points from various data-sets, a data
set (or dataset) is an assortment of information. Many characteristics outline a data
set's structure and properties. These include the quantity and kinds of the attributes or
variables, and varied applied math measures applicable to them, like variance. There
exist several publicly available info sets that are wide used for human daily action /
activity recognition. Some of the most commonly used datasets are shown in Table
3.2.1

34
Table 3.2.1 Some most commonly used Action Datasets

Dataset Videos Classes Subjects Views


MSR-Action3D[28] 567 20 10 1
CAD-60[29] 60 12 4 -
RGBD-HuDaAct[30] 1189 13 30 1
MSR-DailyActivity3D[31] 320 16 10 1
UT-Kinect[15] 200 10 10 4
Act42[32] 6844 14 24 4
CAD-120[33] 120 10+10 4 -
3D Action Pairs[34] 360 12 10 1
Multiview 3D Event[35] 3815 8 8 3
Northwestern UCLA[36] 1475 10 10 3
UWA3D multiview[6] ~900 30 10 1
Office activity[37] 1180 20 10 3
UTD-MHAD[38] 861 27 8 1
UWA3D multiview II[39] 1015 30 10 5
M2I[40] ~1800 22 22 2
SYSU-3DHOI[41] 480 12 40 1
NTU RGBD 120[42, 43] 114480 120 106 155

3.3 System Requirements


Software requirements

 Python 3.5+
 Windows 10 x64

Dependencies:

 PyTorch
 Numpy
 Scipy
 OpenCV

35
 Tqdm
 TensorboardX

Hardware requirements

 100GB storage space


 8GB RAM
 512MB VRAM
 i3 4th generation processor

3.4 Overview of System


In this system a 3D RGBD action video will be given as input. then noise will
be removed to clarify the frames and sharpening techniques will be applied to get
better visualization of body for the extraction of skeleton points. After the extraction
the tree will be formed by using attention networks and features will be extracted by
getting the information from each node. Finally features will be provided to classifier
to get the predicted output class

36
Figure 3.3.1: Block Diagram of system

3.4.1 3D Video Inputs


NTU-RGBD 120[43, 44]is currently the most widely used dataset for
skeleton-based action recognition; it contains 56,880 videos, each containing an
action. There are a total of 60 classes including single-person action, e.g., drinking
water, and two-person action, kicking another person. The dataset contains 4 different
modalities of data: RGB videos, depth map sequences, 3D skeleton data and infrared
videos. These data are captured by Microsoft Kinect V2 at 30 fps. The actions are
performed by 40 volunteers aging from 10 to 35. There are three cameras for every
action, set at the same height but aimed from different horizontal angles: −45◦, 0◦,
45◦. The camera can provide 25 3D locations of joints. which recommends two
benchmarks: 1). Cross subject (CS): The persons in the training and validation sets are
different. The training set contains 40,320 videos, and validation set contains 16,560
videos. 2). Cross-view (CV): The horizontal angles of the cameras used in the training
and validation sets are different.

3.4.2 Pre Processing


In NTU RGBD 120[44] dataset salt and pepper noise[45] is added to for better
training and predictions, to remove this noise median filter[46] is used.

Median filter is very effective filter in removing the noise. This filter is
applied on each frame of the input video by replacing the gray level of each pixel by
the median of the gray levels in a neighborhood of the pixels, size of neighborhood is
defined according to level of noise. As shown in fig 3.4.2.1

Figure 3.4.2.1 Application of median filter[47]

37
The next step is the sharpening of the videos. Sharpening means to clear the
edges of the object for easier detection. Laplacian filter is widely used for sharpening
the image

The technique of Laplacian filter assists in enhancing the edges of the video
frames this enhancement is known as sharpening of the image. Moreover, sharpening
the edges helps in feature-extraction and skeleton-shaping as well, therefore the
sharpened skeleton points formed clearly shows the subject’s body joints. Two
commonly used Laplacian filters are shown in table 3.4.2.1

Table 3.4.2.1 Two Laplace filters

0 -1 0 -1 -1 -1
-1 4 -1 -1 8 -1
0 -1 0 -1 -1 -1

Laplacian filter uses function L(x, y)(shown in below) to find the intensities of
pixels of an image,

2 2
∂ I ∂ I
L ( x , y )= +
∂ x2 ∂ y 2

3.4.3 Human Body detection and getting skeleton


For human body detection standard Point Cloud Library (PCL)[48] is used.
PCL is an open source cross-platform library used to perform operation on point
clouds. In this context this library will create the point cloud of image, then will detect
human body through some standard algorithm. After detection this library will
highlight the joints of human body to form a skeleton. These points are called
skeleton points

3.4.4 Skeleton Extraction:


To extract the skeleton, the system will use skeleton points formed in above
step will be plotted in a new 3D plane for better visualization and processing.
Extracted skeleton on a separate plane will also greatly decrease the performance
payload as whole image will not be processed only 25 skeleton points will be left for
processing.

38
Figure 3.4.4.1 Extracted skeleton of clap action

3.4.5 Neuron Tree Model:


Skeleton points are used for construction of tree. Skeleton points are basically
the points that are placed over the joints of human body. The points (4, 21, 2 and 1)
are most important points for construction of tree. These points are mostly used as
root of tree. In fig 3.5.5.1 point 2 is used as root. Tree is formed by making two
networks over the skeleton, Global Long-sequence Attention Network and Sub Short–
sequence Attention Network[49]. These networks give best possible formation of tree
without losing relative information between joints and edges as well as without losing
spatio-temporal information. Each node of the tree contains the information about the
angle between incoming and outgoing edges, position of node, and velocity of joints.

Figure 3.4.5.1 Skeleton (left) and a possible tree of skeleton (right)

3.4.6 Movements of joints:


The difference between positions of joints in each frame is extracted as feature
to use for the prediction of actions. It is calculated by finding the difference between
the movements of same joint within consecutive frames. The position of a joint is

39
recorded in each frame and difference is calculated with the previous frame that give
the displacement of the joint and also the speed at which the joint is displaced.

Mvt = Vt+1 – Vt is used to calculate where Vt is the position of the joint at the
present frame and Vt+1 is position of joint in previous frame so by subtracting these
two the movement of joint is extracted in Mvt.

3.4.7 Deforming of edge


Deforming of the edge is basically the movement and change in position of
edges in skeleton. It is calculated by determining the position of the edges within the
frames and then subtracting the current position with the position of edges in the
previous frame. It gives the direction and movement of edges. Met = et+1 – et can be
used to get the position where et is the position of edge in the previous frame and et+1is
the position of edges in the present frame. Then these both positions are subtracted to
get the difference rate and displacement of the edge within the frames in Met.

3.5 Proposed Approach (Structure Tree Neural Network)


The purpose is to extract the spatio-temporal information from tree, by
following a novel-introduced technique Structure Tree Neural Networks. The major
motive is to take the input as multi-layered network and display the output as updated
node attributes of trees. These attributes are the properties which defines the position
of tree node and the formulated angle with previous node by using functions h c() and
hp() for each node, by processing the child nodes and parent nodes respectively.
Where

h c ( ni ) =ni+ 1 . info−n i . info

p
h ( n i) =ni . info−n i−1 . info

However, it is sufficient for providing adjacent nodes only which does not
fulfill the overall processing criteria but only for the predicted farther nodes. The
adjacency matrix calculated is now mapped onto an attention map, where

AO =original ¿

A=P A O

For stable training,

40
A=A O + P

Our methodology comprises of two parts, extracting the spatial information


like position of node and Temporal information like velocity. Till now we have
processed all the spatial information. For getting temporal information the system
used a temporal convolutional block which applies 1D convolution along with
temporal dimension. 1D convolutional layer is followed by a ReLU layer. This
outputs in an image, then A global-average pooling layer followed by a softmax layer
is added at the end for class prediction. Fig 3.5.1.2 Demonstrates the graphical flow
and Fig 3.5.1.3 Demonstrates the Class Prediction involved in Structured Tree Neural
Networks.

Figure 3.5.1.2 demonstrates the flow of proposed approach Structured Tree Neural
Networks.

41
Figure 3.5.1.3 Demonstrates the Class Prediction involved in Structured Tree Neural
Networks.

3.6 Pseudo Code


Generate joint data for training using raw skeleton files
Input: Skeleton files of dataset,
Output: Joint data of skeleton
1. Place raw skeleton files in dataset directory
2. setDefaultDirectories()
3. Foreach b in benchmarks
4. Foreach p in parts
5. Check outpath
6. Print path and benchmarks
7. gendata()
8. endFor
9. Endfor
Generate bone data for training using raw skeleton files
Input: Joint data of skeleton,
Output: Bone data of skeleton
1. If joint data exists then
2. Load()
3. Foreach d in dataset
4. Foreach s in sets
5. copyTensors()
6. joinBones()
7. endfor
8. Endfor
9. else
10. print(“Please generate joint data first”);
Generate motion data using bone and joint data
Input: Joint and bone data of skeleton
Output: motion data of skeleton
1. loadBoneData()
2. loadJointData()

42
3. insertMotionDifferences()
Training
Input: Motion Data of skeleton
Output: Trained model
1. feeders_train()
2. feeders_test()
3. model_creation()
4. optimization()
5. training()

Testing
Input: Motion data, Train Model, RGB-D Video
Output: Recognized class
1. feeders()
2. loadModel()
3. test()

3.7 Scenario of Processing


Initially, a 3D video taken from NTU-RGB-D will be given to the system as
input on different Human subjects through 3 separate action view cameras. Various
Preprocessing techniques break the video into multiple frames for extraction of
detailed information related to our subjects while maintaining quality by removing
noise and de-shaped subjects (Persons) for accurate and successful subject-detection
step wise. Such noise is removed by applying Median Filter and the image is
sharpened using Laplacian filter. Using the standard PCL library skeleton of detected
human bodies will be formed. However, till now the image will have a lot more
unnecessary information. To remove that information skeleton will be moved to a
new empty 3D plane so that we have a clear visualization or movements. Certain
skeletal features are extracted through a Body Pose Evolution Map. Late fusion, and
Cross Setup Protocol. Mainly two features will be extracted one is movement of nodes
and the other is deformation of bones within the frames. These features are extracted
by processing difference of frames and passing the collected data through BN and
ReLU. Then these features are fed into Global-average pooling and softmax layer to
give our recognized class as an output

43
3.8 Expected Output
After getting the features from softmax layer, the features will be compared
with the trained model of the system and for the final output, labeled classes of
actions will show-up given the condition that the similarity of the extracted features
with trained features should exceed 70% as at-least 10 out of 15 points of skeleton
should match the features. The output will be a video displaying the subject detected
in a rectangular shape box and recognized action labeled on the corner.

3.9 Summary
In this approach, a novel methodology towards human action recognition has
been introduced. A graphical-tree skeleton structure constructed provides the joints
information for extracting features, which later serves as a concluding for classified
actions. These actions are classified and labeled into different categories.

44
4. Implementation

4.1 Development Tool

The developing tool used is Python 3.7.5 (64 bit). The reason for choosing this tool is
that Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics whose syntax emphasizes readability and therefore the cost factor is minimized in
case of program maintenance. Maintenance is one among vital aspects of Python, because it
supports numerous packages or modules, which inspires program modularity and code utilize.
The other tool in use is the most common Command Prompt also known as CMD. This tool is
used to install the dependencies of the project and to run the project.

4.2 Implementation issues

Human action recognition faces a major bottleneck in adjusting the anthropometric


variations in such a way, the execution rate does not get affected. These variations however,
makes the system complex. The main technical issue faced was the integration of the CUDA
library, a deep neural network library which requires a GPU-accelerated primitive operations
in order to process the various nodes of the tree formed together. Providing a Nvidia GPU
would most possibly accelerate the processing. Without the GPU it required a lot of
processing speed and power. The CPU usage measured during the processing was more than
95%. This also heated up and slowed down the system. To counter this problem, GPU cloud
is used. One other problem we faced was using tensorflow, as Google has released tensorflow
2.0, it has deprecated many attributes and many function in it, it was a great challenge to
migrate to tensorflow 2.0

The reason our system faces a high computational time is because of the training time
required when dealing with the neural networks. It might possibly raise a query in mind that
why neural networks after all? The relation between a system and a human is defined on basis
of how efficiently a system can learn from a human. This is what makes it an intelligent. As
discussed before, the human behavior has always been an important factor for better
understanding of a computer vision. Until our system recognize all possible human actions, it
cannot become a strong and intelligent machine.

STNN algorithm however, aims at classifying the actions in an acylic form, or simply
a Tree structure. In this way, the child nodes of a tree are restricted to one parent only, which

45
means no cycles between nodes are formed. This saves the computational time of our
framework since STNN acts as a non-recursive data structure. Of course, a deep neural-
network combined with a structural tree threatens the complexity of system, as many layers
involved are inter-linked, but the relation defined among a node-to-a-node is definite, i.e.
there is no possibility of repetition. This overall marks DGNN less complicated, so consider it
as a minor issue.

Another major issue is how to acquire a dataset for the training purpose. NTU RGBD
invented by NTU Singapore is a dataset used for this system. It needs a formal approval from
a senior authority to access the dataset. But the problem with such a large dataset is reaching
out to all of its aspects, such as avoiding the complex response time. The response time in this
case is very slow (training time of almost 10 days).

4.3 Configuration Management

There are mainly three different CM systems for the configuration management and
the maintenance of STNN algorithm. One of them is Tensorflow, open source library
provided by Google to implement Machine Learning, Computer Vision, Natural Language
Processing and many other algorithms. To configure this library, we need to run a single
command on command prompt i.e.

pip install tensorflow

In this project this library is used to train the system. The second one is the NumPy, a
python library that supports large sizes of multi-dimensional arrays and matrices. To
configure this library, we need to run a command on command prompt. i-e

pip install numpy.

The features extracted are stored on storage using this library. The last CM is the
OpenPose. A library used to make skeleton on human body. For configuration, download the
CPU only binary of the library.

The plotting of the skeleton was done using the matplotlib library, This is a plotting
library for python programming language it uses object oriented approach, hence making it
very easy for developers to plot the graphs. To configure it you need to run cmd command

pip install matlpotlib

46
4.4 Framework Section for development

Figure 4.4.1 Demonstrates the flow of proposed approach Structured Tree Neural
Networks.

Initially, a skeleton tree is manufactured in OpenPose library by taking the spine joint
as a root node. Following, every node in the tree consists of two portions, a data portion and a
children portion (references of children). The data portion contains the x, y, z coordinates
information and the angle information between incoming and outgoing edges. In the children
portion there are three references of child nodes i.e. left right and center as the tree is not
binary tree. After the frame difference of each node is calculated, the node is updated with
new position and angle. And the change (difference) is also saved in the memory. Then the
changes and updated nodes are mapped on attention map to get the details about the not
connected nodes. To form an input signal to the main algorithms for feature extraction 1D
convolution and Global Pooling is applied to extract the features i.e. node difference and
angle difference, ReLu layer is then applied to remove the redundant features. At the end
Softmax layer classifies the best matching class and send it to the output.

47
4.5 Deployment factors

STNN is implemented using Tensorflow library. The reason for deploying this as a
standard library is that it is an open-source ML library, known for creating dynamic graphs,
i.e. typical updates. STNN can also be implemented using other ML libraries like PyTorch,
Pandas, and even some other tools, like Matlab

4.5.1 PyTorch

PyTorch is an open-source ML library developed in Facebook AI Reasearch (FAIR)


lab, It’s major primary applications are Computer Vision, and Natural Language Processing.
Major reasons for not using PyTorch are, first of all it doesn’t have as much resources as and
is slower comparative to tensorflow, PyTorch is only compatible with Nvidia GPUs. Thirdly
the minimum requirement of RAM is 16GB this much higher, as normally laptops are by
default installed with 4GB to 8GB RAM

4.5.2 Pandas

In computer programming, pandas is a software library written for the Python


programming language for data manipulation and analysis. In particular, it offers data
structures and operations for manipulating numerical tables and time series. Reason for not
using this is that it has much larger response time and very less resources comparative to
Tensorflow.

4.5.3 MATLAB

MATLAB is a programming platform designed specifically for engineers and


scientists. The heart of MATLAB is the MATLAB language, a matrix-based language
allowing the most natural expression of computational mathematics. Reason for not using this
is MATLAB requires high RAM and CPU usage comparative to tensorflow,

4.6 Summary

A graphical-tree skeleton structure constructed provides the joints information for


extracting features, which later serves as a conclusion for classified actions. These actions are
classified and labeled into different categories. However, our aim is to identify these actions
with minimal performance load and maximum accuracy support.

48
49
5. System Testing
5.1 Introduction
Previously in this project, the discussion was reached to completion of
implementation i-e libraries, dependencies, environment and modular integrations. Now it’s
time for system testing, which deals with correctness and precision of the system. It also
means to validate and verify the system’s functionality according to the scope defined. An
important part of testing of this project is the discussion of accuracy of the system and
comparison with previous techniques. This system uses trees to classify human actions, the
inspiration towards usage of trees come from the natural shape of human body. If only joints
are taken and are joined through lines, it becomes a tree type shape. Considering this the
technique Structured Tree Neural Network was proposed

5.2 Output
This project, the STNN has four layers; input, output, and upper and lower hidden
layers. The input layer consists of 25 nodes, where each represents a skeletal joint. The top
hidden layer handles feature extraction, which gets passed to the lower hidden layer, which is
responsible for the removal of nodes with redundant features. The remaining nodes are passed
to the output layer, responsible for determining the performed action, as depicted in the
figure. 5.2.1.

Figure 5.2.1 depiction of layers of STNN

50
The benefit of using this layered approach and simplicity of data structure is the high
accuracy, this project has accuracy of 96.3% and accuracy confusion matrix is shown in
figure 5.2.2.

Figure 5.2.2 Accuracy confusion matrix.

5.3 Automatic Testing


Automatic testing means that there is a tool which check that whether the system is
showing the desired output or not, for this purpose the expected output of the system is
already fed into the tool, then the tool checks the actual output and notify that whether the test
passed or not. This project uses Python programming language for implementation that’s why
the tool used for automatic testing is specific for Python programming language. In this
project Unittest is used for automatic testing. The unittest unit testing framework was
originally inspired by JUnit and has a similar flavor as major unit testing frameworks in other
languages. It supports test automation, sharing of setup and shutdown code for tests,
aggregation of tests into collections, and independence of the tests from the reporting
framework[50]. Every unit of the system id tested and every test is successful. As shown in
figure 5.3.1

51
Figure 5.3.1 All test passed in automatic testing

5.4 Statistical and Graphical Analysis


Table 5.4.1 statistical analysis

Technique Year Accuracy

Silhouette sequences[13] 2009 82.3%

3D Flow Estimation[10] 2013 87.9%

Eigen joints[18] 2014 90.9%

Joint spatial graphs[17] 2017 94.5%

Kinematic similarities with


2017 98.5%
Depth motion maps[23]
Directed Graph Neural
2019 96.1%
Network[24]

Structured Tree Neural


2020 96.3%
Network (Proposed)

52
Accuracy
100

95

90

85

80

75

70
2009 2013 2014 2017 2017 2019 2020

Accuracy

Graph 5.4.1 Graphical analysis

5.5 Complete User Interface


5.5.1 Layout 1:
For the implementation of this project the complete coding is being done in python
language in which all the steps are followed and portrayed and output is created at each step.
Python 3.7 is used to code the logic but to make GUI of the system MATLAB r2018b tool is
used in which a GUI is designed that contain six button each referring to a step from taking
input to recognition and an axes is used to display the stepwise output and a table is used to
display the features points that was extracted from the tree. Fig 5.5.1.1 shows an empty layout

53
Figure 5.5.1.1: Most initial Layout

5.5.2 Layout 2:
In this step the user will select the video as input by clicking on the Select Video
Input button. Through which an input dialog will open and user will select the video as input.
The selected video will then be displayed in the axes portion after the system read the video
as shown in figure 5.5.2.1.

54
Figure 5.5.2.1: Selection of video by user by clicking Select Video Input button.

5.5.3 Layout 3:
In this layout the skeletons are made on the human of selected video by clicking the
button Make Skeleton and the skeletons are made on the body of human. Fig 5.5.3.1 displays
the output of the video in which the skeleton is made on the human. User need to click the
Make Skeleton Button to get the desired output.

55
Figure 5.5.3.1: Output of the skeletons displayed.

5.5.4 Layout 4:
In this layout step the skeleton is extracted on the 3d grid. The user will have to click
on the extract skeleton button. Fig 5.5.4.1 has the output of the extracted skeleton on the 3d
grid.

56
Figure 5.5.4.1: Output of the extracted skeleton.

5.5.5 Layout 5:
In this step the trees are formulated from the extracted skeleton by making a tree and
reading the nodes with the frequent changes in the value and then those nodes are separated as
further subtree that will be traversed in next step to get the output. The user will have to click
Structuring tree button. Fig 5.5.5.1 displays the output of the formed tree along with the
subtree. Circles and lines are used to make the tree.

57
Figure 5.5.5.1: Output of the Structured tree.

5.5.6 Layout 6:

In this step the features points are extracted from the subtree and are displayed in the
table. Then these features are used in next step to recognize the action performed in the video.
The user need to click the Extract Features button to get them. Fig 5.5.6.1 shows the output
layout of extracted features in the table.

58
Figure 5.5.6.1: Output of the extracted Features from the tree.

5.5.7 Layout 7:
In this step the extracted features are provided to classifier to classify the performed
action by matching features with trained model. User need to click Recognition button to get
the desired output. Fig 5.5.7.1 show the final output of the video with action labeled at the left
top corner of the video.

59
Figure 5.5.7.1: Output of the performed action along with the feature points used.

5.6 Summary

The ability for automated technologies to correctly identify a human’s actions


provides considerable scope for systems that make use of human-machine interaction. Thus,
automatic3D Human Action Recognition is an area that has seen significant research effort. In
work described here, a human’s everyday 3D actions recorded in the NTU RGB+D dataset
are identified using a novel structured-tree neural network. The nodes of the tree represent the
skeleton joints, with the spine joint being represented by the root. The connection between a

60
child node and its parent is known as the incoming edge while reciprocal connection is known
as the outgoing edge. The uses of tree structure lead to a system that intuitively maps to
human movements. The classifier uses the change in displacement of joints and change in the
angles between incoming and outgoing edges as features for classification of the actions
performed.

61
6. Conclusion
We're living in a 3D World, meaning it's a 3D physical world we're living. Computer
vision is very critical in the sense that if you want to understand our physical
world, 3D World, and also to interact with our 3D World. So it's very important for us
to teach the computer to see and understand the world as a human being. So in this
sense, computer vision is really important because human relies on our eyes a lot to
achieve our daily activities, to achieve many tasks. So if the machine can do the
same, achieve the same capability, well, computer vision techniques can be great.
From the engineering point of view, computer vision aims to build autonomous
systems to perform some of the tasks which the human visual system can perform and
even surpass it in many cases. Many vision tasks are related to extraction of 3D and
temporal information from time varying 2D data, or in other words, videos. Of course,
the two goals are intimately related.
Computer vision have many fields in it. Human Action recognition is one of most
famous field in computer vision because of its vast applications in the field. Human
action recognition has great impact on Human robotic interaction. To make computers
able to recognize the human action, we require some low cost, highly efficient and
quick algorithms. This is the reason that this research area has been major focus of
researchers since last few years. Inspired by this, this contribution was made in the
field by proposing the approach that uses the Trees to manipulate human skeleton and
recognize human actions.
Before moving towards the proposed approach lets discuss the previous approaches
made in this research area, like in 2009 silhouette sequences were used to classify the
human actions, this technique got very famous with the time due to its simplicity, but
in this technique the posture form camera affected the performance for example if a
person is facing towards the camera and performing the action, then the performance
of this technique was very good, but in case the person isn’t facing the camera, the
performance wasn’t good. To resolve this issue depth motion maps were introduced
these are real 3D data and these also provide excellent performance in different
lightning conditions and the posture with camera also don’t affect the performance.
But these are very large in the size. To decrease the size of the data, Skeleton points
were introduced, these are also real 3D data and these have exactly same advantages

62
of depth motion maps but with significantly decreased size. Today these are ideal for
use for Human action recognition due to their simplicity and small size, one more
amazing thing about skeleton points is that they can be represented by many data
structures like graphs and trees. Skeleton points rapidly became most used technique
for Human action recognition.
Now, let’s talk about what is human action? A human action is the result of joint-
movements of any human body such as elbows and hands for wave as well ankles and
knees for running or walking. If we joint the points of these joints this forms a
skeleton, if we look skeleton closer from a computer scientist point of view, it forms a
tree. A tree structure constructed using skeleton, simplifies the processing and reduces
the performance load as compare to graphs, because in graphs edges are also traversed
and processed, but in tree only nodes are traversed and processed. The actions were
classified using two features i-e joint movement and bone deformation, joint
movement means that how much a joint is moving and on which axis the joint moving
frame by frame, bone deformation means that how much the angle between incoming
and outgoing edge of the node is changing, Yes, trees don’t have edges, that’s why
slopes of both edges is found and then angle between these two slopes is found, the
change in this angle is calculated and is used as a feature for classification. This
technique uses layered approach for classification. First layer is input layer, this layer
takes input in the form of skeleton data and forms a tree, then second layer extracts
the features of every node in the tree, third layers removes the nodes with redundant
features, then at last there is output layer is classifying the actions and places action
label. This layered approach and simplicity of the structure lead to 96.3% accuracy of
the system on the NTU-RGBD dataset.

63
7. Major Contributions

All of the involved candidates including Dr.Sajid, planned and arranged the
experiments. Misha, Nisar, and Junaid completed the analyses. Nisar, and Junaid
arranged and completed the productions. Nisar added to the understanding of the
outcomes. Misha started to lead the pack by recording the thesis manuscript as the
original copy. All authors provided critical feedback and helped shape the research,
analysis and complete thesis.

64
8. References
[1] H. S. Robotics. (2015, 26/09/2019). Point Cloud. Available:
https://www.sciencedirect.com/topics/engineering/point-cloud
[2] Z. W. Shugang Zhang, Jie Nie, Lei Huang, Shuang Wang, Zhen Li "A Review on
Human Activity Recognition Using Vision-Based Method," J Healthc Eng., p. 2,
20/07/2017 2017.
[3] A. Shafaei and J. J. Little, "Real-time human motion capture with multiple depth
cameras," in 2016 13th Conference on Computer and Robot Vision (CRV), 2016, pp.
24-31: IEEE.
[4] S. Berretti, M. Daoudi, P. Turaga, A. J. A. T. o. M. C. Basu, Communications,, and
Applications, "Representation, analysis, and recognition of 3D humans: A survey,"
vol. 14, no. 1s, p. 16, 2018.
[5] L. Ge, Y. Cai, J. Weng, and J. Yuan, "Hand PointNet: 3d hand pose estimation using
point sets," in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 8417-8426.
[6] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, "HOPC: Histogram of
oriented principal components of 3D pointclouds for action recognition," in
European conference on computer vision, 2014, pp. 742-757: Springer.
[7] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, "Glimpse clouds: Human activity
recognition from unstructured feature points," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp. 469-478.
[8] M. Khokhlova, C. Migniot, and A. Dipanda, "3D Point Cloud Descriptor for Posture
Recognition," in VISIGRAPP (5: VISAPP), 2018, pp. 161-168.
[9] G. J. P. Johansson and psychophysics, "Visual perception of biological motion and a
model for its analysis," vol. 14, no. 2, pp. 201-211, 1973.
[10] M. Munaro, G. Ballin, S. Michieletto, and E. J. B. I. C. A. Menegatti, "3D flow
estimation for human action recognition from colored point clouds," vol. 5, pp. 42-51,
2013.
[11] M. T. SRINI DHARMAPURI. (2018). Evolution of Point Cloud. Available:
https://lidarmag.com/2018/07/16/evolution-of-point-cloud/
[12] P. C. L. (PCL). (2017). People module for detecting people in unconventional poses.
Available: http://pointclouds.org/gsoc/
[13] R. B. Rusu, J. Bandouch, Z. C. Marton, N. Blodow, and M. Beetz, "Action
recognition in intelligent environments using point cloud features extracted from
silhouette sequences," in RO-MAN 2008-The 17th IEEE International Symposium on
Robot and Human Interactive Communication, 2008, pp. 267-272: IEEE.
[14] J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. J. P. R. Tang, "RGB-D-based
action recognition datasets: A survey," vol. 60, pp. 86-105, 2016.
[15] L. Xia, C.-C. Chen, and J. K. Aggarwal, "View invariant human action recognition
using histograms of 3d joints," in 2012 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, 2012, pp. 20-27: IEEE.
[16] L. Seidenari, V. Varano, S. Berretti, A. Bimbo, and P. Pala, "Recognizing actions
from depth cameras as weakly aligned multi-part bag-of-poses," in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013,
pp. 479-485.
[17] M. Li and H. J. P. R. L. Leung, "Graph-based approach for 3D human skeletal action
recognition," vol. 87, pp. 195-202, 2017.
[18] X. Yang, Y. J. J. o. V. C. Tian, and I. Representation, "Effective 3d action recognition
using eigenjoints," vol. 25, no. 1, pp. 2-11, 2014.

65
[19] G. T. Papadopoulos, V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, "Accumulated
motion energy fields estimation and representation for semantic event detection," in
Proceedings of the 2008 international conference on Content-based image and video
retrieval, 2008, pp. 221-230: ACM.
[20] S. McCann and D. G. Lowe, "Local naive bayes nearest neighbor for image
classification," in 2012 IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 3650-3656: IEEE.
[21] M. A. R. Ahad, Motion history images for action recognition and understanding.
Springer Science & Business Media, 2012.
[22] X. Yang, C. Zhang, and Y. Tian, "Recognizing actions using depth motion maps-
based histograms of oriented gradients," in Proceedings of the 20th ACM
international conference on Multimedia, 2012, pp. 1057-1060: ACM.
[23] Q. Wu, G. Xu, L. Chen, A. Luo, and S. J. P. o. Zhang, "Human action recognition
based on kinematic similarity in real time," vol. 12, no. 10, p. e0185719, 2017.
[24] L. Shi, Y. Zhang, J. Cheng, and H. Lu, "Skeleton-Based Action Recognition with
Directed Graph Neural Networks," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 7912-7921.
[25] S. C. Akkaladevi and C. Heindl, "Action recognition for human robot interaction in
industrial applications," in 2015 IEEE International Conference on Computer
Graphics, Vision and Information Security (CGVIS), 2015, pp. 94-99: IEEE.
[26] L. OpenGL. Coordinate Systems. Available: https://learnopengl.com/Getting-
started/Coordinate-Systems
[27] I. Wikimedia Foundation. Orthographic Projection. Available:
https://en.wikipedia.org/wiki/Orthographic_projection
[28] W. Li, Z. Zhang, and Z. Liu, "Action recognition based on a bag of 3d points," in
2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition-Workshops, 2010, pp. 9-14: IEEE.
[29] J. Sung, C. Ponce, B. Selman, and A. Saxena, "Human activity detection from RGBD
images," in Workshops at the twenty-fifth AAAI conference on artificial intelligence,
2011.
[30] B. Ni, G. Wang, and P. Moulin, "Rgbd-hudaact: A color-depth video database for
human daily activity recognition," in 2011 IEEE international conference on
computer vision workshops (ICCV workshops), 2011, pp. 1147-1153: IEEE.
[31] J. Wang, Z. Liu, Y. Wu, and J. Yuan, "Mining actionlet ensemble for action
recognition with depth cameras," in 2012 IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 1290-1297: IEEE.
[32] Z. Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, "Human daily action analysis with
multi-view and color-depth data," in European Conference on Computer Vision,
2012, pp. 52-61: Springer.
[33] H. S. Koppula, R. Gupta, and A. J. T. I. J. o. R. R. Saxena, "Learning human
activities and object affordances from rgb-d videos," vol. 32, no. 8, pp. 951-970,
2013.
[34] O. Oreifej and Z. Liu, "Hon4d: Histogram of oriented 4d normals for activity
recognition from depth sequences," in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2013, pp. 716-723.
[35] P. Wei, Y. Zhao, N. Zheng, and S.-C. Zhu, "Modeling 4d human-object interactions
for event and object recognition," in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 3272-3279.
[36] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, "Cross-view action modeling,
learning and recognition," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2014, pp. 2649-2656.
[37] K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo, "3d human activity recognition
with reconfigurable convolutional neural networks," in Proceedings of the 22nd ACM
international conference on Multimedia, 2014, pp. 97-106: ACM.

66
[38] C. Chen, R. Jafari, and N. Kehtarnavaz, "Utd-mhad: A multimodal dataset for human
action recognition utilizing a depth camera and a wearable inertial sensor," in 2015
IEEE International conference on image processing (ICIP), 2015, pp. 168-172:
IEEE.
[39] H. Rahmani, A. Mahmood, D. Huynh, A. J. I. t. o. p. a. Mian, and m. intelligence,
"Histogram of oriented principal components for cross-view action recognition," vol.
38, no. 12, pp. 2430-2443, 2016.
[40] N. Xu, A. Liu, W. Nie, Y. Wong, F. Li, and Y. Su, "Multi-modal & multi-view &
interactive benchmark dataset for human action recognition," in Proceedings of the
23rd ACM international conference on Multimedia, 2015, pp. 1195-1198: ACM.
[41] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang, "Jointly learning heterogeneous features
for RGB-D activity recognition," in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 5344-5352.
[42] J. Liu et al., "NTU RGB+ D 120: A Large-Scale Benchmark for 3D Human Activity
Understanding," 2019.
[43] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+ d: A large scale dataset for
3d human activity analysis," in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 1010-1019.
[44] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y. Duan, and A. K. Chichung, "NTU
RGB+ D 120: A Large-Scale Benchmark for 3D Human Activity Understanding,"
IEEE transactions on pattern analysis and machine intelligence, 2019.
[45] R. H. Chan, C.-W. Ho, and M. Nikolova, "Salt-and-pepper noise removal by median-
type noise detectors and detail-preserving regularization," IEEE Transactions on
image processing, vol. 14, no. 10, pp. 1479-1485, 2005.
[46] T. Chen, K.-K. Ma, and L.-H. Chen, "Tri-state median filter for image denoising,"
IEEE Transactions on Image processing, vol. 8, no. 12, pp. 1834-1838, 1999.
[47] ScienceDirect. Median Filter. Available:
https://www.sciencedirect.com/topics/computer-science/median-filter
[48] R. B. Rusu and S. Cousins, "3d is here: Point cloud library (pcl)," in 2011 IEEE
international conference on robotics and automation, 2011, pp. 1-4: IEEE.
[49] Z. Yang, Y. Li, J. Yang, J. J. I. T. o. C. Luo, and S. f. V. Technology, "Action
recognition with spatio-temporal visual attention on skeleton image sequences,"
2018.
[50] unittest — Unit testing framework. Available:
https://docs.python.org/3/library/unittest.html#module-unittest

67

You might also like