Army Public College of Management & Sciences: Final Year Project Thesis Format
Army Public College of Management & Sciences: Final Year Project Thesis Format
SCIENCES
1
Human Action Recognition using 3D point clouds
A project submitted
in partial fulfillment of the requirements for the degree of
Bachelor of Science in Computer Science
by
Misha Karim (UET-16f-BSCS-02)
Nisar Bahoo (UET-16f-BSCS-84)
Muhammad Junaid Khalid (UET-16f-BSCS-92)
Supervised by
Dr. Muhammad Sajid Khan
2
ACKNOWLEDGEMENT
We cannot express enough thanks to our supervisor for their continued support
and encouragement: Dr. Sajid Khan, we offer our genuine gratefulness for the
learning openings given by him.
Our completion of this project could not have been accomplished without the
support of each other. To other friend who gave us his GPU fitted laptop for our
implementation – thank you for allowing us time away from you to research and
write. Thanks to our parents as well, the occasions you helped us by tackle our rushed
timetables won't be overlooked.
3
UNDERTAKING
This is to declare that the project entitled “Human Action Recognition using
3D point clouds” is an original work done by undersigned, in partial fulfillment of the
requirements for the degree “Bachelor of Science in Computer Science” at Computer
Science /Software Engineering Department, Army Public College of Management &
Sciences, affiliated with UET Taxila, Pakistan.
All the analysis, design and system development have been accomplished by
the undersigned. Moreover, this project has not been submitted to any other college or
university.
Date:
4
CERTIFICATE
Date:
5
ABSTRACT
6
Table of Contents
1. Introduction:........................................................................................................................................
1.1 Objective:....................................................................................................................
1.2 Motivation:..................................................................................................................
1.3 Problem Definition:....................................................................................................
1.4 Scope:...........................................................................................................................
1.5 Problem Solution:...................................................................................................
1.6 Existing systems..........................................................................................................
1.7 Project breakdown structure.....................................................................................
1.8 Block diagram.............................................................................................................
1.9 Applications.................................................................................................................
2. Literature Review................................................................................................................................
2.1 Related Work..............................................................................................................
2.2 Analysis........................................................................................................................
2.2.1 Analytical Graph:...............................................................................................
2.3 State-of-the-Art:..........................................................................................................
2.4 Problem Solution:.......................................................................................................
3. Methodology.........................................................................................................................................
3.1 Introduction:...............................................................................................................
3.1.1 Coordinate System..............................................................................................
3.1.2 Orthographic Environment................................................................................
3.2 Inputs to System (Dataset).........................................................................................
3.3 System Requirements.................................................................................................
3.4 Overview of System.....................................................................................................
3.4.1 3D Video Inputs...................................................................................................
3.4.2 Pre Processing.....................................................................................................
3.4.3 Human Body detection and getting skeleton.....................................................
3.4.4 Skeleton Extraction:...........................................................................................
3.4.5 Neuron Tree Model:............................................................................................
3.4.6 Movements of joints:...........................................................................................
3.4.7 Deforming of edge...............................................................................................
3.5 Proposed Approach (Structure Tree Neural Network)............................................
3.6 Pseudo Code................................................................................................................
7
3.7 Scenario of Processing................................................................................................
3.8 Expected Output.........................................................................................................
3.9 Summary.....................................................................................................................
4.1 Development Tool........................................................................................................................
4.2 Implementation issues.................................................................................................................
4.3 Configuration Management........................................................................................................
4.4 Framework Section for development.........................................................................................
4.5 Deployment factors......................................................................................................................
4.5.1 PyTorch...................................................................................................................
4.5.2 Pandas......................................................................................................................
4.5.3 MATLAB.................................................................................................................
4.6 Summary......................................................................................................................................
5. System Testing.....................................................................................................................................
5.1 Introduction.................................................................................................................
5.2 Output..........................................................................................................................
5.3 Automatic Testing.......................................................................................................
5.4 Statistical and Graphical Analysis.............................................................................
5.5 Complete User Interface.............................................................................................
5.5.1 Layout 1:..............................................................................................................
5.5.2 Layout 2:..............................................................................................................
5.5.3 Layout 3:..............................................................................................................
5.5.4 Layout 4:..............................................................................................................
5.5.5 Layout 5:..............................................................................................................
5.5.6 Layout 6:..............................................................................................................
5.5.7 Layout 7:..............................................................................................................
5.6 Summary.....................................................................................................................
6. Conclusion........................................................................................................................................
Major Contributions...............................................................................................................................
7. References............................................................................................................................................
8
Figure No. Figure Name Page
No.
Figure 1.1 Point cloud representation of Humans 13
Figure 1.2 Extracting skeleton points from point clouds 14
Figure 1.1.1 Point cloud extraction 15
Figure 1.2.1 Human Action Recognition is dependent on multiple disciplines 16
9
Table No. Table Name Page No.
Table 2.2.1 Comparative summary of existing systems 29-30
10
Graph No. Graph Name Page No.
Graph 2.2.1.1 Accuracies of existing systems 31
11
1. Introduction:
12
Figure 1.1 Point cloud representation of Humans
Various point cloud datasets help the scientists to explore ways on tasks such
as image classification, object detection or semantic segmentation. A point cloud is a
collection of 3D positions in a 3D coordinate system, commonly defined by x, y, and
z coordinates to represent the surface of an object (do not contain data of any internal
features, color, materials, etc.) or space. Point clouds are becoming the basis for
highly accurate 3D models, used for measurements and calculations directly in or on
the object, e.g. distances, diameters, curvatures or cubature(s). They are therefore not
only a great source of information in 3D feature and object recognition, but as well as
in deformation analysis of surfaces.
For reference, the recent framework proposed by [3] uses several Kinect
datasets (MSR-3D Dataset and UTKinect-3D-Action datasets are enough to perform
the test and training to show the capabilities) and a deep CNN (Convolution Neural
Network) architecture using Spatial arrangement. The currently prevailing strategy is
to use Machine Learning methods, such as space partitioning. Using depth image
features and skeleton points, 3D structure posture can be recognized using point
clouds in multi-camera scenario.[3]
With evolution, views from 3D point clouds made the task reliable of observing and
analyzing human actions, as well as 3D Point Cloud systems are more accurate than
13
of Depth sequences.[4] Initially the task was carried out by human operators, in need
of monitoring and detecting surveillance crimes along-with many others but with the
increasing number of camera views and technical monitoring devices, however, this
task becomes not only more challenging for the operators but also became cost-
intensive, due to efficient response time. The Human Action Recognition system
plays a wide role in different applications like Video Surveillance System (used to
capture the illegal actions performing on CCTV cameras), Video Retrieval (used to
retrieve the specific portion of any video recording automatically by recognizing
actions), Healthcare monitoring applications (used to monitor the patient's
physiological parameters constantly), Robot Visions: (understand the human action
and respond accordingly). Though these applications have shown the tremendous
work about action recognition but have shown problem regarding response time and
performance due to training of only specified actions.
14
1.1 Objective:
1.2 Motivation:
Our aim is to recognize actions using 3D point clouds because it is better,
cheaper and more efficient.
15
Figure 1.2.1. Human Action Recognition is dependent on multiple disciplines.
If the image is observed closely, it gives an idea that work like 3D Action
Recognition does not depend on one concept, rather is a mesh of Statistics, Machine
Learning and Artificial Intelligence combined.
“In the arena of human life, the honors and rewards fall to those
who show their good qualities in action.”
- Aristotle
This quote of Aristotle is related, as it compares two different dimensions for
the same role; action recognition. With the emerging 3D point clouds, The challenge
of representing images has improved, as 2D images was not able to cover the feature
such as good quality of an image. With 3D view of AR, you can have a better look at
perceived action through these images as they cover various viewpoints.
“To develop a computer system that recognize and infer the human
actions based on complete executions using 3d point clouds”
16
Many existing systems have already worked on solving the problem but was only able
to identify some specific actions like walking, laying, sitting, or running.
1.4 Scope:
“To detect and recognize the human actions by using point clouds in an
orthographic environment and classifying the actions according to their category
while improving the overall system’s efficiency”
Our proposed algorithm will follow certain steps in terms of final output.
Usually the procedure of solving the problem fall into these certain steps:
(1) Pre-processing of the raw data from sensor streams for handling unwanted
distortions or features such as eliminating noise and redundancy, or performing data
aggregation and normalization which is important for further processing
(5) Classification, the core and most important step of all, determining the
given activity i.e. labelling images into their respective predefined categories.
17
In this paper authors have used 3D Point Clouds to create the 3D model of the
hand, then they have detected the shape and position of hand using Regression in
CNNs (Convolutional Neural networks) and detect the fingertips and center of hand to
estimate the pose, then they have extracted the features using the extracted points.
Then they have created classified the histograms created from the features and have
given the estimation of the pose detected. [5]
In this paper authors have proposed a new technique to directly process point
clouds directly instead to making depth images. They have employed key point
extractor and novel descriptor for the purpose. They directly make the histograms
through key points extracted and then classify the histograms made to predict the
action detected[6]
18
Human Activity Recognition from Unstructured Feature Points
In this paper researchers do not do any pose estimation or use it in any way for
the processing instead they have used the visual attention module that learns activity
from the video frames and creates the interest points, then it classifies the points and
give activity prediction. Interest points were gathered by monitoring the glimpses
frame by frame [7]
19
Figure 1.6.5. Two different clusters of human actions.
20
1.8 Block diagram
1.9 Applications
There are several application domains where HAR can be developed. It includes these
four categories
1. Active and assisted living (AAL) systems: the use of information and
communication technologies (ICT) in a person's daily living and working
environment to enable them to stay active longer, remain socially connected
and live independently into old age”
2. Healthcare monitoring applications with the interest of
constantly monitoring the patient's physiological parameters.
3. monitoring and surveillance systems for both indoor and outdoor activities,
4. Tele-immersion (TI) applications implemented that will enable users in
different geographic locations to perform similar actions through simulation.
5. Human-Computer interaction
6. Robot Visions: mimic the actions of human and perform similar actions
21
2. Literature Review
“Action” is derived from “act”, described as a fact or a process to do
something. However, human action is the aftermath of joint-movements of any human
body such as elbows for stand-ups, ankles and knees for sit-down or walking, hips,
shoulders for throwing or any other muscle action. Recognizing human actions under
the conditions of wrong pose orientation, orientation towards camera and half body
input in interactive entertainment and robots is a challenging task. The representation
of such actions encompasses by a series of motion, and although similar motions can
result in different correspondences to different actions, the study proves to be
interesting.
The system Human Action Recognition tends to gain more and more
importance in surveillance-related issues. Human Action Recognition commits to
fulfill the great challenges of Computer Vision, by replacing the traditional tasks to
Human Robotic Interactions. In 1970s when, Gunnar Johansson explained the concept
tat how humans recognizes the actions[9] this field became the center of attention of
many computer scientists.
In earlier times, the task of detecting and recognizing human actions used only
2D datasets (images and videos), these datasets were able to highlight the features like
edges, silhouette of the human body, 2D optical flow, 3D spatio-temporal volumes.
Conversely, effective local representations are Silhouette representation by
Histogram-of-Oriented-Gradients (HOG) or Histogram-of-Optical-Flow (HOF)
descriptors[10]. However, these datasets were unable to create a reliable
environmental setup for retrieving 3D information. Later, with evolution of 3D
scanners in 1960s [5], scientists had the ability to retrieve information regarding the
3D-based images and related features from a natural environmental phenomena
containing special point clouds. Point clouds are defined as the points in the space
comprising of the three essential coordinates (x, y, z) of any object in orthogonal
direction[11], which marks the point clouds highly beneficial for an exceptional
accuracy in terms of obtaining précised data. Point clouds also provide the effective
point cloud scans for measuring dimensions or interpreting the spatial relation
between the different parts of the structure. The history of points clouds relate directly
22
to the history of 3D scanners, major ones include LiDAR, PX-80, and Microsoft
Kinect.
23
named Rusu, R.B., Bandouch, J., Marton, Z.C., Blodow, N., and Beetz, M, proposed a
solution for human action recognition process using intelligent environments where a
spherical coordinate system is used [5]. An intelligent enviornment is capable of
calculating the difference between the silhouette image sequences frame-by-frame.
The 2D silhouette images focuses only on two dimensions, one space and other, time.
However, while implementing a cube algorithm on 2D input converts the input into
3D by adding another dimension, spatial dimension for recognizing actions
Techniques like background subtraction and Gaussian spatial filter segments the data,
and silhouettes images are obtained. For differentiating actions over gestures, K-D
trees are used.
3D Skeleton joints also help recognizing actions (Li, M., and Leung, H.J.P.R.L.)
using skeletal points as input for their system. They make skeletal points (Figure 4)
from every joint of our body as shown in the figure 4. Then the directly extract
features such as Relative Variance of Joint Relative Distance and Temporal Pyramid
covariance descriptor. Then the data extracted is represented using Joint Spatial Graph
(JSG) after that they analyze JSGs by proposing a new technique named as JSG
Kernel. JSG Kernel has parallelized the processes of edge attribute similarity
checking and vertex attribute similarity checking. This Synchronizing of work made
the work easier for classifier to label the actions. They used 3 datasets MSR-3D-
action dataset[14], UTKinect-3D-action dataset[15] and Florence-3D action
dataset[16] for training, testing and evaluating.[17]
24
Figure 2.1.2. Skeletal points[17]
Using same method of skeleton points Yang and Tian[18] proposed a new
technique in which the 3D skeleton points were extracted from the 3D depth data
involved different RGB-D cameras. The skeleton joints are also known as Eigen
Joints, which combines actions information including static posture, motion property
and overall dynamics of each point with different values for each frame. In
preprocessing Accumulated Motion Energy (AME) technique [19] was proposed to
perform the selection of informative frames from the input video and in feature
extraction the noisy frames are removed or reduced computational cost enhanced the
performance of the proposed system in an effective manner. After performing (AME)
[19] preprocessing a non-parametric Naïve-Bayes-Nearest-Neighbor (NBNN)[20]
algorithm is used to classify multiples actions performed by comparing it with the
actions stored in the system through which the system was trained. The input in this
system was taken by using a 3D video dataset (MSR Action3d dataset) [14]. Different
experiments compare the performance and accuracy of the systems. Results obtained
by multiple datasets concluded that the proposed approach outperforms the state-of-
the-art methods. The frame rate was investigated and the first 30-40% frames
achieved the desired results instead of processing video on MSR Action3D[14]
Dataset[18] any further.
25
Figure 2.1.3. Flow of Events.[18]
An autonomous system for real time action recognition using 3D motion flow
estimation (Munaro, M., Ballin, G., Michieletto, S., and Menegatti, ) recognizes the real-
time online human actions. For visualizing the Point clouds, different colors highlight
the essential information of images. A 3D sensor device name Microsoft Kinect
senses the view and give the cumulative value of all the three planes (axes). 3D grid-
based descriptor joins multiple point clouds to form a shape to be recognized.
Algorithm nearest neighbor classify the performed actions stored on the system.
Overall, the performance and efficiency of the system was tested under this dataset.
The efficiency of the system measured through this dataset in recognizing the 90% of
the action.[10]
26
The development of advanced new depth sensor with low cost such as Kinect
is very helpful for recognizing human actions. Depth sensors provide better body
postures and human body silhouettes. RGB-D images provide feature sources or
Motion History Images (MHI)[21], Motion Energy Images (MEI) [19] or Depth
Motion Map (DMM)[22] in different views. Each depth image projected into three
orthogonal planes employed a representation classifier for action recognition. The
methods proposed in the early stage of human actions recognition field, stated poor
performance.
The data is at first, trained and tested. Noisy frames that still exists in public
data due to the segmentation of irregular data are filtered. Angular spatial-temporal
descriptor, which is a stable and low-cost descriptor, combines kinematic parameters
of humans. For the recognition process, KNN classifier [13] is preferred. It also
includes the computation of the descriptor of each frame and estimation of the time
label of the frame.[23]
To achieve the high accuracy and performance in this field some scientists Lei
Shi, Yifan Zhang, Jian Cheng and Hanqing Lu proposed Directed Graph Neural
Network[24] which treats the skeleton as a directed acyclic graph here is the
summary how this works.
Typically, the raw skeleton data are a sequence of frames, each of which
contains a set of joint coordinates. Given a skeletons sequence, we first extract the
bone information according to the 2D or 3D coordinates of the joints. Then, the joints
and bones (the spatial information) in each frame are represented as the vertexes and
edges within a directed acyclic graph, which is fed into the directed graph neural
network (DGNN) to extract features for action recognition. Finally, the motion
information, which is represented with the same graph structure that used for spatial
information, is extracted and combined with the spatial information in a two-stream
framework to further improve the performance.
27
Figure 2.1.5. Directed Acyclic Graph used in Directed Graph Neural Network
Skeleton data is represented as a directed acyclic graph (DAG) with the joints
as vertexes and bones as edges. The direction of each edge is determined by the
distance between the vertex and the root vertex, where the vertex closer to the root
vertex points to the vertex farther from the root vertex. Here, the Neck joint is
considered as root joint as show in fig 2.1.5.
2.2 Analysis
Working with the challenging research area of Action Recognition, many
researchers gave their best on it, but each had their drawbacks. For instance, in 2009,
the extraction of point cloud features in intelligent environments was not a very
efficient research due to poor response time. In 2013, three dimensional action
recognition carried a massive load that did not had a good impact on performance.
However, within a year (2014), some researchers approached action recognition again
through 3D Eigen joints, which worked on reliable classifier but provided only a
28
single dataset. In 2017, two major research portions were conducted, one was based
on the use of graphs, which provided quick response for multiple datasets but still the
actions were limited to a very few number. The other research held in same year
(2017), based on Kinematic Similarity in Real Time, which was suitable for obtaining
good results, but the use of complex algorithms was time-consuming. Later on in
2019 Directed Graph Neural Network was proposed which produced great
performance and accuracy, but it required very large storage space. The timeline
shows that with the advancement of technology, and discovery of new algorithms, the
task of action recognition became easier and achieved a better performance up to 90%
but still, the target of 100% was left-behind.
Table 2.2.1 comparative summary of existing systems
Name of Recognized
Title of paper Year Methodology Advantages Drawbacks
Authors actions
Action
Radu Bogdan Recognition in Opening
Rusu, Jan Intelligent Door,
Bandouch, Environments Closing First use of Slow
K-D trees are
Zoltan Csaba using Point 2009 Door, 3D point processing
used
Marton, Nico Cloud Features Unscrewing clouds response time
Blodow, Extracted from and
Michael Beetz Silhouette Drinking
Sequences[13]
Get up,
Matteo
3D Flow kick, pick Slow
Munaro, Color point
Estimation for KNN up, sit performance
GioiaBallin, 2013 clouds were
Human Classifier down, due to triple
Stefano used
Action[10] walk, turn workload
Michieletto,
around
Xiaodong Effective 3D 2014 Step back, Better Single dataset
Yang, YingLi action Naïve-Bayes- step performance used
Tian recognition Nearest- forward, under
using Eigen Neighbor step aside, different
Joints[18] (NBNN) kick, run, light
29
twist, climb conditions
Quick
forward
Graph-based response as
punch,
Meng Lia, approach for compare to Limited
Joint spatial forward
Howard 3D human 2017 previous actions
graphs kick, walk
Leunga skeletal action ones under recognized
pick up,
recognition[17] 3 different
clap, bow
datasets
Basketball
shoot,
Qingqiang Human Action Boxing, Suitable
Slow
Wu ,Guanghua Recognition Depth motion Bowling, feature
performance
Xu ,Longting Based On maps with Push, jog, extraction
2017 and difficult
Chen ,Ailing Kinematic KNN walk, results in
feature
Luo ,Sicong Similarity In classifier Tennis high
extraction
Zhang Real Time[23] serve and accuracy
Tennis
swing
Skeleton-
Lei Shi Based Action Great Very large
Directed
Yifan Zhang Recognition accuracy storage
2019 graph neural -
Jian Cheng with Directed over a large space
networks
Hanqing Lu Graph Neural dataset required
Networks[24]
30
Accuracy
100
95
90
85
80
75
70
2009 2013 2014 2017 2017 2019
Accuracy
2.3 State-of-the-Art:
After analyzing these certain researches, the best research with utmost result
among all is the Directed Graph Neural Network. The techniques like global average
pooling is used for feature extraction, and DGNN classifier for the classification of
different actions, produced highly efficient result. Even though it used complex
algorithms for processing that affected the performance, it achieved more than 95%
accuracy.
31
The steps are discussed in detail in the coming chapter
32
3. Methodology
3.1 Introduction:
Recognizing the human actions plays an important role of base in robotic vision.
Robots cooperate with the human behavior to accommodate their actions and to
mimic them accordingly, particularly within the industrial areas, wherever the reach
of an individual's being isn't achievable like lifting significant objects or precise
positioning of objects on top of one’s shoulders or perhaps the aging work-force
agencies by providing a robotic assistant instead of a totally committed auto-manual
system. The essential plan is to combine the cognitive capabilities of humans with the
physical strength and potency of the robots/machines. Humans in general have
perception and psychological feature functions and are able to act and react
with regard to a given scenario, robots will learn their behavior and
act accordingly[25].
33
3.1.2 Orthographic Environment
34
Table 3.2.1 Some most commonly used Action Datasets
Python 3.5+
Windows 10 x64
Dependencies:
PyTorch
Numpy
Scipy
OpenCV
35
Tqdm
TensorboardX
Hardware requirements
36
Figure 3.3.1: Block Diagram of system
Median filter is very effective filter in removing the noise. This filter is
applied on each frame of the input video by replacing the gray level of each pixel by
the median of the gray levels in a neighborhood of the pixels, size of neighborhood is
defined according to level of noise. As shown in fig 3.4.2.1
37
The next step is the sharpening of the videos. Sharpening means to clear the
edges of the object for easier detection. Laplacian filter is widely used for sharpening
the image
The technique of Laplacian filter assists in enhancing the edges of the video
frames this enhancement is known as sharpening of the image. Moreover, sharpening
the edges helps in feature-extraction and skeleton-shaping as well, therefore the
sharpened skeleton points formed clearly shows the subject’s body joints. Two
commonly used Laplacian filters are shown in table 3.4.2.1
0 -1 0 -1 -1 -1
-1 4 -1 -1 8 -1
0 -1 0 -1 -1 -1
Laplacian filter uses function L(x, y)(shown in below) to find the intensities of
pixels of an image,
2 2
∂ I ∂ I
L ( x , y )= +
∂ x2 ∂ y 2
38
Figure 3.4.4.1 Extracted skeleton of clap action
39
recorded in each frame and difference is calculated with the previous frame that give
the displacement of the joint and also the speed at which the joint is displaced.
Mvt = Vt+1 – Vt is used to calculate where Vt is the position of the joint at the
present frame and Vt+1 is position of joint in previous frame so by subtracting these
two the movement of joint is extracted in Mvt.
p
h ( n i) =ni . info−n i−1 . info
However, it is sufficient for providing adjacent nodes only which does not
fulfill the overall processing criteria but only for the predicted farther nodes. The
adjacency matrix calculated is now mapped onto an attention map, where
AO =original ¿
A=P A O
40
A=A O + P
Figure 3.5.1.2 demonstrates the flow of proposed approach Structured Tree Neural
Networks.
41
Figure 3.5.1.3 Demonstrates the Class Prediction involved in Structured Tree Neural
Networks.
42
3. insertMotionDifferences()
Training
Input: Motion Data of skeleton
Output: Trained model
1. feeders_train()
2. feeders_test()
3. model_creation()
4. optimization()
5. training()
Testing
Input: Motion data, Train Model, RGB-D Video
Output: Recognized class
1. feeders()
2. loadModel()
3. test()
43
3.8 Expected Output
After getting the features from softmax layer, the features will be compared
with the trained model of the system and for the final output, labeled classes of
actions will show-up given the condition that the similarity of the extracted features
with trained features should exceed 70% as at-least 10 out of 15 points of skeleton
should match the features. The output will be a video displaying the subject detected
in a rectangular shape box and recognized action labeled on the corner.
3.9 Summary
In this approach, a novel methodology towards human action recognition has
been introduced. A graphical-tree skeleton structure constructed provides the joints
information for extracting features, which later serves as a concluding for classified
actions. These actions are classified and labeled into different categories.
44
4. Implementation
4.1 Development Tool
The developing tool used is Python 3.7.5 (64 bit). The reason for choosing this tool is
that Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics whose syntax emphasizes readability and therefore the cost factor is minimized in
case of program maintenance. Maintenance is one among vital aspects of Python, because it
supports numerous packages or modules, which inspires program modularity and code utilize.
The other tool in use is the most common Command Prompt also known as CMD. This tool is
used to install the dependencies of the project and to run the project.
The reason our system faces a high computational time is because of the training time
required when dealing with the neural networks. It might possibly raise a query in mind that
why neural networks after all? The relation between a system and a human is defined on basis
of how efficiently a system can learn from a human. This is what makes it an intelligent. As
discussed before, the human behavior has always been an important factor for better
understanding of a computer vision. Until our system recognize all possible human actions, it
cannot become a strong and intelligent machine.
STNN algorithm however, aims at classifying the actions in an acylic form, or simply
a Tree structure. In this way, the child nodes of a tree are restricted to one parent only, which
45
means no cycles between nodes are formed. This saves the computational time of our
framework since STNN acts as a non-recursive data structure. Of course, a deep neural-
network combined with a structural tree threatens the complexity of system, as many layers
involved are inter-linked, but the relation defined among a node-to-a-node is definite, i.e.
there is no possibility of repetition. This overall marks DGNN less complicated, so consider it
as a minor issue.
Another major issue is how to acquire a dataset for the training purpose. NTU RGBD
invented by NTU Singapore is a dataset used for this system. It needs a formal approval from
a senior authority to access the dataset. But the problem with such a large dataset is reaching
out to all of its aspects, such as avoiding the complex response time. The response time in this
case is very slow (training time of almost 10 days).
There are mainly three different CM systems for the configuration management and
the maintenance of STNN algorithm. One of them is Tensorflow, open source library
provided by Google to implement Machine Learning, Computer Vision, Natural Language
Processing and many other algorithms. To configure this library, we need to run a single
command on command prompt i.e.
In this project this library is used to train the system. The second one is the NumPy, a
python library that supports large sizes of multi-dimensional arrays and matrices. To
configure this library, we need to run a command on command prompt. i-e
The features extracted are stored on storage using this library. The last CM is the
OpenPose. A library used to make skeleton on human body. For configuration, download the
CPU only binary of the library.
The plotting of the skeleton was done using the matplotlib library, This is a plotting
library for python programming language it uses object oriented approach, hence making it
very easy for developers to plot the graphs. To configure it you need to run cmd command
46
4.4 Framework Section for development
Figure 4.4.1 Demonstrates the flow of proposed approach Structured Tree Neural
Networks.
Initially, a skeleton tree is manufactured in OpenPose library by taking the spine joint
as a root node. Following, every node in the tree consists of two portions, a data portion and a
children portion (references of children). The data portion contains the x, y, z coordinates
information and the angle information between incoming and outgoing edges. In the children
portion there are three references of child nodes i.e. left right and center as the tree is not
binary tree. After the frame difference of each node is calculated, the node is updated with
new position and angle. And the change (difference) is also saved in the memory. Then the
changes and updated nodes are mapped on attention map to get the details about the not
connected nodes. To form an input signal to the main algorithms for feature extraction 1D
convolution and Global Pooling is applied to extract the features i.e. node difference and
angle difference, ReLu layer is then applied to remove the redundant features. At the end
Softmax layer classifies the best matching class and send it to the output.
47
4.5 Deployment factors
STNN is implemented using Tensorflow library. The reason for deploying this as a
standard library is that it is an open-source ML library, known for creating dynamic graphs,
i.e. typical updates. STNN can also be implemented using other ML libraries like PyTorch,
Pandas, and even some other tools, like Matlab
4.5.1 PyTorch
4.5.2 Pandas
4.5.3 MATLAB
4.6 Summary
48
49
5. System Testing
5.1 Introduction
Previously in this project, the discussion was reached to completion of
implementation i-e libraries, dependencies, environment and modular integrations. Now it’s
time for system testing, which deals with correctness and precision of the system. It also
means to validate and verify the system’s functionality according to the scope defined. An
important part of testing of this project is the discussion of accuracy of the system and
comparison with previous techniques. This system uses trees to classify human actions, the
inspiration towards usage of trees come from the natural shape of human body. If only joints
are taken and are joined through lines, it becomes a tree type shape. Considering this the
technique Structured Tree Neural Network was proposed
5.2 Output
This project, the STNN has four layers; input, output, and upper and lower hidden
layers. The input layer consists of 25 nodes, where each represents a skeletal joint. The top
hidden layer handles feature extraction, which gets passed to the lower hidden layer, which is
responsible for the removal of nodes with redundant features. The remaining nodes are passed
to the output layer, responsible for determining the performed action, as depicted in the
figure. 5.2.1.
50
The benefit of using this layered approach and simplicity of data structure is the high
accuracy, this project has accuracy of 96.3% and accuracy confusion matrix is shown in
figure 5.2.2.
51
Figure 5.3.1 All test passed in automatic testing
52
Accuracy
100
95
90
85
80
75
70
2009 2013 2014 2017 2017 2019 2020
Accuracy
53
Figure 5.5.1.1: Most initial Layout
5.5.2 Layout 2:
In this step the user will select the video as input by clicking on the Select Video
Input button. Through which an input dialog will open and user will select the video as input.
The selected video will then be displayed in the axes portion after the system read the video
as shown in figure 5.5.2.1.
54
Figure 5.5.2.1: Selection of video by user by clicking Select Video Input button.
5.5.3 Layout 3:
In this layout the skeletons are made on the human of selected video by clicking the
button Make Skeleton and the skeletons are made on the body of human. Fig 5.5.3.1 displays
the output of the video in which the skeleton is made on the human. User need to click the
Make Skeleton Button to get the desired output.
55
Figure 5.5.3.1: Output of the skeletons displayed.
5.5.4 Layout 4:
In this layout step the skeleton is extracted on the 3d grid. The user will have to click
on the extract skeleton button. Fig 5.5.4.1 has the output of the extracted skeleton on the 3d
grid.
56
Figure 5.5.4.1: Output of the extracted skeleton.
5.5.5 Layout 5:
In this step the trees are formulated from the extracted skeleton by making a tree and
reading the nodes with the frequent changes in the value and then those nodes are separated as
further subtree that will be traversed in next step to get the output. The user will have to click
Structuring tree button. Fig 5.5.5.1 displays the output of the formed tree along with the
subtree. Circles and lines are used to make the tree.
57
Figure 5.5.5.1: Output of the Structured tree.
5.5.6 Layout 6:
In this step the features points are extracted from the subtree and are displayed in the
table. Then these features are used in next step to recognize the action performed in the video.
The user need to click the Extract Features button to get them. Fig 5.5.6.1 shows the output
layout of extracted features in the table.
58
Figure 5.5.6.1: Output of the extracted Features from the tree.
5.5.7 Layout 7:
In this step the extracted features are provided to classifier to classify the performed
action by matching features with trained model. User need to click Recognition button to get
the desired output. Fig 5.5.7.1 show the final output of the video with action labeled at the left
top corner of the video.
59
Figure 5.5.7.1: Output of the performed action along with the feature points used.
5.6 Summary
60
child node and its parent is known as the incoming edge while reciprocal connection is known
as the outgoing edge. The uses of tree structure lead to a system that intuitively maps to
human movements. The classifier uses the change in displacement of joints and change in the
angles between incoming and outgoing edges as features for classification of the actions
performed.
61
6. Conclusion
We're living in a 3D World, meaning it's a 3D physical world we're living. Computer
vision is very critical in the sense that if you want to understand our physical
world, 3D World, and also to interact with our 3D World. So it's very important for us
to teach the computer to see and understand the world as a human being. So in this
sense, computer vision is really important because human relies on our eyes a lot to
achieve our daily activities, to achieve many tasks. So if the machine can do the
same, achieve the same capability, well, computer vision techniques can be great.
From the engineering point of view, computer vision aims to build autonomous
systems to perform some of the tasks which the human visual system can perform and
even surpass it in many cases. Many vision tasks are related to extraction of 3D and
temporal information from time varying 2D data, or in other words, videos. Of course,
the two goals are intimately related.
Computer vision have many fields in it. Human Action recognition is one of most
famous field in computer vision because of its vast applications in the field. Human
action recognition has great impact on Human robotic interaction. To make computers
able to recognize the human action, we require some low cost, highly efficient and
quick algorithms. This is the reason that this research area has been major focus of
researchers since last few years. Inspired by this, this contribution was made in the
field by proposing the approach that uses the Trees to manipulate human skeleton and
recognize human actions.
Before moving towards the proposed approach lets discuss the previous approaches
made in this research area, like in 2009 silhouette sequences were used to classify the
human actions, this technique got very famous with the time due to its simplicity, but
in this technique the posture form camera affected the performance for example if a
person is facing towards the camera and performing the action, then the performance
of this technique was very good, but in case the person isn’t facing the camera, the
performance wasn’t good. To resolve this issue depth motion maps were introduced
these are real 3D data and these also provide excellent performance in different
lightning conditions and the posture with camera also don’t affect the performance.
But these are very large in the size. To decrease the size of the data, Skeleton points
were introduced, these are also real 3D data and these have exactly same advantages
62
of depth motion maps but with significantly decreased size. Today these are ideal for
use for Human action recognition due to their simplicity and small size, one more
amazing thing about skeleton points is that they can be represented by many data
structures like graphs and trees. Skeleton points rapidly became most used technique
for Human action recognition.
Now, let’s talk about what is human action? A human action is the result of joint-
movements of any human body such as elbows and hands for wave as well ankles and
knees for running or walking. If we joint the points of these joints this forms a
skeleton, if we look skeleton closer from a computer scientist point of view, it forms a
tree. A tree structure constructed using skeleton, simplifies the processing and reduces
the performance load as compare to graphs, because in graphs edges are also traversed
and processed, but in tree only nodes are traversed and processed. The actions were
classified using two features i-e joint movement and bone deformation, joint
movement means that how much a joint is moving and on which axis the joint moving
frame by frame, bone deformation means that how much the angle between incoming
and outgoing edge of the node is changing, Yes, trees don’t have edges, that’s why
slopes of both edges is found and then angle between these two slopes is found, the
change in this angle is calculated and is used as a feature for classification. This
technique uses layered approach for classification. First layer is input layer, this layer
takes input in the form of skeleton data and forms a tree, then second layer extracts
the features of every node in the tree, third layers removes the nodes with redundant
features, then at last there is output layer is classifying the actions and places action
label. This layered approach and simplicity of the structure lead to 96.3% accuracy of
the system on the NTU-RGBD dataset.
63
7. Major Contributions
All of the involved candidates including Dr.Sajid, planned and arranged the
experiments. Misha, Nisar, and Junaid completed the analyses. Nisar, and Junaid
arranged and completed the productions. Nisar added to the understanding of the
outcomes. Misha started to lead the pack by recording the thesis manuscript as the
original copy. All authors provided critical feedback and helped shape the research,
analysis and complete thesis.
64
8. References
[1] H. S. Robotics. (2015, 26/09/2019). Point Cloud. Available:
https://www.sciencedirect.com/topics/engineering/point-cloud
[2] Z. W. Shugang Zhang, Jie Nie, Lei Huang, Shuang Wang, Zhen Li "A Review on
Human Activity Recognition Using Vision-Based Method," J Healthc Eng., p. 2,
20/07/2017 2017.
[3] A. Shafaei and J. J. Little, "Real-time human motion capture with multiple depth
cameras," in 2016 13th Conference on Computer and Robot Vision (CRV), 2016, pp.
24-31: IEEE.
[4] S. Berretti, M. Daoudi, P. Turaga, A. J. A. T. o. M. C. Basu, Communications,, and
Applications, "Representation, analysis, and recognition of 3D humans: A survey,"
vol. 14, no. 1s, p. 16, 2018.
[5] L. Ge, Y. Cai, J. Weng, and J. Yuan, "Hand PointNet: 3d hand pose estimation using
point sets," in Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2018, pp. 8417-8426.
[6] H. Rahmani, A. Mahmood, D. Q. Huynh, and A. Mian, "HOPC: Histogram of
oriented principal components of 3D pointclouds for action recognition," in
European conference on computer vision, 2014, pp. 742-757: Springer.
[7] F. Baradel, C. Wolf, J. Mille, and G. W. Taylor, "Glimpse clouds: Human activity
recognition from unstructured feature points," in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2018, pp. 469-478.
[8] M. Khokhlova, C. Migniot, and A. Dipanda, "3D Point Cloud Descriptor for Posture
Recognition," in VISIGRAPP (5: VISAPP), 2018, pp. 161-168.
[9] G. J. P. Johansson and psychophysics, "Visual perception of biological motion and a
model for its analysis," vol. 14, no. 2, pp. 201-211, 1973.
[10] M. Munaro, G. Ballin, S. Michieletto, and E. J. B. I. C. A. Menegatti, "3D flow
estimation for human action recognition from colored point clouds," vol. 5, pp. 42-51,
2013.
[11] M. T. SRINI DHARMAPURI. (2018). Evolution of Point Cloud. Available:
https://lidarmag.com/2018/07/16/evolution-of-point-cloud/
[12] P. C. L. (PCL). (2017). People module for detecting people in unconventional poses.
Available: http://pointclouds.org/gsoc/
[13] R. B. Rusu, J. Bandouch, Z. C. Marton, N. Blodow, and M. Beetz, "Action
recognition in intelligent environments using point cloud features extracted from
silhouette sequences," in RO-MAN 2008-The 17th IEEE International Symposium on
Robot and Human Interactive Communication, 2008, pp. 267-272: IEEE.
[14] J. Zhang, W. Li, P. O. Ogunbona, P. Wang, and C. J. P. R. Tang, "RGB-D-based
action recognition datasets: A survey," vol. 60, pp. 86-105, 2016.
[15] L. Xia, C.-C. Chen, and J. K. Aggarwal, "View invariant human action recognition
using histograms of 3d joints," in 2012 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition Workshops, 2012, pp. 20-27: IEEE.
[16] L. Seidenari, V. Varano, S. Berretti, A. Bimbo, and P. Pala, "Recognizing actions
from depth cameras as weakly aligned multi-part bag-of-poses," in Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2013,
pp. 479-485.
[17] M. Li and H. J. P. R. L. Leung, "Graph-based approach for 3D human skeletal action
recognition," vol. 87, pp. 195-202, 2017.
[18] X. Yang, Y. J. J. o. V. C. Tian, and I. Representation, "Effective 3d action recognition
using eigenjoints," vol. 25, no. 1, pp. 2-11, 2014.
65
[19] G. T. Papadopoulos, V. Mezaris, I. Kompatsiaris, and M. G. Strintzis, "Accumulated
motion energy fields estimation and representation for semantic event detection," in
Proceedings of the 2008 international conference on Content-based image and video
retrieval, 2008, pp. 221-230: ACM.
[20] S. McCann and D. G. Lowe, "Local naive bayes nearest neighbor for image
classification," in 2012 IEEE Conference on Computer Vision and Pattern
Recognition, 2012, pp. 3650-3656: IEEE.
[21] M. A. R. Ahad, Motion history images for action recognition and understanding.
Springer Science & Business Media, 2012.
[22] X. Yang, C. Zhang, and Y. Tian, "Recognizing actions using depth motion maps-
based histograms of oriented gradients," in Proceedings of the 20th ACM
international conference on Multimedia, 2012, pp. 1057-1060: ACM.
[23] Q. Wu, G. Xu, L. Chen, A. Luo, and S. J. P. o. Zhang, "Human action recognition
based on kinematic similarity in real time," vol. 12, no. 10, p. e0185719, 2017.
[24] L. Shi, Y. Zhang, J. Cheng, and H. Lu, "Skeleton-Based Action Recognition with
Directed Graph Neural Networks," in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2019, pp. 7912-7921.
[25] S. C. Akkaladevi and C. Heindl, "Action recognition for human robot interaction in
industrial applications," in 2015 IEEE International Conference on Computer
Graphics, Vision and Information Security (CGVIS), 2015, pp. 94-99: IEEE.
[26] L. OpenGL. Coordinate Systems. Available: https://learnopengl.com/Getting-
started/Coordinate-Systems
[27] I. Wikimedia Foundation. Orthographic Projection. Available:
https://en.wikipedia.org/wiki/Orthographic_projection
[28] W. Li, Z. Zhang, and Z. Liu, "Action recognition based on a bag of 3d points," in
2010 IEEE Computer Society Conference on Computer Vision and Pattern
Recognition-Workshops, 2010, pp. 9-14: IEEE.
[29] J. Sung, C. Ponce, B. Selman, and A. Saxena, "Human activity detection from RGBD
images," in Workshops at the twenty-fifth AAAI conference on artificial intelligence,
2011.
[30] B. Ni, G. Wang, and P. Moulin, "Rgbd-hudaact: A color-depth video database for
human daily activity recognition," in 2011 IEEE international conference on
computer vision workshops (ICCV workshops), 2011, pp. 1147-1153: IEEE.
[31] J. Wang, Z. Liu, Y. Wu, and J. Yuan, "Mining actionlet ensemble for action
recognition with depth cameras," in 2012 IEEE Conference on Computer Vision and
Pattern Recognition, 2012, pp. 1290-1297: IEEE.
[32] Z. Cheng, L. Qin, Y. Ye, Q. Huang, and Q. Tian, "Human daily action analysis with
multi-view and color-depth data," in European Conference on Computer Vision,
2012, pp. 52-61: Springer.
[33] H. S. Koppula, R. Gupta, and A. J. T. I. J. o. R. R. Saxena, "Learning human
activities and object affordances from rgb-d videos," vol. 32, no. 8, pp. 951-970,
2013.
[34] O. Oreifej and Z. Liu, "Hon4d: Histogram of oriented 4d normals for activity
recognition from depth sequences," in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2013, pp. 716-723.
[35] P. Wei, Y. Zhao, N. Zheng, and S.-C. Zhu, "Modeling 4d human-object interactions
for event and object recognition," in Proceedings of the IEEE International
Conference on Computer Vision, 2013, pp. 3272-3279.
[36] J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, "Cross-view action modeling,
learning and recognition," in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2014, pp. 2649-2656.
[37] K. Wang, X. Wang, L. Lin, M. Wang, and W. Zuo, "3d human activity recognition
with reconfigurable convolutional neural networks," in Proceedings of the 22nd ACM
international conference on Multimedia, 2014, pp. 97-106: ACM.
66
[38] C. Chen, R. Jafari, and N. Kehtarnavaz, "Utd-mhad: A multimodal dataset for human
action recognition utilizing a depth camera and a wearable inertial sensor," in 2015
IEEE International conference on image processing (ICIP), 2015, pp. 168-172:
IEEE.
[39] H. Rahmani, A. Mahmood, D. Huynh, A. J. I. t. o. p. a. Mian, and m. intelligence,
"Histogram of oriented principal components for cross-view action recognition," vol.
38, no. 12, pp. 2430-2443, 2016.
[40] N. Xu, A. Liu, W. Nie, Y. Wong, F. Li, and Y. Su, "Multi-modal & multi-view &
interactive benchmark dataset for human action recognition," in Proceedings of the
23rd ACM international conference on Multimedia, 2015, pp. 1195-1198: ACM.
[41] J.-F. Hu, W.-S. Zheng, J. Lai, and J. Zhang, "Jointly learning heterogeneous features
for RGB-D activity recognition," in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2015, pp. 5344-5352.
[42] J. Liu et al., "NTU RGB+ D 120: A Large-Scale Benchmark for 3D Human Activity
Understanding," 2019.
[43] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, "Ntu rgb+ d: A large scale dataset for
3d human activity analysis," in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 1010-1019.
[44] J. Liu, A. Shahroudy, M. L. Perez, G. Wang, L.-Y. Duan, and A. K. Chichung, "NTU
RGB+ D 120: A Large-Scale Benchmark for 3D Human Activity Understanding,"
IEEE transactions on pattern analysis and machine intelligence, 2019.
[45] R. H. Chan, C.-W. Ho, and M. Nikolova, "Salt-and-pepper noise removal by median-
type noise detectors and detail-preserving regularization," IEEE Transactions on
image processing, vol. 14, no. 10, pp. 1479-1485, 2005.
[46] T. Chen, K.-K. Ma, and L.-H. Chen, "Tri-state median filter for image denoising,"
IEEE Transactions on Image processing, vol. 8, no. 12, pp. 1834-1838, 1999.
[47] ScienceDirect. Median Filter. Available:
https://www.sciencedirect.com/topics/computer-science/median-filter
[48] R. B. Rusu and S. Cousins, "3d is here: Point cloud library (pcl)," in 2011 IEEE
international conference on robotics and automation, 2011, pp. 1-4: IEEE.
[49] Z. Yang, Y. Li, J. Yang, J. J. I. T. o. C. Luo, and S. f. V. Technology, "Action
recognition with spatio-temporal visual attention on skeleton image sequences,"
2018.
[50] unittest — Unit testing framework. Available:
https://docs.python.org/3/library/unittest.html#module-unittest
67