Action N Pose Estimation
Action N Pose Estimation
Computer Vision
DR TRIPTY SINGH
Pose Estimation
• There are majorly two techniques in which pose
estimation models can detect human poses.
1.2D Pose Estimation: In this type of pose
estimation, you simply estimate the locations of
the body joints in 2D space relative to input
data (i.e., image or video frame). The location
is represented with X and Y coordinates for
each key point.
2.3D Pose Estimation: In this type of pose
estimation, you transform a 2D image into a 3D
object by estimating an additional Z-dimension
to the prediction. 3D pose estimation enables
us to predict the accurate spatial positioning of
a represented person or thing.
3D pose estimation is a significant challenge
faced by machine learning engineers because
of the complexity entailed in building
datasets and algorithms that estimate
several factors, such as an image’s or video’s
Types of Human Pose Estimation
Models
• There are three main types of human pose
estimation models used to represent the
human body in 2D and 3D planes.
• #1. Skeleton-based model: also called
the kinematic model, this representative
includes a set of key points (joints) like
ankles, knees, shoulders, elbows, wrists,
and limb orientations primarily utilized
for 3D and 2D pose estimation.
• This flexible and intuitive human body
model comprises the human body’s
skeletal structure and is frequently
applied to capture the relations between
different body parts.
• #2. Contour-based model: also called the
planar model, it is used for 2D pose
estimation and consists of the contour and
rough width of the body, torso, and limbs.
Basically, it represents the appearance
and shape of a human body, where body
parts are displayed with boundaries and
rectangles of a person’s contour.
Categories
Walking, hammering, dancing, skiing, sitting
down, standing up, jumping
Poses
http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf
Interfaces
2011
Supervised Learning Data
Representation
• Before we dive into the specific neural networks that can be used for
human activity recognition, we need to talk about data preparation.
They have proven very effective on challenging computer vision problems when
trained at scale for tasks such as identifying and localizing objects in images and
automatically describing the content of images.
They are models that are comprised of two main types of elements: convolutional
layers and pooling layers.
Pooling layers take the feature map projections and distill them to the most essential
elements, such as using a signal averaging or signal maximizing process.
Convolutional Neural Network Models
The convolution and pooling layers can be repeated at depth, providing multiple layers of
abstraction of the input signals.
The output of these networks is often one or more fully connected layers that interpret what
has been read and map this internal representation to a class value.
The CNN model learns to map a given window of signal data to an activity where the model
reads across each window of data and prepares an internal representation of the window.
When applied to time series classification like HAR, CNN has two advantages over other
models: local dependency and scale invariance. Local dependency means the nearby signals in
HAR are likely to be correlated, while scale invariance refers to the scale-invariant for different
paces or frequencies.
The first important work using CNNs to HAR was by Ming Zeng, et al in their 2014 paper
“Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors.”
In the paper, the authors develop a simple CNN model for accelerometer data, where each
axis of the accelerometer data is fed into separate convolutional layers, pooling layers, then
concatenated before being interpreted by hidden fully connected layers.
The figure below taken from the paper clearly shows the topology of the model. It provides a
good template for how the CNN may be used for HAR problems and time series classification
in general.
How is CNN used for human activity recognition
• There are many ways to model HAR problems
with CNNs.
• One interesting example was by Heeryon Cho
and Sang Min Yoon in their 2018 paper titled “
Divide and Conquer-Based 1D CNN Human Ac
tivity Recognition Using Test Data Sharpening
.”
• In it, they divide activities into those that
involve movement, called “dynamic,” and
those where the subject is stationary, called
“static,” then develop a CNN model to
discriminate between these two main classes.
Then, within each class, models are
developed to discriminate between activities
of that type, such as “walking” for dynamic
and “sitting” for static.
• Quite large CNN models were developed, which in turn
allowed the authors to claim state-of-the-art results on
challenging standard human activity recognition datasets.
• The authors claim that the removal of the pooling layers is a critical part of their
model architecture, where the use of pooling layers after the convolutional layers
interferes with the convolutional layers’ ability to learn to downsample the raw
sensor data.
• In the literature, CNN frameworks often include convolutional and pooling layers
successively, as a measure to reduce data complexity and introduce translation
invariant features. Nevertheless, such an approach is not strictly part of the
architecture, and in the time series domain […] DeepConvLSTM does not include
pooling operations because the input of the network is constrained by the sliding
window mechanism […] and this fact limits the possibility of downsampling the
data, given that DeepConvLSTM requires a data sequence to be processed by the
recurrent layers. However, without the sliding window requirement, a pooling
mechanism could be useful to cover different sensor data time scales at deeper
layers.
• Depiction of CNN LSTM Model for
Activity Recognition
Taken from “Deep Convolutional and
LSTM Recurrent Neural Networks for
Multimodal Wearable Activity
Recognition
Interfaces
2011 1995
W. T. Freeman and C. Weissman, Television control by hand gestures, International Workshop
on Automatic Face- and Gesture- Recognition, IEEE Computer Society, Zurich, Switzerland, June,
1995, pp. 179--183. MERL-TR94-24
How can we identify actions?
Motion Pose
Held Nearby
Objects Objects
Representing Motion
Corner detectors in
space-time
Laptev 2005
Representing Motion
Space-Time Interest Points
Laptev 2005
Representing Motion
Space-Time Volumes
Training Training
Labels
Training
Images
Image Classifier Trained
Features Training Classifier
Remember image categorization…
Training Training
Labels
Training
Images
Image Classifier Trained
Features Training Classifier
Testing
Corner detectors in
space-time
With keyframe
detection
Without keyframe
detection
“Talk on phone”
Spatio-Temporal Binning
Interest Points
Results
Action Recognition using Pose and Objects
Integrated reasoning
• Human pose estimation
Head
rm
i ght- a r m e f t-a
R Torso L
eg
t-l
Left-leg
gh
Ri
Integrated reasoning
• Human pose estimation
• Object detection
Tennis
racket
Integrated reasoning
• Human pose estimation
• Object detection
• Action categorization
Head
rm
i ght- a r m e f t-a
Tennis R Torso L
racket eg
t-l
Left-leg
gh
Ri
Self-occlusion
Human pose
estimation is
challenging.
Facilitate
Given the
object is
detected.
Object
detection is
Small, low-resolution, challenging
partially occluded
Object
detection is
challenging
Facilitate
Given the
pose is
estimated.
Mutual Context
0.9
Cricket
83.3%
shot
Classification accuracy
0.8 78.9%
0.7
Tennis
0.6 52.5% forehand
0.5
Our Gupta et Bag-of-
Our
model Gupta
al, 2009et Bag-of-words
Words
model al, 2009 SIFT+SVM