0% found this document useful (0 votes)
33 views84 pages

Action N Pose Estimation

The document discusses human pose estimation, detailing techniques for 2D and 3D pose estimation, various model types, and methods for recognizing human activity. It highlights applications in fields like healthcare, robotics, and gaming, and explains the importance of deep learning models in improving accuracy. Additionally, it covers data preparation and neural network architectures suitable for human activity recognition.

Uploaded by

Akhilesh S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views84 pages

Action N Pose Estimation

The document discusses human pose estimation, detailing techniques for 2D and 3D pose estimation, various model types, and methods for recognizing human activity. It highlights applications in fields like healthcare, robotics, and gaming, and explains the importance of deep learning models in improving accuracy. Additionally, it covers data preparation and neural network architectures suitable for human activity recognition.

Uploaded by

Akhilesh S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 84

11/21/11

2D, 3D Activity and Pose


Recognition

Computer Vision

DR TRIPTY SINGH
Pose Estimation
• There are majorly two techniques in which pose
estimation models can detect human poses.
1.2D Pose Estimation: In this type of pose
estimation, you simply estimate the locations of
the body joints in 2D space relative to input
data (i.e., image or video frame). The location
is represented with X and Y coordinates for
each key point.
2.3D Pose Estimation: In this type of pose
estimation, you transform a 2D image into a 3D
object by estimating an additional Z-dimension
to the prediction. 3D pose estimation enables
us to predict the accurate spatial positioning of
a represented person or thing.
3D pose estimation is a significant challenge
faced by machine learning engineers because
of the complexity entailed in building
datasets and algorithms that estimate
several factors, such as an image’s or video’s
Types of Human Pose Estimation
Models
• There are three main types of human pose
estimation models used to represent the
human body in 2D and 3D planes.
• #1. Skeleton-based model: also called
the kinematic model, this representative
includes a set of key points (joints) like
ankles, knees, shoulders, elbows, wrists,
and limb orientations primarily utilized
for 3D and 2D pose estimation.
• This flexible and intuitive human body
model comprises the human body’s
skeletal structure and is frequently
applied to capture the relations between
different body parts.
• #2. Contour-based model: also called the
planar model, it is used for 2D pose
estimation and consists of the contour and
rough width of the body, torso, and limbs.
Basically, it represents the appearance
and shape of a human body, where body
parts are displayed with boundaries and
rectangles of a person’s contour.

• A famous example is the Active Shape


Model (ASM), which captures the entire
human body graph and silhouette
deformations employing the principal
component analysis (PCA) technique.
• #3. Volume-based model: also
called the volumetric model, is used
for 3D pose estimation. It consists of
multiple popular 3D human body
models and poses represented by
human geometric meshes and
shapes, generally captured for deep
learning-based 3D human pose
estimation.
• Bottom-Up VS. Top-Down Methods of
Pose Estimation
• All methods for human pose estimation
can be classified into two primary
approaches: bottom-up and top-down.
1.Bottom-up methods evaluate each
body joint first and then arrange them
to compose a unique pose.
2.Top-down methods run a body
detector first and determine body
joints within the discovered bounding
boxes.
Pose Estimation
• Pose estimation works by locating key
points on the body, such as the elbows,
knees, and wrists. It then uses a
classification algorithm to determine what
the person is doing based on their
movement.
• Pose estimation is used in many
applications, including fitness and
rehabilitation therapy. For example, in
rehabilitation, pose estimation can track
human activity during exercises
• Some algorithms used in pose estimation include
DensePose and PoseNet. The SURF feature extraction
algorithm is also commonly used in pose estimation
How Does Human Pose Estimation
Works?
• Now that you know what pose estimation is,
why it is essential, and the difference
between various methods, models, and
techniques, it’s now time to look into its
working. Yes, we are going to talk about
how human pose estimation works, and this
section is divided into 3 sub-categories
respectively:
• Basis Structure
• Model Architecture Overview
• Different Approaches For Human Pose
Estimation
How Does Human Pose Estimation
Works?
How Does Human Pose Estimation
Works?
• There are several solutions proposed for the
problem of human pose estimation. However, on
the whole, the existing methods can be
subcategorized into three groups, namely
Absolute Pose Estimation, Relative Pose
Estimation, and the appropriate Pose estimation,
which is the combination of both.

• The first: Absolute pose estimation method is


based on satellite-based navigation signals,
navigation beacons, active and passive landmarks,
and heatmap matching. The second: Relative pose
estimation method is based on dead reckoning,
which incrementally updates the human pose by
estimating the distance from a known joint, i.e., a
human’s initial position and orientation
• Fundamentally, most algorithms use human
pose and orientation to predict a person’s
location concerning the background. It is a 2
steps framework that identifies human bounding
boxes and then evaluates each box’s pose.

• Next, it estimates the key points for a person


that would be jointed like the elbow, knees,
wrists, etc. We can estimate poses for a single
person or multiple people depending on the
application.

• In single pose estimation, the model estimates


the poses for a single person in a given scene. In
contrast, in the case of multi-pose estimation,
the model estimates the poses for multiple
people in the given input sequence.
Model Architecture View
• Multiple specific neural network architectures cannot be covered
here in a single article, but we’ll talk about a few robust, reliable
ones that make good points to start.

• Human pose estimation models come in a few varieties, i.e.,


bottom-up and top-down approaches mentioned above. The most
famous architecture begins with an encoder that takes an input
image and extracts features utilizing a series of narrowing
convolution blocks. The next step after an encoder varies with the
method used for pose estimation.

• The most conceptually simplistic system practices a regressor to


final output predictions of each keypoint location by accepting an
input image and outputs X, Y, and Z coordinates for each key point
you’re attempting to predict. However, practically this architecture
is not used as it does not produce accurate results without further
refinement.
• A somewhat more complex approach practices an encoder-decoder
architecture. Instead of calculating joint coordinates directly in this
architecture, the encoder is fed into a decoder, generating
heatmaps. These heatmaps represent the likelihood of a joint
detected in a given section of an input image.

• The precise coordinates are chosen by selecting heatmap locations


with the highest joint likelihood during the post-processing.
Further, in the case of multi-pose estimation, a heatmap comprises
multiple regions of high keypoint likelihood, for instance, 2 or more
left hands in an image. It is done to assign each location to a specific
human model.

• The architectures discussed above apply equally to 2D and 3D pose


estimation.
There are a lot of public datasets
available both for 3D and 2D pose
estimation.
• 2D Pose Estimation Datasets
• MPII Human Pose Dataset
• Leeds Sports Pose
• Frames Labeled in Cinema
• Frames Labeled in Cinema Plus
• YouTube Pose (VGG)
• BBC Pose (VGG)
• COCO Keypoints
There are a lot of public datasets
available both for 3D and 2D pose
estimation.
• 3D Pose Estimation Datasets
• DensePose
• UP-3D
• Human3.6m
• 3D Poses in the Wild
• HumanEva
• Total Capture
• SURREAL (Synthetic hUmans foR REAL tas
ks)
• JTA Dataset
• MPI-INF-3DHP
• 1. Human Activity & Movement Estimation
• One of the most apparent dimensions applicable
to pose estimation is tracking and measuring
human activity and movement.
• Many architectures like OpenPose, PoseNet, and
DensePose are often practised for action,
gesture, or gait recognition. Some examples of
human activity tracking are:
– AI-powered sports coaches or personal gym trainer
– Sitting gestures detection
– Workplace activity monitoring
– Sign language communication for disabled
– Traffic policemen signal detection
– Cricket umpire signal detection
– Dance techniques detection
– Monitoring movements in security and surveillance
– Crowd counting and tracking for retail outlets
How does Human Activity Recognition
work?
2. Augmented Reality & Virtual Reality (AR/ VR)
When clubbed with augmented and virtual reality
applications, human pose estimation presents an
opportunity to create more realistic and responsive
experiences. For example, you can learn to play various
games like tennis or golf via virtual tutors who are pose
illustrated. More so, the US army has implemented AR
programs in combat. It helps soldiers distinguish
between enemies and friendly troops.
3. Robotics
Traditional industrial robots are based on 2D vision
systems with many limitations. In place of manually
programming robots to learn movements, a 3D pose
estimation technique can be employed. This approach
creates more responsive, flexible, and true-to-life
robotics systems. It enables robots to understand
actions and movements by following the tutor’s
posture, look, or appearance.
3. Robotics
Traditional industrial
robots are based on 2D
vision systems with
many limitations. In
place of manually
programming robots to
learn movements, a 3D
pose estimation
technique can be
employed. This approach
creates more responsive,
flexible, and true-to-life
robotics systems. It
enables robots to
understand actions and
movements by following
the tutor’s posture, look,
or appearance.
4. Animation & Gaming
Modern advancements in pose estimation and motion
capture technology make character animation a
streamlined and automated process. For example,
Microsoft’s Kinect depth camera captures human motion
in real-time using IR sensors data and uses it to render
the characters’ actions virtually into the gaming
environment. Likewise, capturing animations for
immersive video game experiences can also effortlessly
be automated by different pose estimation architectures.

Pose estimation is a captivating computer vision


component utilized by multiple domains, including
technology, healthcare, gaming, etc. I hope my
comprehensive guide on Human Pose Estimation helped
explain the basics of human pose estimation, its working
principles, and how it can be utilized in the real world.
What is an action?

Action: a transition from one state to another


•Who is the actor?
•How is the state of the actor changing?
•What (if anything) is being acted on?
•How is that thing changing?
•What is the purpose of the action (if any)?
Human activity in video

No universal terminology, but approximately:


• “Actions”: atomic motion patterns -- often gesture-
like, single clear-cut trajectory, single nameable
behavior (e.g., sit, wave arms)

• “Activity”: series or composition of actions (e.g.,


interactions between people)

• “Event”: combination of activities or actions (e.g., a


football game, a traffic accident)
Adapted from Venu Govindaraju
How do we represent actions?

Categories
Walking, hammering, dancing, skiing, sitting
down, standing up, jumping

Poses

Nouns and Predicates


<man, swings, hammer>
<man, hits, nail, w/ hammer>
What is the purpose of action recognition?
Human Action Recognition (HAR) aims to understand human
behavior and assign a label to each action. It has a wide range of
applications, and therefore has been attracting increasing attention in
the field of computer vision.

Deep learning models for human activity recognition :As of


today, neural networks have proven to be the most effective in
performing activity recognition. In particular, two approaches, including
Convolutional Neural Network Models and Recurrent Neural Network
Models, are the most widely used for this task.
•Activity recognition is the problem of predicting the movement of
a person, often indoors, based on sensor data, such as an
accelerometer in a smartphone.
•Streams of sensor data are often split into subs-sequences called
windows, and each window is associated with a broader activity,
called a sliding window approach.
•Convolutional neural networks and long short-term memory
networks, and perhaps both together, are best suited to learning
features from raw sensor data and predicting the associated
What is the purpose of action recognition?

In computer vision, action recognition is the task of identifying when a


person in an image or video is performing a given action. AI models can
be trained to recognize a variety of actions, from running and sleeping to
drinking, falling, or riding a bike

Human activity recognition, or HAR, is a challenging time series


classification task.
It involves predicting the movement of a person based on sensor
data and traditionally involves deep domain expertise and
methods from signal processing to correctly engineer features
from the raw data in order to fit a machine learning model.
Recently, deep learning methods such as convolutional neural
networks and recurrent neural networks have shown capable and
even achieve state-of-the-art results by automatically learning
features from the raw sensor data.
Human Activity Recognition
Traditionally, methods from the field of signal processing were used to
analyze and distill the collected sensor data.
Such methods were for feature engineering, creating domain-specific,
sensor-specific, or signal processing-specific features and views of the
original data. Statistical and machine learning models were then
trained on the processed version of the data.
A limitation of this approach is the signal processing and domain
expertise required to analyze the raw data and engineer the features
required to fit a model. This expertise would be required for each new
dataset or sensor modality. In essence, it is expensive and not
scalable.

•Ideally, learning methods could be used that automatically learn the


features required to make accurate predictions from the raw data directly.
This would allow new problems, new datasets, and new sensor modalities
to be adopted quickly and cheaply.
•deep neural network models have started delivering on their promises of
feature learning and are achieving stat-of-the-art results for human activity
recognition. They are capable of performing automatic feature learning
from the raw sensor data and out-perform models fit on hand-crafted
Human Activity Recognition
The feature extraction and model building procedures are often
performed simultaneously in the deep learning models. The
features can be learned automatically through the network
instead of being manually designed. Besides, the deep neural
network can also extract high-level representation in deep layer,
which makes it more suitable for complex activity recognition
tasks
There are two main approaches to neural networks that are
appropriate for time series classification and that have been
demonstrated to perform well on activity recognition using sensor
data from commodity smart phones and fitness tracking devices.
They are Convolutional Neural Network Models and Recurrent
Neural Network Models.
RNN and LSTM are recommended to recognize short activities
that have natural order while CNN is better at inferring long term
repetitive activities. The reason is that RNN could make use of the
time-order relationship between sensor readings, and CNN is
more capable of learning deep features contained in recursive
patterns.
Surveillance

http://users.isr.ist.utl.pt/~etienne/mypubs/Auvinetal06PETS.pdf
Interfaces

2011
Supervised Learning Data
Representation
• Before we dive into the specific neural networks that can be used for
human activity recognition, we need to talk about data preparation.

• Both types of neural networks suitable for time series classification


require that data be prepared in a specific manner in order to fit a model.
That is, in a ‘supervised learning‘ way that allows the model to associate
signal data with an activity class.

• A straight-forward data preparation approach that was used both for


classical machine learning methods on the hand-crafted features and for
neural networks involves dividing the input signal data into windows of
signals, where a given window may have one to a few seconds of
observation data. This is often called a ‘sliding window.’

• Human activity recognition aims to infer the actions of one or more


persons from a set of observations captured by sensors. Usually, this is
performed by following a fixed length sliding window approach for the
features extraction where two parameters have to be fixed: the size of the
window and the shift.
Convolutional Neural Network Models
Convolutional Neural Network models, or CNNs for short, are a type of deep neural
network that were developed for use with image data, e.g. such as handwriting
recognition.

They have proven very effective on challenging computer vision problems when
trained at scale for tasks such as identifying and localizing objects in images and
automatically describing the content of images.

They are models that are comprised of two main types of elements: convolutional
layers and pooling layers.

Convolutional layers read an input, such as a 2D image or a 1D signal, using a kernel


that reads in small segments at a time and steps across the entire input field. Each
read results in an the input that is projected onto a filter map and represents an
internal interpretation of the input.

Pooling layers take the feature map projections and distill them to the most essential
elements, such as using a signal averaging or signal maximizing process.
Convolutional Neural Network Models
The convolution and pooling layers can be repeated at depth, providing multiple layers of
abstraction of the input signals.

The output of these networks is often one or more fully connected layers that interpret what
has been read and map this internal representation to a class value.

The CNN model learns to map a given window of signal data to an activity where the model
reads across each window of data and prepares an internal representation of the window.

When applied to time series classification like HAR, CNN has two advantages over other
models: local dependency and scale invariance. Local dependency means the nearby signals in
HAR are likely to be correlated, while scale invariance refers to the scale-invariant for different
paces or frequencies.

The first important work using CNNs to HAR was by Ming Zeng, et al in their 2014 paper
“Convolutional Neural Networks for Human Activity Recognition using Mobile Sensors.”

In the paper, the authors develop a simple CNN model for accelerometer data, where each
axis of the accelerometer data is fed into separate convolutional layers, pooling layers, then
concatenated before being interpreted by hidden fully connected layers.

The figure below taken from the paper clearly shows the topology of the model. It provides a
good template for how the CNN may be used for HAR problems and time series classification
in general.
How is CNN used for human activity recognition
• There are many ways to model HAR problems
with CNNs.
• One interesting example was by Heeryon Cho
and Sang Min Yoon in their 2018 paper titled “
Divide and Conquer-Based 1D CNN Human Ac
tivity Recognition Using Test Data Sharpening
.”
• In it, they divide activities into those that
involve movement, called “dynamic,” and
those where the subject is stationary, called
“static,” then develop a CNN model to
discriminate between these two main classes.
Then, within each class, models are
developed to discriminate between activities
of that type, such as “walking” for dynamic
and “sitting” for static.
• Quite large CNN models were developed, which in turn
allowed the authors to claim state-of-the-art results on
challenging standard human activity recognition datasets.

• Another interesting approach was proposed by Wenchao


Jiang and Zhaozheng Yin in their 2015 paper titled “Human
Activity Recognition Using Wearable Sensors by Deep
Convolutional Neural Networks.”

• Instead of using 1D CNNs on the signal data, they instead


combine the signal data together to create “images” which
are then fed to a 2D CNN and processed as image data with
convolutions along the time axis of signals and across signal
variables, specifically accelerometer and gyroscope data.
Recurrent Neural Network
Models
• Recurrent neural networks, or RNNs for short, are a type
of neural network that was designed to learn from
sequence data, such as sequences of observations over
time, or a sequence of words in a sentence.
• A specific type of RNN called the long short-term memory
network, or LSTM for short, is perhaps the most widely
used RNN as its careful design overcomes the general
difficulties in training a stable RNN on sequence data.
• LSTMs have proven effective on challenging sequence
prediction problems when trained at scale for such tasks
as handwriting recognition, language modeling, and
machine translation.
• A layer in an LSTM model is comprised of special units
that have gates that govern input, output, and recurrent
connections, the weights of which are learned. Each LSTM
unit also has internal memory or state that is
accumulated as an input sequence is read and can be
used by the network as a type of local variable or
memory register.
• Like the CNN that can read across an input
sequence, the LSTM reads a sequence of
input observations and develops its own
internal representation of the input
sequence. Unlike the CNN, the LSTM is
trained in a way that pays specific attention
to observations made and prediction errors
made over the time steps in the input
sequence, called backpropagation through
time.
• The LSTM learns to map each window of
sensor data to an activity, where the
observations in the input sequence are read
one at a time, where each time step may be
comprised of one or more variables (e.g.
parallel sequences).
Depiction of LSTM RNN for
Activity Recognition
Taken from “Deep Recurrent
Neural Networks for Human
Activity Recognition.
• It may be more common to use an LSTM in conjunction with a CNN on
HAR problems, in a CNN-LSTM model or ConvLSTM model.

• This is where a CNN model is used to extract the features from a


subsequence of raw sample data, and output features from the CNN for
each subsequence are then interpreted by an LSTM in aggregate.

• An example of this is in the 2016 paper by Francisco Javier Ordonez and


Daniel Roggen titled “Deep Convolutional and LSTM Recurrent Neural
Networks for Multimodal Wearable Activity Recognition.”

• We introduce a new DNN framework for wearable activity recognition,


which we refer to as DeepConvLSTM. This architecture combines
convolutional and recurrent layers. The convolutional layers act as feature
extractors and provide abstract representations of the input sensor data
in feature maps. The recurrent layers model the temporal dynamics of the
activation of the feature maps.
• A deep network architecture is used with four convolutional layers without any
pooling layers, followed by two LSTM layers to interpret the extracted features
over multiple time steps.

• The authors claim that the removal of the pooling layers is a critical part of their
model architecture, where the use of pooling layers after the convolutional layers
interferes with the convolutional layers’ ability to learn to downsample the raw
sensor data.

• In the literature, CNN frameworks often include convolutional and pooling layers
successively, as a measure to reduce data complexity and introduce translation
invariant features. Nevertheless, such an approach is not strictly part of the
architecture, and in the time series domain […] DeepConvLSTM does not include
pooling operations because the input of the network is constrained by the sliding
window mechanism […] and this fact limits the possibility of downsampling the
data, given that DeepConvLSTM requires a data sequence to be processed by the
recurrent layers. However, without the sliding window requirement, a pooling
mechanism could be useful to cover different sensor data time scales at deeper
layers.
• Depiction of CNN LSTM Model for
Activity Recognition
Taken from “Deep Convolutional and
LSTM Recurrent Neural Networks for
Multimodal Wearable Activity
Recognition
Interfaces

2011 1995
W. T. Freeman and C. Weissman, Television control by hand gestures, International Workshop
on Automatic Face- and Gesture- Recognition, IEEE Computer Society, Zurich, Switzerland, June,
1995, pp. 179--183. MERL-TR94-24
How can we identify actions?

Motion Pose

Held Nearby
Objects Objects
Representing Motion

Optical Flow with Motion History

Bobick Davis 2001


Representing Motion

Optical Flow with Split Channels

Efros et al. 2003


Representing Motion
Tracked Points

Matikainen et al. 2009


Representing Motion
Space-Time Interest Points

Corner detectors in
space-time

Laptev 2005
Representing Motion
Space-Time Interest Points

Laptev 2005
Representing Motion
Space-Time Volumes

Blank et al. 2005


Examples of Action Recognition Systems
• Feature-based classification

• Recognition using pose and objects


Action recognition as classification

Retrieving actions in movies, Laptev and Perez, 2007


Remember image categorization…

Training Training
Labels
Training
Images
Image Classifier Trained
Features Training Classifier
Remember image categorization…

Training Training
Labels
Training
Images
Image Classifier Trained
Features Training Classifier

Testing

Image Trained Prediction


Features Classifier Outdoor
Test Image
Remember spatial pyramids….

Compute histogram in each spatial bin


Features for Classifying Actions
1. Spatio-temporal pyramids (14x14x8 bins)
– Image Gradients
– Optical Flow
Features for Classifying Actions
2. Spatio-temporal interest points

Corner detectors in
space-time

Descriptors based on Gaussian derivative filters over x, y, time


Searching the video for an action
1. Detect keyframes using a trained HOG
detector in each frame
2. Classify detected keyframes as positive (e.g.,
“drinking”) or negative (“other”)
Accuracy in searching video

With keyframe
detection

Without keyframe
detection
“Talk on phone”

“Get out of car”


Learning realistic human actions from movies, Laptev et al. 2008
Approach
• Space-time interest point detectors
• Descriptors
– HOG, HOF
• Pyramid histograms (3x3x2)
• SVMs with Chi-Squared Kernel

Spatio-Temporal Binning
Interest Points
Results
Action Recognition using Pose and Objects

Modeling Mutual Context of Object and Human Pose in Human-Object


Interaction Activities, B. Yao and Li Fei-Fei, 2010
Slide Credit: Yao/Fei-Fei
Human-Object Interaction
Holistic image based classification

Integrated reasoning
• Human pose estimation

Head
rm
i ght- a r m e f t-a
R Torso L
eg
t-l

Left-leg
gh
Ri

Slide Credit: Yao/Fei-Fei


Human-Object Interaction
Holistic image based classification

Integrated reasoning
• Human pose estimation
• Object detection

Tennis
racket

Slide Credit: Yao/Fei-Fei


Human-Object Interaction
Holistic image based classification

Integrated reasoning
• Human pose estimation
• Object detection
• Action categorization
Head
rm
i ght- a r m e f t-a
Tennis R Torso L
racket eg
t-l

Left-leg
gh
Ri

HOI activity: Tennis Forehand

Slide Credit: Yao/Fei-Fei


Human pose estimation & Object detection

Human pose Difficult part


estimation is appearance
challenging.

Self-occlusion

Image region looks


like a body part

• Felzenszwalb & Huttenlocher, 2005


• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
• Eichner & Ferrari, 2009
Slide Credit: Yao/Fei-Fei
Human pose estimation & Object detection

Human pose
estimation is
challenging.

• Felzenszwalb & Huttenlocher, 2005


• Ren et al, 2005
• Ramanan, 2006
• Ferrari et al, 2008
• Yang & Mori, 2008
• Andriluka et al, 2009
• Eichner & Ferrari, 2009
Slide Credit: Yao/Fei-Fei
Human pose estimation & Object detection

Facilitate

Given the
object is
detected.

Slide Credit: Yao/Fei-Fei


Human pose estimation & Object detection

Object
detection is
Small, low-resolution, challenging
partially occluded

Image region similar


to detection target

• Viola & Jones, 2001


• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009

Slide Credit: Yao/Fei-Fei


Human pose estimation & Object detection

Object
detection is
challenging

• Viola & Jones, 2001


• Lampert et al, 2008
• Divvala et al, 2009
• Vedaldi et al, 2009

Slide Credit: Yao/Fei-Fei


Human pose estimation & Object detection

Facilitate

Given the
pose is
estimated.

Slide Credit: Yao/Fei-Fei


Human pose estimation & Object detection

Mutual Context

Slide Credit: Yao/Fei-Fei


Mutual Context Model Representation
A:
 Activity
A

Tennis Croquet Volleyball Human pose


forehand shot smash
H
O: Object
 O
Tennis Croquet Volleyball Body parts
racket mallet
P1 P2  PN
H:
fO
f1 f2  fN
Intra-class variations
• More than one H for each A; Image evidence
• Unobserved during training.

P: lP: location; θP: orientation; sP: scale.

f: Shape context. [Belongie et al, 2002]


Slide Credit: Yao/Fei-Fei
Activity Classification Results

0.9
Cricket
83.3%
shot
Classification accuracy

0.8 78.9%

0.7
Tennis
0.6 52.5% forehand

0.5
Our Gupta et Bag-of-
Our
model Gupta
al, 2009et Bag-of-words
Words
model al, 2009 SIFT+SVM

Slide Credit: Yao/Fei-Fei


Take-home messages
• Action recognition is an open problem.
– How to define actions?
– How to infer them?
– What are good visual cues?
– How do we incorporate higher level reasoning?
Take-home messages
• Some work done, but it is just the beginning of
exploring the problem. So far…
– Actions are mainly categorical
– Most approaches are classification using simple
features (spatial-temporal histograms of gradients
or flow, s-t interest points, SIFT in images)
– Just a couple works on how to incorporate pose
and objects
– Not much idea of how to reason about long-term
activities or to describe video sequences

You might also like