0% found this document useful (0 votes)
114 views10 pages

Densepose: Dense Human Pose Estimation in The Wild: Seminar: Vision Systems Ma-Inf 4208

This document summarizes a paper on DensePose, a method for dense human pose estimation. DensePose establishes dense correspondences between pixels in an RGB image of a person and the person's 3D surface model. The authors introduced the DensePose-COCO dataset with over 5 million manually annotated correspondences. DensePose-RCNN was also introduced, which combines DenseReg architecture with Mask RCNN to predict dense correspondences. The DensePose method operates at multiple frames per second by using a 2D parametrization of the human body shape.

Uploaded by

Jelena Trajkovic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views10 pages

Densepose: Dense Human Pose Estimation in The Wild: Seminar: Vision Systems Ma-Inf 4208

This document summarizes a paper on DensePose, a method for dense human pose estimation. DensePose establishes dense correspondences between pixels in an RGB image of a person and the person's 3D surface model. The authors introduced the DensePose-COCO dataset with over 5 million manually annotated correspondences. DensePose-RCNN was also introduced, which combines DenseReg architecture with Mask RCNN to predict dense correspondences. The DensePose method operates at multiple frames per second by using a 2D parametrization of the human body shape.

Uploaded by

Jelena Trajkovic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

DensePose: Dense Human Pose Estimation In

The Wild
Seminar: Vision Systems MA-INF 4208

Trajkovic, Jelena

Universität Bonn
[email protected], Matrikelnummer: 3304211

Abstract. This is an analysis of the paper "DensePose: Human Pose Es-


timation in the wild". Human pose estimation is one of the most central
tasks in the field of computer vision and it has progressed intensively with
the progress of neural networks in the era of deep learning. This work
strives to a surface based representation of the human body, and this is
achieved by establishing the correspondences between all pixels on the
RGB image, corresponding to humans, and the 3D surface. The authors
introduced DensePose-COCO dataset, by collecting more than 5 million
manually annotated correspondences. In order to predict dense corre-
spondences, authors have introduced DensePose RCNN, by combining
DenseReg architecture with Mask-RCNN. Human pose estimation has a
lot of applications - sentiment analysis, full body tracking, augmented
reality.

1 Introduction

Human analysis is one of the most important tasks in computer vision. It includes
several sub-problems such as classification (is there a person on this image),
person detection (placing a bounding box around every person on the image),
part and instance segmentation (associating image regions with body parts). The
method proposed in this work includes the mentioned problems as prerequisites
and it is built on top of their outcome.
DensePose estimation is an end-to-end trainable architecture, that operates
at multiple frames per second. In order to achieve this, 2D parametrization of
the human body shape is used. According to the fact that the human body
is highly articulated, in order to achieve a good parametrization, the human
body is divided into smaller and simpler body parts (patches). The human body
shape is divided into 24 patches, isometric to a plane. By flattening each of these
patches through UV unwrapping, every node of the surface is associated to a
local UV coordinate. The basic idea is following - consider a single pixel on the
image. In the first step, the network will classify the pixel as belonging to the
background, or one of the 24 patches. That will give us a rough estimate of
the surface coordinates. And than, in the next step, the regression model will
indicate the exact coordinates of the pixel, within the patch. By repeating this
2 Trajkovic, Jelena

Fig. 1. 3D HumanTemplate and UV unwrapping.

procedure for all the pixels on the image, we obtain a dense correspondence
between the image and the surface. How to train the system to perform such
task? Before we answer to that question, we will briefly discuss some relevant
topics.

1.1 Relevant topics

During the last years, CNNs present a huge breakthrough in image recognition.
CNNs extract features from the images. The features are learned through hidden
layers, during the training on the set of images. More layers, greater the com-
plexity of the learned features. Classical convolutional neural networks are used
for classification task. R-CNN is trying to solve the problem of multiple object
detection on the image. R-CNN is using region proposal algorithm, for example
selective search algorithm, in order to extract up to 2000 region proposals. Each
of the region is warped to have a fixed size and fed to a pre-trained CNN. CNN
will use SVM to make a classification decision. In addition to category labels, for
each of these proposals, CNN will also predict a correction to the box that was
predicted at the region proposal stage. It is easy to see that it is computationally
expensive to evaluate each of the 2000 warped proposed regions with CNN.
Fast R-CNN is trying to solve this problem in the following way - instead of
generating CNN feature vectors for each region proposal, extract a set of feature
maps from the entire map, just once. In parallel, find a set of ROI (regions of
interest) using Selective Search or another method. For each region proposal,
extract the corresponding part from the feature map. Fully connected layers are
expected fixed size input. Resize it to a fixed size using ROI Pooling Layer. Gen-
erating region proposals with the help of Selective Search is still very expensive.
Faster R-CNN is trying to solve this problem by using RPN - region pro-
posal network. The network itself will predict its own region proposals. The idea
is run the input image through the convolutional layers to get convolutional
feature map representing the entire high resolution image. There is a separate
region proposal network which works on top of the convolutional features and
predicts feature proposals inside of the network. When we have these predicted
region proposals, it looks like Fast R-CNN. RPN has a classifier (probability of
Seminar: Vision Systems MA-INF 4208 3

a proposal having a target object) and regressor (regresses the coordinates of


the object)
Mask R-CNN is a method for instance segmentation. Looks a lot like Faster
R-CNN. We take the input image, it goes into some convolutional network and
some learned region proposal network. Once we have our learned region propos-
als, we project these proposals on our convolutional feature map. Now, rather
than just making a classification and a bounding box regression decision for
each of those boxes, we in addition want to predict a segmentation mask. After
ROIAlign, we have two different branches. First branch on the right looks like
the Faster R-CNN - it will predict classification scores(category corresponding
to the region proposal), and bounding box coordinates, and we will have the
branch on the left, that looks like a semantic segmentation mini network.

Fig. 2. Mask R-CNN vs Faster R-CNN

2 Related work

Human pose estimation is commonly interpreted as the problem of localizing


joints of the human body. OpenPose (Cao et al., 2018) is a very popular bottom-
up approach for multi-person human pose estimation. It first detects key points
belonging to every person on the image, and than a non-parametric representa-
tion is used for learning the association between body parts and persons in the
image.
Instead of working with 10-20 joints, the authors of this paper interpret
an image with a mesh consisting of thousands of nodes, by establishing dense
correspondences between 2D image and 3D body shape.
In (Bristow et al., 2015), the authors aimed at recovering dense correspon-
dences between pairs of RGB images. In (Zhou et al., 2016) the authors tackled a
problem of establishing dense correspondences between distinct objects. In these
works, the authors took an unsupervised approach, for general categories.
The work of the authors is inspired by the progress in representing 2D images
using 3D deformable models of human body, such as SMPL model (Loper et al.,
2015). In (Bogo et al., 2016), CNN is used to predict the position of the individual
4 Trajkovic, Jelena

joints, and a faithful representation of human body pose and shape is created in
process of recovering SMPL parameters by means of optimization. In (Kanazawa
et al., 2017) and (Pavlakos et al., 2018) human shape is estimated by means of
regressing SMPL parameters with a CNN.
The authors are not relying on SMPL model at test time. They use it to
define the dense correspondence task.

3 Methods
3.1 DensePose-COCO Dataset
There exists a lot of image data sets for computer vision training, containing
annotated images and labeled objects: ImageNet, LSUN, MS COCO etc. But,
the idea is not to just label the person. The idea is to create a 3D model of one.
DensePose architecture is strongly supervised approach. The authors intro-
duced the first, valuable, manually collected ground-truth data set for this task,
through the process of gathering dense correspondences between the SMPL
model and the persons appearing in the COCO dataset. This task involves hu-
mans manually creating annotations, that would relate 2D images to surface
based representation of the humans. The two-stage annotation pipeline is used,
in order to improve the efficiency of gathering annotations.
First stage is related with segmenting the images into semantic part regions.
The annotators are asked to segment the body into 24 parts, ignoring hair,
clothes or other accessories.

Fig. 3. Step1: part segmentation

For the second stage, authors sampled every part region with a set of roughly
equidistant points, obtained via k-means. Number of sampled points is region
size dependant, up to 14 points per part. The annotators are asked to bring each
point in the correspondence with the body surface. This task is simplified by
providing 6 pre-rendered views of the specific body part. Annotator can place
landmark on any of them, and the annotator will have 6 different perspectives
of the annotated point.
Seminar: Vision Systems MA-INF 4208 5

Fig. 4. Step2: marking correspondences

Such annotations are collected for 50.000 persons in the COCO dataset. Fig.4
shows visualisation from COCO, colored based on U and V coordinates of the
annotated points. For the U coordinates, the color is changing horizontally, and
for the V coordinates, the color is changing vertically.
How well did the annotators perform?

Fig. 5. Visualization of annotations: Image(left), U(middle) and V(right) coor-


dinates of the annotated points.

3.2 Accuracy of human annotators


For measuring the labeling accuracy, the authors are using synthetic images, for
which they have ground-truth coordinates. The geodesic distance between the
true position, and the position estimated by annotators, is used to calculate the
accuracy. The geodesic distance is the shortest distance along the contour. For
every image k, estimate the geodesic distance between the correct surface point
i, and the estimated surface point ibk :

di ,k = g(i, ibk )

Error is estimated on the set of sampled surface points, and interpolated on


the remainder on the surface. Than the error is averaged over multiple images.
As we can see on the Fig. 7, the annotators are less accurate on larger, uni-
form areas(torso, upper legs).Two types of evaluation measures are considered
to evaluate correspondence accuracy over the whole human body.
6 Trajkovic, Jelena

Fig. 6. Synthetic image with sampled and collected points from the torso.

Fig. 7. Average human annotation error as the function of surface position.

Pointwise evaluation: in this approach, ratio of correct points (RCP) is


considered as a function of geodesic error threshold. Area under the curve AU Ca
gives the summary of correspondence accuracy:

1 a
Z
AU Ca = f (t)dt
a 0

Per-instance evaluation: for this type evaluation, the authors define a


measure geodesic point similarity (GPS):

1 X −g(ip , ibp )2
GP Sj = exp( )
|Pj | 2k 2
p∈Pj

With the formula given above, GPS is calculated for every person instance j,
where Pj is set of ground . The normilizing parametar k is set to 0.255m, so
that a single point has a GPS value of 0.5 if the distance is approximately 0.3m.
(change, finish..)
After collecting a large scale and clean data set, it is time to train deep neural
network in order to predict dense image to surface correspondences.

3.3 Learning DensePose Human Estimation

The authors have experimented with fully convolution network and region based
network.
FCN approach: classification and regression tasks are combined. Classify
the pixel as belonging to the background, or any of the 24 body parts. The
Seminar: Vision Systems MA-INF 4208 7

classification will be 25 class classification.


c∗ = argmaxc P (c|i)
Here c∗ is the class to which, with the highest probability, pixel i belongs. After
determining the class, the regression is performed, in order to find U,V param-
eterization for the exact point of the pixel in the 3D surface model:

[U, V ] = Rc (i)

24 regression functions Rc are trained, according to the fact that 24 body parts
are treated, with their local coordinates.
Region Based Network approach: It is not fruitful to load one single
network with too many tasks, including part segmentation and pixel localiza-
tion. The authors introduce DensePose-RCNN, the exact same architecture of
MASK-RCNN, until ROIAlign, and than they have used fully convolution neural
network dedicated to the classification and regression tasks.

Fig. 8. On the left side is MASK-RCNN architecture. Replace mask branch with
dens pose branch.

On the Fig.7, Patch is the classification that provides part assignment for
each pixel(background or one of 24 body parts). (U,V) is the regression head
that provides part coordinate predictions.
The use of cascading strategies further improve the performance of the sys-
tem.Through cascading, the information from related tasks, such as keypoint
estimation and instance segmentation, is used.
During the annotation, the annotators annotate a sparse subset of the pixels
in every training sample, around 100-150 per human. This does not affect train-
ing, because at the time of calculation of per-pixel loss, those pixels which are
not annotated, are not included. But they found improved performance when
they interpolated annotations of other pixels with the help of annotated pixels.
Authors first train a "teacher network" with manually collected supervision sig-
nal, that learns to interpolate between the user-defined point correspondences,
resulting in a dense set of point correspondences per body part. In order to make
a teacher as accurate as possible, they remove background by using ground-truth
mask. Obtained dense supervision signal is used to train region based system.
8 Trajkovic, Jelena

Fig. 9. Benefits from information on similar tasks - keypoints/masks.

Fig. 10. "Teacher Network".

4 Results

Firstly, the authors emphasize the importance of having the strong supervision
through the comparison w.r.t. the model fitting approaches. In order to further
point out the significance of strong supervision, they compare the performance of
DensePose using different kinds of supervision signals during training. DensePose

Fig. 11. On the left side - improvement wrt. SMPLify-14 and UP-SMPLify-
91. On the right side - the effect of different kinds of supervision signal during
training.

dataset leads to the predictions that are superior w.r.t to the ones obtained from
the surrogate datasets.
Different architecture choices have a different impact on the performance of
the system. We can observe significant improvement by using the techniques like
distillations (using "teacher network" to inpaint dense supervision signal) and
cascading. DP* uses the ground-truth mask to out the effects of the background.
Seminar: Vision Systems MA-INF 4208 9

Fig. 12. Impact of different architecture choices on the system performance.

Experiments with ResNet-101 backbone shows little improvement over ResNet-


50 backbone, what justifies the use of the latter in the experiments.

Fig. 13. Multi-task experiments are based on ResNet-50 architecture, due to


insignificant improvement of ResNet-101 backbone.

The qualitative evaluation shows that the system is able to successfully esti-
mate body shape, excluding clothes, simultaneously dealing with a large amounts
of occlusion, scale and pose variation in the appearance of multiple persons. Al-
though the system operates on frame-by-basis on a single GPU, the real time
results on videos are highly accurate.

5 Conclusion

Paper introduces nice solution for establishing dense correspondences between a


single RGB image and a surface-based representation of the human body. How-
ever, intuitively, it can have difficulties w.r.t to the rare poses. Increasing the field
of view can possibly have the impact on reducing the inference speed. DensePose
works on frame-to-frame basis, so the predictions can be a little bit inconsistent
w.r.t time. The proposed model is based on full-blown supervised learning. The
success of DensePose is rooted in manual annotations. The obtained dataset,
10 Trajkovic, Jelena

Fig. 14. Qualitative evaluation.

although undeniably valuable, is limited in correctness. DensePose task can be


solved more accurately, by understanding the underlying uncertainty better.
This is addressed in (Neverova et al., 2019a).On the other hand, one can con-
sider more cost-effective method for data annotation, as discussed in (Neverova
et al., 2019b).
DensePose is one step further in understanding human images by means of
surface-based models. Dataset and source code are open, and can be reused for
future research works. Experimental results are promising. This work opens a
wide variety of possibilities for augmented reality applications, activity recogni-
tion and much more.

References
Bogo, F. et al. (2016). “Keep it SMPL: automatic estimation of 3d human pose
and shape from a single image”. In: ECCV.
Bristow, H. et al. (2015). “Dense semantic correspondence where every pixel is
a classifier”. In: ICCV.
Cao, Zhe et al. (2018). “Realtime Multi-Person 2D Human Pose Estimation using
Part Affinity Fields”. In: arXiv.
Kanazawa, A. et al. (2017). “End to-end recovery of human shape and pose”. In:
arXiv.
Loper, Matthew et al. (2015). “SMPL: A Skinned Multi-Person Linear Model.”
In: ACM Transactions on Graphics(TOG).
Neverova, Natalia et al. (2019a). “Correlated Uncertainty for Learning Dense
Correspondences from Noisy Labels”. In: NIPS.
Neverova, Natalia et al. (2019b). “Slim DensePose: Thrifty Learning from Sparse
Annotations and Motion Cues”. In: arXiv.
Pavlakos, Georgios et al. (2018). “Learning to Estimate 3D Human Pose and
Shape from a Single Color Image”. In: arXiv.
Zhou, T. et al. (2016). “Learning dense correspondence via 3d-guided cycle con-
sistency.” In: CVPR.

You might also like