Densepose: Dense Human Pose Estimation in The Wild: Seminar: Vision Systems Ma-Inf 4208
Densepose: Dense Human Pose Estimation in The Wild: Seminar: Vision Systems Ma-Inf 4208
The Wild
Seminar: Vision Systems MA-INF 4208
Trajkovic, Jelena
Universität Bonn
[email protected], Matrikelnummer: 3304211
1 Introduction
Human analysis is one of the most important tasks in computer vision. It includes
several sub-problems such as classification (is there a person on this image),
person detection (placing a bounding box around every person on the image),
part and instance segmentation (associating image regions with body parts). The
method proposed in this work includes the mentioned problems as prerequisites
and it is built on top of their outcome.
DensePose estimation is an end-to-end trainable architecture, that operates
at multiple frames per second. In order to achieve this, 2D parametrization of
the human body shape is used. According to the fact that the human body
is highly articulated, in order to achieve a good parametrization, the human
body is divided into smaller and simpler body parts (patches). The human body
shape is divided into 24 patches, isometric to a plane. By flattening each of these
patches through UV unwrapping, every node of the surface is associated to a
local UV coordinate. The basic idea is following - consider a single pixel on the
image. In the first step, the network will classify the pixel as belonging to the
background, or one of the 24 patches. That will give us a rough estimate of
the surface coordinates. And than, in the next step, the regression model will
indicate the exact coordinates of the pixel, within the patch. By repeating this
2 Trajkovic, Jelena
procedure for all the pixels on the image, we obtain a dense correspondence
between the image and the surface. How to train the system to perform such
task? Before we answer to that question, we will briefly discuss some relevant
topics.
During the last years, CNNs present a huge breakthrough in image recognition.
CNNs extract features from the images. The features are learned through hidden
layers, during the training on the set of images. More layers, greater the com-
plexity of the learned features. Classical convolutional neural networks are used
for classification task. R-CNN is trying to solve the problem of multiple object
detection on the image. R-CNN is using region proposal algorithm, for example
selective search algorithm, in order to extract up to 2000 region proposals. Each
of the region is warped to have a fixed size and fed to a pre-trained CNN. CNN
will use SVM to make a classification decision. In addition to category labels, for
each of these proposals, CNN will also predict a correction to the box that was
predicted at the region proposal stage. It is easy to see that it is computationally
expensive to evaluate each of the 2000 warped proposed regions with CNN.
Fast R-CNN is trying to solve this problem in the following way - instead of
generating CNN feature vectors for each region proposal, extract a set of feature
maps from the entire map, just once. In parallel, find a set of ROI (regions of
interest) using Selective Search or another method. For each region proposal,
extract the corresponding part from the feature map. Fully connected layers are
expected fixed size input. Resize it to a fixed size using ROI Pooling Layer. Gen-
erating region proposals with the help of Selective Search is still very expensive.
Faster R-CNN is trying to solve this problem by using RPN - region pro-
posal network. The network itself will predict its own region proposals. The idea
is run the input image through the convolutional layers to get convolutional
feature map representing the entire high resolution image. There is a separate
region proposal network which works on top of the convolutional features and
predicts feature proposals inside of the network. When we have these predicted
region proposals, it looks like Fast R-CNN. RPN has a classifier (probability of
Seminar: Vision Systems MA-INF 4208 3
2 Related work
joints, and a faithful representation of human body pose and shape is created in
process of recovering SMPL parameters by means of optimization. In (Kanazawa
et al., 2017) and (Pavlakos et al., 2018) human shape is estimated by means of
regressing SMPL parameters with a CNN.
The authors are not relying on SMPL model at test time. They use it to
define the dense correspondence task.
3 Methods
3.1 DensePose-COCO Dataset
There exists a lot of image data sets for computer vision training, containing
annotated images and labeled objects: ImageNet, LSUN, MS COCO etc. But,
the idea is not to just label the person. The idea is to create a 3D model of one.
DensePose architecture is strongly supervised approach. The authors intro-
duced the first, valuable, manually collected ground-truth data set for this task,
through the process of gathering dense correspondences between the SMPL
model and the persons appearing in the COCO dataset. This task involves hu-
mans manually creating annotations, that would relate 2D images to surface
based representation of the humans. The two-stage annotation pipeline is used,
in order to improve the efficiency of gathering annotations.
First stage is related with segmenting the images into semantic part regions.
The annotators are asked to segment the body into 24 parts, ignoring hair,
clothes or other accessories.
For the second stage, authors sampled every part region with a set of roughly
equidistant points, obtained via k-means. Number of sampled points is region
size dependant, up to 14 points per part. The annotators are asked to bring each
point in the correspondence with the body surface. This task is simplified by
providing 6 pre-rendered views of the specific body part. Annotator can place
landmark on any of them, and the annotator will have 6 different perspectives
of the annotated point.
Seminar: Vision Systems MA-INF 4208 5
Such annotations are collected for 50.000 persons in the COCO dataset. Fig.4
shows visualisation from COCO, colored based on U and V coordinates of the
annotated points. For the U coordinates, the color is changing horizontally, and
for the V coordinates, the color is changing vertically.
How well did the annotators perform?
di ,k = g(i, ibk )
Fig. 6. Synthetic image with sampled and collected points from the torso.
1 a
Z
AU Ca = f (t)dt
a 0
1 X −g(ip , ibp )2
GP Sj = exp( )
|Pj | 2k 2
p∈Pj
With the formula given above, GPS is calculated for every person instance j,
where Pj is set of ground . The normilizing parametar k is set to 0.255m, so
that a single point has a GPS value of 0.5 if the distance is approximately 0.3m.
(change, finish..)
After collecting a large scale and clean data set, it is time to train deep neural
network in order to predict dense image to surface correspondences.
The authors have experimented with fully convolution network and region based
network.
FCN approach: classification and regression tasks are combined. Classify
the pixel as belonging to the background, or any of the 24 body parts. The
Seminar: Vision Systems MA-INF 4208 7
Fig. 8. On the left side is MASK-RCNN architecture. Replace mask branch with
dens pose branch.
On the Fig.7, Patch is the classification that provides part assignment for
each pixel(background or one of 24 body parts). (U,V) is the regression head
that provides part coordinate predictions.
The use of cascading strategies further improve the performance of the sys-
tem.Through cascading, the information from related tasks, such as keypoint
estimation and instance segmentation, is used.
During the annotation, the annotators annotate a sparse subset of the pixels
in every training sample, around 100-150 per human. This does not affect train-
ing, because at the time of calculation of per-pixel loss, those pixels which are
not annotated, are not included. But they found improved performance when
they interpolated annotations of other pixels with the help of annotated pixels.
Authors first train a "teacher network" with manually collected supervision sig-
nal, that learns to interpolate between the user-defined point correspondences,
resulting in a dense set of point correspondences per body part. In order to make
a teacher as accurate as possible, they remove background by using ground-truth
mask. Obtained dense supervision signal is used to train region based system.
8 Trajkovic, Jelena
4 Results
Firstly, the authors emphasize the importance of having the strong supervision
through the comparison w.r.t. the model fitting approaches. In order to further
point out the significance of strong supervision, they compare the performance of
DensePose using different kinds of supervision signals during training. DensePose
Fig. 11. On the left side - improvement wrt. SMPLify-14 and UP-SMPLify-
91. On the right side - the effect of different kinds of supervision signal during
training.
dataset leads to the predictions that are superior w.r.t to the ones obtained from
the surrogate datasets.
Different architecture choices have a different impact on the performance of
the system. We can observe significant improvement by using the techniques like
distillations (using "teacher network" to inpaint dense supervision signal) and
cascading. DP* uses the ground-truth mask to out the effects of the background.
Seminar: Vision Systems MA-INF 4208 9
The qualitative evaluation shows that the system is able to successfully esti-
mate body shape, excluding clothes, simultaneously dealing with a large amounts
of occlusion, scale and pose variation in the appearance of multiple persons. Al-
though the system operates on frame-by-basis on a single GPU, the real time
results on videos are highly accurate.
5 Conclusion
References
Bogo, F. et al. (2016). “Keep it SMPL: automatic estimation of 3d human pose
and shape from a single image”. In: ECCV.
Bristow, H. et al. (2015). “Dense semantic correspondence where every pixel is
a classifier”. In: ICCV.
Cao, Zhe et al. (2018). “Realtime Multi-Person 2D Human Pose Estimation using
Part Affinity Fields”. In: arXiv.
Kanazawa, A. et al. (2017). “End to-end recovery of human shape and pose”. In:
arXiv.
Loper, Matthew et al. (2015). “SMPL: A Skinned Multi-Person Linear Model.”
In: ACM Transactions on Graphics(TOG).
Neverova, Natalia et al. (2019a). “Correlated Uncertainty for Learning Dense
Correspondences from Noisy Labels”. In: NIPS.
Neverova, Natalia et al. (2019b). “Slim DensePose: Thrifty Learning from Sparse
Annotations and Motion Cues”. In: arXiv.
Pavlakos, Georgios et al. (2018). “Learning to Estimate 3D Human Pose and
Shape from a Single Color Image”. In: arXiv.
Zhou, T. et al. (2016). “Learning dense correspondence via 3d-guided cycle con-
sistency.” In: CVPR.