Review: Deepmask (Instance Segmentation) : An Instance Segment Proposal Method Driven by Convolutional Neural Networks
Review: Deepmask (Instance Segmentation) : An Instance Segment Proposal Method Driven by Convolutional Neural Networks
1. Model Architecture
Model Architecture (Top), Positive Samples (Green, Left Bottom), Negative Samples (Red,
Right Bottom)
When yk=1, the ground truth mask mk has positive values for the pixels which belong to the
single object located in the centre of the image patch.
Right Bottom: Negative Samples
Otherwise, a label yk=-1 is given for a negative sample even the object is partially present. When
yk=-1, the mask is not used.
2. Joint Learning
2.1. Loss Function
The network is trained to jointly learn the pixel-wise segmentation map fsegm(xk) at each location
(i,j) and the predicted object score fscore(xk). The loss function is shown as below:
To be brief, the loss function is a sum of binary logistic regression losses, one for each location of
the segmentation network fsegm(xk) and one for the object score fscore(xk). The first term implies
that we only backpropagate the error over the segmentation path if yk=1.
If yk=-1, i.e. the negative sample, the first term become 0 and will not contribute to the loss. Only
the second term contributes the loss.
For the sake of data balance, equal number of positive and negative samples is used.
4. Results
4.1. MS COCO (Boxes & Segmentation Masks)
80,000 images and a total of nearly 500,000 segmented objects, are used for training. And the first
5000 images of the MS COCO 2014 are used for validated.
Average Recall (AR) Detection Boxes (Left) and Segmentation Masks (Right) on MS COCO
Validation Set (AR@n: the AR when n region proposals are generated. AUCx: x is the size of
objects)
• DeepMask20: Trained only with objects belonging to one of the 20 PASCAL categories. AR
is low compared to DeepMask which means the network is not generalized to unseen
classes. (low scores for unseen classes.)
• DeepMask20*: Similar to DeepMask but the scoring path uses the original DeepMask.
• DeepMaskZoom: Additional smaller scale to boost AR but with the cost of increased
inference time.
• DeepMaskFull: The two FC layers at the path for predicting the segmentation mask are
replaced by one FC layer directly mapped from the 512×14×14 feature maps to the 56×56
segmentation maps. The whole architecture are with over 300M parameters. It is slightly
inferior to DeepMask and much slower.
Average Recall (AR) for Detection Boxes on PASCAL VOC 2007 Test Set
• Region proposals are generated based on the predicted segmentation masks, which can be
used as the first step of object detection task.
• Fast R-CNN using DeepMask outperforms original Fast R-CNN using Selective Search as
well as other state-of-the-art approaches.
DeepMask has been updated that the VGGNet backbone is replaced by ResNet in GitHub.
After DeepMask, FAIR also invented SharpMask. Hope I can cover it later as well.