Review: Deepmask (Instance Segmentation) : An Instance Segment Proposal Method Driven by Convolutional Neural Networks

This document summarizes the DeepMask instance segmentation method. DeepMask takes an image patch as input and uses a convolutional neural network with two branches to 1) predict a segmentation mask and 2) score how likely it is to contain an object. It is trained jointly on these two tasks. During inference, it densely applies the model at multiple locations and scales to generate segmentation proposals for each object. Evaluation on MS COCO and PASCAL VOC shows it outperforms other proposal generation methods and approaches for instance segmentation and object detection.

Uploaded by

Peter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views6 pages

Review: Deepmask (Instance Segmentation) : An Instance Segment Proposal Method Driven by Convolutional Neural Networks

Uploaded by

Peter

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Review: DeepMask (Instance Segmentation)

An Instance Segment Proposal Method Driven by

Convolutional Neural Networks
This time, DeepMask, by Facebook AI Research (FAIR), is reviewed. Starting from AlexNet,
high accuracy is obtained by convolutional neural network (CNN) for image classification,
numerous CNN approaches are developed for other tasks such as object detection, semantic
segmentation, and instance segmentation.

Semantic Segmentation vs Instance Segmentation

• Image Classification: Classify the main object category within an image.
• Object Detection: Identify the object category and locate the position using a bounding box
for every known object within an image.
• Semantic Segmentation: Identify the object category of each pixel for every known object
within an image. Labels are class-aware.
• Instance Segmentation: Identify each object instance of each pixel for every known object
within an image. Labels are instance-aware.

Some Differences from Semantic Segmentation

• More understanding on the instance individuals.
• Reasoning about occlusion.
• Essential to tasks such as counting the number of objects.

Some Differences from Object Detection

• A bounding box is a very coarse object boundary, many pixels irrelevant to the detected
object are also included in the bounding box.
• And Non Maximum Suppression (NMS) will suppress occluded objects or slanted
objects.
Thus, Instance Segmentation is one level increase in difficulty!!!
And DeepMask is the 2015 NIPS paper with more than 300 citations. Though it is a paper
published in the year of 2015, it is one of the earliest paper using CNN for instance segmentation. It
is worth to study it to know the development of deep-learning-based instance segmentation. (Sik-Ho
Tsang @ Medium)
Since a region proposal can be generated based on the predicted segmentation mask, object
detection task can also be performed.

What Are Covered

1. Model Architecture
2. Joint Learning
3. Full Scene Inference
4. Results

1. Model Architecture

Model Architecture (Top), Positive Samples (Green, Left Bottom), Negative Samples (Red,
Right Bottom)

Left Bottom: Positive Samples

A label yk=1 is given for k-th positive sample. To be a positive sample, two criteria need to be
satisfied:
• The patch contains an object roughly centered in the input patch.
• The object is fully contained in the patch and in a given scale range.

When yk=1, the ground truth mask mk has positive values for the pixels which belong to the
single object located in the centre of the image patch.
Right Bottom: Negative Samples
Otherwise, a label yk=-1 is given for a negative sample even the object is partially present. When
yk=-1, the mask is not used.

Top, Model Architecture: Main Branch

The model as shown above, given the input image patch x, after feature extraction by VGGNet, The
fully connected (FC) layers originated in VGGNet are removed. The last max pooling layer in
VGGNet is also removed, thus the output before splitting into two paths are of the size of 1/16 of
input. For example as above, the input is 224×224 (3 is the number of channels in the input image,
i.e. RGB), the output at the end of main branch is (224/16)×(224/16) =14×14. (512 is the number of
feature maps after convolution.)
There are two paths after VGGNet:
• The first path is to predict the class-agnostic segmentation mask, i.e. fsegm(x).
• The second path is to assign a score corresponding to how likely the patch is to contain
an object, i.e. fscore(x).

Top, First Path: Predicting Segmentation Map

1×1 convolution is performed first without changing the number of feature maps, non-linear
mapping without dimension reduction is done here. After that, two FC layers are performed. (It
is notice that there is no ReLU in between these two FC layers!)
Unlike in semantic segmentation, the network must output a mask for a single object even
when multiple objects are present. (Just like the elephant in the centre of the input image as
shown above.)
Finally, a 56×56 segmentation map is generated. And a simple bilinear interpolation is to
upsample the segmentation map to 224×224.

Top, Second Path: Predicting Object Score

2×2 max pooling followed by two FC layers. Finally, one single value Predicted Object Score,
fscore(x), is obtained. Since positive samples are given based on the two criteria mentioned above,
fscore(x) is to predict whether the input image has satisfied these two criteria.

2. Joint Learning
2.1. Loss Function
The network is trained to jointly learn the pixel-wise segmentation map fsegm(xk) at each location
(i,j) and the predicted object score fscore(xk). The loss function is shown as below:
To be brief, the loss function is a sum of binary logistic regression losses, one for each location of
the segmentation network fsegm(xk) and one for the object score fscore(xk). The first term implies
that we only backpropagate the error over the segmentation path if yk=1.
If yk=-1, i.e. the negative sample, the first term become 0 and will not contribute to the loss. Only
the second term contributes the loss.
For the sake of data balance, equal number of positive and negative samples is used.

2.2. Other Details

Batch size of 32 is used. Pretrained ImageNet model is used. There are 75M parameters in total.
The model takes around 5 days to train on a Nvidia Tesla K40m.

3. Full Scene Inference

3.1. Multiple Locations and Scales
During inference (testing), the model is applied densely at multiple locations with a stride of 16
pixels, and multiple scales from 1/4 to 2 with a step size of square root of 2. This ensures that
there is at least one tested image patch that fully contains each object in the image.

3.2. Fine Stride Max Pooling

Since the input test image is larger than the training input patch size, we need a corresponding 2D
scoring map as an output rather than one single scoring value. An interleaving trick is used before
the last max pooling layer for the scoring branch, i.e. the Fine Stride Max Pooling proposed in
OverFeat.
To be brief, multiple max pooling is done on the feature map. A pixel shift is performed before each
max pooling.

4. Results
4.1. MS COCO (Boxes & Segmentation Masks)
80,000 images and a total of nearly 500,000 segmented objects, are used for training. And the first
5000 images of the MS COCO 2014 are used for validated.
Average Recall (AR) Detection Boxes (Left) and Segmentation Masks (Right) on MS COCO
Validation Set (AR@n: the AR when n region proposals are generated. AUCx: x is the size of
objects)
• DeepMask20: Trained only with objects belonging to one of the 20 PASCAL categories. AR
is low compared to DeepMask which means the network is not generalized to unseen
classes. (low scores for unseen classes.)
• DeepMask20*: Similar to DeepMask but the scoring path uses the original DeepMask.
• DeepMaskZoom: Additional smaller scale to boost AR but with the cost of increased
inference time.
• DeepMaskFull: The two FC layers at the path for predicting the segmentation mask are
replaced by one FC layer directly mapped from the 512×14×14 feature maps to the 56×56
segmentation maps. The whole architecture are with over 300M parameters. It is slightly
inferior to DeepMask and much slower.

4.2. PASCAL VOC 2007 (Boxes)

Average Recall (AR) for Detection Boxes on PASCAL VOC 2007 Test Set
• Region proposals are generated based on the predicted segmentation masks, which can be
used as the first step of object detection task.
• Fast R-CNN using DeepMask outperforms original Fast R-CNN using Selective Search as
well as other state-of-the-art approaches.

4.3. Inference Time

• The inference time in MS COCO is 1.6s per image.
• The inference time in PASCAL VOC 2007 is 1.2s per image.
• Inference time can be further dropped by about 30% by parallelizing all scales in a single
batch.

DeepMask has been updated that the VGGNet backbone is replaced by ResNet in GitHub.
After DeepMask, FAIR also invented SharpMask. Hope I can cover it later as well.

An Introduction To Statistical Learning PDF
No ratings yet
An Introduction To Statistical Learning PDF
35 pages
Microsoft Certified Azure AI Fundamentals
No ratings yet
Microsoft Certified Azure AI Fundamentals
75 pages
Object Detection and Identification
67% (3)
Object Detection and Identification
20 pages
Semantic Segmentation
No ratings yet
Semantic Segmentation
22 pages
LVIS: A Dataset For Large Vocabulary Instance Segmentation
No ratings yet
LVIS: A Dataset For Large Vocabulary Instance Segmentation
11 pages
Module 04
No ratings yet
Module 04
42 pages
DT Unit 5 Lecture Notes
No ratings yet
DT Unit 5 Lecture Notes
30 pages
LSTM Paper
No ratings yet
LSTM Paper
10 pages
Lect-7 Segmentation Localization
No ratings yet
Lect-7 Segmentation Localization
151 pages
01-02 Introduction To CV and Segmentation
No ratings yet
01-02 Introduction To CV and Segmentation
85 pages
Harley MSC Thesis Menos Especializadpo
No ratings yet
Harley MSC Thesis Menos Especializadpo
71 pages
2.ObjectDetection Two Stage
No ratings yet
2.ObjectDetection Two Stage
66 pages
Advanced Topics in CNN and RNN
No ratings yet
Advanced Topics in CNN and RNN
72 pages
Mobile Money Nigeria Literature Review
No ratings yet
Mobile Money Nigeria Literature Review
54 pages
02 Semantic Segmentation 2024
No ratings yet
02 Semantic Segmentation 2024
53 pages
Dlcv2017d3l1segmentation 170623173102
No ratings yet
Dlcv2017d3l1segmentation 170623173102
36 pages
Lec36 Obj Detn
No ratings yet
Lec36 Obj Detn
60 pages
Machine Learning and Data Analytics Using Python Lab
No ratings yet
Machine Learning and Data Analytics Using Python Lab
36 pages
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
No ratings yet
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
60 pages
Adversarial Attacks On Deep-Learning Models in Natural Language Processing: A Survey
No ratings yet
Adversarial Attacks On Deep-Learning Models in Natural Language Processing: A Survey
41 pages
Object Detection and Segmentation - Part 2
No ratings yet
Object Detection and Segmentation - Part 2
36 pages
Deep Learning: Dr. Sanjeev Sharma
No ratings yet
Deep Learning: Dr. Sanjeev Sharma
61 pages
Screenshot 2024-11-29 at 8.35.21 AM
No ratings yet
Screenshot 2024-11-29 at 8.35.21 AM
40 pages
Instance Segmentation
No ratings yet
Instance Segmentation
51 pages
Journal Pre-Proofs: Neurocomputing
No ratings yet
Journal Pre-Proofs: Neurocomputing
37 pages
1 Recurrent Neural Networks
No ratings yet
1 Recurrent Neural Networks
34 pages
L10 Lecture Detection - Segmentation v2.5
No ratings yet
L10 Lecture Detection - Segmentation v2.5
35 pages
Lesson 07
No ratings yet
Lesson 07
59 pages
cs231n 2018 ds06
No ratings yet
cs231n 2018 ds06
38 pages
Machine Learning
No ratings yet
Machine Learning
22 pages
Machine Learning in Forecasting Motor Insurance CL
No ratings yet
Machine Learning in Forecasting Motor Insurance CL
19 pages
Yolo Family
No ratings yet
Yolo Family
40 pages
NN 09
No ratings yet
NN 09
34 pages
BERT and RoBERTa For Sarcasm Detection - Optimizing Performance Through Advanced Fine-Tuning
No ratings yet
BERT and RoBERTa For Sarcasm Detection - Optimizing Performance Through Advanced Fine-Tuning
11 pages
Vor Art
No ratings yet
Vor Art
19 pages
Quiz AIOps L1 2024 PDF
No ratings yet
Quiz AIOps L1 2024 PDF
13 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
No ratings yet
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
21 pages
3 SipMask: Spatial Information Preservation For Fast Image and Video Instance Segmentation
No ratings yet
3 SipMask: Spatial Information Preservation For Fast Image and Video Instance Segmentation
17 pages
Term Paper - DL
No ratings yet
Term Paper - DL
22 pages
1 Image Segmentation Using Deep Learning
No ratings yet
1 Image Segmentation Using Deep Learning
6 pages
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
18 pages
Image Segmentation Based On Improved Unet
No ratings yet
Image Segmentation Based On Improved Unet
7 pages
Internet of Medical Things (IoMT) For Cardio-Vascular Disease
No ratings yet
Internet of Medical Things (IoMT) For Cardio-Vascular Disease
15 pages
End-to-End Object Detection With Fully Convolutional Network
No ratings yet
End-to-End Object Detection With Fully Convolutional Network
13 pages
Semantic Segmentation by Using Down-Sampling and S
No ratings yet
Semantic Segmentation by Using Down-Sampling and S
14 pages
MMDetection Open MMLab Detection Toolbox and Benchmark
No ratings yet
MMDetection Open MMLab Detection Toolbox and Benchmark
13 pages
Simultaneous Detection and Segmentation
No ratings yet
Simultaneous Detection and Segmentation
16 pages
DATA-51000-ClusteringAssignmentTemplateNew Maternal Health Risk
No ratings yet
DATA-51000-ClusteringAssignmentTemplateNew Maternal Health Risk
12 pages
Second Progress Report UID - 17BCS2127
No ratings yet
Second Progress Report UID - 17BCS2127
13 pages
Center Mask
No ratings yet
Center Mask
10 pages
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
No ratings yet
Fully Convolutional Networks For Semantic Segmentation: Jonathan Long Evan Shelhamer Trevor Darrell UC Berkeley
10 pages
Choudhury21unsupervised Supp
No ratings yet
Choudhury21unsupervised Supp
9 pages
VGG (Simonyan and Zisserman)
No ratings yet
VGG (Simonyan and Zisserman)
14 pages
Manuscript Template 2
No ratings yet
Manuscript Template 2
13 pages
1803 01534-PANet
No ratings yet
1803 01534-PANet
11 pages
Paper 3
No ratings yet
Paper 3
11 pages
Research Cloud
No ratings yet
Research Cloud
8 pages
Refine Net
No ratings yet
Refine Net
11 pages
The Ultimate Guide To Object Detection
No ratings yet
The Ultimate Guide To Object Detection
16 pages
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
No ratings yet
Implementation of Deep Neural Networks Learning On Unmanned Aerial Vehicle Based Remote-Sensing
7 pages
Comparing Q Learning and Policy Gradient in Frozen Lake Environment
No ratings yet
Comparing Q Learning and Policy Gradient in Frozen Lake Environment
8 pages
Research Challenges and Opportunities in Business Analytics (JBA, 2018)
No ratings yet
Research Challenges and Opportunities in Business Analytics (JBA, 2018)
12 pages
YOLACT
No ratings yet
YOLACT
10 pages
Mackay Hazel PythoMachine Learning With Pytorch and Scikit Learn A Co
No ratings yet
Mackay Hazel PythoMachine Learning With Pytorch and Scikit Learn A Co
135 pages
Al Evaluation
No ratings yet
Al Evaluation
4 pages
Blitznet: A Real-Time Deep Network For Scene Understanding
No ratings yet
Blitznet: A Real-Time Deep Network For Scene Understanding
11 pages
Development of Framework For Detecting Smoking Scenes
No ratings yet
Development of Framework For Detecting Smoking Scenes
5 pages
1911 06667v1 PDF
No ratings yet
1911 06667v1 PDF
10 pages
He Mask R-CNN ICCV 2017 Paper PDF
No ratings yet
He Mask R-CNN ICCV 2017 Paper PDF
9 pages
T Thesis Topics in Machine Learning For Research Scholars
No ratings yet
T Thesis Topics in Machine Learning For Research Scholars
14 pages
SoS'25 Midterm - Report
No ratings yet
SoS'25 Midterm - Report
14 pages
IEEE Conference LaTeX Template 7 9 18
No ratings yet
IEEE Conference LaTeX Template 7 9 18
8 pages
He 2017
No ratings yet
He 2017
9 pages
He Mask R-CNN Iccv 2017 Paper
No ratings yet
He Mask R-CNN Iccv 2017 Paper
9 pages
2021 ICPR FASSDNet
No ratings yet
2021 ICPR FASSDNet
8 pages
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
No ratings yet
Generalizability of Semantic Segmentation Techniques: Keshav Bhandari Texas State University, San Marcos, TX
6 pages
CV Project
No ratings yet
CV Project
7 pages
10 21541-Apjess 1542885-4187651
No ratings yet
10 21541-Apjess 1542885-4187651
5 pages
Facemask Detection Using MMdetection Toolbox
No ratings yet
Facemask Detection Using MMdetection Toolbox
6 pages
Mask R-CNN
No ratings yet
Mask R-CNN
4 pages
M.tech - Data Analytics
No ratings yet
M.tech - Data Analytics
3 pages
2802 8020 1 PB
No ratings yet
2802 8020 1 PB
3 pages
Amr Abdellatif CV
No ratings yet
Amr Abdellatif CV
2 pages
Heart Disease Prediction Using Supervised Machine Learning Algorithms
No ratings yet
Heart Disease Prediction Using Supervised Machine Learning Algorithms
3 pages
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
No ratings yet
Cluster Analysis or Clustering Is The Art of Separating The Data Points Into Dissimilar Group With A
11 pages
Inbound 2666963362994994781
No ratings yet
Inbound 2666963362994994781
2 pages
NLP Sentiment Analysis On Movie Reviews With Toxic Comment Detection
No ratings yet
NLP Sentiment Analysis On Movie Reviews With Toxic Comment Detection
33 pages