0% found this document useful (0 votes)
0 views

Lecture 07 Seg2

Uploaded by

huson7328
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Lecture 07 Seg2

Uploaded by

huson7328
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

Image Segmentation 2

We are here

Christmas break

2
3
4
Class distribution
(5x5xnum. classes)
Feature map
(5 x 5 x 1024)

GAP conv

CNN
1x1 convolution

5
Output stride 32 Output stride 16 Output stride 8

Long, Shelhamer, Darrell. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015, PAMI 2016.

6
7

8
N N−1

dilation = 1 dilation = 2 dilation = 3

9
K D
<latexit sha1_base64="jKmO79SQ9SoqRzUlbAdsUD/6yLI=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix6Eby0YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWMbqd+6wmV5rF8MOME/YgOJA85o8ZK9fteqexW3BnIMvFyUoYctV7pq9uPWRqhNExQrTuemxg/o8pwJnBS7KYaE8pGdIAdSyWNUPvZ7NAJObVKn4SxsiUNmam/JzIaaT2OAtsZUTPUi95U/M/rpCa89jMuk9SgZPNFYSqIicn0a9LnCpkRY0soU9zeStiQKsqMzaZoQ/AWX14mzfOKd1nx6hfl6k0eRwGO4QTOwIMrqMId1KABDBCe4RXenEfnxXl3PuatK04+cwR/4Hz+AKP9jNU=</latexit> <latexit sha1_base64="/GjIqvIKiIBr+DdwbKirDA1K0T8=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeizqwWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSU2DDcC24lCGgUCW8Hoduq3nlBpHssHM07Qj+hA8pAzaqxUv+uVym7FnYEsEy8nZchR65W+uv2YpRFKwwTVuuO5ifEzqgxnAifFbqoxoWxEB9ixVNIItZ/NDp2QU6v0SRgrW9KQmfp7IqOR1uMosJ0RNUO96E3F/7xOasJrP+MySQ1KNl8UpoKYmEy/Jn2ukBkxtoQyxe2thA2poszYbIo2BG/x5WXSPK94lxWvflGu3uRxFOAYTuAMPLiCKtxDDRrAAOEZXuHNeXRenHfnY9664uQzR/AHzucPmWGMzg==</latexit>

<latexit sha1_base64="rFz2bc/WLrR5rsHfPMdMD3MnMZg=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRahIpaNiHos6kHwUsF+QLuUbJptQ7PZNckKZemf8OJBEa/+HW/+G9N2D1p9MPB4b4aZeX4suDau++XkFhaXllfyq4W19Y3NreL2TkNHiaKsTiMRqZZPNBNcsrrhRrBWrBgJfcGa/vBq4jcfmdI8kvdmFDMvJH3JA06JsVLrunx7jA+PcLdYcivuFOgvwRkpQYZat/jZ6UU0CZk0VBCt29iNjZcSZTgVbFzoJJrFhA5Jn7UtlSRk2kun947RgVV6KIiULWnQVP05kZJQ61Ho286QmIGe9ybif147McGFl3IZJ4ZJOlsUJAKZCE2eRz2uGDViZAmhittbER0QRaixERVsCHj+5b+kcVLBZxV8d1qqXmZx5GEP9qEMGM6hCjdQgzpQEPAEL/DqPDjPzpvzPmvNOdnMLvyC8/ENqFOOag==</latexit>

D(K 1) + 1

10
Typically at the “bottleneck” section

11
Sum-Fusion

Fc8 Fc8 Fc8 Fc8


Conv Conv Conv Conv
(1x1) (1x1) (1x1) (1x1)
kernel: 3x3 kernel: 3x3 kernel: 3x3 kernel: 3x3
rate: 6 rate: 12 rate: 18 rate: 24
rate = 24
rate = 12
rate = 18 Fc7 Fc7 Fc7 Fc7
rate = 6 (1x1) (1x1) (1x1) (1x1)

Fc6 Fc6 Fc6 Fc6


(3x3, rate = 6) (3x3, rate = 12) (3x3, rate = 18) (3x3, rate = 24)
Atrous Spatial Pyramid Pooling

Input Feature Map


Pool5

Atrous Spatial Pyramid Pooling DeepLab-ASPP

Chen et al., “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs” (2016).

12
Replace stride-2 OPs with dilation-2 OPs

13
reduced resolution (encoder)

upsampling (decoder)

14
Upsampling
im2col representation
kernel

0
<latexit sha1_base64="OO5sepFFwQWHu3ACq5L0yGOVOxo=">AAAB9XicbVBNSwMxEJ2tX7V+VT16CRbBU9kVUS9C0YvHCrZdaLclm2bb0CS7JFmlLP0fXjwo4tX/4s1/Y9ruQVsfDDzem2FmXphwpo3rfjuFldW19Y3iZmlre2d3r7x/0NRxqghtkJjHyg+xppxJ2jDMcOonimIRctoKR7dTv/VIlWaxfDDjhAYCDySLGMHGSl2/20kUExRdoxbye+WKW3VnQMvEy0kFctR75a9OPyapoNIQjrVue25iggwrwwink1In1TTBZIQHtG2pxILqIJtdPUEnVumjKFa2pEEz9fdEhoXWYxHaToHNUC96U/E/r52a6CrImExSQyWZL4pSjkyMphGgPlOUGD62BBPF7K2IDLHCxNigSjYEb/HlZdI8q3oXVe/+vFK7yeMowhEcwyl4cAk1uIM6NICAgmd4hTfnyXlx3p2PeWvByWcO4Q+czx/9/pGH</latexit>

X = WX =
<latexit sha1_base64="feYJlqRAt9N6RJRnwkNptAVwjek=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIehGKXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7qZ+6wmV5rF8MOME/YgOJA85o8ZK9ZteqexW3BnIMvFyUoYctV7pq9uPWRqhNExQrTuemxg/o8pwJnBS7KYaE8pGdIAdSyWNUPvZ7NAJObVKn4SxsiUNmam/JzIaaT2OAtsZUTPUi95U/M/rpCa89jMuk9SgZPNFYSqIicn0a9LnCpkRY0soU9zeStiQKsqMzaZoQ/AWX14mzfOKd1nx6hfl6m0eRwGO4QTOwIMrqMI91KABDBCe4RXenEfnxXl3PuatK04+cwR/4Hz+AI7FjMc=</latexit>

·
<latexit sha1_base64="fCo//JzD7H8On/jVjzaWK5GBlsA=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMF0xbaUDabTbt0sxt2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZelAlu0PO+ndLa+sbmVnm7srO7t39QPTxqGZVrygKqhNKdiBgmuGQBchSsk2lG0kiwdjS6m/ntJ6YNV/IRxxkLUzKQPOGUoJWCHo0V9qs1r+7N4a4SvyA1KNDsV796saJ5yiRSQYzp+l6G4YRo5FSwaaWXG5YROiID1rVUkpSZcDI/duqeWSV2E6VtSXTn6u+JCUmNGaeR7UwJDs2yNxP/87o5JjfhhMssRybpYlGSCxeVO/vcjblmFMXYEkI1t7e6dEg0oWjzqdgQ/OWXV0nrou5f1f2Hy1rjtoijDCdwCufgwzU04B6aEAAFDs/wCm+OdF6cd+dj0Vpyiplj+APn8wfbAo64</latexit>

[1 ⇥ K]
<latexit sha1_base64="0lCbHj5sQArZKrt+wDj5nrTCgDY=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix6EbxUsB/QhrLZbtqlm03cnRRK6O/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgkQKg6777aysrq1vbBa2its7u3v7pYPDholTzXidxTLWrYAaLoXidRQoeSvRnEaB5M1geDv1myOujYjVI44T7ke0r0QoGEUr+W2PdFBE3JB7n3RLZbfizkCWiZeTMuSodUtfnV7M0ogrZJIa0/bcBP2MahRM8kmxkxqeUDakfd62VFG7yM9mR0/IqVV6JIy1LYVkpv6eyGhkzDgKbGdEcWAWvan4n9dOMbz2M6GSFLli80VhKgnGZJoA6QnNGcqxJZRpYW8lbEA1ZWhzKtoQvMWXl0njvOJdVryHi3L1Jo+jAMdwAmfgwRVU4Q5qUAcGT/AMr/DmjJwX5935mLeuOPnMEfyB8/kDMkWRFA==</latexit>

[1 ⇥ n]
<latexit sha1_base64="4kWu/w/zrHNLQmxfNAXLFJynIFQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gOSUDbbTbt0swm7E6GU/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvyqQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZprxlsslanuRtRwKRRvoUDJu5nmNIkk70Sju5nfeeLaiFQ94jjjYUIHSsSCUbRS4HskQJFwQ1TYq9bcujsHWSVeQWpQoNmrfgX9lOUJV8gkNcb33AzDCdUomOTTSpAbnlE2ogPuW6qo3RNO5jdPyZlV+iROtS2FZK7+npjQxJhxEtnOhOLQLHsz8T/PzzG+CSdCZTlyxRaL4lwSTMksANIXmjOUY0so08LeStiQasrQxlSxIXjLL6+S9kXdu6p7D5e1xm0RRxlO4BTOwYNraMA9NKEFDDJ4hld4c3LnxXl3PhatJaeYOYY/cD5/AA4IkQ0=</latexit>

[K ⇥ n]
<latexit sha1_base64="r+Cfxs+hUxD+MFjOwxinVIL/ZNY=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi+ClgrWFJJTNdtMu3WzC7kQopX/DiwdFvPpnvPlv3LY5aOuDgcd7M8zMizIpDLrut1NaWV1b3yhvVra2d3b3qvsHjybNNeMtlspUdyJquBSKt1Cg5J1Mc5pEkrej4c3Ubz9xbUSqHnCU8TChfSViwShaKfDvSIAi4YaosFutuXV3BrJMvILUoECzW/0KeinLE66QSWqM77kZhmOqUTDJJ5UgNzyjbEj73LdUUbsnHM9unpATq/RInGpbCslM/T0xpokxoySynQnFgVn0puJ/np9jfBWOhcpy5IrNF8W5JJiSaQCkJzRnKEeWUKaFvZWwAdWUoY2pYkPwFl9eJo9nde+i7t2f1xrXRRxlOIJjOAUPLqEBt9CEFjDI4Ble4c3JnRfn3fmYt5acYuYQ/sD5/AE2dJEn</latexit>

T 0
<latexit sha1_base64="mxYVVQOyMKsru92jNhkqCXcZn9E=">AAAB+XicbVBNSwMxEM36WevXqkcvwSJ4Krsi6kUoevFYoV/Qbks2nW1Dk+ySZAtl6T/x4kERr/4Tb/4b03YP2vpg4PHeDDPzwoQzbTzv21lb39jc2i7sFHf39g8O3aPjho5TRaFOYx6rVkg0cCahbpjh0EoUEBFyaIajh5nfHIPSLJY1M0kgEGQgWcQoMVbquW4L3+Fmt4Zb3U6imICeW/LK3hx4lfg5KaEc1Z771enHNBUgDeVE67bvJSbIiDKMcpgWO6mGhNARGUDbUkkE6CCbXz7F51bp4yhWtqTBc/X3REaE1hMR2k5BzFAvezPxP6+dmug2yJhMUgOSLhZFKccmxrMYcJ8poIZPLCFUMXsrpkOiCDU2rKINwV9+eZU0Lsv+ddl/uipV7vM4CugUnaEL5KMbVEGPqIrqiKIxekav6M3JnBfn3flYtK45+cwJ+gPn8wfOuJJ+</latexit>

X=W X
(Remark: For CNNs, also apply the inverse of im2col
[K ⇥ n] [K ⇥ 1] [1 ⇥ n]
<latexit sha1_base64="r+Cfxs+hUxD+MFjOwxinVIL/ZNY=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi+ClgrWFJJTNdtMu3WzC7kQopX/DiwdFvPpnvPlv3LY5aOuDgcd7M8zMizIpDLrut1NaWV1b3yhvVra2d3b3qvsHjybNNeMtlspUdyJquBSKt1Cg5J1Mc5pEkrej4c3Ubz9xbUSqHnCU8TChfSViwShaKfDvSIAi4YaosFutuXV3BrJMvILUoECzW/0KeinLE66QSWqM77kZhmOqUTDJJ5UgNzyjbEj73LdUUbsnHM9unpATq/RInGpbCslM/T0xpokxoySynQnFgVn0puJ/np9jfBWOhcpy5IrNF8W5JJiSaQCkJzRnKEeWUKaFvZWwAdWUoY2pYkPwFl9eJo9nde+i7t2f1xrXRRxlOIJjOAUPLqEBt9CEFjDI4Ble4c3JnRfn3fmYt5acYuYQ/sD5/AE2dJEn</latexit>

<latexit sha1_base64="r8F1RfdFt8UIFR5LUl6tDukeP9s=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPBg4KXCtYWklA22027dLMJuxOhlP4NLx4U8eqf8ea/cdvmoK0PBh7vzTAzL8qkMOi6305pZXVtfaO8Wdna3tndq+4fPJo014y3WCpT3Ymo4VIo3kKBkncyzWkSSd6OhtdTv/3EtRGpesBRxsOE9pWIBaNopcC/IwGKhBvihd1qza27M5Bl4hWkBgWa3epX0EtZnnCFTFJjfM/NMBxTjYJJPqkEueEZZUPa576lito94Xh284ScWKVH4lTbUkhm6u+JMU2MGSWR7UwoDsyiNxX/8/wc46twLFSWI1dsvijOJcGUTAMgPaE5QzmyhDIt7K2EDaimDG1MFRuCt/jyMnk8q3sXde/+vNa4KeIowxEcwyl4cAkNuIUmtIBBBs/wCm9O7rw4787HvLXkFDOH8AfO5w/bNZDv</latexit>
<latexit sha1_base64="4kWu/w/zrHNLQmxfNAXLFJynIFQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gOSUDbbTbt0swm7E6GU/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvyqQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZprxlsslanuRtRwKRRvoUDJu5nmNIkk70Sju5nfeeLaiFQ94jjjYUIHSsSCUbRS4HskQJFwQ1TYq9bcujsHWSVeQWpQoNmrfgX9lOUJV8gkNcb33AzDCdUomOTTSpAbnlE2ogPuW6qo3RNO5jdPyZlV+iROtS2FZK7+npjQxJhxEtnOhOLQLHsz8T/PzzG+CSdCZTlyxRaL4lwSTMksANIXmjOUY0so08LeStiQasrQxlSxIXjLL6+S9kXdu6p7D5e1xm0RRxlO4BTOwYNraMA9NKEFDDJ4hld4c3LnxXl3PhatJaeYOYY/cD5/AA4IkQ0=</latexit>

to obtain the 2D grid representation – sums up overlapping values)

16
Input 3x3

Output 5x5

17
stride = 2, kernel = 3
input

output

https://distill.pub/2016/deconv-checkerboard/

18
Kernel size: 3
Stride: 2

https://distill.pub/2016/deconv-checkerboard/

19
stride = 2, kernel = 4

20
Original image

Nearest neighbor interpolation Bilinear interpolation Bicubic interpolation

21
Image reconstruction example

Using transposed
convolution

Resize-
convolution

https://distill.pub/2016/deconv-checkerboard/

22
?

23
skip connection

24
skip connection

skip connection

skip connection

O. Ronneberger et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. MICCAI 2015
25
append

O. Ronneberger et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. MICCAI 2015
26
27
Badrinarayanan et al. „SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation“. TPAMI 2016
28
29
30
Label every pixel, including the background Do not label pixels coming from uncountable
(sky, grass, road) objects (“stu ”), e.g. “sky”, “grass”, “road”

Does not di erentiate between the pixels Di erentiates between the pixels coming
from objects (instances) of the same class from instances of the same class

31
ff
ff
ff
Proposal-based Proposal-free

1. Semantic
1. Proposals segmentation
(e.g. bounding boxes)
(optional)

2. Group
2. Segment
pixels into
and classify
instances

32
Proposal-based Proposal-free

1. Semantic
1. Proposals segmentation
(e.g. bounding boxes)
(optional)

2. Group
2. Segment
pixels into
and classify
instances

33
34

35
R-CNN family
2014 2015 2016 2017

R-CNN Fast R-CNN Faster R-CNN Mask R-CNN

(ICCV 2017)

36
Bounding box regression head

Classi cation head

Image CNN

Region Proposal Network


37
fi
Bounding box regression head

Classi cation head

Image CNN

Mask head

Region Proposal Network


38
fi
Faster R-CNN

Object recognition head

Segmentation head

Most of features
are shared

39
+ mask loss:
cross-entropy per pixel

Faster R-CNN Conv-only

He at al. “Mask R-CNN” ICCV 2017

40
+ mask loss:
cross-entropy per pixel

Faster R-CNN Conv-only


New:
RoIAlign He at al. “Mask R-CNN” ICCV 2017

41
CNN

42
Feature grid

Predicted bounding box

43
Feature grid

Predicted bounding box

44
Feature grid

Predicted bounding box

Assume 2x2 bins


45
Feature grid

Predicted bounding box

Assume 2x2 bins


46
Feature grid

Predicted bounding box

Assume 2x2 bins


47
Feature grid

Predicted bounding box

Assume 2x2 bins


48
49
Feature grid
(features in the corners)

50
He at al. “Mask R-CNN” ICCV 2017

51
52
umbrella.98 bus.99

umbrella.98
person1.00

person1.00
person1.00
backpack1.00
person1.00 person.99
handbag.96 person.99
person1.00 person1.00 person1.00
person.95 person.98
person1.00 person.89

sheep.99
backpack.99
sheep.99 sheep.86
backpack.93 sheep.82 sheep.96
sheep.96 sheep.93 sheep.91 sheep.95 sheep.96 sheep1.00
sheep1.00
sheep.99
sheep1.00
sheep.99
sheep.96

sheep.99

person.99

person.99person1.00
person1.00
traffic light.96 tv.99

chair.98 chair.99
chair.90
dining table.99 chair.96 wine glass.97
chair.86
bottle.99wine glass.93 chair.99
bowl.85 wine glass1.00

elephant1.00
wine glass.99
wine glass1.00
person1.00 chair.96 chair.99 fork.95

traffic light.95 bowl.81

traffic light.92 traffic light.84


person1.00 person.85
truck1.00 person.99
person1.00 person.87 car.99 car.92
person.99
car.99 car.93
car1.00
motorcycle.95
knife.83

person.96

53
54
55
Mask R-CNN + PointRend

Mask R-CNN + PointRend

28x28 28×28 224×224

Bilinear upsampling

28×28 56×56 112×112 224×224


Mask head prediction
28×28 224×224
Upsampled mask

Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)


56
Faster R-CNN Faster R-CNN w/ FPN
ave class ave class class class
7×7 7×7 7×7 7×7 7×7 7×7
2048 res5 ×2048
RoI ×1024 res5 ×2048RoI ×1024 2048
RoI ×256 box 1024 RoI 1024
×256 1024 1024
box box box

14×14 14×14 14×14 14×1414×14 14×14 14×14


28×28 28×28
14×14 28×28 28×28
×256 ×80 RoI ×256 ×80
×256 ×4 ×256 RoI ×256 ×80
×256 ×4 ×256 ×256 ×80

mask mask mask mask

upsampling decoder

57
28x28 28×28

28×28 56×56
Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
58
11
Mask R-CNN + PointRend

<latexit sha1_base64="r5kMPE5X6qAEsmeLAJVhrYbLVHI=">AAAB+nicbVBNS8NAEN34WetXqkcvi0WoICVR0V6Eggc9VrAf0Iaw2W7apZtN2J0opfanePGgiFd/iTf/jds2B219MPB4b4aZeUEiuAbH+baWlldW19ZzG/nNre2dXbuw19Bxqiir01jEqhUQzQSXrA4cBGslipEoEKwZDK4nfvOBKc1jeQ/DhHkR6UkeckrASL5dCP0O9BmQ0tlJ5RhfYce3i07ZmQIvEjcjRZSh5ttfnW5M04hJoIJo3XadBLwRUcCpYON8J9UsIXRAeqxtqCQR095oevoYHxmli8NYmZKAp+rviRGJtB5GgemMCPT1vDcR//PaKYQVb8RlkgKTdLYoTAWGGE9ywF2uGAUxNIRQxc2tmPaJIhRMWnkTgjv/8iJpnJbdi7J7d16s3mRx5NABOkQl5KJLVEW3qIbqiKJH9Ixe0Zv1ZL1Y79bHrHXJymb20R9Ynz9vI5Is</latexit>

f✓ (3, 8) = 0 28x28 28×28


<latexit sha1_base64="i/QLWlmaIaiqdBimQwzl0fyD5mg=">AAAB/HicbVDLSgNBEJz1GeNrNUcvg0GIIGFHgnoRAh70GME8IFmW2clsMmT2wUyvsIT4K148KOLVD/Hm3zhJ9qCJBQ1FVTfdXX4ihQbH+bZWVtfWNzYLW8Xtnd29ffvgsKXjVDHeZLGMVcenmksR8SYIkLyTKE5DX/K2P7qZ+u1HrrSIowfIEu6GdBCJQDAKRvLsUuD1YMiBVkjtjNRO8TUmnl12qs4MeJmQnJRRjoZnf/X6MUtDHgGTVOsucRJwx1SBYJJPir1U84SyER3wrqERDbl2x7PjJ/jEKH0cxMpUBHim/p4Y01DrLPRNZ0hhqBe9qfif100huHLHIkpS4BGbLwpSiSHG0yRwXyjOQGaGUKaEuRWzIVWUgcmraEIgiy8vk9Z5lVxUyX2tXL/N4yigI3SMKoigS1RHd6iBmoihDD2jV/RmPVkv1rv1MW9dsfKZEvoD6/MHVOqSoA==</latexit>

f (14, 14) = 1
<latexit sha1_base64="LfxgBQVx3JWEkgWrh00tMgs5LHQ=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQQcKuiHoMeNBjBPOAZAmzk9lkyOzDmd5gWPIdXjwo4tWP8ebfOEn2oIkFDUVVN91dXiyFRtv+tnIrq2vrG/nNwtb2zu5ecf+goaNEMV5nkYxUy6OaSxHyOgqUvBUrTgNP8qY3vJn6zRFXWkThA45j7ga0HwpfMIpGcv1uBwccafnpbHzaLZbsij0DWSZORkqQodYtfnV6EUsCHiKTVOu2Y8foplShYJJPCp1E85iyIe3ztqEhDbh209nRE3JilB7xI2UqRDJTf0+kNNB6HHimM6A40IveVPzPayfoX7upCOMEecjmi/xEEozINAHSE4ozlGNDKFPC3ErYgCrK0ORUMCE4iy8vk8Z5xbmsOPcXpeptFkcejuAYyuDAFVThDmpQBwaP8Ayv8GaNrBfr3fqYt+asbOYQ/sD6/AEXo5Gs</latexit>
224×224
✓ f✓ (x, y)

<latexit sha1_base64="DEEW6vGdrNaIkYxxvBpN/mOfmcQ=">AAAB/nicbVDLSsNAFJ3UV62vqLhyEyxCBQmZotaNUHChywr2AW0Ik+mkHTp5MHMjlFDwV9y4UMSt3+HOv3HaZqHVAxcO59zLvff4ieAKHOfLKCwtr6yuFddLG5tb2zvm7l5LxamkrEljEcuOTxQTPGJN4CBYJ5GMhL5gbX90PfXbD0wqHkf3ME6YG5JBxANOCWjJMw8CrwdDBqRSrdn4FJ/btZMr7Jllx3ZmsP4SnJMyytHwzM9eP6ZpyCKggijVxU4CbkYkcCrYpNRLFUsIHZEB62oakZApN5udP7GOtdK3gljqisCaqT8nMhIqNQ593RkSGKpFbyr+53VTCC7djEdJCiyi80VBKiyIrWkWVp9LRkGMNSFUcn2rRYdEEgo6sZIOAS++/Je0qja+sPHdWbl+k8dRRIfoCFUQRjVUR7eogZqIogw9oRf0ajwaz8ab8T5vLRj5zD76BePjG3oXkz0=</latexit>

f✓ (27.1, 15.7) = 1

28×28 56×56 112×112 224×224


Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
59
Boundaries
Uniform sampling
kk==1,1,ββ==0.0
0.0 kk==3,3,ββ==0.75
0.75 k(uncertain
k==10,
10,βpredictions)
β==0.75
0.75

Trade-o de ned by
hyperparameters

a) regulargrid
) regular grid b)uniform
b) uniform c)c)mildly
mildlybiased
biased d)d) heavily
heavily biased
biased
Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
60
ff
fi
Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
61
Queries to f✓ (x, y)
<latexit sha1_base64="LfxgBQVx3JWEkgWrh00tMgs5LHQ=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQQcKuiHoMeNBjBPOAZAmzk9lkyOzDmd5gWPIdXjwo4tWP8ebfOEn2oIkFDUVVN91dXiyFRtv+tnIrq2vrG/nNwtb2zu5ecf+goaNEMV5nkYxUy6OaSxHyOgqUvBUrTgNP8qY3vJn6zRFXWkThA45j7ga0HwpfMIpGcv1uBwccafnpbHzaLZbsij0DWSZORkqQodYtfnV6EUsCHiKTVOu2Y8foplShYJJPCp1E85iyIe3ztqEhDbh209nRE3JilB7xI2UqRDJTf0+kNNB6HHimM6A40IveVPzPayfoX7upCOMEecjmi/xEEozINAHSE4ozlGNDKFPC3ErYgCrK0ORUMCE4iy8vk8Z5xbmsOPcXpeptFkcejuAYyuDAFVThDmpQBwaP8Ayv8GaNrBfr3fqYt+asbOYQ/sD6/AEXo5Gs</latexit>

4×4 8×8 8×8


Initial mask Re
point
2× prediction
One iteration:

Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)


62
fi
Mask R-CNN + PointRend

Mask R-CNN + PointRend Mask R-CNN + PointRend


28×28 224×224

28×28 28×28 56×56


224×224 112×112 224×22428×28 224×224

Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)


63
64

65
Proposal-based Proposal-free

1. Semantic
1. Proposals segmentation
(e.g. bounding boxes)
(optional)

2. Group
2. Segment
pixels into
and classify
instances

66
67
Long et al., (2015)
68
Y = KX
<latexit sha1_base64="ZKYcsNDGLcCJbu4wN51nrzptpS8=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4KomIehEKHhS8VLAf0oay2U7apZtN2N0IJfRHePGgiFd/jzf/jds2B219MPB4b4aZeUEiuDau++0sLa+srq0XNoqbW9s7u6W9/YaOU8WwzmIRq1ZANQousW64EdhKFNIoENgMhtcTv/mESvNYPphRgn5E+5KHnFFjpeYjuSJ3pNUtld2KOwVZJF5OypCj1i19dXoxSyOUhgmqddtzE+NnVBnOBI6LnVRjQtmQ9rFtqaQRaj+bnjsmx1bpkTBWtqQhU/X3REYjrUdRYDsjagZ63puI/3nt1ISXfsZlkhqUbLYoTAUxMZn8TnpcITNiZAllittbCRtQRZmxCRVtCN78y4ukcVrxzive/Vm5epPHUYBDOIIT8OACqnALNagDgyE8wyu8OYnz4rw7H7PWJSefOYA/cD5/AIIZjmQ=</latexit>

Pixelwise class scores Layer parameters Features


(1x1 conv)
[C ⇥ HW ]
<latexit sha1_base64="q8AUHUyXbRv4PlZ7Fy4Ep3ducho=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMdADuYYwTxgs4TZyWwyZPbhTG8gLPkOLx4U8erHePNvnCR70MSChqKqm+4uP5FCo21/W4WNza3tneJuaW//4PCofHzS1nGqGG+xWMaq61PNpYh4CwVK3k0Up6Eveccf1+d+Z8KVFnH0iNOEeyEdRiIQjKKRPLdOeihCrkmj4/XLFbtqL0DWiZOTCuRo9stfvUHM0pBHyCTV2nXsBL2MKhRM8lmpl2qeUDamQ+4aGlGzyMsWR8/IhVEGJIiVqQjJQv09kdFQ62nom86Q4kivenPxP89NMbjzMhElKfKILRcFqSQYk3kCZCAUZyinhlCmhLmVsBFVlKHJqWRCcFZfXiftq6pzU3Ueriu1+zyOIpzBOVyCA7dQgwY0oQUMnuAZXuHNmlgv1rv1sWwtWPnMKfyB9fkDnpiRXw==</latexit>

[D ⇥ HW ]
<latexit sha1_base64="Njxx+0euUBFkGZG53uI/7MAm42o=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMeAgjlGMA/YLGF2MpsMmX040xsIS77DiwdFvPox3vwbJ8keNLGgoajqprvLT6TQaNvfVmFtfWNzq7hd2tnd2z8oHx61dJwqxpsslrHq+FRzKSLeRIGSdxLFaehL3vZHtzO/PeZKizh6xEnCvZAOIhEIRtFInntHuihCrkm97fXKFbtqz0FWiZOTCuRo9Mpf3X7M0pBHyCTV2nXsBL2MKhRM8mmpm2qeUDaiA+4aGlGzyMvmR0/JmVH6JIiVqQjJXP09kdFQ60nom86Q4lAvezPxP89NMbjxMhElKfKILRYFqSQYk1kCpC8UZygnhlCmhLmVsCFVlKHJqWRCcJZfXiWti6pzVXUeLiu1+zyOIpzAKZyDA9dQgzo0oAkMnuAZXuHNGlsv1rv1sWgtWPnMMfyB9fkDoCeRYA==</latexit>

[C ⇥ D]
<latexit sha1_base64="Ij9VtIhPDTfDOkkUrJ5X9FgZjqY=">AAAB83icbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMdABD1GMA/YXcLsZDYZMvtgplcIS37DiwdFvPoz3vwbJ8keNLGgoajqprsrSKXQaNvfVmltfWNzq7xd2dnd2z+oHh51dJIpxtsskYnqBVRzKWLeRoGS91LFaRRI3g3GzZnffeJKiyR+xEnK/YgOYxEKRtFIntskHoqIa3Lr96s1u27PQVaJU5AaFGj1q1/eIGFZxGNkkmrtOnaKfk4VCib5tOJlmqeUjemQu4bG1Ozx8/nNU3JmlAEJE2UqRjJXf0/kNNJ6EgWmM6I40sveTPzPczMMb/xcxGmGPGaLRWEmCSZkFgAZCMUZyokhlClhbiVsRBVlaGKqmBCc5ZdXSeei7lzVnYfLWuOuiKMMJ3AK5+DANTTgHlrQBgYpPMMrvFmZ9WK9Wx+L1pJVzBzDH1ifP+ukkPo=</latexit>

69
G: S × S × D

kernel ( , )
branch
( , )
FCN

feature
branch
I *
F: H × W × E

70
G: S × S × D

kernel ( , )
branch Convolution:
( , )
Mi,j = Gi,j ⇤ F
<latexit sha1_base64="s8NcLAdYKv3WFv9lvpzGvCMhraE=">AAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWwYWURETdCAXBuhEq2Au0IUym03bsZBJmJkIJdeOruHGhiFvfwp1v47TNQlt/GPj4zzmcOX8Qc6a043xbc/MLi0vLuZX86tr6xqa9tV1TUSIJrZKIR7IRYEU5E7Sqmea0EUuKw4DTetC/HNXrD1QqFok7PYipF+KuYB1GsDaWb+/e+Ck7uh+iC1TOqIWVRle+XXCKzlhoFtwMCpCp4ttfrXZEkpAKTThWquk6sfZSLDUjnA7zrUTRGJM+7tKmQYFDqrx0fMEQHRinjTqRNE9oNHZ/T6Q4VGoQBqYzxLqnpmsj879aM9Gdcy9lIk40FWSyqJNwpCM0igO1maRE84EBTCQzf0WkhyUm2oSWNyG40yfPQu246J4W3duTQqmcxZGDPdiHQ3DhDEpwDRWoAoFHeIZXeLOerBfr3fqYtM5Z2cwO/JH1+QNWApWN</latexit>

FCN

feature
branch
I *
F: H × W × E

71
Dimensionality depends
G: S × S × D on the kernel size
(1x1 works well, so D = E)
kernel ( , )
branch
( , )
FCN

feature
branch
I *
F: H × W × E

72
SxS grid
with class distribution

Disregard kernels with uncertain


class prediction

G: S × S × D

kernel ( , )
branch
NMS
( , )
FCN

feature
branch
I *
F: H × W × E

73
Mask R-CNN
40
Mask R-CNN
40
COCO Mask AP

35
COCO Mask AP

35 SOLOv2
SOLO
SOLOv2
Mask R-CNN
SOLO Ours
30 TensorMask
Mask R-CNN
YOLACT Ours
30 TensorMask
PolarMask
YOLACT
BlendMask
Real-time PolarMask
25 BlendMask
Real-time
0 25 50 100 125 150
25
0 25 50Inference100
time (ms) 125 150
(a) Speed
Inference vs.(ms)
time Accuracy (b) Detail Comparison
(a) Speed vs. Accuracy (b) Detail Comparison 74
75
76
77
78
Panoptic segmentation
Semantic segmentation Instance segmentation

(e.g. FCN, DeepLab) (e.g. Mask R-CNN)

80
Semantic segmentation Instance segmentation Panoptic segmentation

+ =

(e.g. FCN, DeepLab) (e.g. Mask R-CNN) (e.g. Panoptic FPN)

81
It gives labels to uncountable objects called
"stu " (sky, road, etc), similar to FCN-like
networks.

It di erentiates between pixels coming from


di erent instances of the same class
(countable objects) called "things" (cars,
pedestrians, etc).

82
ff
ff
ff
“Image parsing”
(Tu et al., 2005):
“Holistic scene understanding”
(Yao et al., 2012):

83
Key components in a panoptic
segmentation method

Input image Feature extractor Panoptic output


Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics

Instance
Segmentation
(CNN)

Adapted from [de Geus et al., 2018].

84
85
86
87
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics

Instance
Segmentation
(CNN)

(a) Feature Pyramid Network

Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

88
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics

Instance
(a) Feature Pyramid Network Segmentation
(CNN)

(b) Instance Segmentation Branch (c) Semantic Segmentation Branch

Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

89
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics

Instance
(a) Feature Pyramid Network Segmentation
(CNN)

1/32
conv→2×→conv→2×→conv→2×
128 × 1/4
256 × 1/16 conv→2×→conv→2×
128 × 1/4

conv→2×
256 × 1/8
128 × 1/4
conv
256 × 1/4 128 × 1/4
mentation Branch (c) Semantic Segmentation Branch conv→4×

C ×1

Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

90
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics

Instance
Segmentation
(CNN)

Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

91
Trade-o hyperparameter

L = Lc + Lb + Lm + s Ls
<latexit sha1_base64="Yprj1URbLuzCkWoltWc4NLOGuFE=">AAACDHicbVDLSgMxFL1TX7W+qi7dBIsgCGVGRN0IRTcuuqhgH9AOQyaTaUOTmSHJCKX0A9z4K25cKOLWD3Dn35hOZ6GtF3I5Offcm9zjJ5wpbdvfVmFpeWV1rbhe2tjc2t4p7+61VJxKQpsk5rHs+FhRziLa1Exz2kkkxcLntO0Pb6b19gOVisXRvR4l1BW4H7GQEawN5ZUrdXSF6h5BJyb7WRYm97gZEWBPmbsyKrtqZ4EWgZODCuTR8MpfvSAmqaCRJhwr1XXsRLtjLDUjnE5KvVTRBJMh7tOugREWVLnjbJkJOjJMgMJYmhNplLG/O8ZYKDUSvlEKrAdqvjYl/6t1Ux1eumMWJammEZk9FKYc6RhNnUEBk5RoPjIAE8nMXxEZYImJNv6VjAnO/MqLoHVadc6rzt1ZpXad21GEAziEY3DgAmpwCw1oAoFHeIZXeLOerBfr3fqYSQtW3rMPf8L6/AFo+ZgZ</latexit>

Instance segmentation branch loss Semantic segmentation branch

Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

92
ff
Trade-o hyperparameter

L = Lc + Lb + Lm + s Ls
<latexit sha1_base64="Yprj1URbLuzCkWoltWc4NLOGuFE=">AAACDHicbVDLSgMxFL1TX7W+qi7dBIsgCGVGRN0IRTcuuqhgH9AOQyaTaUOTmSHJCKX0A9z4K25cKOLWD3Dn35hOZ6GtF3I5Offcm9zjJ5wpbdvfVmFpeWV1rbhe2tjc2t4p7+61VJxKQpsk5rHs+FhRziLa1Exz2kkkxcLntO0Pb6b19gOVisXRvR4l1BW4H7GQEawN5ZUrdXSF6h5BJyb7WRYm97gZEWBPmbsyKrtqZ4EWgZODCuTR8MpfvSAmqaCRJhwr1XXsRLtjLDUjnE5KvVTRBJMh7tOugREWVLnjbJkJOjJMgMJYmhNplLG/O8ZYKDUSvlEKrAdqvjYl/6t1Ux1eumMWJammEZk9FKYc6RhNnUEBk5RoPjIAE8nMXxEZYImJNv6VjAnO/MqLoHVadc6rzt1ZpXad21GEAziEY3DgAmpwCw1oAoFHeIZXeLOerBfr3fqYSQtW3rMPf8L6/AFo+ZgZ</latexit>

Instance segmentation branch loss Semantic segmentation branch

Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

93
ff
Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019

94
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Ls
<latexit sha1_base64="nkYayS1G5B13aybNQH25eogMdjk=">AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E1DJoY2ER0XxAcoS9zVyyZG/v2N0TwpGfYGOhiK2/yM5/4ya5QhMfDDzem2FmXpAIro3rfjuFldW19Y3iZmlre2d3r7x/0NRxqhg2WCxi1Q6oRsElNgw3AtuJQhoFAlvB6Gbqt55QaR7LRzNO0I/oQPKQM2qs9HDX071yxa26M5Bl4uWkAjnqvfJXtx+zNEJpmKBadzw3MX5GleFM4KTUTTUmlI3oADuWShqh9rPZqRNyYpU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjlZ1wmqUHJ5ovCVBATk+nfpM8VMiPGllCmuL2VsCFVlBmbTsmG4C2+vEyaZ1Xvourdn1dq13kcRTiCYzgFDy6hBrdQhwYwGMAzvMKbI5wX5935mLcWnHzmEP7A+fwBLqiNvA==</latexit>

Segmentation
(CNN)
Merging
using Heuristics

Instance
Lc + Lb + Lm
<latexit sha1_base64="p+hdh2P649yseCBHxka++x5xqw0=">AAAB+HicbVDLSsNAFL3xWeujUZduBosgCCURUZdFNy66qGAf0IYwmU7aoZNJmJkINfRL3LhQxK2f4s6/cZpmoa0H7uVwzr3MnRMknCntON/Wyura+sZmaau8vbO7V7H3D9oqTiWhLRLzWHYDrChngrY005x2E0lxFHDaCca3M7/zSKVisXjQk4R6ER4KFjKCtZF8u9LwCTpDDT/Ie+TbVafm5EDLxC1IFQo0ffurP4hJGlGhCcdK9Vwn0V6GpWaE02m5nyqaYDLGQ9ozVOCIKi/LD5+iE6MMUBhLU0KjXP29keFIqUkUmMkI65Fa9Gbif14v1eG1lzGRpJoKMn8oTDnSMZqlgAZMUqL5xBBMJDO3IjLCEhNtsiqbENzFLy+T9nnNvay59xfV+k0RRwmO4BhOwYUrqMMdNKEFBFJ4hld4s56sF+vd+piPrljFziH8gfX5A+kFkVA=</latexit>

Segmentation
(CNN)

95
G: S × S × D

kernel ( , )
branch
( , )
FCN

feature
branch
I *
F: H × W × E

96
Panoptic FPN Panoptic FCN (Fully Convolutional Network)

Li et al., “Fully Convolutional Networks for Panoptic Segmentation”, CVPR 2021.


97
Li et al., “Fully Convolutional Networks for Panoptic Segmentation”, CVPR 2021.
98
MS-COCO (val)

Li et al., “Fully Convolutional Networks for Panoptic Segmentation”, CVPR 2021.


99
Panoptic FCN: Qualitative examples 100
101

You might also like