Lecture 07 Seg2
Lecture 07 Seg2
We are here
Christmas break
2
3
4
Class distribution
(5x5xnum. classes)
Feature map
(5 x 5 x 1024)
GAP conv
CNN
1x1 convolution
5
Output stride 32 Output stride 16 Output stride 8
Long, Shelhamer, Darrell. “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015, PAMI 2016.
6
7
→
8
N N−1
9
K D
<latexit sha1_base64="jKmO79SQ9SoqRzUlbAdsUD/6yLI=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix6Eby0YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWMbqd+6wmV5rF8MOME/YgOJA85o8ZK9fteqexW3BnIMvFyUoYctV7pq9uPWRqhNExQrTuemxg/o8pwJnBS7KYaE8pGdIAdSyWNUPvZ7NAJObVKn4SxsiUNmam/JzIaaT2OAtsZUTPUi95U/M/rpCa89jMuk9SgZPNFYSqIicn0a9LnCpkRY0soU9zeStiQKsqMzaZoQ/AWX14mzfOKd1nx6hfl6k0eRwGO4QTOwIMrqMId1KABDBCe4RXenEfnxXl3PuatK04+cwR/4Hz+AKP9jNU=</latexit> <latexit sha1_base64="/GjIqvIKiIBr+DdwbKirDA1K0T8=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeizqwWML9gPaUDbbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nZXVtfWNzcJWcXtnd2+/dHDY1HGqGDZYLGLVDqhGwSU2DDcC24lCGgUCW8Hoduq3nlBpHssHM07Qj+hA8pAzaqxUv+uVym7FnYEsEy8nZchR65W+uv2YpRFKwwTVuuO5ifEzqgxnAifFbqoxoWxEB9ixVNIItZ/NDp2QU6v0SRgrW9KQmfp7IqOR1uMosJ0RNUO96E3F/7xOasJrP+MySQ1KNl8UpoKYmEy/Jn2ukBkxtoQyxe2thA2poszYbIo2BG/x5WXSPK94lxWvflGu3uRxFOAYTuAMPLiCKtxDDRrAAOEZXuHNeXRenHfnY9664uQzR/AHzucPmWGMzg==</latexit>
<latexit sha1_base64="rFz2bc/WLrR5rsHfPMdMD3MnMZg=">AAAB73icbVBNSwMxEJ2tX7V+VT16CRahIpaNiHos6kHwUsF+QLuUbJptQ7PZNckKZemf8OJBEa/+HW/+G9N2D1p9MPB4b4aZeX4suDau++XkFhaXllfyq4W19Y3NreL2TkNHiaKsTiMRqZZPNBNcsrrhRrBWrBgJfcGa/vBq4jcfmdI8kvdmFDMvJH3JA06JsVLrunx7jA+PcLdYcivuFOgvwRkpQYZat/jZ6UU0CZk0VBCt29iNjZcSZTgVbFzoJJrFhA5Jn7UtlSRk2kun947RgVV6KIiULWnQVP05kZJQ61Ho286QmIGe9ybif147McGFl3IZJ4ZJOlsUJAKZCE2eRz2uGDViZAmhittbER0QRaixERVsCHj+5b+kcVLBZxV8d1qqXmZx5GEP9qEMGM6hCjdQgzpQEPAEL/DqPDjPzpvzPmvNOdnMLvyC8/ENqFOOag==</latexit>
D(K 1) + 1
10
Typically at the “bottleneck” section
11
Sum-Fusion
Chen et al., “DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs” (2016).
12
Replace stride-2 OPs with dilation-2 OPs
13
reduced resolution (encoder)
upsampling (decoder)
14
Upsampling
im2col representation
kernel
0
<latexit sha1_base64="OO5sepFFwQWHu3ACq5L0yGOVOxo=">AAAB9XicbVBNSwMxEJ2tX7V+VT16CRbBU9kVUS9C0YvHCrZdaLclm2bb0CS7JFmlLP0fXjwo4tX/4s1/Y9ruQVsfDDzem2FmXphwpo3rfjuFldW19Y3iZmlre2d3r7x/0NRxqghtkJjHyg+xppxJ2jDMcOonimIRctoKR7dTv/VIlWaxfDDjhAYCDySLGMHGSl2/20kUExRdoxbye+WKW3VnQMvEy0kFctR75a9OPyapoNIQjrVue25iggwrwwink1In1TTBZIQHtG2pxILqIJtdPUEnVumjKFa2pEEz9fdEhoXWYxHaToHNUC96U/E/r52a6CrImExSQyWZL4pSjkyMphGgPlOUGD62BBPF7K2IDLHCxNigSjYEb/HlZdI8q3oXVe/+vFK7yeMowhEcwyl4cAk1uIM6NICAgmd4hTfnyXlx3p2PeWvByWcO4Q+czx/9/pGH</latexit>
X = WX =
<latexit sha1_base64="feYJlqRAt9N6RJRnwkNptAVwjek=">AAAB6HicbVBNS8NAEJ34WetX1aOXxSJ4KomIehGKXjy2YD+gDWWznbRrN5uwuxFK6C/w4kERr/4kb/4bt20O2vpg4PHeDDPzgkRwbVz321lZXVvf2CxsFbd3dvf2SweHTR2nimGDxSJW7YBqFFxiw3AjsJ0opFEgsBWM7qZ+6wmV5rF8MOME/YgOJA85o8ZK9ZteqexW3BnIMvFyUoYctV7pq9uPWRqhNExQrTuemxg/o8pwJnBS7KYaE8pGdIAdSyWNUPvZ7NAJObVKn4SxsiUNmam/JzIaaT2OAtsZUTPUi95U/M/rpCa89jMuk9SgZPNFYSqIicn0a9LnCpkRY0soU9zeStiQKsqMzaZoQ/AWX14mzfOKd1nx6hfl6m0eRwGO4QTOwIMrqMI91KABDBCe4RXenEfnxXl3PuatK04+cwR/4Hz+AI7FjMc=</latexit>
·
<latexit sha1_base64="fCo//JzD7H8On/jVjzaWK5GBlsA=">AAAB7HicbVBNS8NAEJ3Ur1q/qh69BIvgqSQi6rHoxWMF0xbaUDabTbt0sxt2J0Ip/Q1ePCji1R/kzX/jts1BWx8MPN6bYWZelAlu0PO+ndLa+sbmVnm7srO7t39QPTxqGZVrygKqhNKdiBgmuGQBchSsk2lG0kiwdjS6m/ntJ6YNV/IRxxkLUzKQPOGUoJWCHo0V9qs1r+7N4a4SvyA1KNDsV796saJ5yiRSQYzp+l6G4YRo5FSwaaWXG5YROiID1rVUkpSZcDI/duqeWSV2E6VtSXTn6u+JCUmNGaeR7UwJDs2yNxP/87o5JjfhhMssRybpYlGSCxeVO/vcjblmFMXYEkI1t7e6dEg0oWjzqdgQ/OWXV0nrou5f1f2Hy1rjtoijDCdwCufgwzU04B6aEAAFDs/wCm+OdF6cd+dj0Vpyiplj+APn8wfbAo64</latexit>
[1 ⇥ K]
<latexit sha1_base64="0lCbHj5sQArZKrt+wDj5nrTCgDY=">AAAB9HicbVBNS8NAEJ34WetX1aOXxSJ4KomIeix6EbxUsB/QhrLZbtqlm03cnRRK6O/w4kERr/4Yb/4bt20O2vpg4PHeDDPzgkQKg6777aysrq1vbBa2its7u3v7pYPDholTzXidxTLWrYAaLoXidRQoeSvRnEaB5M1geDv1myOujYjVI44T7ke0r0QoGEUr+W2PdFBE3JB7n3RLZbfizkCWiZeTMuSodUtfnV7M0ogrZJIa0/bcBP2MahRM8kmxkxqeUDakfd62VFG7yM9mR0/IqVV6JIy1LYVkpv6eyGhkzDgKbGdEcWAWvan4n9dOMbz2M6GSFLli80VhKgnGZJoA6QnNGcqxJZRpYW8lbEA1ZWhzKtoQvMWXl0njvOJdVryHi3L1Jo+jAMdwAmfgwRVU4Q5qUAcGT/AMr/DmjJwX5935mLeuOPnMEfyB8/kDMkWRFA==</latexit>
[1 ⇥ n]
<latexit sha1_base64="4kWu/w/zrHNLQmxfNAXLFJynIFQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gOSUDbbTbt0swm7E6GU/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvyqQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZprxlsslanuRtRwKRRvoUDJu5nmNIkk70Sju5nfeeLaiFQ94jjjYUIHSsSCUbRS4HskQJFwQ1TYq9bcujsHWSVeQWpQoNmrfgX9lOUJV8gkNcb33AzDCdUomOTTSpAbnlE2ogPuW6qo3RNO5jdPyZlV+iROtS2FZK7+npjQxJhxEtnOhOLQLHsz8T/PzzG+CSdCZTlyxRaL4lwSTMksANIXmjOUY0so08LeStiQasrQxlSxIXjLL6+S9kXdu6p7D5e1xm0RRxlO4BTOwYNraMA9NKEFDDJ4hld4c3LnxXl3PhatJaeYOYY/cD5/AA4IkQ0=</latexit>
[K ⇥ n]
<latexit sha1_base64="r+Cfxs+hUxD+MFjOwxinVIL/ZNY=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi+ClgrWFJJTNdtMu3WzC7kQopX/DiwdFvPpnvPlv3LY5aOuDgcd7M8zMizIpDLrut1NaWV1b3yhvVra2d3b3qvsHjybNNeMtlspUdyJquBSKt1Cg5J1Mc5pEkrej4c3Ubz9xbUSqHnCU8TChfSViwShaKfDvSIAi4YaosFutuXV3BrJMvILUoECzW/0KeinLE66QSWqM77kZhmOqUTDJJ5UgNzyjbEj73LdUUbsnHM9unpATq/RInGpbCslM/T0xpokxoySynQnFgVn0puJ/np9jfBWOhcpy5IrNF8W5JJiSaQCkJzRnKEeWUKaFvZWwAdWUoY2pYkPwFl9eJo9nde+i7t2f1xrXRRxlOIJjOAUPLqEBt9CEFjDI4Ble4c3JnRfn3fmYt5acYuYQ/sD5/AE2dJEn</latexit>
T 0
<latexit sha1_base64="mxYVVQOyMKsru92jNhkqCXcZn9E=">AAAB+XicbVBNSwMxEM36WevXqkcvwSJ4Krsi6kUoevFYoV/Qbks2nW1Dk+ySZAtl6T/x4kERr/4Tb/4b03YP2vpg4PHeDDPzwoQzbTzv21lb39jc2i7sFHf39g8O3aPjho5TRaFOYx6rVkg0cCahbpjh0EoUEBFyaIajh5nfHIPSLJY1M0kgEGQgWcQoMVbquW4L3+Fmt4Zb3U6imICeW/LK3hx4lfg5KaEc1Z771enHNBUgDeVE67bvJSbIiDKMcpgWO6mGhNARGUDbUkkE6CCbXz7F51bp4yhWtqTBc/X3REaE1hMR2k5BzFAvezPxP6+dmug2yJhMUgOSLhZFKccmxrMYcJ8poIZPLCFUMXsrpkOiCDU2rKINwV9+eZU0Lsv+ddl/uipV7vM4CugUnaEL5KMbVEGPqIrqiKIxekav6M3JnBfn3flYtK45+cwJ+gPn8wfOuJJ+</latexit>
X=W X
(Remark: For CNNs, also apply the inverse of im2col
[K ⇥ n] [K ⇥ 1] [1 ⇥ n]
<latexit sha1_base64="r+Cfxs+hUxD+MFjOwxinVIL/ZNY=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi+ClgrWFJJTNdtMu3WzC7kQopX/DiwdFvPpnvPlv3LY5aOuDgcd7M8zMizIpDLrut1NaWV1b3yhvVra2d3b3qvsHjybNNeMtlspUdyJquBSKt1Cg5J1Mc5pEkrej4c3Ubz9xbUSqHnCU8TChfSViwShaKfDvSIAi4YaosFutuXV3BrJMvILUoECzW/0KeinLE66QSWqM77kZhmOqUTDJJ5UgNzyjbEj73LdUUbsnHM9unpATq/RInGpbCslM/T0xpokxoySynQnFgVn0puJ/np9jfBWOhcpy5IrNF8W5JJiSaQCkJzRnKEeWUKaFvZWwAdWUoY2pYkPwFl9eJo9nde+i7t2f1xrXRRxlOIJjOAUPLqEBt9CEFjDI4Ble4c3JnRfn3fmYt5acYuYQ/sD5/AE2dJEn</latexit>
<latexit sha1_base64="r8F1RfdFt8UIFR5LUl6tDukeP9s=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPBg4KXCtYWklA22027dLMJuxOhlP4NLx4U8eqf8ea/cdvmoK0PBh7vzTAzL8qkMOi6305pZXVtfaO8Wdna3tndq+4fPJo014y3WCpT3Ymo4VIo3kKBkncyzWkSSd6OhtdTv/3EtRGpesBRxsOE9pWIBaNopcC/IwGKhBvihd1qza27M5Bl4hWkBgWa3epX0EtZnnCFTFJjfM/NMBxTjYJJPqkEueEZZUPa576lito94Xh284ScWKVH4lTbUkhm6u+JMU2MGSWR7UwoDsyiNxX/8/wc46twLFSWI1dsvijOJcGUTAMgPaE5QzmyhDIt7K2EDaimDG1MFRuCt/jyMnk8q3sXde/+vNa4KeIowxEcwyl4cAkNuIUmtIBBBs/wCm9O7rw4787HvLXkFDOH8AfO5w/bNZDv</latexit>
<latexit sha1_base64="4kWu/w/zrHNLQmxfNAXLFJynIFQ=">AAAB83icbVBNS8NAEJ3Ur1q/qh69LBbBU0lE1GPRi8cK9gOSUDbbTbt0swm7E6GU/g0vHhTx6p/x5r9x2+agrQ8GHu/NMDMvyqQw6LrfTmltfWNzq7xd2dnd2z+oHh61TZprxlsslanuRtRwKRRvoUDJu5nmNIkk70Sju5nfeeLaiFQ94jjjYUIHSsSCUbRS4HskQJFwQ1TYq9bcujsHWSVeQWpQoNmrfgX9lOUJV8gkNcb33AzDCdUomOTTSpAbnlE2ogPuW6qo3RNO5jdPyZlV+iROtS2FZK7+npjQxJhxEtnOhOLQLHsz8T/PzzG+CSdCZTlyxRaL4lwSTMksANIXmjOUY0so08LeStiQasrQxlSxIXjLL6+S9kXdu6p7D5e1xm0RRxlO4BTOwYNraMA9NKEFDDJ4hld4c3LnxXl3PhatJaeYOYY/cD5/AA4IkQ0=</latexit>
16
Input 3x3
Output 5x5
17
stride = 2, kernel = 3
input
output
https://distill.pub/2016/deconv-checkerboard/
18
Kernel size: 3
Stride: 2
https://distill.pub/2016/deconv-checkerboard/
19
stride = 2, kernel = 4
20
Original image
21
Image reconstruction example
Using transposed
convolution
Resize-
convolution
https://distill.pub/2016/deconv-checkerboard/
22
?
23
skip connection
24
skip connection
skip connection
skip connection
O. Ronneberger et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. MICCAI 2015
25
append
O. Ronneberger et al. “U-Net: Convolutional Networks for Biomedical Image Segmentation”. MICCAI 2015
26
27
Badrinarayanan et al. „SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation“. TPAMI 2016
28
29
30
Label every pixel, including the background Do not label pixels coming from uncountable
(sky, grass, road) objects (“stu ”), e.g. “sky”, “grass”, “road”
Does not di erentiate between the pixels Di erentiates between the pixels coming
from objects (instances) of the same class from instances of the same class
31
ff
ff
ff
Proposal-based Proposal-free
1. Semantic
1. Proposals segmentation
(e.g. bounding boxes)
(optional)
2. Group
2. Segment
pixels into
and classify
instances
32
Proposal-based Proposal-free
1. Semantic
1. Proposals segmentation
(e.g. bounding boxes)
(optional)
2. Group
2. Segment
pixels into
and classify
instances
33
34
→
35
R-CNN family
2014 2015 2016 2017
(ICCV 2017)
36
Bounding box regression head
Image CNN
Image CNN
Mask head
Segmentation head
Most of features
are shared
39
+ mask loss:
cross-entropy per pixel
40
+ mask loss:
cross-entropy per pixel
41
CNN
42
Feature grid
43
Feature grid
44
Feature grid
50
He at al. “Mask R-CNN” ICCV 2017
51
52
umbrella.98 bus.99
umbrella.98
person1.00
person1.00
person1.00
backpack1.00
person1.00 person.99
handbag.96 person.99
person1.00 person1.00 person1.00
person.95 person.98
person1.00 person.89
sheep.99
backpack.99
sheep.99 sheep.86
backpack.93 sheep.82 sheep.96
sheep.96 sheep.93 sheep.91 sheep.95 sheep.96 sheep1.00
sheep1.00
sheep.99
sheep1.00
sheep.99
sheep.96
sheep.99
person.99
person.99person1.00
person1.00
traffic light.96 tv.99
chair.98 chair.99
chair.90
dining table.99 chair.96 wine glass.97
chair.86
bottle.99wine glass.93 chair.99
bowl.85 wine glass1.00
elephant1.00
wine glass.99
wine glass1.00
person1.00 chair.96 chair.99 fork.95
person.96
53
54
55
Mask R-CNN + PointRend
Bilinear upsampling
upsampling decoder
57
28x28 28×28
28×28 56×56
Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
58
11
Mask R-CNN + PointRend
<latexit sha1_base64="r5kMPE5X6qAEsmeLAJVhrYbLVHI=">AAAB+nicbVBNS8NAEN34WetXqkcvi0WoICVR0V6Eggc9VrAf0Iaw2W7apZtN2J0opfanePGgiFd/iTf/jds2B219MPB4b4aZeUEiuAbH+baWlldW19ZzG/nNre2dXbuw19Bxqiir01jEqhUQzQSXrA4cBGslipEoEKwZDK4nfvOBKc1jeQ/DhHkR6UkeckrASL5dCP0O9BmQ0tlJ5RhfYce3i07ZmQIvEjcjRZSh5ttfnW5M04hJoIJo3XadBLwRUcCpYON8J9UsIXRAeqxtqCQR095oevoYHxmli8NYmZKAp+rviRGJtB5GgemMCPT1vDcR//PaKYQVb8RlkgKTdLYoTAWGGE9ywF2uGAUxNIRQxc2tmPaJIhRMWnkTgjv/8iJpnJbdi7J7d16s3mRx5NABOkQl5KJLVEW3qIbqiKJH9Ixe0Zv1ZL1Y79bHrHXJymb20R9Ynz9vI5Is</latexit>
f (14, 14) = 1
<latexit sha1_base64="LfxgBQVx3JWEkgWrh00tMgs5LHQ=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQQcKuiHoMeNBjBPOAZAmzk9lkyOzDmd5gWPIdXjwo4tWP8ebfOEn2oIkFDUVVN91dXiyFRtv+tnIrq2vrG/nNwtb2zu5ecf+goaNEMV5nkYxUy6OaSxHyOgqUvBUrTgNP8qY3vJn6zRFXWkThA45j7ga0HwpfMIpGcv1uBwccafnpbHzaLZbsij0DWSZORkqQodYtfnV6EUsCHiKTVOu2Y8foplShYJJPCp1E85iyIe3ztqEhDbh209nRE3JilB7xI2UqRDJTf0+kNNB6HHimM6A40IveVPzPayfoX7upCOMEecjmi/xEEozINAHSE4ozlGNDKFPC3ErYgCrK0ORUMCE4iy8vk8Z5xbmsOPcXpeptFkcejuAYyuDAFVThDmpQBwaP8Ayv8GaNrBfr3fqYt+asbOYQ/sD6/AEXo5Gs</latexit>
224×224
✓ f✓ (x, y)
<latexit sha1_base64="DEEW6vGdrNaIkYxxvBpN/mOfmcQ=">AAAB/nicbVDLSsNAFJ3UV62vqLhyEyxCBQmZotaNUHChywr2AW0Ik+mkHTp5MHMjlFDwV9y4UMSt3+HOv3HaZqHVAxcO59zLvff4ieAKHOfLKCwtr6yuFddLG5tb2zvm7l5LxamkrEljEcuOTxQTPGJN4CBYJ5GMhL5gbX90PfXbD0wqHkf3ME6YG5JBxANOCWjJMw8CrwdDBqRSrdn4FJ/btZMr7Jllx3ZmsP4SnJMyytHwzM9eP6ZpyCKggijVxU4CbkYkcCrYpNRLFUsIHZEB62oakZApN5udP7GOtdK3gljqisCaqT8nMhIqNQ593RkSGKpFbyr+53VTCC7djEdJCiyi80VBKiyIrWkWVp9LRkGMNSFUcn2rRYdEEgo6sZIOAS++/Je0qja+sPHdWbl+k8dRRIfoCFUQRjVUR7eogZqIogw9oRf0ajwaz8ab8T5vLRj5zD76BePjG3oXkz0=</latexit>
f✓ (27.1, 15.7) = 1
Trade-o de ned by
hyperparameters
a) regulargrid
) regular grid b)uniform
b) uniform c)c)mildly
mildlybiased
biased d)d) heavily
heavily biased
biased
Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
60
ff
fi
Kirillov et al., “PointRend: Image Segmentation as Rendering” (2020)
61
Queries to f✓ (x, y)
<latexit sha1_base64="LfxgBQVx3JWEkgWrh00tMgs5LHQ=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoMQQcKuiHoMeNBjBPOAZAmzk9lkyOzDmd5gWPIdXjwo4tWP8ebfOEn2oIkFDUVVN91dXiyFRtv+tnIrq2vrG/nNwtb2zu5ecf+goaNEMV5nkYxUy6OaSxHyOgqUvBUrTgNP8qY3vJn6zRFXWkThA45j7ga0HwpfMIpGcv1uBwccafnpbHzaLZbsij0DWSZORkqQodYtfnV6EUsCHiKTVOu2Y8foplShYJJPCp1E85iyIe3ztqEhDbh209nRE3JilB7xI2UqRDJTf0+kNNB6HHimM6A40IveVPzPayfoX7upCOMEecjmi/xEEozINAHSE4ozlGNDKFPC3ErYgCrK0ORUMCE4iy8vk8Z5xbmsOPcXpeptFkcejuAYyuDAFVThDmpQBwaP8Ayv8GaNrBfr3fqYt+asbOYQ/sD6/AEXo5Gs</latexit>
65
Proposal-based Proposal-free
1. Semantic
1. Proposals segmentation
(e.g. bounding boxes)
(optional)
2. Group
2. Segment
pixels into
and classify
instances
66
67
Long et al., (2015)
68
Y = KX
<latexit sha1_base64="ZKYcsNDGLcCJbu4wN51nrzptpS8=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4KomIehEKHhS8VLAf0oay2U7apZtN2N0IJfRHePGgiFd/jzf/jds2B219MPB4b4aZeUEiuDau++0sLa+srq0XNoqbW9s7u6W9/YaOU8WwzmIRq1ZANQousW64EdhKFNIoENgMhtcTv/mESvNYPphRgn5E+5KHnFFjpeYjuSJ3pNUtld2KOwVZJF5OypCj1i19dXoxSyOUhgmqddtzE+NnVBnOBI6LnVRjQtmQ9rFtqaQRaj+bnjsmx1bpkTBWtqQhU/X3REYjrUdRYDsjagZ63puI/3nt1ISXfsZlkhqUbLYoTAUxMZn8TnpcITNiZAllittbCRtQRZmxCRVtCN78y4ukcVrxzive/Vm5epPHUYBDOIIT8OACqnALNagDgyE8wyu8OYnz4rw7H7PWJSefOYA/cD5/AIIZjmQ=</latexit>
[D ⇥ HW ]
<latexit sha1_base64="Njxx+0euUBFkGZG53uI/7MAm42o=">AAAB9HicbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMeAgjlGMA/YLGF2MpsMmX040xsIS77DiwdFvPox3vwbJ8keNLGgoajqprvLT6TQaNvfVmFtfWNzq7hd2tnd2z8oHx61dJwqxpsslrHq+FRzKSLeRIGSdxLFaehL3vZHtzO/PeZKizh6xEnCvZAOIhEIRtFInntHuihCrkm97fXKFbtqz0FWiZOTCuRo9Mpf3X7M0pBHyCTV2nXsBL2MKhRM8mmpm2qeUDaiA+4aGlGzyMvmR0/JmVH6JIiVqQjJXP09kdFQ60nom86Q4lAvezPxP89NMbjxMhElKfKILRYFqSQYk1kCpC8UZygnhlCmhLmVsCFVlKHJqWRCcJZfXiWti6pzVXUeLiu1+zyOIpzAKZyDA9dQgzo0oAkMnuAZXuHNGlsv1rv1sWgtWPnMMfyB9fkDoCeRYA==</latexit>
[C ⇥ D]
<latexit sha1_base64="Ij9VtIhPDTfDOkkUrJ5X9FgZjqY=">AAAB83icbVDLSgNBEOyNrxhfUY9eBoPgKeyKqMdABD1GMA/YXcLsZDYZMvtgplcIS37DiwdFvPoz3vwbJ8keNLGgoajqprsrSKXQaNvfVmltfWNzq7xd2dnd2z+oHh51dJIpxtsskYnqBVRzKWLeRoGS91LFaRRI3g3GzZnffeJKiyR+xEnK/YgOYxEKRtFIntskHoqIa3Lr96s1u27PQVaJU5AaFGj1q1/eIGFZxGNkkmrtOnaKfk4VCib5tOJlmqeUjemQu4bG1Ozx8/nNU3JmlAEJE2UqRjJXf0/kNNJ6EgWmM6I40sveTPzPczMMb/xcxGmGPGaLRWEmCSZkFgAZCMUZyokhlClhbiVsRBVlaGKqmBCc5ZdXSeei7lzVnYfLWuOuiKMMJ3AK5+DANTTgHlrQBgYpPMMrvFmZ9WK9Wx+L1pJVzBzDH1ifP+ukkPo=</latexit>
69
G: S × S × D
kernel ( , )
branch
( , )
FCN
feature
branch
I *
F: H × W × E
70
G: S × S × D
kernel ( , )
branch Convolution:
( , )
Mi,j = Gi,j ⇤ F
<latexit sha1_base64="s8NcLAdYKv3WFv9lvpzGvCMhraE=">AAACAXicbZDLSsNAFIZPvNZ6i7oR3AwWwYWURETdCAXBuhEq2Au0IUym03bsZBJmJkIJdeOruHGhiFvfwp1v47TNQlt/GPj4zzmcOX8Qc6a043xbc/MLi0vLuZX86tr6xqa9tV1TUSIJrZKIR7IRYEU5E7Sqmea0EUuKw4DTetC/HNXrD1QqFok7PYipF+KuYB1GsDaWb+/e+Ck7uh+iC1TOqIWVRle+XXCKzlhoFtwMCpCp4ttfrXZEkpAKTThWquk6sfZSLDUjnA7zrUTRGJM+7tKmQYFDqrx0fMEQHRinjTqRNE9oNHZ/T6Q4VGoQBqYzxLqnpmsj879aM9Gdcy9lIk40FWSyqJNwpCM0igO1maRE84EBTCQzf0WkhyUm2oSWNyG40yfPQu246J4W3duTQqmcxZGDPdiHQ3DhDEpwDRWoAoFHeIZXeLOerBfr3fqYtM5Z2cwO/JH1+QNWApWN</latexit>
FCN
feature
branch
I *
F: H × W × E
71
Dimensionality depends
G: S × S × D on the kernel size
(1x1 works well, so D = E)
kernel ( , )
branch
( , )
FCN
feature
branch
I *
F: H × W × E
72
SxS grid
with class distribution
G: S × S × D
kernel ( , )
branch
NMS
( , )
FCN
feature
branch
I *
F: H × W × E
73
Mask R-CNN
40
Mask R-CNN
40
COCO Mask AP
35
COCO Mask AP
35 SOLOv2
SOLO
SOLOv2
Mask R-CNN
SOLO Ours
30 TensorMask
Mask R-CNN
YOLACT Ours
30 TensorMask
PolarMask
YOLACT
BlendMask
Real-time PolarMask
25 BlendMask
Real-time
0 25 50 100 125 150
25
0 25 50Inference100
time (ms) 125 150
(a) Speed
Inference vs.(ms)
time Accuracy (b) Detail Comparison
(a) Speed vs. Accuracy (b) Detail Comparison 74
75
76
77
78
Panoptic segmentation
Semantic segmentation Instance segmentation
80
Semantic segmentation Instance segmentation Panoptic segmentation
+ =
81
It gives labels to uncountable objects called
"stu " (sky, road, etc), similar to FCN-like
networks.
82
ff
ff
ff
“Image parsing”
(Tu et al., 2005):
“Holistic scene understanding”
(Yao et al., 2012):
83
Key components in a panoptic
segmentation method
Instance
Segmentation
(CNN)
84
85
86
87
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics
Instance
Segmentation
(CNN)
88
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics
Instance
(a) Feature Pyramid Network Segmentation
(CNN)
89
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics
Instance
(a) Feature Pyramid Network Segmentation
(CNN)
1/32
conv→2×→conv→2×→conv→2×
128 × 1/4
256 × 1/16 conv→2×→conv→2×
128 × 1/4
conv→2×
256 × 1/8
128 × 1/4
conv
256 × 1/4 128 × 1/4
mentation Branch (c) Semantic Segmentation Branch conv→4×
C ×1
90
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Segmentation
(CNN)
Merging
Combine
using Heuristics
Instance
Segmentation
(CNN)
91
Trade-o hyperparameter
L = Lc + Lb + Lm + s Ls
<latexit sha1_base64="Yprj1URbLuzCkWoltWc4NLOGuFE=">AAACDHicbVDLSgMxFL1TX7W+qi7dBIsgCGVGRN0IRTcuuqhgH9AOQyaTaUOTmSHJCKX0A9z4K25cKOLWD3Dn35hOZ6GtF3I5Offcm9zjJ5wpbdvfVmFpeWV1rbhe2tjc2t4p7+61VJxKQpsk5rHs+FhRziLa1Exz2kkkxcLntO0Pb6b19gOVisXRvR4l1BW4H7GQEawN5ZUrdXSF6h5BJyb7WRYm97gZEWBPmbsyKrtqZ4EWgZODCuTR8MpfvSAmqaCRJhwr1XXsRLtjLDUjnE5KvVTRBJMh7tOugREWVLnjbJkJOjJMgMJYmhNplLG/O8ZYKDUSvlEKrAdqvjYl/6t1Ux1eumMWJammEZk9FKYc6RhNnUEBk5RoPjIAE8nMXxEZYImJNv6VjAnO/MqLoHVadc6rzt1ZpXad21GEAziEY3DgAmpwCw1oAoFHeIZXeLOerBfr3fqYSQtW3rMPf8L6/AFo+ZgZ</latexit>
92
ff
Trade-o hyperparameter
L = Lc + Lb + Lm + s Ls
<latexit sha1_base64="Yprj1URbLuzCkWoltWc4NLOGuFE=">AAACDHicbVDLSgMxFL1TX7W+qi7dBIsgCGVGRN0IRTcuuqhgH9AOQyaTaUOTmSHJCKX0A9z4K25cKOLWD3Dn35hOZ6GtF3I5Offcm9zjJ5wpbdvfVmFpeWV1rbhe2tjc2t4p7+61VJxKQpsk5rHs+FhRziLa1Exz2kkkxcLntO0Pb6b19gOVisXRvR4l1BW4H7GQEawN5ZUrdXSF6h5BJyb7WRYm97gZEWBPmbsyKrtqZ4EWgZODCuTR8MpfvSAmqaCRJhwr1XXsRLtjLDUjnE5KvVTRBJMh7tOugREWVLnjbJkJOjJMgMJYmhNplLG/O8ZYKDUSvlEKrAdqvjYl/6t1Ux1eumMWJammEZk9FKYc6RhNnUEBk5RoPjIAE8nMXxEZYImJNv6VjAnO/MqLoHVadc6rzt1ZpXad21GEAziEY3DgAmpwCw1oAoFHeIZXeLOerBfr3fqYSQtW3rMPf8L6/AFo+ZgZ</latexit>
93
ff
Kirillov et al., “Panoptic Feature Pyramid Networks”. CVPR 2019
94
Input image Feature extractor Panoptic output
Semantic
(ResNet-50)
Ls
<latexit sha1_base64="nkYayS1G5B13aybNQH25eogMdjk=">AAAB6nicbVA9SwNBEJ2LXzF+RS1tFoNgFe5E1DJoY2ER0XxAcoS9zVyyZG/v2N0TwpGfYGOhiK2/yM5/4ya5QhMfDDzem2FmXpAIro3rfjuFldW19Y3iZmlre2d3r7x/0NRxqhg2WCxi1Q6oRsElNgw3AtuJQhoFAlvB6Gbqt55QaR7LRzNO0I/oQPKQM2qs9HDX071yxa26M5Bl4uWkAjnqvfJXtx+zNEJpmKBadzw3MX5GleFM4KTUTTUmlI3oADuWShqh9rPZqRNyYpU+CWNlSxoyU39PZDTSehwFtjOiZqgXvan4n9dJTXjlZ1wmqUHJ5ovCVBATk+nfpM8VMiPGllCmuL2VsCFVlBmbTsmG4C2+vEyaZ1Xvourdn1dq13kcRTiCYzgFDy6hBrdQhwYwGMAzvMKbI5wX5935mLcWnHzmEP7A+fwBLqiNvA==</latexit>
Segmentation
(CNN)
Merging
using Heuristics
Instance
Lc + Lb + Lm
<latexit sha1_base64="p+hdh2P649yseCBHxka++x5xqw0=">AAAB+HicbVDLSsNAFL3xWeujUZduBosgCCURUZdFNy66qGAf0IYwmU7aoZNJmJkINfRL3LhQxK2f4s6/cZpmoa0H7uVwzr3MnRMknCntON/Wyura+sZmaau8vbO7V7H3D9oqTiWhLRLzWHYDrChngrY005x2E0lxFHDaCca3M7/zSKVisXjQk4R6ER4KFjKCtZF8u9LwCTpDDT/Ie+TbVafm5EDLxC1IFQo0ffurP4hJGlGhCcdK9Vwn0V6GpWaE02m5nyqaYDLGQ9ozVOCIKi/LD5+iE6MMUBhLU0KjXP29keFIqUkUmMkI65Fa9Gbif14v1eG1lzGRpJoKMn8oTDnSMZqlgAZMUqL5xBBMJDO3IjLCEhNtsiqbENzFLy+T9nnNvay59xfV+k0RRwmO4BhOwYUrqMMdNKEFBFJ4hld4s56sF+vd+piPrljFziH8gfX5A+kFkVA=</latexit>
Segmentation
(CNN)
95
G: S × S × D
kernel ( , )
branch
( , )
FCN
feature
branch
I *
F: H × W × E
96
Panoptic FPN Panoptic FCN (Fully Convolutional Network)