0% found this document useful (0 votes)
98 views19 pages

Semantic Understanding of Scenes Through The ADE20K Dataset

Uploaded by

lvnttya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views19 pages

Semantic Understanding of Scenes Through The ADE20K Dataset

Uploaded by

lvnttya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Semantic Understanding of Scenes through the ADE20K Dataset

Bolei Zhou · Hang Zhao · Xavier Puig · Tete Xiao · Sanja Fidler · Adela
Barriuso · Antonio Torralba
arXiv:1608.05442v2 [cs.CV] 16 Oct 2018

Abstract Semantic understanding of visual scenes is one Keywords Scene understanding · Semantic segmentation ·
of the holy grails of computer vision. Despite efforts of the Instance segmentation · Image dataset · Deep neural
community in data collection, there are still few image datasets networks
covering a wide range of scenes and object categories with
pixel-wise annotations for scene understanding. In this work,
we present a densely annotated dataset ADE20K, which spans
diverse annotations of scenes, objects, parts of objects, and 1 Introduction
in some cases even parts of parts. Totally there are 25k im-
ages of the complex everyday scenes containing a variety of Semantic understanding of visual scenes is one of the holy
objects in their natural spatial context. On average there are grails of computer vision. The emergence of large-scale im-
19.5 instances and 10.5 object classes per image. Based on age datasets like ImageNet [29], COCO [18] and Places [38],
ADE20K, we construct benchmarks for scene parsing and along with the rapid development of the deep convolutional
instance segmentation. We provide baseline performances neural network (CNN) approaches, have brought great ad-
on both of the benchmarks and re-implement the state-of- vancements to visual scene understanding. Nowadays, given
the-art models for open source. We further evaluate the ef- a visual scene of a living room, a robot equipped with a
fect of synchronized batch normalization and find that a rea- trained CNN can accurately predict the scene category. How-
sonably large batch size is crucial for the semantic segmen- ever, to freely navigate in the scene and manipulate the ob-
tation performance. We show that the networks trained on jects inside, the robot has far more information to extract
ADE20K are able to segment a wide variety of scenes and from the input image: It needs to recognize and localize not
objects1 . only the objects like sofa, table, and TV, but also their parts,
e.g., a seat of a chair or a handle of a cup, to allow proper
manipulation, as well as to segment the stuff like floor, wall
and ceiling for spatial navigation.
B. Zhou Recognizing and segmenting objects and stuff at pixel
Department of Information Engineering, the Chinese University of level remains one of the key problems in scene understand-
Hong Kong, Hong Kong. ing. Going beyond the image-level recognition, the pixel-
H. Zhao, X. Puig, A. Barriuso, A. Torralba level scene understanding requires a much denser annotation
Computer Science and Artificial Intelligence Laboratory, Mas- of scenes with a large set of objects. However, the current
sachusetts Institute of Technology, USA.
datasets have a limited number of objects (e.g., COCO [18],
T. Xiao Pascal [10]) and in many cases those objects are not the most
School of Electronic Engineering and Computer Science, Peking Uni-
versity, China. common objects one encounters in the world (like frisbees
or baseball bats), or the datasets only cover a limited set of
S. Fidler
Department of Computer Science, University of Toronto, Canada. scenes (e.g., Cityscapes [7]). Some notable exceptions are
1
Pascal-Context [22] and the SUN database [34]. However,
Dataset is available at http://groups.csail.mit.edu/
Pascal-Context still contains scenes primarily focused on
vision/datasets/ADE20K.
Pretrained models and code are released at https://github. 20 object classes, while SUN has noisy labels at the object
com/CSAILVision/semantic-segmentation-pytorch level.
2 Bolei Zhou et al.

Fig. 1 Images in ADE20K dataset are densely annotated in detail with objects and parts. The first row shows the sample images, the second row
shows the annotation of objects, and the third row shows the annotation of object parts. The color scheme both encodes the object categories and
object instances, that different object categories have large color difference while different instances from the same object category have small
color difference (e.g., different person instances in first image have slightly different colors).

The motivation of this work is to collect a dataset that re-implement and open-source several state-of-the-art scene
has densely annotated images (every pixel has a semantic parsing models and evaluate the effect of batch normaliza-
label) with a large and an unrestricted open vocabulary. The tion size. In Sec.4 we introduce the Places Challenges at
images in our dataset are manually segmented in great de- ECCV’16 and ICCV’17 based on the benchmarks of the
tail, covering a diverse set of scenes, object and object part ADE20K, as well as the qualitative and quantitative analysis
categories. The challenge for collecting such annotations is on the challenge results. In Sec.5 we train network jointly to
finding reliable annotators, as well as the fact that labeling segment objects and their parts. Sec.6 explores the applica-
is difficult if the class list is not defined in advance. On the tions of the scene parsing networks to the hierarchical se-
other hand, open vocabulary naming also suffers from nam- mantic segmentation and automatic scene content removal.
ing inconsistencies across different annotators. In contrast, Sec.7 concludes this work.
our dataset was annotated by a single expert annotator, pro-
viding extremely detailed and exhaustive image annotations.
On average, our annotator labeled 29 annotation segments 1.1 Related work
per image, compared to the 16 segments per image labeled
by external annotators (like workers from Amazon Mechan- Many datasets have been collected for the purpose of seman-
ical Turk). Furthermore, the data consistency and quality are tic understanding of scenes. We review the datasets accord-
much higher than that of external annotators. Fig. 1 shows ing to the level of details of their annotations, then briefly
examples from our dataset. go through the previous work of semantic segmentation net-
works.
The preliminary result of this work is published at [39]. Object classification/detection datasets. Most of the
Compared to the previous conference paper, we include more large-scale datasets typically only contain labels at the im-
description of the dataset, more baseline results on the scene age level or provide bounding boxes. Examples include Im-
parsing benchmark, the introduction of the new instance seg- ageNet [29], Pascal [10], and KITTI [11]. ImageNet has the
mentation benchmark and its baseline results, as well as the largest set of classes, but contains relatively simple scenes.
effect of synchronized batch norm and the joint training of Pascal and KITTI are more challenging and have more ob-
objects and parts. We also include the contents of the Places jects per image, however, their classes and scenes are more
Challenges we hosted at ECCV’16 and ICCV’17 and the constrained.
analysis on the challenge results. Semantic segmentation datasets. Existing datasets with
The sections of this work are organized as follows. In pixel-level labels typically provide annotations only for a
Sec.2 we describe the construction of the ADE20K dataset subset of foreground objects (20 in PASCAL VOC [10] and
and its statistics. In Sec.3 we introduce the two pixel-wise 91 in Microsoft COCO [18]). Collecting dense annotations
scene understanding benchmarks we build upon ADE20K: where all pixels are labeled is much more challenging. Such
scene parsing and instance segmentation. We train and eval- efforts include Pascal-Context [22], NYU Depth V2 [23],
uate several baseline networks on the benchmarks. We also SUN database [34], SUN RGB-D dataset [31], CityScapes
Semantic Understanding of Scenes through the ADE20K Dataset 3

dataset [7], and OpenSurfaces [2, 3]. Recently COCO stuff as head or leg. Images come from the LabelMe [30], SUN
dataset [4] provides additional stuff segmentation comple- datasets [34], and Places [38] and were selected to cover the
mentary to the 80 object categories in COCO dataset, while 900 scene categories defined in the SUN database. Images
COCO attributes dataset [26] annotates attributes for some were annotated by a single expert worker using the LabelMe
objects in COCO dataset. Such a dataset with progressive interface [30]. Fig. 2 shows a snapshot of the annotation in-
enhancement of diverse annotations over the years makes terface and one fully segmented image. The worker provided
great progress to the modern development of image dataset. three types of annotations: object segments with names, ob-
Datasets with objects, parts and attributes. Two datasets ject parts, and attributes. All object instances are segmented
were released that go beyond the typical labeling setup by independently so that the dataset could be used to train and
also providing pixel-level annotation for the object parts, evaluate detection or segmentation algorithms.
i.e., Pascal-Part dataset [6], or material classes, i.e., Open- Given that the objects appearing in the dataset are fully
Surfaces [2, 3]. We advance this effort by collecting very annotated, even in the regions where these are occluded,
high-resolution imagery of a much wider selection of scenes, there are multiple areas where the polygons from different
containing a large set of object classes per image. We anno- regions overlap. In order to convert the annotated polygons
tated both stuff and object classes, for which we additionally into a segmentation mask, we sort objects in an image by
annotated their parts, and parts of these parts. We believe depth layers. Background classes like ‘sky’ or ‘wall’ are set
that our dataset, ADE20K, is one of the most comprehen- as the farthest layers. The rest of objects’ depths are set as
sive datasets of its kind. We provide a comparison between follows: when a polygon is fully contained inside another
datasets in Sec. 2.6. polygon, the object from the inner polygon is given a closer
Semantic segmentation models. With the success of depth layer. When objects only partially overlap, we look at
convolutional neural networks (CNN) for image classifica- the region of intersection between the two polygons, and set
tion [17], there is growing interest for semantic pixel-wise as the closest object the one whose polygon has more points
labeling using CNNs with dense output, such as the fully in the region of intersection. Once objects have been sorted,
CNN [20], deconvolutional neural networks [25], encoder- the segmentation mask is constructed by iterating over the
decoder SegNet [1], multi-task network cascades [9], and objects in decreasing depth, ensuring that object parts never
DilatedVGG [5, 36]. They are benchmarked on Pascal dataset occlude whole objects and no object is occluded by its parts.
with impressive performance on segmenting the 20 object Datasets such as COCO [18], Pascal [10] or Cityscape [7]
classes. Some of them [20, 1] are evaluated on Pascal Con- start by defining a set of object categories of interest. How-
text [22] or SUN RGB-D dataset [31] to show the capa- ever, when labeling all the objects in a scene, working with
bility to segment more object classes in scenes. Joint stuff a predefined list of objects is not possible as new categories
and object segmentation is explored in [8] which uses pre- appear frequently (see fig. 6.d). Here, the annotator created a
computed superpixels and feature masking to represent stuff. dictionary of visual concepts where new classes were added
Cascade of instance segmentation and categorization has been constantly to ensure consistency in object naming.
explored in [9]. A multiscale pyramid pooling module is Object parts are associated with object instances. Note
proposed to improve the scene parsing [37]. A recent multi- that parts can have parts too, and we label these associations
task segmentation network UperNet is proposed to segment as well. For example, the ‘rim’ is a part of a ‘wheel’, which
visual concepts from different levels [35]. in turn is part of a ‘car’. A ‘knob’ is a part of a ‘door’ that
can be part of a ‘cabinet’. This part hierarchy in Fig. 3 has a
depth of 3.
2 ADE20K: Fully Annotated Image Dataset

In this section, we describe the construction of our ADE20K


dataset and analyze its statistics. 2.2 Dataset summary

After annotation, there are 20, 210 images in the training


2.1 Image annotation set, 2, 000 images in the validation set, and 3, 000 images
in the testing set. There are in total 3, 169 class labels an-
For our dataset, we are interested in having a diverse set notated, among them 2, 693 are object and stuff classes, 476
of scenes with dense annotations of all the visual concepts are object part classes. All the images are exhaustively an-
present. The visual concepts could be 1) discrete object which notated with objects. Many objects are also annotated with
is a thing with a well-defined shape, e.g., car, person, 2) stuff their parts. For each object there is additional information
which contains amorphous background regions, e.g., grass, about whether it is occluded or cropped, and other attributes.
sky, or 3) object part, which is a component of some existing The images in the validation set are exhaustively annotated
object instance which has some functional meaning, such with parts, while the part annotations are not exhaustive over
4 Bolei Zhou et al.

Fig. 2 Annotation interface, the list of the objects and their associated parts in the image.
9/24/2018 labelme.csail.mit.edu/developers/xavierpuig/analysisADE/plottree.html

headboard (1186)
bed (2418) ladder (22)
leg (564)
side rail (107)
arcades (42) railing (31)
balcony (2060) shutter (51)
door (2934) handle (18)
pane (58)
doors (14)
building, edifice (18850) double door (324) door (18) pane (12)
garage door (40) casing (26)
metal shutters (48) lower sash (17) stile (13)
shop window (755) pane (16)
window (35737) sash (13) rail (26)
stile (20)
shutter (275)
pane (16)
upper sash (14) rail (14)
window (83) stile (13)
base (19) handle (930)
drawer (1777) knob (857)
chest of drawers, chest, bureau, ... (663) front (36)
leg (35)
top (35) handle (58)
door (291) hinge (30)
double door (471) door frame (103) lock (21)
handle (32) panel (11)
base (12)
column (97) capital (20)
shaft (20)
door (353) pane (20)
garage door (43) post (16)
house (1227) railing (107) rail (15)
roof (280) rakes (16)
casing (15)
window (1665) pane (109)
windows (10) shutter (18)
buttons (12)
button panel (55) dial (85)
screen (21)
oven (272)
door (87) handle (76)
window (21)
sink (1480) faucet (1106)
tap (147)
traffic light, traffic signal, ... (1120) housing (163) visor (15)
wardrobe, closet, press (429) shelf (33)
side (17)

Fig. 3 Section of the relation tree of objects and parts for the dataset. Each number indicates the number of instances for each object. The full
relation tree is available at the dataset webpage.

the images in the training set. Sample images and annota- ages. However, those rare object classes cannot be ignored
tions from the ADE20K dataset are shown in Fig. 1. as they might be important elements for the interpretation of
the scene. Labeling in these conditions becomes difficult be-
cause we need to keep a growing list of all the object classes
2.3 Annotation consistency in order to have a consistent naming across the entire dataset.
Despite the best effort of the annotator, the process is not
Defining a labeling protocol is relatively easy when the la- free from noise.
beling task is restricted to a fixed list of object classes, how-
ever it becomes challenging when the class list is open-ended. To analyze the annotation consistency we took a subset
As the goal is to label all the objects within each image, of 61 randomly chosen images from the validation set, then
the list of classes grows unbounded. Many object classes asked our annotator to annotate them again (there is a time
appear only a few times across the entire collection of im- difference of six months). One expects that there are some
Semantic Understanding of Scenes through the ADE20K Dataset 5
Image
Seg. 1
Seg. 2
Difference

60.2% 95.1% 82.3% 89.7% 87.2% 83.1% 95.7%


Fig. 4 Analysis of annotation consistency. Each column shows an image and two segmentations done by the same annotator at different times.
Bottom row shows the pixel discrepancy when the two segmentations are subtracted, while the number at the bottom shows the percentage of
pixels with the same label. On average across all re-annotated images, 82.4% of pixels got the same label. In the example in the first column the
percentage of pixels with the same label is relatively low because the annotator labeled the same region as ‘snow’ and ‘ground’ during the two
rounds of annotation. In the third column, there were many objects in the scene and the annotator missed some between the two segmentations.

differences between the two annotations. A few examples inated by a few images, and that the most common type of
are shown in Fig 4. On average, 82.4% of the pixels got the error is segmentation quality.
same label. The remaining 17.6% of pixels had some errors To further compare the annotation done by our single
for which we grouped into three error types as follows: expert annotator and the AMT-like annotators, 20 images
from the validation set are annotated by two invited exter-
– Segmentation quality: Variations in the quality of seg- nal annotators, both with prior experience in image label-
mentation and outlining of the object boundary. One typ- ing. The first external annotator had 58.5% of inconsistent
ical source of error arises when segmenting complex ob- pixels compared to the segmentation provided by our an-
jects such as buildings and trees, which can be segmented notator, and the second external annotator had 75% of the
with different degrees of precision. This type of error inconsistent pixels. Many of these inconsistencies are due
emerges in 5.7% of the pixels. to the poor quality of the segmentations provided by exter-
– Object naming: Differences in object naming (due to nal annotators (as it has been observed with AMT which
ambiguity or similarity between concepts, for instance, requires multiple verification steps for quality control [18]).
calling a big car a ‘car’ in one segmentation and a ‘truck’ For the best external annotator (the first one), 7.9% of pix-
in the another one, or a ‘palm tree’ a ‘tree’. This naming els have inconsistent segmentations (just slightly worse than
issue emerges in 6.0% of the pixels. These errors can be our annotator), 14.9% have inconsistent object naming and
reduced by defining a very precise terminology, but this 35.8% of the pixels correspond to missing objects, which is
becomes much harder with a large growing vocabulary. due to the much smaller number of objects annotated by the
– Segmentation quantity: Missing objects in one of the external annotator in comparison with the ones annotated by
two segmentations. There is a very large number of ob- our expert annotator. The external annotators labeled on av-
jects in each image and some images might be anno- erage 16 segments per image while our annotator provided
tated more thoroughly than others. For example, in the 29 segments per image.
third column of Fig. 4 the annotator missed some small
objects in different annotations. Missing labels account
for 5.9% of the error pixels. A similar issue existed in 2.4 Dataset statistics
segmentation datasets such as the Berkeley Image seg-
mentation dataset [21]. Fig. 5.a shows the distribution of ranked object frequencies.
The distribution is similar to a Zipf’s law and is typically
The median error values for the three error types are: found when objects are exhaustively annotated in images
4.8%, 0.3% and 2.6% showing that the mean value is dom- [32, 34]. They differ from the ones from datasets such as
6 Bolei Zhou et al.

COCO or ImageNet where the distribution is more uniform 2.6 Comparison with other datasets
resulting from manual balancing.
Fig. 5.b shows the distributions of annotated parts grouped We compare ADE20K with existing datasets in Tab. 1. Com-
by the objects to which they belong and sorted by frequency pared to the largest annotated datasets, COCO [18] and Im-
within each object class. Most object classes also have a agenet [29], our dataset comprises of much more diverse
non-uniform distribution of part counts. Fig. 5.c and Fig. 5.d scenes, where the average number of object classes per im-
show how objects are shared across scenes and how parts age is 3 and 6 times larger, respectively. With respect to
are shared by objects. Fig. 5.e shows the variability in the SUN [34], ADE20K is roughly 35% larger in terms of im-
appearances of the part ‘door’. ages and object instances. However, the annotations in our
The mode of the object segmentations is shown in Fig. 6.a dataset are much richer since they also include segmenta-
and contains the four objects (from top to bottom): ‘sky’, tion at the part level. Such annotation is only available for
‘wall’, ‘building’ and ‘floor’. When using simply the mode the Pascal-Context/Part dataset [22, 6] which contains 40
to segment the images, it gets, on average, 20.9% of the pix- distinct part classes across 20 object classes. Note that we
els of each image right. Fig. 6.b shows the distribution of merged some of their part classes to be consistent with our
images according to the number of distinct classes and in- labeling (e.g., we mark both left leg and right leg as the same
stances. On average there are 19.5 instances and 10.5 object semantic part leg). Since our dataset contains part annota-
classes per image, larger than other existing datasets (see tions for a much wider set of object classes, the number of
Table 1). Fig. 6.c shows the distribution of parts. part classes is almost 9 times larger in our dataset.
As the list of object classes is not predefined, there are An interesting fact is that any image in ADE20K con-
new classes appearing over time of annotation. Fig. 6.d shows tains at least 5 objects, and the maximum number of object
the number of object (and part) classes as the number of an- instances per image reaches 273, and 419 instances, when
notated instances increases. Fig. 6.e shows the probability counting parts as well. This shows the high annotation com-
that instance n + 1 is a new class after labeling n instances. plexity of our dataset.
The more segments we have, the smaller the probability that
we will see a new class. At the current state of the dataset,
3 Pixel-wise Scene Understanding Benchmarks
we get one new object class every 300 segmented instances.
Based on the data of the ADE20K, we construct two bench-
marks for pixel-wise scene understanding: scene parsing and
2.5 Object-part relationships instance segmentation:

We analyze the relationships between the objects and object – Scene parsing. Scene parsing is to segment the whole
parts annotated in ADE20K. In the dataset, 76% of the ob- image densely into semantic classes, where each pixel is
ject instances have associated object parts, with an average assigned a class label such as the region of tree and the
of 3 parts per object. The class with the most parts is build- region of building.
ing, with 79 different parts. On average, 10% of the pixels – Instance segmentation. Instance segmentation is to de-
correspond to object parts. A subset of the relation tree be- tect the object instances inside an image and further gen-
tween objects and parts can be seen in Fig. 3. erate the precise segmentation masks of the objects. Its
The information about objects and their parts provides difference compared to the task of scene parsing is that
interesting insights. For instance, we can measure in what in scene parsing there is no instance concept for the seg-
proportion one object is part of another to reason about how mented regions, instead in instance segmentation if there
strongly tied these are. For the object tree, the most com- are three persons in the scene, the network is required to
mon parts are trunk or branch, whereas the least common segment each one of the person regions.
are fruit, flower or leaves. We introduce the details of each task and the baseline
The object-part relationships can also be used to mea- models we train as below.
sure similarities among objects and parts, providing infor-
mation about objects tending to appear together or sharing
similar affordances. We measure the similarity between two 3.1 Scene parsing benchmark
parts as the common objects each one is part of. The most
similar part to knob is handle, sharing objects such as drawer, We select the top 150 categories ranked by their total pixel
door or desk. Objects can similarly be measured by the parts ratios2 in the ADE20K dataset and build a scene parsing
they have in common. As such, chair’s most similar objects 2
As the original images in the ADE20K dataset have various sizes,
are armchair, sofa or stool, sharing parts such as rail, leg or for simplicity we rescale those large-sized images to make their mini-
seat base. mum heights or widths as 512 in the SceneParse150 benchmark.
'wall' Number of scenes sharing each object

b)
a)

a)
c)
Annotated instances Annotated instances
'person' 100 1000 10000

100
200
300
400
500
600
700
wall
100 1000 10000
person
'sky' head
right arm building
tree
left arm car
'floor' left leg chair
Person floor
right leg plant
right hand light
'trees' torso window
window sky
painting
door ceiling

50
'ceiling' balcony
column
table
cabinet
shop window Building sign
'building' lamp
roof sidewalk
shutter road
wheel book
'tree' window cushion
streetlight
door curtain
headlight box
'ground' license plate
Car grass
bottle

100
taillight shelf
'grass' windshield seat
leg mountain
back rock
seat mirror
spotlight
apron ground
bed
arm
Chair
stretcher flowerpot
seat cushion flower
armchair
aperture fence

150
diffusor pillow
shade pole
vase
backplate Light source column
canopy sconce
bulb plate
pane sofa
'cue' sash glass
wall plug
lower sash carpet
'kitchen island' upper sash sink
muntin railing

200
Window bench
rail desk
'pool ball' casing bow
leg house
drawer pot
Number of images 'soap' top work surface
palm tree

Object index sorted by shareability


apron stairs

b)
bag
door
Table
'tombstone' base basket
towel

0
500
1000
1500
2000
2500
side traffic light

250

0
door fluorescent
'merchandise' drawer swivel chair
top toy
shrub
front awning
'videos' side
stool
jar

as a function of the number of instances.


shelf
Cabinet
skirt skyscraper
'headstone' clock
shade trash can

10
column coffee table
base flag
water

300
'gravestone' tube double door
canopy van
Lamp
cord chandelier
'hay roll' arm shoe
fruit
door frame boat

20
handle food, solid food
knob sea
pane handrail
paper
muntin
Door
field
lock candlestick

example where the door does not behave as a part.


hinge tray
Number of objects sharing each part stand

d)
headboard staircase

30
'door' footboard
leg path
ball

10
20
30
40
50
60
70
80
side rail switch
'window' bedpost Bed poster
side brand name
blind
ladder television
'handle' shelf motorbike
truck

40
door

classes (9.9)
side dresser
figurine

10
'base' top
leg
Shelf bicycle
monitor

instances (19.6)
arm magazine
kitchen stove

Number of categories/instances
'drawer' leg air conditioner
back can
computer

50
seat cushion

20
'leg' back pillow Armchair beam
umbrella
Number of objects seat base sculpture
seat animal
'wheel' seat cushion bookcase

c)
telephone

2.10
3.10
back pillow fireplace

0
10 4
4
4
arm bucket

30
'pane'

0
Semantic Understanding of Scenes through the ADE20K Dataset

leg fan
seat base Sofa bus
back counter
'knob' dress
skirt napkin
drawer toilet
leg refrigerator

40
'arm' top bathtub
screen
door Desk text
side gravestone
shelf ottoman
window wardrobe
sand
door pane

50
roof mu g
chimney cake
manhole
'left foot' railing
House
gate
column river
shutter countertop
'left hand' hill

60
back microwave
base hat
seat container
'left leg' arm
piston
Swivel chair doorframe
step

armrest curb
'mouse' loudspeaker

70
leg handbag
seat board
stretcher snow

Parts index sorted by shareability


'neck' apron
Stool candle
barrel
footrest pool ball
leg bridge
'pocket' airplane

80
top bulletin board
apron cup
shelf post
'right arm' Coffee table

classes 2.3
drawer hedge
wheel hood
pipe

instances 3.9
window central reservation

Number of categories/instances
'right foot'

90
blanket

1 2 3 4 5 6 7 8 9
door
license plate suitcase
taillight ground
Number of classes 'right hand' windshield
Van dishwasher
pool table
headlight booklet
'right leg' shade pitcher
bulb grill

100
place mat

d)
arm

10
100
1000
deck chair
'torso' chain showcase
Chandelier
light bulb

10
canopy
arcade machine

1
wheel drawing
windshield faucet
headlight notebook
window papers

e)
fish
license plate
Truck
jacket
mirror price tag

10
drawer bar

2
side radiator
skirt shirt
bird
front tennis shoe
top banner
Dresser statue
leg
base knife
coffee maker

10
oven parking meter

Parts

3
stove grandstand
button panel backpack
merchandise

Objects
dial Kitchen stove teapot
burner partition
drawer pen
monitor traffic cone
tower

10
keyboard washing machine

4
computer case toilet paper
mouse videos
Computer
speaker laptop
receiver antenna
embankment
base ladder
screen screen door

10
buttons vent

Number of instances
slot machine

5
cord Telephone fire hydrant
keyboard booth
blade cap
motor soap dispenser
canopy heater
Fan rack
shade shower
tube stand

10
window teacup

6
wheel vending machine
Probability of new classes pool stick
windshield kitchen island
headlight binder
Bus remote control

10
10
door cow
license plate pier

-3
-2
10 -1
mirror rod

e)
door boot

10
drawer crane

2
shelf canister
hanger
side Wardrobe counter
top pallet
leg bread
door tank
pottery
button panel waterfall
dial kettle
screen dog
cart

10 3
buttons Microwave machine
display chest
button projection screen
landing gear mailbox
stabilizer tree trunk
dishcloth
wing easel
fuselage trouser
Airplane t-shirt
turbine engine hut
corner pocket horse

10 4
leg stage
side pocket candelabra
bed hay
runway
rail Pool table dummy
cabinet console
base microphone
door pack
saucepan
dial spoon
buttons towel rack

10 5
detergent dispenser Waching machine fork
button panel wire
soap

Parts

Number of instances
screen cross
door fountain
sweater

Objects
drawer
work surface paper towel
printer
side Kitchen island bouquet
top skylight

10 6

segmented instances (objects and parts). The squares represent the current state of the dataset. e) Probability of seeing a new object (or part) class
and classes per image. c) Histogram of the number of segmented part instances and classes per object. d) Number of classes as a function of
Fig. 6 a) Mode of the object segmentations contains ‘sky’, ‘wall’, ‘building’ and ‘floor’. b) Histogram of the number of segmented object instances
they are part of. d) Object parts ranked by the number of objects they are part of. e) Examples of objects with doors. The bottom-right image is an
objects with 5 or more parts are shown in this plot (we show at most 7 parts for each object class). c) Objects ranked by the number of scenes
than a 1000 segmented instances. b) Frequency of parts grouped by objects. There are more than 200 object classes with annotated parts. Only
Fig. 5 a) Object classes sorted by frequency. Only the top 270 classes with more than 100 annotated instances are shown. 68 classes have more
7
8 Bolei Zhou et al.

Table 1 Comparison with existing datasets with semantic segmentation.

Images Obj. inst. Obj. classes Part inst. Part classes Obj. classes per image
COCO 123,287 886,284 91 0 0 3.5
ImageNet∗ 476,688 534,309 200 0 0 1.7
NYU Depth V2 1,449 34,064 894 0 0 14.1
Cityscapes 25,000 65,385 30 0 0 12.2
SUN 16,873 313,884 4,479 0 0 9.8
OpenSurfaces 22,214 71,460 160 0 0 N/A
PascalContext 10,103 ∼104,398∗∗ 540 181,770 40 5.1
ADE20K 22,210 434,826 2,693 175,961 476 9.9

has only bounding boxes (no pixel-level segmentation). Sparse annotations.
∗∗
PascalContext dataset does not have instance segmentation. In order to estimate the number of instances, we find connected components (having at least 150pixels) for each
class label.

benchmark of ADE20K, termed as SceneParse150. Among Table 2 Baseline performance on the validation set of SceneParse150.
the 150 categories, there are 35 stuff classes (i.e., wall, sky, Networks Pixel Acc. Mean Acc. Mean IoU Weighted IoU
road) and 115 discrete object classes (i.e., car, person, ta- FCN-8s 71.32% 40.32% 0.2939 0.5733
ble). The annotated pixels of the 150 classes occupy 92.75% SegNet 71.00% 31.14% 0.2164 0.5384
of all the pixels of the dataset, where the stuff classes occupy DilatedVGG 73.55% 44.59% 0.3231 0.6014
DilatedResNet-34 76.47% 45.84% 0.3277 0.6068
60.92%, and discrete object classes occupy 31.83%. DilatedResNet-50 76.40% 45.93% 0.3385 0.6100
We map the WordNet synsets with each one of the ob- Cascade-SegNet 71.83% 37.90% 0.2751 0.5805
ject names, then build up a WordNet tree through the hyper- Cascade-DilatedVGG 74.52% 45.38% 0.3490 0.6108
nym relations of the 150 categories shown in Fig. 7. We can
see that these objects form several semantic clusters in the
ment toolbox are released in the Scene Parsing Benchmark
tree, such as the furniture synset node containing cabinet,
website3 .
desk, pool table, and bench, the conveyance node containing
The segmentation performance of the baseline networks
car, truck, boat, and bus, as well as the living thing node
on SceneParse150 is listed in Table 2. Among the baselines,
containing shrub, grass, flower, and person. Thus, the struc-
the networks based on dilated convolutions achieve better
tured object annotation given in the dataset bridge the image
results in general than FCN and SegNet. Using the cascade
annotation to a wider knowledge base.
framework, the performance further improves. In terms of
As for baseline networks for scene parsing on our bench- mean IoU, Cascade-SegNet and Cascade-DilatedVGG out-
mark, we train several semantic segmentation networks: Seg- perform SegNet and DilatedVGG by 6% and 2.5%, respec-
Net [1], FCN-8s [20], DilatedVGG, DilatedResNet [5, 36], tively.
two cascade networks proposed in [39] where the backbone Qualitative scene parsing results from the validation set
models are SegNet and DilatedVGG. We train these models are shown in Fig. 8. We observe that all the baseline net-
on NVIDIA Titan X GPUs. works can give correct predictions for the common, large
Results are reported in four metrics commonly used for object and stuff classes, the difference in performance comes
semantic segmentation [20]: mostly from small, infrequent objects and how well they
handle details. We further plot the IoU performance of all
– Pixel accuracy indicates the proportion of correctly clas- the 150 categories given by the baseline model DilatedResNet-
sified pixels; 50 in Fig. 9. We can see that the best segmented categories
– Mean accuracy indicates the proportion of correctly clas- are stuffs like sky, building and road; the worst segmented
sified pixels averaged over all the classes. categories are objects that are usually small and have few
– Mean IoU indicates the intersection-over-union between pixels, like blanket, tray and glass.
the predicted and ground-truth pixels, averaged over all
the classes.
– Weighted IoU indicates the IoU weighted by the total 3.2 Opening source the state-of-the-art scene parsing
pixel ratio of each class. models

Since some classes like wall and floor occupy far more Since the introduction of SceneParse150 firstly in 2016, it
pixels of the images, pixel accuracy is biased to reflect the has become a standard benchmark for evaluating new se-
accuracy over those few large classes. Instead, mean IoU re- mantic segmentation models. However, the state-of-the-art
flects how accurately the model classifies each discrete class
3
in the benchmark. The scene parsing data and the develop- http://sceneparsing.csail.mit.edu
Semantic Understanding of Scenes through the ADE20K Dataset 9

tity
en

ty
nti
le
ica
ys
ph

t
jec
ob

n
tio
ole ac
str
wh ab

c t ute
ifa rib
art att

ty
tali
m en
tru a pe
ins sh

ce ity ing
an od n
ey g th tio
nv mm e
co co livin rela lin

s
od
go
re er ism
tu le um an on tte
r ion
uc hic ns ati sit rve
str ve co org loc ma po cu

icle n
tio e
n v eh n t e ica rv
tio ing led me on s c un d cu
uc ish e ati ble icle nt tan
str ee vic uip ra int bs m mm se
ob furn wh de eq cre du art pla po su roo co clo
le nt
hic me int rve
ve uip t po cu
ed eq lan ic ed
ell en
t r ic ce tio
n p ph los
r ce itu
re rop ort um ine on rin
g ct an y lar ra ria
l
ng l c
rrie a rfa y lf-p ft up
p tr on
ta
lec
tr
he
e t
ov
e
rod
u
pp
li ilit re cora ascu ion og rt id te g ati rt na ple
ba are su wa furn se cra s ins c e s c p a fac wa de v reg top pa flu ma thin se pa sig sim
t
on en on
ce be
r
ina
ti e um ng ati area
ier rfa em vic str eri ce te
rr su ce icle lum de g in ov rin
g an nt hy form al ate
r nit nal
ba tal rfa eh lm f il ic n nt ss tive c ve pli n
tio re pla atop al hic at u
ble ing su
re
er on su v tura eo on suri el me gla co g ap va wa n y gic rap fw se age l sig
va us nclo om elt riz e rk th at tor craft uc urc ctr a ss t ple te rote
c
th rk ddin xture me
e
urs xca bleesigooderb perm olo og ne ce as arth dy
o
red gu isua op
mo ho e ro sh ho sid wo pa se mo air str so ele me ve se im pla p clo art wo pa fi ho co e ta d w h s ge ge zo pie g e bo tie lan v lo
raft t
e itu
re ir c rt en
t
re re nc
e
lan on
ac ac
e n-aspo m xtu a
pli k sp ct ati
urf urf furn cle tha an le uip ve
r es art rt on g fi g fixtu good
s
ap ou erm bje lev
th
clo here t
g ay om d r e ce tac er eq e rd ic c a cati ion c lo le of
ion ing s t
ing wellin ceslose py d s rm r s top be rcy el ier- c tr ht ectop t chin play ter epie co ing loth bin en etra ol are m ine sp sp en m
rtit ild or no ve latfo ppe ble lk irw dro le air by uck oto ss av bli rig a ep eiv me n a nd or th dc raphlastiubliush lum ghtin hite ch tw ble e m gio tura tura ct nd ce tmooil lemolid ea nd me n ht lt
pa bu do rail d re c ca pa p u ta wa sta be tab ch ba tr m ve he pu up refl lamligh ma dis he tim jar rec bin rec ga rod pa bo bli flo clo be g p p c p li w kit rac po fla em tre gra an na na tra isla pie a s e s str sta na sig lig be
wall
skyscraper
screen door
fence
bar
bannister
house
signboard
fireplace
booth
bridge
hovel
awning
tent
tower
fountain
floor
runway
stage
pier
ceiling
countertop
screen
road
sidewalk
stairs
escalator
bed
cabinet
desk
counter
pool table
coffee table
armchair
swivel chair
sofa
bench
ottoman
stool
wardrobe
chest of drawers
bookcase
buffet
cradle
car
van
minibike
bicycle
boat
ship
airplane
bus
shelf
base
column
step
mirror
streetlight
sconce
computer
fan
monitor
radiator
clock
bathtub
bottle
barrel
tank
vase
box
case
basket
bag
pot
tray
ashcan
glass
television receiver
ball
pole
windowpane
bulletin board
curtain
rug
apparel
blanket
hood
painting
sculpture
book
pillow
sink
toilet
shower
chandelier
refrigerator
washer
dishwasher
stove
oven
microwave
dirt track
plaything
swimming pool
plate
flag
palm
grass
flower
person
animal
rock
mountain
hill
field
kitchen island
towel
land
sky
sand
water
food
sea
river
waterfall
lake
grandstand
trade name
poster
traffic light
conveyer belt
Fig. 7 Wordnet tree constructed from the 150 objects in the SceneParse150 benchmark. Clusters inside the wordnet tree represent various hierar-
chical semantic relations among objects.

Table 3 Reimplementation of state-of-the art models on the validation Table 4 Comparisons of models trained with various batch normaliza-
set of SceneParse150. PPM refers to Pyramid Pooling Module. tion settings. The framework used is a Dilated ResNet-50 with Pyramid
Pooling Module.
Networks Pixel Acc. Mean IoU
DilatedResNet-18 77.41% 0.3534 BN Status Batch Size BN Size Pixel Acc. Mean IoU
DilatedResNet-50 77.53% 0.3549 Synchronized 16 16 79.73% 0.4126
DilatedResNet-18 + PPM [37] 78.64% 0.3800 8 8 80.05% 0.4158
DilatedResNet-50 + PPM [37] 80.23% 0.4204 4 4 79.71% 0.4119
DilatedResNet-101 + PPM [37] 80.91% 0.4253 2 2 75.26% 0.3355
UPerNet-50 [35] 80.23% 0.4155 Unsynchronized 16 2 75.28% 0.3403
UPerNet-101 [35] 81.01% 0.4266 Frozen 16 N/A 78.32% 0.3809
8 N/A 78.29% 0.3793
4 N/A 78.34% 0.3833
2 N/A 78.81% 0.3856
models are in different libraries (Caffe, PyTorch, Tensor-
flow) while training codes of some models are not released,
which makes it hard to reproduce the original results re- 3.3 Effect of batch normalization for scene parsing
ported in the paper. To benefit the research community, we
re-implement several state-of-the-art models in PyTorch and An overwhelming majority of semantic segmentation mod-
open source them4 . Particularly, we implement (1) The plain els are fine-tuned from a network trained on ImageNet [29],
dilated segmentation network which use the dilated convo- the same as most of the object detection models [28, 19, 13].
lution [36]; (2) PSPNet proposed in [37], it introduces Pyra- There has been work [27] exploring the effects of the size of
mid Pooling Module (PPM) to aggregate multi-scale con- batch normalization (BN) [15]. The authors discovered that,
textual information in the scene; (3) UPerNet proposed in if a network is trained with BN, only by a sufficiently large
[35] which adopts architecture like Feature Pyramid Net- batch size of BN can the network achieves state-of-the-art
work (FPN) [19] to incorporate multi-scale context more performance. We conduct control experiments on ADE20K
efficiently. Table 3 shows results on the validation set of to explore the issue in terms of semantic segmentation. Our
SceneParse150. Compared to plain DilatedResNet, PPM and experiment shows that a reasonably large batch size is es-
UPerNet architectures improve mean IoU by 3-7%, and pixel sential for matching the highest score of the-state-or-the-art
accuracy by 1-2%. The superior performance shows the im- models, while a small batch size such as 2 in Table 4 lower
portance of context in the scene parsing task. the score of the model significantly by 5%. Thus training
with a single GPU with limited RAM or with multiple GPUs
under unsynchronized BN is unable to reproduce the best re-
4 ported numbers. The possible reason is that the BN statics,
Reimplementation of the state-of-the-art models are
released at https://github.com/CSAILVision/ i.e., mean and standard variance of activations may not be
semantic-segmentation-pytorch accurate when batch size is not sufficient.
10 Bolei Zhou et al.

Test image

Ground truth

FCN-8s

SegNet

DilatedVGG

DilatedResNet-50

Cascade-DilatedVGG

Objectness Map (Cascade-DilatedVGG)


Fig. 8 Ground-truths, scene parsing results given by the baseline networks. All networks can give correct predictions for the common, large object
and stuff classes, the difference in performance comes mostly from small, infrequent objects and how well they handle details.

Fig. 9 Plot of scene parsing performance (IoU) on the 150 categories achieved by DilatedResNet-50 model. The best segmented categories are
stuff, and the worst segmented categories are objects that are usually small and have few pixels.
Semantic Understanding of Scenes through the ADE20K Dataset 11

Our baseline framework is the PSPNet with a dilated Table 5 Baseline performance on the validation set of InstSeg100.
ResNet-50 as the backbone. Besides those BN layers in the Networks mAPS mAPM mAPL mAP
ResNet, they are also used in the PPM. The baseline frame- Mask R-CNN single-scale .0542 .1737 .2883 .1832
work is trained with 8 GPUs and 2 images on each GPU. Mask R-CNN multi-scale .0733 .2256 .3584 .2241
We adopt synchronized BN for the baseline network, i.e.,
the BN size should be the same as the batch size. Besides the
synchronized BN setting, we also have unsynchronized BN Table 6 Scene parsing performance before and after fusing outputs
from instance segmentation model Mask R-CNN.
setting and frozen BN setting. The former one means that
the BN size is the number of images on each GPU; the lat- Networks Pixel Acc. Mean IoU
Before After Before After
ter one means that the BN layers are frozen in the backbone DilatedResNet-50 + PPM [37] 80.23% 80.21% 0.4204 0.4256
network, and removed from the PPM. The training iterations DilatedResNet-101 + PPM [37] 80.91% 80.91% 0.4253 0.4290
and learning rate are set to 100k and 0.02 for the baseline,
respectively. For networks trained under the frozen BN set-
ting, the learning rate for the network with 16 batch size is 3.4 Instance Segmentation
set to 0.004 to prevent gradient explosion. And for networks
with batch size smaller than 16, we both linearly decrease To benchmark the performance of instance segmentation,
the learning rate and increase the training iterations accord- we select 100 foreground object categories from the full
ing to previous works [12]. Different from Table 3, the re- dataset, term it as InstSeg100. The plot of the instance num-
sults are obtained w/o multi-scale testing. ber per object in InstSeg100 is shown in Fig. 10. The to-
tal number of object instances is 218K, on average there
are 2.2K instances per object category and 10 instances per
We report the results in Table 4. In general, we empir- image; all the objects except ship have more than 100 in-
ically find that using BN layers with a sufficient BN size stances.
leads to better performance. The model with batch size and We use Mask R-CNN [13] models as baselines for In-
BN size as 16 (line 2) outperforms the one with batch size stSeg100. The models use FPN-50 as backbone network,
16 and frozen BN (line 7) by 1.41% and 3.17% in terms initialized from ImageNet, other hyper-parameters strictly
of Pixel Acc. and Mean IoU respectively. We witness neg- follow those used in [13]. Two variants are presented, one
ligible changes of performance when batch (and BN) size with single scale training, the other with multi-scale train-
changes in the range from 4 to 16 under synchronized BN ing, their performance on the validation set is shown in Ta-
setting (line 2-4). However, when the BN size drops to 2, ble 5. We report an overall metric mean Average Precision
the performance downgrades significantly (line 5). Thus a mAP, along with metrics on different object scales, denoted
BN size of 4 is the inflection point in our experiments. This by mAPS (objects smaller than 32 × 32 pixels), mAPM (be-
finding is different from the finding for object detection [27], tween 32 × 32 and 96 × 96 pixels) and mAPL (larger than
in which the inflection point is at a BN size of 16. We con- 96 × 96 pixels). Numbers suggest that (1) multi-scale train-
jecture that it is due to images for semantic segmentation are ing could greatly improve the average performance (∼ 0.04
densely annotated, different from those for object detection in mAP); (2) instance segmentation of small objects on our
with bounding-box annotations. Therefore it is easier for se- dataset is extremely challenging, it does not improve (∼
mantic segmentation networks to obtain more accurate BN 0.02) as much as large objects (∼ 0.07) when using multi-
statistics with fewer images. scale training. Qualitative results of the Mask R-CNN model
are presented in Fig. 11. We can see that it is a strong base-
line, giving correct detections and accurate object bound-
aries. Some typical errors are object reflections in the mirror,
When we experiment with unsynchronized BN setting,
as shown in the bottom right example.
i.e., we increase the batch size but do not change the BN
size (line 6), the model yields almost identical result com-
pared with the one with the same BN size but smaller batch
size (line 5). Also, when we freeze the BN layers during the 3.5 How does scene parsing performance improve with
fine-tuning, the models are not sensitive to the batch size instance information?
(line 7-10). These two set of experiments indicate that, for
semantic segmentation models, the BN size is the one that In the previous sections, we train and test semantic and in-
matters instead of the batch size. But we do note that smaller stance segmentation tasks separately. Given that instance seg-
batch size leads to longer training time because we need to mentation is trained with additional instance information com-
increase the training iterations for models with small batch pared to scene parsing, we further analyze how instance in-
size. formation can assist scene parsing.
Se

12
ns
eC
36 U
0+
Sc
en Pixel Acc. (%)
M

60
65
70
75
80
C eP
G a
-I A rsi
W CT- del ng 74.73
in C ai
t e A d 74.49
C rI S_ e
AS sC S
IA om P 73.67
_I in
C A g V 73.46
AS _
I J 73.40
Se A_ D
I
D gM VA 73.13

Image
Xi PA od
ao I e 72.56
da Vis l
n o 72.31
AC L n
R NT ian
V- U g 72.22
Ad -S
el P 71.88
Ji ai
SY an S de 71.76
ch UX
SU en L 71.70
_H Hi g
k
C vi Li 71.48
P- si
I2 on 71.39
_L
N AI ab 71.10
U SE
S- G 70.89
AI A
Se
m N PA VL
70.21
an S- US R
LA _ S
tic B F E 69.94
En Fa -II CR
co ce E-C N 69.94
di all- A
ng B S 69.89

Ground truth
N UP
M De et T
w 69.79
ul e
tis p G or 69.66
ca Co F -R k
69.48
le gn 20 M
-F it 5_ I
C io C
N n V 69.22
-C L
R ab
FR s 67.79
Vi NN
ky 67.49
N mm Ne
ui 66.28
st ap t
Pa -o
66.03
W rs
in
en
jin X g 66.03
M gB KA 65.09
ak
si U
m AA
64.81
s

Mask R-CNN
Vo IIP
lk 64.09
ov
s 63.17

Se
ns
eC
U mIoU

Fig. 12 Scene Parsing Track Results, ranked by pixel accuracy and mean IoU.
Sc
en

0.1
0.2
0.3
0.4
0.5

0.15
0.25
0.35
0.45
36 eP
0+ a
M CA A rsi
0.397
C SI de ng
G A l
-IC _I ai
W T- VA de 0.390
in C _J
te AS D 0.377
Image

rIs _
C SP
o 0.375
Xi Se mi
ao gM ng 0.374
da o
C n L del
AS ia 0.367
IA ng 0.367
N _IV
T A 0.358
H U-S
ikv P 0.355
AC i
R G sion
V- - 0.353
Ad RM
el I 0.352
Ji A aide
an IS 0.348
c E
D he G 0.346
SY PA ng
SU I V Li 0.344
is

Fig. 11 Images, ground-truths, and instance segmentation results given by multi-scale Mask R-CNN model.
_H o 0.338
C SU n
P- X
S- I2 L 0.332
Se _L
LA ab
m B 0.331
Ground truth

an -I A
tic N NU IE- VL 0.331
En US S_ CA
co -A FC S 0.315
di IP R
ng A N 0.315
N RS
e E 0.313
Fa F2 two
ce 05 rk 0.312
al _C
M l - B V 0.311
ul N mmUPT
tis ui 0.298
ca st ap
l e P
D -F
ee C ar -o 0.291
si
p N n 0.291
C -C X g
og R K
0.290
Fig. 10 Instance number per object in instance segmentation benchmark. All the objects except ship have more than 100 instances.

ni FR A
tio N
n N 0.284
W L
Mask R-CNN

en Vik abs
jin yN 0.279
M gB e 0.269
ak
si U t
m AA
0.258
s
Vo IIP
lk 0.245
ov
s 0.201
Bolei Zhou et al.
Semantic Understanding of Scenes through the ADE20K Dataset 13

Table 7 Top performing models in Scene Parsing for Places Challenge 85

82.40
2016.
Team Pixel Acc. Mean IoU Score 80

SenseCUSceneParsing [37] 74.73% .3968 .5720

74.73
74.49
Adelaide [33] 74.49 % .3898 .5673

73.67
73.46
73.40
Pixel Acc. (%)
75

360-MCG-CT-CAS SP 73.67% .3746 .5556


70

Table 8 Top performing models in Scene Parsing for Places Challenge

65.41
64.77
64.03
2017. 65

Team Pixel Acc. Mean IoU Score


60
CASIA IVA JD 73.40% .3754 .5547

-8

de
et

an
G

JD

g
P
in

n
_S
VG
gN

ai

um
si
A_

om
FC

ar
el
AS
Se
WinterIsComing 73.46% .3741 .5543

d-

IV

H
Ad
C

eP
te

C
_

rIs
ila

IA

en
T-
te
AS

-IC
D

Sc
in
Xiaodan Liang 72.22% .3672 .5447

W
C

U
G

eC
C
M

ns
0+

Se
36
Fig. 13 Top scene parsing models compared with human performance
Instead of re-modeling, we study this problem by fus-
and baselines in terms of pixel accuracy. Scene parsing based on the
ing results from our trained state-of-the-art models, PSPNet image mode has a 20.30% pixel accuracy.
for scene parsing and Mask R-CNN for instance segmen-
tation. Concretely, we first take Mask R-CNN outputs and Table 9 Top performing models in Instance Segmentation for Places
threshold predicted instances by confidence (≥ 0.95); then Challenge 2017.
we overlay the instance masks on to the PSPNet predictions; Team mAPS mAPM mAPL mAP
if one pixel belongs to multiple instances, it takes the seman- Megvii (Face++) [27] .1386 .3015 .4119 .2977
tic label with the highest confidence. Note that instance seg- G-RMI .0980 .2523 .3858 .2415
mentation only works for 100 foreground object categories Baseline Mask R-CNN .0733 .2256 .3584 .2241
as opposed to 150 categories, so stuff predictions come from
the scene parsing model. Quantitative results are shown in be the upper bound performance. As an interesting compar-
6, overall the fusion improves scene parsing performance, ison, if we use the image mode generated in Fig.6 as pre-
pixel accuracy stays around the same and mean IoU im- diction on the testing set, it achieves 20.30% pixel accu-
proves around 0.4-0.5%. This experiment demonstrate that racy, which could be the lower bound performance for all
instance level information is useful for helping the non-instance- the models.
aware scene parsing task. Some error cases are shown in Fig. 15. We can see that
models usually fail to detect the concepts in some images
4 Places Challenges that have occlusions or require high-level context reason-
ing. For example, the boat in the first image is not a typical
In order to foster new models for pixel-wise scene under- view of a boat so that the models fail; For the last image,
standing, we organized in 2016 and 2017 the Places Chal- the muddy car is missed by all the top performer networks
lenge including the scene parsing track and instance seg- because of its muddy camouflage.
mentation track.

4.2 Instance Segmentation Track


4.1 Scene Parsing Track
For instance segmentation, we used the mean Average Pre-
Scene parsing submissions were ranked based on the aver- cision (mAP), following COCO’s evaluation metrics.
age score of the mean IoU and pixel-wise accuracy in the The Instance Segmentation Track, introduced in Places
benchmark test set. Challenge 2017, received 12 submissions from 5 teams. Two
The Scene Parsing Track totally received 75 submissions teams beat the strong Mask R-CNN baseline by a good mar-
from 22 teams in 2016 and 27 submissions from 11 teams gin, their best model performances are shown in Table 9 to-
in 2017. The top performing teams for both years are shown gether with the Mask R-CNN baseline trained by ourselves.
in Table 7 and Table 8. The winning team in 2016 propos- The performances for small, medium and large objects are
ing PSPNet [37] still holds the highest score. Fig. 14 shows also reported, following 3.4. Fig. 16 shows qualitative re-
some qualitative results from the top performing models on sults from the teams’ best models.
each year. As can be seen in table 9, both methods outperform the
In Fig. 13 we compare the top models against the pro- Mask R-CNN at any of the object scales, even though they
posed baselines and human performance (approximately mea- still struggle with medium and small objects. Megvii (Face++)
sured as the annotation consistency in Sec.2.3), which could submission seems to particularly advantage G-RMI for the
14 Bolei Zhou et al.

SceneParsing Challenge 2016 SceneParsing Challenge 2017


Image Ground-truth SenseCUSceneParsing Adelaide 360+MCG-ICT-CAS_SP CASIA_IVA_JD WinterIsComing Xiaodan Liang

Fig. 14 Scene Parsing results given by top methods for Places Challenge 2016 and 2017.

Image Ground-truth SenseCU… Adelaide 360+MCG… SegModel CASIA IVA

boat
grandstand

booth

car

Fig. 15 Ground-truths and predictions given by top methods for scene parsing. The mistaken regions are labeled. We can see that models make
mistakes on objects in non-canonical views such as the boat in first example, and on objects which require high-level reasoning such as the muddy
car in the last example.
Semantic Understanding of Scenes through the ADE20K Dataset 15

Image Ground-truth Megvii (Face++) G-RMI Image Ground-truth Megvii (Face++) G-RMI

Fig. 16 Instance Segmentation results given by top methods for Places Challenge 2017.

small objects, probably due to the use of contextual informa- Following a top-down instance segmentation framework,
tion. Their mAP on small objects show a relative improve- [27] starts with a module to generate object proposals first
ment over G-RMI of 41%, compared to the 19% and 6% of then classify each pixel within the proposal. But unlike RoI
medium and large objects. Align used in Mask-RCNN [13], they use Precise RoI Pool-
This effect can be qualitatively seen in figure 16. While ing [16] to extract features for each proposal. Precise RoI
both methods perform similarly well in finding large object Pooling avoids sampling the pivot points used in RoI Align
classes such as people or tables, Megvii (Face++) is able by regarding a discrete feature map as a continuous interpo-
to detect small paintings (rows 1 and 3) or lights (row 5) lated feature map and directly computing a two-order inte-
occupying small regions. gral. The good alignment of features provide with good im-
provement for object detection, while even higher gain for
instance segmentation. To improve the recognition of small
objects, they make use of contextual information by com-
4.3 Take-aways from the Challenge bining, for each proposal, the features of the previous and
following layers. Given that top-down instance segmenta-
Looking at the challenge results, there are several peculiar- tion relies heavily on object detection, the model ensembles
ities that make ADE20K challenging for instance segmen- multiple object bounding-boxes before fed into a mask gen-
tation. First, ADE20K contains plenty of small objects. It is erator. We also find that the models cannot avoid predicting
hard for most of instance segmentation frameworks to dis- objects in the mirror, which indicates that current models are
tinguish small objects from background, and even harder to still incapable of high-level reasoning in parallel with low-
recognize and classify them into correct categories. Second, level visual cues.
ADE20K is highly diverse in terms of scenes and objects,
requiring models of strong capability to achieve better per-
formance in various scenes. Third, scenes in ADE20K are 5 Object-Part Joint Segmentation
generally crowded. The inter-class occlusion and intro-class
occlusion create problems for object detection as well as in- Since ADE20K contains part annotations for various object
stance segmentation. This is can be seen in fig. 16, where classes, we further train a network to jointly segment objects
the models struggle to detect some of the boxes in the clut- and parts. There are 59 out of total 150 objects that contain
tered areas (row 2, left) or the counter inf row 4, covered by parts, some examples can be found in Fig. 3. In total there
multiple people. are 153 part classes included. We use UPerNet [35] to jointly
To further gain insight from the insiders, we invite the train object and part segmentation. During the training, we
leading author of the winner for the instance segmentation include the non-part class and only calculate softmax loss
track in Places Challenge to give a summary of their winning within the set of part classes via ground-truth object class.
method as follows: During the inference, we first pick out a predicted object
16 Bolei Zhou et al.

Test image Semantic segmentation Part segmentation Test image Semantic segmentation Part segmentation

Part segmentation

Part ground truth

Fig. 17 Object and part joint segmentation results predicted by UPerNet. Object parts are segmented based on the top of the corresponding object
segmentation mask.

class, then get the predicted part classes from its correspond- we generated a hierarchical semantic segmentation of the
ing part set. This is organized in a cascaded way. We show image shown in Fig. 19. The tree also provides a principled
the qualitative results of UPerNet in Fig. 17, and the quanti- way to segment more general visual concepts. For example,
tative performance of part segmentation for several selected to detect all furniture in a scene, we can simply merge the
objects in Fig. 18. hyponyms associated with that synset, such as the chair, ta-
ble, bench, and bookcase.
Automatic image content removal. Image content re-
6 Applications moval methods typically require the users to annotate the
precise boundary of the target objects to be removed. Here,
Accurate scene parsing leads to wider applications. Here we based on the predicted object probability map from scene
take the hierarchical semantic segmentation and the auto- parsing networks, we automatically identify the image re-
matic scene content removal as exemplar applications of the gions of the target objects. After cropping out the target ob-
scene parsing networks. jects using the predicted object score maps, we simply use
Hierarchical semantic segmentation. Given the word- image completion/inpainting methods to fill the holes in the
net tree constructed on the object annotations shown in Fig.7, image. Fig. 20 shows some examples of the automatic im-
the 150 categories are hierarchically connected and have hy- age content removal. It can be seen that with the object score
ponyms relations. Thus we could gradually merge the ob- maps, we are able to crop out the objects from an image pre-
jects into their hyponyms so that classes with similar se- cisely. The image completion technique used is described in
mantics are merged at the early levels. Through this way, [14].
Semantic Understanding of Scenes through the ADE20K Dataset 17

person
Person
Lamp

Table

Chair
tree
Sofa

Car

Building
car
Bed

Fig. 18 Part segmentation performance (in pixel accuracy) grouped by


several selected objects predicted by UPerNet.
all the objects

Test image

Fig. 20 Automatic image content removal using the predicted object


score maps given by the scene parsing network. We are not only able
Lv.0 to remove individual objects such as person, tree, car, but also groups
of them or even all the discrete objects. For each row, the first image
is the original image, the second is the object score map, and the third
one is the filled-in image.

Lv.1

Lv.2

Lv.3

Fig. 21 Scene synthesis. Given annotation masks, images are synthe-


sized by coupling the scene parsing network and the image synthesis
method proposed in [24].

Fig. 19 The examples of the hierarchical semantic segmentation. Ob-


jects with similar semantics like furnitures and vegetations are merged Scene synthesis. Given a scene image, the scene pars-
at early levels following the wordnet tree. ing network could predict a semantic label mask. Further-
more, by coupling the scene parsing network with the recent
image synthesis technique proposed in [24], we could also
synthesize a scene image given the semantic label mask. The
general idea is to optimize the input code of a deep image
generator network to produce an image that highly activates
the pixel-wise output of the scene parsing network. Fig. 21
shows three synthesized image samples given the seman-
18 Bolei Zhou et al.

tic label mask in each row. As comparison, we also show 8. Dai J, He K, Sun J (2015) Convolutional feature mask-
the original image associated with the semantic label mask. ing for joint object and stuff segmentation. In: Proc.
Conditioned on an semantic mask, the deep image generator CVPR
network is able to synthesize an image with similar spatial 9. Dai J, He K, Sun J (2016) Instance-aware semantic seg-
configuration of visual concepts. mentation via multi-task network cascades. Proc CVPR
10. Everingham M, Van Gool L, Williams CK, Winn J, Zis-
serman A (2010) The pascal visual object classes (voc)
challenge. Int’l Journal of Computer Vision
7 Conclusion 11. Geiger A, Lenz P, Urtasun R (2012) Are we ready for
autonomous driving? the kitti vision benchmark suite.
In this work we introduced the ADE20K dataset, a densely In: Proc. CVPR
annotated dataset with the instances of stuff, objects, and 12. Goyal P, Dollár P, Girshick R, Noordhuis P, Wesolowski
parts, covering a diverse set of visual concepts in scenes. L, Kyrola A, Tulloch A, Jia Y, He K (2017) Accurate,
The dataset was carefully annotated by a single annotator large minibatch sgd: training imagenet in 1 hour. arXiv
to ensure precise object boundaries within the image and preprint arXiv:170602677
the consistency of object naming across the images. Bench- 13. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask
marks for scene parsing and instance segmentation are con- r-cnn. In: Proc. ICCV
structed based on the ADE20K dataset. We further orga- 14. Huang JB, Kang SB, Ahuja N, Kopf J (2014) Im-
nized challenges and evaluated the state-of-the-art models age completion using planar structure guidance. ACM
on our benchmarks. All the data and pre-trained models are Transactions on Graphics (TOG)
released to the public. 15. Ioffe S, Szegedy C (2015) Batch normalization: Accel-
erating deep network training by reducing internal co-
Acknowledgements This work was supported by Samsung and NSF variate shift. arXiv preprint arXiv:150203167
grant No.1524817 to AT. SF acknowledges the support from NSERC. 16. Jiang B, Luo R, Mao J, Xiao T, Jiang Y (2018) Ac-
BZ is supported by a Facebook Fellowship. quisition of localization confidence for accurate object
detection. In: Proc. ECCV
17. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet
classification with deep convolutional neural networks.
References
In: In Advances in Neural Information Processing Sys-
tems
1. Badrinarayanan V, Kendall A, Cipolla R (2017) Seg-
18. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ra-
net: A deep convolutional encoder-decoder architecture
manan D, Dollár P, Zitnick CL (2014) Microsoft coco:
for image segmentation. IEEE Trans on Pattern Analy-
Common objects in context. In: Proc. ECCV
sis and Machine Intelligence
19. Lin TY, Dollár P, Girshick R, He K, Hariharan B, Be-
2. Bell S, Upchurch P, Snavely N, Bala K (2013) OpenSur-
longie S (2017) Feature pyramid networks for object de-
faces: A richly annotated catalog of surface appearance.
tection. In: Proc. CVPR
ACM Transactions on Graphics (TOG)
20. Long J, Shelhamer E, Darrell T (2015) Fully convo-
3. Bell S, Upchurch P, Snavely N, Bala K (2015) Mate-
lutional networks for semantic segmentation. In: Proc.
rial recognition in the wild with the materials in context
CVPR
database. In: Proc. CVPR
21. Martin D, Fowlkes C, Tal D, Malik J (2001) A database
4. Caesar H, Uijlings J, Ferrari V (2017) Coco-stuff: Thing
of human segmented natural images and its application
and stuff classes in context
to evaluating segmentation algorithms and measuring
5. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille
ecological statistics. In: Proc. ICCV
AL (2016) Deeplab: Semantic image segmentation with
22. Mottaghi R, Chen X, Liu X, Cho NG, Lee SW, Fidler
deep convolutional nets, atrous convolution, and fully
S, Urtasun R, Yuille A (2014) The role of context for
connected CRFs. arXiv:160600915
object detection and semantic segmentation in the wild.
6. Chen X, Mottaghi R, Liu X, Cho NG, Fidler S, Urta-
In: Proc. CVPR
sun R, Yuille A (2014) Detect what you can: Detecting
23. Nathan Silberman PK Derek Hoiem, Fergus R (2012)
and representing objects using holistic models and body
Indoor segmentation and support inference from rgbd
parts. In: Proc. CVPR
images. In: Proc. ECCV
7. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler
24. Nguyen A, Dosovitskiy A, Yosinski J, Brox T, Clune J
M, Benenson R, Franke U, Roth S, Schiele B (2016)
(2016) Synthesizing the preferred inputs for neurons in
The cityscapes dataset for semantic urban scene under-
neural networks via deep generator networks
standing. In: Proc. CVPR
Semantic Understanding of Scenes through the ADE20K Dataset 19

25. Noh H, Hong S, Han B (2015) Learning deconvolution


network for semantic segmentation. In: Proc. ICCV
26. Patterson G, Hays J (2016) Coco attributes: Attributes
for people, animals, and objects. In: Proc. ECCV
27. Peng C, Xiao T, Li Z, Jiang Y, Zhang X, Jia K, Yu G,
Sun J (2018) Megdet: A large mini-batch object detec-
tor. In: Proc. CVPR, pp 6181–6189
28. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: To-
wards real-time object detection with region proposal
networks. In: In Advances in Neural Information Pro-
cessing Systems
29. Russakovsky O, Deng J, Su H, Krause J, Satheesh S,
Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M,
Berg AC, Fei-Fei L (2015) ImageNet Large Scale Vi-
sual Recognition Challenge. Int’l Journal of Computer
Vision 115(3):211–252
30. Russell BC, Torralba A, Murphy KP, Freeman WT
(2008) Labelme: a database and web-based tool for im-
age annotation. Int’l Journal of Computer Vision
31. Song S, Lichtenberg SP, Xiao J (2015) Sun rgb-d: A
rgb-d scene understanding benchmark suite. In: Proc.
CVPR
32. Spain M, Perona P (2010) Measuring and predicting ob-
ject importance. International Journal of Computer Vi-
sion
33. Wu Z, Shen C, van den Hengel A (2016) Wider or
deeper: Revisiting the resnet model for visual recogni-
tion. CoRR abs/1611.10080, 1611.10080
34. Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A
(2010) Sun database: Large-scale scene recognition
from abbey to zoo. In: Proc. CVPR
35. Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified
perceptual parsing for scene understanding. In: Proc.
ECCV
36. Yu F, Koltun V (2016) Multi-scale context aggregation
by dilated convolutions
37. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid
scene parsing network. In: Proc. CVPR
38. Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A
(2014) Learning deep features for scene recognition us-
ing places database. In: In Advances in Neural Informa-
tion Processing Systems
39. Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Tor-
ralba A (2017) Scene parsing through ade20k dataset.
In: Proc. CVPR

You might also like