0% found this document useful (0 votes)
60 views

Semantic Segmentation Evaluation

This document discusses evaluating semantic segmentation models for real-world applications like mapping pedestrian environments. It introduces a new metric to assess over- and under-segmentation of regions, which is important for these applications. Current evaluation metrics like mean Intersection Over Union are pixel-based but semantic segmentation produces region predictions, so a region-based metric provides better explainability of model performance.

Uploaded by

haze.eslem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views

Semantic Segmentation Evaluation

This document discusses evaluating semantic segmentation models for real-world applications like mapping pedestrian environments. It introduces a new metric to assess over- and under-segmentation of regions, which is important for these applications. Current evaluation metrics like mean Intersection Over Union are pixel-based but semantic segmentation produces region predictions, so a region-based metric provides better explainability of model performance.

Uploaded by

haze.eslem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Rethinking Semantic Segmentation Evaluation for


Explainability and Model Selection
Yuxiang Zhang, Sachin Mehta, and Anat Caspi

Abstract—Semantic segmentation aims to robustly predict where paths are, how paths are connected, and whether paths
coherent class labels for entire regions of an image. It is a have the amenities and landmarks necessary to traverse them
scene understanding task that powers real-world applications efficiently and safely. Providing street-side maps will promote
(e.g., autonomous navigation). One important application, the use
sustainable, equal access for people of all abilities to navi-
arXiv:2101.08418v2 [cs.CV] 15 Feb 2023

of imagery for automated semantic understanding of pedestrian


environments, provides remote mapping of accessibility features gate urban spaces. A recent study of semantic segmentation
in street environments. This application (and others like it) in imagery data of pedestrian environments, conducted to
require detailed geometric information of geographical objects. enable real-time mapping of accessibility features in street
Semantic segmentation is a prerequisite for this task since it environments [8], identifies challenges of deploying CNNs
maps contiguous regions of the same class as single entities.
Importantly, semantic segmentation uses like ours are not pixel- in real-time, compute-sparse environments. In this application
wise outcomes; however, most of their quantitative evaluation space, an ideal model should predict connected regions of
metrics (e.g., mean Intersection Over Union) are based on pixel- pixels that represent: (1) pedestrian pathways, e.g., sidewalks,
wise similarities to a ground-truth, which fails to emphasize over- footways, and road crossings in a scene, and (2) common
and under-segmentation properties of a segmentation model. object classes in the pedestrian environments, e.g., poles, fire
Here, we introduce a new metric to assess region-based over-
and under-segmentation. We analyze and compare it to other hydrants, and benches. For these applications, removing under-
metrics, demonstrating that the use of our metric lends greater or over-segmentation is often more important than achieving
explainability to semantic segmentation model performance in high pixel-wise accuracy. Questions like, ”How many paths are
real-world applications. there in the current scene, and how are they connected?” can
Index Terms—Image Segmentation, Semantic Segmentation, be answered only through segmentation results with region-
Performance Evaluation. wise fidelity.
Model evaluation for explainability. Semantic segmentation
I. I NTRODUCTION
is performed through pixel-wise classification; however, as an

S EMANTIC segmentation is an important computer vision


task that clusters together parts of images belonging to
the same object class. It is used in many real-world visual
important subtask of image understanding, it contributes to the
discovery of groups of objects and identification of seman-
tically meaningful distributions and patterns in input image
scene understanding applications (e.g., autonomous driving, data. In [9], segmentation is framed as a graph partitioning
robotics, medical diagnostics). Substantial improvements in problem, and normalized cut criteria is proposed for measuring
semantic segmentation performance were driven by recent the total similarity and dissimilarity of partitions. In [10],
advances in machine learning (ML), especially convolutional superpixel (effectively regions consisting of multiple pixels)
neural networks (CNNs) [1], [2], coupled with advancements is used to advocate for evaluating segmentation in practical
in hardware technology [3]–[6]. Moreover, these developments applications. Semantic segmentation decomposes a surround-
have made some time-sensitive domains (e.g., autonomous ing scene into meaningful semantic regions that are useful
driving [7]) practically feasible. for decision-making in downstream applications. For instance,
Motivation. In the area of autonomous driving, extensive when mapping in the pedestrian environment, it is important to
research and mapping efforts have produced detailed maps identify which connected regions are correctly mapped to real
of automobile roads. However, street-side environments that pathways. In sum, uses of semantic segmentation are not pixel-
serve pedestrians have not received merited attention. Real- but region-based, as evidenced in their application to aggregate
time, efficient inference about street-side infrastructure (i.e., semantically rich information about the surroundings used to
mapping the pedestrian environment) is essential for path assess navigation options for people with visual disabilities
planning in delivery robots and autonomous wheelchairs. (e.g., Mehta et al. [11] and Yang et al. [12]).
Importantly, similarly detailed maps of path networks can Most quantitative evaluation metrics (e.g., mean Intersection
provide data feeds to mobility apps, closing informational Over Union, mIOU) currently used to evaluate performance or
gaps experienced by pedestrians with disabilities that answer error in semantic segmentation algorithms are based on pixel-
This work has been submitted to the IEEE for possible publication. wise similarity between ground-truth and predicted segmenta-
Copyright may be transferred without notice, after which this version may tion masks. Though these metrics effectively capture certain
no longer be accessible properties about clustering or classification performance, they
Y. Zhang, S. Mehta, and A. Caspi are with the University of
Washington, Seattle, WA, 98115 USA. E-mail: {yz325, sacmehta, fail to provide a proper meter of region-based agreement with
caspian}@cs.washington.edu. a given ground-truth or provide an expectation about broken
2

IoU Error
0.3 ROM(Ours)

Score
0.2

0.1

0.0
background bus

(a) Over-segmentation
0.75 IoU Error
RUM (Ours)
0.50

Score
0.25

0.00
person

(b) Under-segmentation
Fig. 1: Examples that illustrate over- and under-segmentation issues in semantic segmentation.(Left to right: RGB image,
ground-truth, prediction, and error metrics) In (a), the back of the bus is over-segmented (one region is segmented into three).
In (b), people are under-segmented (three groups are segmented as one). Though the IOU error (1 - IOU score) is low in both
cases, the IOU metric does not reflect these distinctions. However, these issues can be explained using the proposed metrics
for over- (ROM) and under-segmentation (RUM).

regions that model segmentation may yield. We assert that Over Union) error to assess model consistency with annotated
quantitative measures of region segmentation similarity (rather image segments [19]. However, in practice, these metrics are
than pixel-wise or boundary similarity) will prove useful to ill-suited for discriminating between effective and ineffective
perform model selection and to measure the consistency of region-based segmentation consistency.
predicted regions with a ground-truth in a manner that is in- Here, we survey measures of similarity popular in segmenta-
variant to region refinement or boundary ambiguity. Such mea- tion literature and discuss their relevance and associated disad-
sures will prove practically useful in applications of semantic vantages as performance metrics for semantic segmentation. In
segmentation where performance is affected by over- and particular, we highlight two questions that help us characterize
under-segmentation. For clarity, a predicted region is said to and differentiate metrics that are used to quantify segmen-
contribute to over-segmentation when the prediction overlaps a tation performance: (1) Does the metric correlate with the
ground-truth region that another predicted region also overlaps. level of agreement between model segmentation and ground-
A predicted region is under-segmented when the prediction truth?, and (2) Is the metric robust to region refinements or
overlaps two or more ground-truth regions. Examples of over- small ambiguities in region boundaries that naturally arise in
and under-segmentation are shown in Figure 1. images?
Contributions. This work introduces quantitative evalua- IOU and other pixel-wise similarity metrics. IOU evaluates
tion metrics for semantic segmentation that provide gran- pixel-wise similarity irrespective of how pixels are grouped in
ular accounting for over- and under-segmentation. We ex- bounded regions. It is the most common metric for evaluating
press the empirical probability that a predicted segmentation classification accuracy and is the performance metric of choice
mask is over- or under-segmenting and also penalize model- for models trained with strong pixel-level supervision. Given
segmentations that repeatedly over-segment the same region, a segmentation mask SI and the ground-truth mask GI for
causing large variations between model-prediction and ground- image I, the IOU is calculated as SSII ∩G
∪GI . This metric provides
I

truth. We demonstrate how these issues affect currently used a measure of similarity ranging from 0 (when there is no
segmentation algorithms across a variety of segmentation similar pixel labeling between GI and SI ) to 1 (when there is
datasets and models and that currently available metrics do full concordance between the two pixel assignments GI and
not differentiate between models since they do not measure SI ).
these issues directly. IOU gives a measure of pixel-wise assignment similarity
that is associated, but not always correlated, with the accu-
II. R ELATED W ORK racy of region classification in the single class case. Due to
Most segmentation models deployed today are CNN-based this relationship, the metric is also used as a surrogate for
models that adopt encoder-decoder architectures (e.g., [3]–[5], assessing segmentation performance in various segmentation
[13], [14]). To enable real-world, low-latency deployments datasets (e.g., PASCAL VOC [24] and MS-COCO [22]).
on resource-constrained devices, recent efforts in the field For instance, IOU correctly identifies the difference in over-
are proposing light-weight segmentation models (e.g., [15]– lap between predicted regions and ground-truth in synthetic
[18]). Most of these models are trained with strong pixel- examples in Figures 2c (lowest IOU error) and 2a (highest
level supervision, using error metrics such as IOU (Intersection IOU error). Additionally, it appropriately penalizes and dif-
3

(a) (b) (c) (d) (e) (f) (g) (h)

(i) (j) (k) (l) (m) (n) (o) (p)

Metric a b c d e f g h i j k l m n o p
IOU Error 1.00 0.53 0.07 0.11 0.58 0.51 0.29 0.58 0.13 0.16 0.61 0.54 0.33 0.60 0.17 0.20
GCE 1.00 0.27 0.07 0.12 0.34 0.38 0.29 0.26 0.12 0.14 0.40 0.42 0.33 0.32 0.17 0.20
PE-OS 1.00 0.53 0.07 0.06 0.78 0.61 0.49 0.78 0.33 0.58 0.78 0.61 0.49 0.78 0.33 0.58
1-AP IOU =.50 1.00 0.00 0.00 0.33 0.50 0.50 0.33 0.67 0.50 0.60 0.67 0.67 0.50 0.75 0.60 0.67
1-AP IOU =.75 1.00 0.00 0.00 0.33 1.00 1.00 1.00 1.00 0.75 1.00 1.00 1.00 1.00 1.00 0.8 1.00
1-PQ 1.00 0.67 0.51 0.64 0.76 0.75 0.71 0.82 0.69 0.74 0.84 0.82 0.78 0.86 0.76 0.79
ROM (Ours) 0.00 0.00 0.00 0.00 0.46 0.46 0.96 0.76 0.64 0.99 0.32 0.32 0.91 0.64 0.54 0.98

Fig. 2: Synthetic examples that illustrate and compare metrics. While these primarily address over-segmentation, they can be
generalized to under-segmentation by inverting color interpretation of segmentation and ground-truth. Each panel represents an
image overlay, with ground-truth regions (denoted in green ) and predicted regions (denoted in purple ). Panels are designed
to represent various situations: (2a) no positive class prediction at all; (2b) near perfect prediction for one ground-truth region;
(2c) near perfect prediction; panels (2e through 2j) different aspects of over-segmentation, with panel (2j) representing the worst
over-segmentation case among those; panels (2k) through (2p) correspond to panels (2e) through (2j), respectively, but with
an additional false-positive predicted region. These metrics include IOU error, GCE [20], Persello’s error (PE) [21], average
precision (AP) [22], panoptic quality (PQ) [23] and ROM (ours). Note that IOU error is the representative pixel-wise measure,
while GCE, PE, AP and ROM are region-wise measures for semantic segmentation; PQ is a metric for the related instance
segmentation task, not semantic segmentation.

ferentiates models that tend to false-predict regions. However, Refinement Error (LRE) is defined at pixel xi as:
the correlation between accurate pixel classification and proper |C(GI , xi ) \ C(SI , xi )|
segmentation extends only in cases where segmented regions LRE(SI , GI , xi ) = (1)
|C(GI , xi )|
are contiguously uninterrupted, i.e., where it is appropriate
to concatenate regions in which geometrically adjacent pixels Since the LRE is one-directional and some inconsistencies
are similarly labeled. The horse leg segmentation (second arise where the set difference |C(SI , xi )\C(GI , xi )| accounts
example in Figure 4) and the towel class segmentation (third for greater divergence between segmentations, the Global
example in Figure 4) apply exactly in this instance, where IOU Consistency Error (GCE) was defined for an entire image with
does not dovetail with region agreement between segmentation N pixels as:
P
mask and ground-truth. As shown in the synthetic example in GCE(SI , GI ) = N1 min
N
i=1 LRE(SI , GI , xi ) ,
Figures 2b, 2e, and 2h, the metric is also not perturbed by PN  (2)
significant region-wise refinements. i=1 LRE(GI , SI , xi )
Other pixel-wise metrics similar to IOU (e.g., pixel accuracy GCE more stringently accounts for consistencies between
and dice score [3], [25]) also measure pixel-level correspon- the two segmentation masks than LCE. GCE, like pixel-
dence between the predicted segmentation mask and ground- based measures, still captures global pixel classification error
truth. These metrics account for the correctness of pixel-based (similar high and low measures are observed for Figures 2a
prediction at a global level but again fail to capture region- and 2c, respectively). It also offers some additional penalty
based agreement between predicted regions and ground-truth. for false positive predictions (note difference between GCE
They also fail to robustly capture or differentiate models that and IOU for Figures 2e and 2k). Unlike pixel-wise methods,
provide improvements in region boundary agreement with this metric amplifies disagreement between prediction and
ground-truth (see Figure 2k and Figure 2n). ground-truth regions. However, GCE still fails to capture the
GCE. Martin et al. [20] proposed region-based metrics for difference between predictions that refine region segmentation
evaluating segmentation performance. The global consistency (for example, the GCE for Figure 2j is worse than for 2i, but
error (GCE) and local consistency error (LCE) were intro- not by much).
duced to measure the region-wise inconsistency between a Partition distance. In [26], an error measure was proposed
prediction mask, SI , and a ground-truth, GI . For a given pixel based on partition distance, which counts the number of pixels
xi , let the class assignment of xi in prediction mask SI be that must be removed or moved from the prediction for it to
denoted by C(SI , xi ). Likewise, the class assignment for xi in match the ground-truth. The partition distance penalizes for
the ground-truth, GI , is represented by C(GI , xi ). The Local under- and over-segmentation to some extent, but partitioning
4

a region in either the prediction or ground-truth into multiple ground-truth in over- and under-segmentation cases, and (2)
subsets will render the same partition distance. The partition displaying sensitivity for local over- and under-segmentation
distance tracks much like the GCE for that reason, so we do refinements to model predictions or refinements to ground-
not calculate it here. truth based on ambiguous semantic segmentation boundaries.
Persello’s error. In [21], PE was proposed to measure the local These properties are relevant primarily to quantify the consis-
error for over-segmentation (PE-OS) and under-segmentation tency of segmentation results (see Section III-C).
(PE-US) based on the ratio of the area of the largest overlap-
ping prediction region and the area of a ground-truth region. A. Region-wise metric estimations
The measure identifies region agreement and penalizes for We now define metrics that isolate errors due to over-
larger sizes of discrepancy. For each ground-truth region, the and under-segmentation in region-based tasks like semantic
error is computed based on the ratio of the area of the largest segmentation.
overlapping predicted region and the area of the ground-truth
Region-wise over-segmentation measure (ROM). Let I be
region. PE-OS increases with over-segmentation (e.g., Figure
an RGB image, GI be the ground-truth segmentation mask for
2b has a better score than either Figure 2e or Figure 2h).
I, and SI be the model-predicted segmentation mask for I in
However, it provides the same score for Figures 2e and 2h
dataset D (validation or test set), each with spatial dimensions
because they have the same-sized largest predicted region for
of w × h, where w and h correspond to width and height,
the given ground-truth region.
respectively. Assume there are K semantic classes in D. A
Average precision. AP is the evaluation metric used in valid label assignment (also referred to as a segmentation)
the MS-COCO object detection and instance segmentation in GI and SI maps each pixel xr,c to a single integer label
challenge [22]. Instead of directly measuring pixel-wise con- k ∈ [0, · · · , K − 1] representing the class assignment for that
cordance, AP uses IOU as a threshold and measures region- pixel. For simplicity, we assume that the background class
wise concordance. If a given prediction region’s IOU with a label is 0. For each non-background class k ∈ [1, · · · , K − 1],
ground-truth region exceeds the threshold, it is counted as a we convert GI and SI to their binary formats, GIb and SIb ,
true positive (TP); otherwise,
 itis counted as a false positive as follows:
(FP). The precision T PT+F P
P is computed and averaged (
over multiple IOU values, specifically from 0.50 to 0.95 with 1, GI [r, c] == k
GIb [k, r, c] = (3)
0.05 increments. We computed AP IOU =.50 (same as PASCAL 0, Otherwise
VOC metric ) and AP IOU =.75 (strict metric) for the synthetic (
examples in Figure 2. AP has its advantages over pure pixel- 1, SI [r, c] == k
SIb [k, r, c] = (4)
wise measures when assessing error instance-wise, but it will 0, Otherwise
fail when two segmentation masks have similar precision
where r ∈ [0, h − 1] and c ∈ [0, w − 1] correspond to row and
values but suffer from different degrees of over-segmentation.
column indices of a pixel in the image.
For example, AP IOU =.50 gives the same error measure for
Each spatial plane in GIb , i.e., GIb [k], consists of N =
Figure 2h, 2k, 2l and 2p, but Figure 2h is considered a worse
kGIb [k]k separate contiguous regions, where k·k is an operator
over-segmentation case than 2l and 2k, while 2p represents
that counts the number of separate contiguous regions in
the worst over-segmentation case among them. In terms of
GIb [k]. Therefore, we can represent each spatial plane GIb [k]
the stricter measure, AP IoU =.75 indicates that the majority of
as a set of connected regions GIb [k] = {g1 , g2 , ..., gN }, k ∈
these examples have an equal error of 1 without differentiating
[1, K − 1]. We assert that gi ∩ gj = ∅, ∀i 6= j. Similarly, we
which one is worse.
represent each spatial plane in SIb , i.e., SIb [k] as a set of M
Panoptic quality. PQ, which combines the semantic seg- connected regions SIb [k] = {s1 , s2 , ..., sM }, k ∈ [1, K − 1],
mentation and instance segmentation tasks [23], is a relevant where si ∩ sj = ∅, ∀i 6= j. We refer to the complete
metric but requires having ground-truth annotations for both set of binary planes that satisfy these constraints as a valid
semantic and instance segmentation. It provides a unified segmentation ground-truth pair and look for measures of the
score that measures pixel-wise concordance and penalizes false form d(SIb , GIb ).
positive (FP) and false negative (FN) regions. PQ reflects the We begin by evaluating the performance of the prediction
gross pixel-wise and region-wise error jointly in a predicted mask by making a detailed accounting of over-segmented fore-
segmentation, but it does not differentiate between under- and ground regions. A model-predicted region contributes to the
over-segmentation. Although FP and FN regions indirectly over-segmentation count when the prediction region overlaps
relate to the over- and under-segmentation issues, PQ penalizes with a ground-truth region that itself overlaps with more than
these errors in the same direction and may give the same one model-predicted region. Fig 1a shows an example of over-
score to a prediction that over-segments as to one that under- segmentation.
segments. We denote regions in SIb contributing to over-segmentation
as SO I
= {si ∈ SIb }, where si ∩ gl 6= ∅ ∧ sj ∩ gl 6= ∅, i ∈
III. I NTERPRETABLE R EGION - BASED M EASURES FOR I
[1, · · · , M ], j ∈ [1, · · · , M ], l ∈ [1, · · · , N ], i 6= j. SO
OVER - AND U NDER - SEGMENTATION E RRORS marks model-predicted region si ∈ SIb as included in the over-
We now introduce two measures that combine the desirable segmentation count; it must overlap with at least one ground-
properties of: (1) accounting for model disagreement with truth region gl ∈ GIb , while the ground-truth region gl must
5

overlap with at least one other prediction region sj ∈ SIb . The the sensitivity of the measure to classes that habitually over-
total number of regions that contribute to over-segmentation segment1 and allowing us to compare with existing metrics.
with respect to ground-truth regions are kSO I
k.
ROM = tanh (ROR × mo ) (7)
Similar accounting identifies ground-truth regions involved
in over-segmentation (i.e., overlapping more than one pre- A ROM value of zero indicates that there is no over-
dicted region). We denote regions in GIb that are involved segmentation. In this case, kGIO k = kSO I
k = 0, indicating
in over-segmentation as GIO = {gi ∈ GIb }, such that that each gi ∈ Gb overlaps with at most one sj ∈ SbI . A ROM
I

gi ∩ sj 6= ∅ ∧ gi ∩ sl 6= ∅, j ∈ [1, · · · , M ], l ∈ [1, · · · , M ], value of 1 indicates that the predicted segmentation is worse


and j 6= l. This definition asserts that a ground-truth region and contains abundant over-segmented regions. In this case,
gi ∈ GIb is counted towards the over-segmentation count if it kGIO k = kGIb k, kSO I
k = kSbI k, and mo → . This means

8
overlaps with at least two different model-predicted regions, that every single prediction/ground-truth region contributes to
sj ∈ SIb and sl ∈ SIb . kGIO k denotes the total number of over-segmentation, and each ground-truth region overlaps with
ground-truth regions that are involved in over-segmentation. an infinite number of prediction regions.
We are interested in a measure that combines desirable Region-wise under-segmentation measure (RUM). A sim-
statistical properties with the ability to count disagreements ilar argument follows for evaluating SI from the under-
between SIb and GIb and accommodate model prediction segmentation perspective. A predicted region that contributes
refinement. Specifically, we should not use any thresholds to under-segmentation (see example in Fig 1b) is represented
or fixed pixel-percent requirements to determine the overlap as SUI = {si ∈ SIb }, where ∃k, l ∈ [1..N ], k 6= lsi ∩ gk 6=
between regions in the segmentation mask and ground-truth. ∅ ∧ si ∩ gl 6= ∅. The total count of model-segmented regions
This is important in the context of many methods that are that contribute to under segmentation is denoted by kSUI k. This
used to merge or fuse segmentation results based on geometric representation identifies a model-predicted region as under-
proximity or filtering methods [27]. segmenting when it overlaps with at least two different ground-
Given the ground-truth labeling, GIb , the probability that truth regions gk ∈ GIb and gl ∈ GIb . Similarly, a ground-
a model-predicted region si ∈ SIb contributes to over- truth region is involved in under-segmentation when it overlaps
kS I k
segmentation can be represented as kSOI k , and the probability with a prediction region that in turn overlaps with at least two
ground-truth regions. This is represented as GIU = {gi ∈ GIb },
b
that a ground-truth region gi ∈ GIb is over-segmented can be
I
kG k
represented as kGOI k . Our goal is to express the empirical prob- s.t. ∃j ∈ [1..N ], ∃l ∈ [1..M ], i 6= j, sl ∩gi 6= ∅∧sl ∩gj 6= ∅.
b
ability that a predicted segmentation mask is over-segmenting. This representation counts ground-truth regions gi ∈ GIb that
We define this index as the region-wise over-segmentation overlap with at least one prediction region sl ∈ SIb , while the
ratio (ROR): prediction region sl overlaps with at least one other prediction
kGIO kkSO I
k region gj ∈ GIb . The total number of ground-truth regions
ROR = I I
. (5) involved in under-segmentation is denoted by kGIU k.
kGb kkSb k
The analog probabilistic terms for the region-wise under-
This measure takes values between 0 and 1 and indicates segmentation ratio (RU R), under-segmentation multiplier
the percentage of elements in GIb and SbI that relate to over- (mu ), and region-wise over-segmentation measure (RU M ) are
segmentation. In the worst case, every single model-predicted defined as:
region and ground-truth region is either contributing to or kGIU kkSUI k
RU R = (8)
involved in over-segmentation, giving ROR a value of 1. kGIb kkSbI k
ROR may fail to differentiate between cases where any X
elements in SOI
are further split by subsets (e.g., Figures 2e and mu = max(kGIsi k − 1, 0) (9)
si
and 2h). Therefore, we seek to penalize the ROR score based
on the total number of over-segmenting prediction regions. RU M = tanh (RU R × mu ) (10)
Let SgIi = {sj ∈ SIb }, such that sj ∩ gi 6= ∅ denotes a set
of all predicted regions that overlap with a given gi ∈ GIb . B. Region-wise confidence measure
To penalize the ROR score, we compute over-segmentation Many segmentation models confer a prediction class for
aggregate term (mo ) from SgIi for all gi ∈ GIb as: each pixel and an associated probability (confidence) for each
X  pixel prediction. Instead of looking at the confidence of each
mo = max kSgIi k − 1, 0. (6) pixel individually (as was done in the PASCAL VOC object
gi detection evaluation [24]), we propose to use the confidence
Thus, mo expresses an aggregate penalty structure accounting of each predicted contiguous region, when available. We
for each ground-truth region gi ∈ GIb that is overlapped represent the confidence of a predicted region as the numerical
with at least one model-predicted region. To compute the final mean of the confidence of all pixels enclosed in that region.
region-wise over-segmentation measure (ROM ), we multiply When evaluating with ROM (or RU M ), all pixels in a
ROR by mo . The resultant value will never be negative region whose confidence is lower than a certain threshold are
because both ROR and mo are greater than or equal to 0. mapped to the class unknown. We experiment with the effect
Since the resultant value can be very large, we scale it between of different thresholds in Section V.
0 and 1 using a tanh function. Doing so confers the property 1 Other scaling functions, such as sigmoid, can be applied. We choose tanh
of taking on a wider range of values over [0, 1), increasing over sigmoid because it is centered at 0.
6

C. Qualitative assessment ADE20K [28], and Cityscapes [29]). These models were
Here, we qualitatively demonstrate that the ROM/RUM met- selected based on network design decisions (e.g., light-
ric can: (1) differentiate between models that provide fewer or vs. heavy-weight), public availability, and segmentation per-
more numerous over-/under-segmentation inconsistencies, and formance (IOU), while the datasets were selected based
(2) adequately quantify improvements to model predictions on different visual recognition tasks (PASCAL VOC 2012:
while providing robust tolerance to small perturbations in foreground-background segmentation, ADE20K: scene pars-
ground-truth regions. ing, and Cityscapes: urban scene understanding). In this paper,
We highlight that ROM (or RU M ) cannot be used as mea- we focus on semantic segmentation datasets because of their
sures of classification accuracy that account for gross error. For wide applicability across different domains, including medical
example, panels in Figure 2a through Figure 2d are equivalent imaging (e.g., [30]). However, our metric is generic and can
from the perspective of over-segmentation error (ROM is 0 be easily applied to other segmentation tasks (e.g., panoptic
for all), indicating the lack of over-segmentation errors in all of segmentation) to assess over- and under-segmentation perfor-
these examples (whether due to mis-classification or not). As mance.
demonstrated in Section II, widely available pixel-wise metrics For the PASCAL VOC 2012, we chose Deeplabv3 [6] as our
can be used in conjunction with ROM and RU M to address heavy-weight model and two recent efficient networks (ESP-
these concerns. Future work might consider how to combine Netv2 [17] and MobileNetv2 [18]) as light-weight models. For
these metrics in a principled way. ADE20K, we chose PSPNet [5] (with ResNet-18 and ResNet-
The advantage of the ROM metric is that it isolates 101 [31]) as heavy-weight networks and MobileNetv2 [18]
disagreement and correlates with discrepancies between model as the light-weight network. For Cityscapes, we chose DRN
and ground-truth due to region-wise over-segmentation. The [32] and ESPNetv2 [17] as heavy- and light-weight networks,
analog is true for RU M with respect to under-segmentation.
Specifically, accounting only for over-segmentation, panels
in Figure 2e and Figure 2f are equivalent. Moreover, the TABLE I
P ERFORMANCE C OMPARISON U SING D IFFERENT M ETRICS . O UR
metric is useful in accounting for over-segmentation in the M ETRICS Q UANTIFY OVER - AND U NDER - SEGMENTATION E RRORS , T HUS
average model-predicted region, therefore indicating higher E XPLAINING T HE S OURCE OF E RROR IN S EMANTIC S EGMENTATION
over-segmentation in panels in Figure 2e and Figure 2f versus P ERFORMANCE .
panels in Figure 2k and Figure 2l. This demonstrates the (a) PASCAL VOC 2012
ability to quantify model differences that provide fewer or
Error metrics DeepLabv3 ESPNetv2 MobileNetv2
more numerous over-segmentation inconsistencies. Finally, as
ROM error increases in panels in Figure 2b, 2e and 2h, we mIOU Error 0.18 0.25 0.24
Dice Error 0.13 0.20 0.18
see that the ROM differentiates and grows monotonically with Pixel Error 0.11 0.13 0.13
over-segmentation errors in model predictions. GCE 0.07 0.09 0.10
Notably, our approach yields a supervised objective evalua- PE-OS 0.31 0.37 0.35
PE-US 0.41 0.45 0.44
tion and highlights discrepancies in overall region segmen- AP Error 0.67 0.70 0.70
tation results, comparing it to a ground-truth. Calculations ROM (Ours) 0.07 0.08 0.11
for ROM and RU M specifically avoid penalizing pixel-wise RUM (Ours) 0.32 0.30 0.30
non-concordance in gi ∩ sj = ∅ because such information
(b) C ITYSCAPES
has been handled by other metrics (like IOU). ROM and
RU M help solely with evaluation from the over-and under- Error metrics DRN ESPNetv2
segmentation perspective, respectively. mIOU Error 0.30 0.38
Dice Error 0.22 0.30
IV. E XPERIMENTAL S ET- UP Pixel Error 0.17 0.18
GCE 0.13 0.16
To demonstrate the effectiveness of our proposed metrics, PE-OS 0.46 0.55
we evaluate and compare them with existing metrics across PE-US 0.50 0.62
different semantic segmentation datasets and different segmen- AP Error 0.81 0.86
ROM (Ours) 0.26 0.25
tation models. RUM (Ours) 0.35 0.40
Baseline metrics: We use the following metrics for com-
(c) ADE20K
parison: (1) mIOU error, (2) Dice error, (3) pixel error, (4)
global consistency error (GCE), (5) Parsello’s error (PE-OS PSPNet MobileNetv2
Error metrics
and PE-US for over- and under-segmentation), and (6) Average ResNet-18 ResNet-101
precision error (1−AP IOU =.50:.05:.95 ). These metrics report a mIOU Error 0.62 0.58 0.65
similarity score between 0 and 1, where 0 means that predicted Dice Error 0.43 0.40 0.46
Pixel Error 0.21 0.19 0.24
mask SI and ground-truth mask GI for image I are the same, GCE 0.19 0.17 0.21
while 1 means they are not similar. PE-OS 0.53 0.50 0.56
PE-US 0.52 0.49 0.54
Semantic segmentation models and datasets. We study AP Error 0.73 0.69 0.76
our metric using state-of-the-art semantic segmentation mod- ROM (Ours) 0.11 0.10 0.16
RUM (Ours) 0.23 0.23 0.22
els on three standard datasets (PASCAL VOC 2012 [24],
7

MobileNetv2 0.35 0.35


0.10
DeepLabv3 0.30 0.30

0.08 ESPNetv2
0.25 0.25

0.20 0.20
ROM

RUM
0.06

RUM
0.15 0.15
0.04
0.10 MobileNetv2 0.10 MobileNetv2
0.02
0.05 DeepLabv3 0.05 DeepLabv3
ESPNetv2 ESPNetv2
0.00 0.00 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.02 0.04 0.06 0.08 0.10
Region confidence threshold Region confidence threshold ROM

(a) The PASCAL VOC dataset


0.6 0.6
0.25 ESPNetv2
DRN 0.5 0.5
0.20
0.4 0.4
0.15
RUM

RUM
ROM

0.3 0.3

0.10
0.2 0.2

0.05 0.1 ESPNetv2 0.1 ESPNetv2


DRN DRN
0.00 0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.00 0.05 0.10 0.15 0.20 0.25
Region confidence threshold Region confidence threshold ROM

(b) The Cityscapes dataset


0.30 0.30
MobileNetv2
0.150
PSPNet-ResNet18 0.25 0.25
0.125 PSPNet-ResNet101
0.20 0.20
0.100
RUM

RUM
ROM

0.15 0.15
0.075

0.10 0.10
0.050
MobileNetv2 MobileNetv2
0.025 0.05 PSPNet-ResNet18 0.05 PSPNet-ResNet18
PSPNet-ResNet101 PSPNet-ResNet101
0.000 0.00 0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.000 0.025 0.050 0.075 0.100 0.125 0.150
Region confidence threshold Region confidence threshold ROM

(c) The ADE20K dataset


Fig. 3: ROC curves for over- (ROM) and under-segmentation (RUM) on different datasets. Note the rightmost plot in each
sub-figure is between ROM and RUM errors. Therefore, a lower area under ROC curve means better performance.

respectively. row of Figure 4 shows under-segmentation for the towel class,


where accurate pixel classification and proper region segmen-
V. R ESULTS AND D ISCUSSION tation disagreed. Pixel-wise measures (IOU, Dice score, and
We describe the utility of our approach through extensive pixel error) indicated little error for the towel class. However,
use of examples. Figure 3, Figure 4, and Table I show results the number of model-predicted regions were only half the
for different datasets and segmentation models. Qualitatively, number of ground-truth regions. A high value for RU M
our metric can quantify over- and under-segmentation issues reflected this disagreement.
(Figure 4). For instance, the example in the first row of Quantitatively, all models performed well on the PASCAL
Figure 4 shows a three-class task. The model prediction was VOC dataset, as reflected by the low error scores in Table Ia.
almost perfect for this example, i.e., it correctly segmented The light-weight models ESPNetv2 and MobileNetv2 had
each object. Our metrics correctly assigned ROM = 0 and similar mIOU errors, but ROM shows that ESPNetv2 made
RU M = 0, meaning there was no over- or under-segmentation fewer mistakes in terms of over-segmentation. The Receiver
in this example. However, all other metrics indicated that operating characteristic (ROC) curves in Figure 3a indicate
non-negligible errors occurred in this example. The fifth row that all models made fewer mistakes when region confidence
of Figure 4 shows an example of over-segmentation for the thresholds increased. At the threshold of 0.6, MobileNetv2
person class. The torso of the leftmost person was over- showed a reduction in over-segmentation error (achieving
segmented into two regions that were far apart, delivering a similarity to the other two models) without worsening under-
false prediction that two people were at the same location. segmentation error.
A high value for ROM reflected this discrepancy. The third In the context of quotidian goals of weighing trade-offs
8

RGB Image Ground truth Prediction Error metrics

0.6

0.4

Score
0.2

0.0
tree grass house

0.75

Score
0.50

0.25

0.00
background horse person

1.0

Score
0.5

0.0
wall towel

1.00

0.75

Score
0.50

0.25

0.00
ceiling shelf bottle

1.0
Score

0.5

0.0
wall floor person

1.00

0.75
Score

0.50

0.25

0.00
wall floor table chair

Fig. 4: Qualitative results illustrating over- and under-segmentation. The example in the top row is a near-perfect segmentation.
The second row shows an under-segmentation example on the horse class. The third row shows an under-segmentation example
on the towel class. The fourth row shows an under-segmentation example on the bottle class. The fifth row shows an over-
segmentation example on the person class. The last row shows an over-segmentation example on the chair class. The error
metric legend is shown in the bar graphs (bottom). See Supplementary Material for more examples and detailed discussions.
9

between heavy- and light-weight models for particular seg- [3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
mentation tasks, we can make several observations about for semantic segmentation,” in Proceedings of the IEEE conference on
computer vision and pattern recognition, 2015, pp. 3431–3440.
ROM /RU M . Table Ib compares overall performance of dif- [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
ferent models on the Cityscapes validation set. Comparing the “Deeplab: Semantic image segmentation with deep convolutional nets,
heavy-weight (DRN) and light-weight (ESPNetv2) models, we atrous convolution, and fully connected crfs,” IEEE transactions on
pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848,
expected to see ESPNetv2 generally perform worse, which is 2017.
reflected by an 8 point difference in IOU, Dice score, and pixel [5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
accuracy. In addition to these pixel-wise measures, our metrics network,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2017, pp. 2881–2890.
help explain the source of performance discrepancy. DRN and [6] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
ESPNetv2 had the same value for ROM but a significant dif- atrous convolution for semantic image segmentation,” arXiv preprint
ference of 5 points in RU M . This indicates that performance arXiv:1706.05587, 2017.
[7] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,
degradation in ESPNetv2 in this dataset mainly emanated from P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang et al., “End
under-, not over-, segmentation. Similarly, in Table Ic, we to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316,
can explain that the lower performance of MobileNetv2 (a 2016.
light-weight model) compared to PSPNet with ResNet-101 (a [8] Y. Zhang and A. Caspi, “Stereo imagery based depth sensing in diverse
outdoor environments: Practical considerations,” in Proceedings of the
heavy-weight model) on the ADE20K dataset is primarily due 2nd ACM/EIGSCC Symposium on Smart Cities and Communities, 2019,
to over-segmentation issues. A researcher may elect to tolerate pp. 1–9.
the increase in under- or over-segmentation in exchange for [9] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE
Transactions on pattern analysis and machine intelligence, vol. 22, no. 8,
computational efficiency, depending on downstream uses. Con- pp. 888–905, 2000.
versely, for navigation tasks, under-segmenting light poles may [10] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,
undercut downstream wayfinding and mapping uses, making “Slic superpixels,” Tech. Rep., 2010.
[11] S. Mehta, H. Hajishirzi, and L. Shapiro, “Identifying most walkable
it an undesirable outcome [8]. direction for navigation in an outdoor environment,” arXiv preprint
arXiv:1711.08040, 2017.
[12] K. Yang, K. Wang, L. M. Bergasa, E. Romera, W. Hu, D. Sun, J. Sun,
VI. C ONCLUSION R. Cheng, T. Chen, and E. López, “Unifying terrain awareness for the
visually impaired through real-time semantic segmentation,” Sensors,
This work reviewed measures of similarity popular in com- vol. 18, no. 5, p. 1506, 2018.
puter vision for assessing semantic segmentation fidelity to a [13] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-
ground-truth and discussed their shortcomings in accounting volutional encoder-decoder architecture for image segmentation,” IEEE
transactions on pattern analysis and machine intelligence, vol. 39,
for over- and under-segmentation. While IOU and similar no. 12, pp. 2481–2495, 2017.
evaluation metrics can be applied to measure pixel-wise cor- [14] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks
respondence between both model-predicted segmentation and for biomedical image segmentation,” in International Conference on
Medical image computing and computer-assisted intervention. Springer,
ground-truth, the evaluation can suffer from inconsistencies 2015, pp. 234–241.
due to different region boundaries and notions of signifi- [15] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi,
cance with respect to a particular class of objects. We pro- “Espnet: Efficient spatial pyramid of dilated convolutions for semantic
posed an approach to segmentation metrics that decomposes segmentation,” in Proceedings of the european conference on computer
vision (ECCV), 2018, pp. 552–568.
to explainable terms and is sensitive to over- and under- [16] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Effi-
segmentation errors. This new approach confers additional cient residual factorized convnet for real-time semantic segmentation,”
desirable properties, like robustness to boundary changes IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1,
pp. 263–272, 2017.
(since both annotations and natural-image semantic boundaries [17] S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “Espnetv2: A
are non-deterministic). We contextualized the application of light-weight, power efficient, and general purpose convolutional neural
our metrics in current model selection problems that arise in network,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2019, pp. 9190–9200.
practice when attempting to match the context of use to region- [18] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
based segmentation performance in supervised datasets. “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.
ACKNOWLEDGMENT [19] W. Shimoda and K. Yanai, “Weakly supervised semantic segmentation
using distinct class specific saliency maps,” in Computer Vision and
This work was sponsored by the State of Washington De- Image Understanding, vol. 191, 2018.
partment of Transportation (WSDOT) research grant T1461- [20] D. Martin, C. Fowlkes, D. Tal, J. Malik et al., “A database of human
segmented natural images and its application to evaluating segmentation
47, in cooperation with the Taskar Center for Accessible algorithms and measuring ecological statistics.” Iccv Vancouver, 2001.
Technology at the Paul G. Allen School at the University of [21] C. Persello and L. Bruzzone, “A novel protocol for accuracy assessment
Washington. in classification of very high resolution images,” IEEE Transactions on
Geoscience and Remote Sensing, vol. 48, no. 3, pp. 1232–1244, 2009.
[22] X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and
R EFERENCES C. L. Zitnick, “Microsoft coco captions: Data collection and evaluation
server,” arXiv preprint arXiv:1504.00325, 2015.
[1] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, [23] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollár, “Panoptic
and time series,” The handbook of brain theory and neural networks, segmentation,” in Proceedings of the IEEE conference on computer
vol. 3361, no. 10, p. 1995, 1995. vision and pattern recognition, 2019, pp. 9404–9413.
[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification [24] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser-
with deep convolutional neural networks,” in Advances in neural infor- man, “The pascal visual object classes (voc) challenge,” International
mation processing systems, 2012, pp. 1097–1105. journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
10

[25] A. A. Taha and A. Hanbury, “Metrics for evaluating 3d medical image


segmentation: analysis, selection, and tool,” BMC medical imaging,
vol. 15, no. 1, p. 29, 2015.
[26] M. Polak, H. Zhang, and M. Pi, “An evaluation metric for image
segmentation of multiple objects,” Image and Vision Computing, vol. 27,
no. 8, pp. 1223–1227, 2009.
[27] A. Borji, M.-M. Cheng, H. Jiang, and J. Li, “Salient object detection: A
benchmark,” IEEE Transactions on Image Processing, vol. 24, no. 12,
pp. 5706–5722, 2015.
[28] B. Zhou, H. Zhao, X. Puig, T. Xiao, S. Fidler, A. Barriuso, and
A. Torralba, “Semantic understanding of scenes through the ade20k
dataset,” International Journal of Computer Vision, vol. 127, no. 3, pp.
302–321, 2019.
[29] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be-
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
for semantic urban scene understanding,” in Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016, pp. 3213–
3223.
[30] B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani,
J. Kirby, Y. Burren, N. Porz, J. Slotboom, R. Wiest et al., “The
multimodal brain tumor image segmentation benchmark (brats),” IEEE
transactions on medical imaging, vol. 34, no. 10, pp. 1993–2024, 2014.
[31] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[32] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in
Proceedings of the IEEE conference on computer vision and pattern
recognition, 2017, pp. 472–480.
1

Supplementary Material:
Rethinking Semantic Segmentation Evaluation for
Explainability and Model Selection
Yuxiang Zhang, Sachin Mehta, and Anat Caspi

I. O UTLINE they are not surrogate metrics for near-perfect semantic seg-
mentation. Importantly, pixel-wise measurements do not cap-
arXiv:2101.08418v2 [cs.CV] 15 Feb 2023

This document accompanies the paper “Rethinking Seman-


tic Segmentation Evaluation for Explainability and Model ture any region-wise information, nor indicating if there are
Selection” We illustrate our evaluation analysis of the any over-/under-segmentation issues in the prediction. The
ROM/RUM metrics with results using additional image ex- following examples show that despite similar pixel-wise error
amples. metrics, the quality of semantic segmentation in the prediction
In section II, we provide additional examples for comparing will diverge.
different metrics in various scenarios. Extended discussions Low ROM, Low RUM: Exploring the category of joint
based on the additional examples are made in section III. low pixel-wise errors, low over-segmentation error (ROM),
and low under-segmentation error (RUM), we anticipate high
II. C ONJOINT USE OF RUM/ROM WITH M IOU TO INFORM model-prediction fidelity to the ground truth. Looking at
MODEL SELECTION AND EVALUATION these metrics conjointly, we expect that the model predicts
a near-perfect segmentation. As an example, in Figure 2,
Our paper proposes that region-based error metrics are
an excellent prediction on a simple 2-class image with low
not necessarily correlated with pixel-wise error metrics. As
pixel-wise error metrics, and no error along with ROM/RUM.
alternative sources of information about the model’s perfor-
Without the need to inspect the results by eye, we can expect
mance in segmentation settings, and in particular, specific class
few or no region-wise issues in this example and others like it.
performance, it is important to understand how pixel-based
and region-based metrics might interact to better inform model
High RUM: Demonstrating that pixel-wise and region-wise
selection or focus on improvements in any particular class.
metrics are not always correlated, it is important to note
In this section, we demonstrate our suggested metrics are
cases in which a result yields low pixel-wise errors, but high
orthogonal to traditional segmentation metrics and the manner
RUM, indicating that the number of regions predicted for
in which the metrics can be used conjointly to inform model
a particular class by the model is less than the number of
selection and evaluation. We integrate additional segmentation
regions in the ground truth. As an example, in Figure 3,
examples, with their class-wise score and RUM/ROM score.
the model only predicts 2 regions of people out of 6 in the
The examples are drawn from the ADE20K dataset and the
ground truth. The pixel distance between regions classified as
Cityscapes dataset. The ADE20K dataset provides a variety
people is small, and therefore region under-segmentation does
of indoor/outdoor scenes with 150 object classes. The wide
not significantly impact pixel-wise error. Nevertheless, but
range of scenes in this dataset allows us to locate examples
the region-wise discrepancy is reflected by the high RUM. By
for different segmentation scenarios. The Cityscapes dataset
far, this is the most prevalent type of conjoint error we see in
contains urban scenes captured from a car dashboard’s per-
the Cityscapes dataset with the ESPNetv2 model predictions.
spective, with 19 object classes commonly seen in the urban
road environment. This dataset is important to tasks involving
High ROM: High ROM in conjunction with low pixel-wise
autonomous vehicles and outdoor environment mapping. The
errors may also occur, indicative of model-predictions that
examples drawn from these two datasets are assorted into
demonstrate low per-pixel classification for an object class,
different categories as follows. The color code representing
despite segmenting more than one distinct regions per single
different metrics is shown in Figure 1.
class object in the ground truth. For example in Figure 4,
the chair legs are segmented into multiple separate regions,
A. Detailed Interrogation Into Low Pixel-wise Error and creating over-segmentation issues and delivering false infer-
Region-based Error Metrics ence that there are multiple chairs at the same location.
Pixel-wise measures (IOU, dice score, pixel accuracy) are Note that this is an interesting example of the interplay
indeed informative when addressing questions surrounding between different class segmentation: when one chair is over-
pixel classification. However correlated these measures are, segmented in multiple disconnected regions, the result impacts
another class’ region segmentation, i.e., the chair background
Y. Zhang, S. Mehta, and A. Caspi are with the University of
Washington, Seattle, WA, 98115 USA. E-mail: {yz325, sacmehta, is then connected into one large region for the wall class.
caspian}@cs.washington.edu. The wall is then under-segmented as a result of the chair
2

Fig. 1: Legend for different metrics

RGB Image Ground-Truth Prediction Error metrics

1.00

0.75

Score
0.50

0.25

0.00
building sidewalk

Fig. 2: Low IOU Error, Low ROM/RUM

RGB Image Ground-Truth Prediction Error metrics

0.75

Score
0.50

0.25

0.00
sidewalkbuilding person

Fig. 3: Low IOU Error, High RUM (person)

RGB Image Ground-Truth Prediction Error metrics

1.00

0.75
Score

0.50

0.25

0.00
wall floor table chair

Fig. 4: Low IOU Error, High ROM (chair)

over-segmentation. This demonstrates how models might be B. Detailed Interrogation Into High Pixel-wise Error and
interrogated for intricate nuanced relationships among classes. Region-based Error Metrics
In this case, looking at the entire model (ResNet101+PSPNet)
There is no question that significant pixel-wise classification
performance on this dataset (ADE20k), overall mIOU error is
errors cannot yield very good segmentation. However, the
0.58, whereas for the chair class (mIOU error is 0.54, RUM is
examples we looked at are encouraging in terms of the
0.09, and ROM is 0.14 ), interplayed with wall class (mIOU
potential uses of light-weight models (with occasionally higher
error is 0.32, RUM is 0.31, and ROM is 0.10).
pixel-wise error) in certain semantic segmentation scenarios.
We looked at images that tended to have high pixel-wise
errors in conjunction with ROM/RUM. Identifying images
with specific region-wise metric attributes can provide insight
on region-wise segmentation qualities and offer clues about
the types of images or scenes in which pixel-wise predictions
3

RGB Image Ground-Truth Prediction Error metrics

1.0

Score
0.5

0.0
pole person car bus

Fig. 5: High IOU Error, Low RUM/ROM (person, pole)

RGB Image Ground-Truth Prediction Error metrics


1.00

0.75

Score
0.50

0.25

0.00
ceiling shelf bottle

Fig. 6: High IOU Error, High RUM (bottle)

may fail. regions. This is concurrent with the majority of pixels in the
middle of the chair being predicted as other classes, resulting
Low ROM, Low RUM: In this class of images, the predicted in low pixel-wise accuracy.
pixel classification may create false-negative/false-positive re-
gions that degrade pixel-wise accuracy. However, those classes High ROM, High RUM: Rarely, there are cases where both
yielding regions that are correctly predicted by the model ROM and RUM are high, e.g. the plant class in Figure 8.
correspond well with ground-truth, and reflected by the low This happens when one or more regions in the ground truth
ROM/RUM metrics. For example, in Figure 5, although the are over-segmented, while there are other regions are under-
model fails to predict some persons and poles, and conse- segmented.
quently makes few false-positive predictions, those regions
that are predicted correctly are matched to the corresponding
objects in the ground truth. As no over-/under-segmentation III. D ISCUSSION
occurred in those particular predicted regions, they are as- From the examples shown in Section II, we demonstrate that
signed a 0 value ROM/RUM. Again, this notion can be used ROM and RUM have the following attributes and advantages.
to interrogate model predictions in a more nuanced manner to
1) ROM/RUM accounts for model disagreement with
understand how the interplay between different classes (with
ground-truth in over- and under-segmentation cases.
potentially different backgrounds) may account for gross pixel-
Revealing over-/under-segmentation issues that are not
wise classification errors with certain models.
reflected in other metrics. For example, as shown in Fig-
ure 3, the prediction may have high pixel-wise accuracy,
High RUM: In this category, the pixel-wise accuracy errors while suffering from severe under-segmentation issues.
are high, and a high RUM value indicates that these errors In downstream applications where under-segmentation
mainly come from under-segmentation. As shown in Figure 6, is the primary concern, RUM needs to be considered to-
the bottles on the shelf are falsely grouped into one large gether with other pixel-wise metrics for model selection.
object. Meanwhile, the pixels in between each bottle are 2) ROM/RUM can differentiate among different degrees of
falsely predicted into the bottle class which impairs the overall over-segmentation and under-segmentation issues. This
pixel-wise accuracy. is important because it allows researchers to quantify
the severity of the over-/under-segmentation issues in
High ROM: In this category, the pixel-wise accuracy errors a particular model as it pertains to a specific class or
are high, and a high ROM value explains that these errors classes of objects. For example, in Figure 4, both the
occur when a single object in the ground truth is subdivided table class and the chair class are over-segmented, but
into several regions in the prediction. For example, in the chair class receives a higher ROM because it has
Figure 7, a chair is falsely segmented into multiple far apart more over-segmented regions per ground truth region.
4

RGB Image Ground-Truth Prediction Error metrics

1.00

0.75

Score
0.50

0.25

0.00
wall floor chair

Fig. 7: High IOU Error, High ROM (chair)

RGB Image Ground-Truth Prediction Error metrics


1.00

0.75

Score
0.50

0.25

0.00
ground plant flower

Fig. 8: High IOU Error, High ROM/RUM (plant)

3) ROM/RUM offers additional information for evaluating in multi-class correlated scenarios. In future work, we intend
and selecting a segmentation model alongside the pixel- to explore how ROM/RUM can be used during the learning
wise metrics. process of a model, in order to tailor a model for a specific
When pixel-wise accuracy is high, ROM/RUM can assist use case. We intend to further study the interrelation between
in validating the model prediction in region-wise perfor- ROM and RUM, and combined ROM/RUM with pixel-wise
mance. All examples in Figure 2, 3, 4 have relatively low metrics applied in multi-class correlated settings. A principled
pixel-wise errors for certain object classes, but the pixel- approach in this direction can allow for comprehensive, simul-
wise metrics alone do not lend for a proper nor complete taneous evaluation and interpretation of model performance in
interpretation of the model’s region-wise performance. semantic segmentation.
As illustrated in Figure 2, a prediction should receive
0 values for pixel-wise error, ROM, and RUM if and
only if it predicts perfect region-wise segmentation. A
non-zero value of ROM/RUM indicates there are over-
/under-segmentation issues within the prediction even is
the pixel-wise errors are low.
4) When pixel-wise accuracy is low, ROM/RUM can ex-
plain the source of performance degradation and provide
useful information for model selection. As illustrated
in Figure 6 and Figure 7, the pixel-wise accuracy of
predictions can be penalized to a similar degree while
creating contrasting region-wise segmentation qualities–
one may have major under-segmentation issues, while
the other creates over-segmentation issues.
Overall, we demonstrated certain common model interro-
gation scenarios in which combining ROM/RUM with other
pixel-wise measures allows researchers to select the most
appropriate model for specific datasets or applications. Fur-
thermore, interrogating models while using metrics conjointly
may give rise to a more nuanced understanding of model
performance for certain classes alone, or for certain models

You might also like