0% found this document useful (0 votes)
53 views10 pages

Stcgan Shadow

This document presents a novel framework called STacked Conditional Generative Adversarial Network (ST-CGAN) for jointly learning shadow detection and shadow removal from images in an end-to-end manner. The authors introduce a large-scale dataset with 1870 image triplets to facilitate the evaluation of their approach, which shows superior performance compared to existing methods. The framework leverages the mutual benefits of both tasks through a stacked architecture that enhances global scene understanding and improves task-specific outputs.

Uploaded by

u23ec178
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views10 pages

Stcgan Shadow

This document presents a novel framework called STacked Conditional Generative Adversarial Network (ST-CGAN) for jointly learning shadow detection and shadow removal from images in an end-to-end manner. The authors introduce a large-scale dataset with 1870 image triplets to facilitate the evaluation of their approach, which shows superior performance compared to existing methods. The framework leverages the mutual benefits of both tasks through a stacked architecture that enhances global scene understanding and improves task-specific outputs.

Uploaded by

u23ec178
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Task2

G1 G2

Stacked Conditional Generative AdversarialInput image


Networks for Jointly Learning
Shadow Detection and Shadow Removal
Shadow Detection Shadow Removal
Jifeng Wang ∗, Xiang Li∗ , Le Hui, Jian Yang
Nanjing University of Science and Technology
[email protected], {xiang.li.implus, le.hui, csjyang}@njust.edu.cn
arXiv:1712.02478v1 [cs.CV] 7 Dec 2017

Abstract

Understanding shadows from a single image sponta-


𝐺1 𝐺2
neously derives into two types of task in previous studies,
containing shadow detection and shadow removal. In this
Input image
paper, we present a multi-task perspective, which is not em-
braced by any existing work, to jointly learn both detection
and removal in an end-to-end fashion that aims at enjoying
the mutually improved benefits from each other. Our frame-
Shadow Detection Shadow Removal
work is based on a novel STacked Conditional Generative
Adversarial Network (ST-CGAN), which is composed of two Figure 1. We propose an end-to-end stacked joint learning archi-
stacked CGANs, each with a generator and a discriminator. tecture for two tasks: shadow detection and shadow removal.
Specifically, a shadow image is fed into the first generator mination conditions [38, 39, 40], object shapes [37] and
which produces a shadow detection mask. That shadow im- geometry information [19, 20]. Meanwhile, removing the
age, concatenated with its predicted mask, goes through the presence of shadows (i.e., shadow removal) in images is
second generator in order to recover its shadow-free image of great interest for the downstream computer vision tasks,
consequently. In addition, the two corresponding discrim- such as efficient object detection and tracking [3, 32]. Till
inators are very likely to model higher level relationships this end, existing researches basically obey one of the fol-
and global scene characteristics for the detected shadow lowing pipelines for understanding shadows:
region and reconstruction via removing shadows, respec- Detection only. In the history of shadow detection, a
tively. More importantly, for multi-task learning, our design series of data-driven statistical learning approaches [15, 26,
of stacked paradigm provides a novel view which is notably 49, 56, 22, 48] have been proposed. Their main objective is
different from the commonly used one as the multi-branch to find the shadow regions, in a form of an image mask that
version. To fully evaluate the performance of our proposed separates shadow and non-shadow areas.
framework, we construct the first large-scale benchmark Removal only. A list of approaches [7, 5, 55, 10, 46,
with 1870 image triplets (shadow image, shadow mask im- 1, 52, 29, 43] simply skips the potential information gained
age, and shadow-free image) under 135 scenes. Extensive from the discovery of shadow regions and directly produces
experimental results consistently show the advantages of the illumination attenuation effects on the whole image,
ST-CGAN over several representative state-of-the-art meth- which is also denoted as a shadow matte [43], to recover
ods on two large-scale publicly available datasets and our the image with shadows removed naturally.
newly released one. Two stages for removal. Many of the shadow removal
methods [11, 12, 23, 8, 50] generally include two seperated
steps: shadow localization and shadow-free reconstruction
1. Introduction by exploiting the intermediate results in the awareness of
shadow regions.
Both shadow detection and shadow removal reveal their
It is worth noting that the two targets: shadow mask in
respective advantages for scene understanding. The ac-
detection and shadow-free image in shadow removal, share
curate recognition of shadow area (i.e., shadow detection)
a fundamental characteristic essentially. As shown in Figure
provides adequate clues about the light sources [25], illu-
1, the shadow mask is posed as a two-binary map that seg-
∗ co-first author ments the original image into two types of region whereas

1
Color : CGAN for shadow detection
𝐺1 𝐺2 Color : CGAN for shadow removal
: concatenation over channels
: skip connection between mirror layers
: fake pair/triplet for discriminator
: real pair/triplet for discriminator

Shadow Detection Shadow Removal


Input image
Input image Input image

Detected shadow Detected shadow


Shadow removal image

𝐷1 𝐷2

OR OR
Real/Fake Real/Fake

Input image
Input image GT shadow
GT shadow Ground truth

Figure 2. The architecture of the proposed ST-CGAN. It consists of two stacked CGANs: one for shadow detection and another for shadow
removal, which are marked in different colors. The intermediate outputs are concatenated together as the subsequent components’ input.

the shadow removal mainly focuses on one type of that and ing the input) and stacks them as its input. Similarly, the
needs to discover the semantic relationship between the two discriminator attempts to distinguish the concatenation of
areas, which indicates the strong correlations and possible all the previous tasks’ targets from the real corresponding
mutual benefits between these two tasks. ground-truth pairs or triplets.
Besides, most of the previous methods, including Importantly, the design of the proposed stacked compo-
shadow detection [15, 26, 49, 56, 22, 48] and removal nents offers a novel perspective for multi-task learning in
[8, 52, 1] are heavily based on local region classifications the literature. Different from the commonly used multi-
or low-level feature representations, failing to reason about branch paradigm (e.g., Mask R-CNN [13], in which each
the global scene semantic structure and illumination con- individual task is assigned with a branch), we stack all the
ditions. Consequently, a most recent study [36] in shadow tasks that can not only focus on one task once a time in dif-
detection introduced a Conditional Generative Adversarial ferent stages, but also share mutual improvements through
Network (CGAN) [33] which is proved to be effective for forward/backward information flows. Instead, the multi-
the global consistency. For shadow removal, Qu et al. [43] branch version aims to learn a shared embedding across
also proposed a multi-context architecture with an end-to- tasks by simply aggregating the supervisions from each in-
end manner, which maintained a global view of feature ex- dividual task.
traction.
Since no existing approaches have explored the joint To validate the effectiveness of the proposed framework,
learning aspect of these two tasks, in this work, we pro- we further construct a new large-scale Dataset with Im-
pose a STacked Conditional Generative Adversarial Net- age Shadow Triplets (ISTD) consisting of shadow, shadow
work (ST-CGAN) framework and aim to tackle shadow de- mask and shadow-free image to match the demand of multi-
tection and shadow removal problems simultaneously in an task learning. It contains 1870 image triplets under 135 dis-
end-to-end fashion. Besides making full use of the poten- tinct scenarios, in which 1330 is assigned for training whilst
tial mutual promotions between the two tasks, the global 540 is for testing.
perceptions are well preserved through the stacked adver- Extensive experiments on two large-scale publicly avail-
sarial components. Further, our design of stacked modules able benchmarks and our newly released dataset show that
is not only to achieve a multi-task purpose, but also inspired ST-CGAN performs favorably on both detection and re-
from the connectivity pattern of DenseNet [14], where out- moval aspects, comparing to several state-of-the-art meth-
puts of all preceding tasks are used as inputs for all subse- ods. Further, we empirically demonstrate the advantages of
quent tasks. Specifically, we construct ST-CGAN by stack- our stacked joint formula over the widely used multi-branch
ing two generators along with two discriminators. In Figure version for shadow understanding. To conclude, the main
2, each generator takes every prior target of tasks (includ- contributions of this work are listed as follows:

2
• It is the first end-to-end framework which jointly learns ing [41], style transfer [28] and domain adaptation/transfer
shadow detection and shadow removal with superior [18, 57, 30]. The key of CGANs is the introduction of
performances on various datasets and on both the two the adversarial loss with an informative conditioning vari-
tasks. able, that forces the generated images to be with high qual-
ity and indistinguishable from real images. Besides, recent
• A novel STacked Conditional Generative Adversar- researches have proposed some variants of GAN, which
ial Network (ST-CGAN) with a unique stacked joint mainly explores the stacked scheme of its usage. Zhang
learning paradigm is proposed to exploit the advan- et al. [54] first put forward the StackGAN to progressively
tages of multi-task training for shadow understanding. produce photo-realistic image synthesis with considerably
• The first large-scale shadow dataset which contains im- high resolution. Huang et al. [16] design a top-down stack
age triplets of shadow, shadow mask and shadow-free of GANs, each learned to generate lower-level represen-
image is publicly released. tations conditioned on higher-level representations for the
purpose of generating more qualified images. Therefore,
our proposed stacked form is distinct from all the above rel-
2. Related Work evant versions in essence.
Multi-task Learning. The learning hypothesis is biased
Shadow Detection. To improve the robustness of shadow to prefer a shared embedding learnt across multiple tasks.
detection on consumer photographs and web quality im- The widely adopted architecture of multi-task formulation
ages, a series of data-driven approaches [15, 26, 56] have is a shared component with multi-branch outputs, each for
been taken and been proved to be effective. Recently, an individual task. For example, in Mask R-CNN [13]
Khan et al. [22] first introduced deep Convolutional Neu- and MultiNet [47], 3 parallel branches for object classifica-
ral Networks (CNNs) [45] to automatically learn features tion, bounding-box regression and semantic segmentation
for shadow regions/boundaries that significantly outper- respectively are utilized. Misra et al. [34] propose “cross-
forms the previous state-of-the-art. A multikernel model stitch” unit to learn shared representations from multiple
for shadow region classification was proposed by Vicente supervisory tasks. In Multi-task Network Cascades[4], all
et al. [48] and it is efficiently optimized based on least- tasks share convolutional features, whereas later task also
squares SVM leave-one-out estimates. More recent work of depends the output of a preceding one.
Vicente et al. [49] used a stacked CNN with separated steps,
including first generating the image level shadow-prior and
3. A new Dataset with Image Shadow Triplets
training a patch-based CNN which produces shadow masks
for local patches. Nguyen et al. [36] presented the first ap- – ISTD
plication of adversarial training for shadow detection and Existing publicly available datasets are all limited in the
developed a novel conditional GAN architecture with a tun- view of multi-task settings. Among them, SBU [51] and
able sensitivity parameter. UCF [56] are prepared for shadow detection only, whilst
Shadow Removal. Early works are motivated by physical SRD [43], UIUC [12] and LRSS [10] are constructed for
models of illumination and color. For instance, Finlayson the purpose of shadow removal accordingly.
et al. [5, 7] provide the illumination invariant solutions
Dataset Amount Content of Images Type
that work well only on high quality images. Many existing
SRD [43] 3088 shadow/shadow-free pair
approaches for shadow removal include two steps in gen-
UIUC [12] 76 shadow/shadow-free pair
eral. For the removal part of these two-stage solutions, the LRSS [10] 37 shadow/shadow-free pair
shadow is erased either in the gradient domain [6, 35, 2] SBU [51] 4727 shadow/shadow mask pair
or the image intensity domain [1, 11, 12, 8, 23]. On the UCF [56] 245 shadow/shadow mask pair
contrary, a few works [46, 53, 42] recover the shadow-free ISTD (ours) 1870 shadow/shadow mask/shadow-free triplet
image by intrinsic image decomposition and preclude the Table 1. Comparisons with other popular shadow related datasets.
need of shadow prediction in an end-to-end manner. How- Ours is unique in the content and type, whilst being in the same
ever, these methods suffer from altering the colors of the order of magnitude to the most large-scale datasets in amount.
non-shadow regions. Qu et al. [43] further propose a multi-
context architecture which consists of three levels (global To facilitate the evaluation of shadow understanding
localization, appearance modeling and semantic modeling) methods, we have constructed a large-scale Dataset with
of embedding networks, to explore shadow removal in an Image Shadow Triplets called ISTD1 . It contains 1870
end-to-end and fully automatic framework. triplets of shadow, shadow mask and shadow-free image un-
CGAN and Stacked GAN. CGANs have achieved im- der 135 different scenarios. To the best of our knowledge,
pressive results in various image-to-image translation prob- 1 ISTD dataset is available in https://drive.google.com/file/d/1I0qw-

lems, such as image superresolution [27], image inpaint- 65KBA6np8vIZzO6oeiOvcDBttAY/view?usp=sharing

3
Figure 3. An illustration of several shadow, shadow mask and shadow-free image triplets in ISTD.

ISTD is the first large-scale benchmark for simultaneous input z, that is sampled from a certain noise distribution.
evaluations of shadow detection and shadow removal. De- The discriminator D is forced to classify if a given image is
tailed comparisons with previous popular datasets are listed generated by G or it is indeed a real one from the dataset.
in Table 1. Hence, the adversarial competition progressively facilitates
In addition, our proposed dataset also contains a variety each other, whilst making G’s generation hard for D to dif-
of properties in the following aspects: ferentiate from the real data. Conditional Generative Adver-
sarial Networks (CGANs) [33] extends GANs by introduc-
• Illumination: Minimized illumination difference be- ing an additional observed information, named conditioning
tween a shadow image and the shadow-free one is variable, to both the generator G and discriminator D.
obtained. When constructing the dataset, we pose a Our ST-CGAN consists of two Conditional GANs in
camera with a fixed exposure parameter to capture the which the second one is stacked upon the first. For the first
shadow image, where the shadow is cast by an object. CGAN of ST-CGAN in Figure 2, both the generator G1 and
Then the occluder is removed in order to get the cor- discriminator D1 are conditioned on the input RGB shadow
responding shadow-free image. More evidences are image x. G1 is trained to output the corresponding shadow
given in the 1st and 3rd row of Figure 3. mask G1 (z, x), where z is the random sampled noise vec-
• Shapes: Various shapes of shadows are built by differ- tor. We denote the ground truth of shadow mask for x as y,
ent objects, such as umbrellas, boards, persons, twigs to which G1 (z, x) is supposed to be close. As a result, G1
and so on. See the 2nd row of Figure 3. needs to model the distribution pdata (x, y) of the dataset.
The objective function for the first CGAN is:
• Scenes: 135 different types of ground materials, e.g.,
6th-8th column in Figure 3, are utilized to cover as LCGAN1 (G1 , D1 ) = Ex,y∼pdata (x,y) [log D1 (x, y)]+
many complex backgrounds and different reflectances Ex∼pdata (x),z∼pz (z) [log(1 − D1 (x, G1 (z, x)))]. (1)
as possible.
We further eliminate the random variable z to have a de-
terministic generator G1 and thus the Equation (1) is sim-
4. Proposed Method plified to:
We propose STacked Conditional Generative Adversar-
ial Networks (ST-CGANs), a novel stacked architecture that LCGAN1 (G1 , D1 ) = Ex,y∼pdata (x,y) [log D1 (x, y)]+
enables the joint learning for shadow detection and shadow Ex∼pdata (x) [log(1 − D1 (x, G1 (x)))]. (2)
removal, as shown in Figure 2. In this section, we first de-
scribe the formulations with loss functions, training proce- Besides the adversarial loss, the classical data loss is
dure, and then present the network details of ST-CGAN, adopted that encourages a straight and accurate regression
followed by a subsequent discussion. of the target:

4.1. STacked Conditional Generative Adversarial Ldata1 (G1 ) = Ex,y∼pdata (x,y) ||y − G1 (x)||. (3)
Networks
Further in the second CGAN of Figure 2, by applying
Generative Adversarial Networks (GANs) [9] consists of the similar formulations above, we have:
two players: a generator G and a discriminator D. These
two players are competing in a zero-sum game, in which
the generator G aims to produce a realistic image given an Ldata2 (G2 |G1 ) = Ex,r∼pdata (x,r) ||r − G2 (x, G1 (x))||, (4)

4
Network Layer Cv0 Cv1 Cv2 Cv3 Cv4 (×3) Cv5 CvT6 CvT7 (×3) CvT8 CvT9 CvT10 CvT11
#C in 3/4 64 128 256 512 512 512 1024 1024 512 256 128
#C out 64 128 256 512 512 512 512 512 256 128 64 1/3
G1 /G2 before – LReLU LReLU LReLU LReLU LReLU ReLU ReLU ReLU ReLU ReLU ReLU
after – BN BN BN BN – BN BN BN BN BN Tanh
link → CvT11 → CvT10 → CvT9 → CvT8 → CvT7 – – Cv4 → Cv3 → Cv2 → Cv1 → Cv0 →

Table 2. The architecture for generator G1 /G2 of ST-CGAN. Cvi means a classic convolutional layer whilst CvTi stands for a transposed
convolutional layer that upsamples a feature map. Cv4 (×3) indicates that the block of Cv4 is replicated for additional two times, three in
total. “#C in” and “#C out” denote for the amount of input channels and output channels respectively. “before” shows the immediate layer
before a block and “after” gives the subsequent one directly. “link” explains the specific connections that lie in U-Net architectures [44] in
which → decides the direction of connectivity, i.e., Cv0 → CvT11 bridges the output of Cv0 concatenated to the input of CvT11 . LReLU
is short for Leaky ReLU activation [31] and BN is a abbreviation of Batch Normalization [17].

Network Layer Cv0 Cv1 Cv2 Cv3 Cv4


#C in 4/7 64 128 256 512
#C out 64 128 256 512 1 Task B
D1 /D2
before – LReLU LReLU LReLU LReLU Task A
after – BN BN BN Sigmoid
Forward Flow (A  B) Backward (Backpropagation) Flow (B  A)
Table 3. The architectures for discriminator D1 /D2 of ST-CGAN.
Annotations are kept the same with Table 2. Figure 4. An illustration of information flows which indicates the
mutual promotions between tasks of the proposed stacked scheme.
LCGAN2 (G2 , D2 |G1 ) = Ex,y,r∼pdata (x,y,r) [log D2 (x, y, r)]
+ Ex∼pdata (x) [log(1 − D2 (x, G1 (x), G2 (x, G1 (x))))], Training/Implementation settings. Our code is based on
(5) pytorch [21]. We train ST-CGAN with the Adam solver
where r denotes for x’s corresponding shadow-free im- [24] and an alternating gradient update scheme is applied.
age and G2 takes a combination of x and G1 (x) as inputs Specifically, we first adopt a gradient ascent step to update
whereas D2 differentiates the concatenation of outputs from D1 , D2 with G1 , G2 fixed. We then apply a gradient de-
G1 and G2 , conditioned on x, from the real pairs. Till this scent step to update G1 , G2 with D1 , D2 fixed. We initial-
end, we can finally conclude the entire objective for the joint ize all the weights of ST-CGAN by sampling from a zero-
learning task which results in solving a mini-max problem mean normal distribution with standard deviation 0.2. Dur-
where the optimization aims to find a saddle point: ing training, augmentations are adopted by cropping (image
size 286 → 256) and flipping (horizontally) operations. A
min max Ldata1 (G1 ) + λ1 Ldata2 (G2 |G1 ) + practical setting for λ, where λ1 = 5, λ2 = 0.1, λ3 = 0.1, is
G1 ,G2 D1 ,D2
λ2 LCGAN1 (G1 , D1 ) + λ3 LCGAN2 (G2 , D2 |G1 ). (6) used. The Binary Cross Entropy (BCE) loss is assigned for
the objective of image mask regression and L1 loss is uti-
It is regarded as a two-player zero-sum game. The first lized for the shadow-free image reconstruction respectively.
player is a team consisting of two generators (G1 , G2 ).
The second player is a team containing two discriminators 4.3. Discussion
(D1 , D2 ). In order to defeat the second player, the members The stacked term. The commonly used form of multi-
of the first team are encouraged to produce outputs that are task learning is the multi-branch version. It aims to learn a
close to their corresponding ground-truths. shared representation, which is further utilized for each task
in parallel. Figure 4 implies that our stacked design differs
4.2. Network Architecture and Training Details
quite a lot from it. We conduct the multi-task learning in
Generator. The generator is inspired by the U-Net architec- such a way that each task can focus on its individual feature
ture [44], which is originally designed for biomedical im- embeddings, instead of a shared embedding across tasks,
age segmentation. The architecture consists of a contract- whilst they still enhance each other through the stacked con-
ing path to capture context and a symmetric expanding path nections, in a form of a forward/backward information flow.
that enables precise localization. The detailed structure of The following experiments also confirm the effectiveness of
G1 /G2 , similar to [18], is listed in the Table 2. our architecture on the two tasks, compared with the multi-
Discriminator. For D1 , it receives a pair of images as branch one, which can be found in Table 8.
inputs, composed of an original RGB scene image and a The adversarial term. Moreover, Conditional GANs
shadow mask image that generates 4-channel feature-maps (CGANs) are able to effectively enforce higher order con-
as inputs. The dimensionality of channels increases to 7 for sistencies, to learn a joint distribution of image pairs or
D2 as it accepts an additional shadow-free image. Table 3 triplets. This confers an additional advantage to our method,
gives more details of these two discriminators. as we implement our basic component to be CGAN and per-

5
Using ISTD Train Detection Aspects StackedCNN [51] cGAN [36] scGAN [36] ours
Shadow 11.29 24.07 9.1 9.02
SBU [51] (%) Non-shadow 20.49 13.13 17.41 13.66
BER 15.94 18.6 13.26 11.34
Shadow 10.56 23.23 9.09 8.77
UCF [56] (%) Non-shadow 27.58 15.61 23.74 23.59
BER 18.67 19.42 16.41 16.18
Shadow 7.96 10.81 3.22 2.14
ISTD (%) Non-shadow 9.23 8.48 6.18 5.55
BER 8.6 9.64 4.7 3.85
Table 4. Detection with quantitative results using BER, smaller is better. For our proposed architecture, we use image triplets of ISTD
training set. These models are tested on three datasets. The best and second best results are marked in red and blue colors, respectively.
Using SBU Train Detection Aspects StackedCNN [51] cGAN [36] scGAN [36] ours
Shadow 9.6 20.5 7.8 3.75
SBU [51] (%) Non-shadow 12.5 6.9 10.4 12.53
BER 11.0 13.6 9.1 8.14
Shadow 9.0 27.06 7.7 4.94
UCF [56] (%) Non-shadow 17.1 10.93 15.3 17.52
BER 13.0 18.99 11.5 11.23
Shadow 11.33 19.93 9.5 4.8
ISTD (%) Non-shadow 9.57 4.92 8.46 9.9
BER 10.45 12.42 8.98 7.35
Table 5. Detection with quantitative results using BER, smaller is better. For our proposed architecture, we use image pairs of SBU training
set together with their roughly generated shadow-free images by Guo et al. [12] to form image triplets for training. The best and second
best results are marked in red and blue colors, respectively.

form a stacked input into the adversarial networks, when 5.2. Compared Methods and Metrics
compared with nearly most of previous approaches.
For detection part, we compare ST-CGAN with the state-
of-the-art StackedCNN [51], cGAN [36] and scGAN [36].
5. Experiments To evaluate the shadow detection performance quantita-
tively, we follow the commonly used terms [36] to compare
To comprehensively evaluate the performance of our the provided ground-truth masks and the predicted ones
proposed method, we perform extensive experiments on a with the main evaluation metric, which is called Balance
variety of datasets and evaluate ST-CGAN in both detection Error Rate (BER):
and removal measures, respectively.
1 TP TN
BER = 1 − ( + ), (7)
5.1. Datasets 2 TP + FN TN + FP

We mainly utilize two large-scale publicly available along with separated per pixel error rates per class (shadow
datasets2 including SBU [51] and UCF [56], along with our and non-shadow).
newly collected dataset ISTD. For removal part, we use the publicly available source
SBU [51] has 4727 pairs of shadow and shadow mask im- codes [12, 53, 8] as our baselines. In order to perform a
age. Among them, 4089 pairs are for training and the rest is quantitative comparison, we follow [12, 43] and use the root
for testing. mean square error (RMSE) in LAB color space between the
ground truth shadow-free image and the recovered image as
UCF [56] has 245 shadow and shadow mask pairs in total,
measurement, and then evaluate the results on the whole im-
which are all used for testing in the following experiments.
age as well as shadow and non-shadow regions separately.
ISTD is our new released dataset consisting of 1870 triplets,
which is suitable for multi-task training. It is randomly di- 5.3. Detection Evaluation
vided into 1330 for training and 540 for testing.
For detection, we utilize the cross-dataset shadow detec-
2 Note that we do not include the large-scale SRD dataset in this work tion schedule, similar in [36], to evaluate our method. We
because it is currently unavailable for the authors’ [43] personal reasons. first train our proposed ST-CGAN on the ISTD training set.

6
Dataset Removal aspects Original Guo et al. [12] Yang et al. [53] Gong et al. [8] ours
Shadow 32.67 18.95 19.82 14.98 10.33
ISTD Non-shadow 6.83 7.46 14.83 7.29 6.93
All 10.97 9.3 15.63 8.53 7.47
Table 6. Removal with quantitative results using RMSE, smaller is better. The original difference between the shadow and shadow-free
images is reported in the third column. We perform multi-task training on ISTD and compare it with three state-of-the-art methods. The
best and second best results are marked in red and blue colors, respectively.
Task Type Aspects Ours Ours (-D1 ) Ours (-D2 ) Ours (-G1 -D1 ) Ours (-G2 -D2 )
Shadow 10.33 10.36 10.38 12.12 –
Removal Non-shadow 6.93 6.96 7.03 7.45 –
All 7.47 7.51 7.56 8.19 –
Shadow 2.14 2.62 2.49 – 3.4
Detection (%) Non-shadow 5.55 6.18 6.03 – 5.1
BER 3.85 4.4 4.26 – 4.25
Table 7. Component analysis of ST-CGAN on ISTD by using RMSE for removal and BER for detection, smaller is better. The metrics
related to shadow and non-shadow part are also provided. The best and second best results are marked in red and blue colors, respectively.

The evaluations are thus conducted on three datasets with proposed ST-CGAN achieves the best performance among
three state-of-the-art approaches in Table 4. As can be seen, all the compared methods by a large margin. Notably, the
ST-CGAN outperforms StackedCNN and cGAN by a large error of non-shadow region is very close to the original one,
margin. In terms of BER, we obtain a significant 14.4% which indicates its strong ability to distinguish the non-
error reduction on SBU and 18.1% on ISTD respectively, shadow part of an image. The advantage of removal also
compared to scGAN. partially comes from the joint learning scheme, where the
Next, we switch the training set to SBU’s training data. well-trained detection block provides more clear clues of
Considering our framework requires image triplets that shadow and shadow-free areas.
SBU cannot offer, we make an additional pre-processing We also demonstrate the comparisons of the removal re-
step. In order to get the corresponding shadow-free im- sults. As shown in Figure 5, although Yang [53] can recover
age, we use the shadow removal code [12] to generate them shadow-free image, it alters the colors of both shadow and
as coarse labels. We also test these trained models on the nonshadow regions. Guo [11] and Gong [8] fail to detect
three datasets. Despite the inaccurate shadow-free ground- shadow accurately, thus both of their predictions are incom-
truths, our proposed framework still significantly improves plete especially in shadow regions. Moreover, due to the
the overall performances. Specifically, on the SBU test set, difficulty of determining the environmental illuminations
ST-CGAN achieves an obvious improvement with 10.5% and global consistency, all the compared baseline models
error reduction of BER over the previous best record from produce unsatisfactory results on the semantic regions.
scGAN. 5.5. Component Analysis of ST-CGAN
In Figure 5, we demonstrate the comparisons of the de-
tection results qualitatively. As shown in Figure 5 (a) and To illustrate the effects of different components of ST-
5 (b), ST-CGAN is not easily fooled by the lower bright- CGAN, we make a series of ablation experiments by pro-
ness area of the scene, comparing to cGAN and scGAN. gressively removing different parts of it. According to both
Our method is also precise in detecting shadows cast on the removal and the detection performances in Table 7, we
bright areas such as the line mark in Figure 5 (c) and 5 (d). find that each individual component is necessary and in-
The proposed ST-CGAN is able to detect more fine-grained dispensable for the final excellent predictions. Moreover,
shadow details (e.g., shadow of leaves) than other methods, the last two columns of Table 7 also demonstrate that with-
as shown in Figure 5 (e) and 5 (f). out the stacked joint learning, a single module consisting of
one generator and one discriminator performs worse consis-
5.4. Removal Evaluation tently. It further implies the effectiveness of our multi-task
architecture on both shadow detection and shadow removal.
For removal, we compare our proposed ST-CGAN with
5.6. Stacked Joint vs. Multi-branch Learning
the three state-of-the-art methods on ISTD dataset, as
shown in Table 6. The RMSE values are reported. We eval- We further modify our body architecture into a multi-
uate the performance of different methods on the shadow branch version, where each branch is designed for one
regions, non-shadow regions, and the whole image. The task respectively. Therefore, the framework aims to learn

7
Image GT Mask cGAN scGAN Ours GT Guo Yang Gong Ours

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5. Comparison of shadow detection and removal results of different methods on ISTD dataset. Note that our proposed ST-CGAN
simultaneously produces the detection and removal results, whilst others are either for shadow detection or for shadow removal.

Task Type Aspects Multi-branch Ours


Forward
Shadow 11.54 10.33
Stacked joint
Removal Non-shadow 7.13 6.93
All 7.84 7.47 TaskA Backward TaskB
Shadow 2.34 2.14
Detection (%) Non-shadow 7.2 5.55 TaskA
BER 4.77 3.85 Multi-branch Shared embedding

Table 8. Comparisons between stacked learning (ours) and multi- TaskB


branch learning with removal and detection results on ISTD
dataset. Figure 6. Illustrations of our stacked joint learning (top) and com-
mon multi-branch learning (bottom).
a shared embedding which is supervised by two tasks, as
shown in the bottom of Figure 6. For a clear explanation,
the illustration of comparisons between ours and the multi- work has at least four unique advantages as follows: 1) it is
branch one is also given. With all other training settings the first end-to-end approach that tackles shadow detection
fixed, we fairly compare our proposed ST-CGAN with the and shadow removal simultaneously; 2) we design a novel
multi-branch version quantitatively on the measurements of stacked mode, which densely connects all the tasks in the
both detection and removal on ISTD dataset. Table 8 re- purpose of multi-task learning, that proves its effectiveness
ports that our stacked joint learning paradigm consistently and suggests the future extension on other types of multiple
outperforms the multi-branch version in every single aspect tasks; 3) the stacked adversarial components are able to pre-
of the metrics. serve the global scene characteristics hierarchically, thus it
leads to a fine-grained and natural recovery of shadow-free
6. Conclusion images; 4) ST-CGAN consistently improves the overall per-
formances on both the detection and removal of shadows.
In this paper, we have proposed STacked Conditional Moreover, as an additional contribution, we publicly release
Generative Adversarial Network (ST-CGAN) to jointly the first large-scale dataset which contains shadow, shadow
learn shadow detection and shadow removal. Our frame- mask and shadow-free image triplets.

8
References [17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift.
[1] E. Arbel and H. Hel-Or. Shadow removal using inten- In International Conference on Machine Learning(ICML),
sity surfaces and texture anchor points. IEEE Transac- pages 448–456, 2015.
tions on Pattern Analysis and Machine Intelligence (TPAMI), [18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-
33(6):1202–1216, 2011. to-image translation with conditional adversarial networks.
[2] J. T. Barron and J. Malik. Shape, illumination, and re- arXiv preprint arXiv:1611.07004, 2016.
flectance from shading. IEEE Transactions on Pattern Anal- [19] I. N. Junejo and H. Foroosh. Estimating geo-temporal loca-
ysis and Machine Intelligence (TPAMI), 37(8):1670–1687, tion of stationary cameras using shadow trajectories. In Eu-
2015. ropean conference on computer vision (ECCV), pages 318–
[3] R. Cucchiara, C. Grana, M. Piccardi, A. Prati, and S. Sirotti. 331. Springer, 2008.
Improving shadow suppression in moving object detection [20] K. Karsch, V. Hedau, D. Forsyth, and D. Hoiem. Rendering
with hsv color information. In IEEE Intelligent Transporta- synthetic objects into legacy photographs. ACM Transac-
tion Systems (ITSC), pages 334–339, 2001. tions on Graphics (TOG), 30(6):157, 2011.
[4] J. Dai, K. He, and J. Sun. Instance-aware semantic segmen- [21] N. Ketkar. Introduction to pytorch. In Deep Learning with
tation via multi-task network cascades. In IEEE Conference Python, pages 195–208. Springer, 2017.
on Computer Vision and Pattern Recognition (CVPR), June [22] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Au-
2016. tomatic feature learning for robust shadow detection. In
[5] G. D. Finlayson, M. S. Drew, and C. Lu. Entropy minimiza- IEEE Conference on Computer Vision and Pattern Recog-
tion for shadow removal. International Journal of Computer nition (CVPR), pages 1939–1946, 2014.
Vision (IJCV), 85(1):35–57, 2009. [23] S. H. Khan, M. Bennamoun, F. Sohel, and R. Togneri. Au-
[6] G. D. Finlayson, S. D. Hordley, and M. S. Drew. Removing tomatic shadow detection and removal from a single image.
shadows from images. In European Conference on Computer IEEE Transactions on Pattern Analysis and Machine Intelli-
Vision (ECCV), pages 823–836. Springer, 2002. gence (TPAMI), 38(3):431–446, 2016.
[7] G. D. Finlayson, S. D. Hordley, C. Lu, and M. S. Drew. [24] D. Kingma and J. Ba. Adam: A method for stochastic opti-
On the removal of shadows from images. IEEE Transac- mization. arXiv preprint arXiv:1412.6980, 2014.
tions on Pattern Analysis and Machine Intelligence (TPAMI), [25] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan. Estimating
28(1):59–68, 2006. natural illumination from a single outdoor image. In IEEE
[8] H. Gong and D. Cosker. Interactive shadow removal and International Conference on Computer Vision (ICCV), pages
ground truth for variable scene categories. In British Ma- 183–190, 2009.
chine Vision Conference (BMVC). University of Bath, 2014. [26] J.-F. Lalonde, A. A. Efros, and S. G. Narasimhan. Detecting
[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, ground shadows in outdoor consumer photographs. In Euro-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen- pean Conference on Computer Vision (ECCV), pages 322–
erative adversarial nets. In Advances in Neural Information 335. Springer, 2010.
Processing Systems (NIPS), pages 2672–2680, 2014. [27] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham,
[10] M. Gryka, M. Terry, and G. J. Brostow. Learning to re- A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al.
move soft shadows. ACM Transactions on Graphics (TOG), Photo-realistic single image super-resolution using a gener-
34(5):153, 2015. ative adversarial network. arXiv preprint arXiv:1609.04802,
[11] R. Guo, Q. Dai, and D. Hoiem. Single-image shadow detec- 2016.
tion and removal using paired regions. In IEEE Conference [28] C. Li and M. Wand. Precomputed real-time texture synthesis
on Computer Vision and Pattern Recognition (CVPR), pages with markovian generative adversarial networks. In Euro-
2033–2040, 2011. pean Conference on Computer Vision (ECCV), pages 702–
[12] R. Guo, Q. Dai, and D. Hoiem. Paired regions for shadow 716. Springer, 2016.
detection and removal. IEEE Transactions on Pattern Anal- [29] F. Liu and M. Gleicher. Texture-consistent shadow removal.
ysis and Machine Intelligence (TPAMI), 35(12):2956–2967, In European Conference on Computer Vision (ECCV), pages
2013. 437–450. Springer, 2008.
[13] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. [30] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised
arXiv preprint arXiv:1703.06870, 2017. image-to-image translation networks. arXiv preprint
[14] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. arXiv:1703.00848, 2017.
Densely connected convolutional networks. arXiv preprint [31] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlin-
arXiv:1608.06993, 2016. earities improve neural network acoustic models. In Proc.
[15] X. Huang, G. Hua, J. Tumblin, and L. Williams. What char- ICML, volume 30, 2013.
acterizes a shadow boundary under the sun and sky? In IEEE [32] I. Mikic, P. C. Cosman, G. T. Kogut, and M. M. Trivedi.
International Conference on Computer Vision (ICCV), pages Moving shadow and object detection in traffic scenes. In In-
898–905, 2011. ternational Conference on Pattern Recognition (ICPR), vol-
[16] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. ume 1, pages 321–324. IEEE, 2000.
Stacked generative adversarial networks. arXiv preprint [33] M. Mirza and S. Osindero. Conditional generative adversar-
arXiv:1612.04357, 2016. ial nets. arXiv preprint arXiv:1411.1784, 2014.

9
[34] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert. Cross- [48] Y. Vicente, F. Tomas, M. Hoai, and D. Samaras. Leave-one-
stitch networks for multi-task learning. In IEEE Conference out kernel optimization for shadow detection. In IEEE In-
on Computer Vision and Pattern Recognition (CVPR), June ternational Conference on Computer Vision (ICCV), pages
2016. 3388–3396, 2015.
[35] A. Mohan, J. Tumblin, and P. Choudhury. Editing soft shad- [49] Y. Vicente, F. Tomas, M. Hoai, and D. Samaras. Noisy la-
ows in a digital photograph. IEEE Computer Graphics and bel recovery for shadow detection in unfamiliar domains. In
Applications, 27(2):23–31, 2007. IEEE Conference on Computer Vision and Pattern Recogni-
[36] V. Nguyen, T. F. Yago Vicente, M. Zhao, M. Hoai, and tion (CVPR), pages 3783–3792, 2016.
D. Samaras. Shadow detection with conditional generative [50] Y. Vicente, F. Tomas, M. Hoai, and D. Samaras. Leave-one-
adversarial networks. In IEEE International Conference on out kernel optimization for shadow detection and removal.
Computer Vision (ICCV), pages 4510–4518, 2017. IEEE Transactions on Pattern Analysis and Machine Intelli-
[37] T. Okabe, I. Sato, and Y. Sato. Attached shadow coding: gence (TPAMI), PP(99):1–1, 2017.
Estimating surface normals from shadows under unknown [51] Y. Vicente, F. Tomas, L. Hou, C.-P. Yu, M. Hoai, and
reflectance and lighting conditions. In IEEE International D. Samaras. Large-scale training of shadow detectors with
Conference on Computer Vision (ICCV), pages 1693–1700, noisily-annotated shadow examples. In European Confer-
2009. ence on Computer Vision (ECCV), pages 816–832. Springer,
[38] A. Panagopoulos, D. Samaras, and N. Paragios. Robust 2016.
shadow and illumination estimation using a mixture model. [52] T.-P. Wu, C.-K. Tang, M. S. Brown, and H.-Y. Shum. Natu-
In IEEE Conference on Computer Vision and Pattern Recog- ral shadow matting. ACM Transactions on Graphics (TOG),
nition (CVPR), pages 651–658, 2009. 26(2):8, 2007.
[39] A. Panagopoulos, C. Wang, D. Samaras, and N. Paragios. Il- [53] Q. Yang, K.-H. Tan, and N. Ahuja. Shadow removal using
lumination estimation and cast shadow detection through a bilateral filtering. IEEE Transactions on Image Processing
higher-order graphical model. In IEEE Conference on Com- (TIP), 21(10):4361–4368, 2012.
puter Vision and Pattern Recognition (CVPR), pages 673– [54] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and
680, 2011. D. Metaxas. Stackgan: Text to photo-realistic image syn-
[40] A. Panagopoulos, C. Wang, D. Samaras, and N. Paragios. thesis with stacked generative adversarial networks. arXiv
Simultaneous cast shadows, illumination and geometry in- preprint arXiv:1612.03242, 2016.
ference using hypergraphs. IEEE Transactions on Pattern [55] L. Zhang, Q. Zhang, and C. Xiao. Shadow remover: Im-
Analysis and Machine Intelligence (TPAMI), 35(2):437–449, age shadow removal based on illumination recovering op-
2013. timization. IEEE Transactions on Image Processing (TIP),
[41] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. 24(11):4623–4636, 2015.
Efros. Context encoders: Feature learning by inpainting. In [56] J. Zhu, K. G. Samuel, S. Z. Masood, and M. F. Tappen.
IEEE Conference on Computer Vision and Pattern Recogni- Learning to recognize shadows in monochromatic natural
tion (CVPR), pages 2536–2544, 2016. images. In IEEE Conference on Computer Vision and Pat-
[42] L. Qu, J. Tian, Z. Han, and Y. Tang. Pixel-wise orthogonal tern Recognition (CVPR), pages 223–230, 2010.
decomposition for color illumination invariant and shadow- [57] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-
free image. Optics express, 23(3):2220–2239, 2015. to-image translation using cycle-consistent adversarial net-
[43] L. Qu, J. Tian, S. He, Y. Tang, and R. W. Lau. Deshad- works. arXiv preprint arXiv:1703.10593, 2017.
ownet: A multi-context embedding deep network for shadow
removal. In IEEE International Conference on Computer Vi-
sion and Pattern Recognition (CVPR), 2017.
[44] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-
tional networks for biomedical image segmentation. In In-
ternational Conference on Medical Image Computing and
Computer-Assisted Intervention (MICCAI), pages 234–241.
Springer, 2015.
[45] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[46] M. F. Tappen, W. T. Freeman, and E. H. Adelson. Recov-
ering intrinsic images from a single image. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence (TPAMI),
27(9):1459–1472, 2005.
[47] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and
R. Urtasun. Multinet: Real-time joint semantic reasoning
for autonomous driving. arXiv preprint arXiv:1612.07695,
2016.

10

You might also like