Two at Once: Enhancing Learning and Generalization Capacities Via IBN-Net
Two at Once: Enhancing Learning and Generalization Capacities Via IBN-Net
1 Introduction
(b) (c)
Decoder
Encoder
VGG
IN
(d)
Fig. 1. (a) visualizes two example images (left) and their segmentation maps (right)
selected from Cityscapes [2] and GTA5 [23] respectively. These samples have simi-
lar categories and scene configurations when comparing their segmentation maps, but
their images are from different domains, i.e. reality and virtuality. (b) shows simple ap-
pearance variations, while those of complex appearance variations are provided in (c).
(d) proves that Instance Normalization (IN) is able to filter out complex appearance
variance. The style transfer network used here is AdaIN [14]. (Best viewed in color)
is due to the appearance gap between the images of these two datasets, as shown
in Fig.1 (a).
A natural solution to solve the appearance gap is by using transfer learning.
For instance, by finetuning a CNN pretrained on Cityscapes using the data from
GTA5, we are able to adapt the features learned from Cityscapes to GTA5, where
accuracy can be increased. But even so, the appearance gap is not eliminated,
because when applying the finetuned CNN back to Cityscapes, the accuracy
would be significantly degraded. How to address large diversity of appearances
by designing deep architectures? It is a key challenge in computer vision.
The answer is to induce appearance invariance into CNNs. This solution is
obvious but non-trivial. For example, there are many ways to produce the prop-
erty of spatial invariance in deep networks, such as max pooling [17], deformable
convolution [3], which are invariant to spatial variations like poses, viewpoints,
and scales, but are not invariant to variations of image appearances. As shown in
Fig.1 (b), when the appearance variance of two datasets are simple and known
beforehand, such as lightings and infrared, they can be reduced by explicitly
augmenting data. However, as shown in Fig.1 (c), when appearance variance are
complex and unknown, such as arbitrary image styles and virtuality, the CNNs
have to learn to reduce them by introducing new component into their deep
architectures.
To this end, we present IBN-Net, a novel convolutional architecture, which
learns to capture and eliminate appearance variance, while maintains discrimi-
IBN-Net 3
Feature Divergence
class A-class B (differ in content)
3
2
1
0
1 3 5 7 9 11 13 15 17
Block ID
Fig. 2. (a) Feature divergence calculated from image sets with appearance difference
(blue) and content difference (orange). We show the results of the 17 features after the
residual blocks of ResNet50. The detailed definition of feature divergence is given in
Section 4.3. The orange bars are enlarged 10 times for better visualization.
2 Related Works
The previous work related to IBN-Net are described in three aspects, including
invariance of CNNs, network architectures, and methods of domain adaptation
and generalization.
Invariance in CNNs. Several modules [17,3,25,30,15] have been proposed
to improve a CNN’s modeling capacity, or reduce overfitting to enhance its gen-
eralization capacity on a single domain. These methods typically achieved the
above purposes by introducing specific kinds of invariance into the architec-
tures of CNNs. For example, max pooling [17] and deformable convolution [3]
introduce spatial invariance to CNNs, thus increasing their robustness to spa-
tial variations such as affine, distortion, and viewpoint transformations. And
dropout [25] and batch normalization (BN) [15] can be treated as regularizers
to reduce the effects of sample noise in training. When image appearances are
presented, simple appearance variations such as color or brightness shift could
simply be eliminated by normalizing each RGB channel of an image with its
mean and standard deviation. For more complex appearance transforms such
IBN-Net 5
as style transformations, recent studies have found that such information could
be encoded in the mean and variance of the hidden feature maps [5,14]. There-
fore, the instance normalization (IN) [30] layer shows potential to eliminate such
appearance differences.
CNN Architectures. Since CNNs have shown compelling modeling capac-
ity over traditional methods, their architectures have gone through a number of
developments. Among them one of the most widely used is the residual network
(ResNet) [8], which uses short cut to alleviate training difficulties of very deep
networks. Since then a number of variants of ResNet were proposed. Compared
to ResNet, ResNeXt [31] improves modeling capacity by increasing ‘cardinality’
of ResNet. It is implemented by using group convolutions. In practice, increas-
ing cardinality increases runtime in modern deep learning frameworks. Moreover,
squeeze-and-excitation network (SENet) [12] introduces channel wise attention
into ResNet. It achieves better performance on ImageNet compared to ResNet,
but it also increases number of network parameters and computations. The re-
cently proposed densely connected networks (DenseNet) [13] uses concatenation
to replace short-cut connections. It was proved to be more efficient than ResNet.
However, there are two limitations in the above CNN architectures. Firstly,
the limited basic modules prevent them from gaining more appealing properties.
For example, all these architectures are simply composed of convolutions, BNs,
ReLUs, and poolings. The only difference among them is how these modules
are organized. However, the composition of these layers are naturally vulner-
able by appearance variations. Secondly, the design goal of these models is to
achieve strong modeling capacity on a single task of a single domain, while their
capacities to generalize to new domains are still limited.
In the field of image style transfer, some methods employ IN to help remove
image contrast [30,5,14]. Basically, this helps the models transfer images to dif-
ferent styles. However, the invariance property of image appearance has not been
successfully introduced to aforementioned CNNs, especially in high-level tasks
such as image classification or semantic segmentation. This is because IN drops
useful content information presented in the hidden features, impeding modeling
capacity as proved in [30].
Improve Performances across Domains. Alleviating the drop of perfor-
mances caused by appearance gap between different domains is an important
problem. One natural approach is to use transfer learning such as finetuning the
model on the target domain. However, this requires human annotations of the
target domain, and the performances of the finetuned models would then drop
when they are applied on the source domain. There are a number of domain
adaptation approaches which use the statistics of the target domain to facilitate
adaptation. Most of these works address the problem by reducing feature diver-
gences between two domains through carefully designed loss functions, like max-
imum mean discrepancy (MMD) [29,19], correlation alignment (CORAL) [26],
and adversarial loss [28,11]. Besides, [24] and [10] use generative adversarial net-
works (GAN) to transfer images between two domains to help adaptation, but
required independent models for the two domains. AdaBN [18] provides a sim-
6 X. Pan, P. Luo, J. Shi, and X. Tang
ple approach for domain adaptation simply by adjusting the statistics of all BN
layers using data from the target domain. Our method does not rely on any
specific target domain and there is no need to adjust any parameters. There are
two main limitations in transfer learning and domain adaptation. First, in real
applications it is expensive and difficult to collect data that covers all possible
scenarios in the target domain. Second, most state-of-the-art methods employ
different model weights for the source and target domains in order to improve
performance. But the ideal case is that one model could adapt to all domains.
Another paradigm towards this problem is domain generalization, which aims
to acquire knowledge from a number of related source domains and apply it to a
new target domain whose statistics is unknown during training. Existing meth-
ods typically design algorithms to learn domain agnostic representations or de-
sign models that capture common aspects from the domains, such as [16] [20] [6].
However, for real applications it is often hard to acquire data from a number
of related source domains, and the performance highly depends on the series of
source domains.
In this work, we increase the modeling capacity and generalization ability
across domains by designing a new CNN architecture, IBN-Net. The benefit is
that we do not require either target domain data or related source domains, un-
like existing domain adaptation and generalization methods. The improvement of
generalization across domains is achieved by designing architectures with built-
in appearance invariance. Our method is extremely useful for the situations that
the target domain data are unobtainable, where traditional domain adaptation
cannot be applied. For more detailed comparison of our method with related
works, please refer to our supplementary material.
3 Method
3.1 Background
Batch normalization [15] enables larger learning rate and faster convergence
by reducing the internal covariate shift during training CNNs. It uses the mean
and variance of a mini-batch to normalize each feature channels during training,
while in inference phase, BN uses the global statistics to normalize features. Ex-
periments have shown that BN significantly accelerates training, and could im-
prove the final performance meanwhile. It has become a standard component in
most prevalent CNN architectures like Inception [27], ResNet [8], DenseNet [13],
etc.
Unlike batch normalization, instance normalization [30] uses the statistics
of an individual sample instead of mini-batch to normalize features. Another im-
portant difference between IN and BN is that IN applies the same normalize pro-
cedure for both training and inference. Instance normalization has been mainly
used in the style transfer field [30,5,14]. The reason for IN’s success in style
transfer and similar tasks is that, these tasks try to change image appearance
while preserving content, and IN allows to filter out instance-specific contrast
IBN-Net 7
IN, 256
ReLU ReLU
ReLU
information from the content. Despite these successes, IN has not shown benefits
for high-level vision tasks like image classification and semantic segmentation.
Ulyanov et al [30] have given primary attempt adopting IN for image classifica-
tion, but got worse results than CNNs with BN.
In a word, batch normalization preserves discrimination between individ-
ual samples, but also makes CNNs vulnerable to appearance transforms. And
instance normalization eliminates individual contrast, but diminishes useful in-
formation at the same time. Both methods have their limitations. In order to
introduce appearance invariance to CNNs without hurting feature discrimina-
tion, here we carefully unify them in a single deep hierarchy.
Firstly, as [9] pointed out, a clean identity path is essential for optimizing ResNet,
so we add IN to the residual path instead of identity path. Secondly, in the
residual learning function y = F(x, {Wi }) + x, the residual function F(x, {Wi })
is learned to align with x in the identity path. Therefore IN is applied to the
first normalization layer instead of the last to avoid misalignment. Thirdly, the
half BN half IN scheme comes from our second design rule as discussed before.
This gives rise to our instance-batch normalization network (IBN-Net).
This design is a pursuit of model capacity. On one hand, INs enable the
model to learn appearance invariant features so that it could better utilize the
images with high appearance diversity within one dataset. On the other hand,
INs are added in a moderate way so that content related information could be
well preserved. We denote this model as IBN-Net-a. To take full use of IN’s
potential for generalization, in this work we also study another version, which is
IBN-Net-b. Since appearance information could be either preserved in residual
path or identity path, we add IN right after the addition operation, as shown
in Fig. 3(c). To not deteriorate optimization for ResNet, we only add three IN
layers after the first convolution layer (conv1) and the first two convolution
groups (conv2 x, conv3 x).
Variants of IBN-Net.
The two types of IBN-Net described above are not the only ways to utilize
IN and BN in CNNs. In the experiments we will also study some interesting
variants, as shown in Fig. 4. For example, to keep both generalizable and dis-
criminative features, another natural idea is to feed the feature to both IN and
BN layers and then concatenate their outputs, as in Fig. 4(a), but this would
introduce more parameters. And the idea of keeping two kind of features also
be applied to the IBN-b, giving rise to Fig. 4(b). We may also combine these
schemes as Fig. 4(c)(d) do. Discussions about these variants would be given in
the experiments section.
IBN-Net 9
Table 1. Results on ImageNet validation set with appearance transforms. The perfor-
mance drops are given in brackets.
4 Experiments
We evaluate IBN-Net on both classification and semantic segmentation tasks
on the ImageNet and Cityscapes-GTA5 dataset respectively. In both tasks, we
study our models’ modeling capacity within one dataset and their generalization
under appearance transforms.
Table 2. Results of IBN-Net over other CNNs on ImageNet validation set. The perfor-
mance gains are shown in the brackets. More detailed descriptions of these IBN-Nets
are provided in the supplementary material.
Table 3. Results of IBN-Net variants on ImageNet validation set and Monet style set.
origin Monet
Model
top1/top5 err. top1/top5 err.
ResNet50 24.26/7.08 54.51/29.32 (30.24/22.24)
IBN-Net50-a 22.54/6.32 51.57/27.15 (29.03/20.83)
IBN-Net50-b 23.64/6.86 50.45/25.22 (26.81/18.36)
IBN-Net50-c 22.78/6.32 51.83/27.09 (29.05/20.77)
IBN-Net50-d 22.86/6.48 50.80/26.16 (27.94/19.68)
IBN-Net50-a&d 22.89/6.48 51.27/26.64 (28.38/20.16)
IBN-Net50-a×2 22.81/6.46 51.95/26.98 (29.14/20.52)
Table 6. Results on Cityscapes-GTA dataset. Mean IoU for both within domain eval-
uation and cross domain evaluation is reported.
Table 7. Comparison with domain adaptation methods. Note that our method does
not use target data to help adaptation.
performance is increased by 8.5% from Cityscapes to GTA5 and 7.5% for the
opposite direction.
Comparison with domain adaptation methods. It should be mentioned
that our method is under the different setting with the domain adaptation works.
Domain adaptation is target domain oriented and requires target domain data
during training, while our method does not. Despite so, we show that the per-
formance gain of our method is comparable with those of domain adaptation
methods, as Table. 7 shows. Our approach takes an important step towards
more generalizable models since we introduce built-in appearance invariance to
the model instead of forcing it to fit into a specific data domain.
Finetune on Cityscapes. Another commonly used approach to apply a
model on new data domain is to finetune it with a small amount of target
domain annotations. Here we show that with our more generalizable model,
the data required for finetuning could be significantly reduced. We finetune the
models pretrained on the GTA5 dataset with different amount of Cityscapes data
IBN-Net 13
and labels. The initial learning rate and the number of epochs is set to 0.003 and
80 respectively. As Table. 8 shows, with only 30% of Cityscapes training data,
IBN-Net50-a outperforms resnet50 finetuned on all the data.
Denote D(FiA ||FiB ) as the symmetric KL divergence of the ith channel, then
the average feature divergence of the layer would be:
C
1 X
D(LA ||LB ) = D(FiA ||FiB ) (3)
C i=1
where C is the number of channels in this layer. This metric provides a mea-
surement of the distance between feature distribution for domain A and that for
domain B.
To capture the effects of instance normalization on appearance information
and content information, here we consider three groups of domains. The first
two groups are ”Cityscapes-GTA5” and ”photo-Monet”, which differs in com-
plex appearance. To build two domains with different contents, we split the
ImageNet-1k validation set into two parts, with the first part containing images
with 500 object categories and the second part containing those with the other
500 categories. Then we calculate the feature divergence of the 17 ReLU layers
on the main path of ResNet50 and IBN-Net50. The results are shown in Fig. 5.
It can be seen from Fig. 5(a)(b) that in our IBN-Net, the feature divergence
caused by appearance difference is significantly reduced. For IBN-Net-a the di-
vergence decreases moderately while for IBN-Net-b it encounters sudden drop
after IN layer at position 2,4,8. And this effect lasts till deep layers where IN
is not added, which implies that the variance encoding appearance is reduced
in deep features, so that their interference with classification is reduced. On the
other hand, the feature divergence caused by content difference does not drop in
IBN-Net, as Fig. 5(c) shows, showing that the content information in features
are well preserved in BN layers.
Discussions. These results give us an intuition of how IBN-Net gains stronger
generalization. By introducing IN layers to CNNs in a clever and moderate way,
14 X. Pan, P. Luo, J. Shi, and X. Tang
Feature Divergence
ResNet50
4 IBN-Net50-a
IBN-Net50-b
2
0
1 3 5 7 9 11 13 15 17
(a) Cityscapes-GTA5
Feature Divergence
4 ResNet50
IBN-Net50-a
2 IBN-Net50-b
0
1 3 5 7 9 11 13 15 17
(b) Photo-Monet
0.4
Feature Divergence
ResNet50
IBN-Net50-a
0.2 IBN-Net50-b
0.0
1 3 5 7 9 11 13 15 17
BlockID
(c) class A-Class B
Fig. 5. Feature divergence caused by (a) real-virtual appearance gap, (b) style gap,
(c) object class difference.
they could work in a manner that helps to filter out the appearance variance
within features. In this way the models’ robustness to appearance transforms is
improved, as shown in our experiments.
Note that generalization and modelling capacity are not uncorrelated prop-
erties. On one hand, intuitively appearance invariance could also help the model
to better adapt to the training data of high appearance diversity and extract
their common aspects. On the other hand, even within one dataset, appearance
gap exists between the training and testing set, in which case stronger gener-
alization would also improve performance. These could be the reasons for the
stronger modelling capacity of IBN-Net.
5 Conclusions
In this work we propose IBN-Net, which carefully unifies instance normaliza-
tion and batch normalization layers in a single deep network to increase both
modeling and generalization capacity. We show that IBN-Net achieves consistent
improvement over a number of classic CNNs including VGG, ResNet, ResNeXt,
and SENet on ImageNet dataset. Moreover, the built-in appearance invariance
introduced by IN helps our model to generalize across image domains even with-
out the use of target domain data. Our work concludes the role of IN and BN
layers in CNNs: IN introduces appearance invariance and improves generaliza-
tion while BN preserves content information in discriminative features.
References
1. Chen, L., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.: Deeplab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully
connected crfs. TPAMI (2017)
2. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. CVPR (2016)
3. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-
lutional networks. ICCV (2017)
4. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale
hierarchical image database. CVPR (2009)
5. Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style.
ICLR (2017)
6. Ghifary, M., Bastiaan Kleijn, W., Zhang, M., Balduzzi, D.: Domain generalization
for object recognition with multi-task autoencoders. ICCV (2015)
7. Gross, S., Wilber, M.: Training and investigating residual nets.
https://github.com/ facebook/fb.resnet.torch (2016)
8. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
CVPR (2016)
9. He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks.
ECCV (2016)
10. Hoffman, J., Tzeng, E., Park, T., Zhu, J.Y., Isola, P., Saenko, K., Efros, A.A., Dar-
rell, T.: Cycada: Cycle-consistent adversarial domain adaptation. arXiv preprint
arXiv:1711.03213 (2017)
11. Hoffman, J., Wang, D., Yu, F., Darrell, T.: Fcns in the wild: Pixel-level adversarial
and constraint-based adaptation. arXiv preprint arXiv:1612.02649 (2016)
12. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. arXiv preprint
arXiv:1709.01507 (2017)
13. Huang, G., Liu, Z., Weinberger, K.Q., van der Maaten, L.: Densely connected
convolutional networks. CVPR (2017)
14. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance
normalization. ICCV (2017)
15. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by
reducing internal covariate shift. ICML (2015)
16. Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the
damage of dataset bias. ECCV (2012)
17. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep con-
volutional neural networks. NIPS (2012)
18. Li, Y., Wang, N., Shi, J., Liu, J., Hou, X.: Revisiting batch normalization for
practical domain adaptation. arXiv preprint arXiv:1603.04779 (2016)
19. Long, M., Cao, Y., Wang, J., Jordan, M.: Learning transferable features with deep
adaptation networks. ICML (2015)
20. Muandet, K., Balduzzi, D., Schölkopf, B.: Domain generalization via invariant
feature representation. ICML (2013)
21. Pan, X., Shi, J., Luo, P., Wang, X., Tang, X.: Spatial as deep: Spatial cnn for
traffic scene understanding. AAAI (2018)
22. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. NIPS (2015)
16 X. Pan, P. Luo, J. Shi, and X. Tang
23. Richter, S.R., Vineet, V., Roth, S., Koltun, V.: Playing for data: Ground truth
from computer games. ECCV (2016)
24. Sankaranarayanan, S., Balaji, Y., Jain, A., Lim, S.N., Chellappa, R.: Unsuper-
vised domain adaptation for semantic segmentation with gans. arXiv preprint
arXiv:1711.06969 (2017)
25. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:
Dropout: A simple way to prevent neural networks from overfitting. The Jour-
nal of Machine Learning Research (2014)
26. Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation.
ECCV (2016)
27. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D.,
Vanhoucke, V., Rabinovich, A., et al.: Going deeper with convolutions. CVPR
(2015)
28. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain
adaptation. CVPR (2017)
29. Tzeng, E., Hoffman, J., Zhang, N., Saenko, K., Darrell, T.: Deep domain confusion:
Maximizing for domain invariance. arXiv preprint arXiv:1412.3474 (2014)
30. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Improved texture networks: Maximiz-
ing quality and diversity in feed-forward stylization and texture synthesis. CVPR
(2017)
31. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations
for deep neural networks. CVPR (2017)
32. Zhang, Y., David, P., Gong, B.: Curriculum domain adaptation for semantic seg-
mentation of urban scenes. ICCV (2017)
33. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
using cycle-consistent adversarial networks. ICCV (2017)