Suter 19 A
Suter 19 A
Abstract all red objects is demanded, only position and color are
required.
The ability to learn disentangled representations
that split underlying sources of variation in high Having a disentangled representation, where each feature
dimensional, unstructured data is important for captures only one factor of variation, allows the robot to
data efficient and robust use of neural networks. build separate (simple) models for each task based on only a
While various approaches aiming towards this relevant and stable subselection of these generically learned
goal have been proposed in recent times, a com- features. We argue that robustness of the learned representa-
monly accepted definition and validation proce- tion is a crucial property when this is attempted in practice.
dure is missing. We provide a causal perspective It has been proposed that features should be selected based
on representation learning which covers disentan- on their robustness or invariance across tasks (e.g., Rojas-
glement and domain shift robustness as special Carulla et al., 2018), we hence do not want them to be
cases. Our causal framework allows us to intro- affected by changes in any other factor. In our example,
duce a new metric for the quantitative evaluation the robot assigned with the grasping task should be able
of deep latent variable models. We show how to build a model using features well describing shape and
this metric can be estimated from labeled observa- position of the object. For this model to be robust, however,
tional data and further provide an efficient estima- these features must not be affected by changing color (or
tion algorithm that scales linearly in the dataset any other nuisance factor).
size.
It is striking that despite the recent popularity of disentan-
gled representation learning approaches, a commonly ac-
cepted definition and validation metric is missing (Higgins
1. Introduction et al., 2018). We view disentanglement as a property of a
Learning deep representations in which different semantic causal process (Spirtes et al., 1993; Pearl, 2009) responsible
aspects of data are structurally disentangled is of central for the data generation, as opposed to only a heuristic charac-
importance for training robust machine learning models. teristic of the encoding. Concretely, we call a causal process
Separating independent factors of variation could pave the disentangled when the parents of the generated observations
way for successful transfer learning and domain adaptation do not affect each other (i.e., there is no total causal effect
(Bengio et al., 2013). Imagine the example of a robot learn- between them (Peters et al., 2017, Definition 6.12)). We
ing multiple tasks by interacting with its environment. For call these parents elementary ingredients. In the example
data efficiency, the robot can learn a generic representation above, we view color and shape as elementary ingredients,
architecture that maps its high dimensional sensory data as both can be changed without affecting each other. Still,
to a collection of general, compact features describing its there can be dependencies between them if for example our
surrounding. For each task, only a subset of features will experimental setup is confounded by the capabilities of the
be required. If the robot is instructed to grasp an object, it 3D printers that are used to create the objects (e.g., certain
must know the shape and the position of the object, however, shapes can only be printed in some colors).
its color is irrelevant. On the other hand, when pointing to Combining these disentangled causal processes with the
1 encoding allows us to study interventional effects on feature
Department of Computer Science, ETH Zurich, Switzerland
2
MPI for Intelligent Systems, Tübingen, Germany. Correspon- representations and estimate them from observational data.
dence to: Raphael Suter <[email protected]>, Stefan Bauer This is of interest when benchmarking disentanglement ap-
<[email protected]>. proaches based on ground truth data (Locatello et al., 2018)
or trying to evaluate robustness of a deep representations
Proceedings of the 36 th International Conference on Machine w.r.t. known nuisance factors (e.g., domain changes). In the
Learning, Long Beach, California, PMLR 97, 2019. Copyright
2019 by the author(s). example of robotics, knowledge about the generative factors
Robustly Disentangled Causal Mechanisms
(e.g., the color, shape, weight, etc. of an object to grasp) is using neural networks by maximizing the variational lower
often availabe and can be controlled in experiments. bound (ELBO) of log p(x1 , . . . , xN ):
We will start by first giving an overview of previous work LV AE =
PN
Eq (i)
|z)] DKL (q (z|x(i) )kp(z)).
i=1 (z|x(i) ) [log p✓ (x
in finding disentangled representations and how they have (1)
been validated in Section 2. In Section 3 we introduce our This objective function a priori does not encourage much
framework for the joint treatment of the disentangled causal structure on the latent space (except some similarity to
process and its learned representation. We introduce our the chosen prior p(z) which is usually isotropic Gaussian).
notion of interventional effects on encodings and the follow- More precisely, for a given encoder E and decoder D any
ing interventional robustness score in Section 4 and show bijective transformation g of the latent space z = E(x)
how this score can be estimated from observational data yields the same reconstruction x̂ = D(g(g 1 (E(x))) =
with an efficient O(N ) algorithm in Section 5. Section 6 D(E(x)).
provides experimental evidence in a standard disentangle-
ment benchmark dataset supporting the need of a robustness Various proposals for more structure imposing regulariza-
based disentanglement criterion. tion have been made, either with some sort of supervision
(e.g. Siddharth et al., 2017; Bouchacourt et al., 2017; Liu
O UR CONTRIBUTIONS : et al., 2017; Mathieu et al., 2016; Cheung et al., 2014) or
completely unsupervised (e.g. Higgins et al., 2017; Kim
• We introduce a unifying causal framework of disen- & Mnih, 2018; Chen et al., 2018; Kumar et al., 2018; Es-
tangled generative processes and consequent feature maeili et al., 2018). Higgins et al. (2017) proposed the
encodings. This perspective allows us to introduce a -VAE penalizing the Kullback-Leibler divergence (KL)
novel validation metric, the interventional robustness term in the VAE objective (1) more strongly, which en-
score. courages similarity to the factorized prior distribution. Oth-
• We show how this metric can be estimated from ob-
ers used techniques to encourage statistical independence
servational data and provide an efficient algorithm that
between the different components in Z, e.g., FactorVAE
scales linearly in the dataset size.
• Our extensive experiments on a standard benchmark (Kim & Mnih, 2018) or -TCVAE (Chen et al., 2018),
dataset show that our robustness based validation is similar to independent component analysis (e.g. Comon,
able to discover vulnerabilities of deep representations 1994). With disentangling the inferred prior (DIP-VAE),
that have been undetected by existing work. Kumar etRal. (2018) proposed encouraging factorization of
• Motivated by this metric, we additionally present a q (z) = q (z|x)p(x) dx.
new visualisation technique which provides an intuitive A special form of structure in the latent space which has
understanding of dependency structures and robustness gained a lot of attention in recent time is referred to as dis-
of learned encodings. entanglement (Bengio et al., 2013). This term encompasses
the understanding that each learned feature in Z should
N OTATION : represent structurally different aspects of the observed phe-
nomena (i.e., capture different sources of variation).
We denote the generative factors of high dimensional ob-
servations X as G. The latent variables learned by a Various methods to validate a learned representation for
model, e.g., a variational auto-encoder (VAE) (Kingma & disentanglement based on known ground truth generative
Welling, 2014), are denoted as Z. We use the notation factors G have been proposed (e.g. Eastwood & Williams,
E(·) to describe the encoding which in case of VAEs corre- 2018; Ridgeway & Mozer, 2018; Chen et al., 2018; Kim &
sponds to the posterior mean of q (z|x). Capital letters de- Mnih, 2018). While a universal definition of disentangle-
note random variables, and lower case observations thereof. ment is missing, the most widely accepted notion is that one
Subindices ZJ for a set J or Zj for a single index j denote feature Zi should capture information of only one generative
the selected components of a multidimensional variable. A factor (Eastwood & Williams, 2018; Ridgeway & Mozer,
backslash Z\J denotes all components except those in J. 2018). This has for example been expressed as the mutual
information of a single latent dimension Zi with generative
2. Related Work factors G1 , . . . , GK (Ridgeway & Mozer, 2018), where in
the ideal case each Zi has some mutual information with
In the framework of variational auto-encoders (VAEs) one generative factor Gk but none with all the others. Simi-
(Kingma & Welling, 2014) the (high dimensional) obser- larly, Eastwood & Williams (2018) trained predictors (e.g.,
vations x are modelled to be generated from some latent Lasso or random forests) for a generative factor Gk based
features z with chosen prior p(z) according to the proba- on the representation Z. In a disentangled model, each di-
bilistic model p✓ (x|z)p(z). The generative model p✓ (x|z) mension Zi is only useful (i.e., has high feature importance)
as well as the proxy posterior q (z|x) can be estimated to predict one of those factors (see appendix D for details).
Robustly Disentangled Causal Mechanisms
Validation without known generative factors is still an open the underlying model is likewise a key requirement for the
research question and so far it is not possible to quantita- recent extension of identificability results of non-linear ICA
tively validate disentanglement in an unsupervised way. The (Hyvarinen et al., 2018). We formulate this assumption of a
community has been using ”latent traversals” (i.e., changing disentangled causal model as follows (see also Figure 1):
one latent dimension and subsequently re-generating the im- Definition 1 (Disentangled Causal Process). Con-
age) for visual inspection when supervision is not available sider a causal model for X with generative factors
(see e.g. Chen et al., 2018). This can be used to encounter G, described by the mechanisms p(x|g), where
physically meaningful interpretations of each dimension. G could generally be influenced by L confounders
C = (C1 , . . . , CL ). This causal model for X is called
3. Causal Model disentangled if and only if it can be described by a
structural causal model (SCM) (Pearl, 2009) of the form
We will first consider assumptions for the causal process
underlying the data generating mechanism. Following this, C Nc
we discuss consequences for trying to match encodings Z
Gi fi (P AC
i , Ni ), P AC
i ⇢ {C1 , . . . , CL }, i = 1, . . . , K
with causal factors G in a deep latent variable model.
X g(G, Nx )
with functions fi , g and jointly independent noise variables
3.1. Disentangled Causal Model
Nc , N1 , . . . , NK , Nx . Note that 8i 6= j Gi 6! Gj .
As opposed to previous approaches that defined disentangle-
ment heuristically as properties of the learned latent space, In practice we assume that the dimensionality of the con-
we take a step back and first introduce a notion of disen- founding L is significantly smaller than the number of fac-
tanglement on the level of the true causal mechanism (or tors K.
data generation process). Subsequently, we can use this
definition to better understand a learned probabilistic model C
for latent representations and evaluate its properties.
We assume to be given a set of observations from a (po-
tentially high dimensional) random variable X. In our G1 G2 ··· GK 1 GK
model, the data generating process is described by K causes
of variation (generative factors) G = [G1 , . . . , GK ] (i.e.,
G ! X) that do not cause each other. These factors G X
are generally assumed to be unobserved and are objects of
interest when doing deep representation learning. In par-
ticular, knowledge about G could be used to build lower Figure 1. Disentangled Causal Mechanism: This graphical
dimensional predictive models, not relying on the (unstruc- model encompasses our assumptions on a disentangled causal
tured) X itself. This could be classic prediction of a label model. C stands for a confounder, G = (G1 , G2 , . . . , GK ) are
Y , often in ”confounded” direction (i.e., predicting effects the generative factors (or elementary ingredients) and X the ob-
from other effects) if G ! (X, Y ) or in anti-causal direc- served quantity. In general, there can be multiple confounders
affecting a range of elementary ingredients each.
tion if Y ! G ! X. It is also relevant in a domain change
setting when we know that the domain S has an impact on
X, i.e., (S, G) ! X. This definition reflects our understanding of elementary in-
gredients Gi , i = 1, . . . , K, of the causal process. Each
Having these potential use cases in mind, we assume the ingredient should work on its own and is changable without
generative factors themselves to be confounded by (multi- affecting others. This reflects the independent mechanisms
dimensional) C, which can for example include a potential (IM) assumption (Schölkopf et al., 2012). Independent
label Y or source S. Hence, the resulting causal model mechanisms as components of causal models allow interven-
C ! G ! X allows for statistical dependencies between tion on one mechanism without affecting the other modules
latent variables Gi and Gj , i 6= j, when they are both and thus correspond to the notion of independently con-
affected by a certain label, i.e., Gi Y ! Gj . trollable factors in reinforcement learning (Thomas et al.,
However, a crucial assumption of our model is that these 2017). Our setting is broader, describing any causal process
latent factors should represent elementary ingredients to and inheriting the generality of the notion of IM, pertaining
the causal mechanism generating X (to be defined below), to autonomy, invariance and modularity (Peters et al., 2017).
which can be thought of as descriptive features of X that Based on this view of the data generation process, we can
can be changed without affecting each other (i.e., there is prove (see Appendix B) the following observations which
no causal effect between them). A similar assumption on will help us discuss notions of disentanglement and deep
Robustly Disentangled Causal Mechanisms
Gi 6?
? Gj , i 6= j.
Figure 2. We assume that the data are generated by a process in-
Only if we condition on the confounders in the data volving a set of unknown independent mechanisms Gi (which may
generation they are independent themselves be confounded by other processes, see Figure 1). In the
simplest case, disentangled representation learning aims to recover
Gi ?
? Gj |C 8i 6= j. variables Zi that capture the independent mechanisms Gi in the
sense that they (i) represent the information contained in the Gi
(c) Knowing what observation of X we obtained renders and (ii) respect the causal generative structure of G ! X in an
the different latent causes dependent, i.e., interventional sense: in particular, for any i, localized interventions
on another cause Gj (j 6= i) should not affect Zi . In practice,
Gi 6?
? Gj |X. there need not be a direct correspondence between Gi and Zi
variables (e.g., multiple latent variables may jointly represent one
(d) The latent factors G already contain all information cause), hence our definitions deal with sets of factors rather than
about confounders C that is relevant for X, i.e., individual ones. Note that in the unsupervised setting, we do not
know G nor the mapping from G to X (we do know, however,
I(X; G) = I(X; (G, C)) I(X; C) the “decoder” mapping from Z to X, not shown in this picture).
In experimental evaluations of disentanglement, however, such
where I denotes the mutual information. knowledge is usually assumed.
(e) There is no total causal effect from Gj to Gi for j 6= i;
i.e., intervening on Gj does not change Gi , i.e,
(Ridgeway & Mozer, 2018). Hence, we will generally al-
⇣ ⌘
8gj4 p(gi |do(Gj gj4 )) = p(gi ) 6= p(gi |gj4 ) low the encodings Z to be K 0 dimensional, where usually
K0 K. The -VAE (Higgins et al., 2017) encourages
(f) The remaining components of G, i.e., G\j , are a valid factorization of q (z|x) through penalization of the KL to
adjustment set (Pearl, 2009) to estimate interventional its prior p(z). Due to property (c) other approaches were
effects from Gj to X based on observational data, i.e., introduced making use of statistical independence (Kim &
Z Mnih, 2018; Chen et al., 2018; Kumar et al., 2018). Esmaeili
et al. (2018) allow dependence within groups of variables
p(x|do(Gj gj )) = p(x|gj4 , g\j )p(g\j ) dg\j .
4
in a hierarchical model (i.e., with some form of confound-
ing where property (b) becomes an issue) by specifically
(g) If there is no confounding, conditioning is sufficient to modelling groups of dependent latent encodings. In con-
obtain the post interventional distribution of X: trast to the above mentioned approaches, this requires prior
knowledge on the generative structure. We will make use
p(x|do(Gj gj4 )) = p(x|gj4 )
of property (f) to solve the task of using observational data
to evaluate deep latent variable models for disentanglement
3.2. Disentangled Latent Variable Model
and robustness. Figure 2 illustrates our causal perspective on
We can now understand generative models with latent vari- representation learning which encompasses the data gener-
ables (e.g., the decoder p✓ (x|z) in VAEs) as models for ating process (G ! X) as well as the subsequent encoding
the causal mechanism in (a) and the inferred latent space through E(·) (X ! Z). Based on this viewpoint, we de-
through q (z|x) as proxy to the generative factors G. Prop- fine the interventional effect of a group of generative factors
erty (d) gives hope that under an adequate information bot- GJ on the implied latent space encodings ZL with proxy
tleneck we can indeed recover information about causal par- posterior q (z|x) from a VAE, where J ⇢ {1, . . . , K} and
ents and not the confounders. Ideally, we would hope for a L ⇢ {1, . . . , K 0 } as:
one-to-one correspondance of Zi to Gi for all i = 1, . . . , K. R
p(zI |do(GJ gJ4 )) := q (zI |x) p(x|do(GJ gJ4 )) dx
In some situations it might be useful to learn multiple la-
tent dimensions for one causal factor for a more natural This definition is consistent with the above graphical model
description, e.g., describing an angle ✓ as cos(✓) and sin(✓) as it implies that p(zI |x, do(GJ gJ4 )) = q (zI |x).
Robustly Disentangled Causal Mechanisms
can use PIDA to evaluate robustness of a selected feature 14: initialize mpida(k) 0.0
set ZL against such domain shifts. In particular, 15:
(k)
for l = 1, . . . , NI,J do
IRS(L|{1, . . . , K}\{S}, {S}) (k) (l)
16: meanint E[ZL |do(GI gI , GJ gJ )]
(k,l)
quantifies how robust ZL is when changes in GS occur. If using Eq. (7) and samples DI,J for estimation
we are building a model predicting a label Y based on some 17: compute pida d(mean, meanint )
(to be selected) feature set L, we can use this score to make 18: update mpida(k) max (mpida(k), pida)
a trade-off between robustness and predictive power. For 19: end for
example, we could use the best performing set of features 20: end for
PNI |DI(k) |
among all those that satisfy a given robustness threshold. 21: Return empida k=1 |D| mpida(k)
6. Experiments
Our evaluations involve five different state of the art unsu-
pervised disentanglement techniques (classic VAE, -VAE,
DIP-VAE, FactorVAE and -TCVAE), each learning 10
features.
Figure 4. Visualising Interventional Robustness: Plots of E[Zl |gi ⇤, do(Gj gj4 )] as a function of gj4 for different Gj per column as
explained in Section 6.3. The upper row is an example of good, robust disentanglement (Z3 from the DIP model discussed in Figure
6). The lower row illustrates Z6 which is classified as well disentangled according to FI (top 18%) and MI (top 33%) but still has
a low robustness score (bottom 4%). This stems from the fact that even though Z6 is very informative about scale (almost a linear
function in expectation), its value can still be changed remarkably by switching any of posX, posY or orientation. These additional
dependencies are not discovered by mutual information (or feature importance) due to the higher noise in these relationships (see Figure
7) and because they are partly hidden in cumulated effects.
motivated by interventional robustness and illustrates how (i.e., there is no more dependency on any Gj after account-
robust a learned feature is with respect to changes in nui- ing for Gi ⇤). As such visualizations can provide a much
sance factors. Figure 4 illustrates this approach on two more in depth understanding of learned representations than
features learned by the DIP model. Each row corresponds to single numbers, we provide the full plots of various models
a different feature Zl . The upper row corresponds to a well in the appendix F.
disentangled and robust feature (Z3 ) which gets classified
as such by all three metrics. The lower (Z6 ) also obtaines a 7. Conclusion
high FI and MI score, however, IRS correctly discovers that
this feature is not robust. This illustrates a case where having We have proposed a framework for assessing disentangle-
a robustness perspective on disentanglement is important. ment in deep representation learning which combines the
The columns correspond to different generative factors Gj generative process responsible for high dimensional obser-
(shape, scale, orientation, posX, posY) which vations with the subsequent feature encoding by a neural net-
potentially influence Zl . For each latent variable Zl we first work. This perspective leads to a natural validation method,
find the generative factor Gi⇤ which is most related to it the interventional robustness score. We show how it can be
by choosing the maximizer of Eq. (3) (i.e., the factor that estimated from observational data using an efficient algo-
renders Zl most invariant). In the column i⇤ we then plot rithm that scales linearly in the dataset size. As special cases,
the estimate of E[Zl |gi⇤ ] together with its confidence bound this proposed measure captures robust disentanglement and
in order to visualize the informativeness of Zl about Gi⇤ . domain shift stability. Extensive evaluations showed that the
For example the upper row in plot 4 corresponds to Z3 in existing metrics do not capture the effects that rare events
the DIP model and mostly relates to posY. This is why we or cumulative influences from multiple generative factors
plot the dependence of Z3 on posY in the fifth column. The can have on feature encodings, while our robustness based
remaining columns then illustrate how Z3 changes when validation metric discovers such vulnerabilities.
interventions on the other generative factor are made, even
We envision that the notion of interventional effects on en-
though posY is being kept at a fixed value. Each line with
codings may give rise to the development of novel, robustly
different color corresponds to a particular value posY can
disentangled representation learning algorithms, for exam-
take on. More generally speaking, we plot in the jth column
ple in the interactive learning environment (Thomas et al.,
E[Zl |gi ⇤, do(Gj gj4 )] as a function of gj4 for all possi-
2017) or when weak forms of supervision are available
ble realizations gi ⇤ of Gi ⇤. All values with constant gi ⇤ are
(Bouchacourt et al., 2017; Locatello et al., 2019). The ex-
connected with a line. For a robustly disentangled feature,
ploration of those ideas, especially including confounding,
we would expect all of these colored lines to be horizontal
is left for future research.
Robustly Disentangled Causal Mechanisms
Bengio, Y., Courville, A., and Vincent, P. Representation Kingma, D. P. and Ba, J. Adam: A method for stochastic
learning: A review and new perspectives. IEEE transac- optimization. In International Conference on Learning
tions on pattern analysis and machine intelligence, 35(8): Representations, 2015.
1798–1828, 2013.
Kingma, D. P. and Welling, M. Auto-encoding variational
Besserve, M., Sun, R., and Schölkopf, B. Counterfactuals Bayes. In International Conference on Learning Repre-
uncover the modular structure of deep generative models. sentations, 2014.
arXiv preprint arXiv:1812.03253, 2018.
Koller, D., Friedman, N., and Bach, F. Probabilistic graphi-
Bouchacourt, D., Tomioka, R., and Nowozin, S. Multi-level cal models: principles and techniques. MIT Press, 2009.
variational autoencoder: Learning disentangled repre- Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational
sentations from grouped observations. arXiv preprint inference of disentangled latent concepts from unlabeled
arXiv:1705.08841, 2017. observations. In International Conference on Learning
Representations, 2018.
Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters,
N., Desjardins, G., and Lerchner, A. Understanding dis- Liu, Y.-C., Yeh, Y.-Y., Fu, T.-C., Chiu, W.-C., Wang, S.-D.,
entangling in -vae. arXiv preprint arXiv:1804.03599, and Wang, Y.-C. F. Detach and adapt: Learning cross-
2018. domain disentangled deep representation. arXiv preprint
arXiv:1705.01314, 2017.
Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating
sources of disentanglement in variational autoencoders. Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B.,
arXiv preprint arXiv:1802.04942, 2018. and Bachem, O. Challenging common assumptions in
the unsupervised learning of disentangled representations.
Cheung, B., Livezey, J. A., Bansal, A. K., and Olshausen, arXiv preprint arXiv:1811.12359, 2018.
B. A. Discovering hidden factors of variation in deep
networks. arXiv preprint arXiv:1412.6583, 2014. Locatello, F., Tschannen, M., Bauer, S., Rätsch, G.,
Schölkopf, B., and Bachem, O. Disentangling fac-
Comon, P. Independent component analysis, a new concept? tors of variation using few labels. arXiv preprint
Signal processing, 36(3):287–314, 1994. arXiv:1905.01258, 2019.
Eastwood, C. and Williams, C. K. I. A framework for the Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprech-
quantitative evaluation of disentangled representations. mann, P., and LeCun, Y. Disentangling factors of varia-
In International Conference on Learning Representations, tion in deep representation using adversarial training. In
2018. Neural Information Processing Systems, pp. 5040–5048,
2016.
Esmaeili, B., Wu, H., Jain, S., Narayanaswamy, S., Paige,
B., and Van de Meent, J.-W. Hierarchical disentangled Pearl, J. Causality. Cambridge University Press, 2009.
representations. arXiv preprint arXiv:1804.02086, 2018. Peters, J., Janzing, D., and Schölkopf, B. Elements of causal
inference: foundations and learning algorithms. MIT
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,
Press, 2017.
Botvinick, M., Mohamed, S., and Lerchner, A. beta-
vae: Learning basic visual concepts with a constrained Ridgeway, K. and Mozer, M. C. Learning deep disentan-
variational framework. 2017. gled embeddings with the f-statistic loss. arXiv preprint
arXiv:1802.05312, 2018.
Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey,
L., Rezende, D., and Lerchner, A. Towards a defi- Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J.
nition of disentangled representations. arXiv preprint Invariant models for causal transfer learning. Journal of
arXiv:1812.02230, 2018. Machine Learning Research, 2018.
Robustly Disentangled Causal Mechanisms