0% found this document useful (0 votes)
17 views10 pages

Suter 19 A

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views10 pages

Suter 19 A

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Robustly Disentangled Causal Mechanisms:

Validating Deep Representations for Interventional Robustness

Raphael Suter 1 Dorde Miladinović 1


Bernhard Schölkopf 2 Stefan Bauer 2

Abstract all red objects is demanded, only position and color are
required.
The ability to learn disentangled representations
that split underlying sources of variation in high Having a disentangled representation, where each feature
dimensional, unstructured data is important for captures only one factor of variation, allows the robot to
data efficient and robust use of neural networks. build separate (simple) models for each task based on only a
While various approaches aiming towards this relevant and stable subselection of these generically learned
goal have been proposed in recent times, a com- features. We argue that robustness of the learned representa-
monly accepted definition and validation proce- tion is a crucial property when this is attempted in practice.
dure is missing. We provide a causal perspective It has been proposed that features should be selected based
on representation learning which covers disentan- on their robustness or invariance across tasks (e.g., Rojas-
glement and domain shift robustness as special Carulla et al., 2018), we hence do not want them to be
cases. Our causal framework allows us to intro- affected by changes in any other factor. In our example,
duce a new metric for the quantitative evaluation the robot assigned with the grasping task should be able
of deep latent variable models. We show how to build a model using features well describing shape and
this metric can be estimated from labeled observa- position of the object. For this model to be robust, however,
tional data and further provide an efficient estima- these features must not be affected by changing color (or
tion algorithm that scales linearly in the dataset any other nuisance factor).
size.
It is striking that despite the recent popularity of disentan-
gled representation learning approaches, a commonly ac-
cepted definition and validation metric is missing (Higgins
1. Introduction et al., 2018). We view disentanglement as a property of a
Learning deep representations in which different semantic causal process (Spirtes et al., 1993; Pearl, 2009) responsible
aspects of data are structurally disentangled is of central for the data generation, as opposed to only a heuristic charac-
importance for training robust machine learning models. teristic of the encoding. Concretely, we call a causal process
Separating independent factors of variation could pave the disentangled when the parents of the generated observations
way for successful transfer learning and domain adaptation do not affect each other (i.e., there is no total causal effect
(Bengio et al., 2013). Imagine the example of a robot learn- between them (Peters et al., 2017, Definition 6.12)). We
ing multiple tasks by interacting with its environment. For call these parents elementary ingredients. In the example
data efficiency, the robot can learn a generic representation above, we view color and shape as elementary ingredients,
architecture that maps its high dimensional sensory data as both can be changed without affecting each other. Still,
to a collection of general, compact features describing its there can be dependencies between them if for example our
surrounding. For each task, only a subset of features will experimental setup is confounded by the capabilities of the
be required. If the robot is instructed to grasp an object, it 3D printers that are used to create the objects (e.g., certain
must know the shape and the position of the object, however, shapes can only be printed in some colors).
its color is irrelevant. On the other hand, when pointing to Combining these disentangled causal processes with the
1 encoding allows us to study interventional effects on feature
Department of Computer Science, ETH Zurich, Switzerland
2
MPI for Intelligent Systems, Tübingen, Germany. Correspon- representations and estimate them from observational data.
dence to: Raphael Suter <[email protected]>, Stefan Bauer This is of interest when benchmarking disentanglement ap-
<[email protected]>. proaches based on ground truth data (Locatello et al., 2018)
or trying to evaluate robustness of a deep representations
Proceedings of the 36 th International Conference on Machine w.r.t. known nuisance factors (e.g., domain changes). In the
Learning, Long Beach, California, PMLR 97, 2019. Copyright
2019 by the author(s). example of robotics, knowledge about the generative factors
Robustly Disentangled Causal Mechanisms

(e.g., the color, shape, weight, etc. of an object to grasp) is using neural networks by maximizing the variational lower
often availabe and can be controlled in experiments. bound (ELBO) of log p(x1 , . . . , xN ):
We will start by first giving an overview of previous work LV AE =
PN
Eq (i)
|z)] DKL (q (z|x(i) )kp(z)).
i=1 (z|x(i) ) [log p✓ (x
in finding disentangled representations and how they have (1)
been validated in Section 2. In Section 3 we introduce our This objective function a priori does not encourage much
framework for the joint treatment of the disentangled causal structure on the latent space (except some similarity to
process and its learned representation. We introduce our the chosen prior p(z) which is usually isotropic Gaussian).
notion of interventional effects on encodings and the follow- More precisely, for a given encoder E and decoder D any
ing interventional robustness score in Section 4 and show bijective transformation g of the latent space z = E(x)
how this score can be estimated from observational data yields the same reconstruction x̂ = D(g(g 1 (E(x))) =
with an efficient O(N ) algorithm in Section 5. Section 6 D(E(x)).
provides experimental evidence in a standard disentangle-
ment benchmark dataset supporting the need of a robustness Various proposals for more structure imposing regulariza-
based disentanglement criterion. tion have been made, either with some sort of supervision
(e.g. Siddharth et al., 2017; Bouchacourt et al., 2017; Liu
O UR CONTRIBUTIONS : et al., 2017; Mathieu et al., 2016; Cheung et al., 2014) or
completely unsupervised (e.g. Higgins et al., 2017; Kim
• We introduce a unifying causal framework of disen- & Mnih, 2018; Chen et al., 2018; Kumar et al., 2018; Es-
tangled generative processes and consequent feature maeili et al., 2018). Higgins et al. (2017) proposed the
encodings. This perspective allows us to introduce a -VAE penalizing the Kullback-Leibler divergence (KL)
novel validation metric, the interventional robustness term in the VAE objective (1) more strongly, which en-
score. courages similarity to the factorized prior distribution. Oth-
• We show how this metric can be estimated from ob-
ers used techniques to encourage statistical independence
servational data and provide an efficient algorithm that
between the different components in Z, e.g., FactorVAE
scales linearly in the dataset size.
• Our extensive experiments on a standard benchmark (Kim & Mnih, 2018) or -TCVAE (Chen et al., 2018),
dataset show that our robustness based validation is similar to independent component analysis (e.g. Comon,
able to discover vulnerabilities of deep representations 1994). With disentangling the inferred prior (DIP-VAE),
that have been undetected by existing work. Kumar etRal. (2018) proposed encouraging factorization of
• Motivated by this metric, we additionally present a q (z) = q (z|x)p(x) dx.
new visualisation technique which provides an intuitive A special form of structure in the latent space which has
understanding of dependency structures and robustness gained a lot of attention in recent time is referred to as dis-
of learned encodings. entanglement (Bengio et al., 2013). This term encompasses
the understanding that each learned feature in Z should
N OTATION : represent structurally different aspects of the observed phe-
nomena (i.e., capture different sources of variation).
We denote the generative factors of high dimensional ob-
servations X as G. The latent variables learned by a Various methods to validate a learned representation for
model, e.g., a variational auto-encoder (VAE) (Kingma & disentanglement based on known ground truth generative
Welling, 2014), are denoted as Z. We use the notation factors G have been proposed (e.g. Eastwood & Williams,
E(·) to describe the encoding which in case of VAEs corre- 2018; Ridgeway & Mozer, 2018; Chen et al., 2018; Kim &
sponds to the posterior mean of q (z|x). Capital letters de- Mnih, 2018). While a universal definition of disentangle-
note random variables, and lower case observations thereof. ment is missing, the most widely accepted notion is that one
Subindices ZJ for a set J or Zj for a single index j denote feature Zi should capture information of only one generative
the selected components of a multidimensional variable. A factor (Eastwood & Williams, 2018; Ridgeway & Mozer,
backslash Z\J denotes all components except those in J. 2018). This has for example been expressed as the mutual
information of a single latent dimension Zi with generative
2. Related Work factors G1 , . . . , GK (Ridgeway & Mozer, 2018), where in
the ideal case each Zi has some mutual information with
In the framework of variational auto-encoders (VAEs) one generative factor Gk but none with all the others. Simi-
(Kingma & Welling, 2014) the (high dimensional) obser- larly, Eastwood & Williams (2018) trained predictors (e.g.,
vations x are modelled to be generated from some latent Lasso or random forests) for a generative factor Gk based
features z with chosen prior p(z) according to the proba- on the representation Z. In a disentangled model, each di-
bilistic model p✓ (x|z)p(z). The generative model p✓ (x|z) mension Zi is only useful (i.e., has high feature importance)
as well as the proxy posterior q (z|x) can be estimated to predict one of those factors (see appendix D for details).
Robustly Disentangled Causal Mechanisms

Validation without known generative factors is still an open the underlying model is likewise a key requirement for the
research question and so far it is not possible to quantita- recent extension of identificability results of non-linear ICA
tively validate disentanglement in an unsupervised way. The (Hyvarinen et al., 2018). We formulate this assumption of a
community has been using ”latent traversals” (i.e., changing disentangled causal model as follows (see also Figure 1):
one latent dimension and subsequently re-generating the im- Definition 1 (Disentangled Causal Process). Con-
age) for visual inspection when supervision is not available sider a causal model for X with generative factors
(see e.g. Chen et al., 2018). This can be used to encounter G, described by the mechanisms p(x|g), where
physically meaningful interpretations of each dimension. G could generally be influenced by L confounders
C = (C1 , . . . , CL ). This causal model for X is called
3. Causal Model disentangled if and only if it can be described by a
structural causal model (SCM) (Pearl, 2009) of the form
We will first consider assumptions for the causal process
underlying the data generating mechanism. Following this, C Nc
we discuss consequences for trying to match encodings Z
Gi fi (P AC
i , Ni ), P AC
i ⇢ {C1 , . . . , CL }, i = 1, . . . , K
with causal factors G in a deep latent variable model.
X g(G, Nx )
with functions fi , g and jointly independent noise variables
3.1. Disentangled Causal Model
Nc , N1 , . . . , NK , Nx . Note that 8i 6= j Gi 6! Gj .
As opposed to previous approaches that defined disentangle-
ment heuristically as properties of the learned latent space, In practice we assume that the dimensionality of the con-
we take a step back and first introduce a notion of disen- founding L is significantly smaller than the number of fac-
tanglement on the level of the true causal mechanism (or tors K.
data generation process). Subsequently, we can use this
definition to better understand a learned probabilistic model C
for latent representations and evaluate its properties.
We assume to be given a set of observations from a (po-
tentially high dimensional) random variable X. In our G1 G2 ··· GK 1 GK
model, the data generating process is described by K causes
of variation (generative factors) G = [G1 , . . . , GK ] (i.e.,
G ! X) that do not cause each other. These factors G X
are generally assumed to be unobserved and are objects of
interest when doing deep representation learning. In par-
ticular, knowledge about G could be used to build lower Figure 1. Disentangled Causal Mechanism: This graphical
dimensional predictive models, not relying on the (unstruc- model encompasses our assumptions on a disentangled causal
tured) X itself. This could be classic prediction of a label model. C stands for a confounder, G = (G1 , G2 , . . . , GK ) are
Y , often in ”confounded” direction (i.e., predicting effects the generative factors (or elementary ingredients) and X the ob-
from other effects) if G ! (X, Y ) or in anti-causal direc- served quantity. In general, there can be multiple confounders
affecting a range of elementary ingredients each.
tion if Y ! G ! X. It is also relevant in a domain change
setting when we know that the domain S has an impact on
X, i.e., (S, G) ! X. This definition reflects our understanding of elementary in-
gredients Gi , i = 1, . . . , K, of the causal process. Each
Having these potential use cases in mind, we assume the ingredient should work on its own and is changable without
generative factors themselves to be confounded by (multi- affecting others. This reflects the independent mechanisms
dimensional) C, which can for example include a potential (IM) assumption (Schölkopf et al., 2012). Independent
label Y or source S. Hence, the resulting causal model mechanisms as components of causal models allow interven-
C ! G ! X allows for statistical dependencies between tion on one mechanism without affecting the other modules
latent variables Gi and Gj , i 6= j, when they are both and thus correspond to the notion of independently con-
affected by a certain label, i.e., Gi Y ! Gj . trollable factors in reinforcement learning (Thomas et al.,
However, a crucial assumption of our model is that these 2017). Our setting is broader, describing any causal process
latent factors should represent elementary ingredients to and inheriting the generality of the notion of IM, pertaining
the causal mechanism generating X (to be defined below), to autonomy, invariance and modularity (Peters et al., 2017).
which can be thought of as descriptive features of X that Based on this view of the data generation process, we can
can be changed without affecting each other (i.e., there is prove (see Appendix B) the following observations which
no causal effect between them). A similar assumption on will help us discuss notions of disentanglement and deep
Robustly Disentangled Causal Mechanisms

latent variable models.


G1 G2 ··· GK 1 GK
Proposition 1 (Properties of a Disentangled Causal Pro-
cess). A disentangled causal process as introduced in Defi-
nition 1 fulfills the following properties:
X

(a) p(x|g) describes a causal mechanism invariant to


changes in the distributions p(gi ).
(b) In general, the latent causes can be dependent Z1 Z2 ··· ZK 0 1 ZK 0

Gi 6?
? Gj , i 6= j.
Figure 2. We assume that the data are generated by a process in-
Only if we condition on the confounders in the data volving a set of unknown independent mechanisms Gi (which may
generation they are independent themselves be confounded by other processes, see Figure 1). In the
simplest case, disentangled representation learning aims to recover
Gi ?
? Gj |C 8i 6= j. variables Zi that capture the independent mechanisms Gi in the
sense that they (i) represent the information contained in the Gi
(c) Knowing what observation of X we obtained renders and (ii) respect the causal generative structure of G ! X in an
the different latent causes dependent, i.e., interventional sense: in particular, for any i, localized interventions
on another cause Gj (j 6= i) should not affect Zi . In practice,
Gi 6?
? Gj |X. there need not be a direct correspondence between Gi and Zi
variables (e.g., multiple latent variables may jointly represent one
(d) The latent factors G already contain all information cause), hence our definitions deal with sets of factors rather than
about confounders C that is relevant for X, i.e., individual ones. Note that in the unsupervised setting, we do not
know G nor the mapping from G to X (we do know, however,
I(X; G) = I(X; (G, C)) I(X; C) the “decoder” mapping from Z to X, not shown in this picture).
In experimental evaluations of disentanglement, however, such
where I denotes the mutual information. knowledge is usually assumed.
(e) There is no total causal effect from Gj to Gi for j 6= i;
i.e., intervening on Gj does not change Gi , i.e,
(Ridgeway & Mozer, 2018). Hence, we will generally al-
⇣ ⌘
8gj4 p(gi |do(Gj gj4 )) = p(gi ) 6= p(gi |gj4 ) low the encodings Z to be K 0 dimensional, where usually
K0 K. The -VAE (Higgins et al., 2017) encourages
(f) The remaining components of G, i.e., G\j , are a valid factorization of q (z|x) through penalization of the KL to
adjustment set (Pearl, 2009) to estimate interventional its prior p(z). Due to property (c) other approaches were
effects from Gj to X based on observational data, i.e., introduced making use of statistical independence (Kim &
Z Mnih, 2018; Chen et al., 2018; Kumar et al., 2018). Esmaeili
et al. (2018) allow dependence within groups of variables
p(x|do(Gj gj )) = p(x|gj4 , g\j )p(g\j ) dg\j .
4
in a hierarchical model (i.e., with some form of confound-
ing where property (b) becomes an issue) by specifically
(g) If there is no confounding, conditioning is sufficient to modelling groups of dependent latent encodings. In con-
obtain the post interventional distribution of X: trast to the above mentioned approaches, this requires prior
knowledge on the generative structure. We will make use
p(x|do(Gj gj4 )) = p(x|gj4 )
of property (f) to solve the task of using observational data
to evaluate deep latent variable models for disentanglement
3.2. Disentangled Latent Variable Model
and robustness. Figure 2 illustrates our causal perspective on
We can now understand generative models with latent vari- representation learning which encompasses the data gener-
ables (e.g., the decoder p✓ (x|z) in VAEs) as models for ating process (G ! X) as well as the subsequent encoding
the causal mechanism in (a) and the inferred latent space through E(·) (X ! Z). Based on this viewpoint, we de-
through q (z|x) as proxy to the generative factors G. Prop- fine the interventional effect of a group of generative factors
erty (d) gives hope that under an adequate information bot- GJ on the implied latent space encodings ZL with proxy
tleneck we can indeed recover information about causal par- posterior q (z|x) from a VAE, where J ⇢ {1, . . . , K} and
ents and not the confounders. Ideally, we would hope for a L ⇢ {1, . . . , K 0 } as:
one-to-one correspondance of Zi to Gi for all i = 1, . . . , K. R
p(zI |do(GJ gJ4 )) := q (zI |x) p(x|do(GJ gJ4 )) dx
In some situations it might be useful to learn multiple la-
tent dimensions for one causal factor for a more natural This definition is consistent with the above graphical model
description, e.g., describing an angle ✓ as cos(✓) and sin(✓) as it implies that p(zI |x, do(GJ gJ4 )) = q (zI |x).
Robustly Disentangled Causal Mechanisms

4. Interventional Robustness of the observational dataset. For example, when a robot


is trained with various objects of different colors, it might
Building on the definition of interventional effects on deep be the case that certain shapes occur more often in specific
feature representations in Eq. (3.2), we now derive a robust- colors (e.g., due to 3D printer capabilities). When we would
ness measure of encodings with respect to changes in certain condition the feature encoding on on a specific color, the
generative factors. observed effects might as well be due to a change in object
Let L ⇢ {1, . . . , K 0 } and I, J ⇢ {1, . . . , K}, I \ J = ; be shape. The interventional distribution, on the other hand,
groups of indices in the latent space and generative space. measures by definition the change features experience due
For generality, we will henceforth talk about robustness to externally setting the color while all other generative fac-
of groups of features ZL with respect to interventions on tors remain the same. If there was no confounding in the
groups of generative factors GJ . We believe that having generative process, this definition is equivalent to regular
this general formulation of allowing disagreements between conditioning (see Proposition 1 (g)).
groups of latent dimensions and generative factors provides For robustness reasons, we are interested in the worst case
more flexibility, for example when multiple latent dimen- effect any change in nuisance parameters gJ4 might have.
sions are used to describe one phenomenon (Esmaeili et al., We call this the maximal post interventional disagreement
2018) or when some sort of supervision is available through (MPIDA): MPIDA(L|gI , J) := supg4 PIDA(L|gI , gJ4 ).
groupings in the dataset according to generative factors J
This metric is still computed for a specific realization of
(Bouchacourt et al., 2017). Below, we will also discuss
GI . Hence, we weight this score according to occurance
special cases of how these sets can be chosen.
probabilities of gI , which leads us to the expected MPIDA:
If we assume that the encoding ZL captures information EMPIDA(L|I, J) := EgI [MPIDA(L|gI , J)]. EMPIDA
about the causal factors GI and we would like to build a is now a (unnormalized) measure in [0, 1) quantifying the
predictive model that only depends on those factors, we worst-case shifts in the inferred ZL we have to expect due
might be interested in knowing how robust our encoding is to changes in GJ even though our generative factors of
with respect to nuisance factors GJ , where I \ J = ;. To interest GI remain the same. This is for example of interest
quantify this robustness for specific realizations of gI and when the robot in our introductory example learns a generic
gJ4 we make the following definition: feature representation Z of his environment from which
Definition 2 (Post Interventional Disagreement). For any he wants to make a subselection of features ZL in order to
given set of feature indices L ⇢ {1, . . . , K }, gI and gJ ,
0 4 perform a grasping task. For this model to work well, the
we call generative factor of the object I = {shape, weight} are
important, however, factor J = {color} is not. Now, the
4
PIDA(L|gI , gJ ) := robot can evaluate how robust its features ZL perform at the
⇣ ⌘ task requiring I but not J.
d E[ZL |do(GI gI )], E[ZL |do(GI g I , GJ gJ4 )]
We propose to normalize this quantity with
the post interventional disagreement (PIDA) in ZL due to EMPIDA(L|;, {1, . . . , K}), which represents the
4
gJ given gI . Here, d is a suitable distance function (e.g., expected maximal deviation from the mean encoding of ZL
`2 -norm). without fixed generative factors as it is often useful to have
a normalized score for comparisons. Hence, we define:
The above definition on its own is likewise a contribution to Definition 3 (Interventional Robustness Score).
the defined but unused notion of extrinsic disentanglement
in Besserve et al. (2018). PIDA now quantifies the shifts in EMPIDA(L|I, J)
IRS(L|I, J) := 1 (2)
our inferred features ZL we experience when the generative EMPIDA(L|;, {1, . . . , K})
factors GJ are externally changed to gJ4 while the gen-
erative factors that we are actually interested in capturing This score yields 1.0 for perfect robustness (i.e., no harm is
with ZL (i.e., GI ) remain at the predefined setting of gI . done by changes in GJ ) and 0.0 for no robustness. Note that
Using expected values after intervention on the generative IRS has a similar interpretion to a R2 value in regression.
factors (i.e., Pearl’s do-notation), as opposed to regular con- Instead of measuring the captured variance, it looks at worst
ditioning, allows for interpretation of the score also when case deviations of inferred values.
factors are dependent due to confounding. The do-notation
represents setting these generative values by external inter- Special Case: Disentanglement One important special
vention. It thus isolates the causal effect that a generative case includes the setting where L = {l}, I = {i} and J =
factor has, which in general is not possible using standard {1, . . . , i 1, i + 1, . . . , K}. This corresponds to the degree
conditioning (Pearl, 2009). This neglects the history that to which Zl is robustly isolated from any extraneous causes
might have led to the observations in the collection phase (assuming Zl captures Gi ), which can be interpreted as the
Robustly Disentangled Causal Mechanisms

concept of disentanglement in the framework of Eastwood Algorithm 1 EMPIDA Estimation


& Williams (2018). We define 1: Input:
2: dataset D = {(x(i) , g (i) )}i=1,...,N
Dl := max IRS({l}|{i}, {1, . . . , K}\{i}) (3)
i2{1,...,K} 3: trained encoder E
as disentanglement score of Zl . The maximizing i? is in- 4: subsets of factors L ⇢ {1, . . . , K 0 } and I, J ⇢
terpreted as the generative factor that Zl captures predomi- {1, . . . , K}
nantly. Intuitively, we have robust disentanglement when a 5: Preprocessing:
feature Zl reliably captures information about the generative 6: encode all samples to obtain {z (i) = E(x(i) ) : i =
factor Gi? , where reliable means that the inferred value is 1, . . . , N }
(i)
always the same when gi? stays the same, regardless of what 7: estimate p(g (i) ) and p(g\(I[J) ) 8i from relative fre-
the other generative factors G\i? are doing. quencies in D
8: Estimation:
In our evaluations of disentanglement, we also (k)
9: find all realizations of GI in D: {gI , k = 1, . . . , NI }
plot the full dependency matrix R̂ with R̂li = 10: partition the dataset according to those realizations:
IRS({l}|{i}, {1, . . . , K}\{i}) (see for example Fig- (k) (k)
DI := {(x, g) 2 D s.t. gI = gI }
ure 6 on page 16) next to providing the values Dl and their
11: for k = 1, . . . , NI do
weighted average. (k)
12: estimate mean E[ZL |do(GI gI )] using
(k)
Special Case: Domain Shift Robustness If we under- Eq. (7) and samples DI
(k)
stand one (or multiple) generative factor(s) GS as indicating 13: partition DI according to realizations of GJ :
source domains which we would like to generalize over, we (k,l) (k)
DI,J := {(x, g) 2 DI s.t. gJ = gJ }
(l)

can use PIDA to evaluate robustness of a selected feature 14: initialize mpida(k) 0.0
set ZL against such domain shifts. In particular, 15:
(k)
for l = 1, . . . , NI,J do
IRS(L|{1, . . . , K}\{S}, {S}) (k) (l)
16: meanint E[ZL |do(GI gI , GJ gJ )]
(k,l)
quantifies how robust ZL is when changes in GS occur. If using Eq. (7) and samples DI,J for estimation
we are building a model predicting a label Y based on some 17: compute pida d(mean, meanint )
(to be selected) feature set L, we can use this score to make 18: update mpida(k) max (mpida(k), pida)
a trade-off between robustness and predictive power. For 19: end for
example, we could use the best performing set of features 20: end for
PNI |DI(k) |
among all those that satisfy a given robustness threshold. 21: Return empida k=1 |D| mpida(k)

5. Estimation and Benchmarking


Disentanglement
In the supplementary material A we provide the derivation While this is a general issue for all validation approaches
of our estimation procedure for EMPIDA(L|I, J). Here and care needs to be taken when collecting such datasets
we only present the specific algorithm how EMPIDA can in practice, we just remind that due to the generally large
be estimated from a generic observational dataset D in Al- nature of N it is particularly important to have such an effi-
gorithm 1. The main ingredient for this estimation to work cient validation procedure. In many benchmark datasets for
is provided by our constrained causal model (i.e., a disen- disentanglement (e.g. dsprites) the observations are obtained
tangled process) that implies that the backdoor criteria can noise-free and the dataset contains all possible combinations
be applied, which we showed in Proposition 1. of generative factors exactly once. This makes the estima-
Even though the sampling procedure might look non- tion of the disentanglement score even easier, as we have
(k,l)
trivial at first sight, the algorithm 1 for estimating |DI={i},J={1,...,K}\{i} | = 1. Furthermore, since no con-
EMPIDA(L|I, J) has O(N ) complexity as indicated by founding is present, we can use conditioning to estimate the
the following result: interventional effect, i.e., p(x|do(Gi gi )) = p(x|gi ), as
Proposition 2 (Computational Complexity). The EMPIDA seen in Proposition 1 (g). The disentanglement score of Zl ,
estimation algorithm described in Algorithm 1 scales O(N ) as discussed in Eq. (3) , follows (see A.1 for details) as:
in the dataset size N = |D|.

The proof of Proposition 2 can be found in Appendix C.


✓ ◆
Note that a dataset capturing all possible variations gener- EMPIDAli
ally grows exponentially in the number of generative factors. Dl = max 1 .
i2{1,...,K} supx̃2D d (E[Zl ], E(x̃))
Robustly Disentangled Causal Mechanisms

6. Experiments
Our evaluations involve five different state of the art unsu-
pervised disentanglement techniques (classic VAE, -VAE,
DIP-VAE, FactorVAE and -TCVAE), each learning 10
features.

6.1. Methods Comparison


In Table 1 we provide a compact summary of our evaluation.
Our objective is the analysis of various kinds of learned la-
tent spaces and their characteristics, not primarily evaluating
which methods work best under some metric. In particu-
lar, we used each method with the parameter settings that
were indicated in the original publications (details are given
in Appendix D) and did not tune them further in order to
achieve a better robustness score, which is certainly feasible. Figure 3. Relationship Metrics: Visualization of all learned fea-
Rather, we are interested in evaluating latent spaces as a tures Zi in our universe (5 models with 10 dimensions each) based
whole, which encompasses both the method and its settings on their MI disentanglement score on the x axis and interventional
in combination. We can for example observe that -TCVAE robustness (IRS) on the y axis. The red box indicates the features
achieves a relatively low feature importance based measure that obtained a high disentanglement score according to mutual
by Eastwood & Williams (2018). This is due to the fact that information (i.e., they share high mutual information with only one
generative factor), but still provide low robustness according to
Chen et al. (2018) did not consider shape to be a generative
IRS. These are the cases where the robustness perspective delivers
factor in their tuning (which also leads to a lower informa-
additional insight into disentanglement quality.
tiveness score in our evaluation that includes this factor),
and also because their model ends up with only few active
dimensions. The treatment of such inactive components can
make a difference when averaging disentanglement scores tangles best. This is consistent with the recent large scale
of the single Zi to an overall score. FI uses a simple aver- evaluation provided by Locatello et al. (2018). In Figure 3
age, MI weights the components with their overall feature we further illustrate the dependency between MI score and
importance and we weight them according to worst case our IRS on the finer granularity of considering the metrics of
deviation from mean (i.e., normalization of the IRS). individual features (instead of the full latent space). There
seems to be a clear positive correlation between the two
Table 1. Metrics Overview: IRS: (ours), FI: (Eastwood &
Williams, 2018), MI: (Ridgeway & Mozer, 2018), INFO: informa-
evaluation metrics. However, there are features classified
tiveness score (Eastwood & Williams, 2018) (higher is better). The as well disentangled according to MI, but not robustly (ac-
number in parentheses indicates the rank according to a particular cording to IRS). These features are marked with the red
metric. Experimental details are given in Section D. rectangle in Figure 3. We explore one typical such example
in more detail in Figures 4, 6 and 7 in the appendix, for the
Model IRS FI MI Info
case of the DIP model.
VAE 0.33 (5) 0.23 (4) 0.90 (3) 0.82 (1)
Annealed -VAE 0.57 (2) 0.35 (2) 0.86 (5) 0.79 (4) When there are rare events happening that still have a major
DIP-VAE 0.43 (4) 0.39 (1) 0.89 (4) 0.82 (1) impact on the features or when there is a cumulative effect
FactorVAE 0.51 (3) 0.31 (3) 0.92 (1) 0.79 (4) from several generative factors (e.g., in Figure 4), pairwise
-TCVAE 0.72 (1) 0.16 (5) 0.92 (1) 0.74 (5) information based methods (such as MI or FI) cannot cap-
ture this vulnerability of deeply learned features. IRS, on
Believing that it is most insightful to look at scores for each the other hand, looks specifically at these cases. For a well
dimension separately, which indicates the quality of a single rounded view on disentanglement quality, we propose to
feature, we included the full evaluations including plots of use both types of measures in a manner that is comple-
correspondance matrices (as in Figure 6) in Appendix E. mentary and use-case specific. Specificially when critical
For future extensions and applications our work is added to applications are designed on top of deep representations
the disentanglement lib of Locatello et al. (2018). quantifying its robustness can be decisive.

6.2. Robustness as Complementary Metric 6.3. Visualising Interventional Robustness


As we could already see in Table 1, different metrics do We further introduce a new visualization technique for la-
not always agree with each other about which model disen- tent space models based on ground truth factors which is
Robustly Disentangled Causal Mechanisms

Figure 4. Visualising Interventional Robustness: Plots of E[Zl |gi ⇤, do(Gj gj4 )] as a function of gj4 for different Gj per column as
explained in Section 6.3. The upper row is an example of good, robust disentanglement (Z3 from the DIP model discussed in Figure
6). The lower row illustrates Z6 which is classified as well disentangled according to FI (top 18%) and MI (top 33%) but still has
a low robustness score (bottom 4%). This stems from the fact that even though Z6 is very informative about scale (almost a linear
function in expectation), its value can still be changed remarkably by switching any of posX, posY or orientation. These additional
dependencies are not discovered by mutual information (or feature importance) due to the higher noise in these relationships (see Figure
7) and because they are partly hidden in cumulated effects.

motivated by interventional robustness and illustrates how (i.e., there is no more dependency on any Gj after account-
robust a learned feature is with respect to changes in nui- ing for Gi ⇤). As such visualizations can provide a much
sance factors. Figure 4 illustrates this approach on two more in depth understanding of learned representations than
features learned by the DIP model. Each row corresponds to single numbers, we provide the full plots of various models
a different feature Zl . The upper row corresponds to a well in the appendix F.
disentangled and robust feature (Z3 ) which gets classified
as such by all three metrics. The lower (Z6 ) also obtaines a 7. Conclusion
high FI and MI score, however, IRS correctly discovers that
this feature is not robust. This illustrates a case where having We have proposed a framework for assessing disentangle-
a robustness perspective on disentanglement is important. ment in deep representation learning which combines the
The columns correspond to different generative factors Gj generative process responsible for high dimensional obser-
(shape, scale, orientation, posX, posY) which vations with the subsequent feature encoding by a neural net-
potentially influence Zl . For each latent variable Zl we first work. This perspective leads to a natural validation method,
find the generative factor Gi⇤ which is most related to it the interventional robustness score. We show how it can be
by choosing the maximizer of Eq. (3) (i.e., the factor that estimated from observational data using an efficient algo-
renders Zl most invariant). In the column i⇤ we then plot rithm that scales linearly in the dataset size. As special cases,
the estimate of E[Zl |gi⇤ ] together with its confidence bound this proposed measure captures robust disentanglement and
in order to visualize the informativeness of Zl about Gi⇤ . domain shift stability. Extensive evaluations showed that the
For example the upper row in plot 4 corresponds to Z3 in existing metrics do not capture the effects that rare events
the DIP model and mostly relates to posY. This is why we or cumulative influences from multiple generative factors
plot the dependence of Z3 on posY in the fifth column. The can have on feature encodings, while our robustness based
remaining columns then illustrate how Z3 changes when validation metric discovers such vulnerabilities.
interventions on the other generative factor are made, even
We envision that the notion of interventional effects on en-
though posY is being kept at a fixed value. Each line with
codings may give rise to the development of novel, robustly
different color corresponds to a particular value posY can
disentangled representation learning algorithms, for exam-
take on. More generally speaking, we plot in the jth column
ple in the interactive learning environment (Thomas et al.,
E[Zl |gi ⇤, do(Gj gj4 )] as a function of gj4 for all possi-
2017) or when weak forms of supervision are available
ble realizations gi ⇤ of Gi ⇤. All values with constant gi ⇤ are
(Bouchacourt et al., 2017; Locatello et al., 2019). The ex-
connected with a line. For a robustly disentangled feature,
ploration of those ideas, especially including confounding,
we would expect all of these colored lines to be horizontal
is left for future research.
Robustly Disentangled Causal Mechanisms

Acknowledgments Hyvarinen, A., Sasaki, H., and Turner, R. E. Nonlinear


ICA using auxiliary variables and generalized contrastive
We thank Andreas Krause for helpful discussions and sup- learning. arXiv preprint arXiv:1805.08651, 2018.
port. This research was partially supported by the Max
Planck ETH Center for Learning Systems. Kim, H. and Mnih, A. Disentangling by factorising. In
International Conference on Machine Learning, pp. 2649–
References 2658, 2018.

Bengio, Y., Courville, A., and Vincent, P. Representation Kingma, D. P. and Ba, J. Adam: A method for stochastic
learning: A review and new perspectives. IEEE transac- optimization. In International Conference on Learning
tions on pattern analysis and machine intelligence, 35(8): Representations, 2015.
1798–1828, 2013.
Kingma, D. P. and Welling, M. Auto-encoding variational
Besserve, M., Sun, R., and Schölkopf, B. Counterfactuals Bayes. In International Conference on Learning Repre-
uncover the modular structure of deep generative models. sentations, 2014.
arXiv preprint arXiv:1812.03253, 2018.
Koller, D., Friedman, N., and Bach, F. Probabilistic graphi-
Bouchacourt, D., Tomioka, R., and Nowozin, S. Multi-level cal models: principles and techniques. MIT Press, 2009.
variational autoencoder: Learning disentangled repre- Kumar, A., Sattigeri, P., and Balakrishnan, A. Variational
sentations from grouped observations. arXiv preprint inference of disentangled latent concepts from unlabeled
arXiv:1705.08841, 2017. observations. In International Conference on Learning
Representations, 2018.
Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters,
N., Desjardins, G., and Lerchner, A. Understanding dis- Liu, Y.-C., Yeh, Y.-Y., Fu, T.-C., Chiu, W.-C., Wang, S.-D.,
entangling in -vae. arXiv preprint arXiv:1804.03599, and Wang, Y.-C. F. Detach and adapt: Learning cross-
2018. domain disentangled deep representation. arXiv preprint
arXiv:1705.01314, 2017.
Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolating
sources of disentanglement in variational autoencoders. Locatello, F., Bauer, S., Lucic, M., Gelly, S., Schölkopf, B.,
arXiv preprint arXiv:1802.04942, 2018. and Bachem, O. Challenging common assumptions in
the unsupervised learning of disentangled representations.
Cheung, B., Livezey, J. A., Bansal, A. K., and Olshausen, arXiv preprint arXiv:1811.12359, 2018.
B. A. Discovering hidden factors of variation in deep
networks. arXiv preprint arXiv:1412.6583, 2014. Locatello, F., Tschannen, M., Bauer, S., Rätsch, G.,
Schölkopf, B., and Bachem, O. Disentangling fac-
Comon, P. Independent component analysis, a new concept? tors of variation using few labels. arXiv preprint
Signal processing, 36(3):287–314, 1994. arXiv:1905.01258, 2019.

Eastwood, C. and Williams, C. K. I. A framework for the Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprech-
quantitative evaluation of disentangled representations. mann, P., and LeCun, Y. Disentangling factors of varia-
In International Conference on Learning Representations, tion in deep representation using adversarial training. In
2018. Neural Information Processing Systems, pp. 5040–5048,
2016.
Esmaeili, B., Wu, H., Jain, S., Narayanaswamy, S., Paige,
B., and Van de Meent, J.-W. Hierarchical disentangled Pearl, J. Causality. Cambridge University Press, 2009.
representations. arXiv preprint arXiv:1804.02086, 2018. Peters, J., Janzing, D., and Schölkopf, B. Elements of causal
inference: foundations and learning algorithms. MIT
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,
Press, 2017.
Botvinick, M., Mohamed, S., and Lerchner, A. beta-
vae: Learning basic visual concepts with a constrained Ridgeway, K. and Mozer, M. C. Learning deep disentan-
variational framework. 2017. gled embeddings with the f-statistic loss. arXiv preprint
arXiv:1802.05312, 2018.
Higgins, I., Amos, D., Pfau, D., Racaniere, S., Matthey,
L., Rezende, D., and Lerchner, A. Towards a defi- Rojas-Carulla, M., Schölkopf, B., Turner, R., and Peters, J.
nition of disentangled representations. arXiv preprint Invariant models for causal transfer learning. Journal of
arXiv:1812.02230, 2018. Machine Learning Research, 2018.
Robustly Disentangled Causal Mechanisms

Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,


K., and Mooij, J. On Causal and Anticausal Learning.
In Proceedings of the 29th ICML, pp. 1255–1262, New
York, NY, USA, 2012. Omnipress.
Siddharth, N., Paige, B., van de Meent, J.-W., Desmaison,
A., Wood, F. D., Goodman, N. D., Kohli, P., and Torr, P.
H. S. Learning disentangled representations with semi-
supervised deep generative models. In Neural Informa-
tion Processing Systems, 2017.
Spirtes, P., Glymour, C., and Scheines, R. Causation, pre-
diction, and search. Springer-Verlag. (2nd edition MIT
Press 2000), 1993.

Thomas, V., Pondard, J., Bengio, E., Sarfati, M., Beaudoin,


P., Meurs, M.-J., Pineau, J., Precup, D., and Bengio,
Y. Independently controllable features. arXiv preprint
arXiv:1708.01289, 2017.

You might also like