Nonlinear Source Separation - Luis B. Almeida
Nonlinear Source Separation - Luis B. Almeida
Nonlinear
Source Separation
i
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or trans-
mitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief
quotations in printed reviews, without the prior permission of the publisher.
DOI 10.2200/S00016ED1V01Y200602SPR002
First Edition
ii
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
Nonlinear
Source Separation
Luis B. Almeida
Instituto das Telecomunicações, Lisboa, Portugal
M
&C Mor gan & Cl aypool Publishers
iii
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
iv
ABSTRACT
The purpose of this lecture book is to present the state of the art in nonlinear blind source
separation, in a form appropriate for students, researchers and developers. Source separation
deals with the problem of recovering sources that are observed in a mixed condition. When we
have little knowledge about the sources and about the mixture process, we speak of blind source
separation. Linear blind source separation is a relatively well studied subject. Nonlinear blind
source separation is still in a less advanced stage, but has seen several significant developments
in the last few years.
This publication reviews the main nonlinear separation methods, including the separation
of post-nonlinear mixtures, and the MISEP, ensemble learning and kTDSEP methods for
generic mixtures. These methods are studied with a significant depth. A historical overview is
also presented, mentioning most of the relevant results, on nonlinear blind source separation,
that have been presented over the years.
KEYWORDS
Signal Processing, Source Separation, Nonlinear blind source separation, Independent
component analysis, Nonlinear ICA.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
v
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
vi
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Scatter Plot as a Tool To Depict Joint Distributions . . . . . . . . . . . . . . 3
1.1.2 Separation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Linear Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 INFOMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Choice of the Output Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Maximum Entropy Estimation of the Nonlinearities . . . . . . . . . . . . . . . . 12
2.2.3 Estimation of the Score Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Exploiting the Time-Domain Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Other methods: JADE and FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3. Nonlinear Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Post-Nonlinear Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Separation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.2 Application Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Unconstrained Nonlinear Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 MISEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Nonlinear ICA Through Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.3 Kernel-Based Nonlinear Separation: kTDSEP . . . . . . . . . . . . . . . . . . . . . . 66
3.2.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
CONTENTS vii
4. Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
A. Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.1 Passing a Random Variable Through its Cumulative Distribution Function . . . 83
A.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.2.1 Entropy of a Transformed Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.3 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
viii
Acknowledgments
José Moura, the editor of this series, invited me to write this book. Without him I would not
have considered it. Harri Valpola was very influential on the form in which the ensemble learn-
ing method finally is described, having led me to learn quite a bit about the MDL approach
to the method. He also kindly provided the examples of application of ensemble learning, and
very carefully reviewed the manuscript, having made many useful comments. Stefan Harmeling
provided the examples for the TDSEP and kTDSEP methods, and made several useful sug-
gestions regarding the description of those methods. Mariana Almeida made several useful
comments on the ensemble learning section. Andreas Ziehe commented on the description of
the TDSEP method. Aapo Hyvärinen helped to clarify some aspects of score function estima-
tion. The anonymous reviewers made many useful comments. Joel Claypool, from Morgan and
Claypool Publishers, was very supportive, with his attention to the progress of the manuscript
and with his very positive comments. I am grateful to them all. Any errors or inaccuracies that
may remain are my responsibility, not of the people who so kindly offered their help.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53
ix
Notation
In this book we shall adopt the following notational conventions, unless otherwise noted:
x NOTATION
Table 1 provides a summary of the symbols and acronyms used in the book.
NOTATION xi
xii
Preface
Source separation deals with the problem of recovering sources that are observed in a mixed
condition. When we have little knowledge about the sources and about the mixture process,
we speak of blind source separation. Linear blind source separation is a relatively well studied
subject, and there are some good review books on it. Nonlinear blind source separation is still
in a less advanced stage, but has seen several significant developments in the last few years.
The purpose of this book is to present the state of the art in nonlinear blind source
separation, in a form appropriate for students, researchers and developers. The book reviews
the main nonlinear separation methods, including the separation of post-nonlinear mixtures,
and the MISEP, ensemble learning and kTDSEP methods for generic mixtures. These methods
are studied with a significant depth. A historical overview is also presented, mentioning most
of the relevant results on nonlinear blind source separation that have been presented over the
years, and giving pointers to the literature.
The book tries to be relatively self-contained. It includes an initial chapter on linear
source separation, focusing on those separation methods that are useful for the ensuing study
of nonlinear separation. An extensive bibliography is included. Many of the references contain
pointers to freely accessible online versions of the publications.
The prerequisites for understanding the book consist of a basic knowledge of mathematical
analysis and of statistics. Some more advanced concepts of statistics are treated in an appendix.
A basic knowledge of neural networks (multilayer perceptrons and backpropagation) is needed
for understanding some parts of the book
The writing style is intended to afford an easy reading without sacrificing rigor. Where
necessary the reader is pointed to the relevant literature, for a discussion of some more detailed
aspects of the methods that are studied. Several examples of application of the studied methods
are included, and an appendix provides pointers to online code and data for linear and nonlinear
source separation.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43
CHAPTER 1
Introduction
It is common, in many practical situations, to have access to observations that are mixtures of
some “original” signals, and to be interested in recovering those signals. For example, when
trying to obtain an electrocardiogram of the fetus, in a pregnant woman, the fetus’ signals will
be contaminated by the much stronger signals from the mother’s heart. When recording speech
in a noisy environment, the signals recorded through the microphone(s) will be contaminated
with noise. When acquiring the image of a document in a scanner, the image sometimes gets
superimposed with the image from the opposite page, especially if the paper is thin. In all these
cases we obtain signals that are contaminated by other signals, and we would like to get rid of
the contamination to recover the original signals.
The recovery of the original signals is normally called source separation. The original
signals are normally called sources, and the contaminated signals are considered to be mixtures
of those sources. In cases such as those presented above, if there is little knowledge about the
sources and about the details of the mixture process, we normally speak of blind source separation
(BSS). In many situations, such as most of those involving biomedical or acoustic signals, the
mixture process is known to be linear, to a good approximation. This allows us to perform
source separation through linear operations, and we then speak of linear source separation. On
the other hand, the document imaging situation that we mentioned above is an example of a
situation in which the mixture is nonlinear, and the corresponding separation process also has
to be nonlinear.
Linear source separation has been the object of much study in recent years, and the
corresponding theory is rather well developed. It has also been the subject of some good overview
books, such as [27, 55]. Nonlinear source separation, on the other hand, has been the object
of research only more recently, and until now there was, to our knowledge, no overview book
specifically addressing it (a former overview paper is [59]).
This book attempts to fill this gap, by providing a comprehensive overview of the state of
the art in nonlinear source separation. It is intended to be used as an introduction to the topic
of nonlinear source separation for scientists and students, as well as for applications-oriented
people. It has been intentionally limited in size, so as to be easy to read. For most of the
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43
X = AS, (1.1)
where A is a matrix. Often the numbers of components of S and X are assumed to be the
same, implying that matrix A is square, and we speak of a square separation problem. In source
separation one is interested in recovering the source vector S, and possibly also the mixture
matrix A.
Clearly, the problem cannot be solved without some additional knowledge about S and/or
A. The assumption that is most commonly made is that the components of S are statistically in-
dependent from one another, but sometimes some other assumptions are made, either explicitly
or implicitly. This is clarified ahead.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43
INTRODUCTION 3
Several variants of the basic source separation problem exist. First of all, the number of
components of the mixture X may be smaller than the number of sources, and we speak of an
overcomplete or underdetermined problem, or it may be larger than the number of sources, and
we speak of an undercomplete or overdetermined problem.
Another variant involves the presence of additive noise. Equation (1.1) is then replaced
with
X = AS + N,
X = M(S ),
where M(·) represents a nonlinear mapping. We observe X and wish to recover S. Again, this
cannot be done without some further assumptions, which will be clarified ahead.
As in the linear case, the nonlinear mixture may have noise, and may be nonstationary
or noninstantaneous. However, the formal treatment of these variants is not much developed
to date, and in this book we shall limit ourselves, almost always, to stationary, instantaneous,
noiseless mixtures. Also, in most situations, we shall consider the mixture to be square; i.e., the
sizes of X and S will be the same, although there will be a few cases in which we shall consider
different situations.
FIGURE 1.1: Examples of linear and nonlinear mixtures. The top row shows mixtures of two super-
gaussian sources and the bottom row shows mixtures of two uniformly distributed sources
example the sources have densities that are strongly peaked (they are much more peaked than
Gaussian distributions with the same variance, and for this reason they are called supergaussian).1
The lower example shows uniformly distributed sources.
Note that, in both cases, a horizontal section through the distribution of the sources—
Figs. 1.1(a) and 1.1(d)—will yield a density which has the same shape (apart from an amplitude
scale factor) irrespective of where the section is made. The same happens with vertical sections.
These sections, once normalized, correspond to conditional distributions, and the fact that their
shape is independent of where the section is made means that the conditional distribution of
one of the sources given the other is independent of that other source. For example, p(s 1 | s 2 ) is
independent of s 2 . This means that the sources are independent from each other.
The middle column of the figure shows examples of linear mixtures. In such mixtures,
lines that were originally straight remain straight, and lines that were originally parallel to each
1
More specifically, a random variable is called supergaussian if its kurtosis is positive, and subgaussian if its kurtosis
is negative (the kurtosis of a Gaussian random variable is zero). The kurtosis of a random variable is defined as its
fourth-order cumulant (see Section 2.4.1).
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43
INTRODUCTION 5
other remain parallel to each other. Linear operations can only perform scaling, rotation, and
shear transformations, which do not affect the straightness or the parallelism of lines. But note
that in these mixtures the two components are no longer independent from each other. This
can be seen by the fact that sections (either horizontal or vertical) through the distribution yield
densities that depend on where the section was made.
The rightmost column shows two examples of nonlinear mixtures. In the upper one, lines
that were originally straight now appear curved. In the lower one, lines that were originally
parallel to each other (opposite edges of the square) do not remain parallel to each other, even
though they remain straight. Both of these are telltale signs of nonlinear mixtures. Whenever
they occur we know that the mixture that is being considered is nonlinear. And once again, we
can see that the two random variables in each mixture are not independent from each other:
The densities corresponding to horizontal or vertical sections through the distribution depend
on where these sections were made.
1.2 SUMMARY
Source separation deals with the recovery of sources that are observed in a mixed condition.
Blind source separation (BSS) refers to source separation in situations in which there is little
knowledge about the sources and about the mixing process.
Most of the work done to date on source separation concerns linear mixtures. Variants
that have been studied include over- and undercomplete mixtures, noisy, nonstationary, and
convolutive (noninstantaneous) mixtures.
This book is mainly concerned with nonlinear mixtures. These may also be noisy, non-
stationary, and/or noninstantaneous, but we shall normally restrict ourselves to the basic case of
noise-free, stationary instantaneous mixtures, because other variants have still been an object of
very little study in the nonlinear case. The book analyzes with some detail the main nonlinear
separation methods, and gives an overview of other methods that have been proposed.
An assumption that is frequently made in blind source separation is that the sources are
independent random variables. The mutual dependence of random variables is often measured
by means of their mutual information.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
CHAPTER 2
In this chapter we shall make a brief overview of linear source separation in order to introduce
several concepts and methods that will be useful later for the treatment of nonlinear separation.
We shall only deal with the basic form of the linear separation problem (same number of
observed mixture components as of sources, no noise, stationary instantaneous mixture) since
this is, almost exclusively, the form with which we shall be concerned later, when dealing with
nonlinear separation.
X = AS,
where the sizes of S and X are equal, and A is a square, invertible matrix, which is usually called
the mixture matrix or mixing matrix. We observe N exemplars of X, generated by identically
distributed (but possibly nonindependent) exemplars of S. These exemplars of S, however, are
not observed. In the blind separation setting, we assume that we do not know the mixing matrix,
and that we have relatively little knowledge about S. We wish to recover the sources, i.e. the
components of S.
If we knew the mixing matrix, we could simply invert it and compute the sources by
means of the inverse
S = A −1 X.
However, in the blind source separation setting we do not know A, and therefore we have to
use some other method to estimate the sources. As we have said above, the assumption that
is most commonly made is that the sources (the components of S) are mutually statistically
independent.
In accordance with this assumption, one of the most widely used methods to recover the
sources consists of estimating a square matrix W such that the components of Y, given by
Y = W X, (2.1)
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
Y = PDS,
where P is a permutation matrix2 and D is a diagonal matrix. This means that the components
of Y are the components of S, possibly subject to a permutation and to arbitrary scalings.
This use of the independence criterion leads to the so-called independent component analysis
(ICA) techniques, which analyze mixtures into sets of components that are as independent from
one another as possible, according to some mutual dependence measure.
Note that simply imposing that the components of Y be uncorrelated with one another
does not suffice for separation. There is an infinite number of solutions of (2.1) in which
the components of Y are mutually uncorrelated, but in which each component still contains
contributions from more than one source.3 Also note that, for the same random vector X,
principal component analysis (PCA) [39] and independent component analysis normally yield
very different results. Although both methods yield components that are mutually uncorrelated,
the components extracted by PCA normally are not statistically independent from one another,
and normally contain contributions from more than one source, contrary to what happens with
ICA.
There are several practical methods for estimating the matrix W based on the indepen-
dence criterion (see [27, 55] for overviews). In the next sections we shall briefly describe some
of them, focusing on those that will be useful later in the study of nonlinear separation.
2.2 INFOMAX
INFOMAX, also often called the Bell–Sejnowski method, is a method for performing linear
ICA; i.e., it attempts to transform the mixture X, according to (2.1), into components Yi which
are as independent from one another as possible.
The INFOMAX method is based on the structure depicted in Fig. 2.1. In the left-hand
side of the figure we recognize the implementation of the separation operation (2.1). The result
of the separation is Y. The ψi blocks and the Zi outputs are auxiliary, being used only during
the optimization process.
2
A permutation matrix has exactly one element per row and one element per column equal to 1, all other elements
being equal to zero.
3
However, decorrelation, in a different setting, can be a criterion for separation, as we shall see in Section 2.3.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
FIGURE 2.1: Structure used by INFOMAX. The W block performs a product by a matrix, and is what
performs the separation proper. The separated outputs are Yi . The ψi blocks, implementing nonlinear
increasing functions, are auxiliary, being used only during optimization
In the paper that introduced it [19], INFOMAX was justified on the basis of an informa-
tion maximization criterion (hence its name). However, the method has later been interpreted
as a maximum likelihood method [83] and also as a method based on the minimization of
the mutual information I (Y ) [55] (recall that mutual information was seen, in Section 1.1.2,
to be a measure of statistical dependence). Here we shall use the interpretation based on the
minimization of mutual information, because this will be the most useful approach when we
deal with nonlinear separation methods, later on.
For presenting the mutual information interpretation, we shall start by assuming that each
of the ψi functions is equal to the cumulative distribution function (CDF) of the corresponding
random variable Yi (we shall denote that cumulative function by FYi ). Then, all of the Zi will
be uniformly distributed in (0, 1) (see Appendix A.1). Therefore, p(zi ) will be 1 in (0, 1) and
zero elsewhere, and the entropy of each of the Zi shall be zero, H (Zi ) = 0.
As shown in Appendix A.4, the mutual information is not affected by performing in-
vertible, possibly nonlinear, transformations on the individual random variables. In the above
setting, since the ψi functions are invertible, this means that I (Y ) = I (Z ). Therefore,
I (Y ) =
I (Z )
= H(Zi ) − H(Z )
i
= −H(Z ). (2.2)
This result shows that minimizing the mutual information of the components of Y can be
achieved by maximizing the output entropy, H(Z ).4 This is an advantage, since the direct min-
imization of the mutual information is generally difficult, and maximizing the output entropy
is easier, as we shall see next.
The output entropy of the system of Fig. 2.1 is related to the input entropy by 5
H(Z ) = H(X ) + E[log det J ],
4
We are considering continuous random variables, and therefore H(·) represents the differential entropy, throughout
this and the next chapter (see Appendices A.2 and A.4). We shall designate it simply as entropy, for brevity.
5
See Appendix A.2.1.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
where Jm is the Jacobian corresponding to the mth exemplar of X in the training set (the mth
training pattern) and M is the number of exemplars in the training set. J is what shall effectively
be used as objective function to be maximized.
The maximization is usually done by means of gradient-based methods. We have
1 N
J = Lm (2.3)
M m=1
with
Lm = log det J m . (2.4)
For simplicity we shall drop, from now on, the subscript m, which refers to the training pattern
being considered. The Jacobian in the preceding equation is given by
J =W ψ j (y j ).
j
W ∝ (I + ξy T )W,
W ∝ (I + ξy T )W
The distribution of Y has been assumed to be fixed, and therefore I (Y ) is constant. Maxi-
mization of H (Z ) thus corresponds to the maximization of i H (Zi ). Since the parameter
vectors wi can be varied independently from one another, this corresponds to the simultaneous
maximization of all the H(Zi ) terms.
Given that ψ̂i has codomain (0, 1), Zi is limited to that interval. We therefore have,
for each Zi , the maximization of the entropy of a continuous random variable, constrained
to the interval (0, 1). The maximum entropy distribution under this constraint is the uniform
distribution (see Appendix A.2). Therefore the maximization will lead each Zi to become as
close as possible to a uniformly distributed variable in (0, 1), subject only to the restrictions
imposed by the limitations of the family of functions ψ̂i .6
Since, at the maximum, Zi is approximately uniformly distributed in (0, 1), and since
ψ̂i (yi , wi ) is, by construction, a nondecreasing function of yi , it must be approximately equal
to the CDF FYi , as desired (see Appendix A.1). Therefore, maximization of H(Z ) leads the
ψ̂i (yi , wi ) functions to approximate the corresponding CDFs, subject only to the limitations of
the family of approximators ψ̂i (yi , wi ).
Let us now drop the assumption that W is constant, and assume instead that it is also being
optimized by maximization of H (Z ). Consider what happens at a maximum of H(Z). Each
of the H (Zi ) must be maximal, because otherwise H (Z ) could still be increased further, by
increasing that specific H(Zi ) while keeping the other Z j and Y fixed—see Eq. (2.8). Therefore,
at a maximum of H(Z ), the Zi variables must be as uniform as possible, and the ψ̂i functions
must be the closest possible approximations of the corresponding CDFs.
This shows that the maximization of the output entropy will lead to the desired estima-
tion of the CDFs by the ψ̂i functions while at the same time optimizing W. In practice, during
6
We are disregarding the possibility of stopping at a local maximum during the optimization. That would be a
contingency of the specific optimization procedure being used, and not of the basic method itself.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
the optimization process, when W changes rapidly, the ψ̂i functions will follow the corre-
sponding CDFs with some lag, because the CDFs are changing rapidly. During the asymptotic
convergence to the maximum of H(Z), W will change progressively slower, and the ψ̂i func-
tions will approximate the corresponding CDFs progressively more closely, tending to the best
possible approximations.
In summary, the maximization of the output entropy leads to the minimization of the
mutual information I (Y), with the simultaneous adaptive estimation of the ψi nonlinearities.
Before proceeding to discuss the practical implementation of this method for the esti-
mation of the nonlinearities, we note that it is based on the optimization of the same objective
function that is used for the estimation of the separating matrix. Therefore, the whole opti-
mization is based on a single objective function. There are maximization methods (e.g. methods
based on gradient ascent [4]) which are guaranteed to converge when there is a single function
to be optimized, which is differentiable and upper-bounded, as in this case. Therefore there is
no risk of instability, contrary to what happens if the ψi nonlinearities are estimated based on
some other criterion.
7
We shall designate each linear block by the same letter that denotes the corresponding matrix, because this will not
cause any confusion.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
y
x W B C z
FIGURE 2.2: Structure of the INFOMAX system with adaptive estimation of the nonlinearities. The
part marked ψ corresponds to all the ψi blocks taken together. See the text for further explanation
The three rightmost blocks represent all the ψi blocks of Fig. 2.1 taken together. Block
B̄ performs a product by matrix B̄, which is the weight matrix of the hidden units of all the
ψi blocks, taken as forming a single hidden layer. As in ȳ, the overbar in B̄ indicates that this
matrix incorporates the elements relative to the bias terms of the hidden units. The output of
this block consists of the vector of input activations of the units of the global hidden layer of all
the ψi blocks.
Block Φ is nonlinear and applies to its input vector, on a per-component basis, the
nonlinear activation functions of the units of that hidden layer. We shall denote these activation
functions by φi . Usually these activation functions are the same for all units, normally having a
sigmoidal shape. A common choice is the hyperbolic tangent function.
The output of the Φ block is the vector of output activations of the global hidden layer.
Block C multiplies this vector, on the left, by matrix C, which is the weight matrix of the linear
output units of the ψi blocks. The output of the C block is the output vector z.
Of course, since there are no interconnections between the various ψi blocks, matrices B̄
and C have a special structure, with a large number of elements equal to zero. This does not
affect our further analysis, except for the fact that those elements are never changed during the
optimization, being always kept equal to zero, to keep the correct network structure.
We wish to compute the gradient of the output entropy H(Z ) relative to the weights
of this network, so that we can use gradient-based maximization methods. We shall not give
explicit expressions of the components of the gradient here, because they are somewhat complex
(and would become even more complex later, in the nonlinear MISEP method). Instead, we
shall derive a method for computing these gradient components, based on the backpropagation
method for multilayer perceptrons [4, 39]. As normally happens with backpropagation, this
method is both more compact and more efficient than a direct computation of each of the
individual components of the gradient.
Let us take Eq. (2.4), that we reproduce here:
Lm = log det J m .
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
J = CΦ BW, (2.9)
where Φ is a diagonal matrix whose diagonal elements are φi , i.e. the derivatives of the acti-
vation functions of the corresponding hidden units, for the specific input activations that they
receive for the input pattern being considered. B is the weight matrix B̄ stripped of the column
corresponding to the bias weights (those weights disappear from the equations when differen-
tiating relative to x).
The network that computes J according to this equation is shown in Fig. 2.3. The lower
part is what computes (2.9) proper. It propagates matrices (this is depicted in the figure by the
“3-D arrows”). Its input is the identity matrix I of size n × n (n being the number of sources and
of mixture components). Block W performs a product, on the left, by matrix W, yielding matrix
W itself at the output (this might seem unnecessary but is useful later, when backpropagating,
to allow the computation of derivatives relative to the elements of W ). The following blocks
also perform products, on the left, by the corresponding matrices.
It is clear that the lower chain computes J as per (2.9). The upper part of the network
is needed because the derivatives in the diagonal matrix Φ depend on the input pattern. The
two leftmost blocks of the upper part compute the input activations of the nonlinear hidden
units, and transmit these activations, through the gray-shaded arrow, to block Φ , allowing it
to compute the activation function derivatives. Supplying this information is the reason for the
presence of the upper part of the network.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
y
x W B C z
I W B C J
FIGURE 2.3: Network for computing the Jacobian. The lower part is what computes the Jacobian
proper, and is essentially a linearized version of the network of Fig. 2.2. The upper part is identical to
the network of Fig. 2.2. The two rightmost blocks, shown in light gray, are not necessary for computing
J , and are shown only for better correspondence with Fig. 2.2. See the text for further explanation
We can, therefore, use the standard backpropagation procedure.8 The inputs to the backprop-
agation network are
∂L
= J −T ,
∂J
where the −T superscript denotes the transpose of the inverse of the matrix.
The backpropagation method to be used is rather standard, except for two aspects that
we detail here:
• All the blocks that appear in the network of Fig. 2.3 are standard neural network
blocks, with the exception of block Φ . We shall now see how to backpropagate through
8
The fact that the network’s outputs form a matrix instead of a vector is unimportant: we can form a vector by
arranging all the matrix elements into a single column.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
si
(si) gij
gij (si) hij (s i )
(a) (b)
FIGURE 2.4: Backpropagation through the Φ block. (a) Forward unit. (b) Backpropagation unit. Each
box denotes a product by the value indicated inside the box
this block. A unit of this block, i.e. a unit that performs the product by the derivative of
a hidden unit’s activation function, is shown in Fig. 2.4(a). The unit receives an input
(that we denote by g i j ) from block B, and receives the activation value s i (which is
necessary to compute the derivative) from the upper part of the network, through the
gray-shaded arrow of Fig. 2.3. This unit produces the output (that we denote by h i j )
h i j = φ (s i )g i j .
• The activation functions of the hidden layer’s units are chosen to be sigmoids with
values in (−1, 1), such as tanh functions.
• The vector of weights leading to each linear output unit is normalized, after each
√
gradient update, to an Euclidean norm of 1/ h, where h is the number of hidden
units feeding that output unit.
This guarantees that the MLPs’ outputs will always be in the interval (−1, 1). The constraint
to nondecreasing functions can also be implemented in several ways. The one that has shown
to be best is in fact a soft constraint:
• The hidden units’ sigmoids are all chosen to be increasing functions (e.g. tanh).
• All the MLP’s weights are initialized to positive values (with the exception of the hidden
units’ biases, which may be negative).
In strict terms, this only guarantees that, upon initialization, the MLPs will implement increas-
ing functions. However, during the optimization process, the weights will tend to stay positive
because the optimization maximizes the output entropy, and any sign change in a weight would
tend to decrease the output entropy. In practice it has been found that negative weights occur
very rarely in the optimizations, and when they occur they normally switch back to positive
after a few iterations.
An Example We shall now present an example of the use of INFOMAX, with entropy-
based estimation of the nonlinearities, for the separation of two sources. One of the sources was
strongly supergaussian, while the other was uniformly distributed (and therefore subgaussian).
Figure 2.5(a) shows the joint distribution of the sources.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
FIGURE 2.5: Scatter plots corresponding to the example of linear separation by INFOMAX with
maximum entropy estimation of the output nonlinearities
Figure 2.5(b) shows the mixture distribution. Figure 2.5(c) shows the result of separation, and
we can see that it was virtually perfect. This can be confirmed by checking the product of the
separating matrix that was estimated, W, with the mixing matrix
8.688 0.002
WA = .
−0.133 8.114
This product is very close to a diagonal matrix, confirming the quality of the separation.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
(a) Function 1
(b) Function 2
The MLPs that were used to estimate the cumulative distributions had 10 hidden units
each. Figure 2.6 shows the functions that were estimated by these networks. We see that they
match well (qualitatively, at least) the cumulative distributions of the corresponding sources.
The quality of the estimation can be checked better by observing the distribution of the auxiliary
output z, shown in Fig. 2.5(d). We can see that this distribution is nearly perfectly uniform
within a square. On the one hand, this means that the cumulative functions were quite well
estimated. On the other hand, it also confirms that the extracted components were virtually
independent from each other, because otherwise z could not have a distribution with this
form.
This example illustrates both the effectiveness of INFOMAX in performing linear sepa-
ration and the effectiveness of the maximum entropy estimation of the cumulative distributions.
As we mentioned above, INFOMAX with this estimation method corresponds to the linear
version of MISEP, whose extension to nonlinear separation will be studied in Section 3.2.1.
ψ j (y j )
ϕ̂ j (y j ) = .
ψ j (y j )
We can therefore work directly with the functions ϕ̂ j instead of the functions ψ j . Furthermore
we know that, ideally, ψ j (y j ) = FY j (y j ), where FY j is the CDF of Y j . Therefore the objective
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
FYj (y j )
ϕ j (y j ) =
FY j (y j )
p (y j )
=
p(y j )
d
= log p(y j ). (2.11)
dy j
These ϕ j functions play an important role in ICA, and are called score functions. Since their
definition involves pdfs, one would expect that their estimation would also involve an estima-
tion of the probability densities. However, in [97] Taleb and Jutten proposed an interesting
estimation method that involves the pdfs only indirectly, through expected values that can
easily be estimated by averaging on the training set. This is the method that we shall now
study.
For the derivation of this method we shall drop, for simplicity, the index j , referring to
the component under consideration. The method assumes that we have a parameterized form
of our estimate, ϕ̂(y, w), where w is a vector of parameters (Taleb and Jutten proposed using
a multilayer perceptron with one input and one output for implementing ϕ̂; then, w would
be the vector of weights of the perceptron). The method uses a minimum mean squared error
criterion, defining the mean squared error as
1
= E [ϕ̂(Y, w) − ϕ(Y )]2 , (2.12)
2
where E(·) denotes statistical expectation, and the factor 1/2 is for later convenience. The
minimization of is performed through gradient descent. The gradient of relative to the
parameter vector is
∂ ∂ ϕ̂(Y, w)
= E [ϕ̂(Y, w) − ϕ(Y )]
∂w ∂w
p (y) ∂ ϕ̂(y, w)
= p(y) ϕ̂(y, w) − dy
p(y) ∂w
∂ ϕ̂(y, w) ∂ ϕ̂(y, w)
= p(y)ϕ̂(y, w) dy − p (y) dy.
∂w ∂w
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
• What is the best way to interleave optimization iterations for each of the objective func-
tions? A simple solution is to alternate a step of estimation of the unmixing matrix with
one of estimation of the score functions. However, it could make more sense to perform
several successive steps of optimization of the score functions for each step of optimiza-
tion of the matrix, so that the optimization of the matrix would be done with rather good
estimates of the score functions. Another possibility that has been proposed [97] is to
use a stochastic optimization for the score function estimates, in which the expectation
is dropped from Eq. (2.13), and the resulting stochastic gradient is used to update the
score function estimate once for each pattern that is processed during the optimization.
• More important from a theoretical viewpoint is that this kind of inter leaved
optimization of two or more objective functions does not normally have a guaranty
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
where E(·) denotes statistical expectation. If the components of Y are independent, then Cτ (t)
will be diagonal for all values of t and τ . The methods that we are studying implicitly assume
that the sources are jointly ergodic. This implies that they are jointly stationary, and therefore
9
Actually it is their covariance that must be zero. But it is common to assume that the means of the random variables
have been subtracted out, and then the correlation and the covariance are equal. In the remainder of this chapter
we shall assume that the means have been subtracted out.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
Cτ = Y(t)Y T (t + τ ),
where · denotes an average over the time variable t. If the components are independent, the
Cτ matrices should be diagonal for all τ . The ICA methods that use the time domain structure
enforce that Cτ be diagonal for more than one value of τ , and differ from one another in
the details of how they do so. In practice, Cτ is replaced with an approximation, obtained by
computing an average over a finite amount of time.
The method of Molgedey and Schuster enforces the simultaneous diagonality of C0 (i.e.
of the correlation with no delay) and of Cτ for one specific, user-chosen value of the delay τ .
These authors showed that this simultaneous diagonality condition can be transformed into
a generalized eigenvalue problem, which can be solved by standard linear algebra methods,
yielding the separating matrix W. A sufficient condition for separation is
Z = BX
Matrix B can be found by one of several methods, one of which is PCA. Once a prewhitening
matrix is found, it can be shown that the separating matrix is related to it by
W = QB,
2.3.1 An Example
We show an example of the separation, by TDSEP, of a mixture of two sources: a speech and a
music signal. The source signals are shown in Fig. 2.7(a).
These signals were mixed using the mixture matrix
1.0000 0.8515
A= .
0.5525 1.0000
The components of the mixture are shown in Fig. 2.7(b). TDSEP was applied to this mixture,
using the set of delays {0, 1, 2, 3}. The separation results are shown in Fig. 2.7(c). They are
10
The reflection is never needed in practice, because it would just change the signs of some components, and the
scale indeterminacy of ICA already includes a sign indeterminacy.
11
Givens rotations of a multidimensional space are rotations that involve only two coordinates at a time, leaving the
remaining coordinates unchanged.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
virtually identical to the original sources. This can be confirmed by computing the product of
the separating and mixture matrices,
10.2127 −0.0288
WA = ,
−0.0286 5.9523
2.4.1 JADE
Some criteria that are frequently used in linear ICA are based on cumulants. The cumulants of a
distribution are related to the coefficients of the expansion of the cumulant-generating function
as a power series. More specifically, for a univariate distribution, they are defined implicitly as
the κi coefficients in the power series expansion
∞
( j ω)i
ψ(ω) = κi ,
i=0
i!
where j is the imaginary unit and ψ is the cumulant-generating function, which is defined as
the logarithm of the moment-generating function: ψ(ω) = log ϕ(ω). The moment-generating
function is the Fourier transform of the probability density function,
ϕ(ω) = p(x) e j ωx dx.
The first four cumulants correspond to relatively well-known statistical quantities. The
first cumulant of a distribution is its mean, and the second is its variance. The third cumulant
is the so-called skewness, which is an indicator of the distribution’s asymmetry. The fourth
cumulant is called kurtosis, and is an indicator of the distribution’s deviation from Gaussianity.
For more details see [55], for example.
Cumulants can also be defined, in a similar way, for joint distributions of several variables.
For example, for two random variables Y1 and Y2 , the joint cumulants κkl , of orders k and l
relative to Y1 and to Y2 respectively, are given implicitly by the series expansion
∞
( j ω1 )k ( j ω2 )l
ψ(ω1 , ω2 ) = κkl ,
k,l=0
k! l!
2.4.2 FastICA
Another criterion that is used for linear ICA is based on negentropy, a concept that we shall
define next. Recall that, among all distributions with a given variance, the Gaussian distribution
is the one with the largest entropy (see Appendix A.2). Consider a random variable Y , and let
Z be a Gaussian random variable with the same variance. The quantity
J (Y ) = H(Z) − H(Y )
2.5 SUMMARY
We have made a short overview of linear source separation and linear ICA, giving a brief
introduction to some of the methods that are used to perform them. Among these we can
distinguish two classes. Methods such as INFOMAX, JADE, and FastICA are based (explicitly
or implicitly) on statistics of order higher than 2, and demand that at most one source be
Gaussian. This means that their performance will be poor if more than one source is close to
Gaussian. However, most of the sources that are of practical interest have distributions that
deviate markedly from Gaussian (for example, speech is strongly supergaussian, while images
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44
31
CHAPTER 3
Nonlinear Separation
Nonlinear source separation, as discussed in this book, deals with the following problem: There
is an unobserved random vector S whose components are called sources and are mutually
statistically independent. We observe a mixture vector X, which is related to S by
X = M(S ),
M being a vector-valued invertible function, which is, in general, nonlinear. Unless stated
otherwise, the size of X is assumed to be equal to the size of S; i.e., we consider a so-called
square problem. We wish to find a transformation
Y = F(X ),
where F is a vector-valued function and the size of Y is the same as those of X and S, such
that the components of Y are equal to the components of S, up to some indeterminacies to be
clarified later. These indeterminacies will include, at least, permutation and scaling, as in the
linear case. As we can see, we are considering only the simplest nonlinear setting, in which there
is no noise and the mixing is both instantaneous and invariant.
There is a crucial difference between linear and nonlinear separation, which we should
immediately emphasize. While in a linear setting, the independence of the components of
Y suffices, under rather general conditions, to guarantee the recovery of the sources (up to
permutation and scaling), there is no such guaranty in the nonlinear setting. This has been
shown by several authors [30, 55, 73]. We shall show it for the case of two sources, since the
generalization to more sources is then straightforward. Assume that we have a two-dimensional
random vector X and that we form the first component of the output vector Y as some function
of the mixture components,
Y1 = f 1 (X1 , X2 ).
The function f 1 is arbitrary, subject only to the conditions of not being constant and of Y1
having a well-defined statistical distribution. Now define an auxiliary random variable Z as
Z = f 2 (X1 , X2 ),
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
Y2 = FZ | Y1 (Z | Y1 ).
For each value of Y1 , Y2 is the CDF of Z given that value of Y1 . Therefore Y2 will be uniformly
distributed in (0, 1) for all values of Y1 (see Appendix A.1). As a consequence, Y2 will be
independent from Y1 , and Y will have independent components.
Given the large arbitrariness in the choice of both f 1 and f 2 , we must conclude that there
is a very large variety of ways to obtain independent components. This variety is much larger
than the variety of solutions in which the sources are separated. Therefore, in most cases, each
of the components of Y will depend on both S1 and S2 . This means that the sources will not be
separated at the output, even though the output components will be independent.
This raises an important difficulty: if we perform nonlinear ICA, i.e. if we extract inde-
pendent components from X, we have no guaranty that each component will depend on a single
source. ICA becomes ill-posed, and ICA alone is not a means for performing source separation,
in the nonlinear setting.
Researchers have addressed this difficulty in three main ways:
• One way has been to restrict the range of allowed nonlinearities, so that ICA becomes
well-posed again. When we restrict the range of allowed nonlinearities, an important
consideration is whether there are practical situations in which the nonlinearities are
(at least approximately) within our restricted range. The case that has met a significant
practical applicability is the restriction to post-nonlinear (PNL) mixtures, described
ahead.
• Another way to address the ill-posedness of nonlinear ICA has been the use of regu-
larization. This corresponds to placing some “soft” extra assumptions on the mixture
process and/or on the source signals, and trying to smoothly enforce those assumptions
in the separation process. The kind of assumption that has most often been used is that
the mixture is only mildly nonlinear. This is the approach taken in the MISEP and
ensemble learning methods, to be studied in Sections 3.2.1 and 3.2.2 respectively.
• The third way to eliminate the ill-posedness of nonlinear ICA has been to use extra
structure (temporal or spatial) that may exist in the sources. Most sources of practical
interest have temporal or spatial structure that can be used for this purpose. This is the
approach taken in the kTDSEP method, to be studied in Section 3.2.3. This approach
is also used, together with regularization, in the second application example of the
ensemble learning method (p. 62).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 33
In the next sections we shall study, in some detail, methods that use each of these approaches,
starting with the constraint to PNL mixtures.
U = AS.
Xi = f i (Ui ),
where the functions f i are invertible. As before, we shall assume that the sizes of S, U, and X
are the same.
This is the structure of the so-called PNL mixture process. Its interest resides mainly in
the fact that it corresponds to a well-identified practical situation. PNL mixtures arise whenever,
after a linear mixing process, the signals are acquired by sensors that are nonlinear. This is a
situation that sometimes occurs in practice. The other fact that makes these mixtures interesting
is that they present almost the same indeterminations as linear mixtures, as we shall see below.
S1 U1
f1 X1
• •
• A •
Sn
•
Un •
fn Xn
FIGURE 3.1: Post-nonlinear mixture process. Block A performs a linear mixture (i.e. a product by a
matrix). The f i blocks implement nonlinear invertible functions
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
FIGURE 3.2: Separating structure for post-nonlinear mixtures. The g i functions should, desirably, be
the inverses of the corresponding f i in the mixture process. Block W is linear, performing a product by
a matrix
Naturally, the separation of PNL mixtures uses a structure that is a “mirror” of the mixing
structure (Fig. 3.2). Each mixture component is first linearized by going through a nonlinearity,
whose purpose is to invert the nonlinearity present in the mixture process, and then the linearized
mixture goes through a standard linear separation stage. Formally,
Vi = g i (Xi )
Y = WV .
The indeterminations that exist in PNL separation are only slightly wider than those
of linear ICA. It has been shown [97] that, under a set of conditions indicated ahead, if the
components of Y are independent, they will obey
Y = PDS + t,
in which P is a permutation matrix, D is a diagonal matrix, and t is a vector. This means that
the sources are recovered up to an unknown permutation (represented by P ), unknown scalings
(represented by D), and unknown translations (represented by t).
The unknown translations represent an additional indetermination relative to the linear
case. However, this indetermination is often not too serious a problem because most sources can
be translated back to their original levels after separation, using prior knowledge. For example,
speech and other acoustic signals normally have a mean of zero, and in images we can often
assume that the lowest intensity corresponds to black.
The aforementioned result is valid if
• the mixture matrix A is invertible and has at least two nonzero elements per row and/or
per column;
• the functions f i are invertible and differentiable; and
• the pdf of each source is zero in one point at least.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 35
The last condition precludes the possibility of having Gaussian sources, but is not too restrictive
since most distributions can, in practice, be considered bounded.12 The fact that the sources can
be recovered up to permutation, scaling, and translation implies that each g i must be the inverse
of the corresponding f i , up to unknown translation and scaling, and thus that the nonlinearities
of the mixture can be estimated, up to unknown translations and scalings.
The need for the first condition (that A must have at least two nonzero elements per
row and/or per column) can be understood in the following way: If A had just one nonzero
element per row and per column, it would correspond just to a permutation and a scaling.
The Ui components would be independent from one another, and so would be the Xi , for any
nonlinearities f i . Therefore it would be impossible to estimate and invert the nonlinearities f i
on the basis of the independence criterion alone. One would only be able to recover the sources
up to a permutation and an arbitrary invertible nonlinear transformation per component: the
nonlinearities could not be compensated for. For it to be possible to estimate and invert the
nonlinearities there must be some mixing in A.
PNL mixtures have been subject to extensive study, e.g. [1, 2, 15, 95–97]. We shall review
the basic separation method.
• W −T = (W −1 )T .
• V is the vector whose components are defined by (3.1).
• E(·) denotes statistical expectation.
∂g i (Yi ,θi )
• g i (Yi , θi ) = ∂Yi
.
• wki is the element ki of matrix W.
• ϕ is the column vector [ϕ1 (Y1 ), ϕ2 (Y2 ), . . . , ϕn (Yn )]T , n being the size of Y, i.e. the
number of sources and of extracted components.
In practice, the score function estimates ϕ̂i (Yi , wi ) are used instead of the actual score
functions ϕi (Yi ) in these equations. Also, instead of the statistical expectations E(·), what is
used in practice are the means computed in the training set.
This method is, essentially, an extension of INFOMAX to the PNL case. In particular,
note that the expression of the gradient relative to the separation matrix, (3.1), is essentially
the same as in INFOMAX, (2.7), if we replace the statistical expectation E(·) with the mean
in the training set ·.13 And, as in INFOMAX, people have used the relative/natural gradient
method (see Section 2.2) for the estimation of this matrix.
This PNL separation method is also subject to the issues concerning the interleaved
optimization of two different objective functions, which we mentioned in Section 2.2.3, since
the estimation of the score functions is done through a criterion different from the one used for
the estimation of W and of the θi . Once again, however, these issues do not seem to raise great
difficulties in practice.
Several variants of the basic PNL separation algorithm have appeared in the literature.
Some of them have to do with different ways to estimate the source densities (or equivalently,
13
There is a sign difference, because here we use as objective function the mutual information, which is to be
minimized, and in INFOMAX we used the output entropy, which was to be maximized. Recall that the two are
equivalent—see Eq. (2.2).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 37
the score functions) [94, 97], while other ones have to do with different ways to represent the
inverse nonlinearities g i [2, 96, 97]. An interesting development was the proposal, in [98], of
a nonparametric way to represent the nonlinearities,14 in which each of them is represented
by a table of the values of g i (xi ) for all the elements of the training set (using interpolation to
compute intermediate values, if necessary). This appears to be one of the most flexible and most
powerful ways to represent these functions.
An efficient way to obtain estimates of the g i nonlinearities is to use the fact that mixtures
tend to have distributions that are close to Gaussians. This idea has been used in [92, 114–
116]. The last three of these references used the TDSEP method (cf. Section 2.3), instead of
INFOMAX, to perform the linear part of the separation.
Other variants of PNL methods, involving somewhat different concepts, have been
proposed in [16,17,67]. In addition, a method based on ensemble learning (discussed ahead, in
Section 3.2.2) has been proposed in [57]. It can separate mixtures in which some of the individual
nonlinearities are not invertible, but in which there are more mixture components than sources.
In [60] it was shown that when the sources have temporal structure, incorporating temporal
decorrelation into the PNL separation algorithm can achieve better separation. A method using
genetic algorithms in the optimization process was proposed in [86]. A quadratic dependence
measure has been proposed in [3], and has been used for the separation of PNL mixtures.
14
“Nonparametric,” in this context, does not mean that there are no parameters. It means that there is not a fixed
number of parameters, but instead that this number is of the order of magnitude of the number of training data,
and grows with it.
15
This example was worked out by us using software made publicly available by its authors (see Appendix B).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
10
0.5
5
0 0
−5
−0.5
−10
−1
−15
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
1 1
0.5 0.5
0 0
−0.5 −0.5
−1 −1
−1.5 −1.5
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000
FIGURE 3.3: Separation of a post-nonlinear mixture: source, mixture, and separated signals
f 1 (u 1 ) = tanh(5u 1 )
f 2 (u 2 ) = (u 2 )3 .
They are plotted in Fig. 3.4(a). The mixture components are shown in Fig. 3.3(b), where we
can see the strong distortions caused by the nonlinearities. The 1000-sample segments shown
in the figure were used as training set.
The separation was performed through the algorithm indicated above, with the following
specifics:
NONLINEAR SEPARATION 39
15
1
10
0.5
5
0 0
−5
−0.5
−10
−1
−15
−2 −1 0 1 2 −2 −1 0 1 2
2 2
1 1
0 0
−1 −1
−2 −2
−2 −1 0 1 2 −2 −1 0 1 2
Fig. 3.3(c) shows the separation results. The source signals were recovered with a relatively
small amount of error. Fig. 3.4(b) shows the compensated nonlinearities, i.e. the functions
g i [ f i (·)]. We can see that the compensation had a rather good accuracy. The good recovery of
the source signals implies that the separating matrix was well estimated. Its product with the
mixing matrix was
1.14 −0.16
WA = ,
0.07 1.28
FIGURE 3.5: Separation of a post-nonlinear mixture: scatter plots. From left to right: sources, mixture
components, separated components
3.2.1 MISEP
MISEP (for Mutual Information-based SEParation) is an extension of the INFOMAX linear
ICA method to the nonlinear separation framework, and was proposed by L. Almeida [5, 9].
It is based on the structure of Fig. 3.6. This structure is very similar to the one of INFOMAX
(Fig. 2.1), the only difference being that the separation block is now nonlinear. Like INFOMAX,
the method uses the minimization of the mutual information of the extracted components, I (Y ),
as an optimization criterion. Also as in INFOMAX, this minimization is transformed into the
maximization of the output entropy H (Z ) through the reasoning of Eqs. (2.2), assuming again
that each ψi function equals the CDF of the corresponding Yi variable.
For implementing the nonlinear separation block F, the method can use essentially
any nonlinear parameterized system. In practice, the kind of system that has been used most
Y1
X1 ψ1 Z1
• •
• F •
•
Yn •
Xn ψn Zn
FIGURE 3.6: Structure used by MISEP. F is a generic nonlinear block, and is what performs the
separation proper. The separated outputs are Yi . The ψi blocks, implementing nonlinear increasing
functions, are auxiliary, being used only during optimization
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 41
16
frequently is an MLP. For this reason, we shall present the method using as example a
structure based on an MLP. To make the presentation simple, we shall use a very simple
MLP structure with a single hidden layer of nonlinear units and a layer of linear output units,
and with connections only between successive layers. We note, however, that more complex
structures can be used, and have actually been used in the examples that we shall present
ahead.
MISEP uses the maximum entropy method for estimation of the ψi functions, as ex-
plained in Section 2.2.2. And in fact, as we mentioned in that section, once that estimation
method is understood, the extension to nonlinear MISEP is quite straightforward. In what fol-
lows we shall assume that the contents of that section have been well understood. If the reader
has only glossed over that section, we recommend reading it carefully now, before proceeding
with nonlinear MISEP.
Recall that L is one of the additive terms forming the objective function and is relative to one
of the training patterns (for simplicity we have again dropped the subscript referring to the
training patterns). Also recall that J = ∂z/∂ x is the Jacobian of the transformation made by
the system.
We need to compute the gradient of L relative to the network’s weights to be able to
use gradient-based optimization methods. As in Section 2.2.2, we shall compute that gradient
through backpropagation. We shall use a network that computes J and shall backpropagate
through it to find the desired gradient.
A Network That Computes the Jacobian The network that computes J is constructed using
the same reasoning that was used in Section 2.2.2. In fact, the MLP that implements the system
of Fig. 3.6 is very similar to the one that implements the system of Fig. 2.1. The only difference
is that the linear block W of Fig. 2.1 is now replaced with an MLP, which has a hidden layer
of nonlinear units followed by a layer of linear output units.
16
In [7] a system based on radial basis functions was used. In [11] a system using an MLP together with product
units was used.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
y
x D Φ1 E B Φ2 C z
I D Φ1′ E B Φ 2′ C J
FIGURE 3.7: Network for computing the Jacobian. The lower part is what computes the Jacobian
proper and is essentially a linearized version of the network of Fig. 3.6. The upper part is identical
with the network of Fig. 3.6, but is drawn in a different way. The two upper-right blocks, shown
in gray, are not necessary for computing J and are shown for reference only. See the text for further
explanation
We have already seen, in Section 2.2.2, how to deal with a layer of nonlinear hidden units
in constructing the network that computes the Jacobian. The structure of the network, for the
current case, is shown in Fig. 3.7. In this figure, the lower part is what computes the Jacobian
proper. The upper part is identical to the network of Fig. 3.6, but is drawn in a different way.
We shall start by describing this part.
• The input vector x̄ is the input pattern x, augmented with a component equal to 1,
which is useful for implementing the bias terms of the hidden units of block F.
• Blocks D̄, Φ1 , and E form block F:
– Block D̄ implements the product by the weight matrix (which we also call D̄) that
connects the input to the hidden layer. This matrix includes a row of elements
corresponding to the bias terms (this is indicated by the overbar in the matrix’s
symbol). The output of this block is formed by the vector of input activations of the
hidden units of F.
– Φ1 is a nonlinear block that applies to this vector of input activations, on a per-
component basis, the nonlinear activation functions of the hidden units of F. This
block’s output is the vector of output activations of that hidden layer.
– Block E implements the product by the weight matrix (also designated by E ) con-
necting the hidden layer to the output layer of F. The block’s output is the vector of
extracted components, y. In the figure it is shown as ȳ because it is considered to be
augmented with a component equal to 1, which is useful for implementing the bias
terms of the next layer of nonlinear units.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 43
• As before, in Section 2.2.2, the three rightmost blocks represent all the ψi blocks of
Fig. 3.6 taken together:
– Block B̄ performs a product by matrix B̄, which is the weight matrix of the hidden
units of all the ψi blocks, taken as forming a single hidden layer. The overbar in
B̄ indicates that this matrix incorporates elements relative to the bias terms of the
hidden units. The output of this block consists of the vector of input activations of
the units of the hidden layer of all the ψi blocks taken together.
– Block Φ2 is nonlinear and applies to this vector of input activations, on a per-
component basis, the nonlinear activation functions of the hidden units. The output
of this block is the vector of output activations of the global hidden layer of the ψi
blocks.
– Block C performs a product by matrix C, which is the weight matrix of the linear
output units of the ψi blocks. The output of the C block is the output vector z.
As before, and since there are no interconnections between the various ψi blocks, matrices B̄
and C have a special structure, with a large number of elements equal to zero. This does not
affect our further analysis, except for the fact that during the optimization these elements are
not updated, being always kept equal to zero.
We shall now describe the lower part of the network, which is the part that computes the
Jacobian proper. The Jacobian is given by
J = CΦ2 BEΦ1 D, (3.2)
in which
• Matrices D, E, B, and C are as defined above, the absence of an overbar indicating that
the matrices have been stripped of the elements relating to bias terms (which disappear
when differentiating relative to x).
• Block Φ1 performs a product by a diagonal matrix whose diagonal elements are the
derivatives of the activation functions of block Φ1 , for the input activations that these
activation functions receive for the current input pattern.
• Block Φ2 is similar to Φ1 but performs a product by the derivatives of the activation
functions of block Φ2 .
To compute the values of the activation functions’ derivatives, blocks Φ1 and Φ2 need to receive
the input activations of the corresponding units. This information is supplied by the upper part
of the network through the gray-shaded arrows. The reason for the presence of the upper part
of the network is exactly supplying this information.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
The procedure that we have described allows us to compute the gradient of the objective
function and therefore permits the use of any gradient-based optimization method. Among
these, the simplest is plain gradient descent, possibly augmented with the use of momentum for
better performance [4, 39]. While this method, with suitably chosen step size and momentum
coefficients, will normally be fast enough to perform linear separation, as described in Section
2.2.2, nonlinear separation usually leads to much more complex optimization problems, which
can only be efficiently handled by means of accelerated optimization methods.
To obtain good results in a reasonable amount of time it is therefore essential to use a fast
optimization procedure. Among these, we mention the use of gradient descent with adaptive
step sizes (see [4], Sections C.1.2.4.2 and C.1.2.4.3). This method is simple to implement
and has been used with MISEP with great success. Another important set of fast optimization
methods that are good candidates for use with MISEP (but which have not been used yet, to
our knowledge) is the class of conjugate gradient methods [39].
NONLINEAR SEPARATION 45
This reasoning is correct, but does not take into account that to deal with the indetermi-
nation of nonlinear ICA, we need to assume that the mixture is mildly nonlinear, and we need
to ensure that the separator also performs a mildly nonlinear transformation. In the scenario
presented in the last paragraph, the extracted components Yi would have approximately uniform
distributions. If the sources that were being handled did not have close-to-uniform distribu-
tions (and many real-life sources are not close to uniform), the F block would have to perform
a strongly nonlinear transformation to make them uniform, and it would not be possible to
keep F mildly nonlinear. With the structure of Fig. 3.6, we can keep F mildly nonlinear, while
allowing the ψi to be strongly nonlinear, so that the Zi components become close to uniform.
It is even possible, if desired, to apply explicit regularization to F to force it to stay close to
linear. In the application examples presented ahead, regularization was applied to F to ensure
a correct separation of the sources.
Artificial Mixtures We show two examples of mixtures of two sources. In the first case
both sources were supergaussian, and in the second case one was supergaussian and the other
subgaussian (and bimodal). Fig. 3.8(a) shows the joint distributions of the sources for the two
cases.
The nonlinear part of the mixtures was of the form
x̂1 = s 1 + a(s 2 )2
x̂2 = s 2 + a(s 1 )2 ,
with a suitably chosen value of a. The mixture components were then obtained by rotating the
vector x̂ by 45◦ :
x = A x̂,
with
1 1 1
A= √ .
2 −1 1
In what concerns processing time for nonlinear separation, the first example (the one with
two supergaussian sources) took 1000 training epochs to converge, corresponding approximately
to 3 min on a 1.6 GHz Pentium-M (Centrino) processor programmed in Matlab. The second
example (with a supergaussian and a subgaussian source) took 500 epochs, corresponding to
approximately 1.5 min on the same processor. This shows that MISEP is relatively fast, at least
for situations similar to those presented in these examples.
Results similar to these were presented in [8] for artificial mixtures of four sources (two
supergaussian and two bimodal) and in [9] for mixtures of 10 supergaussian sources. In the latter
case it was found that the number of epochs for convergence did not depend significantly on the
number of sources and that the computation time per epoch increased approximately linearly
with the number of sources for this range of problem dimensionalities. Therefore MISEP
remained relatively fast in all these situations.
Real-Life Mixtures We now present results of the application of MISEP to a real-life image
separation problem. When we acquire an image of a paper document (using a scanner, for exam-
ple), the image printed on the back page sometimes shows through due to partial transparency
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 47
(a) Sources
(b) Mixtures
FIGURE 3.8: Separation of artificial mixtures. Left column: two supergaussian sources. Right column:
a supergaussian and a bimodal source
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
of the paper. The mixture that is thus obtained is nonlinear. Since it is possible to acquire both
sides of the paper, we can obtain a two-component nonlinear mixture, which is a good candidate
for nonlinear separation. Here we present a difficult instance of this problem, in which the paper
is of the “onion skin” type, yielding a strong, rather nonlinear mixture.
Fig. 3.9(a) shows the two sources. These images were printed on opposite sides of a sheet of
onion skin paper. Both sides of the paper were then acquired with a desktop scanner. Fig. 3.9(b)
shows the images that were acquired. These images constitute the mixture components.
Fig. 3.10(a) shows the joint distribution of the sources, while Fig. 3.10(b) shows the mixture
distribution. It is easy to see that the mixture was nonlinear because the source distribution was
contained within a square, and a linear mixture would have transformed this square into a paral-
lelogram. It is clear that the outline of the mixture distribution is far from a parallelogram. The
distortion of the original square outline gives an idea of the degree of nonlinearity of the mixture.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 49
The separation of this mixture was performed both with the linear ICA and with a non-
linear, MISEP-based system. Fig. 3.11 shows the results. We can see that nonlinear separation
yielded significantly better results than linear separation. The distributions of the separated com-
ponents are shown in Figs. 3.10(c) and 3.10(d) for the linear and nonlinear cases respectively.
These distributions also confirm the better separation achieved by the nonlinear system.
The quality of separation was also assessed by means of several objective quality measures,
again showing the advantage of nonlinear separation. One of these measures, the mean signal-
to-noise ratio of the separated sources, for a set of 10 separation tests, is shown in Table 3.1.
We see that nonlinear separation yielded an improvement of 4.1 dB in one of the sources, and
of 3.4 dB in the other one, relative to linear separation.
In these separation tests, the F block had the same structure as the one used in the
previous artificial mixture examples. The ψ blocks had 20 hidden units each, to be able to
approximate well the somewhat complex cumulative distributions of the sources. The training
set was formed by the intensities from 5000 pixel pairs, randomly chosen from the mixture
images. Separation was achieved in 400 training epochs, which took approximately 9 min on a
1.6 GHz Pentium-M (Centrino) processor programmed in Matlab.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
SOURCE 1 SOURCE 2
Linear separation 5.2 dB 10.5 dB
Nonlinear separation 9.3 dB 13.9 dB
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 51
• The F network was constrained to be symmetrical (i.e. such that exchanging the inputs
would result just in an exchange of the outputs). This constraint was effective because
the mixture was known to be symmetrical to a good approximation.
The presentation of this example, here, was necessarily brief. Further details, as well as
additional examples, can be found in [10].
x = M(s , θ) + n, (3.4)
where the mixture vector x results from a nonlinear mixture M of the sources (the components
of s ), with modeling error n (in the Bayesian framework n is viewed as noise, instead of modeling
error). θ is a vector that gathers all the parameters of the mixture model.
We shall denote by x̄ the ordered set of all observation vectors x i that we have available
(with an arbitrarily chosen order, e.g. the order in which they were obtained), and by s̄ and n̄
the correspondingly ordered sets of source vectors s i and modeling error vectors ni respectively.
We’ll extend the concept of model to these sets of vectors, by writing
where the base of the logarithms defines the unit of measurement of the coding length (e.g.
base 2 for bits). This expression is valid if the resolution is fine enough for the density p(z) to
be approximately constant within intervals of length ε Z .
If the random variable is multidimensional, we have
L(z) = − log p(z) − log ε Zi , (3.7)
i
where ε Zi is the resolution of the encoding of the i-th component of Z. Whenever it is necessary
to define the coding length for a variable z, we will do so by specifying a probability density p(z),
which defines the coding length through (3.7).17 This will allow us to use only valid coding
lengths without having to specify the coding scheme [34, 35], and will also allow us to use the
tools of probability theory to more easily manipulate coding lengths. In what follows we shall
represent − i log ε Zi by r Z , for brevity.
17
In principle we would also need to define the resolutions ε Zi . However, as we’ve said before, those resolutions will
become relatively immaterial to our discussion.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 53
An aspect that we have glossed over, is that (3.7) may yield a non-integer value, while
any actual encoding will involve an integer number of symbols. One way to view Eq. (3.7) is as
a continuously valued approximation of that integer length. The continuous approximation has
the advantage of being amenable to a more powerful analytical treatment than a discrete one.
There are, however, other ways to justify the use of (3.7) (see [29] Section 3.2).
If the data that we are trying to represent come from sampling a random variable, there is
an important connection between the true pdf of that variable and the pdf that we use to define
the variable’s coding length. Let us assume that the random variable Z has pdf p(z), and that
we encode it with a length defined by the density q (z). The expected value of the coding length
of Z will be
If we use a length defined by a pdf q (z) = p(z), the average difference in coding length,
relative to L p , is given by
E[Lq (Z)] − E[L p (Z)] = − p(z) log q (z)dz + p(z) log p(z)dz
p(z)
= p(z) log dz
q (z)
= K LD( p, q ).
Therefore, the Kullback-Leibler divergence between p and q measures the average excess
length resulting from using a coding length defined by q on a random variable whose pdf is p.
As we know, this KLD is always positive except if q = p, in which case it is zero. Consequently,
the optimal encoding, in terms of average coding length, is defined by the variable’s own pdf.
If z is a vector whose components are i.i.d. samples from a distribution with pdf p(z),
the empirical distribution formed by the components of z will be close to p(z), and the coding
length of z will be close to minimal if we choose the components’ coding lengths according to
p(z). Therefore, the choice of the pdf that defines the coding length has a strong relationship
with the statistics of the data that are to be encoded.
The representation of coding lengths in terms of probability distributions provides a
strong link to the Bayesian approach to ensemble learning. In that approach, the probability
distributions of S̄, Θ and N̄ correspond to priors over those random variables, and represent
our beliefs about those variables’ values. In the MDL approach these priors are just convenient
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
Since the coding length of the pair (z, w) is the sum of the coding lengths of the two
parameters, this corresponds to coding the two separately from each other. If z and w were
obtained from random variables Z and W that are not independent from each other, some
coding efficiency will be lost, because some common information will be coded in both variables
simultaneously. Consequently, the MDL criterion will tend to favor representations in which
the two parameters are statistically independent. Therefore, although the equality p(z, w) =
p(z) p(w), above, was not a statement about the independence of the two random variables, but
rather a statement about the form in which the variables are coded, it does favor solutions in
which the variables are independent.
L(ξ) is defined by the prior p(ξ), and L(n̄) by the prior p(n̄). Equation (3.8) can be
written
NONLINEAR SEPARATION 55
Once ξ is coded, what remains, to code x̄, is coding n̄. Therefore, we can write L(x̄|ξ) =
L(n̄). Furthermore, from (3.5), we see that the resolutions at which the components of x̄ are
represented are the same as those at which the components of n̄ are represented, and consequently
r x̄ = r n̄ . Therefore, it is natural to define
so that
L(x̄|ξ) = L(n̄)
= − log p(n̄) + r n̄
= − log p(x̄|ξ) + r x̄ .
We wish to choose the representation of x̄ with minimum length L(x̄). The most obvious
solution would correspond to finding the minimum of (3.11), subject to the condition that
the modeling equation (3.5) is satisfied for the given x̄. There is, however, a form of coding
that generally yields a shorter coding length. It is based on the so-called bits-back coding
method [48,110], which we shall briefly examine. It uses the fact that (3.5) allows for using any
value of ξ for coding any given set of observations x̄ (of course, bad choices of ξ will result in
large modeling errors).
For a simple example of bits-back coding, assume that we are using a redundant code,
in which a certain x̄ is represented by two different bit strings, both with the same length of
l bits. Of course, choosing the shortest code will yield a coding length of l bits. However, a
cleverer scheme is to choose one of the two available code strings according to some other
binary information that we need to transmit. In this way we can transmit x̄, plus one extra bit of
information, in the l bits. This means that, for x̄, we will effectively be using l − 1 bits, because
we will get back the extra bit at the receiver (hence the name of bits-back coding). In practice,
of course, we don’t need to send any extra information, or even to encode the data. We only
need to know the coding length, and this reasoning shows that the effective coding length of x̄
would be l − 1 bits.
In our case we can use any value of ξ for coding any given x̄, possibly at the cost of getting
large modeling errors. We’ll assume that the actual ξ to be used will be chosen at random, with
a pdf q (ξ), and that its components will be represented with resolutions corresponding to r ξ .
From (3.11), the average coding length will be
Therefore the effective bits-back coding length of x̄, which we represent by L̂q (x̄), is given by
Note that, in this equation, the term r ξ , relative to the resolution of the representation
of ξ, has been canceled out. This means that bits-back coding allows us to encode the model
parameters with as fine a resolution as we wish, without incurring any penalty in terms of coding
length.
The latter equation can be transformed as
q (ξ)
L̂q (x̄) = q (ξ) log dξ + r n̄ (3.13)
p(ξ) p(x̄|ξ)
q (ξ)
= q (ξ) log dξ + r n̄ (3.14)
p(ξ|x̄) p(x̄)
= K LD[q (ξ), p(ξ|x̄)] − log p(x̄) + r n̄ , (3.15)
where p(x̄) and p(ξ|x̄) are defined, consistently with probability theory, as
NONLINEAR SEPARATION 57
3.2.2.3 Practical aspects
We are now in a position to give an overview of the ensemble learning method of nonlinear
source separation. The overview will cover the most important aspects, but several details will
have to be skimmed over. The reader is referred to [102, 104] for more complete presentations.
The method involves a number of choices and approximations, most of which are intended
at making it computationally tractable:
where s i j is the j -th sample of the i-th source, and a similar convention applies to nlm .
• The priors of the parameters and of the noise samples, p(θk ) and p(nlm ) respectively,
are taken to be Gaussian. The means and variances of these Gaussians are not fixed
a priori and, instead, are also encoded. For this reason these means and variances are
called hyperparameters. They are further discussed ahead. From (3.10), p(x̄|ξ) factors
as a product of Gaussians, because p(n̄) does.
• The priors p(s i j ) can be simply chosen as Gaussians. In this case, only linear com-
binations of the sources are normally found. This method is called nonlinear factor
analysis (NFA), and is often followed by an ordinary linear ICA operation, to extract
the independent sources from these linear combinations.
The source priors can also be chosen as mixtures of Gaussians, in order to have a
good flexibility in modeling the source distributions, in an attempt to directly perform
nonlinear ICA. However, this increased flexibility often does not lead to a full separation
of the sources, as explained ahead. For that reason, it is more common to use the
approach of performing NFA (using simple Gaussian priors) followed by linear ICA,
rather than trying to directly perform nonlinear ICA, using mixtures of Gaussians for
the priors.
The parameters of the Gaussians (means and variances) or of the mixtures
(weights, means and variances of the mixture components) are hyperparameters.
18
In the second example presented ahead, with a dynamical model of the sources, the factorization of p(s̄ ) doesn’t
apply, being replaced by the dynamical model.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
where the q (·) on both factors of the right hand side are Gaussians.19 Therefore one
only needs to estimate their means and variances.
Since q (ξ) ≈ p(ξ|x̄), the factorization in (3.17) will tend to make the sources mutually
independent, given x̄. This bias for independence is necessary to make the model
tractable.20
• Those means and variances are estimated by minimizing the coding length L̂q (x̄),
given by (3.13). The resolutions that are chosen for representing the modeling error’s
components affect this length only through the additive term r n̄ , which doesn’t affect the
position of the minimum. That term can, therefore, be dropped from the minimization.
19
The densities q (s i j ) are taken as mixtures of Gaussians if one is doing nonlinear ICA, and not just NFA.
20
This independence bias has some influence on the final result of separation, sometimes yielding a linear combination
of the sources instead of completely separated ones, even if one uses mixtures of Gaussians for the source priors [57].
This is why some authors prefer to use ensemble learning to perform only nonlinear factor analysis (NFA), extracting
only linear combinations of the sources, and then to use a standard linear ICA method to separate the sources from
those linear combinations. This is what is done in the first application example, presented ahead.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 59
Consequently, the objective function that is actually used is
q (ξ)
C= q (ξ) log dξ. (3.18)
p(ξ) p(x̄|ξ)
As a side note, and since the KLD in (3.15) is non-negative, we conclude from (3.13)
and (3.15) that the objective function C gives an upper bound for − log p(x̄). In the
Bayesian framework, p(x̄) is the probability density that the observed data would be
generated by the model (3.5), for the specific form of M and for the specific priors that
were chosen. This probability is often called the model evidence, and the value of the
objective function can be used to compute a lower bound for it.
• The approximator q (ξ) and the prior p(ξ) factor into products of large numbers of
simple terms – see (3.17) and (3.16). The posterior p(x̄|ξ) is approximated by a product
of Gaussians,
The factorization and the presence of the logarithms in (3.18) leads the objective
function to become a sum of a large number of relatively simple terms which can all be
computed, either exactly or approximately.
These choices and approximations make it possible to compute closed form expressions of
the partial derivatives of the objective function relative to the parameters (means and standard
deviations) of the approximator q (ξ). The partial derivatives are set equal to zero, and this
leads to equations that are simple enough to be solved directly (for some of the equations, only
approximate solutions can be found). This allows the estimation of the parameters, for one subset
of the q (ξi ) at a time, taking the parameters for the other ξi as constant. The procedure iterates
through these estimations until convergence. This procedure, although computationally heavy,
is nevertheless faster than gradient descent on the KLD. Conjugate gradient optimization [39]
also gives good results.21
Once the minimum is found, the density
yields an estimate of the posterior distribution of the sources. Due to the factorization of
q (ξ), one can directly obtain the approximate posterior of the sources, q (s̄ ), from the complete
posterior q (ξ), by keeping only the terms relative to s̄ .
21
H. Valpola, private communication, 2005.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
Since p(x̄), r ξ and r n̄ don’t depend on ξ, they can be dropped from the optimization. Further-
more,
Therefore, the minimum length encoding will correspond (within these approximations)
to the s̄ and θ that maximize the corresponding terms in (3.22).22 If q (s̄ ) is represented as a
product of Gaussians on the various s i j , as is normally done, the MDL estimate of s̄ will simply
correspond to the estimated means of those Gaussians.
Within a Bayesian framework, p(s̄ |x̄) is interpreted as the true posterior of the sources,
in the statistical sense, and q (s̄ ) is an approximation to it. Therefore it can be used, for example,
for computing means or MAP estimates, or for taking decisions that depend on s̄ , as is normally
done with statistical distributions.
22
Incidentally, these are also the maximum a posteriori (MAP) estimates of s̄ and θ, within the approximation (3.21).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 61
in [48]. In that example, a relatively small amount of training data led FastICA to find spurious
spiky sources, due to overfitting, while ensemble learning found relatively good approximations
of the actual sources.
Non-dynamical model In this case the sources were eight random signals, four of which had
supergaussian distributions, the other four being subgaussian. These signals were nonlinearly
mixed by a multilayer perceptron with one hidden layer, with sinh−1 activation functions. This
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
FIGURE 3.12: Results of the first example of separation using ensemble learning. In each scatter plot,
the horizontal axis corresponds to a true source, and the vertical axis to an extracted source. The sources
in the top row were supergaussian, and those in the lower row were subgaussian.
mixing MLP had 30 hidden units and 20 output units, and its weights were randomly chosen.
Note that, since 20 mixtures were observed and there were only 8 sources, this was not a square
mixing situation, unlike other ones that we have seen in this book. Gaussian noise, with an
SNR of 20 dB, was added to the mixture components.
The analysis was made using nonlinear factor analysis (i.e. Gaussian priors were used
for the sources), using eight sources and a mixture model consisting of an MLP with a single
hidden layer of 50 units, with tanh activation functions. Since the source priors were Gaussian,
the sources themselves were not extracted in this step, but only linear combinations of them. In
a succeeding step, the FastICA method of linear ICA (see Section 2.4.2) was used to perform
separation from these linear combinations. Figure 3.12 shows the scatter plots of the extracted
components (after the FastICA step) versus the corresponding sources, and confirms that the
sources were extracted to a very good accuracy. This is also shown by the average SNR of the
extracted sources relative to the true ones, which was 19.6 dB. The extraction was slow, though,
having taken 100,000 training epochs.
Separation with a dynamical model Our second example corresponds to the dynamical ex-
tension of ensemble learning [106]. In this example, the mixture observations are still modeled
through (3.5), but the sources are not directly coded. Instead, they are assumed to be observations
that are sequential in time, and are coded through a dynamical model
NONLINEAR SEPARATION 63
FIGURE 3.13: Separation by ensemble learning with a dynamical model: The eight sources.
where G is a nonlinear function parameterized through φ.23 Therefore, s (t) is coded through φ,
m(t) and s (1). The m(t) process is often called the innovation process, since it is what drives the
dynamical model. The variables to be encoded are the parameters of the nonlinear mappings
(θ and φ), the innovation process m(t), the initial value s (1) and the modeling error n(t).
All of these variables were separated into their scalar components, and each component had a
Gaussian prior. These Gaussians involved some hyperparameters that were given very broad
distributions, as usual.
For generating the test data for this example, the source dynamical processes consisted
of two independent chaotic Lorenz processes [70] (with two different sets of parameters) and
a harmonic oscillator. A Lorenz process has three state variables and an oscillator has two,
meaning that the system had a total of eight state variables, forming three independent dynamical
processes. Figure 3.13 shows a plot of the eight source state variables.
For producing the mixture observations, these eight state variables were first linearly
projected into a five-dimensional space. Due to this projection into a lower-dimensional space,
the behavior of the system could not be reconstructed without learning its dynamics, because five
variables do not suffice to represent the eight-dimensional state space. These five projections
were then nonlinearly mixed, by means of an MLP with a single hidden layer with sinh−1
23
We have slightly changed the notation, in this equation, using s (t) instead of s̄ , and m(t) instead of m̄, to emphasize
the temporal aspect of the dynamical model.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
FIGURE 3.14: Separation by ensemble learning with a dynamical model: The ten nonlinear mixture
components.
NONLINEAR SEPARATION 65
to a “continuation” of the process, using only the dynamical model s (t) = G[(s (t − 1), φ], with
no innovation. The fact that the correct dynamical behavior of the sources has been obtained
in this continuation shows that a very good dynamical model was learned. In fact, in chaotic
processes, even small deviations from the correct model often lead to a rather different attractor,
and therefore to a rather different continuation.
For comparing with more classical methods, several attempts were made to learn a dy-
namical model operating directly on the observations, and using MLPs of several different
configurations for implementing its nonlinear dynamics. None of these attempts was able to
yield the correct behavior, in a continuation of the dynamical process.
This example is necessarily rather complex, and here we could only give an overview. One
aspect that we have glossed over, is that only a nonlinearly transformed state space was learned
by the method. This space was then mapped, by means of an MLP, to the original state space,
in order for the recovery of the correct dynamics to be checked. What is shown in Fig. 3.15
is the result of that mapping of the learned dynamics into the original state space. However,
even in the state space learned by the method, the three dynamical processes were separated.
And the fact that a transformed state space was learned has little relevance, since there is not a
particular space that can claim being the “main” state space, in a nonlinear dynamical process.
For further details see [106].
0 200 400 600 800 1000 1200 1400 1600 1800 2000
FIGURE 3.15: Separation by ensemble learning with a dynamical model: Extracted sources. The first
1000 samples correspond to sources estimated using mixture observations. The last 1000 samples corre-
spond to a continuation, by iteration of the dynamical model s (t) = G[(s (t − 1), φ].
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 67
products of vectors of that space. Also assume that, for any two vectors of X , x 1 and x 2 , the
inner product of their images in X̂ can be expressed as a relatively simple function of the original
vectors x 1 and x 2 :
x̂ 1 · x̂ 2 = k(x 1 , x 2 ). (3.24)
The function k(·, ·) is then called the kernel of the mapping from X into X̂ . If (3.24) holds, the
whole algorithm that we wish to implement can be performed in the low-dimensional space X
by replacing the inner products in X̂ with kernel evaluations in X . This will avoid performing
operations in the high-dimensional space X̂ .
This may seem too far fetched at first sight. After all, most nonlinear mappings will not
have a corresponding kernel that obeys (3.24). However, what is done in practice is to choose
only mappings that have a corresponding kernel. In fact, the issue is often reversed: one chooses
the kernel, and that implicitly defines the nonlinear mapping that is used. The important point
is that it is not hard to find kernels corresponding to mappings that yield very wide classes of
nonlinear operations in X , when linear operations are performed in X̂ .
The most straightforward application of these ideas to nonlinear ICA would consist of
performing linear ICA in the feature space X̂ , and then mapping the obtained components
back to the original space X . This is not possible, however, because it has not been possible
to express linear ICA only in terms of inner products, and these are the only operations that
can be efficiently performed in X̂ . And even if this could be done, there would probably be
the additional problem that working (albeit indirectly) in such a high-dimensional space could
easily lead to badly conditioned systems, raising numerical instability issues.
For these reasons, the approach that is taken in kernel-based nonlinear ICA is differ-
ent: we first linearly project the data from the feature space X̂ into a medium-dimensional
intermediate space X̃ , and then perform linear ICA in that space. Normally, linear ICA in this
lower dimensional space already is numerically tractable. The dimension of X̃ (call it d ) is often
chosen in a way that depends on the existing data, but for a typical two-source problem it could
be around 20.
The projection from the feature space into the intermediate space should be made in such
a way that the most important information from the feature space is kept. This projection can
be performed in several ways. The one that immediately comes to mind is the use of PCA in X̂ .
This amounts to using the kernel-PCA method that we mentioned earlier [90]. That method
uses only inner products in X̂ , and can therefore be efficiently implemented by means of the
kernel trick. Other alternatives that also use only inner products include the use of random
sampling or of clustering [38].
Once the data have been projected into the intermediate space X̃ , it is possible to perform
linear ICA in that space through several standard methods. The method that has been used in
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
3.2.3.1 Examples
We shall present two examples of the separation of nonlinear mixtures through kTDSEP.
First Example The source signals were two sinusoids with different frequencies (2000 samples
of each). Fig. 3.16(a) shows these signals, and Fig. 3.17(a) shows the corresponding scatter plot.
These signals were nonlinearly mixed according to
x1 = e s 1 − e s 2
x2 = e −s 1 + e −s 2 . (3.25)
Fig. 3.16(b) shows the mixture signals, and Fig. 3.17(b) shows the corresponding scatter plot.
We can see that the mixture performed by (3.25) was significantly nonlinear.
Figs. 3.16(c) and 3.17(c) show the result of linear ICA. Both figures show that it was not
able to perform a good separation, as expected. Fig. 3.17(c) shows that linear ICA just rotated
the joint distribution. Naturally, it could not undo the nonlinearities introduced by the mixture.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 69
FIGURE 3.16: Signals corresponding to the first example of separation through kTDSEP
k(a 1 , a 2 ) = (a T1 a 2 + 1)9 .
This kernel generates a feature space X̂ spanned by all the monomials of the form (x1 )m (x2 )n ,
with m + n ≤ 9, where x1 and x2 are the two mixture components. This space has a total of
54 dimensions. This space was then reduced, through a clustering technique, to 20 dimen-
sions, thus forming the intermediate space X̃ . Linear ICA was performed on this intermediate
space by the TDSEP technique, yielding 20 components. From these, the ones corresponding
to the sources were extracted, as indicated above, by rerunning the algorithm on the set of
20 components and selecting the two that were least modified by this second separation. The
results are shown in Figs. 3.16(d) and 3.17(d). We can see that a virtually perfect separation was
achieved.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
FIGURE 3.17: Scatter plots from the first example of separation through kTDSEP
Second Example The sources were two speech signals, each with a length of 20 000 samples.
Fig. 3.18(a) shows these signals, and Fig. 3.19(a) shows their joint distribution. The latter figure
shows that the sources were strongly supergaussian, as normally happens with speech signals.
For performing the nonlinear mixture, the signals were first both scaled to the interval
[−1, 1] and were then mixed according to
x1 = −(s 2 + 1) cos(πs 1 )
x2 = (s 2 + 1) sin(π s 1 ).
We can describe this mixture in the following way: The mixture space is generated in polar
coordinates. Source s 1 controls the angle, while (s 2 + 1) controls the distance from the center.
Figs. 3.18(b) and 3.19(b) show the mixture results.24
24
Due to the shape of the mixture scatter plot, this has become commonly known as the “euro mixture.”
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 71
FIGURE 3.18: Signals corresponding to the second example of separation through kTDSEP
k(a 1 , a 2 ) = e −||a 1 −a 2 || .
2
FIGURE 3.19: Scatter plots from the second example of separation through kTDSEP
the signal-to-noise ratios of the results of both linear and nonlinear separation, relative to the
original sources. Table 3.2 shows the results and confirms the very large improvement obtained
with kTDSEP, relative to linear ICA.
We have shown examples of the separation of nonlinear mixtures of just two sources.
However, kTDSEP has been shown to be able to efficiently separate mixtures of up to seven
sources [38].
TABLE 3.2: Signal-to-Noise Ratios of the Results of Linear ICA and kTDSEP
SOURCE 1 SOURCE 2
Linear ICA 5.4 dB −7.0 dB
kTDSEP 13.2 dB 18.1 dB
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 73
3.2.4 Other Methods
In this section we shall give an overview of other methods that have been proposed for performing
nonlinear source separation, and we shall simultaneously try to give an historical perspective of
the field. The number of different methods that we shall mention is large, and therefore we can
only make a brief reference to each of them.
A very early result on nonlinear source separation was published by Darmois in 1953 [30].
He showed the essential ill-posedness of unconstrained nonlinear ICA; i.e., that it has an infinite
number of solutions that are not related to one another in a simple way.
One of the first nonlinear ICA methods was proposed by Schmidhuber in 1992 [88]. It
was based on the idea of extracting, from the observed mixture, a set of components such that
each component would be as unpredictable as possible from the set of all other components.
This justified the method’s name, predictability minimization. The components were extracted
by an MLP, and the attempted prediction was also performed by MLPs. While the basic idea
was sound, the method was computationally heavy, and hard to apply in practice.
In the same year, another nonlinear ICA method was proposed by Burel [23]. It was
based on the minimization of a cost function that was a smoothed version of the quadratic
error between the true distribution p(y) and the product of the marginals, i p(yi ). The
method used an MLP as a separator. The cost function was expressed as a series involving the
moments of the extracted components. The series was then truncated and the moments were
estimated on the training set. Backpropagation was used to compute the gradient of the cost
function for minimization. The method was demonstrated to work on an artificially generated
nonlinear mixture, both without and with noise. While no explicit regularization conditions
were mentioned, the smoothing of the error, the series truncation, and the fact that a very small
MLP was used (with one hidden layer with just two units) were implicit regularizing constraints
that allowed the method to cope with the indetermination of nonlinear ICA for the mixture
that was considered in the examples.
In 1995, Deco and Brauer proposed a method based on the minimization of the mutual
information of the extracted components [31]. The method was restricted to volume-conserving
nonlinear mixtures (also called information-preserving mixtures), and the separation was
performed by an MLP with a special structure. The probability densities needed for the esti-
mation of the mutual information were approximated by truncated series based on cumulants.
The indetermination inherent to nonlinear ICA was handled by the restriction to volume-
conserving mixtures and also by the inherent regularization corresponding to the truncation of
the cumulant-based series. Variants of the method were proposed in [81, 82].
Also in 1995, Hecht–Nielsen proposed a method (replicator neural networks) to find a
mapping from observed data to data that are uniformly distributed within a hypercube [41].
Coordinates parallel to the hypercube’s edges (which were designated natural coordinates of the
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 75
Also in 1998, Hochreiter and Schmidhuber [43–45] proposed the so-called LO-
COCODE method, which was based on a philosophy similar to (although less complete
than) the MDL one used later to develop ensemble learning: it tried to find a low-complexity
auto-encoder for the data. Under appropriate conditions, this led the auto-encoder’s internal
representation of the data to consist of the original sources. The main difference in philoso-
phy, relative to ensemble learning, was that it did not take into account the complexity of the
extracted sources and of the modeling error. The method was demonstrated on an artificial
nonlinear separation task, both with noiseless and with noisy data.
In 1999, Marques and L. Almeida proposed a method, called pattern repulsion, based on
a physical analogy with electrostatic repulsion among output patterns [72]. The method was
shown to be equivalent to the maximization of the second-order differential Renyi entropy of
the output, defined as [112]
As with Shannon’s entropy, this entropy maximization led to a uniform distribution of the
outputs within a hypercube, and thus to independent outputs. The method used regularization
to handle the indetermination of nonlinear ICA. This method was extended by Hochreiter and
Mozer [42], by including both “repulsion” and “attraction,” allowing it to deal with nonuniform
source distributions. A further theoretical analysis of the method was made in [100].
Also in 1999, Palmieri [80] proposed a method based on the maximization of the output
entropy, using as separator an MLP with the restriction that it had, in each hidden layer, the
same number of units as the number of mixture components. The possible ill-posedness of the
problem was not addressed in this work. The same separator structure (but restricted to two
hidden layers) was considered in [73] with a different learning algorithm.
In the same year, Hyvärinen and Pajunen [56] showed that a nonlinear mixture of two
independent sources is uniquely separable if the mixture is conformal25 and the sources’ distri-
butions have known, bounded supports.
Still in 1999, Lappalainen (presently Valpola) and Giannakopoulos proposed the ensem-
ble learning method, studied in Section 3.2.2.
In 2000, L. Almeida proposed the MISEP method, studied in Section 3.2.1. Also in
2000, Fyfe and Lai [33] proposed a method that combines the use of kernels with the canonical
correlation analysis technique from statistics, and demonstrated the separation of two sinusoids
from a nonlinear mixture. The method seems to retain at least some of the indetermination of
nonlinear ICA because it does not provide information on which components, among those
that are extracted, do correspond to actual sources.
25
A conformal mapping is a mapping that locally preserves orthogonality.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
NONLINEAR SEPARATION 77
In 2005, the so-called denoising source separation method, originally proposed for linear
source separation [87], was extended to nonlinear separation by M. Almeida et al., and was
demonstrated on the image separation problem that we examined in Section 3.2.1.4 [12]. The
method is not based on an independence criterion. Instead, it uses some prior information about
the sources and/or the mixture process to perform a partial separation of the sources. With an
application of this partial separation within an appropriate iterative structure, the method is
able to achieve a rather complete separation.
As we have seen, a large number of nonlinear source separation methods have been
proposed in the literature. Some of them are limited to specific kinds of nonlinear mixtures,
either explicitly or due to the restricted kind of separator that they use, but other ones are
relatively generic. In our view, this variety of methods reflects both the youth and the difficulty
of the topic: it has not stabilized into a well-defined set of methods yet. In Sections 3.2.1 to
3.2.3 we presented in some detail the methods that, in our opinion, have the greatest potential
for yielding useful application results.
3.3 CONCLUSION
If we compare the results obtained by kTDSEP with those obtained with MISEP, there is a
difference that is striking. The kTDSEP method can successfully separate mixtures involving
nonlinearities that are much stronger than those that can be handled by MISEP. A good example
is the “euro mixture.” MISEP, applied to this mixture, yields independent components, but these
do not correspond to the original sources. MISEP is not able to deal with the ill-posedness
of nonlinear separation in this case, because it uses the assumption that the mixture is mildly
nonlinear to be able to perform regularization, and that assumption is not valid in this case.
The question that immediately comes to mind is how can kTDSEP deal with the ill-
posedness, even without regularization. The intuition is that its use of the temporal structure
of the sources in the separation process greatly reduces the indetermination of nonlinear ICA.
There is some evidence to support this idea. In a series of unpublished experiments, jointly
performed by Harmeling and L. Almeida, a variant of kTDSEP was used to try to separate
a mixture of sources with no time structure (i.e. each of them had independent, identically
distributed samples). Of course, TDSEP could not be used for the linear separation step.
Therefore another method (INFOMAX) was used for this step. Even though the mixtures
that were considered were only mildly nonlinear, all tests failed to recover the original sources.
This seems to confirm the idea that the use of temporal structure is what allows kTDSEP to
successfully deal with the indetermination. Hosseini and Jutten have analyzed in [53] a number
of cases that also suggest that temporal structure can be used for this purpose. We conjecture
that temporal (or spatial) structure of the sources will become a very important element in
dealing with the indetermination of nonlinear source separation in the future.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52
78
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47
79
CHAPTER 4
Final Comments
We have made an overview of nonlinear source separation and have studied in some detail the
main methods that are currently used to perform this kind of operation. While these methods
yield useful results in a number of cases, they are still somewhat hard to apply, and still need
much care from the user, both in tuning them and in assessing the quality of the results. Simply
put, these methods are still very far from a “black box” use, in which one would just grab a
separation routine, supply the data, and immediately get useful results.
The reader would have noticed that there are still few examples of application of nonlinear
separation to real-life data. Some people have suggested that there are few naturally occurring
nonlinear mixtures. We think this is not so. It is true that many naturally occurring mixtures are
approximately linear. This is often the case with acoustic, biomedical, and telecommunications
signals, for example, and it is a fortunate situation, because it allows us to deal with them using
the well-developed linear separation methods. However, we think that we often do not identify
nonlinear mixtures as such, simply because we still do not have powerful enough methods to
deal with them. Imagine, for example, a complex set of real-life data such as the stock values
from some financial market. We would like to be able to extract the fundamental “sources”
that drive these data. These could perhaps be variables such as the investor’s confidence, the
interest rate, the market liquidity and volatility, and probably also other variables that we are
unable to think of. These “sources” are related to the observed data in a nonlinear way. The
same could be said about countless other sets of data. If we were able to efficiently perform
nonlinear source separation in complex data sets, we would have a very powerful data mining
tool, applicable to a very large number of situations. It is therefore worth investing a significant
effort into the development of more powerful nonlinear separation methods.
In our view, the main difficulty of nonlinear source separation resides in the ill-posedness
of nonlinear ICA, that we have emphasized a number of times throughout this book. While the
use of the temporal or spatial structure of signals may provide a significant help in this respect,
as illustrated by the kTDSEP method, knowledge about how this structure should be used, and
about the capabilities and shortcomings of its use, is still very limited. To our knowledge, the
use of signal structure in nonlinear separation is currently limited to kTDSEP, to the method
presented in [21], and to ensemble learning with a dynamical model (as in the second example
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47
FINAL COMMENTS 81
to develop methods that are able to exploit elaborate and powerful measures of complexity and
that, at the same time, are efficient in computational terms.
The field of nonlinear source separation will keep steadily advancing, with the develop-
ment of methods that are progressively more powerful, more efficient, easier to use and applicable
to a wider range of situations. The benefits to be gained are very large. Linear separation, which
is much more advanced today, has already shown the potential of these methods to reveal the
“hidden truth” that lies behind complex signals, making them much easier to understand and
to process, and giving us access to information that could not be obtained in any other way. A
similar potential awaits us as we learn to perform nonlinear separation more effectively.
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47
82
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47
83
We conclude that Z will be uniformly distributed in (0, 1), whatever the distribution of
the original random variable Y . The function FY is the nondecreasing function that transforms
Y into a random variable which is uniformly distributed in (0, 1). This fact is used by some ICA
methods, namely INFOMAX and MISEP.
A.2 ENTROPY
The entropy of a discrete random variable X, as defined by Shannon [29, 91], is
H(X) = − P (xi ) log P (xi )
i
= −E[log P (X)], (A.1)
where P (xi ) is the probability that X takes the value xi , and the sum spans all the possible values
of X. The entropy measures the amount of information that is contained, on average, in X. It
can also be interpreted as the minimum number of symbols needed to encode X, on average.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47
where p(X) is the probability density function of X, and the convention 0 log 0 = 0 is adopted.
The differential entropy does not have the same interpretation as the entropy in terms of coding
length, but is useful in relative terms, for comparing different random variables. Its definition
extends, in a straightforward way, to multidimensional variables.
Following are two important facts about differential entropy (see [29, Chapter 11] or
[91]):
• Among all distributions with support within a given bounded region, the one with the
largest differential entropy is the uniform distribution within that region. This is true
both for single-dimensional and for multidimensional distributions.
• Among all single-dimensional distributions with zero mean and with a given variance,
the one with the largest differential entropy is the Gaussian distribution with the given
variance. Among all multidimensional distributions with zero mean and a given covari-
ance matrix, the one with the largest differential entropy is the Gaussian distribution
with the given covariance matrix. Among all multidimensional distributions with zero
mean and a given variance, the one with the largest differential entropy is the spherically
symmetric Gaussian distribution with the given variance.
APPENDIX A 85
where J = ∂z/∂ x is the Jacobian of the transformation T. Therefore,
H (Z ) = −E[log p(z)]
= −E[log p(x)] + E[log det J ]
= H (X ) + E[log det J ].
where, for the case of continuous random variables, H represents Shannon’s differential entropy,
and for the case of discrete variables it represents Shannon’s entropy.
I (Y ) is a nonnegative quantity. It is zero if and only if the components Yi are mutually
independent. This agrees with the intuitive concept that we mentioned above: The information
shared by the components Yi is never negative, and is zero only if these components are mutually
independent.
I (Y ) is equal to the Kullback–Leibler divergence between the density of Y and the product
of the marginal densities of the components Yi :
I (Y ) = KLD p(y), p(yi ) .
i
This is proved, for example, in [29,39], for the case of a two-dimensional random vector Y, and
the proof can easily be extended to more than two dimensions.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47
But the Jacobian ∂ Z/∂Y is a diagonal matrix, its determinant being given by
∂Z
det = ψi (Yi ).
∂Y i
Therefore,
∂ Z
E log det = E log ψ (Yi )
∂Y i
i
= E log ψi (Yi ) .
i
87
• ICA Central—http://www.tsi.enst.fr/icacentral
• Paris Smaragdis’ page—http://web.media.mit.edu/∼paris/ica.html
• ICA Research Network—http://www.elec.qmul.ac.uk/icarn/software
.html
http://www.lx.it.pt/∼lbalmeida/ica/seethrough/index.html
http://www.lx.it.pt/∼lbalmeida/ica/seethrough/code/jmlr05/
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPB MOBK016-Almeida.cls March 7, 2006 13:48
88
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
89
References
[1] S. Achard and C. Jutten, “Identifiability of post nonlinear mixtures,” IEEE Signal Pro-
cessing Letters, vol. 12, no. 5, pp. 423–426, 2005. doi:10.1109/LSP.2005.845593
[2] S. Achard, D. Pham, and C. Jutten, “Blind source separation in post nonlinear mix-
tures,” in Proc. Int. Workshop Independent Component Analysis and Blind Signal Separa-
tion, San Diego, CA, 2001, pp. 295–300. [Online]. Available: http://www-lmc.imag.fr/
lmc-sms/Sophie.Achard/Recherche/ICA2001.pdf
[3] S. Achard, D. Pham, and C. Jutten, “Quadratic dependence measure for nonlinear blind
sources separation,” in Proc. Int. Workshop Independent Component Analysis and Blind
Signal Separation, Nara, Japan, 2003. [Online]. Available: http://www.kecl.ntt.co.jp/icl/
signal/ica2003/cdrom/data/0098.pdf
[4] L. Almeida, “Multilayer perceptrons,” in Handbook of Neural Computation, E. Fiesler
and R. Beale, Eds. Bristol, U.K.: Institute of Physics. 1997. [Online]. Available:
http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaHNC.pdf
[5] L. Almeida, “Linear and nonlinear ICA based on mutual information,” in Proc. Symp.
2000 Adaptive Systems for Signal Processing, Communications, and Control, Lake Louise,
Alberta, Canada, 2000. [Online]. Available: http://www.lx.it.pt/∼lbalmeida/papers/
AlmeidaASSPCC00.ps.zip
[6] L. Almeida, “Simultaneous MI-based estimation of independent components and of
their distributions,” in Proc. Second Int. Workshop Independent Component Analysis and
Blind Signal Separation, Helsinki, Finland, 2000, pp. 169–174. [Online]. Available:
http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaICA00.ps.zip
[7] L. Almeida, “Faster training in nonlinear ICA using MISEP,” in Proc. Int. Workshop
Independent Component Analysis and Blind Signal Separation, Nara, Japan, 2003, pp. 113–
118. [Online]. Available: http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaICA03.pdf
[8] L. Almeida, “MISEP—Linear and nonlinear ICA based on mutual information,” Jour-
nal of Machine Learning Research, vol. 4, pp. 1297–1318, 2003. [Online]. Available:
http://www.jmlr.org/papers/volume4/almeida03a/almeida03a.pdf
[9] L. Almeida, “Linear and nonlinear ICA based on mutual information—
the MISEP method,” Signal Processing, vol. 84, no. 2, pp. 231–245, 2004.
[Online]. Available: http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaSigProc03.pdf
doi:10.1016/j.sigpro.2003.10.008
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
REFERENCES 91
Separation, Series Lecture Notes in Artificial Intelligence, no. 3195, C. G. Puntonet
and A. Prieto, Eds. Springer-Verlag, 2004, pp. 742–749. [Online]. Available: http://itb
.biologie.hu-berlin.de/∼blaschke/publications/isfa.pdf
[22] R. Boscolo, H. Pan, and V. Roychowdhury, “Independent component analysis based
on nonparametric density estimation,” IEEE Transactions on Neural Networks, vol. 15,
no. 1, pp. 55–65, January 2004. [Online]. Available: http://www.ee.ucla.edu/faculty/
papers/vwani trans-neural jan04.pdf doi:10.1109/TNN.2003.820667
[23] G. Burel, “Blind separation of sources: A nonlinear neural algorithm,” Neural Networks,
vol. 5, no. 6, pp. 937–947, 1992.doi:10.1016/S0893-6080(05)80090-5
[24] C. Burges, “A tutorial on support vector machines for pattern recogni-
tion,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167,
1998. [Online]. Available: http://www.kernel-machines.org/papers/Burges98.ps.gz
doi:10.1023/A:1009715923555
[25] J.-F. Cardoso, “The invariant approach to source separation,” in Proc. NOLTA, 1995,
pp. 55–60. [Online]. Available: http://www.tsi.enst.fr/∼cardoso/Papers.PDF/nolta95
.pdf
[26] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non Gaussian signals,” IEE
Proceedings-F, vol. 140, no. 6, pp. 362–370, 1993. [Online]. Available: http://www.tsi
.enst.fr/∼cardoso/Papers.PDF/iee.pdf
[27] A. Cichocki and S.-I. Amari, Adaptive Blind Signal and Image Processing—Learning
Algorithms and Applications. New York, NY: Wiley, 2002.
[28] P. Comon, “Independent component analysis—a new concept?” Signal Processing,
vol. 36, pp. 287–314, 1994.doi:10.1016/0165-1684(94)90029-9
[29] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, NY: Wiley,
1991.
[30] G. Darmois, “Analyse générale des liaisons stochastiques,” Rev. Inst. Internat. Stat.,
vol. 21, pp. 2–8, 1953.
[31] G. Deco and W. Brauer, “Nonlinear higher-order statistical decorrelation by volume-
conserving neural architectures,” Neural Networks, vol. 8, pp. 525–535, 1995.
doi:10.1016/0893-6080(94)00108-X
[32] J. Fisher and J. Principe, “Entropy manipulation of arbitrary nonlinear mappings,” in
Proc. IEEE Workshop Neural Networks for Signal Processing, Amelia Island, FL, 1997, pp.
14–23. [Online]. Available: http://www.cnel.ufl.edu/bib/pdf papers/fisher nnsp97.pdf
[33] C. Fyfe and P. Lai, “ICA using kernel canonical correlation analysis,” in Proc. Int. Work-
shop Independent Component Analysis and Blind Signal Separation, Helsinki, Finland,
2000, pp. 279–284. [Online]. Available: http://www.cis.hut.fi/ica2000/proceedings/
0279.pdf
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
REFERENCES 93
[45] S. Hochreiter and J. Schmidhuber, “LOCOCODE performs nonlinear ICA without
knowing the number of sources,” in Proc. First Int. Workshop Independent Component
Analysis and Signal Separation, J. F. Cardoso, C. Jutten, and P. Loubaton, Eds., Aussois,
France, 1999, pp. 277–282.
[46] A. Honkela, “Speeding up cyclic update schemes by pattern searches,” in Proc. Ninth
Int. Conf. Neural Information Processing, Singapore, 2002, pp. 512–516.
[47] A. Honkela, S. Harmeling, L. Lundqvist, and H. Valpola, “Using kernel PCA for
initialisation of variational Bayesian nonlinear blind source separation method,” in Proc.
Int. Workshop Independent Component Analysis and Blind Signal Separation, Granada,
Spain, 2004, pp. 790–797.
[48] A. Honkela and H. Valpola, “Variational learning and bits-back coding: An information-
theoretic view to Bayesian learning,” IEEE Transactions on Neural Networks, vol. 15,
no. 4, pp. 800–810, 2004.doi:10.1109/TNN.2004.828762
[49] A. Honkela and H. Valpola, “Unsupervised variational bayesian learning of nonlin-
ear models,” in Advances in Neural Information Processing Systems, vol. 17, L. K. Saul,
Y. Weis, and L. Bottou, Eds., 2005, pp. 593–600. [Online]. Available: http://books.nips
.cc/papers/files/nips17/NIPS2004 0322.pdf
[50] A. Honkela, H. Valpola, and J. Karhunen, “Accelerating cyclic update algorithms for pa-
rameter estimation by pattern searches,” Neural Processing Letters, vol. 17, no. 2, pp. 191–
203, 2003.doi:10.1023/A:1023655202546
[51] S. Hosseini and Y. Deville, “Blind separation of linear-quadratic mixtures of real
sources,” in Proc. IWANN, vol. 2, Mao, Menorca, Spain, 2003, pp. 241–248.
[52] S. Hosseini and Y. Deville, “Blind maximum likelihood separation of a linear-quadratic
mixture,” in Proc. Int. Workshop Independent Component Analysis and Blind Signal Sepa-
ration, Series Lecture Notes in Artificial Intelligence, no. 3195. Springer-Verlag, 2004.
[Online]. Available: http://webast.ast.obs-mip.fr/people/ydeville/papers/ica04 1.pdf
[53] S. Hosseini and C. Jutten, “On the separability of nonlinear mixtures of temporally
correlated sources,” IEEE Signal Processing Letters, vol. 10, no. 2, pp. 43–46, February
2003.doi:10.1109/LSP.2002.807871
[54] A. Hyvärinen, “Fast and robust fixed-point algorithms for independent component
analysis,” IEEE Transactions on Neural Networks, vol. 10, no. 3, pp. 626–634, 1999.
[Online]. Available: http://www.cs.helsinki.fi/u/ahyvarin/papers/TNN99new.pdf
doi:10.1109/72.761722
[55] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York,
NY: Wiley, 2001.
[56] A. Hyvärinen and P. Pajunen, “Nonlinear independent component analysis: Exis-
tence and uniqueness results,” Neural Networks, vol. 12, no. 3, pp. 429–439, 1999.
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
REFERENCES 95
Springer-Verlag, 2004, pp. 710–717. [Online]. Available: http://www.dice.ucl.ac.be/
∼verleyse/papers/ica04jl.pdf
[66] T.-W. Lee, M. Girolami, and T. Sejnowski, “Independent component analysis using
an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources,”
Neural Computation, vol. 11, pp. 417–441, 1999.doi:10.1162/089976699300016719
[67] T. W. Lee, B. Koehler, and R. Orglmeister, “Blind source separation of nonlinear mixing
models,” in Proc. Neural Networks for Signal Processing, 1997, pp. 406–415. [Online].
Available: http://www.cnl.salk.edu/∼tewon/Public/nnsp97.ps.gz
[68] J. Lin, D. Grier, and J. Cowan, “Source separation and density estimation by faithful
equivariant SOM,” in Advances in Neural Information Processing Systems. Cambridge,
MA: MIT Press, 1997, pp. 536–542.
[69] J. Lin, D. Grier, and J. Cowan, “Faithful representation of separable input distributions,”
Neural Computation, vol. 9, pp. 1305–1320, 1997.
[70] E. Lorenz, “Deterministic nonperiodic flow,” Journal of Atmospheric Sciences, vol. 20,
pp. 130–141, 1963.doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2
[71] G. Marques and L. Almeida, “An objective function for independence,” in Proc. Int.
Conf. Neural Networks, Washington, DC, 1996, pp. 453–457.
[72] G. Marques and L. Almeida, “Separation of nonlinear mixtures using pattern repulsion,”
in Proc. First Int. Workshop Independent Component Analysis and Signal Separation, J. F.
Cardoso, C. Jutten, and P. Loubaton, Eds., Aussois, France, 1999, pp. 277–282. [On-
line]. Available: http://www.lx.it.pt/∼lbalmeida/papers/MarquesAlmeidaICA99.ps.zip
[73] R. Martı́n-Clemente, S. Hornillo-Mellado, J. Acha, F. Rojas, and C. Puntonet, “MLP-
based source separation for MLP-like nonlinear mixtures,” in Proc. Int. Workshop Inde-
pendent Component Analysis and Blind Signal Separation, Nara, Japan, 2003. [Online].
Available: http://www.kecl.ntt.co.jp/icl/signal/ica2003/cdrom/data/0114.pdf
[74] T. Mitchell, Machine Learning. New York, NY: McGraw Hill, 1997.
[75] L. Molgedey and H. Schuster, “Separation of a mixture of independent signals using
time delayed correlations,” Physical Review Letters, vol. 72, pp. 3634–3636, 1994.
[Online]. Available: http://www.theo-physik.uni-kiel.de/thesis/molgedey94.ps.gz
doi:10.1103/PhysRevLett.72.3634
[76] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction
to kernel-based learning algorithms,” IEEE Transactions on Neural Networks, vol. 12,
no. 2, pp. 181–201, May 2001. [Online]. Available: http://mlg.anu.edu.au/∼raetsch/
ps/review.pdfdoi:10.1109/72.914517
[77] P. Pajunen, “Nonlinear independent component analysis by self-organizing maps,” in
Artificial Neural Networks—ICANN 96, Proc. 1996 International Conference on Artificial
Neural Networks, Bochum, Germany, 1996, pp. 815–819.
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
REFERENCES 97
[90] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a
kernel eigenvalue problem,” Neural Computation, vol. 10, pp. 1299–1319, 1998.
[Online]. Available: http://users.rsise.anu.edu.au/∼smola/papers/SchSmoMul98.pdf
doi:10.1162/089976698300017467
[91] C. Shannon, “A mathematical theory of communication,” Bell System Technical Jour-
nal, vol. 27, pp. 379–423 and 623–656, July and October 1948. [Online]. Available:
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
[92] J. Solé, C. Jutten, and D. T. Pham, “Fast approximation of nonlinearities for improving
inversion algorithms of pnl mixtures and Wiener systems,” Signal Processing, 2004.
[93] J. Solé, C. Jutten, and A. Taleb, “Parametric approach to blind deconvolution of
nonlinear channels,” Neurocomputing, vol. 48, pp. 339–355, 2002.doi:10.1016/S0925-
2312(01)00651-8
[94] A. Taleb and C. Jutten, “Entropy optimization, application to blind source separation,”
in Proc. Int. Conf. Artificial Neural Networks, Lausanne, Switzerland, 1997, pp. 529–
534.
[95] A. Taleb and C. Jutten, “Nonlinear source separation: The post-nonlinear mixtures,”
in Proc. 1997 Eur. Symp. Artifcial Neural Networks, Bruges, Belgium, 1997, pp. 279–
284.
[96] A. Taleb and C. Jutten, “Batch algorithm for source separation in post-nonlinear mix-
tures,” in Proc. First Int. Workshop Independent Component Analysis and Signal Separation,
Aussois, France, 1999, pp. 155–160.
[97] A. Taleb and C. Jutten, “Source separation in post-nonlinear mixtures,” IEEE Transac-
tions on Signal Processing, vol. 47, pp. 2807–2820, 1999.doi:10.1109/78.790661
[98] A. Taleb, J. Solé, and C. Jutten, “Quasi-nonparametic blind inversion of Wiener sys-
tems,” IEEE Transactions on Signal Processing, vol. 49, no. 5, pp. 917–924, 2001.
doi:10.1109/78.917796
[99] Y. Tan, J. Wang, and J. Zurada, “Nonlinear blind source separation using a radial basis
function network,” IEEE Transactions on Neural Networks, vol. 12, no. 1, pp. 124–134,
2001.doi:10.1109/72.896801
[100] F. Theis, C. Bauer, C. Puntonet, and E. Lang, “Pattern repulsion revisited,” in
Proc. IWANN, Series Lecture Notes in Computer Science, no. 2085. New York,
NY: Springer-Verlag, 2001, pp. 778–785. [Online]. Available: http://homepages.uni-
regensburg.de/∼thf11669/publications/theis01patternrep IWANN01.pdf
[101] F. Theis, C. Puntonet, and E. Lang, “Nonlinear geometric ICA,” in Proc. Int. Work-
shop Independent Component Analysis and Blind Signal Separation, Nara, Japan, 2003,
pp. 275–280. [Online]. Available: http://homepages.uni-regensburg.de/∼thf11669/
publications/theis03nonlineargeo ICA03.pdf
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
REFERENCES 99
[114] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, “Separation of post-nonlinear
mixtures using ACE and temporal decorrelation,” in Proc. Int. Conf. Independent Com-
ponent Analysis and Blind Source Separation, San Diego, CA, 2001, pp. 433–438.
[115] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, “Blind separation of post-
nonlinear mixtures using gaussianizing transformations and temporal decorrelation,”
in Proc. Int. Workshop Independent Component Analysis and Blind Signal Separa-
tion, Nara, Japan, 2003, pp. 269–274. [Online]. Available: http://www.kecl.ntt.co.jp/
icl/signal/ica2003/cdrom/data/0208.pdf
[116] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, “Blind separation of post-
nonlinear mixtures using linearizing transformations and temporal decorrelation,” Jour-
nal of Machine Learning Research, vol. 4, pp. 1319–1338, December 2003. [Online].
Available: http://www.jmlr.org/papers/volume4/ziehe03a/ziehe03a.pdf
[117] A. Ziehe and K.-R. Müller, “TDSEP—An efficient algorithm for blind separation using
time structure,” in Proc. Int. Conf. Artificial Neural Networks, Skövde, Sweden, 1998,
pp. 675–680. [Online]. Available: http://wwwold.first.gmd.de/persons/Mueller.Klaus-
Robert/ICANN tdsep.ps.gz
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15
100
P1: IML/FFX P2: IML
MOBK016-BIO MOBK016-Almeida.cls April 7, 2006 16:25
101
Biography
Luis B. Almeida is a full professor of Signals and Systems, and of Neural Networks and
Machine Learning, at Instituto Superior Técnico, Technical University of Lisbon, and a re-
searcher at the Telecommunications Institute, Lisbon, Portugal. He holds a Ph. D. in Signal
Processing from the Technical University of Lisbon. He has formerly taught Systems Theory,
Telecommunications, Digital Systems and Mathematical Analysis, among others.
Luis B. Almeida’s current research focuses on nonlinear source separation. Formerly he
has performed research on speech modeling and coding, Fourier and time-frequency analysis of
signals, and training algorithms for neural networks. Some highlights of his work include the
sinusoidal model for voiced speech, currently in use in INMARSAT and IRIDIUM telephones
(developed with F.M. Silva and J.S. Marques), work on the Fractional Fourier Transform, the
development of recurrent backpropagation, and the development of the MISEP method of
nonlinear source separation.
Luis B. Almeida has been a founding Vice-President of the European Neural Network
Society and the founding President of INESC-ID (a nonprofit research institute associated
with the Technical University of Lisbon).
P1: IML/FFX P2: IML
MOBK016-BIO MOBK016-Almeida.cls April 7, 2006 16:25
102