0% found this document useful (0 votes)

19 views114 pages

Nonlinear Source Separation - Luis B. Almeida

Uploaded by

phapdn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views114 pages

Nonlinear Source Separation - Luis B. Almeida

Uploaded by

phapdn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML

MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

Nonlinear
Source Separation

i
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

Copyright © 2006 by Morgan & Claypool

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or trans-
mitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief
quotations in printed reviews, without the prior permission of the publisher.

Nonlinear Source Separation

Luis B. Almeida
www.morganclaypool.com

1598290304 paper Almeida

1598290312 ebook Almeida

DOI 10.2200/S00016ED1V01Y200602SPR002

A Publication in the Morgan & Claypool Publishers’ series

SYNTHESIS LECTURES ON SIGNAL PROCESSING
Lecture #2

First Edition

Printed in the United States of America

ii
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

Nonlinear
Source Separation
Luis B. Almeida
Instituto das Telecomunicações, Lisboa, Portugal

SYNTHESIS LECTURES ON SIGNAL PROCESSING #2

M
&C Mor gan & Cl aypool Publishers

iii
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

ABSTRACT
The purpose of this lecture book is to present the state of the art in nonlinear blind source
separation, in a form appropriate for students, researchers and developers. Source separation
deals with the problem of recovering sources that are observed in a mixed condition. When we
have little knowledge about the sources and about the mixture process, we speak of blind source
separation. Linear blind source separation is a relatively well studied subject. Nonlinear blind
source separation is still in a less advanced stage, but has seen several significant developments
in the last few years.
This publication reviews the main nonlinear separation methods, including the separation
of post-nonlinear mixtures, and the MISEP, ensemble learning and kTDSEP methods for
generic mixtures. These methods are studied with a significant depth. A historical overview is
also presented, mentioning most of the relevant results, on nonlinear blind source separation,
that have been presented over the years.

KEYWORDS
Signal Processing, Source Separation, Nonlinear blind source separation, Independent
component analysis, Nonlinear ICA.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

To Cila, Miguel and Inês

v
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 The Scatter Plot as a Tool To Depict Joint Distributions . . . . . . . . . . . . . . 3
1.1.2 Separation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2. Linear Source Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Statement of the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 INFOMAX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Choice of the Output Nonlinearities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Maximum Entropy Estimation of the Nonlinearities . . . . . . . . . . . . . . . . 12
2.2.3 Estimation of the Score Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Exploiting the Time-Domain Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Other methods: JADE and FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.1 JADE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.2 FastICA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3. Nonlinear Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1 Post-Nonlinear Mixtures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.1 Separation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1.2 Application Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Unconstrained Nonlinear Separation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.1 MISEP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.2 Nonlinear ICA Through Ensemble Learning . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.3 Kernel-Based Nonlinear Separation: kTDSEP . . . . . . . . . . . . . . . . . . . . . . 66
3.2.4 Other Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

CONTENTS vii
4. Final Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A. Statistical Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.1 Passing a Random Variable Through its Cumulative Distribution Function . . . 83
A.2 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
A.2.1 Entropy of a Transformed Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
A.3 Kullback–Leibler Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
A.4 Mutual Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

B. Online Software and Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

viii

Acknowledgments
José Moura, the editor of this series, invited me to write this book. Without him I would not
have considered it. Harri Valpola was very influential on the form in which the ensemble learn-
ing method finally is described, having led me to learn quite a bit about the MDL approach
to the method. He also kindly provided the examples of application of ensemble learning, and
very carefully reviewed the manuscript, having made many useful comments. Stefan Harmeling
provided the examples for the TDSEP and kTDSEP methods, and made several useful sug-
gestions regarding the description of those methods. Mariana Almeida made several useful
comments on the ensemble learning section. Andreas Ziehe commented on the description of
the TDSEP method. Aapo Hyvärinen helped to clarify some aspects of score function estima-
tion. The anonymous reviewers made many useful comments. Joel Claypool, from Morgan and
Claypool Publishers, was very supportive, with his attention to the progress of the manuscript
and with his very positive comments. I am grateful to them all. Any errors or inaccuracies that
may remain are my responsibility, not of the people who so kindly offered their help.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

Notation
In this book we shall adopt the following notational conventions, unless otherwise noted:

• Scalar nonrandom variables are denoted by lowercase lightface letters, such as x.

• Nonrandom vectors are denoted by lowercase boldface letters, such as x. A component
of this vector would be x1 , for example.
• Scalar random variables are denoted by uppercase lightface letters, such as X. A specific
value of the random variable will normally be denoted by the corresponding lowercase
letter, x.
• Random vectors are denoted by boldface uppercase letters, e.g. X. A specific value of
the random vector will normally be denoted by the corresponding boldface lowercase
letter, x.
• Vectors are considered to be represented by column matrices, unless otherwise
noted.
• Matrices (except for column matrices used to represent vectors) are represented by
uppercase lightface letters, such as A. This overlaps with the notation for scalar random
variables, but the context will make the meaning clear.
• We shall make a slight abuse of notation in the representation of probability density
functions (pdfs), denoting them all by p(·), to make expressions simpler. For example,
the joint density of the random variables X1 and X2 would normally be represented
by p X1 ,X2 (x1 , x2 ), but we shall represent it simply by p(x1 , x2 ). We shall drop the
subscripts indicating which specific random variables are meant, because the arguments
that are used will make that clear, eliminating any possibility of confusion. In the same
spirit, for example, the conditional density p X1 | X2 (x1 | x2 ) will be denoted simply as
p(x1 | x2 ).
• The term sample has two different common uses. In the statistical literature it represents
a set of observations of the same random variable, while in the signal processing literature
it represents a single observed value of a signal, be it random or deterministic. We shall
follow the signal processing convention. Therefore, for us, a sample of the signal x(t)
will be the value of x(3), for example.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

x NOTATION
Table 1 provides a summary of the symbols and acronyms used in the book.

TABLE 1: Symbol Table

Scalars, vectors, matrices, random variables

a, s , x (lightface, lowercase): Scalars
a, s , x (boldface, lowercase): Vectors
A, S, X (lightface, uppercase): Matrices; also scalar random variables
A, S, X (boldface, uppercase): Random vectors
T
A The transpose of matrix A
−T
A The transpose of the inverse of matrix A
det A The determinant of matrix A

Variables and functions related to mixtures and separation

S, s Source vectors
X, x Mixture vectors
Y, y Vectors of separated components
Z, z Auxiliary output vectors, in INFOMAX and MISEP;
also the prewhitened mixture, in SOBI and TDSEP
A Mixture matrix
W Separating matrix
M(·) Nonlinear mixture function
F(·) Nonlinear separating function
B Prewhitening matrix, in SOBI and TDSEP
Q Rotation matrix, in SOBI and TDSEP
ϕ(·) Score function, in ICA methods;
also the moment generating function of a statistical distribution
ψ(·) Output nonlinearities, in INFOMAX and MISEP;
also the cumulant generating function of a statistical distribution
X Mixture space, in kTDSEP
X̂ Feature space, in kTDSEP
X̃ Intermediate space, in kTDSEP
Statistical functions and operators
p(x) Probability density function (pdf ) of the random variable X
FX (·) Cumulative distribution function (CDF) of the random variable X
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

NOTATION xi

TABLE 1: Symbol Table (Continued )

E(X ) Expected value of the random variable X

x Mean of x in a given set, or in the time domain
H(X ) Shannon’s differential entropy of the continuous random variable X;
also Shannon’s entropy of the discrete random variable X
Hn (X ) Renyi’s differential entropy of order n of the random variable X
J (X ) Negentropy of the random variable X
I (X ) Mutual information of the components of the random vector X
KLD ( p, q ) Kullback–Leibler divergence of the q pdf relative to the p pdf
Acronyms and abbreviations
CDF Cumulative distribution function (of a random variable)
ICA Independent component analysis
i.i.d. Independent, identically distributed (random variables)
KLD Kullback–Leibler divergence
MDL Minimum description length
MLP Multilayer perceptron
NFA Nonlinear factor analysis
PCA Principal component analysis
pdf Probability density function (of a random variable)
RBF Radial basis function
SNR Signal-to-noise ratio
SOM Self-organizing map
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-FM MOBK016-Almeida.cls April 5, 2006 14:53

xii

Preface
Source separation deals with the problem of recovering sources that are observed in a mixed
condition. When we have little knowledge about the sources and about the mixture process,
we speak of blind source separation. Linear blind source separation is a relatively well studied
subject, and there are some good review books on it. Nonlinear blind source separation is still
in a less advanced stage, but has seen several significant developments in the last few years.
The purpose of this book is to present the state of the art in nonlinear blind source
separation, in a form appropriate for students, researchers and developers. The book reviews
the main nonlinear separation methods, including the separation of post-nonlinear mixtures,
and the MISEP, ensemble learning and kTDSEP methods for generic mixtures. These methods
are studied with a significant depth. A historical overview is also presented, mentioning most
of the relevant results on nonlinear blind source separation that have been presented over the
years, and giving pointers to the literature.
The book tries to be relatively self-contained. It includes an initial chapter on linear
source separation, focusing on those separation methods that are useful for the ensuing study
of nonlinear separation. An extensive bibliography is included. Many of the references contain
pointers to freely accessible online versions of the publications.
The prerequisites for understanding the book consist of a basic knowledge of mathematical
analysis and of statistics. Some more advanced concepts of statistics are treated in an appendix.
A basic knowledge of neural networks (multilayer perceptrons and backpropagation) is needed
for understanding some parts of the book
The writing style is intended to afford an easy reading without sacrificing rigor. Where
necessary the reader is pointed to the relevant literature, for a discussion of some more detailed
aspects of the methods that are studied. Several examples of application of the studied methods
are included, and an appendix provides pointers to online code and data for linear and nonlinear
source separation.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43

CHAPTER 1

Introduction

It is common, in many practical situations, to have access to observations that are mixtures of
some “original” signals, and to be interested in recovering those signals. For example, when
trying to obtain an electrocardiogram of the fetus, in a pregnant woman, the fetus’ signals will
be contaminated by the much stronger signals from the mother’s heart. When recording speech
in a noisy environment, the signals recorded through the microphone(s) will be contaminated
with noise. When acquiring the image of a document in a scanner, the image sometimes gets
superimposed with the image from the opposite page, especially if the paper is thin. In all these
cases we obtain signals that are contaminated by other signals, and we would like to get rid of
the contamination to recover the original signals.
The recovery of the original signals is normally called source separation. The original
signals are normally called sources, and the contaminated signals are considered to be mixtures
of those sources. In cases such as those presented above, if there is little knowledge about the
sources and about the details of the mixture process, we normally speak of blind source separation
(BSS). In many situations, such as most of those involving biomedical or acoustic signals, the
mixture process is known to be linear, to a good approximation. This allows us to perform
source separation through linear operations, and we then speak of linear source separation. On
the other hand, the document imaging situation that we mentioned above is an example of a
situation in which the mixture is nonlinear, and the corresponding separation process also has
to be nonlinear.
Linear source separation has been the object of much study in recent years, and the
corresponding theory is rather well developed. It has also been the subject of some good overview
books, such as [27, 55]. Nonlinear source separation, on the other hand, has been the object
of research only more recently, and until now there was, to our knowledge, no overview book
specifically addressing it (a former overview paper is [59]).
This book attempts to fill this gap, by providing a comprehensive overview of the state of
the art in nonlinear source separation. It is intended to be used as an introduction to the topic
of nonlinear source separation for scientists and students, as well as for applications-oriented
people. It has been intentionally limited in size, so as to be easy to read. For most of the
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43

2 NONLINEAR SOURCE SEPARATION

methods that are studied, we present their foundations and a relatively detailed overview of
their operation, but we do not discuss them in full detail. Readers who want to know more
about a specific method or who want to implement it should then consult the references that
are indicated.
The book tries to be relatively comprehensive and self-contained. For this purpose, it
includes a chapter introducing linear source separation methods, focusing especially on those
that will be more relevant to the ensuing treatment of nonlinear separation. The next and main
chapter studies nonlinear separation methods, starting with the so-called post-nonlinear setting,
in which there are theoretical results guaranteeing source recovery, and proceeding then to the
unconstrained separation methods. In these we give special focus to the three main methods
which, in our view, are best developed and potentially most useful: MISEP, ensemble learning,
and kTDSEP. We also give an overview of other methods that have been proposed in the
literature. A final chapter tries to show the prospects for the future in this field.
Nonlinear source separation often makes use of nonlinear trainable systems, the most
common of which are multilayer perceptrons (MLPs). We shall assume that the reader has a
basic knowledge of these systems, of the backpropagation training method, and of gradient-
based optimization in general. Basic texts on this subject are, for example, [4, 39]. We shall also
assume that the reader has a basic knowledge of statistics and of linear algebra. A few somewhat
more advanced statistical concepts that are useful for understanding some parts of the book are
reviewed in Appendix A.
There are several software packages, and some datasets, for source separation that are
available online. Some links are given in Appendix B.

1.1 BASIC CONCEPTS

In its simplest setting, the linear source separation problem can be stated as follows: There is
an unknown random vector S, whose components are called sources. We observe exemplars of
a mixture vector X given by

X = AS, (1.1)

where A is a matrix. Often the numbers of components of S and X are assumed to be the
same, implying that matrix A is square, and we speak of a square separation problem. In source
separation one is interested in recovering the source vector S, and possibly also the mixture
matrix A.
Clearly, the problem cannot be solved without some additional knowledge about S and/or
A. The assumption that is most commonly made is that the components of S are statistically in-
dependent from one another, but sometimes some other assumptions are made, either explicitly
or implicitly. This is clarified ahead.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43

INTRODUCTION 3
Several variants of the basic source separation problem exist. First of all, the number of
components of the mixture X may be smaller than the number of sources, and we speak of an
overcomplete or underdetermined problem, or it may be larger than the number of sources, and
we speak of an undercomplete or overdetermined problem.
Another variant involves the presence of additive noise. Equation (1.1) is then replaced
with

X = AS + N,

where N is random noise, usually assumed to be formed by independent, identically distributed

(i.i.d.) samples.
Often the sources and the mixtures are functions of time, being then denoted by S(t)
and X(t) respectively. In this situation some more variants of the source separation problem can
arise. For example, the mixture matrix A may also depend on time, in which case we speak of
a nonstationary mixture. On the other hand, the mixture vector X(t), at a certain time t, may
depend not only on the sources at the same time, but also on sources at other times. Assuming
that we are still in the realm of linear mixtures, the matrix A will then be a matrix of linear filters.
Equation (1.1) will have to be modified accordingly, involving the convolution with these filters’
impulse responses, and we speak of a noninstantaneous or convolutive mixture. These concepts
extend naturally to sources that are functions of two or more variables, such as images, for
example.
In the case of nonlinear source separation, which is the central topic of this book, Eq. (1.1)
is replaced with

X = M(S ),

where M(·) represents a nonlinear mapping. We observe X and wish to recover S. Again, this
cannot be done without some further assumptions, which will be clarified ahead.
As in the linear case, the nonlinear mixture may have noise, and may be nonstationary
or noninstantaneous. However, the formal treatment of these variants is not much developed
to date, and in this book we shall limit ourselves, almost always, to stationary, instantaneous,
noiseless mixtures. Also, in most situations, we shall consider the mixture to be square; i.e., the
sizes of X and S will be the same, although there will be a few cases in which we shall consider
different situations.

1.1.1 The Scatter Plot as a Tool To Depict Joint Distributions

Scatter plots are often used to depict joint distributions of random variables. It will be useful for
us to develop some experience in interpreting these plots. We give some examples in Fig. 1.1. In
the left column we see two examples of joint distributions of independent sources. In the upper
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43

4 NONLINEAR SOURCE SEPARATION

(a) Independent sources (b) Linear mixture (c) Nonlinear mixture

(d) Independent sources (e) Linear mixture (f) Nonlinear mixture

FIGURE 1.1: Examples of linear and nonlinear mixtures. The top row shows mixtures of two super-
gaussian sources and the bottom row shows mixtures of two uniformly distributed sources

example the sources have densities that are strongly peaked (they are much more peaked than
Gaussian distributions with the same variance, and for this reason they are called supergaussian).1
The lower example shows uniformly distributed sources.
Note that, in both cases, a horizontal section through the distribution of the sources—
Figs. 1.1(a) and 1.1(d)—will yield a density which has the same shape (apart from an amplitude
scale factor) irrespective of where the section is made. The same happens with vertical sections.
These sections, once normalized, correspond to conditional distributions, and the fact that their
shape is independent of where the section is made means that the conditional distribution of
one of the sources given the other is independent of that other source. For example, p(s 1 | s 2 ) is
independent of s 2 . This means that the sources are independent from each other.
The middle column of the figure shows examples of linear mixtures. In such mixtures,
lines that were originally straight remain straight, and lines that were originally parallel to each

1
More specifically, a random variable is called supergaussian if its kurtosis is positive, and subgaussian if its kurtosis
is negative (the kurtosis of a Gaussian random variable is zero). The kurtosis of a random variable is defined as its
fourth-order cumulant (see Section 2.4.1).
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43

INTRODUCTION 5
other remain parallel to each other. Linear operations can only perform scaling, rotation, and
shear transformations, which do not affect the straightness or the parallelism of lines. But note
that in these mixtures the two components are no longer independent from each other. This
can be seen by the fact that sections (either horizontal or vertical) through the distribution yield
densities that depend on where the section was made.
The rightmost column shows two examples of nonlinear mixtures. In the upper one, lines
that were originally straight now appear curved. In the lower one, lines that were originally
parallel to each other (opposite edges of the square) do not remain parallel to each other, even
though they remain straight. Both of these are telltale signs of nonlinear mixtures. Whenever
they occur we know that the mixture that is being considered is nonlinear. And once again, we
can see that the two random variables in each mixture are not independent from each other:
The densities corresponding to horizontal or vertical sections through the distribution depend
on where these sections were made.

1.1.2 Separation Criteria

In blind source separation, both linear and nonlinear, we normally assume that we know very
little about the mixture process. We may know, for example, that it is instantaneous and invariant
in time, but we usually assume that we do not know the mixture matrix A, in the linear case,
or the mixture function M, in the nonlinear case. The sources Si are normally considered to be
unknown too. But of course some knowledge has to be assumed in order to make the problem
solvable. An assumption that is very often made is that the sources are mutually statistically
independent. This may seem, at first, to be a rather bold assumption. However, it often turns
out to be valid in practice. For example, when dealing with a mixture of sound signals from
two or more physically distinct sources (such as speech from two or more speakers, or speech
and background noise) the different sound sources can be considered independent, to a good
approximation. In biomedical applications, signals coming from physically separate sources, such
as the cardiac signals from mother and fetus (in the electrocardiography of pregnant women),
or the signals generated in different parts of the brain (in electroencephalographic applications)
can often be considered independent from one another, to a good approximation. The same
happens in many other situations, in fields as diverse as telecommunications and astronomy, for
example.
Consequently, one of the main criteria that are used for the separation of sources is
statistical independence. This creates the need to measure the statistical dependence of random
variables. A measure of the statistical dependence of several random variables Yi should have
two basic properties: It should be equal to zero if the random variables are mutually statistically
independent, and should be positive otherwise. A measure with these properties is called a
contrast function by Comon [28].
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-01 MOBK016-Almeida.cls March 7, 2006 13:43

6 NONLINEAR SOURCE SEPARATION

One of the main measures of statistical dependence used in source separation is the mutual
information, denoted I (·). Given a random vector Y, I (Y) measures the amount of information
that is shared by the components of Y. It is zero if these components are independent and is
positive if they have any mutual dependence. The minimization of I (Y) therefore leads the
components of Y to be as independent from one another as possible. Mutual information is
defined, and some of its properties are studied, in Appendix A.4. Readers that are not familiar
with this concept should read that appendix, as well as the related Appendices A.2 and A.3,
in order to understand several of the linear and nonlinear separation methods discussed in this
book.

1.2 SUMMARY
Source separation deals with the recovery of sources that are observed in a mixed condition.
Blind source separation (BSS) refers to source separation in situations in which there is little
knowledge about the sources and about the mixing process.
Most of the work done to date on source separation concerns linear mixtures. Variants
that have been studied include over- and undercomplete mixtures, noisy, nonstationary, and
convolutive (noninstantaneous) mixtures.
This book is mainly concerned with nonlinear mixtures. These may also be noisy, non-
stationary, and/or noninstantaneous, but we shall normally restrict ourselves to the basic case of
noise-free, stationary instantaneous mixtures, because other variants have still been an object of
very little study in the nonlinear case. The book analyzes with some detail the main nonlinear
separation methods, and gives an overview of other methods that have been proposed.
An assumption that is frequently made in blind source separation is that the sources are
independent random variables. The mutual dependence of random variables is often measured
by means of their mutual information.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

CHAPTER 2

Linear Source Separation

In this chapter we shall make a brief overview of linear source separation in order to introduce
several concepts and methods that will be useful later for the treatment of nonlinear separation.
We shall only deal with the basic form of the linear separation problem (same number of
observed mixture components as of sources, no noise, stationary instantaneous mixture) since
this is, almost exclusively, the form with which we shall be concerned later, when dealing with
nonlinear separation.

2.1 STATEMENT OF THE PROBLEM

Consider a random vector S, whose components we call sources, and a mixture vector given by

X = AS,

where the sizes of S and X are equal, and A is a square, invertible matrix, which is usually called
the mixture matrix or mixing matrix. We observe N exemplars of X, generated by identically
distributed (but possibly nonindependent) exemplars of S. These exemplars of S, however, are
not observed. In the blind separation setting, we assume that we do not know the mixing matrix,
and that we have relatively little knowledge about S. We wish to recover the sources, i.e. the
components of S.
If we knew the mixing matrix, we could simply invert it and compute the sources by
means of the inverse

S = A −1 X.

However, in the blind source separation setting we do not know A, and therefore we have to
use some other method to estimate the sources. As we have said above, the assumption that
is most commonly made is that the sources (the components of S) are mutually statistically
independent.
In accordance with this assumption, one of the most widely used methods to recover the
sources consists of estimating a square matrix W such that the components of Y, given by

Y = W X, (2.1)
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

8 NONLINEAR SOURCE SEPARATION

are mutually statistically independent. It has been shown by Comon [28] that, if at most one of
the components of S has a Gaussian distribution, this independence criterion suffices to recover
S, in the following sense: If Y is obtained through (2.1) and its components are mutually
independent, then Y and S are related by

Y = PDS,

where P is a permutation matrix2 and D is a diagonal matrix. This means that the components
of Y are the components of S, possibly subject to a permutation and to arbitrary scalings.
This use of the independence criterion leads to the so-called independent component analysis
(ICA) techniques, which analyze mixtures into sets of components that are as independent from
one another as possible, according to some mutual dependence measure.
Note that simply imposing that the components of Y be uncorrelated with one another
does not suffice for separation. There is an infinite number of solutions of (2.1) in which
the components of Y are mutually uncorrelated, but in which each component still contains
contributions from more than one source.3 Also note that, for the same random vector X,
principal component analysis (PCA) [39] and independent component analysis normally yield
very different results. Although both methods yield components that are mutually uncorrelated,
the components extracted by PCA normally are not statistically independent from one another,
and normally contain contributions from more than one source, contrary to what happens with
ICA.
There are several practical methods for estimating the matrix W based on the indepen-
dence criterion (see [27, 55] for overviews). In the next sections we shall briefly describe some
of them, focusing on those that will be useful later in the study of nonlinear separation.

2.2 INFOMAX
INFOMAX, also often called the Bell–Sejnowski method, is a method for performing linear
ICA; i.e., it attempts to transform the mixture X, according to (2.1), into components Yi which
are as independent from one another as possible.
The INFOMAX method is based on the structure depicted in Fig. 2.1. In the left-hand
side of the figure we recognize the implementation of the separation operation (2.1). The result
of the separation is Y. The ψi blocks and the Zi outputs are auxiliary, being used only during
the optimization process.

2
A permutation matrix has exactly one element per row and one element per column equal to 1, all other elements
being equal to zero.
3
However, decorrelation, in a different setting, can be a criterion for separation, as we shall see in Section 2.3.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 9

Y1
X1 1 Z1
W
Yn
Xn n Zn

FIGURE 2.1: Structure used by INFOMAX. The W block performs a product by a matrix, and is what
performs the separation proper. The separated outputs are Yi . The ψi blocks, implementing nonlinear
increasing functions, are auxiliary, being used only during optimization

In the paper that introduced it [19], INFOMAX was justified on the basis of an informa-
tion maximization criterion (hence its name). However, the method has later been interpreted
as a maximum likelihood method [83] and also as a method based on the minimization of
the mutual information I (Y ) [55] (recall that mutual information was seen, in Section 1.1.2,
to be a measure of statistical dependence). Here we shall use the interpretation based on the
minimization of mutual information, because this will be the most useful approach when we
deal with nonlinear separation methods, later on.
For presenting the mutual information interpretation, we shall start by assuming that each
of the ψi functions is equal to the cumulative distribution function (CDF) of the corresponding
random variable Yi (we shall denote that cumulative function by FYi ). Then, all of the Zi will
be uniformly distributed in (0, 1) (see Appendix A.1). Therefore, p(zi ) will be 1 in (0, 1) and
zero elsewhere, and the entropy of each of the Zi shall be zero, H (Zi ) = 0.
As shown in Appendix A.4, the mutual information is not affected by performing in-
vertible, possibly nonlinear, transformations on the individual random variables. In the above
setting, since the ψi functions are invertible, this means that I (Y ) = I (Z ). Therefore,

I (Y ) =
I (Z )
= H(Zi ) − H(Z )
i
= −H(Z ). (2.2)

This result shows that minimizing the mutual information of the components of Y can be
achieved by maximizing the output entropy, H(Z ).4 This is an advantage, since the direct min-
imization of the mutual information is generally difficult, and maximizing the output entropy
is easier, as we shall see next.
The output entropy of the system of Fig. 2.1 is related to the input entropy by 5

H(Z ) = H(X ) + E[log det J ],

4
We are considering continuous random variables, and therefore H(·) represents the differential entropy, throughout
this and the next chapter (see Appendices A.2 and A.4). We shall designate it simply as entropy, for brevity.
5
See Appendix A.2.1.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

10 NONLINEAR SOURCE SEPARATION

where E(·) denotes statistical expectation and J = ∂ Z/∂ X is the Jacobian of the transformation
performed by the system. H(X ) does not depend on W, and therefore the maximization of

H(Z ) is equivalent to the maximization of E[log det J ].
To perform that maximization we need to have a set of observations of X (often called the
training set) from which we can compute the corresponding set of outputs Z. We approximate
the statistical expectation by the mean in this set:

1 M
E[log det J ] ≈ log det J m = J ,
M m=1

where Jm is the Jacobian corresponding to the mth exemplar of X in the training set (the mth
training pattern) and M is the number of exemplars in the training set. J is what shall effectively
be used as objective function to be maximized.
The maximization is usually done by means of gradient-based methods. We have

1 N
J = Lm (2.3)
M m=1

with

Lm = log det J m . (2.4)

For simplicity we shall drop, from now on, the subscript m, which refers to the training pattern
being considered. The Jacobian in the preceding equation is given by

J =W ψ j (y j ).
j

The ψ j functions are normally chosen as increasing functions. Therefore ψ j (y j ) ≥ 0 and

L = log det W + log ψ j (y j ).
j

The gradient relative to W is given by

∂L ∂
= W −T + log ψ j (y j )
∂W j
∂ W
ψ j (y j ) ∂ y j
= W −T + ,
j
ψ j (y j ) ∂ W

where W −T is the transpose of the inverse of W.

In the preceding equation, ∂ y j /∂ W is a matrix whose j th row is x T (the superscript T
denoting matrix transposition), and all other rows are equal to zero. Therefore, if we define the
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 11

vector
T
ψ1 (y1 ) ψ2 (y2 ) ψn (yn )
ξ= , , · · · , , (2.5)
ψ1 (y1 ) ψ2 (y2 ) ψn (yn )
where n is the size of y, we finally have
∂L
= W −T + ξx T . (2.6)
∂W
This expression of the gradient is the one to be used in the stochastic or online optimization
mode [4, 39], in which the gradient of L is used to update W after the presentation of each
training pattern at the input of the system of Fig. 2.1. An alternative is to use the so-called
deterministic or batch optimization mode, in which the gradient of J is used to update W. For
that mode we have
∂J
= W −T + ξx T , (2.7)
∂W
where · denotes the mean computed in the training set. Since this gradient involves a mean in
the training set, the update of W is performed once after each sweep through the whole training
set, in this mode.
These equations allow us to compute the gradient of the objective function, for use in
any gradient-based minimization method. The methods that are more commonly used for
minimization are simple gradient descent and, more frequently, the so-called relative gradient
or natural gradient method (see [13, 25, 55]). The relative/natural gradient method has the
double advantage of being computationally simpler, avoiding the matrix inversion that appears
in (2.6) and (2.7), and being usually faster, needing fewer iterations for convergence. As shown
in the indicated references, this method yields the update equation

W ∝ (I + ξy T )W,

in the stochastic mode, and

W ∝ (I + ξy T )W

in the deterministic mode.

2.2.1 Choice of the Output Nonlinearities

One aspect that we have not discussed yet is how to choose the ψi nonlinearities. As previously
mentioned, they should ideally be the CDFs of the corresponding random variables, ψi = FYi .
In practice, however, these cumulative functions often are not known, or are only known in a
qualitative or a very coarse form. Fortunately it is normally not necessary to use a very good
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

12 NONLINEAR SOURCE SEPARATION

approximation of these cumulative functions to obtain good results with linear separation. The
reason is that linear ICA is a rather constrained problem. For example, when there are two sources
to be separated, there are four scalar unknowns in matrix W. However, after optimization, there
will still remain two arbitrary scale factors, one for each extracted component. Therefore the
problem effectively corresponds to finding just two scalars, and is consequently very constrained.
The fact that the problem is strongly constrained makes it possible to obtain good results
even with rather crude approximations of the CDFs. For example, it is known that if the source
distributions are approximately symmetric and more peaked than a Gaussian (i.e. supergaussian)
distribution, sigmoidal nonlinearities of the form
1
ψ(y) =
1 + e −y
will suffice for a nearly perfect separation [19]. It has been proposed by some authors to use
just one of two predetermined ψ functions, one for supergaussian and one for subgaussian
components, as a means to encompass the large majority of practical situations [66].
Often the choice of the ψi nonlinearities turns out not to be too difficult when applying
INFOMAX to linear separation problems, especially if we have some qualitative knowledge
about the sources (for example, if we know that they are speech signals, or that they are images).
In nonlinear separation, however, the choice of these nonlinearities will become much more
stringent, because nonlinear separation is far less constrained than linear separation, and a poor
estimation of the CDFs can easily lead to poor separation. It is therefore useful to examine
methods for a more accurate estimation of these functions.
It is always possible to estimate the densities p(yi ) by one of several standard procedures
(for example using kernel estimators, or using mixtures of Gaussians estimated with the EM
algorithm [84]). It is then possible to set ψi equal to the corresponding CDFs. Although this
has been done in some cases [22,109], it seems preferable to use a form of estimation which has
a closer relationship with the optimization that is being performed in the ICA process. In the
next subsections we shall describe two methods that have been proposed for estimating these
nonlinearities in the context of linear ICA. Both of these methods will later be relevant for the
study of nonlinear source separation.

2.2.2 Maximum Entropy Estimation of the Nonlinearities

This method for estimating the ψi nonlinearities (proposed by L. Almeida [6]) uses, as a criterion
for estimating these nonlinearities, the maximization of the output entropy H(Z). This is the
same criterion that is used for optimization of the separating matrix W itself. Our presentation
of the method will be somewhat detailed because this estimation technique, together with the
INFOMAX method for estimating the separating matrix, constitutes a good introduction to
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 13

the MISEP nonlinear separation method, to be presented later in Section 3.2.1 (and, in fact,
constitutes the linear version of MISEP). Once this linear version is understood, the extension
to nonlinear separation will be relatively easy to grasp.

2.2.2.1 Theoretical Basis

Assume that we represent the ψi nonlinearities by some parameterized estimates ψ̂i (yi , wi ),
where wi are vectors of parameters. These functions are chosen to be nondecreasing functions
of yi , with codomain (0, 1). Also assume, for the moment, that W is kept fixed, so that the
distribution of Y is also fixed. From (2.2),

H(Z ) = H (Zi ) − I (Y ). (2.8)
i

The distribution of Y has been assumed to be fixed, and therefore I (Y ) is constant. Maxi-

mization of H (Z ) thus corresponds to the maximization of i H (Zi ). Since the parameter
vectors wi can be varied independently from one another, this corresponds to the simultaneous
maximization of all the H(Zi ) terms.
Given that ψ̂i has codomain (0, 1), Zi is limited to that interval. We therefore have,
for each Zi , the maximization of the entropy of a continuous random variable, constrained
to the interval (0, 1). The maximum entropy distribution under this constraint is the uniform
distribution (see Appendix A.2). Therefore the maximization will lead each Zi to become as
close as possible to a uniformly distributed variable in (0, 1), subject only to the restrictions
imposed by the limitations of the family of functions ψ̂i .6
Since, at the maximum, Zi is approximately uniformly distributed in (0, 1), and since
ψ̂i (yi , wi ) is, by construction, a nondecreasing function of yi , it must be approximately equal
to the CDF FYi , as desired (see Appendix A.1). Therefore, maximization of H(Z ) leads the
ψ̂i (yi , wi ) functions to approximate the corresponding CDFs, subject only to the limitations of
the family of approximators ψ̂i (yi , wi ).
Let us now drop the assumption that W is constant, and assume instead that it is also being
optimized by maximization of H (Z ). Consider what happens at a maximum of H(Z). Each
of the H (Zi ) must be maximal, because otherwise H (Z ) could still be increased further, by
increasing that specific H(Zi ) while keeping the other Z j and Y fixed—see Eq. (2.8). Therefore,
at a maximum of H(Z ), the Zi variables must be as uniform as possible, and the ψ̂i functions
must be the closest possible approximations of the corresponding CDFs.
This shows that the maximization of the output entropy will lead to the desired estima-
tion of the CDFs by the ψ̂i functions while at the same time optimizing W. In practice, during

6
We are disregarding the possibility of stopping at a local maximum during the optimization. That would be a
contingency of the specific optimization procedure being used, and not of the basic method itself.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

14 NONLINEAR SOURCE SEPARATION

the optimization process, when W changes rapidly, the ψ̂i functions will follow the corre-
sponding CDFs with some lag, because the CDFs are changing rapidly. During the asymptotic
convergence to the maximum of H(Z), W will change progressively slower, and the ψ̂i func-
tions will approximate the corresponding CDFs progressively more closely, tending to the best
possible approximations.
In summary, the maximization of the output entropy leads to the minimization of the
mutual information I (Y), with the simultaneous adaptive estimation of the ψi nonlinearities.
Before proceeding to discuss the practical implementation of this method for the esti-
mation of the nonlinearities, we note that it is based on the optimization of the same objective
function that is used for the estimation of the separating matrix. Therefore, the whole opti-
mization is based on a single objective function. There are maximization methods (e.g. methods
based on gradient ascent [4]) which are guaranteed to converge when there is a single function
to be optimized, which is differentiable and upper-bounded, as in this case. Therefore there is
no risk of instability, contrary to what happens if the ψi nonlinearities are estimated based on
some other criterion.

2.2.2.2 Implementation of the Method

We have given an explanation of the theoretical principles behind the estimation of the ψi
functions, and we shall now describe how these principles are put to practice. Each of the
estimates ψ̂i (yi , wi ) is implemented by means of a multilayer perceptron (MLP) with a single
input and a single output. The wi parameter vector is formed by the MLP’s weights. Each
MLP has a linear output unit and a single hidden layer with a number of units that depends on
the specific problem being dealt with, but which is typically between 2 and 20.
The whole structure of Fig. 2.1, with the ψi blocks implemented by means of MLPs, can
then be seen as a single, larger multilayer perceptron with two hidden layers. The first hidden
layer’s units are linear, corresponding to the output units of block W, which have as outputs Yi .
The second hidden layer is nonlinear, and is formed by the hidden layers of all the ψi blocks
taken together. The output layer is linear, being formed by the output units of all the ψi blocks
taken together.
Figure 2.2 depicts this MLP in a form that will be useful for the discussion that follows.
The input is the mixture vector x. The W block performs a product (on the left) by matrix W.7
Its outputs form the vector of extracted components, y. In the figure we consider this vector
augmented with a component equal to 1, which is useful for implementing the bias terms of
the following layer’s units, and we denote the augmented vector by ȳ.

7
We shall designate each linear block by the same letter that denotes the corresponding matrix, because this will not
cause any confusion.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 15

y
x W B C z

FIGURE 2.2: Structure of the INFOMAX system with adaptive estimation of the nonlinearities. The
part marked ψ corresponds to all the ψi blocks taken together. See the text for further explanation

The three rightmost blocks represent all the ψi blocks of Fig. 2.1 taken together. Block
B̄ performs a product by matrix B̄, which is the weight matrix of the hidden units of all the
ψi blocks, taken as forming a single hidden layer. As in ȳ, the overbar in B̄ indicates that this
matrix incorporates the elements relative to the bias terms of the hidden units. The output of
this block consists of the vector of input activations of the units of the global hidden layer of all
the ψi blocks.
Block Φ is nonlinear and applies to its input vector, on a per-component basis, the
nonlinear activation functions of the units of that hidden layer. We shall denote these activation
functions by φi . Usually these activation functions are the same for all units, normally having a
sigmoidal shape. A common choice is the hyperbolic tangent function.
The output of the Φ block is the vector of output activations of the global hidden layer.
Block C multiplies this vector, on the left, by matrix C, which is the weight matrix of the linear
output units of the ψi blocks. The output of the C block is the output vector z.
Of course, since there are no interconnections between the various ψi blocks, matrices B̄
and C have a special structure, with a large number of elements equal to zero. This does not
affect our further analysis, except for the fact that those elements are never changed during the
optimization, being always kept equal to zero, to keep the correct network structure.
We wish to compute the gradient of the output entropy H(Z ) relative to the weights
of this network, so that we can use gradient-based maximization methods. We shall not give
explicit expressions of the components of the gradient here, because they are somewhat complex
(and would become even more complex later, in the nonlinear MISEP method). Instead, we
shall derive a method for computing these gradient components, based on the backpropagation
method for multilayer perceptrons [4, 39]. As normally happens with backpropagation, this
method is both more compact and more efficient than a direct computation of each of the
individual components of the gradient.
Let us take Eq. (2.4), that we reproduce here:

Lm = log det J m .
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

16 NONLINEAR SOURCE SEPARATION

We recall that Lm is the term of the objective function that refers to the mth training pattern. The
objective function is the mean of all the Lm terms, and therefore its gradient is the mean of the
gradients of all these terms. These gradients are what we now wish to compute. For simplicity
we shall drop, once again, the subscript m, which refers to the specific training pattern being
considered:

L = log det J .

This function depends on the Jacobian

∂z
J = .
∂x
2.2.2.3 A Network for Computing the Jacobian
To be able to use backpropagation for computing the gradient of L, we shall construct a network
that computes J , and shall then backpropagate through it, as explained ahead. Since J is a
matrix of first derivatives, this network is essentially a linearization of the network of Fig. 2.2.
To understand how this network is built, we start by noting that the desired Jacobian is given
by

J = CΦ BW, (2.9)

where Φ is a diagonal matrix whose diagonal elements are φi , i.e. the derivatives of the acti-
vation functions of the corresponding hidden units, for the specific input activations that they
receive for the input pattern being considered. B is the weight matrix B̄ stripped of the column
corresponding to the bias weights (those weights disappear from the equations when differen-
tiating relative to x).
The network that computes J according to this equation is shown in Fig. 2.3. The lower
part is what computes (2.9) proper. It propagates matrices (this is depicted in the figure by the
“3-D arrows”). Its input is the identity matrix I of size n × n (n being the number of sources and
of mixture components). Block W performs a product, on the left, by matrix W, yielding matrix
W itself at the output (this might seem unnecessary but is useful later, when backpropagating,
to allow the computation of derivatives relative to the elements of W ). The following blocks
also perform products, on the left, by the corresponding matrices.
It is clear that the lower chain computes J as per (2.9). The upper part of the network
is needed because the derivatives in the diagonal matrix Φ depend on the input pattern. The
two leftmost blocks of the upper part compute the input activations of the nonlinear hidden
units, and transmit these activations, through the gray-shaded arrow, to block Φ , allowing it
to compute the activation function derivatives. Supplying this information is the reason for the
presence of the upper part of the network.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 17

y
x W B C z

I W B C J

FIGURE 2.3: Network for computing the Jacobian. The lower part is what computes the Jacobian
proper, and is essentially a linearized version of the network of Fig. 2.2. The upper part is identical to
the network of Fig. 2.2. The two rightmost blocks, shown in light gray, are not necessary for computing
J , and are shown only for better correspondence with Fig. 2.2. See the text for further explanation

2.2.2.4 Backpropagating to Compute the Gradient

It is known, from the theory of neural networks [4], that if we have a network with output
vector u and wish to compute the gradient of a function C(u) relative to the network’s weights,
we can do so by standard backpropagation, using as input to the backpropagation network the
vector ∂C/∂u. We shall not prove that fact here, nor shall we give details of the procedure.
The description and the justification can be found in [4], and in several other standard texts on
neural networks.
In our case we have a network that computes J , and wish to compute the gradient of the
function

L = log det J . (2.10)

We can, therefore, use the standard backpropagation procedure.8 The inputs to the backprop-
agation network are
∂L
= J −T ,
∂J
where the −T superscript denotes the transpose of the inverse of the matrix.
The backpropagation method to be used is rather standard, except for two aspects that
we detail here:

• All the blocks that appear in the network of Fig. 2.3 are standard neural network
blocks, with the exception of block Φ . We shall now see how to backpropagate through

8
The fact that the network’s outputs form a matrix instead of a vector is unimportant: we can form a vector by
arranging all the matrix elements into a single column.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

18 NONLINEAR SOURCE SEPARATION

(si) gij
gij (si) hij (s i )

(a) (b)

FIGURE 2.4: Backpropagation through the Φ block. (a) Forward unit. (b) Backpropagation unit. Each
box denotes a product by the value indicated inside the box

this block. A unit of this block, i.e. a unit that performs the product by the derivative of
a hidden unit’s activation function, is shown in Fig. 2.4(a). The unit receives an input
(that we denote by g i j ) from block B, and receives the activation value s i (which is
necessary to compute the derivative) from the upper part of the network, through the
gray-shaded arrow of Fig. 2.3. This unit produces the output (that we denote by h i j )

h i j = φ (s i )g i j .

That output is sent to block C.

The backpropagation has to be made through all signal paths, and therefore has
to be made to the left toward block B, but also upward, through the gray-shaded arrow,
toward block B̄. It is driven by the following equations:
∂h i j
= φ (s i )
∂g i j
∂h i j
= φ (s i )g i j .
∂s i
Figure 2.4(b) shows a unit that implements this backpropagation.
• The network of Fig. 2.3 has what is called, in neural network parlance, shared weights;
i.e., it has several connections with weights that must always remain equal to one
another. In fact, the lower part of the network, propagating matrices, can be seen as n
identical networks, each propagating one of the column vectors of matrix I . All these
networks share matrices W, B, and C. Furthermore, these networks share W and B
with the upper part of the network. When applying backpropagation, the appropriate
procedure for computing the partial derivatives relative to shared weights has to be
used: The partial derivative relative to each instance of a shared weight is computed.
Then, the derivatives relative to all the instances of the same weight are added together
to form the partial derivative relative to that shared weight [4].
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 19

2.2.2.5 Constraining the MLPs
As was said in the theoretical presentation of the method, it is assumed that the ψi blocks of
Fig. 2.1 are constrained to implement nondecreasing functions with values in (0, 1). In practical
realizations the latter interval is normally changed to (−1, 1). This does not change anything
fundamental in the reasoning. The only important change is that these blocks will now estimate
a scaled CDF, 2FYi − 1, instead of FYi itself. But as one can easily check, the whole system will
still minimize the mutual information I (Y ). On the other hand, this change has the advantage
that it allows the use of bipolar sigmoids (e.g. tanh functions) in the MLPs, and this kind of
sigmoid is known to lead to faster optimization [64].
To implement the constraint of the MLPs’ outputs to (−1, 1), several methods are possible,
but the one that has shown to work best in practice consists of the following:

• The activation functions of the hidden layer’s units are chosen to be sigmoids with
values in (−1, 1), such as tanh functions.
• The vector of weights leading to each linear output unit is normalized, after each
√
gradient update, to an Euclidean norm of 1/ h, where h is the number of hidden
units feeding that output unit.

This guarantees that the MLPs’ outputs will always be in the interval (−1, 1). The constraint
to nondecreasing functions can also be implemented in several ways. The one that has shown
to be best is in fact a soft constraint:

• The hidden units’ sigmoids are all chosen to be increasing functions (e.g. tanh).
• All the MLP’s weights are initialized to positive values (with the exception of the hidden
units’ biases, which may be negative).

In strict terms, this only guarantees that, upon initialization, the MLPs will implement increas-
ing functions. However, during the optimization process, the weights will tend to stay positive
because the optimization maximizes the output entropy, and any sign change in a weight would
tend to decrease the output entropy. In practice it has been found that negative weights occur
very rarely in the optimizations, and when they occur they normally switch back to positive
after a few iterations.

An Example We shall now present an example of the use of INFOMAX, with entropy-
based estimation of the nonlinearities, for the separation of two sources. One of the sources was
strongly supergaussian, while the other was uniformly distributed (and therefore subgaussian).
Figure 2.5(a) shows the joint distribution of the sources.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

20 NONLINEAR SOURCE SEPARATION

(a) Sources (b) Mixture

(c) Separated components y (d) Auxiliary outputs z

FIGURE 2.5: Scatter plots corresponding to the example of linear separation by INFOMAX with
maximum entropy estimation of the output nonlinearities

The mixture matrix was

1.0 0.4
A= .
0.8 1.0

Figure 2.5(b) shows the mixture distribution. Figure 2.5(c) shows the result of separation, and
we can see that it was virtually perfect. This can be confirmed by checking the product of the
separating matrix that was estimated, W, with the mixing matrix

8.688 0.002
WA = .
−0.133 8.114

This product is very close to a diagonal matrix, confirming the quality of the separation.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 21

(a) Function 1

(b) Function 2

FIGURE 2.6: Cumulative functions learned by maximum entropy estimation

The MLPs that were used to estimate the cumulative distributions had 10 hidden units
each. Figure 2.6 shows the functions that were estimated by these networks. We see that they
match well (qualitatively, at least) the cumulative distributions of the corresponding sources.
The quality of the estimation can be checked better by observing the distribution of the auxiliary
output z, shown in Fig. 2.5(d). We can see that this distribution is nearly perfectly uniform
within a square. On the one hand, this means that the cumulative functions were quite well
estimated. On the other hand, it also confirms that the extracted components were virtually
independent from each other, because otherwise z could not have a distribution with this
form.
This example illustrates both the effectiveness of INFOMAX in performing linear sepa-
ration and the effectiveness of the maximum entropy estimation of the cumulative distributions.
As we mentioned above, INFOMAX with this estimation method corresponds to the linear
version of MISEP, whose extension to nonlinear separation will be studied in Section 3.2.1.

2.2.3 Estimation of the Score Functions

From Eqs. (2.5) and (2.6) we see that, in INFOMAX, the output nonlinearities appear only
through the quotients of derivatives

ψ j (y j )
ϕ̂ j (y j ) = .
ψ j (y j )

We can therefore work directly with the functions ϕ̂ j instead of the functions ψ j . Furthermore
we know that, ideally, ψ j (y j ) = FY j (y j ), where FY j is the CDF of Y j . Therefore the objective
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

22 NONLINEAR SOURCE SEPARATION

of the estimation of the nonlinearities is to make ϕ̂ j (y j ) as close as possible to

FYj (y j )
ϕ j (y j ) =
FY j (y j )
p (y j )
=
p(y j )
d
= log p(y j ). (2.11)
dy j

These ϕ j functions play an important role in ICA, and are called score functions. Since their
definition involves pdfs, one would expect that their estimation would also involve an estima-
tion of the probability densities. However, in [97] Taleb and Jutten proposed an interesting
estimation method that involves the pdfs only indirectly, through expected values that can
easily be estimated by averaging on the training set. This is the method that we shall now
study.
For the derivation of this method we shall drop, for simplicity, the index j , referring to
the component under consideration. The method assumes that we have a parameterized form
of our estimate, ϕ̂(y, w), where w is a vector of parameters (Taleb and Jutten proposed using
a multilayer perceptron with one input and one output for implementing ϕ̂; then, w would
be the vector of weights of the perceptron). The method uses a minimum mean squared error
criterion, defining the mean squared error as

1
= E [ϕ̂(Y, w) − ϕ(Y )]2 , (2.12)
2

where E(·) denotes statistical expectation, and the factor 1/2 is for later convenience. The
minimization of is performed through gradient descent. The gradient of relative to the
parameter vector is

∂ ∂ ϕ̂(Y, w)
= E [ϕ̂(Y, w) − ϕ(Y )]
∂w ∂w

p (y) ∂ ϕ̂(y, w)
= p(y) ϕ̂(y, w) − dy
p(y) ∂w

∂ ϕ̂(y, w) ∂ ϕ̂(y, w)
= p(y)ϕ̂(y, w) dy − p (y) dy.
∂w ∂w
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 23

The first integral in the last equation is an expected value and the second integral can be
computed by integrating by parts, leading to

∂ ∂ ϕ̂(y, w)
= E ϕ̂(y, w)
∂w ∂w

∂ ϕ̂(y, w) +∞ ∂ 2 ϕ̂(y, w)
− p(y) + p(y) dy.
∂w −∞ ∂ y∂w
The term inside the square bracket in the second line will vanish at infinity, for “well-behaved”
distributions, if ∂ ϕ̂(y, w)/∂w is limited as a function of y, because p will vanish at infinity.
When the iterative optimization is close to convergence we shall have ϕ̂ ≈ p / p and therefore
the term inside that square bracket will be approximately equal to p (y), which again vanishes at
infinity for “well-behaved” distributions, even if ϕ̂ approaches an unlimited function. Therefore
the square bracket term will vanish for well-behaved distributions, and we are led to the final
result

∂ ∂ ϕ̂(y, w) ∂ 2 ϕ̂(y, w)
= E ϕ̂(y, w) + . (2.13)
∂w ∂w ∂ y∂w
This equation allows us to optimize the estimated score function ϕ̂(y, w) by any gradient-based
method. We use the mean in the training set as an estimate of the expected value that appears
in the equation. Note that there is no need to explicitly estimate the densities p(yi ).
INFOMAX, with this method for estimating the score functions, is an iterative method
in which two different objective functions are being optimized simultaneously: the separating
matrix W is estimated through the maximization of the output entropy, while the score functions
are estimated through the minimization of the squared errors in (2.12). An iterative optimization
in which there are two or more simultaneous optimization criteria raises some issues:

• What is the best way to interleave optimization iterations for each of the objective func-
tions? A simple solution is to alternate a step of estimation of the unmixing matrix with
one of estimation of the score functions. However, it could make more sense to perform
several successive steps of optimization of the score functions for each step of optimiza-
tion of the matrix, so that the optimization of the matrix would be done with rather good
estimates of the score functions. Another possibility that has been proposed [97] is to
use a stochastic optimization for the score function estimates, in which the expectation
is dropped from Eq. (2.13), and the resulting stochastic gradient is used to update the
score function estimate once for each pattern that is processed during the optimization.
• More important from a theoretical viewpoint is that this kind of inter leaved
optimization of two or more objective functions does not normally have a guaranty
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

24 NONLINEAR SOURCE SEPARATION

of convergence. When a single objective function is being optimized, it is possible
to design an optimization algorithm (see [4], for example) that guarantees that the
objective function will always improve at each iteration, so that the method will be
guaranteed to converge. However, with two interleaved optimizations, an iteration step
for one of the objective functions may worsen the value of the other objective function,
and vice versa. The consequence is that the process may not converge, possibly having
an oscillating or chaotic behavior. This is a theoretical limitation that appears not to
have too severe consequences in practice, in the use of this method.

2.3 EXPLOITING THE TIME-DOMAIN STRUCTURE

In an ICA setting, if two extracted components are independent, their correlation must be
zero.9 Therefore the correlation matrix of the extracted components Yi must be diagonal, if
these components are independent. This is a necessary condition for independence, but it is not
sufficient. There is an infinite number of matrices W that yield uncorrelated components, but
only some of them (corresponding to the various permutations of the sources, with all possible
scalings) yield statistically independent components.
Let us now assume that the source and mixture vectors are functions of time, denoting
them by S(t) and X(t) respectively. If the sources Si (t) are stochastic processes that are inde-
pendent from one another, then Si (t) must be uncorrelated with S j (t − τ ), if i = j , for any
choice of the delay τ . The separation methods that rely on the temporal structure of the sources
enforce the simultaneous decorrelation of Yi (t − τi ) for various sets of delays τi , thus increas-
ing the number of constraints that are imposed on the solution. If the sources have temporal
correlation, it is normally possible to find sets of delays that force the problem to have only the
solutions that correspond to independent components. Linear ICA based on this principle was
first proposed by Molgedey and Schuster [75]. Other ICA methods based on the same principle
are SOBI (second-order blind identification) [20] and TDSEP (temporal decorrelation source
separation) [117]. We shall now briefly analyze them.
Let us define the correlation matrix at time t, with delay τ , as

Cτ (t) = E[Y (t)Y T (t + τ )],

where E(·) denotes statistical expectation. If the components of Y are independent, then Cτ (t)
will be diagonal for all values of t and τ . The methods that we are studying implicitly assume
that the sources are jointly ergodic. This implies that they are jointly stationary, and therefore

9
Actually it is their covariance that must be zero. But it is common to assume that the means of the random variables
have been subtracted out, and then the correlation and the covariance are equal. In the remainder of this chapter
we shall assume that the means have been subtracted out.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 25

that Cτ (t) does not depend on t and can be simply denoted Cτ . More importantly, ergodicity
allows us to replace the statistical expectation with a time-domain average, resulting in

Cτ = Y(t)Y T (t + τ ),

where · denotes an average over the time variable t. If the components are independent, the
Cτ matrices should be diagonal for all τ . The ICA methods that use the time domain structure
enforce that Cτ be diagonal for more than one value of τ , and differ from one another in
the details of how they do so. In practice, Cτ is replaced with an approximation, obtained by
computing an average over a finite amount of time.
The method of Molgedey and Schuster enforces the simultaneous diagonality of C0 (i.e.
of the correlation with no delay) and of Cτ for one specific, user-chosen value of the delay τ .
These authors showed that this simultaneous diagonality condition can be transformed into
a generalized eigenvalue problem, which can be solved by standard linear algebra methods,
yielding the separating matrix W. A sufficient condition for separation is

Ri (0)R j (τ ) = Ri (τ )R j (0) for all i = j, (2.14)

where Ri (·) is the autocorrelation function of the ith source.

While this method allows perfect separation in ideal situations, it is some what fragile.
A poor choice of the delay τ can lead to a badly conditioned system (a system in which one or
more of the inequalities in (2.14) are actually approximate equalities). This will result in poor
separation. The method does not provide any systematic way to choose the delay. SOBI and
TDSEP address this difficulty by simultaneously taking into account several delays, so that this
problem is alleviated.
SOBI and TDSEP are essentially identical to each other, although they appear to have
been independently developed. They start by performing a prewhitening (also called sphering)
operation on the mixture data, i.e. by performing a linear transformation

Z = BX

such that the zero-delay correlation of Z is the identity matrix

E[Z (t)Z T (t)] = I.

Matrix B can be found by one of several methods, one of which is PCA. Once a prewhitening
matrix is found, it can be shown that the separating matrix is related to it by

W = QB,

where Q is an orthogonal matrix [55]. An orthogonal matrix corresponds to a rotation, and

possibly a reflection. This means that once the data are prewhitened, ICA can be achieved by
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

26 NONLINEAR SOURCE SEPARATION

a suitable rotation.10 Because of this, prewhitening is an operation performed by several linear

ICA methods, besides SOBI and TDSEP [55].
Once the data are prewhitened, the linear transformations Y = Q Z are restricted to
rotations, and C0 (computed now using the prewhitened data Z instead of Y) is guaranteed
to be diagonal (and in fact equal to the identity matrix). SOBI and TDSEP then proceed
by considering a set of correlation matrices Cτi for a user-chosen set of delay values τi (i =
1, . . . , K ), and defining a cost function which is the sum of the squares of the off-diagonal
elements of all these matrices.
An approximate joint diagonalization of these matrices is performed by minimizing that
cost function by means of a sequence of Givens rotations involving different pairs of coordi-
nates.11 The simultaneous use, in the cost function, of correlation matrices corresponding to
several delays (up to 50 delays in [117]) makes these methods much more robust than the
original method of Molgedey and Schuster.
The derivation of these methods does not involve any assumption on the probability dis-
tributions of the sources. Therefore they can separate sources with any probability distributions.
Namely, they can separate mixtures in which two or more sources have Gaussian distributions.
On the other hand, these methods can only separate sources that have temporal or spatial struc-
ture. Most sources of practical interest, such as speech, images, or biomedical signals, do have
such structure.
SOBI and TDSEP are based only on correlations, which normally are not too sensitive
to outliers. Therefore they normally are more robust than methods that rely on higher order
statistics such as cumulants.

2.3.1 An Example
We show an example of the separation, by TDSEP, of a mixture of two sources: a speech and a
music signal. The source signals are shown in Fig. 2.7(a).
These signals were mixed using the mixture matrix

1.0000 0.8515
A= .
0.5525 1.0000

The components of the mixture are shown in Fig. 2.7(b). TDSEP was applied to this mixture,
using the set of delays {0, 1, 2, 3}. The separation results are shown in Fig. 2.7(c). They are

10
The reflection is never needed in practice, because it would just change the signs of some components, and the
scale indeterminacy of ICA already includes a sign indeterminacy.
11
Givens rotations of a multidimensional space are rotations that involve only two coordinates at a time, leaving the
remaining coordinates unchanged.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 27

FIGURE 2.7: Example of source separation using TDSEP

virtually identical to the original sources. This can be confirmed by computing the product of
the separating and mixture matrices,

10.2127 −0.0288
WA = ,
−0.0286 5.9523

which is very close to diagonal.

2.4 OTHER METHODS: JADE AND FASTICA

Although we do not intend to provide a complete overview of linear ICA in this chapter, there
are some methods that are used so frequently (sometimes as a part of nonlinear separation
methods) that they should have a brief mention in this book.
As has been noted above, linear ICA is a rather constrained problem. As such, it can
be addressed using optimization criteria that do not fully enforce independence. The two
methods that we shall briefly overview in this section, JADE and FastICA, have in common
the characteristic of using independence criteria that are only valid within the framework of
linear ICA. They also have in common the use of a preprocessing step in which the data are
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

28 NONLINEAR SOURCE SEPARATION

prewhitened, so that only an orthogonal separating matrix (i.e. a rotation of the data) needs to
be estimated. This prewhitening step has already been studied in Section 2.3.

2.4.1 JADE
Some criteria that are frequently used in linear ICA are based on cumulants. The cumulants of a
distribution are related to the coefficients of the expansion of the cumulant-generating function
as a power series. More specifically, for a univariate distribution, they are defined implicitly as
the κi coefficients in the power series expansion

∞
( j ω)i
ψ(ω) = κi ,
i=0
i!

where j is the imaginary unit and ψ is the cumulant-generating function, which is defined as
the logarithm of the moment-generating function: ψ(ω) = log ϕ(ω). The moment-generating
function is the Fourier transform of the probability density function,

ϕ(ω) = p(x) e j ωx dx.

The first four cumulants correspond to relatively well-known statistical quantities. The
first cumulant of a distribution is its mean, and the second is its variance. The third cumulant
is the so-called skewness, which is an indicator of the distribution’s asymmetry. The fourth
cumulant is called kurtosis, and is an indicator of the distribution’s deviation from Gaussianity.
For more details see [55], for example.
Cumulants can also be defined, in a similar way, for joint distributions of several variables.
For example, for two random variables Y1 and Y2 , the joint cumulants κkl , of orders k and l
relative to Y1 and to Y2 respectively, are given implicitly by the series expansion

∞
( j ω1 )k ( j ω2 )l
ψ(ω1 , ω2 ) = κkl ,
k,l=0
k! l!

where ψ(ω1 , ω2 ) is the joint cumulant-generating function of Y1 and Y2 . Cumulants have

several interesting properties, one of which is that, for any two or more independent random
variables, the joint cumulants of any order are zero.
Fourth-order cumulants are often used for ICA. Consider a set of jointly distributed
random variables Yi . Denote by cum(Yk , Yl , Ym , Yn ) the coefficient of ( j ωk )( j ωl )( j ωm )( j ωn )
in the expansion of the cumulant-generating function. Some or all of k, l, m, n may be equal to
one another. The property that we mentioned above implies that, if the different variables Yi
are independent from one another, this cumulant can be nonzero only if all four arguments are
the same, i.e. if k = l = m = n, so that a single random variable is involved in the cumulant.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

LINEAR SOURCE SEPARATION 29

If two or more variables are involved the cumulant must be zero, because those variables are
independent from one another.
The four-index array cum(Yk , Yl , Ym , Yn ) is often called a tensor, and the elements with
k = l = m = n are its diagonal elements. Finding a linear transformation Y = W X such that
the components of Y are independent can therefore be done by diagonalizing the fourth-
order cumulant tensor cum(Yk , Yl , Ym , Yn ). A widely used algorithm ( JADE [26]) uses the
diagonalization of a part of this tensor to perform linear ICA.

2.4.2 FastICA
Another criterion that is used for linear ICA is based on negentropy, a concept that we shall
define next. Recall that, among all distributions with a given variance, the Gaussian distribution
is the one with the largest entropy (see Appendix A.2). Consider a random variable Y , and let
Z be a Gaussian random variable with the same variance. The quantity

J (Y ) = H(Z) − H(Y )

is nonnegative, and is zero only if Y is Gaussian. Therefore J (Y ), called the negentropy of Y ,

is a measure of the deviation of Y from Gaussianity.
Mixtures of independent random variables tend to have distributions that are closer to
Gaussian than the original random variables (an expression of this is the central limit theorem).
Consequently, once the mixture data have been prewhitened, finding a rotation that yields
components that are maximally non-Gaussian is a way to identify the original unmixed random
variables, and therefore to perform ICA. Negentropy can be used as the measure of non-
Gaussianity.
Negentropy is, however, somewhat hard to use, because it needs an estimation of the
entropy and therefore of the pdf. FastICA works by computing an efficient approximation of
negentropy, and using an iteration that quickly converges to the rotation that yields its maximum.
It can be used for extracting one independent component at a time, or for simultaneously
extracting a set of independent components. For more details see [54].

2.5 SUMMARY
We have made a short overview of linear source separation and linear ICA, giving a brief
introduction to some of the methods that are used to perform them. Among these we can
distinguish two classes. Methods such as INFOMAX, JADE, and FastICA are based (explicitly
or implicitly) on statistics of order higher than 2, and demand that at most one source be
Gaussian. This means that their performance will be poor if more than one source is close to
Gaussian. However, most of the sources that are of practical interest have distributions that
deviate markedly from Gaussian (for example, speech is strongly supergaussian, while images
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-02 MOBK016-Almeida.cls March 7, 2006 13:44

30 NONLINEAR SOURCE SEPARATION

tend to be strongly subgaussian). Methods of this class do not need (and do not exploit) any
temporal or spatial structure of the sources. On the other hand, methods such as TDSEP and
SOBI use only second-order statistics, and can deal with any number of Gaussian sources.
However, they demand sources with temporal structure (or spatial structure, as in images).
Again, most sources of practical interest (such as speech, biomedical signals, or images) do have
temporal or spatial structure that can be used.
In this overview of linear blind source separation it was not our intention to be complete.
To keep this work compact, we limited ourselves to a short introduction to the linear ICA/BSS
problem and to a reference to the main methods that are of interest to the study of nonlinear
separation, which is the main subject of this book.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

CHAPTER 3

Nonlinear Separation

Nonlinear source separation, as discussed in this book, deals with the following problem: There
is an unobserved random vector S whose components are called sources and are mutually
statistically independent. We observe a mixture vector X, which is related to S by

X = M(S ),

M being a vector-valued invertible function, which is, in general, nonlinear. Unless stated
otherwise, the size of X is assumed to be equal to the size of S; i.e., we consider a so-called
square problem. We wish to find a transformation

Y = F(X ),

where F is a vector-valued function and the size of Y is the same as those of X and S, such
that the components of Y are equal to the components of S, up to some indeterminacies to be
clarified later. These indeterminacies will include, at least, permutation and scaling, as in the
linear case. As we can see, we are considering only the simplest nonlinear setting, in which there
is no noise and the mixing is both instantaneous and invariant.
There is a crucial difference between linear and nonlinear separation, which we should
immediately emphasize. While in a linear setting, the independence of the components of
Y suffices, under rather general conditions, to guarantee the recovery of the sources (up to
permutation and scaling), there is no such guaranty in the nonlinear setting. This has been
shown by several authors [30, 55, 73]. We shall show it for the case of two sources, since the
generalization to more sources is then straightforward. Assume that we have a two-dimensional
random vector X and that we form the first component of the output vector Y as some function
of the mixture components,

Y1 = f 1 (X1 , X2 ).

The function f 1 is arbitrary, subject only to the conditions of not being constant and of Y1
having a well-defined statistical distribution. Now define an auxiliary random variable Z as

Z = f 2 (X1 , X2 ),
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

32 NONLINEAR SOURCE SEPARATION

f 2 being arbitrary, subject only to the conditions of the variables Y1 and Z having a well-defined
joint distribution and of FZ | Y1 (z | y1 ) being continuous in z. Define the second component of
Y as

Y2 = FZ | Y1 (Z | Y1 ).

For each value of Y1 , Y2 is the CDF of Z given that value of Y1 . Therefore Y2 will be uniformly
distributed in (0, 1) for all values of Y1 (see Appendix A.1). As a consequence, Y2 will be
independent from Y1 , and Y will have independent components.
Given the large arbitrariness in the choice of both f 1 and f 2 , we must conclude that there
is a very large variety of ways to obtain independent components. This variety is much larger
than the variety of solutions in which the sources are separated. Therefore, in most cases, each
of the components of Y will depend on both S1 and S2 . This means that the sources will not be
separated at the output, even though the output components will be independent.
This raises an important difficulty: if we perform nonlinear ICA, i.e. if we extract inde-
pendent components from X, we have no guaranty that each component will depend on a single
source. ICA becomes ill-posed, and ICA alone is not a means for performing source separation,
in the nonlinear setting.
Researchers have addressed this difficulty in three main ways:

• One way has been to restrict the range of allowed nonlinearities, so that ICA becomes
well-posed again. When we restrict the range of allowed nonlinearities, an important
consideration is whether there are practical situations in which the nonlinearities are
(at least approximately) within our restricted range. The case that has met a significant
practical applicability is the restriction to post-nonlinear (PNL) mixtures, described
ahead.
• Another way to address the ill-posedness of nonlinear ICA has been the use of regu-
larization. This corresponds to placing some “soft” extra assumptions on the mixture
process and/or on the source signals, and trying to smoothly enforce those assumptions
in the separation process. The kind of assumption that has most often been used is that
the mixture is only mildly nonlinear. This is the approach taken in the MISEP and
ensemble learning methods, to be studied in Sections 3.2.1 and 3.2.2 respectively.
• The third way to eliminate the ill-posedness of nonlinear ICA has been to use extra
structure (temporal or spatial) that may exist in the sources. Most sources of practical
interest have temporal or spatial structure that can be used for this purpose. This is the
approach taken in the kTDSEP method, to be studied in Section 3.2.3. This approach
is also used, together with regularization, in the second application example of the
ensemble learning method (p. 62).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 33
In the next sections we shall study, in some detail, methods that use each of these approaches,
starting with the constraint to PNL mixtures.

3.1 POST-NONLINEAR MIXTURES

There are several sets of constraints that can be imposed on the nonlinear ICA problem in order
to eliminate its ill-posedness, making the solution essentially unique. However, such constraints
are of interest only if they are known to correspond (at least approximately) to practical situations.
Among the constraints that have been proposed, the constraint to the so-called PNL mixtures
is, to date, the only one that has been found to have a significant practical applicability. In this
section we shall study the separation of PNL mixtures in some detail.
Consider the mixture process depicted in Fig. 3.1. The source signals are first subject to a
linear mixture, and then each mixture component suffers a nonlinear, invertible transformation.
Formally, we can express this mixture process as follows. There is a set of independent sources
Si , forming the source vector S. An intermediate mixture vector U is formed by a linear mixture
process

U = AS.

The observed mixture X is obtained through component-wise nonlinearities applied to the

components of the linear mixture,

Xi = f i (Ui ),

where the functions f i are invertible. As before, we shall assume that the sizes of S, U, and X
are the same.
This is the structure of the so-called PNL mixture process. Its interest resides mainly in
the fact that it corresponds to a well-identified practical situation. PNL mixtures arise whenever,
after a linear mixing process, the signals are acquired by sensors that are nonlinear. This is a
situation that sometimes occurs in practice. The other fact that makes these mixtures interesting
is that they present almost the same indeterminations as linear mixtures, as we shall see below.

S1 U1
f1 X1
• •
• A •

Sn
•
Un •

fn Xn

FIGURE 3.1: Post-nonlinear mixture process. Block A performs a linear mixture (i.e. a product by a
matrix). The f i blocks implement nonlinear invertible functions
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

34 NONLINEAR SOURCE SEPARATION

V1
X1 g1 Y1
• •
•
W •
•
Vn •
Xn gn Yn

FIGURE 3.2: Separating structure for post-nonlinear mixtures. The g i functions should, desirably, be
the inverses of the corresponding f i in the mixture process. Block W is linear, performing a product by
a matrix

Naturally, the separation of PNL mixtures uses a structure that is a “mirror” of the mixing
structure (Fig. 3.2). Each mixture component is first linearized by going through a nonlinearity,
whose purpose is to invert the nonlinearity present in the mixture process, and then the linearized
mixture goes through a standard linear separation stage. Formally,

Vi = g i (Xi )
Y = WV .

The indeterminations that exist in PNL separation are only slightly wider than those
of linear ICA. It has been shown [97] that, under a set of conditions indicated ahead, if the
components of Y are independent, they will obey

Y = PDS + t,

in which P is a permutation matrix, D is a diagonal matrix, and t is a vector. This means that
the sources are recovered up to an unknown permutation (represented by P ), unknown scalings
(represented by D), and unknown translations (represented by t).
The unknown translations represent an additional indetermination relative to the linear
case. However, this indetermination is often not too serious a problem because most sources can
be translated back to their original levels after separation, using prior knowledge. For example,
speech and other acoustic signals normally have a mean of zero, and in images we can often
assume that the lowest intensity corresponds to black.
The aforementioned result is valid if

• the mixture matrix A is invertible and has at least two nonzero elements per row and/or
per column;
• the functions f i are invertible and differentiable; and
• the pdf of each source is zero in one point at least.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 35
The last condition precludes the possibility of having Gaussian sources, but is not too restrictive
since most distributions can, in practice, be considered bounded.12 The fact that the sources can
be recovered up to permutation, scaling, and translation implies that each g i must be the inverse
of the corresponding f i , up to unknown translation and scaling, and thus that the nonlinearities
of the mixture can be estimated, up to unknown translations and scalings.
The need for the first condition (that A must have at least two nonzero elements per
row and/or per column) can be understood in the following way: If A had just one nonzero
element per row and per column, it would correspond just to a permutation and a scaling.
The Ui components would be independent from one another, and so would be the Xi , for any
nonlinearities f i . Therefore it would be impossible to estimate and invert the nonlinearities f i
on the basis of the independence criterion alone. One would only be able to recover the sources
up to a permutation and an arbitrary invertible nonlinear transformation per component: the
nonlinearities could not be compensated for. For it to be possible to estimate and invert the
nonlinearities there must be some mixing in A.
PNL mixtures have been subject to extensive study, e.g. [1, 2, 15, 95–97]. We shall review
the basic separation method.

3.1.1 Separation Method

The basic method for the separation of PNL mixtures appears to have been independently pro-
posed, in 1997, by Taleb and Jutten [95] and by Lee et al. [67]. A fuller account is given in [97].
The method is based on the minimization of the mutual information of the extracted compo-
nents, I (Y ). There are several variants of the method, differing mostly in the way in which the
inverse nonlinearities g i are represented and in the way the score functions (see Section 2.2.3)
are estimated. The representation of the g i functions has been made in several ways, including
multilayer perceptrons (MLPs), piecewise-linear functions, and polynomials. The score func-
tions have been approximated by means of MLPs, using the method of Section 2.2.3, and have
also been computed, through Eq. (2.11), from the pdfs, which can be estimated by any standard
method. The important fact to retain at this point is that the method uses representations of
both the inverse nonlinearities, ĝ i (xi , θi ), and the score functions, ϕ̂i (yi , wi ), where θi and wi
are vectors of parameters.
The minimization of the mutual information I (Y) is made relative to the separating
matrix B and to the parameter vectors θi , and is made iteratively through gradient descent. A
full presentation of the method is given in [97]. Here we shall give only the expression of the
gradient from which the parameter update equations can readily be found.
12
This condition has been relaxed in [1] to the condition that the joint density of the sources be continuous and
differentiable, with no more than one Gaussian source.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

36 NONLINEAR SOURCE SEPARATION

The gradient of the mutual information is given by
∂ I (Y )
= −W −T − E(ϕV T ) (3.1)
∂W
∂ I (Y ) ∂ log g i (Yi , θi )
= −E
∂θi ∂θi

n
∂g i (Yi , θi )
−E ϕk (Yk )wki .
k=i
∂θi

In these equations, we have used the following notational conventions:

• W −T = (W −1 )T .
• V is the vector whose components are defined by (3.1).
• E(·) denotes statistical expectation.
∂g i (Yi ,θi )
• g i (Yi , θi ) = ∂Yi
.
• wki is the element ki of matrix W.
• ϕ is the column vector [ϕ1 (Y1 ), ϕ2 (Y2 ), . . . , ϕn (Yn )]T , n being the size of Y, i.e. the
number of sources and of extracted components.

In practice, the score function estimates ϕ̂i (Yi , wi ) are used instead of the actual score
functions ϕi (Yi ) in these equations. Also, instead of the statistical expectations E(·), what is
used in practice are the means computed in the training set.
This method is, essentially, an extension of INFOMAX to the PNL case. In particular,
note that the expression of the gradient relative to the separation matrix, (3.1), is essentially
the same as in INFOMAX, (2.7), if we replace the statistical expectation E(·) with the mean
in the training set ·.13 And, as in INFOMAX, people have used the relative/natural gradient
method (see Section 2.2) for the estimation of this matrix.
This PNL separation method is also subject to the issues concerning the interleaved
optimization of two different objective functions, which we mentioned in Section 2.2.3, since
the estimation of the score functions is done through a criterion different from the one used for
the estimation of W and of the θi . Once again, however, these issues do not seem to raise great
difficulties in practice.
Several variants of the basic PNL separation algorithm have appeared in the literature.
Some of them have to do with different ways to estimate the source densities (or equivalently,

13
There is a sign difference, because here we use as objective function the mutual information, which is to be
minimized, and in INFOMAX we used the output entropy, which was to be maximized. Recall that the two are
equivalent—see Eq. (2.2).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 37
the score functions) [94, 97], while other ones have to do with different ways to represent the
inverse nonlinearities g i [2, 96, 97]. An interesting development was the proposal, in [98], of
a nonparametric way to represent the nonlinearities,14 in which each of them is represented
by a table of the values of g i (xi ) for all the elements of the training set (using interpolation to
compute intermediate values, if necessary). This appears to be one of the most flexible and most
powerful ways to represent these functions.
An efficient way to obtain estimates of the g i nonlinearities is to use the fact that mixtures
tend to have distributions that are close to Gaussians. This idea has been used in [92, 114–
116]. The last three of these references used the TDSEP method (cf. Section 2.3), instead of
INFOMAX, to perform the linear part of the separation.
Other variants of PNL methods, involving somewhat different concepts, have been
proposed in [16,17,67]. In addition, a method based on ensemble learning (discussed ahead, in
Section 3.2.2) has been proposed in [57]. It can separate mixtures in which some of the individual
nonlinearities are not invertible, but in which there are more mixture components than sources.
In [60] it was shown that when the sources have temporal structure, incorporating temporal
decorrelation into the PNL separation algorithm can achieve better separation. A method using
genetic algorithms in the optimization process was proposed in [86]. A quadratic dependence
measure has been proposed in [3], and has been used for the separation of PNL mixtures.

3.1.1.1 Extensions of the Post-Nonlinear Setting

Some extensions of PNL mixtures have been considered in the literature. The two most impor-
tant ones concern convolutive PNL mixtures and Wiener systems. Convolutive PNL mixtures
assume that the linear part of the mixture is not instantaneous, but instead involves time delays.
They have been studied in [14]. Wiener systems consist of a linear filter, operating on an inde-
pendent, identically distributed (and therefore white) input signal, the filter being followed by
an invertible nonlinearity. These systems have been considered in [93, 98].

3.1.2 Application Example

We present an example of the application of PNL separation to an artificial mixture of two
sources.15 The sources were the two sinusoids shown in Fig. 3.3(a). The mixture matrix was

1.0 0.6
A= .
0.7 1.0

14
“Nonparametric,” in this context, does not mean that there are no parameters. It means that there is not a fixed
number of parameters, but instead that this number is of the order of magnitude of the number of training data,
and grows with it.
15
This example was worked out by us using software made publicly available by its authors (see Appendix B).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

38 NONLINEAR SOURCE SEPARATION

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000

(a) Source signals

1 15

10
0.5
5
0 0

−5
−0.5
−10
−1
−15
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000

(b) Mixture components

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5

−1 −1

−1.5 −1.5
0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000

(c) Separated components

FIGURE 3.3: Separation of a post-nonlinear mixture: source, mixture, and separated signals

The nonlinearities were

f 1 (u 1 ) = tanh(5u 1 )
f 2 (u 2 ) = (u 2 )3 .

They are plotted in Fig. 3.4(a). The mixture components are shown in Fig. 3.3(b), where we
can see the strong distortions caused by the nonlinearities. The 1000-sample segments shown
in the figure were used as training set.
The separation was performed through the algorithm indicated above, with the following
specifics:

• The inverse nonlinearities g i were represented by the nonparametric method proposed

in [98] and mentioned above.
• The score functions were estimated through their definition, Eq. (2.11). For use in
this equation, the densities p(yi ) were estimated using kernel density estimators with
a Gaussian kernel.
• The optimization was performed for 200 epochs.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 39
15
1

10
0.5
5

0 0

−5
−0.5
−10

−1
−15
−2 −1 0 1 2 −2 −1 0 1 2

(a) Nonlinearities. Left: f1. Right: f 2

2 2

1 1

0 0

−1 −1

−2 −2

−2 −1 0 1 2 −2 −1 0 1 2

(b) Compensated nonlinearities

FIGURE 3.4: Separation of a post-nonlinear mixture: Nonlinearities and their compensation

Fig. 3.3(c) shows the separation results. The source signals were recovered with a relatively
small amount of error. Fig. 3.4(b) shows the compensated nonlinearities, i.e. the functions
g i [ f i (·)]. We can see that the compensation had a rather good accuracy. The good recovery of
the source signals implies that the separating matrix was well estimated. Its product with the
mixing matrix was

1.14 −0.16
WA = ,
0.07 1.28

which is close to diagonal.

Fig. 3.5 shows the scatter plots corresponding to this example. The mixture scatter plot
gives a good idea of the very strong nonlinear distortions that were involved. There are two
regions (the two “vertical segments”) in which the mixture was almost singular, due to the strong
saturation performed by f 1 . This makes the mixture rather hard to separate. Despite this, the
method was able to recover the sources with a reasonably good accuracy.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

40 NONLINEAR SOURCE SEPARATION

FIGURE 3.5: Separation of a post-nonlinear mixture: scatter plots. From left to right: sources, mixture
components, separated components

3.2 UNCONSTRAINED NONLINEAR SEPARATION

In this section we shall present, in three separate subsections, the three main methods that have
been proposed for unconstrained nonlinear source separation: MISEP, ensemble learning, and
kTDSEP. A fourth subsection then gives an historical overview of the area, briefly mentioning
several other methods that have been proposed, but that have had more restricted use.

3.2.1 MISEP
MISEP (for Mutual Information-based SEParation) is an extension of the INFOMAX linear
ICA method to the nonlinear separation framework, and was proposed by L. Almeida [5, 9].
It is based on the structure of Fig. 3.6. This structure is very similar to the one of INFOMAX
(Fig. 2.1), the only difference being that the separation block is now nonlinear. Like INFOMAX,
the method uses the minimization of the mutual information of the extracted components, I (Y ),
as an optimization criterion. Also as in INFOMAX, this minimization is transformed into the
maximization of the output entropy H (Z ) through the reasoning of Eqs. (2.2), assuming again
that each ψi function equals the CDF of the corresponding Yi variable.
For implementing the nonlinear separation block F, the method can use essentially
any nonlinear parameterized system. In practice, the kind of system that has been used most

Y1
X1 ψ1 Z1
• •
• F •
•
Yn •
Xn ψn Zn

FIGURE 3.6: Structure used by MISEP. F is a generic nonlinear block, and is what performs the
separation proper. The separated outputs are Yi . The ψi blocks, implementing nonlinear increasing
functions, are auxiliary, being used only during optimization
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 41
16
frequently is an MLP. For this reason, we shall present the method using as example a
structure based on an MLP. To make the presentation simple, we shall use a very simple
MLP structure with a single hidden layer of nonlinear units and a layer of linear output units,
and with connections only between successive layers. We note, however, that more complex
structures can be used, and have actually been used in the examples that we shall present
ahead.
MISEP uses the maximum entropy method for estimation of the ψi functions, as ex-
plained in Section 2.2.2. And in fact, as we mentioned in that section, once that estimation
method is understood, the extension to nonlinear MISEP is quite straightforward. In what fol-
lows we shall assume that the contents of that section have been well understood. If the reader
has only glossed over that section, we recommend reading it carefully now, before proceeding
with nonlinear MISEP.

3.2.1.1 Maximizing the Output Entropy

Assuming that the F block is implemented by means of an MLP, as said above, and that the ψi
blocks are also implemented through MLPs (as in Section 2.2.2), the whole system of Fig. 3.6
can be seen as a single, global MLP. We wish to optimize it through maximization of the output
entropy. Following the reasoning presented in Section 2.2.2, we repeat Eq. (2.10):

L = log det J .

Recall that L is one of the additive terms forming the objective function and is relative to one
of the training patterns (for simplicity we have again dropped the subscript referring to the
training patterns). Also recall that J = ∂z/∂ x is the Jacobian of the transformation made by
the system.
We need to compute the gradient of L relative to the network’s weights to be able to
use gradient-based optimization methods. As in Section 2.2.2, we shall compute that gradient
through backpropagation. We shall use a network that computes J and shall backpropagate
through it to find the desired gradient.

A Network That Computes the Jacobian The network that computes J is constructed using
the same reasoning that was used in Section 2.2.2. In fact, the MLP that implements the system
of Fig. 3.6 is very similar to the one that implements the system of Fig. 2.1. The only difference
is that the linear block W of Fig. 2.1 is now replaced with an MLP, which has a hidden layer
of nonlinear units followed by a layer of linear output units.

16
In [7] a system based on radial basis functions was used. In [11] a system using an MLP together with product
units was used.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

42 NONLINEAR SOURCE SEPARATION

y
x D Φ1 E B Φ2 C z

I D Φ1′ E B Φ 2′ C J

FIGURE 3.7: Network for computing the Jacobian. The lower part is what computes the Jacobian
proper and is essentially a linearized version of the network of Fig. 3.6. The upper part is identical
with the network of Fig. 3.6, but is drawn in a different way. The two upper-right blocks, shown
in gray, are not necessary for computing J and are shown for reference only. See the text for further
explanation

We have already seen, in Section 2.2.2, how to deal with a layer of nonlinear hidden units
in constructing the network that computes the Jacobian. The structure of the network, for the
current case, is shown in Fig. 3.7. In this figure, the lower part is what computes the Jacobian
proper. The upper part is identical to the network of Fig. 3.6, but is drawn in a different way.
We shall start by describing this part.

• The input vector x̄ is the input pattern x, augmented with a component equal to 1,
which is useful for implementing the bias terms of the hidden units of block F.
• Blocks D̄, Φ1 , and E form block F:
– Block D̄ implements the product by the weight matrix (which we also call D̄) that
connects the input to the hidden layer. This matrix includes a row of elements
corresponding to the bias terms (this is indicated by the overbar in the matrix’s
symbol). The output of this block is formed by the vector of input activations of the
hidden units of F.
– Φ1 is a nonlinear block that applies to this vector of input activations, on a per-
component basis, the nonlinear activation functions of the hidden units of F. This
block’s output is the vector of output activations of that hidden layer.
– Block E implements the product by the weight matrix (also designated by E ) con-
necting the hidden layer to the output layer of F. The block’s output is the vector of
extracted components, y. In the figure it is shown as ȳ because it is considered to be
augmented with a component equal to 1, which is useful for implementing the bias
terms of the next layer of nonlinear units.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 43
• As before, in Section 2.2.2, the three rightmost blocks represent all the ψi blocks of
Fig. 3.6 taken together:
– Block B̄ performs a product by matrix B̄, which is the weight matrix of the hidden
units of all the ψi blocks, taken as forming a single hidden layer. The overbar in
B̄ indicates that this matrix incorporates elements relative to the bias terms of the
hidden units. The output of this block consists of the vector of input activations of
the units of the hidden layer of all the ψi blocks taken together.
– Block Φ2 is nonlinear and applies to this vector of input activations, on a per-
component basis, the nonlinear activation functions of the hidden units. The output
of this block is the vector of output activations of the global hidden layer of the ψi
blocks.
– Block C performs a product by matrix C, which is the weight matrix of the linear
output units of the ψi blocks. The output of the C block is the output vector z.
As before, and since there are no interconnections between the various ψi blocks, matrices B̄
and C have a special structure, with a large number of elements equal to zero. This does not
affect our further analysis, except for the fact that during the optimization these elements are
not updated, being always kept equal to zero.
We shall now describe the lower part of the network, which is the part that computes the
Jacobian proper. The Jacobian is given by
J = CΦ2 BEΦ1 D, (3.2)
in which
• Matrices D, E, B, and C are as defined above, the absence of an overbar indicating that
the matrices have been stripped of the elements relating to bias terms (which disappear
when differentiating relative to x).
• Block Φ1 performs a product by a diagonal matrix whose diagonal elements are the
derivatives of the activation functions of block Φ1 , for the input activations that these
activation functions receive for the current input pattern.
• Block Φ2 is similar to Φ1 but performs a product by the derivatives of the activation
functions of block Φ2 .
To compute the values of the activation functions’ derivatives, blocks Φ1 and Φ2 need to receive
the input activations of the corresponding units. This information is supplied by the upper part
of the network through the gray-shaded arrows. The reason for the presence of the upper part
of the network is exactly supplying this information.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

44 NONLINEAR SOURCE SEPARATION

3.2.1.2 Optimization Procedure
The optimization procedure is essentially identical to what was described in Section 2.2.2.
There is one important additional aspect that we shall discuss, however, which concerns opti-
mization speed. We start by recalling the main points from Section 2.2.2, and then discuss the
optimization speed issue. The main aspects to recall are as follows:

• The input to the backpropagation network is

∂L
= J −T . (3.3)
∂J
• The backpropagation through the Φ blocks is performed as discussed in Section 2.2.2
and depicted in Fig. 2.4.
• The gradient computation must take into account the shared weights that exist in the
network.
• The constraints on the ψi networks (initialization with nonnegative weights and renor-
malization of the output weights) must be implemented.

The procedure that we have described allows us to compute the gradient of the objective
function and therefore permits the use of any gradient-based optimization method. Among
these, the simplest is plain gradient descent, possibly augmented with the use of momentum for
better performance [4, 39]. While this method, with suitably chosen step size and momentum
coefficients, will normally be fast enough to perform linear separation, as described in Section
2.2.2, nonlinear separation usually leads to much more complex optimization problems, which
can only be efficiently handled by means of accelerated optimization methods.
To obtain good results in a reasonable amount of time it is therefore essential to use a fast
optimization procedure. Among these, we mention the use of gradient descent with adaptive
step sizes (see [4], Sections C.1.2.4.2 and C.1.2.4.3). This method is simple to implement
and has been used with MISEP with great success. Another important set of fast optimization
methods that are good candidates for use with MISEP (but which have not been used yet, to
our knowledge) is the class of conjugate gradient methods [39].

3.2.1.3 Network Structure

After having understood the MISEP method, the reader may ask why we need to use the special
network structure depicted in Fig. 3.6. Given that the separator block F is already nonlinear, it
would seem that we would not need the ψi blocks. We could simply maximize the entropy at the
output of F, restricting this output to be within a hypercube, and we would obtain a distribution
that would be approximately uniform within that hypercube. Therefore its components Yi would
be independent from one another.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 45
This reasoning is correct, but does not take into account that to deal with the indetermi-
nation of nonlinear ICA, we need to assume that the mixture is mildly nonlinear, and we need
to ensure that the separator also performs a mildly nonlinear transformation. In the scenario
presented in the last paragraph, the extracted components Yi would have approximately uniform
distributions. If the sources that were being handled did not have close-to-uniform distribu-
tions (and many real-life sources are not close to uniform), the F block would have to perform
a strongly nonlinear transformation to make them uniform, and it would not be possible to
keep F mildly nonlinear. With the structure of Fig. 3.6, we can keep F mildly nonlinear, while
allowing the ψi to be strongly nonlinear, so that the Zi components become close to uniform.
It is even possible, if desired, to apply explicit regularization to F to force it to stay close to
linear. In the application examples presented ahead, regularization was applied to F to ensure
a correct separation of the sources.

3.2.1.4 Application Examples

We shall now present some examples of the application of the MISEP method, first to artificial
mixtures and then to a real-life situation.

Artificial Mixtures We show two examples of mixtures of two sources. In the first case
both sources were supergaussian, and in the second case one was supergaussian and the other
subgaussian (and bimodal). Fig. 3.8(a) shows the joint distributions of the sources for the two
cases.
The nonlinear part of the mixtures was of the form

x̂1 = s 1 + a(s 2 )2
x̂2 = s 2 + a(s 1 )2 ,

with a suitably chosen value of a. The mixture components were then obtained by rotating the
vector x̂ by 45◦ :

x = A x̂,

with

1 1 1
A= √ .
2 −1 1

Fig. 3.8(b) shows the mixture distributions.

Fig. 3.8(c) shows the results of linear separation. As expected, linear ICA was not able to
undo the nonlinear part of the mixtures and just performed rotations of the mixture distributions.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

46 NONLINEAR SOURCE SEPARATION

For nonlinear separation, block F was formed by an MLP with a hidden layer with
40 sigmoidal units and with linear output units. Of the 40 hidden units, 20 were connected only
to one of the block’s output units and another 20 to the other output unit. Direct “shortcut”
connections between the input and output units of block F were also implemented, so that the
network could exactly perform a linear separation, by setting all the weights of the hidden units
to zero.
The ψ blocks were formed by MLPs with a single hidden layer, with 10 sigmoidal hidden
units, and with a linear output unit. The training set was formed, in both cases, by 1000 randomly
chosen mixture vectors.
Fig. 3.8(d) shows the results of nonlinear separation. We can see that the sources were
well recovered in both cases. This was possible, in spite of the ill-posedness of nonlinear ICA,
for two reasons:

• The mixtures were not too strongly nonlinear.

• The separation system had inherent regularization, which biased it toward implement-
ing close-to-linear transformations. This regularization was achieved by two means:
– The F block was initialized so as to implement purely linear transformations (all
output weights of the hidden layer were initialized to zero).
– MLPs inherently tend to yield relatively smooth nonlinearities, if their nonlinear
units’ weights are initialized to small values.

In what concerns processing time for nonlinear separation, the first example (the one with
two supergaussian sources) took 1000 training epochs to converge, corresponding approximately
to 3 min on a 1.6 GHz Pentium-M (Centrino) processor programmed in Matlab. The second
example (with a supergaussian and a subgaussian source) took 500 epochs, corresponding to
approximately 1.5 min on the same processor. This shows that MISEP is relatively fast, at least
for situations similar to those presented in these examples.
Results similar to these were presented in [8] for artificial mixtures of four sources (two
supergaussian and two bimodal) and in [9] for mixtures of 10 supergaussian sources. In the latter
case it was found that the number of epochs for convergence did not depend significantly on the
number of sources and that the computation time per epoch increased approximately linearly
with the number of sources for this range of problem dimensionalities. Therefore MISEP
remained relatively fast in all these situations.

Real-Life Mixtures We now present results of the application of MISEP to a real-life image
separation problem. When we acquire an image of a paper document (using a scanner, for exam-
ple), the image printed on the back page sometimes shows through due to partial transparency
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 47

(a) Sources

(b) Mixtures

(c) Linear separations

(d) Nonlinear separations

FIGURE 3.8: Separation of artificial mixtures. Left column: two supergaussian sources. Right column:
a supergaussian and a bimodal source
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

48 NONLINEAR SOURCE SEPARATION

(a) Source images

(b) Acquired (mixture) images

FIGURE 3.9: Image separation example: source and mixture images

of the paper. The mixture that is thus obtained is nonlinear. Since it is possible to acquire both
sides of the paper, we can obtain a two-component nonlinear mixture, which is a good candidate
for nonlinear separation. Here we present a difficult instance of this problem, in which the paper
is of the “onion skin” type, yielding a strong, rather nonlinear mixture.
Fig. 3.9(a) shows the two sources. These images were printed on opposite sides of a sheet of
onion skin paper. Both sides of the paper were then acquired with a desktop scanner. Fig. 3.9(b)
shows the images that were acquired. These images constitute the mixture components.
Fig. 3.10(a) shows the joint distribution of the sources, while Fig. 3.10(b) shows the mixture
distribution. It is easy to see that the mixture was nonlinear because the source distribution was
contained within a square, and a linear mixture would have transformed this square into a paral-
lelogram. It is clear that the outline of the mixture distribution is far from a parallelogram. The
distortion of the original square outline gives an idea of the degree of nonlinearity of the mixture.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 49

(a) Sources (b) Mixture

(c) Linear separation (d) Nonlinear separation

FIGURE 3.10: Scatter plots corresponding to the image separation example

The separation of this mixture was performed both with the linear ICA and with a non-
linear, MISEP-based system. Fig. 3.11 shows the results. We can see that nonlinear separation
yielded significantly better results than linear separation. The distributions of the separated com-
ponents are shown in Figs. 3.10(c) and 3.10(d) for the linear and nonlinear cases respectively.
These distributions also confirm the better separation achieved by the nonlinear system.
The quality of separation was also assessed by means of several objective quality measures,
again showing the advantage of nonlinear separation. One of these measures, the mean signal-
to-noise ratio of the separated sources, for a set of 10 separation tests, is shown in Table 3.1.
We see that nonlinear separation yielded an improvement of 4.1 dB in one of the sources, and
of 3.4 dB in the other one, relative to linear separation.
In these separation tests, the F block had the same structure as the one used in the
previous artificial mixture examples. The ψ blocks had 20 hidden units each, to be able to
approximate well the somewhat complex cumulative distributions of the sources. The training
set was formed by the intensities from 5000 pixel pairs, randomly chosen from the mixture
images. Separation was achieved in 400 training epochs, which took approximately 9 min on a
1.6 GHz Pentium-M (Centrino) processor programmed in Matlab.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

50 NONLINEAR SOURCE SEPARATION

(a) Linear separation

(b) Nonlinear separation by MISEP

FIGURE 3.11: Results of image separation

Regularization was achieved by three means:

• The F network was initialized to perform an identity mapping.

• The F network was constrained to perform a purely linear separation during the first
100 training epochs. This was achieved by keeping the output weights of the hidden
layer units equal to zero during those epochs.

TABLE 3.1: Signal-to-Noise Ratios Achieved by Linear ICA and by

MISEP-Based Nonlinear Separation in the Image Separation Example

SOURCE 1 SOURCE 2
Linear separation 5.2 dB 10.5 dB
Nonlinear separation 9.3 dB 13.9 dB
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 51
• The F network was constrained to be symmetrical (i.e. such that exchanging the inputs
would result just in an exchange of the outputs). This constraint was effective because
the mixture was known to be symmetrical to a good approximation.

The presentation of this example, here, was necessarily brief. Further details, as well as
additional examples, can be found in [10].

3.2.2 Nonlinear ICA through ensemble learning

The ensemble learning method of nonlinear ICA that we shall study in this section can be
viewed both from a minimum description length (MDL) perspective and from a Bayesian one.
There are strong ties between the two perspectives. The two are similar in their mathematical
content, but they use interpretations that are quite different from each other. We shall use
the MDL perspective here, because it gives more insight into the properties of the ensemble
learning method. We shall, however, also make brief references to the Bayesian perspective
whenever it’s convenient. Most of the published presentations of the method have used the
Bayesian approach [61–63, 103, 104]. Two publications that discuss both the Bayesian and the
MDL approaches are [48, 102].
Inference based on minimum description length was introduced by Rissanen [85]. Its
central idea is that regularities in data can be exploited to compress the data’s representation,
and that by searching for the model that yields the shortest description of the data one will
necessarily find a model that embodies the data’s regularities. Tutorials on minimum description
length methods can be found in [34, 35]. Here we’ll only discuss MDL inasmuch as it applies
to the problem at hand.
In nonlinear separation based on ensemble learning, the observed mixtures are represented
by means of a nonlinear mixture model of the form

x = M(s , θ) + n, (3.4)

where the mixture vector x results from a nonlinear mixture M of the sources (the components
of s ), with modeling error n (in the Bayesian framework n is viewed as noise, instead of modeling
error). θ is a vector that gathers all the parameters of the mixture model.
We shall denote by x̄ the ordered set of all observation vectors x i that we have available
(with an arbitrarily chosen order, e.g. the order in which they were obtained), and by s̄ and n̄
the correspondingly ordered sets of source vectors s i and modeling error vectors ni respectively.
We’ll extend the concept of model to these sets of vectors, by writing

x̄ = M(s̄ , θ) + n̄. (3.5)

P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

52 NONLINEAR SOURCE SEPARATION

Following the MDL approach, we’ll assume that we want to describe the mixture ob-
servations x̄ by encoding the sources s̄ , the parameters θ and the modeling error n̄, and then
using (3.5) to compute the mixture observations from them. The central aspect of MDL is that
we seek the representation that corresponds to an encoding with minimal length, measured in
terms of the number of coding symbols (e.g. bits). Since what we seek is just the minimal-length
representation, the actual encoding doesn’t need to be explicitly found. Only its length needs to
be computed, so that we can find the s̄ , θ and n̄ that correspond to the minimal length.

3.2.2.1 Coding lengths and probability distributions

Although we don’t need to specify the actual encodings of s̄ , θ and n̄, it is necessary to specify
their coding lengths. There are two aspects that we need to examine, regarding these lengths.
The first is that the coding of any continuously-valued variable with infinite precision would
demand an infinite number of bits, and wouldn’t be usable as an optimization criterion. We’ll
consider encodings with finite resolution, so that the coding lengths become finite. Further on,
we’ll see that the choice of resolution has virtually no effect on the results of the method.
The second aspect is that it will be convenient to represent coding lengths in an indirect
way, by considering associated random variables S̄, Θ and N̄, with corresponding probability
distributions. It is known that, for a scalar random variable Z with probability density p(z),
encoded with resolution ε Z , the optimal coding length is given by (see [29] Section 9.3, or
[35, 48])

L(z) = − log p(z) − log ε Z , (3.6)

where the base of the logarithms defines the unit of measurement of the coding length (e.g.
base 2 for bits). This expression is valid if the resolution is fine enough for the density p(z) to
be approximately constant within intervals of length ε Z .
If the random variable is multidimensional, we have

L(z) = − log p(z) − log ε Zi , (3.7)
i

where ε Zi is the resolution of the encoding of the i-th component of Z. Whenever it is necessary
to define the coding length for a variable z, we will do so by specifying a probability density p(z),
which defines the coding length through (3.7).17 This will allow us to use only valid coding
lengths without having to specify the coding scheme [34, 35], and will also allow us to use the
tools of probability theory to more easily manipulate coding lengths. In what follows we shall

represent − i log ε Zi by r Z , for brevity.

17
In principle we would also need to define the resolutions ε Zi . However, as we’ve said before, those resolutions will
become relatively immaterial to our discussion.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 53
An aspect that we have glossed over, is that (3.7) may yield a non-integer value, while
any actual encoding will involve an integer number of symbols. One way to view Eq. (3.7) is as
a continuously valued approximation of that integer length. The continuous approximation has
the advantage of being amenable to a more powerful analytical treatment than a discrete one.
There are, however, other ways to justify the use of (3.7) (see [29] Section 3.2).
If the data that we are trying to represent come from sampling a random variable, there is
an important connection between the true pdf of that variable and the pdf that we use to define
the variable’s coding length. Let us assume that the random variable Z has pdf p(z), and that
we encode it with a length defined by the density q (z). The expected value of the coding length
of Z will be

E[Lq (Z)] = − p(z) log q (z)dz + r Z .

If we choose q (z) = p(z), the average coding length will be

E[L p (Z)] = − p(z) log p(z)dz + r Z

= H(Z) + r Z .

If we use a length defined by a pdf q (z) = p(z), the average difference in coding length,
relative to L p , is given by

E[Lq (Z)] − E[L p (Z)] = − p(z) log q (z)dz + p(z) log p(z)dz
p(z)
= p(z) log dz
q (z)
= K LD( p, q ).

Therefore, the Kullback-Leibler divergence between p and q measures the average excess
length resulting from using a coding length defined by q on a random variable whose pdf is p.
As we know, this KLD is always positive except if q = p, in which case it is zero. Consequently,
the optimal encoding, in terms of average coding length, is defined by the variable’s own pdf.
If z is a vector whose components are i.i.d. samples from a distribution with pdf p(z),
the empirical distribution formed by the components of z will be close to p(z), and the coding
length of z will be close to minimal if we choose the components’ coding lengths according to
p(z). Therefore, the choice of the pdf that defines the coding length has a strong relationship
with the statistics of the data that are to be encoded.
The representation of coding lengths in terms of probability distributions provides a
strong link to the Bayesian approach to ensemble learning. In that approach, the probability
distributions of S̄, Θ and N̄ correspond to priors over those random variables, and represent
our beliefs about those variables’ values. In the MDL approach these priors are just convenient
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

54 NONLINEAR SOURCE SEPARATION

means to define coding lengths. However, these priors should be carefully chosen. The reasoning
presented in the previous paragraphs shows that the MDL criterion will tend to favor repre-
sentations which approximately follow the distributions chosen to define the corresponding
coding lengths, because those distributions will correspond to the shortest encodings. Those
distributions will therefore act as a kind of priors on the representations, even if we are not
within a Bayesian framework (and we shall sometimes refer to those distributions as priors, for
convenience).
An important aspect of this prior-like effect, in the ICA context, is the following: If
two parameters, z and w, are encoded with a coding length defined by the product density
p(z, w) = p(z) p(w), their coding length will be

L(z, w) = − log p(z, w) + r z + r w

= − log p(z) − log p(w) + r z + r w
= L(z) + L(w).

Since the coding length of the pair (z, w) is the sum of the coding lengths of the two
parameters, this corresponds to coding the two separately from each other. If z and w were
obtained from random variables Z and W that are not independent from each other, some
coding efficiency will be lost, because some common information will be coded in both variables
simultaneously. Consequently, the MDL criterion will tend to favor representations in which
the two parameters are statistically independent. Therefore, although the equality p(z, w) =
p(z) p(w), above, was not a statement about the independence of the two random variables, but
rather a statement about the form in which the variables are coded, it does favor solutions in
which the variables are independent.

3.2.2.2 Principles of the MDL approach to ensemble learning-based separation

Let us collect all the model’s arguments from (3.5), s̄ and θ, into a single vector ξ. We’ll also
designate this vector as parameter vector, since in MDL the elements of s̄ and of θ are all treated
on an equal footing, and thus can all be considered simply as model parameters.
Imagine that we wish to encode a given set of mixture observations x̄ and that we choose
a certain parameter vector ξ to represent them. This will define a modeling error n̄, through
(3.5). The coding length of x̄ will be

L(x̄) = L(ξ) + L(n̄). (3.8)

L(ξ) is defined by the prior p(ξ), and L(n̄) by the prior p(n̄). Equation (3.8) can be
written

L(x̄) = − log p(ξ) − log p(n̄) + r ξ + r n̄ . (3.9)

P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 55
Once ξ is coded, what remains, to code x̄, is coding n̄. Therefore, we can write L(x̄|ξ) =
L(n̄). Furthermore, from (3.5), we see that the resolutions at which the components of x̄ are
represented are the same as those at which the components of n̄ are represented, and consequently
r x̄ = r n̄ . Therefore, it is natural to define

p(x̄|ξ) = p(n̄), (3.10)

so that

L(x̄|ξ) = L(n̄)
= − log p(n̄) + r n̄
= − log p(x̄|ξ) + r x̄ .

Equation (3.9) can be written

L(x̄) = − log p(ξ) − log p(x̄|ξ) + r ξ + r n̄ . (3.11)

We wish to choose the representation of x̄ with minimum length L(x̄). The most obvious
solution would correspond to finding the minimum of (3.11), subject to the condition that
the modeling equation (3.5) is satisfied for the given x̄. There is, however, a form of coding
that generally yields a shorter coding length. It is based on the so-called bits-back coding
method [48,110], which we shall briefly examine. It uses the fact that (3.5) allows for using any
value of ξ for coding any given set of observations x̄ (of course, bad choices of ξ will result in
large modeling errors).
For a simple example of bits-back coding, assume that we are using a redundant code,
in which a certain x̄ is represented by two different bit strings, both with the same length of
l bits. Of course, choosing the shortest code will yield a coding length of l bits. However, a
cleverer scheme is to choose one of the two available code strings according to some other
binary information that we need to transmit. In this way we can transmit x̄, plus one extra bit of
information, in the l bits. This means that, for x̄, we will effectively be using l − 1 bits, because
we will get back the extra bit at the receiver (hence the name of bits-back coding). In practice,
of course, we don’t need to send any extra information, or even to encode the data. We only
need to know the coding length, and this reasoning shows that the effective coding length of x̄
would be l − 1 bits.
In our case we can use any value of ξ for coding any given x̄, possibly at the cost of getting
large modeling errors. We’ll assume that the actual ξ to be used will be chosen at random, with
a pdf q (ξ), and that its components will be represented with resolutions corresponding to r ξ .
From (3.11), the average coding length will be

E[L(x̄)] = q (ξ)[− log p(ξ) − log p(x̄|ξ)]dξ + r ξ + r n̄ . (3.12)

P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

56 NONLINEAR SOURCE SEPARATION

The amount of information that we can transmit through the choice of ξ is given by the
entropy of the discrete random variable obtained from ξ, with density q (ξ), by discretizing its
components with resolutions corresponding to r ξ . For fine enough resolutions, this entropy is
given by (see [29] Section 9.3)

Hq (ξ) = − q (ξ) log q (ξ)dξ + r ξ .

Therefore the effective bits-back coding length of x̄, which we represent by L̂q (x̄), is given by

L̂q (x̄) = E[L(x̄)] − Hq (ξ)

= q (ξ)[− log p(ξ) − log p(x̄|ξ) + log q (ξ)]dξ + r n̄ .

Note that, in this equation, the term r ξ , relative to the resolution of the representation
of ξ, has been canceled out. This means that bits-back coding allows us to encode the model
parameters with as fine a resolution as we wish, without incurring any penalty in terms of coding
length.
The latter equation can be transformed as
q (ξ)
L̂q (x̄) = q (ξ) log dξ + r n̄ (3.13)
p(ξ) p(x̄|ξ)
q (ξ)
= q (ξ) log dξ + r n̄ (3.14)
p(ξ|x̄) p(x̄)
= K LD[q (ξ), p(ξ|x̄)] − log p(x̄) + r n̄ , (3.15)

where p(x̄) and p(ξ|x̄) are defined, consistently with probability theory, as

p(x̄) = p(x̄|ξ) p(ξ)dξ

p(x̄|ξ) p(ξ)
p(ξ|x̄) = .
p(x̄)
In Eq. (3.15), the term − log p(x̄) doesn’t depend on q (ξ). On the other hand, the KLD
in that equation has a minimum (equal to zero) for q (ξ) = p(ξ|x̄). This shows that the optimal
choice for q (ξ) would be p(ξ|x̄). However, the latter probability is often hard to compute in
real nonlinear separation problems. In the ensemble learning method we use a distribution q (ξ)
which is an approximation to p(ξ|x̄). Equation (3.15) shows that, by using that approximation,
we incur a penalty equal to K LD[q (ξ), p(ξ|x̄)], and that q (ξ) should be selected so as to
minimize that KLD within the chosen approximation family. Note, however, that the actual
optimization is performed using (3.13), and not (3.14). This is because all the terms in the
former equation can be computed, while it would be hard to obtain p(ξ|x̄).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 57
3.2.2.3 Practical aspects
We are now in a position to give an overview of the ensemble learning method of nonlinear
source separation. The overview will cover the most important aspects, but several details will
have to be skimmed over. The reader is referred to [102, 104] for more complete presentations.
The method involves a number of choices and approximations, most of which are intended
at making it computationally tractable:

• The mixture model M in (3.4) is chosen to be a multilayer perceptron with a single

hidden layer of sigmoidal units and with an output layer formed by linear units. The
weights and biases of this network are components of the parameter vector θ.
• The unknowns are encoded independently from one another, and therefore their prior
is highly factored,18

p(s̄ , θ, n) = p(s i j ) p(θk ) p(nlm ), (3.16)

i, j k l,m

where s i j is the j -th sample of the i-th source, and a similar convention applies to nlm .
• The priors of the parameters and of the noise samples, p(θk ) and p(nlm ) respectively,
are taken to be Gaussian. The means and variances of these Gaussians are not fixed
a priori and, instead, are also encoded. For this reason these means and variances are
called hyperparameters. They are further discussed ahead. From (3.10), p(x̄|ξ) factors
as a product of Gaussians, because p(n̄) does.
• The priors p(s i j ) can be simply chosen as Gaussians. In this case, only linear com-
binations of the sources are normally found. This method is called nonlinear factor
analysis (NFA), and is often followed by an ordinary linear ICA operation, to extract
the independent sources from these linear combinations.
The source priors can also be chosen as mixtures of Gaussians, in order to have a
good flexibility in modeling the source distributions, in an attempt to directly perform
nonlinear ICA. However, this increased flexibility often does not lead to a full separation
of the sources, as explained ahead. For that reason, it is more common to use the
approach of performing NFA (using simple Gaussian priors) followed by linear ICA,
rather than trying to directly perform nonlinear ICA, using mixtures of Gaussians for
the priors.
The parameters of the Gaussians (means and variances) or of the mixtures
(weights, means and variances of the mixture components) are hyperparameters.

18
In the second example presented ahead, with a dynamical model of the sources, the factorization of p(s̄ ) doesn’t
apply, being replaced by the dynamical model.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

58 NONLINEAR SOURCE SEPARATION

• The hyperparameters, which are also members of the parameter vector θ, are chosen to
be the same for large numbers of parameter distributions, in order to reduce the number
of hyperparameters. For example, the hyperparameters for all the Gaussians, or for all
the components of the mixtures of Gaussians that represent the sources, are the same.
More specifically, all component means have the same prior Gaussian distribution, and
the same happens with all component variances (and with all component weights, if
mixtures are used). Therefore there are only four or six hyperparameters for all the
prior source distributions (two for the means, two for the variances and possibly two
for the weights). Furthermore, the weights of each layer of the MLP implementing M
all have the same prior distribution, which is Gaussian. Therefore there are only two
hyperparameters for all the weights of each layer. In this way the hyperparameters are
reduced to a small number.
• The hyperparameters are given very broad prior distributions (based on Gaussians
with very large variances) in order to make them very unrestrictive. If we had prior
information about the sources or the mixture process, it would be useful to incorporate
that information into the prior distributions. But usually we don’t have such information,
and then it is preferable to make those distributions as unrestrictive as possible.
• The approximator q (ξ) is factored as

q (ξ) = q (s i j ) q (θk ), (3.17)

i, j k

where the q (·) on both factors of the right hand side are Gaussians.19 Therefore one
only needs to estimate their means and variances.
Since q (ξ) ≈ p(ξ|x̄), the factorization in (3.17) will tend to make the sources mutually
independent, given x̄. This bias for independence is necessary to make the model
tractable.20
• Those means and variances are estimated by minimizing the coding length L̂q (x̄),
given by (3.13). The resolutions that are chosen for representing the modeling error’s
components affect this length only through the additive term r n̄ , which doesn’t affect the
position of the minimum. That term can, therefore, be dropped from the minimization.

19
The densities q (s i j ) are taken as mixtures of Gaussians if one is doing nonlinear ICA, and not just NFA.
20
This independence bias has some influence on the final result of separation, sometimes yielding a linear combination
of the sources instead of completely separated ones, even if one uses mixtures of Gaussians for the source priors [57].
This is why some authors prefer to use ensemble learning to perform only nonlinear factor analysis (NFA), extracting
only linear combinations of the sources, and then to use a standard linear ICA method to separate the sources from
those linear combinations. This is what is done in the first application example, presented ahead.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 59
Consequently, the objective function that is actually used is
q (ξ)
C= q (ξ) log dξ. (3.18)
p(ξ) p(x̄|ξ)
As a side note, and since the KLD in (3.15) is non-negative, we conclude from (3.13)
and (3.15) that the objective function C gives an upper bound for − log p(x̄). In the
Bayesian framework, p(x̄) is the probability density that the observed data would be
generated by the model (3.5), for the specific form of M and for the specific priors that
were chosen. This probability is often called the model evidence, and the value of the
objective function can be used to compute a lower bound for it.
• The approximator q (ξ) and the prior p(ξ) factor into products of large numbers of
simple terms – see (3.17) and (3.16). The posterior p(x̄|ξ) is approximated by a product
of Gaussians,

p(x̄|ξ) ≈ p(xi j ). (3.19)

The factorization and the presence of the logarithms in (3.18) leads the objective
function to become a sum of a large number of relatively simple terms which can all be
computed, either exactly or approximately.

These choices and approximations make it possible to compute closed form expressions of
the partial derivatives of the objective function relative to the parameters (means and standard
deviations) of the approximator q (ξ). The partial derivatives are set equal to zero, and this
leads to equations that are simple enough to be solved directly (for some of the equations, only
approximate solutions can be found). This allows the estimation of the parameters, for one subset
of the q (ξi ) at a time, taking the parameters for the other ξi as constant. The procedure iterates
through these estimations until convergence. This procedure, although computationally heavy,
is nevertheless faster than gradient descent on the KLD. Conjugate gradient optimization [39]
also gives good results.21
Once the minimum is found, the density

q (s̄ ) ≈ p(s̄ |x̄) (3.20)

yields an estimate of the posterior distribution of the sources. Due to the factorization of
q (ξ), one can directly obtain the approximate posterior of the sources, q (s̄ ), from the complete
posterior q (ξ), by keeping only the terms relative to s̄ .

21
H. Valpola, private communication, 2005.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

60 NONLINEAR SOURCE SEPARATION

Within an MDL framework, the maximum of q (s̄ ) yields the estimated sources. In fact,
from (3.11),

L(x̄) = − log p(ξ|x̄) − log p(x̄) + r ξ + r n̄ .

Since p(x̄), r ξ and r n̄ don’t depend on ξ, they can be dropped from the optimization. Further-
more,

p(ξ|x̄) ≈ q (ξ) (3.21)

= q (s̄ )q (θ). (3.22)

Therefore, the minimum length encoding will correspond (within these approximations)
to the s̄ and θ that maximize the corresponding terms in (3.22).22 If q (s̄ ) is represented as a
product of Gaussians on the various s i j , as is normally done, the MDL estimate of s̄ will simply
correspond to the estimated means of those Gaussians.
Within a Bayesian framework, p(s̄ |x̄) is interpreted as the true posterior of the sources,
in the statistical sense, and q (s̄ ) is an approximation to it. Therefore it can be used, for example,
for computing means or MAP estimates, or for taking decisions that depend on s̄ , as is normally
done with statistical distributions.

3.2.2.4 Model selection

The optimization method that we have presented is based on a single model structure, because
the architecture of the model’s MLP is assumed to be fixed. However, the minimum description
length principle yields a simple and natural criterion for comparing different model structures:
the model that achieves the lowest description length is best. But without some guiding lines,
the search for the best model structure would be blind.
Interestingly, the MDL framework also provides information that can guide the search
for better model structures. After optimizing the parameters of a model based on an MLP
with a certain architecture, we can find the coding length of each weight. Weights that are
coded with very few bits probably are rather unimportant to the network’s operation, and are
candidates for pruning. The pruned network can be optimized again, and the resulting coding
length compared with the one of the original network, to decide which one is best. The pruning
procedure can be iterated, if desired.
An important property of ensemble learning is that it inherently avoids overfitting. The
MDL criterion is a form of Occam’s razor [34, 74]. It automatically seeks the simplest expla-
nation for the data to be modeled, and therefore avoids finding too complex a model when the
available data do not warrant it. An interesting example, in a linear separation context, is given

22
Incidentally, these are also the maximum a posteriori (MAP) estimates of s̄ and θ, within the approximation (3.21).
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 61
in [48]. In that example, a relatively small amount of training data led FastICA to find spurious
spiky sources, due to overfitting, while ensemble learning found relatively good approximations
of the actual sources.

3.2.2.5 Extensions of the method

The basic ensemble learning method of nonlinear ICA has been extended in two main directions.
One involves the use of dynamical models of the temporal structure of the source signals [106],
and will be examined in some more detail ahead, in the context of an example. The other
direction involves the use of standard blocks that allow the easy construction of hierarchical
models, where some variables can control the variances of other variables, for example, or can
control the switching among several random variables. These blocks also allow the construction
of MLP-like structures, although the kinds of nonlinear activation functions that can be used
are limited [107, 108]. The advantage is that these models involve fewer approximations, and
are often faster to compute than the basic MLP-based model described above. These models
have been applied to the processing of magneto-encephalographic signals in [105].
In [46, 50] a faster optimization method has been proposed, which can be used both for
ensemble learning and for certain other classes of problems. In [49] an improvement of the basic
nonlinear separation method was proposed, which avoids some of the approximations of the
basic method, and is both more reliable and more accurate. Another interesting result was given
in [47], where it was shown that kernel PCA [90] provides an effective means to initialize the
optimization of nonlinear separation systems based on ensemble learning. Ensemble learning
has also been used to separate post-nonlinear mixtures [57]. In that context it showed to be able
to perform the separation even when not all individual nonlinearities are invertible, in situations
with more mixture components than sources, as long as the nonlinear mapping, as a whole, is
still invertible.

3.2.2.6 Application examples

We shall present two examples of the application of ensemble learning, the first one corre-
sponding to an instantaneous model without any dynamical part, and the second one involving
a dynamical model. Another interesting example can be found in [62]. In that example, 10
features were nonlinearly extracted from a 30-component time series of real-life data from a
pulp mill. Those features were shown to allow the reconstruction of the original data with good
accuracy.

Non-dynamical model In this case the sources were eight random signals, four of which had
supergaussian distributions, the other four being subgaussian. These signals were nonlinearly
mixed by a multilayer perceptron with one hidden layer, with sinh−1 activation functions. This
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

62 NONLINEAR SOURCE SEPARATION

FIGURE 3.12: Results of the first example of separation using ensemble learning. In each scatter plot,
the horizontal axis corresponds to a true source, and the vertical axis to an extracted source. The sources
in the top row were supergaussian, and those in the lower row were subgaussian.

mixing MLP had 30 hidden units and 20 output units, and its weights were randomly chosen.
Note that, since 20 mixtures were observed and there were only 8 sources, this was not a square
mixing situation, unlike other ones that we have seen in this book. Gaussian noise, with an
SNR of 20 dB, was added to the mixture components.
The analysis was made using nonlinear factor analysis (i.e. Gaussian priors were used
for the sources), using eight sources and a mixture model consisting of an MLP with a single
hidden layer of 50 units, with tanh activation functions. Since the source priors were Gaussian,
the sources themselves were not extracted in this step, but only linear combinations of them. In
a succeeding step, the FastICA method of linear ICA (see Section 2.4.2) was used to perform
separation from these linear combinations. Figure 3.12 shows the scatter plots of the extracted
components (after the FastICA step) versus the corresponding sources, and confirms that the
sources were extracted to a very good accuracy. This is also shown by the average SNR of the
extracted sources relative to the true ones, which was 19.6 dB. The extraction was slow, though,
having taken 100,000 training epochs.

Separation with a dynamical model Our second example corresponds to the dynamical ex-
tension of ensemble learning [106]. In this example, the mixture observations are still modeled
through (3.5), but the sources are not directly coded. Instead, they are assumed to be observations
that are sequential in time, and are coded through a dynamical model

s (t) = G[s (t − 1), φ] + m(t), (3.23)

P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 63

0 200 400 600 800 1000

FIGURE 3.13: Separation by ensemble learning with a dynamical model: The eight sources.

where G is a nonlinear function parameterized through φ.23 Therefore, s (t) is coded through φ,
m(t) and s (1). The m(t) process is often called the innovation process, since it is what drives the
dynamical model. The variables to be encoded are the parameters of the nonlinear mappings
(θ and φ), the innovation process m(t), the initial value s (1) and the modeling error n(t).
All of these variables were separated into their scalar components, and each component had a
Gaussian prior. These Gaussians involved some hyperparameters that were given very broad
distributions, as usual.
For generating the test data for this example, the source dynamical processes consisted
of two independent chaotic Lorenz processes [70] (with two different sets of parameters) and
a harmonic oscillator. A Lorenz process has three state variables and an oscillator has two,
meaning that the system had a total of eight state variables, forming three independent dynamical
processes. Figure 3.13 shows a plot of the eight source state variables.
For producing the mixture observations, these eight state variables were first linearly
projected into a five-dimensional space. Due to this projection into a lower-dimensional space,
the behavior of the system could not be reconstructed without learning its dynamics, because five
variables do not suffice to represent the eight-dimensional state space. These five projections
were then nonlinearly mixed, by means of an MLP with a single hidden layer with sinh−1

23
We have slightly changed the notation, in this equation, using s (t) instead of s̄ , and m(t) instead of m̄, to emphasize
the temporal aspect of the dynamical model.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

64 NONLINEAR SOURCE SEPARATION

0 200 400 600 800 1000

FIGURE 3.14: Separation by ensemble learning with a dynamical model: The ten nonlinear mixture
components.

nonlinearities and randomly chosen weights, producing a 10-dimensional observation vector

x(t), shown in Fig. 3.14.
The analysis was performed using a model corresponding to Eqs. (3.5) and (3.23), where
G and M were modeled by multilayer perceptrons with 30 hidden units each, with tanh non-
linearities and with linear output units. The best results were obtained with a model with nine
sources. However, after training, one of the sources was essentially unused (it was almost con-
stant). The training procedure involved two successive phases. In the first phase, lasting for the
first 500 epochs, a 50-component vector consisting of x(t) · · · x(t − 4) was used as the mixture
observation, to provide a suitable embedding, as is often done with dynamical systems [40].
This allowed the system to learn an initial, still rather crude dynamical model. In the second
phase, after these initial epochs, only the vector x(t) was used as mixture observation.
Learning of a good model took a total of 500,000 epochs. This long time should be
no surprise, since learning a good model of a chaotic process often is a very hard task. Fig-
ure 3.15 illustrates the quality of the results. In that figure, the first 1000 samples correspond
to sources that were estimated from observed mixtures x(t). The last 1000 samples correspond
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 65
to a “continuation” of the process, using only the dynamical model s (t) = G[(s (t − 1), φ], with
no innovation. The fact that the correct dynamical behavior of the sources has been obtained
in this continuation shows that a very good dynamical model was learned. In fact, in chaotic
processes, even small deviations from the correct model often lead to a rather different attractor,
and therefore to a rather different continuation.
For comparing with more classical methods, several attempts were made to learn a dy-
namical model operating directly on the observations, and using MLPs of several different
configurations for implementing its nonlinear dynamics. None of these attempts was able to
yield the correct behavior, in a continuation of the dynamical process.
This example is necessarily rather complex, and here we could only give an overview. One
aspect that we have glossed over, is that only a nonlinearly transformed state space was learned
by the method. This space was then mapped, by means of an MLP, to the original state space,
in order for the recovery of the correct dynamics to be checked. What is shown in Fig. 3.15
is the result of that mapping of the learned dynamics into the original state space. However,
even in the state space learned by the method, the three dynamical processes were separated.
And the fact that a transformed state space was learned has little relevance, since there is not a
particular space that can claim being the “main” state space, in a nonlinear dynamical process.
For further details see [106].

0 200 400 600 800 1000 1200 1400 1600 1800 2000

FIGURE 3.15: Separation by ensemble learning with a dynamical model: Extracted sources. The first
1000 samples correspond to sources estimated using mixture observations. The last 1000 samples corre-
spond to a continuation, by iteration of the dynamical model s (t) = G[(s (t − 1), φ].
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

66 NONLINEAR SOURCE SEPARATION

In an interesting development, it was shown that this method can also accurately detect
changes in the properties of dynamical processes [57, 58].

3.2.3 Kernel-Based Nonlinear Separation: kTDSEP

The nonlinear separation method that we shall now discuss was proposed by Harmeling et
al. [37, 38]. It is based on a rather simple idea: If we make a nonlinear mapping from the space
of the mixtures (that we shall designate by X ) into some other space X̂ (usually called feature
space), and then perform a linear separation in that space, that will correspond to a nonlinear
separation in the original space. Similar methods have been used, with very good results, in
other classes of problems. For example, the very successful support vector machines [24,76,89] are
nonlinear classifiers or nonlinear regressors that first make a nonlinear mapping from the original
pattern space into a feature space, and then perform linear classification or linear regression in
that space. Kernel PCA [90] is a form of nonlinear PCA in which the data are first nonlinearly
mapped into a feature space, and linear PCA is then performed in that space. A good review of
kernel-based methods is given in [76].
Although the basic idea behind kernel-based nonlinear separation is simple, applying it in
a useful way involves rather careful, and sometimes nonobvious, choices. One basic issue is how
to choose the nonlinear mapping to be used. This will determine, in an indirect way, what class
of nonlinear transformations can be performed on the original data. That class should include
the transformation that we wish to perform (or at least a good approximation to it). However,
in many kinds of problems, including nonlinear source separation, we do not explicitly know,
a priori, which transformation we wish to apply: we only know some properties of the desired
result. The solution that is normally adopted is to use a nonlinear transformation into a very
high dimensional (or even infinite-dimensional) feature space, so that linear operations in that
space correspond to a very wide class of nonlinear operations in the original space.
Working in the very high or infinite dimensional space X̂ raises the issue of computa-
tional complexity. This is often solved by means of the so-called kernel trick, which enables
us to indirectly perform certain operations in the high-dimensional X̂ by actually performing
operations in the low-dimensional X . We shall start by briefly discussing the kernel trick, be-
cause this makes a good introduction to the kernel-based nonlinear separation method. Before
proceeding, however, we should note that there is an ICA method called kernel ICA [18] which
is a linear ICA method, only remotely related to the kernel-based nonlinear ICA method that
we are discussing in this section. Both methods employ the kernel trick, but for rather different
purposes. However, the similarity of the names may cause some confusion.
Let us denote by x i generic vectors of the low-dimensional mixture space X , and by
x̂ i the corresponding mapped vectors in the high-dimensional feature space X̂ . Assume that
the operations that we need to perform in X̂ can all be expressed exclusively in terms of inner
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 67
products of vectors of that space. Also assume that, for any two vectors of X , x 1 and x 2 , the
inner product of their images in X̂ can be expressed as a relatively simple function of the original
vectors x 1 and x 2 :

x̂ 1 · x̂ 2 = k(x 1 , x 2 ). (3.24)

The function k(·, ·) is then called the kernel of the mapping from X into X̂ . If (3.24) holds, the
whole algorithm that we wish to implement can be performed in the low-dimensional space X
by replacing the inner products in X̂ with kernel evaluations in X . This will avoid performing
operations in the high-dimensional space X̂ .
This may seem too far fetched at first sight. After all, most nonlinear mappings will not
have a corresponding kernel that obeys (3.24). However, what is done in practice is to choose
only mappings that have a corresponding kernel. In fact, the issue is often reversed: one chooses
the kernel, and that implicitly defines the nonlinear mapping that is used. The important point
is that it is not hard to find kernels corresponding to mappings that yield very wide classes of
nonlinear operations in X , when linear operations are performed in X̂ .
The most straightforward application of these ideas to nonlinear ICA would consist of
performing linear ICA in the feature space X̂ , and then mapping the obtained components
back to the original space X . This is not possible, however, because it has not been possible
to express linear ICA only in terms of inner products, and these are the only operations that
can be efficiently performed in X̂ . And even if this could be done, there would probably be
the additional problem that working (albeit indirectly) in such a high-dimensional space could
easily lead to badly conditioned systems, raising numerical instability issues.
For these reasons, the approach that is taken in kernel-based nonlinear ICA is differ-
ent: we first linearly project the data from the feature space X̂ into a medium-dimensional
intermediate space X̃ , and then perform linear ICA in that space. Normally, linear ICA in this
lower dimensional space already is numerically tractable. The dimension of X̃ (call it d ) is often
chosen in a way that depends on the existing data, but for a typical two-source problem it could
be around 20.
The projection from the feature space into the intermediate space should be made in such
a way that the most important information from the feature space is kept. This projection can
be performed in several ways. The one that immediately comes to mind is the use of PCA in X̂ .
This amounts to using the kernel-PCA method that we mentioned earlier [90]. That method
uses only inner products in X̂ , and can therefore be efficiently implemented by means of the
kernel trick. Other alternatives that also use only inner products include the use of random
sampling or of clustering [38].
Once the data have been projected into the intermediate space X̃ , it is possible to perform
linear ICA in that space through several standard methods. The method that has been used in
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

68 NONLINEAR SOURCE SEPARATION

practice is TDSEP (Section 2.3), which, we recall, involves the simultaneous decorrelation of the
extracted components, with a set of different time delays. Because the kernel-based nonlinear
separation method involves the use of kernels and TDSEP, it is often called kTDSEP.
A difficulty that arises, and that might not be obvious at first sight, is that linear ICA,
performed in the intermediate space, yields a number of components equal to the dimension of
that space, a number which is significantly larger than the number of sources n. This might seem
contradictory because, strictly speaking, there can be no more independent components than
the number of sources, which is n. However, TDSEP just performs a joint diagonalization of
the various delayed correlation mixtures that are involved. If a perfect diagonalization cannot be
achieved, it finds a diagonalization which is as perfect as possible, according to its cost function.
This means that it will always find the d components that are, in some sense, “as independent
as possible.” But most of these components cannot actually be independent from one another.
This raises the need to select the n components that are as mutually independent as possible,
and as close to the original sources as possible.
The procedure that has been used for performing this selection is heuristic but appears to
work well in practice. Briefly speaking, it consists of applying the nonlinear separation method
again to the d components resulting from the first separation, and then selecting the components
that have changed the least from the first to the second separation. The rationale is that the
true sources will tend to exhibit less variability than “spurious” components from the first to the
second separation.

3.2.3.1 Examples
We shall present two examples of the separation of nonlinear mixtures through kTDSEP.

First Example The source signals were two sinusoids with different frequencies (2000 samples
of each). Fig. 3.16(a) shows these signals, and Fig. 3.17(a) shows the corresponding scatter plot.
These signals were nonlinearly mixed according to

x1 = e s 1 − e s 2
x2 = e −s 1 + e −s 2 . (3.25)

Fig. 3.16(b) shows the mixture signals, and Fig. 3.17(b) shows the corresponding scatter plot.
We can see that the mixture performed by (3.25) was significantly nonlinear.
Figs. 3.16(c) and 3.17(c) show the result of linear ICA. Both figures show that it was not
able to perform a good separation, as expected. Fig. 3.17(c) shows that linear ICA just rotated
the joint distribution. Naturally, it could not undo the nonlinearities introduced by the mixture.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 69

(a) Source signals

(b) Components of the nonlinear mixture

(c) Linear separation

(d) Nonlinear separation

FIGURE 3.16: Signals corresponding to the first example of separation through kTDSEP

Nonlinear separation through kTDSEP was performed using a so-called polynomial

kernel, of the form

k(a 1 , a 2 ) = (a T1 a 2 + 1)9 .

This kernel generates a feature space X̂ spanned by all the monomials of the form (x1 )m (x2 )n ,
with m + n ≤ 9, where x1 and x2 are the two mixture components. This space has a total of
54 dimensions. This space was then reduced, through a clustering technique, to 20 dimen-
sions, thus forming the intermediate space X̃ . Linear ICA was performed on this intermediate
space by the TDSEP technique, yielding 20 components. From these, the ones corresponding
to the sources were extracted, as indicated above, by rerunning the algorithm on the set of
20 components and selecting the two that were least modified by this second separation. The
results are shown in Figs. 3.16(d) and 3.17(d). We can see that a virtually perfect separation was
achieved.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

70 NONLINEAR SOURCE SEPARATION

(a) Source signals (b) Mixture

(c) Linear separation (d) Nonlinear separation

FIGURE 3.17: Scatter plots from the first example of separation through kTDSEP

Second Example The sources were two speech signals, each with a length of 20 000 samples.
Fig. 3.18(a) shows these signals, and Fig. 3.19(a) shows their joint distribution. The latter figure
shows that the sources were strongly supergaussian, as normally happens with speech signals.
For performing the nonlinear mixture, the signals were first both scaled to the interval
[−1, 1] and were then mixed according to

x1 = −(s 2 + 1) cos(πs 1 )
x2 = (s 2 + 1) sin(π s 1 ).

We can describe this mixture in the following way: The mixture space is generated in polar
coordinates. Source s 1 controls the angle, while (s 2 + 1) controls the distance from the center.
Figs. 3.18(b) and 3.19(b) show the mixture results.24

24
Due to the shape of the mixture scatter plot, this has become commonly known as the “euro mixture.”
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 71

FIGURE 3.18: Signals corresponding to the second example of separation through kTDSEP

Linear ICA naturally was unable to perform an adequate separation, as shown in

Figs. 3.18(c) and 3.19(c). As in the previous example, it just performed a rotation of the mixture
distribution.
Nonlinear separation was performed using a Gaussian kernel of the form

k(a 1 , a 2 ) = e −||a 1 −a 2 || .
2

This kernel induces an infinite-dimensional feature space X̂ . The dimensionality reduction to

an intermediate space X̃ (again of dimension 20) was performed, in this case, by means of a
random sampling technique. Linear ICA, through the TDSEP method, was then performed on
this intermediate space. The two components corresponding to the extracted sources were again
selected by rerunning the algorithm on these separation results and selecting the components
that changed the least from the first to the second separation.
Figs. 3.18(d) and 3.19(d) show the separation results. Once again, we see that a much
better source recovery was achieved than with linear ICA. This fact was confirmed by computing
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

72 NONLINEAR SOURCE SEPARATION

(a) Source signals (b) Mixture

(c) Linear separation (d) Nonlinear separation

FIGURE 3.19: Scatter plots from the second example of separation through kTDSEP

the signal-to-noise ratios of the results of both linear and nonlinear separation, relative to the
original sources. Table 3.2 shows the results and confirms the very large improvement obtained
with kTDSEP, relative to linear ICA.
We have shown examples of the separation of nonlinear mixtures of just two sources.
However, kTDSEP has been shown to be able to efficiently separate mixtures of up to seven
sources [38].

TABLE 3.2: Signal-to-Noise Ratios of the Results of Linear ICA and kTDSEP

SOURCE 1 SOURCE 2
Linear ICA 5.4 dB −7.0 dB
kTDSEP 13.2 dB 18.1 dB
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 73
3.2.4 Other Methods
In this section we shall give an overview of other methods that have been proposed for performing
nonlinear source separation, and we shall simultaneously try to give an historical perspective of
the field. The number of different methods that we shall mention is large, and therefore we can
only make a brief reference to each of them.
A very early result on nonlinear source separation was published by Darmois in 1953 [30].
He showed the essential ill-posedness of unconstrained nonlinear ICA; i.e., that it has an infinite
number of solutions that are not related to one another in a simple way.
One of the first nonlinear ICA methods was proposed by Schmidhuber in 1992 [88]. It
was based on the idea of extracting, from the observed mixture, a set of components such that
each component would be as unpredictable as possible from the set of all other components.
This justified the method’s name, predictability minimization. The components were extracted
by an MLP, and the attempted prediction was also performed by MLPs. While the basic idea
was sound, the method was computationally heavy, and hard to apply in practice.
In the same year, another nonlinear ICA method was proposed by Burel [23]. It was
based on the minimization of a cost function that was a smoothed version of the quadratic
error between the true distribution p(y) and the product of the marginals, i p(yi ). The
method used an MLP as a separator. The cost function was expressed as a series involving the
moments of the extracted components. The series was then truncated and the moments were
estimated on the training set. Backpropagation was used to compute the gradient of the cost
function for minimization. The method was demonstrated to work on an artificially generated
nonlinear mixture, both without and with noise. While no explicit regularization conditions
were mentioned, the smoothing of the error, the series truncation, and the fact that a very small
MLP was used (with one hidden layer with just two units) were implicit regularizing constraints
that allowed the method to cope with the indetermination of nonlinear ICA for the mixture
that was considered in the examples.
In 1995, Deco and Brauer proposed a method based on the minimization of the mutual
information of the extracted components [31]. The method was restricted to volume-conserving
nonlinear mixtures (also called information-preserving mixtures), and the separation was
performed by an MLP with a special structure. The probability densities needed for the esti-
mation of the mutual information were approximated by truncated series based on cumulants.
The indetermination inherent to nonlinear ICA was handled by the restriction to volume-
conserving mixtures and also by the inherent regularization corresponding to the truncation of
the cumulant-based series. Variants of the method were proposed in [81, 82].
Also in 1995, Hecht–Nielsen proposed a method (replicator neural networks) to find a
mapping from observed data to data that are uniformly distributed within a hypercube [41].
Coordinates parallel to the hypercube’s edges (which were designated natural coordinates of the
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

74 NONLINEAR SOURCE SEPARATION

original data) actually were independent components, although that was not noted at the time.
The method was mainly intended for dimensionality reduction (and therefore, in our language,
for problems with more mixture components than sources), but there is nothing that prevents
it from being applied to square problems.
In 1996, Pajunen [77] proposed a method based on the use of self-organizing maps
(SOM) [39]. SOMs learn a mapping of the mixture into a grid of “centers” that are uniformly
distributed within a hypercube. The hypercube’s coordinates become the extracted components,
and are approximately independent from one another, because the distribution is approximately
uniform within the hypercube. The SOM learning algorithm provides an inherent regularization
that can, to a certain extent, deal with the indetermination of nonlinear ICA. The map that
is obtained is discrete, because it is made into the discretely located centers. An adequate
interpolation is performed to handle intermediate mixture vectors. Pajunen’s work was followed
by [68,69], and the method has been applied in [36] to an image denoising problem. An approach
based on generative topographic mapping, which is a more principled form of self-organizing
mapping, was proposed in [79].
Also in 1996, Marques and L. Almeida [71] proposed a new class of objective functions
for achieving statistical independence. These objective functions had the advantage of being
able to enforce independence without any approximations and without the need to estimate
any probability density functions, all computations being based directly on the training set. The
use of these objective functions was illustrated for linear ICA and for very small examples of
nonlinear ICA. Extension of their use to nonlinear ICA situations of realistic size has been
difficult, however.
In 1997, Fisher and Principe [32] proposed a method based on the use of an MLP as
separator and using as independence criterion the squared error between the output density and
a uniform density within a hypercube. The method was shown to be able to extract significant
nonlinear features from synthetic-aperture radar images.
In 1998, Yang et al. [113] proposed two new nonlinear ICA methods. Both methods
used as separator an MLP with a single hidden layer, with a number of hidden units equal
to the number of mixture components. One of the methods was a rather direct extension
of INFOMAX, being based on the maximization of the output entropy and using output
nonlinearities chosen a priori, as in INFOMAX.
The other method, instead of minimizing the mutual information indirectly through the
output entropy, made a more direct estimation using truncated Gram–Charlier expansions of
the extracted components’ densities. The Gram–Charlier expansion [39] is an expansion of a
pdf “around” a Gaussian with the same variance. Its coefficients depend on the distribution’s
cumulants that, in this case, were truncated to the fourth order and were estimated from the
training set.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 75
Also in 1998, Hochreiter and Schmidhuber [43–45] proposed the so-called LO-
COCODE method, which was based on a philosophy similar to (although less complete
than) the MDL one used later to develop ensemble learning: it tried to find a low-complexity
auto-encoder for the data. Under appropriate conditions, this led the auto-encoder’s internal
representation of the data to consist of the original sources. The main difference in philoso-
phy, relative to ensemble learning, was that it did not take into account the complexity of the
extracted sources and of the modeling error. The method was demonstrated on an artificial
nonlinear separation task, both with noiseless and with noisy data.
In 1999, Marques and L. Almeida proposed a method, called pattern repulsion, based on
a physical analogy with electrostatic repulsion among output patterns [72]. The method was
shown to be equivalent to the maximization of the second-order differential Renyi entropy of
the output, defined as [112]

H2 (Y) = −log [ p(y)]2 dy.

As with Shannon’s entropy, this entropy maximization led to a uniform distribution of the
outputs within a hypercube, and thus to independent outputs. The method used regularization
to handle the indetermination of nonlinear ICA. This method was extended by Hochreiter and
Mozer [42], by including both “repulsion” and “attraction,” allowing it to deal with nonuniform
source distributions. A further theoretical analysis of the method was made in [100].
Also in 1999, Palmieri [80] proposed a method based on the maximization of the output
entropy, using as separator an MLP with the restriction that it had, in each hidden layer, the
same number of units as the number of mixture components. The possible ill-posedness of the
problem was not addressed in this work. The same separator structure (but restricted to two
hidden layers) was considered in [73] with a different learning algorithm.
In the same year, Hyvärinen and Pajunen [56] showed that a nonlinear mixture of two
independent sources is uniquely separable if the mixture is conformal25 and the sources’ distri-
butions have known, bounded supports.
Still in 1999, Lappalainen (presently Valpola) and Giannakopoulos proposed the ensem-
ble learning method, studied in Section 3.2.2.
In 2000, L. Almeida proposed the MISEP method, studied in Section 3.2.1. Also in
2000, Fyfe and Lai [33] proposed a method that combines the use of kernels with the canonical
correlation analysis technique from statistics, and demonstrated the separation of two sinusoids
from a nonlinear mixture. The method seems to retain at least some of the indetermination of
nonlinear ICA because it does not provide information on which components, among those
that are extracted, do correspond to actual sources.
25
A conformal mapping is a mapping that locally preserves orthogonality.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

76 NONLINEAR SOURCE SEPARATION

In 2001, Harmeling proposed the kTDSEP method, studied in Section 3.2.3. Also in
2001, Tan et al. [99] proposed a method using a radial basis function (RBF) network [39] as
a separator. The method used as separation criterion the mutual information of the outputs,
together with a matching between moments of the outputs and moments of the sources (the
latter moments were assumed to be known a priori). The mutual information of the outputs was
estimated by expressing the outputs’ densities through a truncated Gram–Charlier expansion.
The method relied on the smoothing properties of the RBF network, together with the prior
knowledge of the sources’ moments and the truncation of the Gram–Charlier expansion, to
handle the ill-posedness of nonlinear ICA.
Also in 2001, Xiong and Huang [111] proposed an extension of INFOMAX to nonlinear
mixtures, using a restricted, truncated power series expansion as a model of the unmixing system.
In 2003, Achard et al. [3] proposed a class of quadratic measures of dependence. These
measures were only applied to the separation of PNL mixtures.
Also in 2003, Hosseini and Deville proposed a method for the separation of a restricted
kind of nonlinear mixture of two sources, of the form s i = a i x1 + b i x2 + c i x1 x2 [51, 52]. They
showed that this kind of mixture, in which the nonlinearity appears only through the product of
the two sources, is uniquely separable under relatively general conditions. In [52] they proposed
a maximum likelihood method for the estimation of the separator’s parameters.
Still in 2003, Theis et al. [101] proposed a nonlinear ICA method based on geometric
concepts. The method was restricted to mixtures that were linear within (hyper-)rings centered
on the origin.
In 2004, Blaschke and Wiscott [21] proposed a method that has some aspects in common
with kTDSEP: it performs linear ICA in a nonlinearly expanded feature space, and it uses the
temporal structure of the signals to address the ill-posedness of nonlinear ICA. However, the
method differs from kTDSEP in some important aspects. On the one hand, it performs ICA
in the feature space itself, and not in a reduced-dimension intermediate space. Furthermore,
it operates directly in the feature space, without relying on the kernel trick. This limits the
dimension of the feature space that it can use. On the other hand, the way in which the method
uses the signal structure is by employing the principle that the true sources are slower-varying
in time than nonlinear combinations or nonlinear distortions of them. The method has been
demonstrated in the separation of the “euro mixture” presented in Section 3.2.3. That example
used a feature space involving all monomials of degree up to 5, a space that has dimension
20.
Also in 2004, Lee et al. [65] proposed a method that is restricted to nonlinear mixtures
that are isometric, i.e. such that distances measured in the source space are kept in the mixture
space (but the metrics that are used can be non-Euclidean). The restriction to isometric mixtures
allows the indetermination of nonlinear ICA to be tackled.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

NONLINEAR SEPARATION 77
In 2005, the so-called denoising source separation method, originally proposed for linear
source separation [87], was extended to nonlinear separation by M. Almeida et al., and was
demonstrated on the image separation problem that we examined in Section 3.2.1.4 [12]. The
method is not based on an independence criterion. Instead, it uses some prior information about
the sources and/or the mixture process to perform a partial separation of the sources. With an
application of this partial separation within an appropriate iterative structure, the method is
able to achieve a rather complete separation.
As we have seen, a large number of nonlinear source separation methods have been
proposed in the literature. Some of them are limited to specific kinds of nonlinear mixtures,
either explicitly or due to the restricted kind of separator that they use, but other ones are
relatively generic. In our view, this variety of methods reflects both the youth and the difficulty
of the topic: it has not stabilized into a well-defined set of methods yet. In Sections 3.2.1 to
3.2.3 we presented in some detail the methods that, in our opinion, have the greatest potential
for yielding useful application results.

3.3 CONCLUSION
If we compare the results obtained by kTDSEP with those obtained with MISEP, there is a
difference that is striking. The kTDSEP method can successfully separate mixtures involving
nonlinearities that are much stronger than those that can be handled by MISEP. A good example
is the “euro mixture.” MISEP, applied to this mixture, yields independent components, but these
do not correspond to the original sources. MISEP is not able to deal with the ill-posedness
of nonlinear separation in this case, because it uses the assumption that the mixture is mildly
nonlinear to be able to perform regularization, and that assumption is not valid in this case.
The question that immediately comes to mind is how can kTDSEP deal with the ill-
posedness, even without regularization. The intuition is that its use of the temporal structure
of the sources in the separation process greatly reduces the indetermination of nonlinear ICA.
There is some evidence to support this idea. In a series of unpublished experiments, jointly
performed by Harmeling and L. Almeida, a variant of kTDSEP was used to try to separate
a mixture of sources with no time structure (i.e. each of them had independent, identically
distributed samples). Of course, TDSEP could not be used for the linear separation step.
Therefore another method (INFOMAX) was used for this step. Even though the mixtures
that were considered were only mildly nonlinear, all tests failed to recover the original sources.
This seems to confirm the idea that the use of temporal structure is what allows kTDSEP to
successfully deal with the indetermination. Hosseini and Jutten have analyzed in [53] a number
of cases that also suggest that temporal structure can be used for this purpose. We conjecture
that temporal (or spatial) structure of the sources will become a very important element in
dealing with the indetermination of nonlinear source separation in the future.
P1: IML/FFX P2: IML
MOBK016-03 MOBK016-Almeida.cls April 5, 2006 14:52

78
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47

CHAPTER 4

Final Comments

We have made an overview of nonlinear source separation and have studied in some detail the
main methods that are currently used to perform this kind of operation. While these methods
yield useful results in a number of cases, they are still somewhat hard to apply, and still need
much care from the user, both in tuning them and in assessing the quality of the results. Simply
put, these methods are still very far from a “black box” use, in which one would just grab a
separation routine, supply the data, and immediately get useful results.
The reader would have noticed that there are still few examples of application of nonlinear
separation to real-life data. Some people have suggested that there are few naturally occurring
nonlinear mixtures. We think this is not so. It is true that many naturally occurring mixtures are
approximately linear. This is often the case with acoustic, biomedical, and telecommunications
signals, for example, and it is a fortunate situation, because it allows us to deal with them using
the well-developed linear separation methods. However, we think that we often do not identify
nonlinear mixtures as such, simply because we still do not have powerful enough methods to
deal with them. Imagine, for example, a complex set of real-life data such as the stock values
from some financial market. We would like to be able to extract the fundamental “sources”
that drive these data. These could perhaps be variables such as the investor’s confidence, the
interest rate, the market liquidity and volatility, and probably also other variables that we are
unable to think of. These “sources” are related to the observed data in a nonlinear way. The
same could be said about countless other sets of data. If we were able to efficiently perform
nonlinear source separation in complex data sets, we would have a very powerful data mining
tool, applicable to a very large number of situations. It is therefore worth investing a significant
effort into the development of more powerful nonlinear separation methods.
In our view, the main difficulty of nonlinear source separation resides in the ill-posedness
of nonlinear ICA, that we have emphasized a number of times throughout this book. While the
use of the temporal or spatial structure of signals may provide a significant help in this respect,
as illustrated by the kTDSEP method, knowledge about how this structure should be used, and
about the capabilities and shortcomings of its use, is still very limited. To our knowledge, the
use of signal structure in nonlinear separation is currently limited to kTDSEP, to the method
presented in [21], and to ensemble learning with a dynamical model (as in the second example
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47

80 NONLINEAR SOURCE SEPARATION

of Section 3.2.2). The experience with these methods is still rather restricted. More work is
needed to explore and understand the use of signal structure in addressing the indeterminations
of nonlinear ICA.
We would, however, like, to also point in a different direction, as a more fundamental
way of addressing the nonlinear source separation problem, and probably also as a way to cope
with its indeterminations. Until now, the statistical independence of the sources has been used
as the main criterion for separation, both in linear and in nonlinear mixtures. However, sources
are not always independent, and in such cases the separation based on independence may be
impaired. An example, in the context of linear separation, was given by Pajunen [78]. Another
example, within the context of nonlinear separation, was given by L. Almeida in [10]. In [78],
Pajunen showed that, while ICA failed to separate the nonindependent sources, a method
based on minimizing the complexity of the extracted sources did yield a good separation. That
work was just a proof of concept, however, because the method that was used would involve an
inordinate amount of computation in any real-life situation. Another work in the same direction
is the LOCOCODE method of Hochreiter and Schmidhuber [43–45], that we mentioned in
Section 3.2.4. It is a nonlinear separation method that uses as criterion the low complexity
of the separator and of its inverse, instead of using independence (LOCOCODE stands for
low complexity coding and decoding). A more complete approach in the complexity-based
direction is the ensemble learning method that we presented in Section 3.2.2, which is based on
the minimum description length principle. This was, to date, and to our knowledge, the most
thorough application of complexity-based criteria to nonlinear source separation.
If we think of sources such as speech, images, or other common signals, it is intuitively
clear that such sources are, in general, simpler than their mixtures. Furthermore, complexity-
based criteria, such as minimum description length, have shown their great potential in several
inductive inference situations. Nonlinear source separation can clearly be viewed as an inductive
inference problem, in which we wish to infer a set of sources and a mixture model from observed
mixture data.
We believe that a fruitful path in nonlinear source separation lies in the use of complexity-
based criteria. Their use is not simple, however. For example, a natural complexity measure for
images would be their code length when encoded in JPEG with a given quality level. However,
this measure is not differentiable, and therefore cannot be used in many of the most efficient
optimization methods, which involve the use of first- and/or second-order derivatives. Further-
more, JPEG encoding is somewhat time-consuming, and therefore is not very appropriate for
use in an iterative optimization procedure. It is necessary to develop measures of complexity
(possibly approximate) which are differentiable and easier to compute. Those measures will
probably depend, to a great extent, on the kind of sources to be separated. For example, com-
plexity measures for images are not appropriate for speech and vice versa. In our view, we need
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47

FINAL COMMENTS 81
to develop methods that are able to exploit elaborate and powerful measures of complexity and
that, at the same time, are efficient in computational terms.
The field of nonlinear source separation will keep steadily advancing, with the develop-
ment of methods that are progressively more powerful, more efficient, easier to use and applicable
to a wider range of situations. The benefits to be gained are very large. Linear separation, which
is much more advanced today, has already shown the potential of these methods to reveal the
“hidden truth” that lies behind complex signals, making them much easier to understand and
to process, and giving us access to information that could not be obtained in any other way. A
similar potential awaits us as we learn to perform nonlinear separation more effectively.
P1: IML/FFX P2: IML
MOBK016-04 MOBK016-Almeida.cls March 7, 2006 13:47

82
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47

Appendix A: Statistical Concepts

A.1 PASSING A RANDOM VARIABLE THROUGH ITS
CUMULATIVE DISTRIBUTION FUNCTION
Consider a continuous random variable Y , whose cumulative distribution function, FY , is con-
tinuous, and whose probability density function (pdf ) is p(y). Define a new random variable
Z = FY (Y ). We wish to find the distribution of Z.
Clearly, the domain of Z is the codomain of FY , which is the interval (0, 1). Since
dz
= FY (y) = p(y) ≥ 0,
dy
FY is one-to-one wherever p(y) = 0, and therefore we have, for z ∈ (0, 1):

dy
p(z) = p(y)
dz
p(y)
=
dz/dy
p(y)
=
p(y)
= 1.

We conclude that Z will be uniformly distributed in (0, 1), whatever the distribution of
the original random variable Y . The function FY is the nondecreasing function that transforms
Y into a random variable which is uniformly distributed in (0, 1). This fact is used by some ICA
methods, namely INFOMAX and MISEP.

A.2 ENTROPY
The entropy of a discrete random variable X, as defined by Shannon [29, 91], is

H(X) = − P (xi ) log P (xi )
i
= −E[log P (X)], (A.1)

where P (xi ) is the probability that X takes the value xi , and the sum spans all the possible values
of X. The entropy measures the amount of information that is contained, on average, in X. It
can also be interpreted as the minimum number of symbols needed to encode X, on average.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47

84 NONLINEAR SOURCE SEPARATION

For example, if we use the base-2 logarithm in (A.1), H(X) represents the minimum number
of bits needed, on average, to represent each value of X.
Extending this concept to continuous random variables would lead to an infinite entropy.
A concept that is often useful is the so-called differential entropy, also usually denoted H(X),
and defined as
∞
H(X) = − p(x) log p(x) dx
−∞
= −E[log p(X)], (A.2)

where p(X) is the probability density function of X, and the convention 0 log 0 = 0 is adopted.
The differential entropy does not have the same interpretation as the entropy in terms of coding
length, but is useful in relative terms, for comparing different random variables. Its definition
extends, in a straightforward way, to multidimensional variables.
Following are two important facts about differential entropy (see [29, Chapter 11] or
[91]):

• Among all distributions with support within a given bounded region, the one with the
largest differential entropy is the uniform distribution within that region. This is true
both for single-dimensional and for multidimensional distributions.
• Among all single-dimensional distributions with zero mean and with a given variance,
the one with the largest differential entropy is the Gaussian distribution with the given
variance. Among all multidimensional distributions with zero mean and a given covari-
ance matrix, the one with the largest differential entropy is the Gaussian distribution
with the given covariance matrix. Among all multidimensional distributions with zero
mean and a given variance, the one with the largest differential entropy is the spherically
symmetric Gaussian distribution with the given variance.

A.2.1 Entropy of a Transformed Variable

Assume that we have a multidimensional, continuous random variable X with entropy H (X ),
and that we obtain from it a new random variable Z, through a (possibly nonlinear) invertible
transformation Z = T (X ). Since the transformation is invertible,

∂ x

p(z) = p(x) det
∂z
p(x)
= ,
det J
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47

APPENDIX A 85
where J = ∂z/∂ x is the Jacobian of the transformation T. Therefore,

H (Z ) = −E[log p(z)]

= −E[log p(x)] + E[log det J ]

= H (X ) + E[log det J ].

A.3 KULLBACK–LEIBLER DIVERGENCE

The Kullback–Leibler divergence of a density q relative to another density p is defined as [29]

p(z)
KLD ( p, q ) = p(z) log dz,
q (z)
where the integral extends over the support of p(z). The definition extends, in a straightforward
way, to multidimensional densities.
KLD ( p, q ) is often interpreted as measuring the deviation of an approximate density q
from a true density p. It is always nonnegative, and is zero if and only if p = q . Although it is
sometimes called Kullback–Leibler distance, it lacks one of the properties of a distance, since it
is nonsymmetric: in general, KLD ( p, q ) = KLD (q , p).

A.4 MUTUAL INFORMATION

Consider a random vector Y whose components Yi may be mutually dependent. The mutual
information of the components of Y, that we shall denote by I (Y ), is the amount of information
that is shared by them. It is defined as [29]

I (Y ) = H (Yi ) − H (Y ), (A.3)
i

where, for the case of continuous random variables, H represents Shannon’s differential entropy,
and for the case of discrete variables it represents Shannon’s entropy.
I (Y ) is a nonnegative quantity. It is zero if and only if the components Yi are mutually
independent. This agrees with the intuitive concept that we mentioned above: The information
shared by the components Yi is never negative, and is zero only if these components are mutually
independent.
I (Y ) is equal to the Kullback–Leibler divergence between the density of Y and the product
of the marginal densities of the components Yi :

I (Y ) = KLD p(y), p(yi ) .
i

This is proved, for example, in [29,39], for the case of a two-dimensional random vector Y, and
the proof can easily be extended to more than two dimensions.
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPA MOBK016-Almeida.cls March 7, 2006 13:47

86 NONLINEAR SOURCE SEPARATION

Since the mutual information I (Y ) is zero only if the components of Y are independent
and is positive otherwise, its minimization can be used as a criterion for obtaining independent
components. Several of the source separation methods studied in this book are based on the
minimization of I (Y ) as a way to obtain components that are as independent as possible.
An important property of mutual information is that it is not affected by perform-
ing invertible, possibly nonlinear, transformations on the individual random variables. More
specifically, if we define new random variables Zi = ψi (Yi ) and all the ψi are invertible, then
I (Y ) = I (Z ). In fact,

I (Z ) = H (Zi ) − H (Z )
i
∂ Z
= H (Yi ) + E log ψi (Yi ) − H (Y ) − E log det . (A.4)
i
∂Y

But the Jacobian ∂ Z/∂Y is a diagonal matrix, its determinant being given by
∂Z
det = ψi (Yi ).
∂Y i

Therefore,

∂ Z
E log det = E log ψ (Yi )

∂Y i
i

= E log ψi (Yi ) .
i

Consequently the expectations in (A.4) cancel out, and

I (Z ) = H (Yi ) − H (Y )
i
= I (Y ).
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPB MOBK016-Almeida.cls March 7, 2006 13:48

Appendix B: Online Software

and Data
Linear ICA Software
There are many linear ICA software packages and demos available online. These are some sites
with pointers to such software:

• ICA Central—http://www.tsi.enst.fr/icacentral
• Paris Smaragdis’ page—http://web.media.mit.edu/∼paris/ica.html
• ICA Research Network—http://www.elec.qmul.ac.uk/icarn/software
.html

Nonlinear ICA Software

Below are links to some nonlinear ICA software packages and demos available online.

• PNL mixtures—http://www.lis.inpg.fr/realise au lis/demos/sep

sourc/ICAdemo/
• MISEP—http://www.lx.it.pt/∼lbalmeida/ica/mitoolbox.html
• Ensemble learning (NFA)—http://www.cis.hut.fi/projects/bayes/
software/

Nonlinear Separation Datasets and Dedicated Code

The dataset used for the nonlinear image separation experiments described in Section 3.2.1
and in reference [10] is available at

http://www.lx.it.pt/∼lbalmeida/ica/seethrough/index.html

The Matlab code used for performing those experiments is available at

http://www.lx.it.pt/∼lbalmeida/ica/seethrough/code/jmlr05/
P1: IML/FFX P2: IML/FFX QC: IML/FFX T1: IML
MOBK016-APPB MOBK016-Almeida.cls March 7, 2006 13:48

88
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

References
[1] S. Achard and C. Jutten, “Identifiability of post nonlinear mixtures,” IEEE Signal Pro-
cessing Letters, vol. 12, no. 5, pp. 423–426, 2005. doi:10.1109/LSP.2005.845593
[2] S. Achard, D. Pham, and C. Jutten, “Blind source separation in post nonlinear mix-
tures,” in Proc. Int. Workshop Independent Component Analysis and Blind Signal Separa-
tion, San Diego, CA, 2001, pp. 295–300. [Online]. Available: http://www-lmc.imag.fr/
lmc-sms/Sophie.Achard/Recherche/ICA2001.pdf
[3] S. Achard, D. Pham, and C. Jutten, “Quadratic dependence measure for nonlinear blind
sources separation,” in Proc. Int. Workshop Independent Component Analysis and Blind
Signal Separation, Nara, Japan, 2003. [Online]. Available: http://www.kecl.ntt.co.jp/icl/
signal/ica2003/cdrom/data/0098.pdf
[4] L. Almeida, “Multilayer perceptrons,” in Handbook of Neural Computation, E. Fiesler
and R. Beale, Eds. Bristol, U.K.: Institute of Physics. 1997. [Online]. Available:
http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaHNC.pdf
[5] L. Almeida, “Linear and nonlinear ICA based on mutual information,” in Proc. Symp.
2000 Adaptive Systems for Signal Processing, Communications, and Control, Lake Louise,
Alberta, Canada, 2000. [Online]. Available: http://www.lx.it.pt/∼lbalmeida/papers/
AlmeidaASSPCC00.ps.zip
[6] L. Almeida, “Simultaneous MI-based estimation of independent components and of
their distributions,” in Proc. Second Int. Workshop Independent Component Analysis and
Blind Signal Separation, Helsinki, Finland, 2000, pp. 169–174. [Online]. Available:
http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaICA00.ps.zip
[7] L. Almeida, “Faster training in nonlinear ICA using MISEP,” in Proc. Int. Workshop
Independent Component Analysis and Blind Signal Separation, Nara, Japan, 2003, pp. 113–
118. [Online]. Available: http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaICA03.pdf
[8] L. Almeida, “MISEP—Linear and nonlinear ICA based on mutual information,” Jour-
nal of Machine Learning Research, vol. 4, pp. 1297–1318, 2003. [Online]. Available:
http://www.jmlr.org/papers/volume4/almeida03a/almeida03a.pdf
[9] L. Almeida, “Linear and nonlinear ICA based on mutual information—
the MISEP method,” Signal Processing, vol. 84, no. 2, pp. 231–245, 2004.
[Online]. Available: http://www.lx.it.pt/∼lbalmeida/papers/AlmeidaSigProc03.pdf
doi:10.1016/j.sigpro.2003.10.008
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

90 NONLINEAR SOURCE SEPARATION

[10] L. Almeida, “Separating a real-life nonlinear image mixture,” Journal of Machine Learning
Research, vol. 6, pp. 1199–1229, July 2005. [Online]. Available: http://www.jmlr.org/
papers/volume4/almeida03a/almeida03a.pdf
[11] L. Almeida and M. Faria, “Separating a real-life nonlinear mixture of images,” in Proc.
Int. Workshop Independent Component Analysis and Blind Signal Separation, Series Lecture
Notes in Artificial Intelligence, no. 3195, C. G. Puntonet and A. Prieto, Eds. New York,
NY: Springer-Verlag, 2004, pp. 729–736. [Online]. Available: http://www.lx.it.pt/∼
lbalmeida/papers/AlmeidaICA04.pdf
[12] M. Almeida, H. Valpola, and J. Särelä, “Separation of nonlinear image mixtures by
denoising source separation,” in Proc. Int. Conf. Independent Component Analysis and
Blind Signal Separation, accepted for publication. [Online]. Available: http://www.lce.
hut.fi/∼harri/publications/ICA06final.pdf
[13] S.-I. Amari, “Natural gradient works efficiently in learning,” Neural Computation, vol. 10,
no. 2, pp. 252–276, 1998.doi:10.1162/089976698300017746
[14] M. Babaie-Zadeh, C. Jutten, and K. Nayebi, “Separating convolutive post-nonlinear
mixtures,” in Proc. Int. Conf. Independent Component Analysis and Blind Source Separation,
San Diego, CA, 2001, pp. 138–143.
[15] M. Babaie-Zadeh, C. Jutten, and K. Nayebi, “A geometric approach for separating post
nonlinear mixtures,” in Proc. XI Eur. Sig. Proc. Conf., Toulouse, France, 2002, pp. 11–14.
[16] M. Babaie-Zadeh, C. Jutten, and K. Nayebi, “Minimization-projection (MP) approach
for blind source separation in different mixing models,” in Proc. Fourth Int. Symp. In-
dependent Component Analysis and Blind Signal Separation, Nara, Japan, 2003, pp. 915–
920.
[17] M. Babaie-Zadeh, C. Jutten, and K. Nayebi, “A minimization-projection (MP) ap-
proach for blind separating convolutive mixtures,” in Proc. IEEE Int. Conf. Acoustics,
Speech and Signal Processing, Montreal, Canada, 2004, pp. 533–536.
[18] F. Bach and M. Jordan, “Kernel independent component analysis,” Journal of Machine
Learning Research, vol. 3, pp. 1–48, 2002.doi:10.1162/153244303768966085
[19] A. Bell and T. Sejnowski, “An information-maximization approach to blind separation
and blind deconvolution,” Neural Computation, vol. 7, pp. 1129–1159, 1995. [Online].
Available: ftp://ftp.cnl.salk.edu/pub/tony/bell.blind.ps
[20] A. Belouchrani, K. Meraim, J.-F. Cardoso, and E. Moulines, “A blind source separation
technique based on second order statistics,” IEEE Transactions on Signal Processing,
vol. 45, no. 2, pp. 434–444, 1997. [Online]. Available: http://www.tsi.enst.fr/∼cardoso/
Papers.PDF/ieeesobi.pdfdoi:10.1109/78.554307
[21] T. Blaschke and L. Wiskott, “Independent slow feature analysis and nonlinear blind
source separation,” in Proc. Int. Workshop Independent Component Analysis and Blind Signal
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

REFERENCES 91
Separation, Series Lecture Notes in Artificial Intelligence, no. 3195, C. G. Puntonet
and A. Prieto, Eds. Springer-Verlag, 2004, pp. 742–749. [Online]. Available: http://itb
.biologie.hu-berlin.de/∼blaschke/publications/isfa.pdf
[22] R. Boscolo, H. Pan, and V. Roychowdhury, “Independent component analysis based
on nonparametric density estimation,” IEEE Transactions on Neural Networks, vol. 15,
no. 1, pp. 55–65, January 2004. [Online]. Available: http://www.ee.ucla.edu/faculty/
papers/vwani trans-neural jan04.pdf doi:10.1109/TNN.2003.820667
[23] G. Burel, “Blind separation of sources: A nonlinear neural algorithm,” Neural Networks,
vol. 5, no. 6, pp. 937–947, 1992.doi:10.1016/S0893-6080(05)80090-5
[24] C. Burges, “A tutorial on support vector machines for pattern recogni-
tion,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167,
1998. [Online]. Available: http://www.kernel-machines.org/papers/Burges98.ps.gz
doi:10.1023/A:1009715923555
[25] J.-F. Cardoso, “The invariant approach to source separation,” in Proc. NOLTA, 1995,
pp. 55–60. [Online]. Available: http://www.tsi.enst.fr/∼cardoso/Papers.PDF/nolta95
.pdf
[26] J.-F. Cardoso and A. Souloumiac, “Blind beamforming for non Gaussian signals,” IEE
Proceedings-F, vol. 140, no. 6, pp. 362–370, 1993. [Online]. Available: http://www.tsi
.enst.fr/∼cardoso/Papers.PDF/iee.pdf
[27] A. Cichocki and S.-I. Amari, Adaptive Blind Signal and Image Processing—Learning
Algorithms and Applications. New York, NY: Wiley, 2002.
[28] P. Comon, “Independent component analysis—a new concept?” Signal Processing,
vol. 36, pp. 287–314, 1994.doi:10.1016/0165-1684(94)90029-9
[29] T. M. Cover and J. A. Thomas, Elements of Information Theory. New York, NY: Wiley,
1991.
[30] G. Darmois, “Analyse générale des liaisons stochastiques,” Rev. Inst. Internat. Stat.,
vol. 21, pp. 2–8, 1953.
[31] G. Deco and W. Brauer, “Nonlinear higher-order statistical decorrelation by volume-
conserving neural architectures,” Neural Networks, vol. 8, pp. 525–535, 1995.
doi:10.1016/0893-6080(94)00108-X
[32] J. Fisher and J. Principe, “Entropy manipulation of arbitrary nonlinear mappings,” in
Proc. IEEE Workshop Neural Networks for Signal Processing, Amelia Island, FL, 1997, pp.
14–23. [Online]. Available: http://www.cnel.ufl.edu/bib/pdf papers/fisher nnsp97.pdf
[33] C. Fyfe and P. Lai, “ICA using kernel canonical correlation analysis,” in Proc. Int. Work-
shop Independent Component Analysis and Blind Signal Separation, Helsinki, Finland,
2000, pp. 279–284. [Online]. Available: http://www.cis.hut.fi/ica2000/proceedings/
0279.pdf
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

92 NONLINEAR SOURCE SEPARATION

[34] P. Grünwald, “A tutorial introduction to the minimum description length principle,”
in Advances in Minimum Description Length: Theory and Applications, P. Grünwald,
I. Myung, and M. Pitt, Eds. Cambridge, MA: MIT Press, 2005. [Online]. Available:
http://www.cwi.nl/∼pdg/ftp/mdlintro.pdf
[35] M. Hansen and B. Yu, “Model selection and the principle of minimum de-
scription length,” Journal of American Statistical Association, vol. 96, pp. 746–
774, 2001. [Online]. Available: http://www.stat.ucla.edu/∼cocteau/papers/pdf/mdl.pdf
doi:10.1198/016214501753168398
[36] M. Haritopoulos, H. Yin, and N. Allinson, “Image denoising using SOM-based nonlin-
ear independent component analysis,” Neural Networks, vol. 15, no. 8–9, pp. 1085–1098,
2002.doi:10.1016/S0893-6080(02)00081-3
[37] S. Harmeling, A. Ziehe, M. Kawanabe, B. Blankertz, and K.-R. Müller, “Nonlinear
blind source separation using kernel feature spaces,” in Proc. Int. Workshop Indepen-
dent Component Analysis and Blind Signal Separation, T.-W. Lee, Ed., San Diego, CA,
2001, pp. 102–107. [Online]. Available: http://ica2001.ucsd.edu/index files/pdfs/080-
harmeling.pdf
[38] S. Harmeling, A. Ziehe, M. Kawanabe, and K.-R. Müller, “Kernel-based nonlinear
blind source separation,” Neural Computation, vol. 15, pp. 1089–1124, 2003. [On-
line]. Available: http://ida.first.fraunhofer.de/∼harmeli/papers/article on ktdsep.pdf
doi:10.1162/089976603765202677
[39] S. Haykin, Neural Networks—A Comprehensive Foundation, 2nd ed. Upper Saddle River,
NJ: Prentice-Hall, 1999.
[40] S. Haykin and J. Principe, “Making sense of a complex world,” IEEE Signal Processing
Magazine, vol. 15, no. 3, pp. 66–81, May 1998. [Online]. Available: http://www.cnel.ufl
.edu/alltest.php?type=journals&id=5 doi:10.1109/79.671132
[41] R. Hecht-Nielsen, “Replicator neural networks for universal optimal source coding,”
Science, vol. 269, no. 5232, pp. 1860–1863, 1995.
[42] S. Hochreiter and M. C. Mozer, “An electric field approach to independent component
analysis,” in Proc. Second Int. Workshop Independent Component Analysis and Blind Signal
Separation, Helsinki, Finland, 2000, pp. 45–50. [Online]. Available: http://www.cis.hut
.fi/ica2000/proceedings/0045.pdf
[43] S. Hochreiter and J. Schmidhuber, “LOCOCODE versus PCA and ICA,” in Proc.
Int. Conf. Artificial Neural Networks, Sweden, 1998, pp. 669–674. [Online]. Available:
ftp://ftp.idsia.ch/pub/juergen/icann98.ps.gz
[44] S. Hochreiter and J. Schmidhuber, “Feature extraction through LOCOCODE,” Neural
Computation, vol. 11, no. 3, pp. 679–714, 1999. [Online]. Available: ftp://ftp.idsia.ch/
pub/juergen/lococode.pdf doi:10.1162/089976699300016629
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

REFERENCES 93
[45] S. Hochreiter and J. Schmidhuber, “LOCOCODE performs nonlinear ICA without
knowing the number of sources,” in Proc. First Int. Workshop Independent Component
Analysis and Signal Separation, J. F. Cardoso, C. Jutten, and P. Loubaton, Eds., Aussois,
France, 1999, pp. 277–282.
[46] A. Honkela, “Speeding up cyclic update schemes by pattern searches,” in Proc. Ninth
Int. Conf. Neural Information Processing, Singapore, 2002, pp. 512–516.
[47] A. Honkela, S. Harmeling, L. Lundqvist, and H. Valpola, “Using kernel PCA for
initialisation of variational Bayesian nonlinear blind source separation method,” in Proc.
Int. Workshop Independent Component Analysis and Blind Signal Separation, Granada,
Spain, 2004, pp. 790–797.
[48] A. Honkela and H. Valpola, “Variational learning and bits-back coding: An information-
theoretic view to Bayesian learning,” IEEE Transactions on Neural Networks, vol. 15,
no. 4, pp. 800–810, 2004.doi:10.1109/TNN.2004.828762
[49] A. Honkela and H. Valpola, “Unsupervised variational bayesian learning of nonlin-
ear models,” in Advances in Neural Information Processing Systems, vol. 17, L. K. Saul,
Y. Weis, and L. Bottou, Eds., 2005, pp. 593–600. [Online]. Available: http://books.nips
.cc/papers/files/nips17/NIPS2004 0322.pdf
[50] A. Honkela, H. Valpola, and J. Karhunen, “Accelerating cyclic update algorithms for pa-
rameter estimation by pattern searches,” Neural Processing Letters, vol. 17, no. 2, pp. 191–
203, 2003.doi:10.1023/A:1023655202546
[51] S. Hosseini and Y. Deville, “Blind separation of linear-quadratic mixtures of real
sources,” in Proc. IWANN, vol. 2, Mao, Menorca, Spain, 2003, pp. 241–248.
[52] S. Hosseini and Y. Deville, “Blind maximum likelihood separation of a linear-quadratic
mixture,” in Proc. Int. Workshop Independent Component Analysis and Blind Signal Sepa-
ration, Series Lecture Notes in Artificial Intelligence, no. 3195. Springer-Verlag, 2004.
[Online]. Available: http://webast.ast.obs-mip.fr/people/ydeville/papers/ica04 1.pdf
[53] S. Hosseini and C. Jutten, “On the separability of nonlinear mixtures of temporally
correlated sources,” IEEE Signal Processing Letters, vol. 10, no. 2, pp. 43–46, February
2003.doi:10.1109/LSP.2002.807871
[54] A. Hyvärinen, “Fast and robust fixed-point algorithms for independent component
analysis,” IEEE Transactions on Neural Networks, vol. 10, no. 3, pp. 626–634, 1999.
[Online]. Available: http://www.cs.helsinki.fi/u/ahyvarin/papers/TNN99new.pdf
doi:10.1109/72.761722
[55] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis. New York,
NY: Wiley, 2001.
[56] A. Hyvärinen and P. Pajunen, “Nonlinear independent component analysis: Exis-
tence and uniqueness results,” Neural Networks, vol. 12, no. 3, pp. 429–439, 1999.
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

94 NONLINEAR SOURCE SEPARATION

[Online]. Available: http://www.cis.hut.fi/∼aapo/ps/NN99.psdoi:10.1016/S0893-
6080(98)00140-3
[57] A. Ilin and A. Honkela, “Post-nonlinear independent component analysis by variational
Bayesian learning,” in Proc. Int. Conf. Independent Component Analysis and Blind Source
Separation, Granada, Spain, 2004, pp. 766–773.
[58] A. Iline, H. Valpola, and E. Oja, “Detecting process state changes by nonlinear blind
source separation,” in Proc. Int. Workshop Independent Component Analysis and Blind
Signal Separation, San Diego, CA, 2001, pp. 710–715. [Online]. Available: http://www
.cis.hut.fi/harri/papers/ICAalex.ps.gz
[59] C. Jutten and J. Karhunen, “Advances in blind source separation (BSS) and
independent component analysis (ICA) for nonlinear mixtures,” International
Journal of Neural Systems, vol. 14, no. 5, pp. 267–292, 2004. [Online]. Available:
http://www.worldscinet.com/128/14/preserved-docs/1405/S012906570400208X.pdf
doi:10.1142/S012906570400208X
[60] J. Karvanen and T. Tanaka, “Temporal decorrelation as preprocessing for linear and
post-nonlinear ICA,” in Proc. Int. Conf. Independent Component Analysis and Blind Source
Separation, Granada, Spain, 2004, pp. 774–781.
[61] H. Lappalainen and X. Giannakopoulos, “Multi-layer perceptrons as nonlinear gener-
ative models for unsupervised learning: A Bayesian treatment,” in Proc. ICANN, Ed-
inburgh, Scotland, 1999, pp. 19–24. [Online]. Available: http://www.cis.hut.fi/harri/
icann99.ps.gz
[62] H. Lappalainen and A. Honkela, “Bayesian nonlinear independent component anal-
ysis by multi-layer perceptrons,” in Advances in Independent Component Analysis,
M. Girolami, Ed. Springer-Verlag, 2000, pp. 93–121. [Online]. Available: http://www
.cis.hut.fi/harri/ch7.ps.gz
[63] H. Lappalainen, A. Honkela, X. Giannakopoulos, and J. Karhunen, “Nonlinear source
separation using ensemble learning and MLP networks,” in Proc. Symp. 2000 Adap-
tive Systems for Signal Processing, Communications, and Control, Lake Louise, Alberta,
Canada, October 1–4 2000, pp. 187–192. [Online]. Available: http://www.cis.hut.fi/
projects/ica/bayes/papers/llouise.ps.gzdoi:full text
[64] Y. LeCun, I. Kanter, and S. Solla, “Eigenvalues of covariance matrices: Application
to neural-network learning,” Physical Review Letters, vol. 66, no. 18, pp. 2396–2399,
1991. [Online]. Available: http://yann.lecun.com/exdb/publis/pdf/lecun-kanter-solla-
91.pdfdoi:10.1103/PhysRevLett.66.2396
[65] J. Lee, C. Jutten, and M. Verleysen, “Non-linear ICA by using isometric di-
mensionality reduction,” in Proc. Int. Workshop Independent Component Analysis and
Blind Signal Separation, Series Lecture Notes in Artificial Intelligence, no. 3195.
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

REFERENCES 95
Springer-Verlag, 2004, pp. 710–717. [Online]. Available: http://www.dice.ucl.ac.be/
∼verleyse/papers/ica04jl.pdf
[66] T.-W. Lee, M. Girolami, and T. Sejnowski, “Independent component analysis using
an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources,”
Neural Computation, vol. 11, pp. 417–441, 1999.doi:10.1162/089976699300016719
[67] T. W. Lee, B. Koehler, and R. Orglmeister, “Blind source separation of nonlinear mixing
models,” in Proc. Neural Networks for Signal Processing, 1997, pp. 406–415. [Online].
Available: http://www.cnl.salk.edu/∼tewon/Public/nnsp97.ps.gz
[68] J. Lin, D. Grier, and J. Cowan, “Source separation and density estimation by faithful
equivariant SOM,” in Advances in Neural Information Processing Systems. Cambridge,
MA: MIT Press, 1997, pp. 536–542.
[69] J. Lin, D. Grier, and J. Cowan, “Faithful representation of separable input distributions,”
Neural Computation, vol. 9, pp. 1305–1320, 1997.
[70] E. Lorenz, “Deterministic nonperiodic flow,” Journal of Atmospheric Sciences, vol. 20,
pp. 130–141, 1963.doi:10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2
[71] G. Marques and L. Almeida, “An objective function for independence,” in Proc. Int.
Conf. Neural Networks, Washington, DC, 1996, pp. 453–457.
[72] G. Marques and L. Almeida, “Separation of nonlinear mixtures using pattern repulsion,”
in Proc. First Int. Workshop Independent Component Analysis and Signal Separation, J. F.
Cardoso, C. Jutten, and P. Loubaton, Eds., Aussois, France, 1999, pp. 277–282. [On-
line]. Available: http://www.lx.it.pt/∼lbalmeida/papers/MarquesAlmeidaICA99.ps.zip
[73] R. Martı́n-Clemente, S. Hornillo-Mellado, J. Acha, F. Rojas, and C. Puntonet, “MLP-
based source separation for MLP-like nonlinear mixtures,” in Proc. Int. Workshop Inde-
pendent Component Analysis and Blind Signal Separation, Nara, Japan, 2003. [Online].
Available: http://www.kecl.ntt.co.jp/icl/signal/ica2003/cdrom/data/0114.pdf
[74] T. Mitchell, Machine Learning. New York, NY: McGraw Hill, 1997.
[75] L. Molgedey and H. Schuster, “Separation of a mixture of independent signals using
time delayed correlations,” Physical Review Letters, vol. 72, pp. 3634–3636, 1994.
[Online]. Available: http://www.theo-physik.uni-kiel.de/thesis/molgedey94.ps.gz
doi:10.1103/PhysRevLett.72.3634
[76] K.-R. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction
to kernel-based learning algorithms,” IEEE Transactions on Neural Networks, vol. 12,
no. 2, pp. 181–201, May 2001. [Online]. Available: http://mlg.anu.edu.au/∼raetsch/
ps/review.pdfdoi:10.1109/72.914517
[77] P. Pajunen, “Nonlinear independent component analysis by self-organizing maps,” in
Artificial Neural Networks—ICANN 96, Proc. 1996 International Conference on Artificial
Neural Networks, Bochum, Germany, 1996, pp. 815–819.
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

96 NONLINEAR SOURCE SEPARATION

[78] P. Pajunen, “Blind source separation of natural signals based on approximate com-
plexity minimization,” in Proc. Int. Workshop Independent Component Analysis and Blind
Signal Separation, Aussois, France, January 1999, pp. 267–270. [Online]. Available:
http://www.cis.hut.fi/∼ppajunen/papers/ica99 second.ps
[79] P. Pajunen and J. Karhunen, “A maximum likelihood approach to nonlinear blind
source separation,” in Proc. Int. Conf. Artificial Neural Networks, Lausanne, Switzerland,
October 1997, pp. 541–546. [Online]. Available: http://www.cis.hut.fi/∼ppajunen/
papers/icann97.ps.gz
[80] F. Palmieri, D. Mattera, and A. Budillon, “Multi-layer independent component anal-
ysis (MLICA),” in Proc. First Int. Workshop Independent Component Analysis and Signal
Separation, J. F. Cardoso, C. Jutten, and P. Loubaton, Eds., Aussois, France, 1999,
pp. 93–97.
[81] L. Parra, “Symplectic nonlinear independent component analysis,” in Advances in Neural
Information Processing Systems 8, D. Touretzky, M. Mozer, and M. Hasselmo, Eds.,
Cambridge, MA: MIT Press 1996, pp. 437–443. [Online]. Available: http://newton.
bme.columbia.edu/∼lparra/publish/Parra.NIPS95.pdf
[82] L. Parra, G. Deco, and S. Miesbach, “Statistical independence and novelty detection
with information preserving nonlinear maps,” Neural Computation, vol. 8, pp. 260–269,
1996. [Online]. Available: http://newton.bme.columbia.edu/∼lparra/publish/nc96.pdf
[83] B. Pearlmutter and L. Parra, “Maximum likelihood blind source separation: A
context-sensitive generalization of ICA,” in Advances in Neural Information Process-
ing Systems. Cambridge, MA: MIT press, 1997, pp. 613–619. [Online]. Available:
http://newton.bme.columbia.edu/∼lparra/publish/nips96.pdf
[84] B. Ripley, Pattern Recognition and Neural Networks. Cambridge, UK: Cambridge Uni-
versity Press, 1996.
[85] J. Rissanen, “Modeling by shortest data description,” Automatica, vol. 14, pp. 465–471,
1978.doi:10.1016/0005-1098(78)90005-5
[86] F. Rojas, I. Rojas, R. Clemente, and C. Puntonet, “Nonlinear blind source sep-
aration using genetic algorithms,” in Proc. Int. Workshop Independent Component
Analysis and Blind Signal Separation, San Diego, CA, 2001. [Online]. Available:
http://ica2001.ucsd.edu/index files/pdfs/030-rojas.pdf
[87] J. Särelä and H. Valpola, “Denoising source separation,” Journal of Machine Learning
Research, vol. 6, pp. 233–272, 2005.
[88] J. Schmidhuber, “Learning factorial codes by predictability minimization,” Neural Com-
putation, vol. 4, no. 6, pp. 863–879, 1992.
[89] B. Schölkopf and A. Smola, Learning with Kernels. Cambridge, MA: MIT Press,
2002.
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

REFERENCES 97
[90] B. Schölkopf, A. Smola, and K.-R. Müller, “Nonlinear component analysis as a
kernel eigenvalue problem,” Neural Computation, vol. 10, pp. 1299–1319, 1998.
[Online]. Available: http://users.rsise.anu.edu.au/∼smola/papers/SchSmoMul98.pdf
doi:10.1162/089976698300017467
[91] C. Shannon, “A mathematical theory of communication,” Bell System Technical Jour-
nal, vol. 27, pp. 379–423 and 623–656, July and October 1948. [Online]. Available:
http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf
[92] J. Solé, C. Jutten, and D. T. Pham, “Fast approximation of nonlinearities for improving
inversion algorithms of pnl mixtures and Wiener systems,” Signal Processing, 2004.
[93] J. Solé, C. Jutten, and A. Taleb, “Parametric approach to blind deconvolution of
nonlinear channels,” Neurocomputing, vol. 48, pp. 339–355, 2002.doi:10.1016/S0925-
2312(01)00651-8
[94] A. Taleb and C. Jutten, “Entropy optimization, application to blind source separation,”
in Proc. Int. Conf. Artificial Neural Networks, Lausanne, Switzerland, 1997, pp. 529–
534.
[95] A. Taleb and C. Jutten, “Nonlinear source separation: The post-nonlinear mixtures,”
in Proc. 1997 Eur. Symp. Artifcial Neural Networks, Bruges, Belgium, 1997, pp. 279–
284.
[96] A. Taleb and C. Jutten, “Batch algorithm for source separation in post-nonlinear mix-
tures,” in Proc. First Int. Workshop Independent Component Analysis and Signal Separation,
Aussois, France, 1999, pp. 155–160.
[97] A. Taleb and C. Jutten, “Source separation in post-nonlinear mixtures,” IEEE Transac-
tions on Signal Processing, vol. 47, pp. 2807–2820, 1999.doi:10.1109/78.790661
[98] A. Taleb, J. Solé, and C. Jutten, “Quasi-nonparametic blind inversion of Wiener sys-
tems,” IEEE Transactions on Signal Processing, vol. 49, no. 5, pp. 917–924, 2001.
doi:10.1109/78.917796
[99] Y. Tan, J. Wang, and J. Zurada, “Nonlinear blind source separation using a radial basis
function network,” IEEE Transactions on Neural Networks, vol. 12, no. 1, pp. 124–134,
2001.doi:10.1109/72.896801
[100] F. Theis, C. Bauer, C. Puntonet, and E. Lang, “Pattern repulsion revisited,” in
Proc. IWANN, Series Lecture Notes in Computer Science, no. 2085. New York,
NY: Springer-Verlag, 2001, pp. 778–785. [Online]. Available: http://homepages.uni-
regensburg.de/∼thf11669/publications/theis01patternrep IWANN01.pdf
[101] F. Theis, C. Puntonet, and E. Lang, “Nonlinear geometric ICA,” in Proc. Int. Work-
shop Independent Component Analysis and Blind Signal Separation, Nara, Japan, 2003,
pp. 275–280. [Online]. Available: http://homepages.uni-regensburg.de/∼thf11669/
publications/theis03nonlineargeo ICA03.pdf
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

98 NONLINEAR SOURCE SEPARATION

[102] H. Valpola, “Bayesian ensemble learning for nonlinear factor analysis,” Acta Polytechnica
Scandinavica, Mathematics and Computing Series, no. 108, 2000. [Online]. Available:
http://www.cis.hut.fi/harri/thesis/
[103] H. Valpola, “Nonlinear independent component analysis using ensemble learning: Ex-
periments and discussion,” in Proc. Second Int. Workshop Independent Component Analysis
and Blind Signal Separation, Helsinki, Finland, 2000, pp. 351–356.
[104] H. Valpola, “Nonlinear independent component analysis using ensemble learning: The-
ory,” in Proc. Second Int. Workshop Independent Component Analysis and Blind Signal
Separation, Helsinki, Finland, 2000, pp. 251–256.
[105] H. Valpola, M. Harva, and J. Karhunen, “Hierarchical models of variance sources,”
Signal Processing, vol. 84, no. 2, pp. 267–282, 2004.doi:10.1016/j.sigpro.2003.10.014
[106] H. Valpola and J. Karhunen, “An unsupervised ensemble learning method for non-
linear dynamic state-space models,” Neural Computation, vol. 14, no. 11, pp. 2647–
2692, 2002. [Online]. Available: http://www.cis.hut.fi/harri/papers/ValpolaNC02.pdf
doi:10.1162/089976602760408017
[107] H. Valpola, T. Östman, and J. Karhunen, “Nonlinear independent factor analysis by
hierarchical models,” in Proc. Int. Workshop Independent Component Analysis and Blind
Signal Separation, Nara, Japan, 2003, pp. 257–262. [Online]. Available: http://www.cis
.hut.fi/harri/papers/ica2003 hnfa.pdf
[108] H. Valpola, T. Raiko, and J. Karhunen, “Building blocks for hierarchical latent vari-
able models,” in Proc. Int. Workshop Independent Component Analysis and Blind Signal
Separation, San Diego, CA, 2001, pp. 716–721. [Online]. Available: http://www.cis
.hut.fi/harri/papers/ica2001.ps.gz
[109] N. Vlassis, “Efficient source adaptivity in independent component analysis,”
IEEE Transactions on Neural Networks, vol. 12, no. 3, pp. 559–566, May 2001.
doi:10.1109/72.925558
[110] C. Wallace, “Classification by minimum-message-length inference,” in Proc. Advances
in Computing and Information—ICCI 90, vol. 468, Berlin, Germany, S. G. Aki, F. Fiala,
and W. W. Koczkodaj, Eds., 1990, pp. 72–81.
[111] Z. Xiong and T. Huang, “Nonlinear independent component analysis (ICA) using
power series and application to blind source separation,” in Proc. Int. Workshop Indepen-
dent Component Analysis and Blind Signal Separation, San Diego, CA, 2001. [Online].
Available: http://ica2001.ucsd.edu/index files/pdfs/031-xiong.pdf
[112] D. Xu, J. Principe, J. Fisher, and H.-C. Wu, “A novel measure for independent compo-
nent analysis,” in Proc. ICASSP, vol. 2, Seattle, 1998.
[113] H. Yang, S.-I. Amari, and A. Clichocki, “Information theoretic approach to blind
separation of sources in non-linear mixture,” Signal Processing, vol. 64, no. 3, pp. 291–
300, 1998.doi:10.1016/S0165-1684(97)00196-5
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

REFERENCES 99
[114] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, “Separation of post-nonlinear
mixtures using ACE and temporal decorrelation,” in Proc. Int. Conf. Independent Com-
ponent Analysis and Blind Source Separation, San Diego, CA, 2001, pp. 433–438.
[115] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, “Blind separation of post-
nonlinear mixtures using gaussianizing transformations and temporal decorrelation,”
in Proc. Int. Workshop Independent Component Analysis and Blind Signal Separa-
tion, Nara, Japan, 2003, pp. 269–274. [Online]. Available: http://www.kecl.ntt.co.jp/
icl/signal/ica2003/cdrom/data/0208.pdf
[116] A. Ziehe, M. Kawanabe, S. Harmeling, and K.-R. Müller, “Blind separation of post-
nonlinear mixtures using linearizing transformations and temporal decorrelation,” Jour-
nal of Machine Learning Research, vol. 4, pp. 1319–1338, December 2003. [Online].
Available: http://www.jmlr.org/papers/volume4/ziehe03a/ziehe03a.pdf
[117] A. Ziehe and K.-R. Müller, “TDSEP—An efficient algorithm for blind separation using
time structure,” in Proc. Int. Conf. Artificial Neural Networks, Skövde, Sweden, 1998,
pp. 675–680. [Online]. Available: http://wwwold.first.gmd.de/persons/Mueller.Klaus-
Robert/ICANN tdsep.ps.gz
P1: IML/FFX P2: IML
MOBK016-REF MOBK016-Almeida.cls March 13, 2006 11:15

100
P1: IML/FFX P2: IML
MOBK016-BIO MOBK016-Almeida.cls April 7, 2006 16:25

101

Biography
Luis B. Almeida is a full professor of Signals and Systems, and of Neural Networks and
Machine Learning, at Instituto Superior Técnico, Technical University of Lisbon, and a re-
searcher at the Telecommunications Institute, Lisbon, Portugal. He holds a Ph. D. in Signal
Processing from the Technical University of Lisbon. He has formerly taught Systems Theory,
Telecommunications, Digital Systems and Mathematical Analysis, among others.
Luis B. Almeida’s current research focuses on nonlinear source separation. Formerly he
has performed research on speech modeling and coding, Fourier and time-frequency analysis of
signals, and training algorithms for neural networks. Some highlights of his work include the
sinusoidal model for voiced speech, currently in use in INMARSAT and IRIDIUM telephones
(developed with F.M. Silva and J.S. Marques), work on the Fractional Fourier Transform, the
development of recurrent backpropagation, and the development of the MISEP method of
nonlinear source separation.
Luis B. Almeida has been a founding Vice-President of the European Neural Network
Society and the founding President of INESC-ID (a nonprofit research institute associated
with the Technical University of Lisbon).
P1: IML/FFX P2: IML
MOBK016-BIO MOBK016-Almeida.cls April 7, 2006 16:25

102

Mastering The Discrete Fourier Transform in One, Two or Several Dimensions
No ratings yet
Mastering The Discrete Fourier Transform in One, Two or Several Dimensions
388 pages
Fundamentals of Statistical Signal Processing, Volume I Estimation Theory by Steven M.kay
No ratings yet
Fundamentals of Statistical Signal Processing, Volume I Estimation Theory by Steven M.kay
303 pages
10.1007@978 0 387 39351 3 PDF
100% (1)
10.1007@978 0 387 39351 3 PDF
316 pages
Law 2015
No ratings yet
Law 2015
256 pages
Introduction To Data Mining 2005
60% (5)
Introduction To Data Mining 2005
400 pages
Power Electronics Circuit Analysis and D
No ratings yet
Power Electronics Circuit Analysis and D
682 pages
Quantecon Python Econometria
No ratings yet
Quantecon Python Econometria
1,399 pages
Fundamentals of Statistical Signal Processing Estimation Theory PDF
100% (5)
Fundamentals of Statistical Signal Processing Estimation Theory PDF
303 pages
Spatio-Temporal Statistics IV Epiphany 2024-25
No ratings yet
Spatio-Temporal Statistics IV Epiphany 2024-25
143 pages
Detection of Abrupt Changes: Theory and Application
No ratings yet
Detection of Abrupt Changes: Theory and Application
470 pages
2006 - Kampes - Radar Interferometry
100% (1)
2006 - Kampes - Radar Interferometry
221 pages
Heavy M
No ratings yet
Heavy M
172 pages
Napolitano Amri 200912 PHD
No ratings yet
Napolitano Amri 200912 PHD
235 pages
Linh Kien Ban Dan Va Vi Mach - Ho Van Sung
100% (1)
Linh Kien Ban Dan Va Vi Mach - Ho Van Sung
197 pages
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
100% (5)
Bayesian Analysis of Time Series - Broemeling L. D. (CRC 2019) (1st Ed.)
293 pages
Practical Biomedical Signal Analysis Using MATLAB® - 2nd Edition Educational Ebook Download
100% (9)
Practical Biomedical Signal Analysis Using MATLAB® - 2nd Edition Educational Ebook Download
14 pages
DSP 1
No ratings yet
DSP 1
40 pages
DSP 7
No ratings yet
DSP 7
38 pages
Random Processes
No ratings yet
Random Processes
155 pages
RVSoC Ch2riscv
No ratings yet
RVSoC Ch2riscv
96 pages
OSHÚN
100% (1)
OSHÚN
10 pages
8051&C I
No ratings yet
8051&C I
142 pages
RVSoC Ch3tools
No ratings yet
RVSoC Ch3tools
81 pages
RVSoC ch4hdl
No ratings yet
RVSoC ch4hdl
69 pages
Notes 12j686o
No ratings yet
Notes 12j686o
272 pages
Fundamentals of Statistical Signal Processing - Estimation Theory-Kay
No ratings yet
Fundamentals of Statistical Signal Processing - Estimation Theory-Kay
303 pages
178 HW 6
No ratings yet
178 HW 6
125 pages
Cmpe226 Ch3
100% (1)
Cmpe226 Ch3
22 pages
System Identi Cation Data-Driven Modelling of Dynamic Systems - Paul M.J. Van Den Hof
No ratings yet
System Identi Cation Data-Driven Modelling of Dynamic Systems - Paul M.J. Van Den Hof
305 pages
dsp2 Sol
No ratings yet
dsp2 Sol
103 pages
Sensor 1979173803
No ratings yet
Sensor 1979173803
55 pages
Spectral Analysis of Signals
100% (1)
Spectral Analysis of Signals
108 pages
Roadmap B1+ SB
No ratings yet
Roadmap B1+ SB
176 pages
RVSoC Ch1intro
No ratings yet
RVSoC Ch1intro
32 pages
C01001848H - 00 Man - Inst Rack Cooler CRCC - CRCX en
No ratings yet
C01001848H - 00 Man - Inst Rack Cooler CRCC - CRCX en
34 pages
DSP 2
No ratings yet
DSP 2
60 pages
DSP 3
No ratings yet
DSP 3
61 pages
Industrial Training (Cse 4389) : Submitted by
No ratings yet
Industrial Training (Cse 4389) : Submitted by
33 pages
RetroMagazine 07 Eng
No ratings yet
RetroMagazine 07 Eng
55 pages
Full Statistics
No ratings yet
Full Statistics
108 pages
5CTA0 Reader 1
No ratings yet
5CTA0 Reader 1
88 pages
IADC Rig Equipment List DDD - Rev - 2014.12.09
No ratings yet
IADC Rig Equipment List DDD - Rev - 2014.12.09
95 pages
Lecture Notes 2013
No ratings yet
Lecture Notes 2013
231 pages
DSP 5
No ratings yet
DSP 5
36 pages
DSP 6
No ratings yet
DSP 6
36 pages
Machine Learning Book PDF
No ratings yet
Machine Learning Book PDF
260 pages
ISATIS 2013 Technical References
100% (1)
ISATIS 2013 Technical References
260 pages
WorkloadCharacterizationAndModeling 2005 Feitelson
No ratings yet
WorkloadCharacterizationAndModeling 2005 Feitelson
508 pages
1 s2.0 S1364032114005826 Main
No ratings yet
1 s2.0 S1364032114005826 Main
12 pages
Chap 13
No ratings yet
Chap 13
19 pages
Chap 5
No ratings yet
Chap 5
40 pages
Solving XOR Problem Using DNN AIDS
100% (1)
Solving XOR Problem Using DNN AIDS
4 pages
DSP 11
No ratings yet
DSP 11
19 pages
Lecture Notes - Kristiaan Pelckmans
100% (1)
Lecture Notes - Kristiaan Pelckmans
153 pages
Chap 14
No ratings yet
Chap 14
18 pages
Pattern Recognition 2nd Ed. (2009)
No ratings yet
Pattern Recognition 2nd Ed. (2009)
113 pages
DMbook TOC1
No ratings yet
DMbook TOC1
8 pages
Sipro PDF
No ratings yet
Sipro PDF
106 pages
Lecture 7 - Thematic Analysis
No ratings yet
Lecture 7 - Thematic Analysis
5 pages
Zenerdiod
No ratings yet
Zenerdiod
16 pages
Sabse Bada Kalakar
No ratings yet
Sabse Bada Kalakar
4 pages
Distribution System
No ratings yet
Distribution System
103 pages
Machine Leaning and Dimensionality Reduction Course UCLouvain
No ratings yet
Machine Leaning and Dimensionality Reduction Course UCLouvain
36 pages
A621421 PDF
No ratings yet
A621421 PDF
127 pages
Predicate and Quantifiers
No ratings yet
Predicate and Quantifiers
8 pages
Ensemble Methods in Data Mining
No ratings yet
Ensemble Methods in Data Mining
127 pages
How To Use The Guide and Quiz: Select The Version. The Questions Are Identical
No ratings yet
How To Use The Guide and Quiz: Select The Version. The Questions Are Identical
11 pages
Configuring CCS5
No ratings yet
Configuring CCS5
10 pages
Michele Basseville Igor V Nikiforov - Detection of Abrupt Changes Theory and Application
No ratings yet
Michele Basseville Igor V Nikiforov - Detection of Abrupt Changes Theory and Application
469 pages
Particle Filtering Without Tears A Primer For Beginners
No ratings yet
Particle Filtering Without Tears A Primer For Beginners
16 pages
Digital Signal Processin-TU Darmstadt
No ratings yet
Digital Signal Processin-TU Darmstadt
166 pages
401 Presentation: Group - II
No ratings yet
401 Presentation: Group - II
33 pages
Method Statement For Installation of SCL HUH Bifurcation P5106 Under Shutdown Arrangement
No ratings yet
Method Statement For Installation of SCL HUH Bifurcation P5106 Under Shutdown Arrangement
22 pages
Indiaray - Brochure
No ratings yet
Indiaray - Brochure
23 pages
Exam 2-1-25
No ratings yet
Exam 2-1-25
4 pages
Gee7 2011
No ratings yet
Gee7 2011
318 pages
Bran Chembah
No ratings yet
Bran Chembah
4 pages
7 ICT Powerpoint W1
No ratings yet
7 ICT Powerpoint W1
3 pages
Bab 2 Bahasa Inggris Frangklif Rafel Pinontoan (20024017)
No ratings yet
Bab 2 Bahasa Inggris Frangklif Rafel Pinontoan (20024017)
3 pages
Type L6N Load Cell: Short Description
No ratings yet
Type L6N Load Cell: Short Description
3 pages
C++ Programming Course
No ratings yet
C++ Programming Course
7 pages
Parametric Time Domain Modelling
No ratings yet
Parametric Time Domain Modelling
54 pages
DSP Lab 6
No ratings yet
DSP Lab 6
7 pages
Particulars of Factories Paying Revenue of Rs. One Crore and Above During The Year 2006-2007 As Compared To 2005 - 06 Commissionerate: Chennai-Iv
No ratings yet
Particulars of Factories Paying Revenue of Rs. One Crore and Above During The Year 2006-2007 As Compared To 2005 - 06 Commissionerate: Chennai-Iv
13 pages
Mech Eng Reliability
No ratings yet
Mech Eng Reliability
3 pages
950H Valvula de Linha
No ratings yet
950H Valvula de Linha
8 pages
Lecture Notes For A Course On System Identification, v2012: Kristiaan Pelckmans
No ratings yet
Lecture Notes For A Course On System Identification, v2012: Kristiaan Pelckmans
24 pages
DSP Homework5
No ratings yet
DSP Homework5
1 page
Lecture Notes On Independent Component Analysis
No ratings yet
Lecture Notes On Independent Component Analysis
12 pages
Estimation and Detection Theory by Don H. Johnson
No ratings yet
Estimation and Detection Theory by Don H. Johnson
214 pages
Class 8 Networking Concepts Part-1 PDF
No ratings yet
Class 8 Networking Concepts Part-1 PDF
7 pages
1664189682389-2ba010110 Cced25030
No ratings yet
1664189682389-2ba010110 Cced25030
1 page
Modeling Hidrology - MEF-MDF
No ratings yet
Modeling Hidrology - MEF-MDF
16 pages
Lecture Notes On Pattern Recognition and Image Processing
No ratings yet
Lecture Notes On Pattern Recognition and Image Processing
24 pages
Engineering Maths
No ratings yet
Engineering Maths
2 pages
Ax 9 G 5 MVFL 78532 U 8 H 2 A 8
100% (1)
Ax 9 G 5 MVFL 78532 U 8 H 2 A 8
303 pages
FALLSEM2019-20 EEE2004 ETH VL2019201000960 MODEL QUESTION PAPER Model Question Paper
No ratings yet
FALLSEM2019-20 EEE2004 ETH VL2019201000960 MODEL QUESTION PAPER Model Question Paper
2 pages
July 24, 2008 10:37 Spi-B626 9in X 6in FM: Empirical-Statistical Downscaling © World Scientific Publishing Co. Pte. LTD
No ratings yet
July 24, 2008 10:37 Spi-B626 9in X 6in FM: Empirical-Statistical Downscaling © World Scientific Publishing Co. Pte. LTD
6 pages
Unlocking Statistics for the Social Sciences
From Everand
Unlocking Statistics for the Social Sciences
Norma Sinclair
No ratings yet
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
From Everand
Time-dependent Behaviour and Design of Composite Steel-concrete Structures
Massimiliano Bocciarelli
No ratings yet
The Gracious Lily Affair
From Everand
The Gracious Lily Affair
Van Wyck Mason
5/5 (1)
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
From Everand
The Linux Terminal for Advanced Users - The Command Line Made Easy: First Edition
Michael Basler
No ratings yet
Osama the Gun
From Everand
Osama the Gun
Norman Spinrad
5/5 (1)
Intrusion Detection Honeypots
From Everand
Intrusion Detection Honeypots
Chris Sanders
3/5 (2)
Gray Hat Hacking the Ethical Hacker's
From Everand
Gray Hat Hacking the Ethical Hacker's
Çağatay Şanlı
5/5 (1)
Bimbo Heaven: Stone Angel #7
From Everand
Bimbo Heaven: Stone Angel #7
Marvin H. Albert
No ratings yet
Content Creation Revolution with chatGPT
From Everand
Content Creation Revolution with chatGPT
Maria Cowen
No ratings yet