0% found this document useful (0 votes)
5 views46 pages

Unit5 3

Uploaded by

samith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views46 pages

Unit5 3

Uploaded by

samith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

3

Canonical Correlation Analysis

Alles Gescheite ist schon gedacht worden, man muß nur versuchen, es noch einmal
zu denken (Johann Wolfgang von Goethe, Wilhelm Meisters Wanderjahre, 1749–1832.)
Every clever thought has been thought before, we can only try to recreate these thoughts.

3.1 Introduction
In Chapter 2 we represented a random vector as a linear combination of uncorrelated vec-
tors. From one random vector we progress to two vectors, but now we look for correlation
between the variables of the first and second vectors, and in particular, we want to find out
which variables are correlated and how strong this relationship is.
In medical diagnostics, for example, we may meet multivariate measurements obtained
from tissue and plasma samples of patients, and the tissue and plasma variables typically
differ. A natural question is: What is the relationship between the tissue measurements and
the plasma measurements? A strong relationship between a combination of tissue variables
and a combination of plasma variables typically indicates that either set of measurements
could be used for a particular diagnosis. A very weak relationship between the plasma and
tissue variables tells us that the sets of variables are not equally appropriate for a particular
diagnosis.
On the share market, one might want to compare changes in the price of industrial shares
and mining shares over a period of time. The time points are the observations, and for each
time point, we have two sets of variables: those arising from industrial shares and those
arising from mining shares. We may want to know whether the industrial and mining shares
show a similar growth pattern over time. In these scenarios, two sets of observations exist,
and the data fall into two parts; each part consists of the same n objects, such as people,
time points, or locations, but the variables in the two parts or sets differ, and the number of
variables is typically not the same in the two parts.
To find high correlation between two sets of variables, we consider linear combinations
of variables and then ask the questions:

1. How should one choose these combinations?


2. How many such combinations should one consider?

We do not ask, ‘Is there a relationship?’ Instead, we want to find the combinations of vari-
ables separately for each of the two sets of variables which exhibit the strongest correlation
between the two sets of variables. The second question can now be re-phrased: How many
such combinations are correlated enough?
70
3.2 Population Canonical Correlations 71

If one of the two sets of variables consists of a single variable, and if the single variable is
linearly related to the variables in the other set, then Multivariate Linear Regression can be
used to determine the nature and strength of the relationship. The two methods, Canonical
Correlation Analysis and Linear Regression, agree for this special setting. The two methods
are closely related but differ in a number of aspects:
• Canonical Correlation Analysis exhibits a symmetric relationship between the two sets
of variables, whereas Linear Regression focuses on predictor variables and response
variables, and the roles of predictors and responses cannot be reversed in general.
• Canonical Correlation Analysis determines optimal combinations of variables simulta-
neously for both sets of variables and finds the strongest overall relationship. In Linear
Regression with vector-valued responses, relationships between each scalar response
variable and combinations of the predictor variables are determined separately for each
component of the response.
The pioneering work of (Hotelling 1935, 1936) paved the way in this field. Hotelling
published his seminal work on Principal Component Analysis in 1933, and only two years
later his next big advance in multivariate analysis followed!
In this chapter we consider different subsets of the variables, and parts of the data will
refer to the subsets of variables. The chapter describes how Canonical Correlation Analysis
finds combinations of variables of two vectors or two parts of data that are more strongly
correlated than the original variables. We begin with the population case in Section 3.2 and
consider the sample case in Section 3.3. We derive properties of canonical correlations in
Section 3.4. Transformations of variables are frequently encountered in practice. We inves-
tigate correlation properties of transformed variables in Section 3.5 and distinguish between
transformations resulting from singular and non-singular matrices. Maximum Covariance
Analysis is also mentioned in this section. In Section 3.6 we briefly touch on asymptotic
results for multivariate normal random vectors and then consider hypothesis tests of the
strength of the correlation. Section 3.7 compares Canonical Correlation Analysis with Lin-
ear Regression and Partial Least Squares and shows that these three approaches are special
instances of the generalised eigenvalue problems. Problems pertaining to the material of this
chapter are listed at the end of Part I.

A Word of Caution. The definitions I give in this chapter use the centred (rather than the
raw) data and thus differ from the usual treatment. Centring the data is a natural first step in
many analyses. Further, working with the centred data will make the theoretical underpin-
nings of this topic more obviously compatible with those of Principal Component Analysis.
The basic building stones are covariance matrices, and for this reason, my approach does
not differ much from that of other authors. In particular, properties relating to canonical
correlations are not affected by using centred vectors and centred data.

3.2 Population Canonical Correlations


The main goal of this section is to define the key ideas of Canonical Correlation Analysis
for random vectors. Because we are dealing with two random vectors rather than a single
vector, as in Principal Component Analysis, the matrices which link the two vectors are
important. We begin with properties of matrices based on Definition 1.12 and Result 1.13 of
72 Canonical Correlation Analysis

Section 1.5.3 and exploit an important link between the singular value decomposition and
the spectral decomposition.
Proposition 3.1 Let A be a p × q matrix with rank r and singular value decomposition
A = EF T . Put B = A AT and K = AT A. Then
1. the matrices B and K have rank r ;
2. the spectral decompositions of B and K are
T T
B = E DE and K = F DF ,

where D = 2 ; and
3. the eigenvectors ek of B and fk of K satisfy
T
Afk = λk ek and A ek = λk fk ,

where the λk are the diagonal elements of , for k = 1, . . .,r .


Proof Because r is the rank of A, the p × r matrix E consists of the left eigenvectors of A,
and the q × r matrix F consists of the right eigenvectors of A. The diagonal matrix  is of
size r × r . From this and the definition of B, it follows that
T T T T
B = A A = EF FE = E2 E

because  = T and F T F = Ir×r . By the uniqueness of the spectral decomposition, it fol-


lows that E is the matrix of the first r eigenvectors of B, and D = 2 is the matrix of
eigenvalues λ2k of B, for k ≤ r . A similar argument applies to K . It now follows that B and
K have the same rank and the same eigenvalues. The proof of part 3 is considered in the
Problems at the end of Part I.
We consider two random vectors X[1] and X[2] such that X[ρ] ∼ (μρ , ρ ), for ρ = 1, 2.
Throughout this chapter, X[ρ] will be a dρ -dimensional random vector, and dρ ≥ 1. Unless
otherwise stated, we assume that the covariance matrix ρ has rank dρ . In the random-
[ρ]
sample case, which I describe in the next section, the observations Xi (for i = 1, . . . , n) are
dρ -dimensional random vectors, and their sample covariance matrix has rank dρ . We write
 [1] 
X
X = [2] ∼ (μ, ),
X
 
μ
for the d-dimensional random vector X, with d = d1 + d2 , mean μ = 1 and covariance
μ2
matrix

1 12
= T¯ . (3.1)
12 2
To distinguish between the different submatrices of  in (3.1), we call the d1 × d2 matrix
12 the between covariance matrix of the vectors X[1] and X[2] . Let r be the rank of 12 ,
so r ≤ min (d1 , d2 ). In general, d1 = d2 , as the dimensions reflect specific measurements,
and there is no reason why the number of measurements in X[1] and X[2] should be the same.
3.2 Population Canonical Correlations 73

We extend the notion of projections onto direction vectors from Principal Component
Analysis to Canonical Correlation Analysis, which has pairs of vectors. For ρ = 1, 2, let
[ρ] −1/2
X = ρ X[ρ] − μρ be the sphered vectors. For k = 1, . . .,r , let ak ∈ Rd1 and bk ∈ Rd2
be unit vectors, and put
T [1] T [2]
Uk = a k X and Vk = bk X . (3.2)
The aim is to find direction vectors ak and bk such that the dependence between Uk and Vk
is strongest for the pair (U1 , V1 ) and decreases with increasing index k. We will measure the
strength of the relationship by the absolute value of the covariance, so we require that
| cov(U1 , V1 )| ≥ | cov(U2 , V2 )| ≥ · · · ≥ | cov(Ur , Vr )| > 0.

The between covariance matrix 12 of (3.1) is the link between the vectors X[1] and X[2] .
It turns out to be more useful to consider a standardised version of this matrix rather than
the matrix itself.
Definition 3.2 Let X[1] ∼ (μ1 , 1 ) and X[2] ∼ (μ2 , 2 ), and assume that 1 and 2 are
invertible. Let 12 be the between covariance matrix of X[1] and X[2] . The matrix of
canonical correlations or the canonical correlation matrix is
−1/2 −1/2
C = 1 12 2 , (3.3)
and the matrices of multivariate coefficients of determination are
T T
R [C,1] = CC and R [C,2] = C C. (3.4)


In Definition 3.2 we assume that the covariance matrices 1 and 2 are invertible. If
1 is singular with rank r , then we may want to replace 1 by its spectral decomposition
T −1/2 −1/2 T
1,r 1,r 1,r and 1 by 1,r 1,r 1,r and similarly for 2 .
A little reflection reveals that C is a generalisation of the univariate correlation coeffi-
cient to multivariate random vectors. Theorem 2.17 of Section 2.6.1 concerns the matrix of
correlation coefficients R, which is the covariance matrix of the scaled vector Xscale . The
entries of this d × d matrix R are the correlation coefficients arising from pairs of variables
of X. The canonical correlation matrix C compares each variable of X[1] with each variable
of X[2] because the focus is on the between covariance or the correlation between X[1] and
X[2] . So C has d1 × d2 entries.
We may interpret the matrices R [C,1] and R [C,2] as two natural generalisations of the
−1/2 −1/2
coefficient of determination in Linear Regression; R [C,1] = 1 12 2−1 12 T
1 is of
−1/2 T −1 −1/2
size d1 × d1 , whereas R [C,2] = 2 12 1 12 2 is of size d2 × d2 .
The matrix C is the object which connects the two vectors X[1] and X[2] . Because the
dimensions of X[1] and X[2] differ, C is not a square matrix. Let
T
C = Pϒ Q

be its singular value decomposition. By Proposition 3.1, R [C,1] and R [C,2] have the spectral
decompositions
T T
R [C,1] = Pϒ 2 P and R [C,2] = Qϒ 2 Q ,
74 Canonical Correlation Analysis

where ϒ 2 is diagonal with diagonal entries υ12 ≥ υ22 ≥ · · · ≥ υr2 > 0. For k ≤ r , we write pk
for the eigenvectors of R [C,1] and qk for the eigenvectors of R [C,2] , so
   
P = p1 p2 · · · pr and Q = q1 q2 · · · qr .

The eigenvectors pk and qk satisfy


T
Cqk = υk pk and C pk = υk qk , (3.5)

and because of this relationship, we call them the left and right eigenvectors of C. See
Definition 1.12 in Section 1.5.3.
Throughout this chapter I use the submatrix notation (1.21) of Section 1.5.2 and so will
write Q m for the m × r submatrix of Q, where m ≤ r , and similarly for other submatrices.
We are now equipped to define the canonical correlations.
[ρ]
Definition 3.3 Let X[1] ∼ (μ1 , 1 ) and X[2] ∼ (μ2 , 2 ). For ρ = 1, 2, let X be the sphered
vector  
X = ρ−1/2 X[ρ] − μρ .
[ρ]

Let 12 be the between covariance matrix of X[1] and X[2] . Let C be the matrix of canonical
correlations of X[1] and X[2] , and write C = Pϒ Q T for its singular value decomposition.
Consider k = 1, . . .,r .
1. The kth pair of canonical correlation scores or canonical variates is
T [1] T [2]
Uk = pk X and Vk = qk X ; (3.6)

2. the k-dimensional pair of vectors of canonical correlations or vectors of canonical


variates is
⎡ ⎤ ⎡ ⎤
U1 V1
⎢ .. ⎥ ⎢ .. ⎥
U =⎣ . ⎦
(k)
and V = ⎣ . ⎦;
(k)
(3.7)
Uk Vk

3. and the kth pair of canonical (correlation) transforms is


−1/2 −1/2
ϕ k = 1 pk and ψ k = 2 qk . (3.8)


For brevity, I sometimes refer to the pair of canonical correlations scores or vectors as CC
scores, or vectors of CC scores or simply as canonical correlations.
I remind the reader that our definitions of the canonical correlation scores and vectors
use the centred data, unlike other treatments of this topic, which use uncentred vectors.
It is worth noting that the canonical correlation transforms of (3.8) are, in general, not
unit vectors because they are linear transforms of unit vectors, namely, the eigenvectors
pk and qk .
The canonical correlation scores are also called the canonical (correlation) variables.
Mardia, Kent, and Bibby (1992) use the term canonical correlation vectors for the ϕ k and
ψ k of (3.8). To distinguish the vectors of canonical correlations U(k) and V(k) of part 2
3.2 Population Canonical Correlations 75

of the definition from the ϕ k and ψ k , I prefer the term transforms for the ϕ k and ψ k
because these vectors are transformations of the directions pk and qk and result in the pair of
scores
T T
Uk = ϕ k (X[1] − μ1 ) and Vk = ψ k (X[2] − μ2 ) for k = 1, . . . ,r. (3.9)

At times, I refer to a specific pair of transforms, but typically we are interested in the first p
pairs with p ≤ r . We write
   
= ϕ1 ϕ2 · · · ϕr and  = ψ1 ψ2 · · ·ψr

for the matrices of canonical correlation transforms. The entries of the vectors ϕ k and
ψ k are the weights of the variables of X[1] and X[2] and so show which variables con-
tribute strongly to correlation and which might be negligible. Some authors, including
Mardia, Kent, and Bibby (1992), define the CC scores as in (3.9). Naively, one might think
of the vectors ϕ k and ψ k as sphered versions of the eigenvectors pk and qk , but this
is incorrect; 1 is the covariance matrix of X[1] , and pk is the kth eigenvector of the
non-random CC T .
I prefer the definition (3.6) to (3.9) for reasons which are primarily concerned with the
interpretation of the results, namely,
1. the vectors pk and qk are unit vectors, and their entries are therefore easy to interpret,
and
2. the scores are given as linear combinations of uncorrelated random vectors, the sphered
[1] [2]
vectors X and X . Uncorrelated variables are more amenable to an interpreta-
tion of the contribution of each variable to the correlation between the pairs Uk
and Vk .
Being eigenvectors, the pk and qk values play a natural role as directions, and they are
some of the key quantities when dealing with correlation for transformed random vectors
in Section 3.5.2 as well as in the variable ranking based on the correlation matrix which I
describe in Section 13.3.
The canonical correlation scores (3.6) and vectors (3.7) play a role similar to the PC
scores and vectors in Principal Component Analysis, and the vectors pk and qk remind us of
the eigenvector η k . However, there is a difference: Principal Component Analysis is based
on the raw or scaled data, whereas the vectors pk and qk relate to the sphered data. This
difference is exhibited in (3.6) but is less apparent in (3.9). The explicit nature of (3.6) is
one of the reasons why I prefer (3.6) as the definition of the scores.
Before we leave the population case, we compare the between covariance matrix 12 and
the canonical correlation matrix in a specific case.

Example 3.1 The car data is a subset of the 1983 ASA Data Exposition of
Ramos and Donoho (1983). We use their five continuous variables: displacement, horse-
power, weight, acceleration and miles per gallon (mpg). The first three variables correspond
to physical properties of the cars, whereas the remaining two are performance-related. We
combine the first three variables into one part and the remaining two into the second part.
[1]
The random vectors Xi have the variables displacement, horsepower, and weight, and the
[2]
random vectors Xi have the variables acceleration and mpg. We consider the sample vari-
76 Canonical Correlation Analysis

ance and covariance matrix of the X[ρ] in lieu of the respective population quantities. We
obtain the 3 × 2 matrices
⎛ ⎞
−157. 0 − 657. 6
12 = ⎝ − 73. 2 − 233. 9 ⎠
−976. 8 −5517. 4

and
⎛ ⎞
−0. 3598 −0. 1131

C = −0. 5992 −0. 0657⎠ . (3.10)
−0. 1036 −0. 8095

Both matrices have negative entries, but the entries are very different: the between covari-
ance matrix 12 has entries of arbitrary size. In contrast, the entries of C are correlation
coefficients and are in the interval [ − 1, 0] in this case and, more generally, in [ − 1, 1]. As
a consequence, C explicitly reports the strength of the relationship between the variables.
Weight and mpg are most strongly correlated with an entry of −5,517.4 in the covariance
matrix, and −0.8095 in the correlation matrix. Although −5,517.4 is a large negative value,
it does not lead to a natural interpretation of the strength of the relationship between the
variables.
An inspection of C shows that the strongest absolute correlation exists between the vari-
ables weight and mpg. In Section 3.4 we examine whether a combination of variables will
lead to a stronger correlation.

3.3 Sample Canonical Correlations


In Example 3.1, I calculate the covariance matrix from data because we do not know the
true population covariance structure. In this section, I define canonical correlation concepts
for data. At the end of this section, we return to Example 3.1 and calculate the CC scores.
The sample definitions are similar to those of the preceding section, but because we are
dealing with a sample and do not know the true means and covariances, there are important
differences. Table 3.1 summarises the key quantities for both the population and the sample.
We begin with some notation for the sample. For ρ = 1, 2, let
 
X[ρ] = X[ρ]
1 X
[ρ]
2 · · · X
[ρ]
n

[ρ]
be dρ ×n data which consist of n independent dρ -dimensional random vectors Xi . The data
X[1] and X[2] usually have a different number of variables, but measurements on the same n
objects are carried out for X[1] and X[2] . This fact is essential for the type of comparison we
want to make.
[ρ]
We assume that the Xi have sample mean Xρ and sample covariance matrix Sρ .
Sometimes it will be convenient to consider the combined data. We write
 [1] 
X
X = [2] ∼ Sam(X, S),
X
3.3 Sample Canonical Correlations 77
 
X1
so X is a d × n matrix with d = d1 + d2 , sample mean X = and sample covariance
X2
matrix

S S12
S = 1T .
S12 S2

Here S12 is the d1 × d2 (sample) between covariance matrix of X[1] and X[2] defined by
1 n
∑ (Xi − X1 )(Xi − X2 ) .
[1] [2] T
S12 = (3.11)
n − 1 i=1

Unless otherwise stated, in this chapter, Sρ has rank dρ for ρ = 1, 2, and r ≤ min (d1 , d2 ) is
the rank of S12 .
Definition 3.4 Let X[1] ∼ Sam X1 , S1 and X[2] ∼ Sam X2 , S2 , and let S12 be their sample
between covariance matrix. Assume that S1 and S2 are non-singular. The matrix of sample
canonical correlations or the sample canonical correlation matrix is
 = S −1/2 S12 S −1/2 ,
C 1 2

and the pair of matrices of sample multivariate coefficients of determination are


C
R[C,1] = C T and R[C,2] = C
 T C.
 (3.12)


As in (1.9) of Section 1.3.2, we use the subscript cent to refer to the centred data. With
this notation, the d1 × d2 matrix
−1/2  −1/2
 = X[1] [1] T [1] [2] T [2] [2] T
C X
cent cent X X
cent cent X X
cent cent , (3.13)

 has the singular value decomposition


and C
 = Pϒ
C QT,

where
   
P = p
1 
p2 · · · 
pr and = q
Q q2 · · · 
1  qr ,

and r is the rank of S12 and hence also of C. In the population case I mentioned that we
−1/2 −1/2
may want to replace a singular 1 , with rank r < d1 , by its r r rT . We may want to
make an analogous replacement in the sample case.
In the population case, we define the canonical correlation scores Uk and Vk of the vectors
X and X[2] . The sample CC scores will be vectors of size n – similar to the PC sample
[1]

scores – with one value for each observation.


[ρ]
Definition 3.5 Let X[1] ∼ Sam X1 , S1 and X[2] ∼ Sam X2 , S2 . For ρ = 1, 2, let X S be the
 the sample canonical
sphered data. Let S12 be the sample between covariance matrix and C
[1] [2]    T
correlation matrix of X and X . Write P ϒ Q for the singular value decomposition of C. 
Consider k = 1, . . .,r , with r the rank of S12 .
78 Canonical Correlation Analysis

1. The kth pair of canonical correlation scores or canonical variates is


T [1] T [2]
U•k = 
pk X S and V•k = 
qk X S ;

2. the k-dimensional canonical correlation data or data of canonical variates consist of


the first k pairs of canonical correlation scores:
⎡ ⎤ ⎡ ⎤
U•1 , V•1 ,
⎢ ⎥ ⎢ ⎥
U(k) = ⎣ ... ⎦ and V(k) = ⎣ ... ⎦ ;
U•k V•k
3. and the kth pair of canonical (correlation) transforms consists of

 k = S1−1/2 
ϕ pk and  = S −1/2 
ψ k 2 qk .


When we go from the population to the sample case, the size of some of the objects
changes: we go from a d-dimensional random vector to data of size d × n. For the param-
eters of the random variables, such as the mean or the covariance matrix, no such change
in dimension occurs because the sample parameters are estimators of the true parameters.
Similarly, the eigenvectors of the sample canonical correlation matrix C  are estimators of
the corresponding population eigenvectors and have the same dimension as the population
eigenvectors. However, the scores are defined for each observation, and we therefore obtain
n pairs of scores for data compared with a single pair of scores for the population.
For a row vector of scores of n observations, we write
 
V•k = V1k V2k · · · Vnk ,

which is similar to the notation for the PC vector of scores in (2.8) of Section 2.3. As in
the PC case, V•k is a vector whose first subscript, • , runs over all n observations and thus
contains the kth CC score of all n observations of X[2] . The second subscript, k, refers to the
index or numbering of the scores. The kth canonical correlation matrices Uk and Vk have
dimensions k × n; that is, the j th row summarises the contributions of the j th CC scores for
all n observations. Using parts 1 and 2 of Definition 3.5, Uk and Vk are

U(k) = Pk X S  X ,
T [1] T [2]
and V(k) = Q k S (3.14)
 and Q
where Pk contains the first k vectors of the matrix P,  k contains the first k vectors of

the matrix Q.
The relationship between pairs of canonical correlation scores and canonical transforms
[ρ]
for the sample is similar to that of the population – see (3.9). For the centred data Xcent and
k ≤ r , we have
T
U•k = ϕ
T [1]
 k Xcent and  X[2] .
V•k = ψ (3.15)
k cent

Table 3.1 summarises the population and sample quantities for the CC framework.
Small examples can be useful in seeing how a method works. We start with the five-
dimensional car data and then consider the canonical correlation scores for data with more
variables.
3.3 Sample Canonical Correlations 79

Table 3.1 Relationships of population and sample canonical correlations: dρ = d1


for the first vector/data and dρ = d2 for the second vector/data.
Population Random Sample
Random variables X[1] , X[2] dρ × 1 X[1] , X[2] dρ × n
kth CC scores Uk , Vk 1×1 U•k , V•k 1×n
CC vector/data U(k) , V(k) k ×1 U(k) , V(k) k ×n

Example 3.2 We continue with the 392 observations of the car data, and as in Example 3.1,
X[1] consist of the first three variables and X[2] the remaining two. The sample between
covariance matrix and the sample canonical correlation matrix are shown in (3.10), and we
now refer to them as S and C,  respectively. The singular values of C
 are 0.8782 and 0.6328,
and the left and right eigenvectors of C are
⎡ ⎤
  0. 3218 0. 3948 −0. 8606
P = p1  p3 = ⎣0. 4163
p2  0. 7574 0. 5031⎦
0. 8504 −0. 5202 0. 0794
and
 
  −0. 5162 −0. 8565
= q
Q 1 
q2 = .
−0. 8565 0. 5162

The sample canonical transforms, which are obtained from the eigenvectors, are
⎡ ⎤
  0. 0025 0. 0048 −0. 0302
= φ 1 φ  3 = ⎣0. 0202
2 φ 0. 0409 0. 0386⎦
0. 0000 −0. 0027 0. 0020
and
   
= ψ
  = −0. 1666
 ψ −0. 3637
. (3.16)
1 2 −0. 0916 0. 1078

The rank of C  is two. Because X[1] has three variables, MATLAB gives a third eigenvector,
which – together with the other two eigenvectors – forms a basis for R3 . Typically, we do
not require this additional vector because the rank determines the number of pairs of scores
we consider.
The signs of the vectors are identical for P and  (and Q  and , respectively), but the
entries of P and  (and similarly of Q
 and )
 differ considerably. In this case the entries of
 and   are much smaller than the corresponding entries of the eigenvector matrices.
The X[2] data are two-dimensional, so we obtain two pairs of vectors and CC scores.
Figure 3.1 shows two-dimensional scatterplots of the scores, with the first scores in the left
panel and the second scores in the right panel. The x-axis displays the scores for X[1] , and the
y-axis shows the scores for X[2] . Both scatterplots have positive relationships. This might
appear contrary to the negative entries in the between covariance matrix and the canoni-
cal correlation matrix of (3.10). Depending on the sign of the eigenvectors, the CC scores
could have a positive or negative relationship. It is customary to show them with a positive
80 Canonical Correlation Analysis
4 3

2
2
1

0 0

−1
−2
−2

−4 −3
−2 0 2 4 −5 0 5 10
Figure 3.1 Canonical correlation scores of Example 3.2: (left panel) first scores; (right panel)
second scores with X[1] values on the x-axis, and X[2] values on the y-axis.

relationship. By doing so, comparisons are easier, but we have lost some information: we
are no longer able to tell from these scatterplots whether the original data have a positive or
negative relationship.
We see from Figure 3.1 that the first scores in the left panel form a tight curve, whereas
the second scores are more spread out and so are less correlated. The sample correlation
coefficients for first and second scores are 0.8782 and 0.6328, respectively. The first value,
0.8782, is considerably higher than the largest entry, 0.8095, of the canonical correlation
matrix (3.10). Here the best combinations are

0. 0025 ∗ (displacement− 194. 4) + 0. 0202 ∗ (horse power − 104. 5) + 0. 000025 ∗


(weight − 2977. 6) − 0. 1666 ∗ (acceleration− 15. 54) − 0. 0916 ∗ (mpg− 23. 45).

The coefficients are those obtained from the canonical transforms in (3.16). Although the
last entry in the first column of the matrix  is zero to the first four decimal places, the
actual value is 0.000025.
The analysis shows that strong correlation exists between the two data sets. By consid-
ering linear combinations of the variables in each data set, the strength of the correlation
between the two parts can be further increased, which shows that the combined physi-
cal properties of cars are very strongly correlated with the combined performance-based
properties.

The next example, the Australian illicit drug market data, naturally split into two parts
and are thus suitable for a canonical correlation analysis.

Example 3.3 We continue with the illicit drug market data for which seventeen different
series have been measured over sixty-six months. Gilmour and Koch (2006) show that the
data split into two distinct groups, which they call the direct measures and the indirect mea-
sures of the illicit drug market. The two groups are listed in separate columns in Table 3.2.
The series numbers are given in the first and third columns of the table.
The direct measures are less likely to be affected by external forces such as health or
law enforcement policy and economic factors but are more vulnerable to direct effects on
the markets, such as successful intervention in the supply reduction and changes in the
3.3 Sample Canonical Correlations 81

Table 3.2 Direct and indirect measures of the illicit drug market
Series Direct measures X[1] Series Indirect measures X[2]
1 Heroin possession offences 4 Prostitution offences
2 Amphetamine possession offences 7 PSB reregistrations
3 Cocaine possession offences 8 PSB new registrations
5 Heroin overdoses (ambulance) 12 Robbery 1
6 ADIS heroin 17 Steal from motor vehicles
9 Heroin deaths
10 ADIS cocaine
11 ADIS amphetamines
13 Amphetamine overdoses
14 Drug psychoses
15 Robbery 2
16 Break and enter dwelling
Note: From Example 3.3. ADIS refers to the Alcohol and Drug Information
Service, and ADIS heroin/cocaine/amphetamine refers to the number of calls to
ADIS by individuals concerned about their own or another’s use of the stated
drug. PSB registrations refers to the number of individuals registering for
pharmacotherapy.

availability or purity of the drugs. The twelve variables of X[1] are the direct measures of
the market, and the remaining five variables are the indirect measures and make up X[2] .
In this analysis we use the raw data. A calculation of the correlation coefficients between
all pairs of variables in X[1] and X[2] yields the highest single correlation of 0.7640 between
amphetamine possession offences (series 2) and steal from motor vehicles (series 17). The
next largest coefficient of 0.6888 is obtained for ADIS heroin (series 6) and PSB new regis-
trations (series 8). Does the overall correlation between the two data sets increase when
we consider combinations of variables within X[1] and X[2] ? The result of a canonical
correlation analysis yields five pairs of CC scores with correlation coefficients
0. 9543 0. 8004 0. 6771 0. 5302 0. 3709.
The first two of these coefficients are larger than the correlation between amphetamine
possession offences and steal from motor vehicles. The first CC score is based almost equally
on ADIS heroin and amphetamine possession offences from among the X[1] variables. The
X[2] variables with the largest weights for the first CC score are, in descending order,
PSB new registrations, steal from motor vehicles and robbery 1. The other variables have
much smaller weights. As in the previous example, the weights are those of the canonical
transforms ϕ  .
 1 and ψ 1
It is interesting to note that the two variables amphetamine possession offences and steal
from motor vehicles, which have the strongest single correlation, do not have the high-
est absolute weights in the first CC score. Instead, the pair of variables ADIS heroin and
PSB new registrations, which have the second-largest single correlation coefficient, are the
strongest contributors to the first CC scores. Overall, the correlation increases from 0.7640
to 0.9543 if the other variables from both data sets are taken into account.
The scatterplots corresponding to the CC data U5 and V5 are shown in Figure 3.2 starting
with the most strongly correlated pair (U•1 , V•1 ) in the top-left panel. The last subplot in
the bottom-right panel shows the scatterplot of amphetamine possession offences versus
82 Canonical Correlation Analysis
4 4 4

0 0 0

−4 −4 −4
−4 0 4 −4 0 4 −4 0 4

4 4 9000

0 0 7000

−4 −4 5000
−4 0 4 −4 0 4 0 100 200
Figure 3.2 Canonical correlation scores of Example 3.3. CC scores 1 to 5 and best single
variables in the bottom right panel. X[1] values are shown on the x-axis and X[2] values on the
y-axis.

steal from motor vehicles. The variables corresponding to X[1] are displayed on the x-axis,
and the X[2] variables are shown on the y-axis. The progression of scatterplots shows the
decreasing strength of the correlation between the combinations as we go from the first to the
fifth pair.
The analysis shows that there is a very strong, almost linear relationship between the direct
and indirect measures of the illicit drug market, which is not expected from an inspection
of the correlation plots of pairs of variables (not shown here). This relationship far exceeds
that of the ‘best’ individual variables, amphetamine possession offences and steal from motor
vehicles, and shows that the two parts of the data are very strongly correlated.

Remark. The singular values of C and C,  being positive square roots of their respective
multivariate coefficients of determination, reflect the strength of the correlation between
the scores. If we want to know whether the combinations of variables are positively or
negatively correlated, we have to calculate the actual correlation coefficient between the
two linear combinations.

3.4 Properties of Canonical Correlations


In this section we consider properties of the CC scores and vectors which include optimal-
ity results for the CCs. Our first result relates the correlation coefficients, such as those
calculated in Example 3.3, to properties of the matrix of canonical correlations, C.

Theorem 3.6 Let X[1] ∼ (μ1 , 1 ) and X[2] ∼ (μ2 , 2 ). Let 12 be the between covariance
matrix of X[1] and X[2] , and let r be its rank. Let C be the matrix of canonical correlations
with singular value decomposition C = Pϒ Q T . For k,  = 1, . . .,r , let U(k) and V() be the
canonical correlation vectors of X[1] and X[2] , respectively.
3.4 Properties of Canonical Correlations 83
 (k)

U
1. The mean and covariance matrix of are
V()
      
U(k) 0 U(k) Ik×k ϒ k×
E = k and var = ,
V() 0 V() ϒ Tk× I× ,

where ϒk× is the k ×  submatrix of ϒ which consists of the ‘top-left corner’ of ϒ.


2. The variances and covariances of the canonical correlation scores Uk and V are

var (Uk ) = var (V ) = 1 and cov(Uk , V ) = ±υk δk ,

where υk is the kth singular value of ϒ, and δk is the Kronecker delta function.

This theorem shows that the object of interest is the submatrix ϒk× . We explore this
matrix further in the following corollary.

Corollary 3.7 Assume that for ρ = 1, 2, the X[ρ] satisfy the conditions of Theorem 3.6. Then
the covariance matrix of U(k) and V() is
   
cov U(k) , V() = cor U(k) , V() ,

where cor U(k) , V() is the matrix of correlation coefficients of U(k) and V() .

Proof of Theorem 3.6 For k,  = 1, . . .,r , let Pk be the submatrix of P which consists of the
first k left eigenvectors of C, and let Q  be the submatrix of Q which consists of the first 
right eigenvectors of C.
−1/2
For part 1 of the theorem and k, recall that U(k) = PkT 1 (X[1] − μ1 ), so

T −1/2
EU(k) = Pk 1 E(X[1] − μ1 ) = 0k ,

and similarly, EV() = 0 .


The calculations for the covariance matrix consist of two parts: the separate variance
calculations for U(k) and V() and the between covariance matrix cov U(k) , V() . All vectors
have zero means, which simplifies the variance calculations. Consider V() . We have
 T

var (V() ) = E V() V()
   T 
T −1/2 −1/2
= E Q  2 X − μ2 X − μ2 2
[2] [2]
Q

T −1/2 −1/2
= Q  2 2 2 Q  = I× .
 
In these equalities we have used the fact that var (X[2] ) = E X[2] (X[2] )T − μ2 μT2 = 2 and
that the matrix Q  consists of  orthonormal vectors.
84 Canonical Correlation Analysis

The between covariance matrix of U(k) and V() is


 T

cov(U(k) , V() ) = E U(k) V()
   T 
T −1/2 −1/2
= E Pk 1 X[1] − μ1 X[2] − μ2 2 Q
  T 
T −1/2 −1/2
= Pk 1 E X − μ1 X − μ2
[1] [2]
2 Q

T −1/2 −1/2
= Pk 1 12 2 Q
T T T
= Pk C Q  = Pk Pϒ Q Q 
= Ik×r ϒIr× = ϒk× .
In this sequence of equalities we have used the singular value decomposition Pϒ Q T of
C and the fact that PkT P = Ik×r . A similar relationship is used in the proof of Theorem 2.6
in Section 2.5. Part 2 follows immediately from part 1 because ϒ is a diagonal matrix with
non-zero entries υk for k = 1, . . .,r .
The theorem is stated for random vectors, but an analogous result applies to random data:
the CC vectors become CC matrices, and the matrix ϒk× is replaced by the sample covari-
ance matrix ϒ  k× . This is an immediate consequence of dealing with a random sample. I
illustrate Theorem 3.6 with an example.

Example 3.4 We continue with the illicit drug market data and use the direct and indirect
measures of the market as in Example 3.3. The singular value decomposition C  = PϒQT
yields the five singular values
1 = 0. 9543,
υ 2 = 0. 8004,
υ 3 = 0. 6771,
υ 4 = 0. 5302,
υ 5 = 0. 3709.
υ
The five singular values of C agree with (the moduli of) the correlation coefficients cal-
culated in Example 3.3, thus confirming the result of Theorem 3.6, stated there for the
population. The biggest singular value is very close to 1, which shows that there is a very
strong relationship between the direct and indirect measures.

Theorem 3.6 and its corollary state properties of the CC scores. In the next proposition
we examine the relationship between the X[ρ] and the vectors U and V.
Proposition 3.8 Let X[1] ∼ (μ1 , 1 ), X[2] ∼ (μ2 , 2 ). Let C be the canonical correlation
matrix with rank r and singular value decomposition Pϒ Q T . For k,  ≤ r , let U(k) and V()
be the k- and -dimensional canonical correlation vectors of X[1] and X[2] . The random
vectors and their canonical correlation vectors satisfy
1/2 1/2
cov(X[1] , U(k) ) = 1 Pk and cov (X[2] , V() ) = 2 Q  .
The proof of Proposition 3.8 is deferred to the Problems at the end of Part I.
In terms of the canonical transforms, by (3.8) the equalities stated in Proposition 3.8 are
cov(X[1] , U(k) ) = 1 k and cov (X[2] , V() ) = 2  . (3.17)
The equalities (3.17) look very similar to the covariance relationship between the random
vector X and its principal component vector W(k) which we considered in Proposition 2.8,
3.4 Properties of Canonical Correlations 85

namely,

cov (X, W(k) ) = k = k , (3.18)

with  =  T . In (3.17) and (3.18), the relationship between the random vector X[ρ]
or X and its CC or PC vector is described by a matrix, which is related to eigenvectors
and the covariance matrix of the appropriate random vector. A difference between the two
expressions is that the columns of k are the eigenvectors of , whereas the columns of k
and  are multiples of the left and right eigenvectors of the between covariance matrix 12
which satisfy
T
12 ψ k = υk ϕ k and 12 ϕ k = υk ψ k , (3.19)

where υk is the kth singular value of C.


The next corollary looks at uncorrelated random vectors with covariance matrices ρ = I
and non-trivial between covariance matrix 12 .
Corollary 3.9 If X[ρ] ∼ 0dρ , Idρ ×dρ , for ρ = 1, 2, the canonical correlation matrix C
reduces to the covariance matrix 12 , and the matrices P and Q agree with the matrices of
canonical transforms  and  respectively.
This result follows from the variance properties of the random vectors because
−1/2 −1/2
C = 1 12 2 = Id1 ×d1 12 Id2 ×d2 = 12 .

For random vectors with the identity covariance matrix, the corollary tells us that C and 12
agree, which is not the case in general. Working with the matrix C has the advantage that
its entries are scaled and therefore more easily interpretable. In some areas or applications,
including climate research and Partial Least Squares, the covariance matrix 12 is used
directly to find the canonical transforms.
So far we have explored the covariance structure of pairs of random vectors and their
CCs. The reason for constructing the CCs is to obtain combinations of X[1] and X[2] which
are more strongly correlated than the individual variables taken separately. In our examples
we have seen that correlation decreases for the first few CC scores. The next result provides
a theoretical underpinning for these observations.
Theorem 3.10 For ρ = 1, 2, let X[ρ] ∼ μρ , ρ , with dρ the rank of ρ . Let 12 be the
between covariance matrix and C the canonical correlation matrix of X[1] and X[2] . Let r
[ρ]
be the rank of C, and for j ≤ r , let (p j , q j ) be the left and right eigenvectors of C. Let X
be the sphered vector derived from X[ρ] . For unit vectors u ∈ Rd1 and v ∈ Rd2 , put
 
T [1] T [2]
c(u,v) = cov u X , v X .

1. It follows that c(u,v) = uT Cv.


2. If c(u∗ ,v∗ ) maximises the covariance c(u,v) over all unit vectors u ∈ Rd1 and v ∈ Rd2 , then

u∗ = ±p1 , v∗ = ±q1 and c(u∗ ,v∗ ) = υ1 ,

where υ1 is the largest singular value of C, which corresponds to the eigenvectors p1


and q1 .
86 Canonical Correlation Analysis

3. Fix 1 < k ≤ r . Consider unit vectors u ∈ Rd1 and v ∈ Rd2 such that
[1] [1]
(a) uT X is uncorrelated with pTj X , and
[2] [2]
(b) vT X is uncorrelated with qTj X ,
for j < k. If c(u∗ ,v∗ ) maximise c(u,v) over all such unit vectors u and v, then
u∗ = ±pk , v∗ = ±qk and c(u∗ ,v∗ ) = υk .
Proof To show part 1, consider unit vectors u and v which satisfy the assumptions of the
theorem. From the definition of the canonical correlation matrix, it follows that
    
T −1/2 T −1/2
c(u,v) = cov u 1 X[1] − μ1 , v 2 X[2] − μ2
   T 
T −1/2 −1/2
= E u 1 X − μ1 X − μ2 2 v
[1] [2]

T −1/2 −1/2 T
= u 1 12 2 v = u Cv.
To see why part 2 holds, consider unit vectors u and v as in part 1. For j = 1, . . ., d1 , let p j
be the left eigenvectors of C. Because 1 has full rank,
u = ∑αjpj with α j ∈ R and ∑ α2j = 1,
j j

and similarly,
v = ∑ βk qk with βk ∈ R and ∑ βk2 = 1,
k k
where the qk are the right eigenvectors of C. From part 1, it follows that
( ! ( !
∑αjpj ∑ βk qk
T T
u Cv = C
j k

= ∑ α j βk p j Cqk
T

j,k

= ∑ α j βk p j υk pk
T
by (3.5)
j,k

= ∑αjβjυj.
j

The last equality follows because the p j are orthonormal, so pTj pk = δ jk , where δ jk is the
Kronecker delta function. Next, we use the fact that the singular values υ j are positive and
ordered, with υ1 the largest. Observe that
|u Cv| = | ∑ α j β j υ j | ≤ υ1 | ∑ α j β j |
T

j j
( !1/2 ( !1/2
≤ υ1 ∑ |α j β j | ≤ υ1 ∑ |α j | 2
∑ |β j | 2
= υ1 .
j j j

Here we have used Hölder’s inequality, which links the sum ∑ j |α j β j | to the product of the
1/2 1/2
norms ∑ j |α j |2 and ∑ j |β j |2 . For details, see, for example, theorem 5.4 in Pryce
(1973). Because the p j are unit vectors, the norms are one. The maximum is attained when
α1 = ±1, β1 = ±1, and αj = βj = 0 for j > 1.
3.4 Properties of Canonical Correlations 87

Table 3.3 Variables of the Boston housing data from Example 3.5, where + indicates that the
quantity is calculated as a centred proportion
Environmental and social measures X[1] Individual measures X[2]
Per capita crime rate by town Average number of rooms per dwelling
Proportion of non-retail business acres per town Proportion of owner-occupied units
Nitric oxide concentration (parts per 10 million) built prior to 1940
Weighted distances to Boston employment centres Full-value property-tax rate per $10,000
Index of accessibility to radial highways Median value of owner-occupied homes
Pupil-teacher ratio by town in $1000s
Proportion of blacks by town+

This choice of coefficients implies that u∗ = ±p1 and v∗ = ±q1 , as desired.


For a proof of part 3, one first shows the result for k = 2. The proof is a combination of
the proof of part 2 and the arguments used in the proof of part 2 of Theorem 2.10 in Section
2.5.2. For k > 2, the proof works in almost the same way as for k = 2.
It is not possible to demonstrate the optimality of the CC scores in an example, but
examples illustrate that the first pair of CC scores is larger than correlation coefficients
of individual variables.

Example 3.5 The Boston housing data of Harrison and Rubinfeld (1978) consist of 506
observations with fourteen variables. The variables naturally fall into two categories: those
containing information regarding the individual, such as house prices, and property tax; and
those which deal with environmental or social factors. For this reason, the data are suitable
for a canonical correlation analysis. Some variables are binary; I have omitted these in this
analysis. The variables I use are listed in Table 3.3. (The variable names in the table are
those of Harrison and Rubinfeld, rather than names I would have chosen.)
There are four variables in X[2] , so we obtain four CC scores. The scatterplots of the CC
scores are displayed in Figure 3.3 starting with the first CC scores on the left. For each
scatterplot, the environmental/social variables are displayed on the x-axis and the individual
measures on the y-axis.
The four singular values of the canonical correlation matrix are

1 = 0. 9451,
υ 2 = 0. 6787,
υ 3 = 0. 5714,
υ 4 = 0. 2010.
υ

These values decrease quickly. Of special interest are the first CC scores, which are shown
in the left subplot of Figure 3.3. The first singular value is higher than any correlation
coefficient calculated from the individual variables.
A singular value as high as 0.9451 expresses a very high correlation, which seems at first
not consistent with the spread of the first scores. There is a reason for this high value: The
first pair of scores consists of two distinct clusters, which behave like two points. So the
large singular value reflects the linear relationship of the two clusters rather than a tight
fit of the scores to a line. It is not clear without further analysis whether there is a strong
positive correlation within each cluster. We also observe that the cluster structure is only
present in the scatterplot of the first CC scores.
88 Canonical Correlation Analysis

2 4 4
1
1 2
0 0 2 0

−1 −2
−1 0
−4
−2
−1 0 1 −2 0 2 −2 0 2 −2 0 2
Figure 3.3 Canonical correlation scores of Example 3.5. CC scores 1 to 4 with environmental
variables on the x-axis.

3.5 Canonical Correlations and Transformed Data


It is a well-known fact that random variables of the form a X + b and cY + d (with ac = 0)
have the same absolute correlation coefficient as the original random variables X and Y .
We examine whether similar relationships hold in the multivariate context. For this purpose,
we derive the matrix of canonical correlations and the CCs for transformed random vectors.
Such transformations could be the result of exchange rates over time on share prices or a
reduction of the original data to a simpler form.
Transformations based on thresholds could result in binary data, and indeed, Canonical
Correlation Analysis works for binary data. I will not pursue this direction but focus on
linear transformations of the data and Canonical Correlation Analysis for such data.

3.5.1 Linear Transformations and Canonical Correlations


Let ρ = 1, 2. Consider the random vectors X[ρ] ∼ μρ , ρ . Let Aρ be κρ × dρ matrices
with κρ ≤ dρ , and let aρ be fixed κρ -dimensional vectors. We define the transformed random
vector T by
 [1] 
T
T = [2] with T[ρ] = Aρ X[ρ] + aρ for ρ = 1, 2. (3.20)
T

We begin with properties of transformed vectors.


Theorem 3.11 Let ρ = 1, 2 and X[ρ] ∼ μρ , ρ . Let 12 be the between covariance matrix
of X[1] and X[2] . Let Aρ be κρ ×dρ matrices with κρ ≤ dρ , and let aρ be fixed κρ -dimensional
vectors. Put T[ρ] = Aρ X[ρ] + aρ and
 [1] 
T
T = [2] .
T

1. The mean of T is
 
A1 μ1 + a1
ET = .
A2 μ2 + a2

2. The covariance matrix of T is



A1 1 AT1 A1 12 AT2
var (T) = T .
A2 12 AT1 A2 2 AT2
3.5 Canonical Correlations and Transformed Data 89

3. If, for ρ = 1, 2, the matrices Aρ ρ ATρ are non-singular, then the canonical correlation
matrix CT of T[1] and T[2] is
 −1/2   −1/2
T T T
CT = A1 1 A1 A1 12 A2 A2 2 A2 .
Proof Part 1 follows by linearity, and the expressions for the covariance matrices follow
from Result 1.1 in Section 1.3.1. The calculation of the covariance matrices of T[1] and T[2]
and the canonical correlation matrix are deferred to the Problems at the end of Part I.
As for the original vectors X[ρ] , we construct the canonical correlation scores from
the matrix of canonical correlations; so for the transformed vectors we use CT . Let
CT = PT ϒT Q T be the singular value decomposition, and assume that the matrices T ,ρ =
Aρ ρ ATρ are invertible for ρ = 1, 2. The canonical correlation scores are the projections of
the sphered vectors onto the left and right eigenvectors pTT ,k and qTT ,k of CT :
T −1/2
UT ,k = pT ,k T ,1 (T[1] − ET[1] )
and
T −1/2
VT ,k = qT ,k T ,2 (T[2] − ET[2] ). (3.21)
Applying Theorem 3.6 to the transformed vectors, we find that
cov (UT ,k , VT , ) = ±υT ,k δk ,
where υT ,k is the kth singular value of ϒT , and k,  ≤ min{κ1 , κ2 }.
If the matrices T ,ρ are not invertible, the singular values of ϒ and ϒT , and similarly, the
CC scores of the original and transformed vectors, can differ, as the next example shows.

Example 3.6 We continue with the direct and indirect measures of the illicit drug
market data which have twelve and five variables in X[1] and X[2] , respectively. Figure 3.2
of Example 3.3 shows the scores of all five pairs of CCs.
To illustrate Theorem 3.11, we use the first four principal components of X[1] and X[2] as
our transformed data. So, for ρ = 1, 2,
T
X[ρ] −→ T[ρ] = ρ,4 (X[ρ] − Xρ ).
The scores of the four pairs of CCs of the T[ρ] are shown in the top row of Figure 3.4 with
the scores of T[1] on the x-axis. The bottom row of the figure shows the first four CC pairs
of the original data for comparison. It is interesting to see that all four sets of transformed
scores are less correlated than their original counterparts.
Table 3.4 contains the correlation coefficients for the transformed and raw data for a more
quantitative comparison. The table confirms the visual impression obtained from Figure 3.4:
the correlation strength of the transformed CC scores is considerably lower than that of the
original CC scores. Such a decrease does not have to happen, but it is worth reflecting on
why it might happen.
The direction vectors in a principal component analysis of X[1] and X[2] are chosen
to maximise the variability within these data. This means that the first and subsequent
eigenvectors point in the direction with the largest variability. In this case, the variables
break and enter (series 16) and steal from motor vehicles (series 17) have the highest PC1
weights for X[1] and X[2] , respectively, largely because these two series have much higher
values than the remaining series, and a principal component analysis will therefore find
90 Canonical Correlation Analysis

Table 3.4 Correlation coefficients of the scores in Figure 3.4,


from Example 3.6.
Canonical correlations
Transformed 0.8562 0.7287 0.3894 0.2041 —
Original 0.9543 0.8004 0.6771 0.5302 0.3709

4 4 4 4

0 0 0 0

−4 −4 −4 −4
−4 0 4 −4 0 4 −4 0 4 −4 0 4

4 4 4 4

0 0 0 0

−4 −4 −4 −4
−4 0 4 −4 0 4 −4 0 4 −4 0 4
Figure 3.4 Canonical correlation scores of Example 3.6. CC scores 1 to 4 of transformed data
(top row) and raw data (bottom row).

these variables first. In contrast, the canonical transforms maximise a different criterion: the
between covariance matrix of X[1] and X[2] . Because the criteria differ, the direction vectors
differ too; the canonical transforms are best at exhibiting the strongest relationships between
different parts of data.

The example shows the effect of transformations on the scores and the strength of their
correlations. We have seen that the strength of the correlations decreases for the PC data.
Principal Component Analysis effectively reduces the dimension, but by doing so, important
structure in data may be lost or obscured. The highest correlation between the parts of the
reduced data in the preceding example illustrates the loss in correlation strength between the
parts of the transformed data that has occurred.
As we will see in later chapters, too, Principal Component Analysis is an effective
dimension-reduction method, but structure may be obscured in the process. Whether this
structure is relevant needs to be considered in each case.

3.5.2 Transforms with Non-Singular Matrices


The preceding section demonstrates that the CCs of the original data can differ from those
of the transformed data for linear transformations with singular matrices. In this section we
focus on non-singular matrices. Theorem 3.11 remains unchanged, but we are now able to
explicitly compare the CCs of the original and the transformed vectors. The key properties of
the transformed CCs are presented in Theorem 3.12. The results are useful in their own right
but also show some interesting features of CCs, the associated eigenvectors and canonical
3.5 Canonical Correlations and Transformed Data 91

transforms. Because Theorem 3.12 is of particular interest for data, I summarise the data
results and point out relevant changes from the population case.
Theorem 3.12 For ρ = 1, 2, let X[ρ] ∼ μρ , ρ , and assume that the ρ are non-singular
with rank dρ . Let C be the canonical correlation matrix of X[1] and X[2] , and let r be the
rank of C and Pϒ Q T its singular value decomposition. Let Aρ be non-singular matrices of
size dρ × dρ , and let aρ be dρ -dimensional vectors. Put
T[ρ] = Aρ X[ρ] + aρ .
Let CT be the canonical correlation matrix of T[1] and T[2] , and write CT = PT ϒT Q TT for
its singular value decomposition. The following hold:
1. CT and C have the same singular values, and hence ϒT = ϒ.
2. For k,  ≤ r , the kth left and the th right eigenvectors pT ,k and qT , of CT and the cor-
responding canonical transforms ϕ T ,k and ψ T , of the T[ρ] are related to the analogous
quantities of the X[ρ] by
 1/2  1/2
T T −1 −1/2 T T −1 −1/2
pT ,k = A1 1 A1 ( A1 ) 1 pk and qT , = A2 2 A2 ( A2 ) 2 q ,
T −1 −1/2 T −1 −1/2
ϕ T ,k = ( A1 ) 1 ϕk and ψ T , = ( A2 ) 2 ψ .
3. The kth and th canonical correlation scores of T are
T −1/2
UT ,k = pk 1 (X[1] − μ1 )
and
T −1/2
VT , = q 2 (X[2] − μ2 ),
and their covariance matrix is
( ! 
(k)
UT Ik×k ϒ k×
var = .
()
VT ϒ Tk× I×

The theorem states that the strength of the correlation is the same for the original and
transformed data. The weights which combine the raw or transformed data may, how-
ever, differ. Thus the theorem establishes the invariance of canonical correlations under
non-singular linear transformations, and it shows this invariance by comparing the singular
values and CC scores of the original and transformed data. We find that
• the singular values of the canonical correlation matrices of the random vectors and the
transformed vectors are the same,
• the canonical correlation scores of the random vectors and the transformed random
vectors are identical (up to a sign), that is,
UT ,k = Uk and VT , = V ,
for k,  = 1, . . .,r , and
• consequently, the covariance matrix of the CC scores remains the same, namely,

cov(UT ,k , VT , ) = cov (Uk , V ) = υk δk .


92 Canonical Correlation Analysis

Before we look at a proof of Theorem 3.12, we consider what changes occur when we
deal with transformed data

T[ρ] = Aρ X[ρ] + aρ for ρ = 1, 2.

Going from the population to the sample, we replace the true parameters by their estimators.
So the means are replaced by their sample means, the covariance matrices  by the sample
covariance matrices S and the canonical correlation matrix CT by C T . The most noticeable
difference is the change from the pairs of scalar canonical correlation scores to pairs of
vectors of length n when we consider data.
I present the proof of Theorem 3.12, because it reveals important facts and relationships.
To make the proof more transparent, I begin with some notation and then prove two lemmas.
For X[ρ] , T[ρ] , C and CT as in Theorem 3.12, put

K = [ var (X[1] )]−1/2 R [C] [ var (X[1] )]1/2 ;


T
R [C] = CC and
K T = [ var (T[1] )]−1/2 RT [ var (T[1] )]1/2 .
[C] T [C]
RT = CT CT and (3.22)

A comparison with (3.4) shows that I have omitted the second superscript ‘1’ in R [C] . In
the current proof we refer to CC T and so only make the distinction when necessary. The
sequence
[C]
CT ←→ RT ←→ K T ←→ K ←→ R [C] ←→ C (3.23)

will be useful in the proof of Theorem 3.12 because the theorem makes statements about the
endpoints CT and C in the sequence. As we shall see in the proofs, relationships about K T
and K are the starting points because we can show that they are similar matrices.

Lemma 1 Assume that the X[ρ] satisfy the assumptions of Theorem 3.12. Let υ1 > υ2
> · · · > υr be the singular values of C. The following hold.
[C]
1. The matrices R [C] , K , RT and K T as in (3.22) have the same eigenvalues

υ12 > υ22 > · · · > υr2 .

2. The singular values of CT coincide with those of C.

Proof To prove the statements about the eigenvalues and singular values, we will make
repeated use of the fact that similar matrices have the same eigenvalues; see Result 1.8 of
Section 1.5.1. So our proof needs to establish similarity relationships between matrices.
The aim is to relate the singular values of CT and C. As it is not easy to do this directly,
we travel along the path in (3.23) and exhibit relationships between the neighbours in (3.23).
By Proposition 3.1, R [C] has positive eigenvalues, and the singular values of C are the
[C]
positive square roots of the eigenvalues of R [C] . A similar relationship holds for RT and
CT . These relationships deal with the two ends of the sequence (3.23).
The definition of K implies that it is similar to R [C] , so K and R [C] have the same eigen-
[C]
values. An analogous result holds for K T and RT . It remains to establish the similarity of K
and K T . This last similarity will establish that the singular values of C and CT are identical.
3.5 Canonical Correlations and Transformed Data 93

We begin with K . We substitute the expression for R [C] and re-write K as follows:
−1/2 1/2
K = 1 R [C] 1
−1/2 −1/2 −1/2 −1/2 T −1/2 1/2
= 1 1 12 2 2 12 1 1
= 1−1 12 2−1 12 .
T
(3.24)
A similar expression holds for K T . It remains to show that K and K T are similar. To do
this, we go back to the definition of K T and use the fact that Aρ and var (T) are invertible.
Now
K T = [ var(T[1] )]−1 cov(T[1] , T[2] )[ var (T[2] )]−1 cov (T[1] , T[2] )
T

= ( A1 1 A1 )−1 ( A1 12 A2 )( A2 2 A2 )−1 ( A2 12 A1 )


T T T T T

= ( A1 )−1 1−1 A−1 −1 −1 −1


T T T T T
1 A1 12 A2 ( A2 ) 2 A2 A2 12 A1

= ( A1 )−1 1−1 12 2−1 12 A1


T T T

= ( A1 )−1 K A1 .
T T
(3.25)
The second equality in (3.25) uses the variance results of part 2 of Theorem 3.11. To show
the last equality, use (3.24). The sequence of equalities establishes the similarity of the two
matrices.
So far we have shown that the four terms in the middle of the sequence (3.23) are similar
matrices, so have the same eigenvalues. This proves part 1 of the lemma. Because the sin-
[C]
gular values of CT are the square roots of the eigenvalues of RT , CT and C have the same
singular values.
Lemma 2 Assume that the X[ρ] satisfy the assumptions of Theorem 3.12. Let υ > 0 be a
singular value of C with corresponding left eigenvector p. Define R [C] , K and K T as in
(3.22).
1. If r is the eigenvector of R [C] corresponding to υ, then
r = p.
2. If s is the eigenvector of K corresponding to υ, then
−1/2 1/2
1 p 1 s
s=  
 −1/2  and p=  
 1/2  .
1 p 1 s

3. If sT is the eigenvector of K T corresponding to υ, then


−1
( AT1 ) s A T sT
sT = 
 T −1 
 and s =  T1  .
 A sT 
( A1 ) s 1

Proof Part 1 follows directly from Proposition 3.1 because the left eigenvectors of C are
the eigenvectors of R [C] . To show part 2, we establish relationships between appropriate
eigenvectors of objects in the sequence (3.23).
We first exhibit relationships between eigenvectors of similar matrices. For this purpose,
let B and D be similar matrices which satisfy B = E D E −1 for some matrix E. Let λ be an
94 Canonical Correlation Analysis

eigenvalue of D and hence also of B. Let e be the eigenvector of B which corresponds to λ.


We have
Be = λe = E D E −1 e.
Pre-multiplying by the matrix E −1 leads to

E −1 Be = λE −1 e = D E −1 e.

Let η be the eigenvector of D which corresponds to λ. The uniqueness of the eigenvalue–


eigenvector decomposition implies that E −1 e is a scalar multiple of the eigenvector η of D.
This last fact leads to the relationships
1 −1
η= E e
c1
or equivalently,

e = c1 Eη for some real c1 , (3.26)

and E therefore is the link between the eigenvectors. Unless E is an isometry, c1 is required
because eigenvectors in this book are vectors of norm 1.
We return to the matrices R [C] and K . Fix k ≤ r , the rank of C. Let υ be the kth eigenvalue
of R [C] and hence also of K , and consider the eigenvector p of R [C] and s of K which
correspond to υ. Because K = [ var(X[1] )]−1/2 R [C] [ var(X[1] )]1/2 , (3.26) implies that
1/2
p = c2 [ var(X[1] )]1/2 s = c2 1 s,
 
 1/2 
for some real c2 . Now p has unit norm, so c2−1 = 1 s, and the results follows. A similar
calculation leads to the results in part 3.
We return to Theorem 3.12 and prove it with the help of the two lemmas.
Proof of Theorem 3.12 Part 1 follows from Lemma 1. For part 2, we need to find rela-
tionships between the eigenvectors of C and CT . We obtain this relationship via the
sequence (3.23) and with the help of Lemma 2. By part 1 of Lemma 2 it suffices to consider
the sequence
[C]
RT ←→ K T ←→ K ←→ R [C] .
[C] [C]
We start with the eigenvectors of RT . Fix k ≤ r . Let υ 2 be the kth eigenvalue of RT and
[C]
hence also of K T , K and R [C] . Let pT and p be the corresponding eigenvectors of RT and
[C]
R , and sT and s those of K T and K, respectively. We start with the pair (pT , sT ). From the
definitions (3.22), we obtain

pT = c1 [ var (T[1] )]1/2 sT


= c1 c2 [ var (T[1] )]1/2 ( A1 )−1 s
T

−1/2
= c1 c2 c3 [ var (T[1] )]1/2 ( A1 )−1 1
T
p

by parts 2 and 3 of Lemma 2, where the constants ci are appropriately chosen. Put c = c1 c2 c3 .
−1/2
We find the value of c by calculating the norm of ) p = [ var (T[1] )]1/2 ( AT1 )−1 1 p. In the
3.5 Canonical Correlations and Transformed Data 95

next calculation, I omit the subscript and superscript 1 in T, A and . Now,


 2 T −1
)p = p  −1/2 A−1 ( A A )1/2 ( A A )1/2 ( A )  −1/2 p
T T T

−1
= p  −1/2 A−1 ( A A )( A )  −1/2 p
T T T

= p  −1/2  −1/2 p = p
T 2
=1

follows from the definition of var (T) and the fact that ( A AT )1/2 ( A AT )1/2 = A AT . The
calculations show that c = ±1, thus giving the desired result.
For the eigenvectors qT and q, we base the calculations on R [C,2] = C T C and recall
that the eigenvectors of C T C are the right eigenvectors of C. This establishes the relation-
ship between qT and q. The results for canonical transforms follow from the preceding
calculations and the definition of the canonical transforms in (3.8).
Part 3 is a consequence of the definitions and the results established in parts 1 and 2. I
now derive the results for T[2] . Fix k ≤ r . I omit the indices k for the eigenvector and the
superscript 2 in T[2] , X[2] and the matrices A and . From (3.6), we find that
VT = qT [ var (T)]−1/2 (T − ET).
T
(3.27)
We substitute the expressions for the mean and covariance matrix, established in Theo-
rem 3.11, and the expression for q from part 2 of the current theorem, into (3.27). It follows
that
  1/2   −1/2
T −1/2 −1 T T
VT = q  A A A A A (AX + a − Aμ − a)

= q  −1/2 A−1 A(X − μ)


T

= q  −1/2 (X − μ) = V ,
T

where V is the corresponding CC score of X. Of course, VT = −qT  −1/2 (X − μ) is also


a solution because eigenvectors are unique only up to a sign. The remainder follows from
Theorem 3.6 because the CC scores of the raw and transformed vectors are the same.
In the proof of Theorem 3.12 we explicitly use the fact that the transformations Aρ are
non-singular. If this assumption is violated, then the results may no longer hold. I illustrate
Theorem 3.12 with an example.

Example 3.7 The income data are an extract from a survey in the San Francisco Bay Area
based on more than 9,000 questionnaires. The aim of the survey is to derive a prediction of
the annual household income from the other demographic attributes. The income data are
also used in Hastie, Tibshirani, and Friedman (2001).
Some of the fourteen variables are not suitable for our purpose. We consider the nine
variables listed in Table 3.5 and the first 1,000 records, excluding records with missing
data. Some of these nine variables are categorical, but in the analysis I will not distinguish
between the different types of variables. The purpose of this analysis is to illustrate the
effect of transformations of the data, and we are not concerned here with interpretations or
effect of individual variables. I have split the variables into two groups: X[1] are the personal
attributes, other than income, and X[2] are the household attributes, with income as the first
variable. The raw data are shown in the top panel of Figure 3.5, with the variables shown
96 Canonical Correlation Analysis

Table 3.5 Variables of the income data


from Example 3.7
Personal X[1] Household X[2]
Marital status Income
Age No. in household
Level of education No. under 18
Occupation Householder status
Type of home

8
6
4
2
0
1 2 3 4 5 6 7 8 9
10

−10

1 2 3 4 5 6 7 8 9
Figure 3.5 Income data from Example 3.7: (top): raw data; (bottom): transformed data.

on the x-axis, starting with the variables of X[1] , and followed by those of X[2] in the order
they are listed in the Table 3.5.
It is not easy to understand or interpret the parallel coordinate plot of the raw data. The
lack of clarity is a result of the way the data are coded: large values for income repre-
sent a large income, whereas the variable occupation has a ‘one’ for ‘professional’, and its
largest positive integer refers to ‘unemployed’; hence occupation is negatively correlated
with income. A consequence is the criss-crossing of the lines in the top panel.
We transform the data in order to disentangle this crossing over. Put a = 0 and
A = diag 2. 0 1. 4 1. 6 −1. 2 1. 1 1. 1 1. 1 −5. 0 −2. 5 .
The transformation X → AX scales the variables and changes the sign of variables such as
occupation. The transformed data are displayed in the bottom panel of Figure 3.5. Vari-
ables 4, 8, and 9 have smaller values than the others, a consequence of the particular
transformation I have chosen.
The matrix of canonical correlations has singular values 0. 7762, 0. 4526, 0. 3312, and
0. 1082, and these coincide with the singular values of the transformed canonical correlation
matrix. The entries of the first normalised canonical transforms for both raw and transformed
data are given in Table 3.6. The variable age has the highest weight for both the raw and
transformed data, followed by education. Occupation has the smallest weight and opposite
signs for the raw and transformed data. The change in sign is a consequence of the negative
entry in A for occupation. Householder status has the highest weight among the X[2] vari-
ables and so is most correlated with the X[1] data. This is followed by the income variable.
3.5 Canonical Correlations and Transformed Data 97

Table 3.6 First raw and transformed normalised canonical transforms from Example 3.7

X[1]  raw
ϕ  trans
ϕ X[2]  raw
ψ  trans
ψ
Marital status 0.4522 0.3461 Income −0.1242 −0.4565
Age −0.6862 −0.7502 No. in household 0.1035 0.3802
Education −0.5441 −0.5205 No. under 18 0.0284 0.1045
Occupation 0.1690 −0.2155 Householder status 0.9864 −0.7974
Type of home −0.0105 0.0170

2
2

1
0
0
−2
−1
1 2 3 4 1 2 3 4 5

5 4
2
0
0
−2
−4
−6
−5
1 2 3 4 1 2 3 4 5

Figure 3.6 Contributions of CC scores along first canonical transforms from Example 3.7: (top
row) raw data; (bottom row) transformed data. The X[1] plots are shown in the left panels and
the X[2] plots on the right.

Again, we see that the signs of the weights change for negative entries of A, here for the
variables householder status and type of home.
Figure 3.6 shows the information given in Table 3.6, and in particular highlights the
change in sign of the weights of the canonical transforms. The figure shows the contribu-
tions of the first CC scores in the direction of the first canonical transforms, that is, parallel
coordinate plots of ϕ  1 U•1 for X[1] and ψ V•1 for X[2] with the variable numbers on the
1
x-axis. The X plots are displayed in the left panels and the corresponding X[2] plots in the
[1]

right panels. The top row shows the raw data, and the bottom row shows the transformed
data.
The plots show clearly where a change in sign occurs in the entries of the canonical
transforms: the lines cross over. The sign change between variables 3 and 4 of ϕ  is apparent
in the raw data but no longer exists in the transformed data. Similar sign changes exist for
the X[2] plots. Further, because of the larger weights of the first two variables of the X[2]
transformed data, these two variables have much more variability for the transformed data.
It is worth noting that the CC scores of the raw and transformed data agree because the
matrices Sρ and A are invertible. Hence, as stated in part 3 of Theorem 3.12, the CC scores
are invariant under this transformation.
98 Canonical Correlation Analysis

For the income data, we applied a transformation to the data, but in other cases the data
may only be available in transformed form. Example 3.7 shows the differences between
the analysis of the raw and transformed data. If the desired result is the strength of the
correlation between combinations of variables, then the transformation is not required. If
a more detailed analysis is appropriate, then the raw and transformed data allow different
insights into the data. The correlation analysis only shows the strength of the relationship
and not the sign, and the decrease rather than an increase of a particular variable could be
important.
The transformation of Example 3.6 is based on a singular matrix, and as we have seen
there, the CCs are not invariant under the transformation. In Example 3.7, A is non-singular,
and the CCs remain the same. Thus the simple univariate case does not carry across to the
multivariate scenario in general, and care needs to be taken when working with transformed
data.

3.5.3 Canonical Correlations for Scaled Data


Scaling of a random vector or data decreases the effect of variables whose scale is much
larger than that of the other variables. In Principal Component Analysis, variables with large
values dominate and can hide important information in the process. Scaling such variables
prior to a principal component analysis is often advisable.
In this section we explore scaling prior to a Canonical Correlation Analysis. Scaling is a
linear transformation, and Theorems 3.11 and 3.12 therefore apply to scaled data.
For ρ = 1, 2, let X[ρ] ∼ μρ , ρ , and let diag,ρ be the diagonal matrix as in (2.16) of
Section 2.6.1. Then
 
[ρ] −1/2
Xscale = diag,ρ X[ρ] − μρ

is the scaled vector. Similarly, for data X[ρ] ∼ Sam(Xρ , Sρ ) and diagonal matrix Sdiag,ρ , the
scaled data are
 
[ρ] −1/2
Xscale = Sdiag,ρ X[ρ] − Xρ .

Using the transformation set-up, the scaled vector


 
−1/2
T[ρ] = diag,ρ X[ρ] − μρ , (3.28)

with
−1/2 −1/2
Aρ = diag,ρ and aρ = −diag,ρ μρ .

If the covariance matrices ρ of X[ρ] are invertible, then, by Theorem 3.12, the CC scores
of the scaled vector are the same as those of the original vector, but the eigenvectors p1 of
C and  T differ, as we shall see in the next example. In the Problems at the end of
pT ,1 of C
Part I we derive an expression for the canonical correlation matrix of the scaled data and
interpret it.

Example 3.8 We continue with the direct and indirect measures of the illicit drug market
data and focus on the weights of the first CC vectors for the raw and scaled data. Table 3.7
3.5 Canonical Correlations and Transformed Data 99

Table 3.7 First left and right eigenvectors of the canonical correlation
matrix from Example 3.8
Variable no. 1 2 3 4 5 6
Raw: p1 0.30 −0.58 −0.11 0.39 0.47 0.02
Scaled: 
pT ,1 0.34 −0.54 −0.30 0.27 0.38 0.26
Variable no. 7 8 9 10 11 12
Raw: p1 −0.08 −0.31 −0.13 −0.10 −0.24 0.07
Scaled: 
pT ,1 −0.19 −0.28 −0.26 0.01 −0.18 0.03
Variable no. 1 2 3 4 5 —
Raw: q1 −0.25 −0.22 0.49 −0.43 −0.68 —
Scaled: 
qT ,1 −0.41 −0.28 0.50 −0.49 −0.52 —

lists the entries of the first left and right eigenvectors 


p1 and 
q1 of the original data and 
pT ,1
and  qT ,1 of the scaled data. The variables in the table are numbered 1 to 12 for the direct
measures X[1] and 1 to 5 for the indirect measures X[2] . The variable names are given in
Table 3.2.
For X[2] , the signs of the eigenvector weights and their ranking (in terms of absolute
value) are the same for the raw and scaled data. This is not the case for X[1] .
The two pairs of variables with the largest absolute weights deserve further comment. For
the X[1] data, variable 2 (amphetamine possession offences) has the largest absolute weight,
and variable 5 (ADIS heroin) has the second-largest weight, and this order is the same for
the raw and scaled data. The two largest absolute weights for the X[2] data belong to variable
5 (steal from motor vehicles) and variable 3 (PSB new registrations). These four variables
stand out in our previous analysis in Example 3.3: amphetamine possession offences and
steal from motor vehicles have the highest single correlation coefficient, and ADIS heroin
and PSB new registrations have the second highest. Further, the highest contributors to the
canonical transforms ϕ  1 and ψ  are also these four variables, but in this case in opposite
1
order, as we noted in Example 3.3. These observations suggest that these four variables are
jointly responsible for the correlation behaviour of the data.
A comparison with the CC scores obtained in Example 3.6 leads to interesting observa-
tions. The first four PC transformations of Example 3.6 result in different CC scores and
different correlation coefficients from those obtained in the preceding analysis. Further, the
two sets of CC scores obtained from the four-dimensional PC data differ depending on
whether the PC transformations are applied to the raw or scaled data. If, on the other hand, a
canonical correlation analysis is applied to all dρ PCs, then the derived CC scores are related
to the sphered PC vectors by an orthogonal transformation. We derive this orthogonal matrix
E in in the Problems at the end of Part I.

In light of Theorem 3.12 and the analysis in Example 3.8, it is worth reflecting on the
circumstances under which a canonical correlation analysis of PCs is advisable. If the main
focus of the analysis is the examination of the relationship between two parts of the data,
then a prior partial principal component analysis could decrease the effect of variables which
do not contribute strongly to variance but which might be strongly related to the other part of
the data. On the other hand, if the original variables are ranked as described in Section 13.3,
100 Canonical Correlation Analysis

then a correlation analysis of the PCs can decrease the effect of noise variables in the
analysis.

3.5.4 Maximum Covariance Analysis


In geophysics and climatology, patterns of spatial dependence between different types of
geophysical measurements are the objects of interest. The observations are measured on a
number of quantities from which one wants to extract the most coherent patterns. This type
of problem fits naturally into the framework of Canonical Correlation Analysis. Tradition-
ally, however, the geophysics and related communities have followed a slightly different
path, known as Maximum Covariance Analysis.
We take a brief look at Maximum Covariance Analysis without going into details. As
in Canonical Correlation Analysis, in Maximum Covariance Analysis one deals with two
distinct parts of a vector or data and aims at finding the strongest relationship between
the two parts. The fundamental object is the between covariance matrix, which is analysed
directly.
For ρ = 1, 2, let X[ρ] be dρ -variate random vectors. Let 12 be the between covariance
matrix of the two vectors, with singular value decomposition 12 = E D F T and rank r . We
define r -variate coefficient vectors A and B by
T T
A = E X[1] and B = F X[2]

for suitable matrices E and F. The vectors A and B are analogous to the canonical cor-
relation vectors and so could be thought of as ‘covariance scores’. The pair ( A1 , B1 ), the
first entries of A and B, are most strongly correlated. Often the coefficient vectors A and B
are normalised. The normalised pairs of coefficient vectors are further analysed and used to
derive patterns of spatial dependence.
For data X[1] and X[2] , the sample between covariance matrix S12 replaces the between
covariance matrix 12 , and the coefficient vectors become coefficient matrices whose
columns correspond to the n observations.
The basic difference between Maximum Covariance Analysis and Canonical Correlation
Analysis lies in the matrix which drives the analysis: 12 is the central object in Maximum
−1/2 −1/2
Covariance Analysis, whereas C = 1 12 2 is central to Canonical Correlation
Analysis. The between covariance matrix contains the raw quantities, whereas the matrix
C has an easier statistical interpretation in terms of the strengths of the relationships. For
more information on and interpretation of Maximum Covariance Analysis in the physical
and climate sciences, see von Storch and Zwiers (1999).

3.6 Asymptotic Considerations and Tests for Correlation


An asymptotic theory for Canonical Correlation Analysis is more complex than that for
Principal Component Analysis, even for normally distributed data. The main reason for
the added complexity is the fact that Canonical Correlation Analysis involves the singular
values and pairs of eigenvectors of the matrix of canonical correlations C. In the sample
 is the product of functions of the covariance matrices S1 and S2 and the
case, the matrix C
between covariance matrix S12 . In a Gaussian setting, the matrices S1 and S2 , as well as the
3.6 Asymptotic Considerations and Tests for Correlation 101

combined covariance matrix 


S1 S12
S= T ,
S12 S2
converge to the corresponding true covariance matrices. Convergence of C  to C further
requires the convergence of inverses of the covariance matrices and their products. We will
not pursue these convergence ideas. Instead, I only mention that under the assumptions of
normality of the random samples, the singular values and the eigenvectors of C  converge to
the corresponding true population parameters for large enough sample sizes. For details, see
Kshirsagar (1972, pp. 261ff).
The goal of Canonical Correlation Analysis is to determine the relationship between two
sets of variables. It is therefore relevant to examine whether such a relationship actually
exists, that is, we want to know whether the correlation coefficients differ significantly from
0. We consider two scenarios:

H0 : 12 = 0 versus H1 : 12 = 0 (3.29)

and
H0 : υ j = υ j+1 = · · · = υr = 0, versus H1 : υ j > 0, (3.30)
−1/2 −1/2
for some j , where the υ j are the singular values of C = 1 12 2 , and the υ j are
listed in decreasing order. Because the singular values appear in decreasing order, the second
scenario is described by a sequence of tests, one for each j :
j j
H0 : υ j+1 = 0 versus H1 : υ j+1 > 0.

Assuming that the covariance matrices 1 and 2 are invertible, then the scenario (3.29)
can be cast in terms of the matrix of canonical correlations C instead of 12 . In either case,
there is no dependence relationship under the null hypothesis, whereas in the tests of (3.30),
non-zero correlation exists, and one tests how many of the correlations differ significantly
from zero. The following theorem, given without proof, addresses both test scenarios (3.29)
and (3.30). Early results and proofs, which are based on normal data and on approximations
of the likelihood ratio, are given in Bartlett (1938) for part 1 and Bartlett (1939) for part 2,
and Kshirsagar (1972) contains a comprehensive proof of part 1 of Theorem 3.13.
To test the hypotheses (3.29) and (3.30), we use the likelihood ratio test statistic. Let L be
the likelihood of the data X, and let θ be the parameter of interest. We consider the likelihood
ratio test statistic  for testing H0 against H1 :
supθ ∈H0 L(θ|X)
(X) = .
supθ L(θ|X)
For details of the likelihood ratio test statistic , see chapter 8 of Casella and Berger (2001).
 
Theorem 3.13 Let ρ = 1, 2, and let X[ρ] = X[ρ] 1 · · · X
[ρ]
n be samples of independent dρ -
dimensional random vectors such that
 [1]  
X 1 12
X = [2] ∼ N (μ, ) with  = T .
X 12 2
102 Canonical Correlation Analysis

Let r be the rank of 12 . Let S and S be the sample covariance matrices corresponding
 to
= υ
 and  , for  = 1, 2 and 12, and assume that 1 and 2 are invertible. Let υ 1 · · · υ
r
be the singular values of C listed in decreasing order.

1. Let 1 be the likelihood ratio statistic for testing


H0 : C = 0 versus H1 : C = 0.
Then the following hold.
(a)
 
det (S1 ) det (S2 )
−2 log1 (X) = n log ,
det (S)
(b)
  r  
1
−2 log 1 (X) ≈ − n − (d1 + d2 + 3) log ∏ 1 − υ
 2j . (3.31)
2 j=1

(c) Further, the distribution of −2 log1 (X) converges to a χ 2 distribution in d1 × d2


degrees of freedom as n → ∞.
2. Fix k ≤ r . Let 2,k be the likelihood ratio statistic for testing
H0k : υ1 = 0, . . ., υk = 0 and υk+1 = · · · = υr = 0
versus
H1k : υ j = 0 for some j ≥ k + 1.
Then the following hold.
(a)
  r  
1
−2 log 2,k (X) ≈ − n − (d1 + d2 + 3) log ∏ 1 − υ 2j , (3.32)
2 j=k+1

(b) −2 log2,k (X) has an approximate χ 2 distribution in (d1 − k) × (d2 − k) degrees of


freedom as n → ∞.
In practice, the tests of part 2 of the theorem are more common and typically they are also
applied to non-Gaussian data. In the latter case, care may be required in the interpretation of
the results. If C = 0 is rejected, then at least the largest singular value is non-zero. Obvious
starting points for the individual tests are therefore either the second-largest singular value
or the smallest. Depending on the decision of the test H01 and, respectively, H0r−1 , one may
continue with further tests. Because the tests reveal the number of non-zero singular values,
the tests can be employed for estimating the rank of C.

Example 3.9 We continue with the income data and test for non-zero canonical correla-
tions. Example 3.7 finds the singular values
0. 7762, 0. 4526, 0. 3312, 0. 1082
for the first 1,000 records. These values express the strength of the correlation between pairs
of CC scores.
3.6 Asymptotic Considerations and Tests for Correlation 103
0.6
0.4

0.3 0.4

0.2
0.2
0.1

0 0
−2 −1 0 1 2 −2 −1 0 1
Figure 3.7 Kernel density estimates of the CC scores of Example 3.9 with U•1 in the left panel
and V•1 in the right panel.

An inspection of the CC scores U•1 and V•1 , in the form of kernel density estimates in
Figure 3.7, shows that these density estimates deviate considerably from the normal density.
Because the CC scores are linear combinations of the variables of X[1] and X[2] , respec-
tively, normality of the X[1] and X[2] leads to normal CC scores. Thus, strictly speaking, the
hypothesis tests of Theorem 3.13 do not apply because these tests are based on Gaussian
assumptions. We apply them to the data, but the interpretation of the test results should be
treated with caution.
In the tests, I use the significance level of 1 per cent. We begin with the test of part 1 of
Theorem 3.13. The value of the approximate likelihood ratio statistic, calculated as in (3.31),
is 1,271.9, which greatly exceeds the critical value 37.57 of the χ 2 distribution in 20 degrees
of freedom. So we reject the null hypothesis and conclude that the data are consistent with a
non-zero matrix C.
We next test whether the smallest singular value could be zero; to do this, we apply the
test of part 2 of the theorem with the null hypothesis H03 , which states that the first three
singular values are non-zero, and the last equals zero. The value of the approximate test
statistic is 11.85, which still exceeds the critical value 9.21 of the χ 2 distribution in 2 degrees
of freedom. Consequently, we conclude that υ4 could be non-zero, and so all pairs of CC
scores could be correlated.
As we know, the outcome of a test depends on the sample size. In the initial tests, I
considered all 1,000 observations. If one considers a smaller sample, the null hypothesis
is less likely to be rejected. For the income data, we now consider separately the first and
second half of the first 1,000 records. Table 3.8 contains the singular values of the com-
plete data (the 1,000 records) as well as those of the two parts. The singular values of
the two parts differ from each other and from the corresponding value of the complete
data. The smallest singular value of the first 500 observations in particular has decreased
considerably.
The test for C = 0 is rejected for both subsets, so we look at tests for individual singular
values. Both parts of the data convincingly reject the null hypothesis H02 with test statistics
above 50 and a corresponding critical value of 16.81. However, in contrast to the test on all
1000 records, the null hypothesis H03 is accepted by both parts at the 1 per cent level, with a
test statistic of 1.22 for the first part and 8.17 for the second part. The discrepancy in the deci-
sions of the test between the 1000 observations and the two parts is a consequence of (3.32),
which explicitly depends on n. For these data, n = 500 is not large enough to reject the null
hypothesis.
104 Canonical Correlation Analysis

 from
Table 3.8 Singular values of the matrix C
Example 3.9 for the complete data and parts
subsets of the data.
Records 1
υ 2
υ 3
υ 4
υ
1–1,000 0.7762 0.4526 0.3312 0.1082
1–500 0.8061 0.4827 0.3238 0.0496
501–1,000 0.7363 0.4419 0.2982 0.1281

As we have seen in the example, the two subsets of the data can result in different test
decisions from that obtained for the combined sample. There are a number of reasons why
this can happen.
• The test statistic is approximately proportional to the sample size n.
• For non-normal data, a larger value of n is required before the test statistic is approxi-
mately χ 2 .
• Large values of n are more likely to reject a null hypothesis.

What we have seen in Example 3.9 may be a combination of all of these.

3.7 Canonical Correlations and Regression


In Canonical Correlation Analysis, the two random vectors X[1] and X[2] or the two data
sets X[1] and X[2] play a symmetric role. This feature does not apply in a regression setting.
Indeed, one of the two data sets, say X[1] , plays the role of the explanatory or predictor
variables, whereas the second, X[2] , acquires the role of the response variables. Instead of
finding the strongest relationship between the two parts, in regression, we want to predict
the responses from the predictor variables, and the roles are not usually reversible.
In Section 2.8.2, I explain how ideas from Principal Component Analysis are adapted
to a Linear Regression setting: A principal component analysis reduces the predictor vari-
ables, and the lower-dimensional PC data are used as the derived predictor variables. The
dimension-reduction step is carried out entirely among the predictors without reference to
the response variables. Like Principal Component Analysis, Canonical Correlation Analysis
is related to Linear Regression, and the goal of this section is to understand this relationship
better.
We deviate from the CC setting of two symmetric objects and return to the notation of
Section 2.8.2: we use X instead of X[1] for the predictor variables and Y instead of X[2]
for the
 responses.This notation carries over to the covariance matrices. For  data, we let
X = X1 X2 · · · Xn be the d-variate predictor variables and Y = Y1 Y2 · · · Yn the q-variate
responses with q ≥ 1. In Linear Regression, one assumes that q ≤ d, but this restriction is
not necessary in our setting. We assume a linear relationship of the form
T
Y = β 0 + B X, (3.33)

where β 0 ∈ Rq , and B is a d × q matrix. If q = 1, B reduces to the vector β of (2.35) in


Section 2.8.2. Because the focus of this section centres on the estimation of B, unless other-
wise stated, in the remainder of Section 3.7 we assume that the predictors X and responses
3.7 Canonical Correlations and Regression 105

Y are centred, so

X ∈ Rd , Y ∈ Rq , X ∼ (0,  X ) and Y ∼ (0, Y ).

3.7.1 The Canonical Correlation Matrix in Regression


We begin with the univariate relationship Y = β0 + β1 X , and take β0 = 0. For the stan-
dardised variables Y and X  , the correlation coefficient measures the strength of the
correlation:

Y = X  . (3.34)

The matrix of canonical correlations generalises the univariate correlation coefficient to a


multivariate setting. If the covariance matrices of X and Y are invertible, then it is natural to
explore the multivariate relationship
T
Y = C X (3.35)

between the random vectors X and Y. Because X and Y are centred, (3.35) is equivalent to

Y =  XY  −1
T
X X. (3.36)
 by the data version of
In the Problems at the end of Part I, we consider the estimation of Y
(3.36), which yields
 −1
 = YXT XXT
Y X. (3.37)

The variables of X that have low correlation with Y do not contribute much to the rela-
tionship (3.35) and may therefore be omitted or weighted down. To separate the highly
correlated combinations of variables from those with low correlation, we consider approx-
imations of C. Let C = Qϒ P T be the singular value decomposition, and let r be its rank.
For κ ≤ r , we use the submatrix notation (1.21) of Section 1.5.2; thus Pκ and Q κ are the
κ × r submatrices of P and Q, ϒκ is the κ × κ diagonal submatrix of ϒ, which consists of
the first κ diagonal elements of ϒ, and
T T
C ≈ Q κ ϒκ Pκ .

Substituting the last approximation into (3.35), we obtain the equivalent expressions
T 1/2 T −1/2 T
Y ≈ Q κ ϒκ Pκ X and Y ≈ Y Q κ ϒκ Pκ  X X = Y κ ϒκ κ X,

where we have used the relationship (3.8) between the eigenvectors p and q of C and the
 for data, based on
canonical transforms ϕ and ψ, respectively. Similarly, the estimator Y
κ ≤ r predictors, is
 = SY 
Y κ ϒ
 κ  Tκ X
1 T   T
= YY  κ ϒκ κ X
n −1
1 T
 κ U(κ) ,
= YV(κ) ϒ (3.38)
n −1
106 Canonical Correlation Analysis

where the sample canonical correlation scores U(κ) =  Tκ X and V(κ) =   κT Y are derived
from (3.15). For the special case κ = 1, (3.38) reduces to a form similar to (2.44), namely,

= υ1 T
Y YV(1) U(1) . (3.39)
n −1
This last equality gives an expression of an estimator for Y which is derived from the first
canonical correlations scores alone. Clearly, this estimator will differ from an estimator
derived for k > 1. However, if the first canonical correlation is very high, this estimator
may convey most of the relevant information.
The next example explores the relationship (3.38) for different combinations of predictors
and responses.

Example 3.10 We continue with the direct and indirect measures of the illicit drug market
data in a linear regression framework. We regard the twelve direct measures as the pre-
dictors and consider three different responses from among the indirect measures: PSB new
registrations, robbery 1, and steal from motor vehicles. I have chosen PSB new registrations
because an accurate prediction of this variable is important for planning purposes and policy
decisions. Regarding robbery 1, commonsense tells us that it should depend on many of the
direct measures. Steal from motor vehicles and the direct measure amphetamine possession
offences exhibit the strongest single correlation, as we have seen in Example 3.3, and it is
therefore interesting to consider steal from motor vehicles as a response.
All calculations are based on the scaled data, and correlation coefficient in this example
means the absolute value of the correlation coefficient, as is common in Canonical Corre-
lation Analysis. For each of the response variables, I calculate the correlation coefficient
based on the derived predictors
T [1]
1. the first PC score W(1) = 
η1 (X[1] − X ) of X[1] , and
[1]
pT1 X S of X[1] ,
2. the first CC score U(1) = 

where  η1 is the first eigenvector of the sample covariance matrix S1 of X[1] , and p 1 is the

first left eigenvector of the canonical correlation matrix C. Table 3.9 and Figure 3.8 show
the results. Because it is interesting to know which variables contribute strongly to W(1)
η1 and 
and U(1) , I list the variables and their weights for which the entries of  p1 exceed 0.4.
The correlation coefficient, however, is calculated from W(1) or U(1) as appropriate. The last
column of the table refers to Figure 3.8, which shows scatterplots of W(1) and U(1) on the
x-axis and the response on the y-axis.
The PC scores W(1) are calculated without reference to the response and thus lead to the
same combination (weights) of the predictor variables for the three responses. There are
four variables, all heroin-related, with absolute weights between 0.42 and 0.45, and all other
weights are much smaller. The correlation coefficient of W(1) with PSB new registration is
0.7104, which is slightly higher than 0.6888, the correlation coefficient of PSB new registra-
tion and ADIS heroin. Robbery 1 has its strongest correlation of 0.58 with cocaine possession
offences, but this variable has the much lower weight of 0.25 in W(1) . As a result the corre-
lation coefficient of robbery 1 and W(1) has decreased compared with the single correlation
of 0.58 owing to the low weight assigned to cocaine possession offences in the linear com-
bination W(1) . A similar remark applies to steal from motor vehicles; the correlation with
3.7 Canonical Correlations and Regression 107

Table 3.9 Strength of correlation for the illicit drug market data from Example 3.10
Response Method Predictor Eigenvector Corr. Figure
variables weights Coeff. position
PSB new reg. PC Heroin poss. off. −0.4415 0.7104 Top left
ADIS heroin −0.4319
Heroin o/d −0.4264
Heroin deaths −0.4237
PSB new reg. CC ADIS heroin −0.5796 0.8181 Bottom left
Drug psych. −0.4011
Robbery 1 PC Heroin poss. off. −0.4415 0.4186 Top middle
ADIS heroin −0.4319
Heroin o/d −0.4264
Heroin deaths −0.4237
Robbery 1 CC Robbery 2 0.4647 0.8359 Bottom middle
Cocaine poss. off. 0.4001
Steal m/vehicles PC Heroin poss. off. −0.4415 0.3420 Top right
ADIS heroin −0.4319
Heroin o/d −0.4264
Heroin deaths −0.4237
Steal m/vehicles CC Amphet. poss. off. 0.7340 0.8545 Bottom right

2
2
2
0
0
0
−2
−2
−3 0 3 −3 0 3 −3 0 3

2
2
2
0
0
0
−2
−2
0 0 3 0
Figure 3.8 Scatterplots for Example 3.10. The x-axis shows the PC predictors W(1) in the top
row and the CC predictors U(1) in the bottom row against the responses PSB new registrations
(left), robbery 1 (middle) and steal from motor vehicles (right) on the y-axis.

W(1) has decreased from the best single of 0.764 to 0.342, a consequence of the low weight
0.2 for amphetamine possession offences.
Unlike the PC scores, the CC scores depend on the response variable. The weights of
the relevant variables are higher than the highest weights of PC scores, and the correlation
coefficient for each response is higher than that obtained with the PC scores as predictors.
This difference is particularly marked for steal from motor vehicles.
108 Canonical Correlation Analysis

The scatterplots of Figure 3.8 confirm the results shown in the table, and the table together
with the plots show the following

1. Linear Regression based on the first CC scores exploits the relationship between the
response and the original data.
2. In Linear Regression with the PC1 predictors, the relationship between response and
original predictor variables may not have been represented appropriately.

The analysis shows that we may lose valuable information when using PCs as predictors in
Linear Regression.

For the three response variables in Example 3.10, the CC-based predictors result in
much stronger relationships. It is natural to ask: Which approach is better, and why? The
two approaches maximise different criteria and hence solve different problems: in the PC
approach, the variance of the predictors is maximised, whereas in the CC approach, the cor-
relation between the variables is maximised. If we want to find the best linear predictor, the
CC scores are more appropriate than the first PC scores. In Section 13.3 we explore how
one can combine PC and CC scores to obtain better regression predictors.

3.7.2 Canonical Correlation Regression


In Section 3.7.1 we explored Linear Regression based on the first pair of CC scores as
a single predictor. In this section we combine the ideas of the preceding section with the
more traditional estimation of regression coefficients (2.38) in Section 2.8.2. I refer to
this approach as Canonical Correlation Regression by analogy with Principal Component
Regression.
Principal Component Regression applies to multivariate predictor variables and univariate
responses. In Canonical Correlation Regression it is natural to allow multivariate responses.
Because we will be using derived predictors instead of the original predictor variables, we
adopt the notation of Koch and Naito (2010) and define estimators B  for the coefficient
)
matrix B of (3.33) in terms of derived data X. We consider specific forms of X ) below and
) ) T −1
require that (XX ) exists. Put
)X
 = (X
B ) T
) T )−1 XY and =B
Y ) = YX
T X )X
) T (X ) T )−1 X.
) (3.40)

The dimension of X ) is generally smaller than that of X, and consequently, the dimension of

B is decreased, too.
As in Principal Component Regression, we replace the original d-dimensional data X
by lower-dimensional data. Let r be the rank of C.  We project X onto the left canonical
transforms, so for κ ≤ r ,

X −→ ) = U(κ) = PT X S =  T X.


X (3.41)
κ κ

The derived data X) are the CC data in κ variables. By Theorem 3.6, the covariance matrix
of the population canonical variates is the identity, and the CC data satisfy
T
U(κ) U(κ) = (n − 1)Iκ×κ .
3.7 Canonical Correlations and Regression 109
) = U(κ) into (3.40) leads to
Substituting X
 
T −1 1 T 1 T
BU = U(κ) U(κ) T
U(κ) Y =
T
Pκ X S Y = XY
T

n −1 n −1 κ
and

=B
 T U(κ) = 1 1 T
YX S Pκ Pκ X S =
T T
Y U YU(κ) U(κ) . (3.42)
n −1 n −1
 new from a new datum Xnew by
The expression (3.42) is applied to the prediction of Y
putting

 new = 1 T T
Y YU(κ) Pκ S −1/2 Xnew .
n −1
In the expression for Y  new , I assume that Xnew is centred. In Section 9.5 we focus on training
and testing. I will give an explicit expression, (9.16), for the predicted Y -value of new datum.
The two expressions (3.38) and (3.42) look different, but they agree for fixed κ. This
is easy to verify for κ = 1 and is trivially true for κ = r . The general case follows from
Theorem 3.6.
A comparison of (3.42) with (2.43) in Section 2.8.2 shows the similarities and subtle dif-
ferences between Canonical Correlation Regression and Principal Component Regression.
In (3.42), the data X are projected onto directions which take into account the between
covariance matrix S XY of X and Y, whereas (2.43) projects the data onto directions purely
based on the covariance matrix of X. As a consequence, variables with low variance contri-
butions will be down-weighted in the regression (2.43). A disadvantage of (3.42) compared
with (2.43) is, however, that the number of components is limited by the dimension of the
response variables. In particular, for univariate responses, κ = 1 in the prediction based on
(3.42).
I illustrate the ideas of this section with an example at the end of the next section and then
also include Partial Least Squares in the data analysis.

3.7.3 Partial Least Squares


In classical multivariate regression d < n and XXT is invertible. If d > n, then XXT is
singular, and (3.37) does not apply. Wold (1966) developed an approach to regression
which circumvents the singularity of XXT in the d > n case. Wold’s approach, which can
be regarded as reversing the roles of n and d, is called Partial Least Squares or Partial
Least Squares Regression. Partial Least Squares was motivated by the requirement to extend
multivariate regression to the d > n case but is not restricted to this case.
The problem statement is similar to that of Linear Regression: For predictors X with a
singular covariance matrix  X and responses Y which satisfy Y = B T X for some matrix B,
construct an estimator B of B. Wold (1966) proposed an iterative approach to constructing
 We consider two such approaches, Helland (1988) and Rosipal and Trejo
the estimator B.
(2001), which are modifications of the original proposal of Wold (1966). The key idea is
to exploit the covariance relationship between the predictors and responses. Partial Least
Squares enjoys popularity in the social sciences, marketing and in chemometrics; see
Rosipal and Trejo (2001).
110 Canonical Correlation Analysis

The population model of Wold (1966) consists of a d-dimensional predictor X, a q-


dimensional response Y, a κ-dimensional unobserved T with κ < d and unknown linear
transformations A X and AY of size d × κ and q × κ, respectively, such that
X = AX T and Y = AY T. (3.43)
 = YG(T), where G is a function of the unknown T.
The aim is to estimate T and Y
For the sample, we keep the assumption that the X and Y are centred, and we put
X = AX T and Y = AY T
for some κ × n matrix T and A X and AY as in (3.43).
The approaches of Helland (1988) and Rosipal and Trejo (2001) differ in the way they
construct the row vectors t1 , . . ., tκ of T. Algorithm 3.1 outlines the general idea for con-
structing a partial least squares solution which is common to both approaches. So, for given
X and Y, the algorithm constructs T and the transformations A X and AY . Helland (1988)
proposes two solution paths and shows the relationship between the two solutions, and
Helland (1990) presents some population results for the set-up we consider below, which
contains a comparison with Principal Component Analysis. Helland (1988, 1990) deal with
univariate responses only. I restrict attention to the first solution in Helland (1988) and then
move on to the approach of Rosipal and Trejo (2001).

Algorithm 3.1 Partial Least Squares Solution


Construct κ row vectors t1 , . . . , tκ of size 1 × n iteratively starting from
X0 = X, Y0 = Y and some 1 × n vector t0 . (3.44)

• In the kth step, obtain the triplet (tk , Xk , Yk ) from (tk−1 , Xk−1 , Yk−1 ) as follows:
 T
1. Construct the row vector tk , add it to the collection T = t1 , . . ., tk−1 .
2. Update
 
T
Xk = Xk−1 In×n − tk tk ,
 
T
Yk = Yk−1 In×n − tk tk . (3.45)
 T
• When T = t1 , . . .tκ , put
 = YG(T )
Y for some function G. (3.46)

The construction of the row vector tk in each step and the definition of the n × n matrix
of coefficients G distinguish the different approaches.

Helland (1988): tk for univariate responses Y. Assume that for 1 < k ≤ κ, we have
constructed (tk−1 , Xk−1, Yk−1 ).
H1. Put tk−1,0 = Yk−1 / Yk−1 .
H2. For  = 1, 2, . . ., calculate
3.7 Canonical Correlations and Regression 111
T
wk−1, = Xk−1 tk−1,−1 ,
 
tk−1, = wk−1, Xk−1 and tk−1, = tk−1, / tk−1,  .
T
(3.47)
H3. Repeat the calculations of step H2 until the sequence {tk−1, :  = 1, 2, . . .} has
converged. Put
tk = lim tk−1, ,

and use tk in (3.45).
H4. After the κth step of Algorithm 3.1, put
κ  −1

T T
G(T ) = G(t1 , . . . , tκ ) = tk tk tk tk
k=1
and
κ  −1
 = YG(T ) = Y ∑ tk tT
Y
T
tk tk . (3.48)
k
k=1

Rosipal and Trejo (2001): tk for q-variate responses Y. In addition to the tk , vectors uk
are constructed and updated in this solution, starting with t0 and u0 . Assume that for 1 <
k ≤ κ, we have constructed (tk−1 , uk−1 , Xk−1 , Yk−1 ).
RT1. Let uk−1,0 be a random vector of size 1 × n.
RT2. For  = 1, 2, . . ., calculate
T
w = Xk−1 uk−1,−1 ,
 
tk−1, = w Xk−1 tk−1, = tk−1, / tk−1,  ,
T
and
T
v = Yk−1 tk−1, ,
 
uk−1, = v Yk−1 uk−1, = uk−1, / uk−1,  .
T
and (3.49)
RT3. Repeat the calculations of step RT2 until the sequences of vectors tk−1, and uk−1,
have converged. Put
tk = lim tk−1, and uk = lim tk−1, ,
 
and use tk in (3.45).
RT4. After the κth step of Algorithm 3.1, define the κ × n matrices
⎡ ⎤ ⎡ ⎤
t1 u1
⎢ .. ⎥ ⎢ .. ⎥
T=⎣ . ⎦ and U = ⎣ . ⎦,
tκ uκ
and put
 −1
T T T T
G(T) = G(t1 , . . . , tκ ) = T UX XT UX X,
 −1
 T = YTT UXT XTT
B UX
T
and
=B
Y  T X = YG(T). (3.50)
112 Canonical Correlation Analysis

A comparison of (3.48) and (3.50) shows that both solutions use of the covariance rela-
tionship between X and Y (and of the updates Xk−1 and Yk−1 ) in the calculation of w in
(3.47) and of v in (3.49). The second algorithm calculates two sets of vectors tk and uk ,
whereas Helland’s solution only requires the tk . The second algorithm applies to multivariate
responses; Helland’s solution has no obvious extension to q > 1.
For multiple regression with univariate responses and κ = 1, write Y  H and Y
 RT for the
estimators defined in (3.48) and (3.50), respectively. Then
 −1
Y H = YtT t1 / t1 2 and  RT = YtT u1 XT XtT
Y
T
u1 X X. (3.51)
1 1 1

The more complex expression for Y  RT could be interpreted as the cost paid for starting
with a random vector u. There is no clear winner among these two solutions; for univariate
responses Helland’s algorithm is clearly the simpler, whereas the second algorithm answers
the needs of multivariate responses.
Partial Least Squares methods are based on all variables rather than on a reduced number
of derived variables, as done in Principal Component Analysis and Canonical Correlation
Analysis. The iterative process, which leads to the κ components t j and u j (with j ≤ κ)
stops when the updated matrix Xk = Xk−1 In×n − tTk tk is the zero matrix.
In the next example I compare the two Partial Least Squares solutions to Principal Com-
ponent Analysis, Canonical Correlation Analysis and Linear Regression. In Section 3.7.4,
I indicate how these four approaches fit into the framework of generalised eigenvalue
problems.

Example 3.11 For the abalone data, the number of rings allows the experts to estimate the
age of the abalone. In Example 2.19 in Section 2.8.2, we explored Linear Regression with
PC1 as the derived predictor of the number of rings. Here we apply a number of approaches
to the abalone data. To assess the performance of each approach, I use the mean sum of
squared errors
1 n  
2
i
MSSE = ∑ Yi − Y  . (3.52)
n i=1
I will use only the first 100 observations in the calculations and present the results in
Tables 3.10 and 3.11. The comparisons include classical Linear Regression (LR), Principal
Component Regression (PCR), Canonical Correlation Regression (CCR) and Partial Least
Squares (PLS). Table 3.10 shows the relative importance of the seven predictor variables,
here given in decreasing order and listed by the variable number, and Table 3.11 gives the
MSSE as the number of terms or components increases.
For LR, I order the variables by their significance obtained from the traditional p-values.
Because the data are not normal, the p-values are approximate. The dried shell weight,
variable 7, is the only variable that is significant, with a p-value of 0.0432. For PCR and
CCR, Table 3.10 lists the ordering of variables induced by the weights of the first direction
vector. For PCR, this vector is the first eigenvector 
η 1 of the sample covariance matrix S of
X, and for CCR, it is the first canonical transform ϕ  1 . LR and CCR pick the same variable
as most important, whereas PCR selects variable 4, whole weight. This difference is not
surprising because variable 4 is chosen merely because of its large effect on the covariance
matrix of the predictor data X. The PLS components are calculated iteratively and are based
3.7 Canonical Correlations and Regression 113

Table 3.10 Relative importance of variables for prediction by


method for the abalone data from Example 3.11.
Method Order of variables
LR 7 5 1 4 6 2 3
PCR 4 5 7 1 6 2 3
CCR 7 1 5 2 4 6 3

Table 3.11 MSSE for different prediction approaches and number of


components for the abalone data from Example 3.11
Number of variables or components
Method 1 2 3 4 5 6 7
LR 5.6885 5.5333 5.5260 5.5081 5.4079 5.3934 5.3934
PCR 5.9099 5.7981 5.6255 5.5470 5.4234 5.4070 5.3934
CCR 5.3934 — — — — — —
PLSH 5.9099 6.0699 6.4771 6.9854 7.8350 8.2402 8.5384
PLSRT 5.9029 5.5774 5.5024 5.4265 5.3980 5.3936 5.3934

on all variables, so they cannot be compared conveniently in this way. For this reason, I do
not include PLS in Table 3.10.
The MSSE results for each approach are given in Table 3.11. The column headed ‘1’
shows the MSSE for one variable or one derived variable, and later columns show the MSSE
for the number of (derived) variables shown in the top row. For CCR, there is only one
derived variable because the response is univariate. The PLS solutions are based on all
variables rather than on subsets, and the number of components therefore has a different
interpretation. For simplicity, I calculate the MSSE based on the first component, first two
components, and so on and include their MSSE in the table.
A comparison of the LR and PCR errors shows that all errors – except the last – are higher
for PCR. When all seven variables are used, the two methods agree. The MSSE of CCR is
the same as the smallest error for LR and PCR, and the CCR solution has the same weights
as the LR solution with all variables. For the abalone data, PLSH performs poorly compared
with the other methods, and the MSSE increases if more components than the first are used,
whereas the performance of PLSRT is similar to that of LR.

The example shows that in the classical scenario of a single response and many more
observations than predictor variables, Linear Regression does at least as well as the competi-
tors I included. However, Linear Regression has limitations, in particular, d  n is required,
and it is therefore important to have methods that apply when these conditions are no longer
satisfied. In Section 13.3.2 we return to regression and consider HDLSS data which require
more sophisticated approaches.

3.7.4 The Generalised Eigenvalue Problem


I conclude this chapter with an introduction to Generalised Eigenvalue Problems, a
topic which includes Principal Component Analysis, Canonical Correlation Analysis,
114 Canonical Correlation Analysis

Partial Least Squares (PLS) and Multiple Linear Regression (MLR) as special cases. The
ideas I describe are presented in Borga, Landelius, and Knutsson (1997) and extended in
De Bie, Christianini, and Rosipal (2005).
Definition 3.14 Let A and B be square matrices of the same size, and assume that B
is invertible. The task of the generalised eigen(value) problem is to find eigenvalue–
eigenvector solutions (λ, e) to the equation
Ae = λBe or equivalently B −1 Ae = λe. (3.53)


Problems of this type, which involve two matrices, arise in physics and the engineering
sciences. For A, B and a vector e, (3.53) is related to the Rayleigh quotient, named after the
physicist Rayleigh, which is the solution of defined by
eT Ae
.
eT Be
We restrict attention to those special cases of the generalised eigenvalue problem we have
met so far. Each method is characterised by the role the eigenvectors play and by the choice
of the two matrices. In each case the eigenvectors optimise specific criteria. Table 3.12 gives
explicit expressions for the matrices A and B and the criteria the eigenvectors optimise. For
details, see Borga, Knutsson, and Landelius (1997).
The setting of Principal Component Analysis is self-explanatory: the eigenvalues and
eigenvectors of the generalised eigenvalue problem are those of the covariance matrix 
of the random vector X, and by Theorem 2.10 in Section 2.5.2, the eigenvectors η of 
maximise eT e.
For Partial Least Squares, two random vectors X[1] and X[2] and their covariance matrix
12 are the objects of interest. In this case, the singular values and the left and right eigen-
vectors of 12 solve the problem. Maximum Covariance Analysis (MCA), described in
Section 3.5.4, shares these properties with PLS.
For Canonical Correlation Analysis, we remain with the pair of random vectors X[1] and
[2]
X but replace the covariance matrix of Partial Least Squares by the matrix of canonical
correlations C. The vectors listed in Table 3.12 are the normalised canonical transforms of
(3.8). To see why these vectors are appropriated, observe that the generalised eigen problem
arising from A and B is the following
 
12 e2 1 e1
T = λ , (3.54)
12 e1 2 e2
which yields
1−1 12 2−1 12 e1 = λ2 e1 .
T
(3.55)
A comparison of (3.55) and (3.24) reveals some interesting facts: the matrix on the left-hand
side of (3.55) is the matrix K , which is similar to R = CC T . The matrix similarity implies
that the eigenvalues of K are squares of the singular values υ of C. Further, the eigenvector
e1 of K is related to the corresponding eigenvector p of R by
−1/2
e1 = c 1 p, (3.56)
3.7 Canonical Correlations and Regression 115

Table 3.12 Special cases of the generalised eigenvalue problem


Method A B Eigenvectors Comments
PCA   I  Maximise  See Section 2.2
0 12 I 0
PLS/MCA T Maximise 12 See Section 3.7.3
12 0  0 I 
0 12 1 0
CCA T Maximise C See Section 3.2
12 0  0 2
0 12 1 0
LR T Minimise LSE —
12 0 0 I

for some c > 0. The vector p is a left eigenvector of C, and so the eigenvector e1 of K is
nothing but the normalised canonical transform ϕ/||ϕ|| because the eigenvectors have norm
1. A similar argument, based on the matrix C T C instead of CC T , establishes that the second
eigenvector equals ψ/||ψ||.
Linear Regression treats the two random vectors asymmetrically. This can be seen in the
expression for B. We take X[1] to be the predictor vector and X[2] the response vector. The
generalised eigen equations amount to
 
12 e2 1 e1
T = λ , (3.57)
12 e1 e2
and hence one needs to solve the equations
1−1 12 12 e1 = λ2 e1 12 1−1 12 e2 = λ2 e2 .
T T
and

The matrix 1−1 12 12


T
is not symmetric, so it has a singular value decomposition with left
and right eigenvectors, whereas 12 T
1−1 12 has a spectral decomposition with a unique set
of eigenvectors.
Generalised eigenvalue problems are, of course, not restricted to these cases. In
Section 4.3 we meet Fisher’s discriminant function, which is the solution of another gener-
alised eigenvalue problem, and in Section 12.4 we discuss approaches based on two scatter
matrices which also fit into this framework.

You might also like