0% found this document useful (0 votes)
10 views18 pages

10 Cor1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

10 Cor1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Correlation analysis 1: Canonical correlation

analysis

Ryan Tibshirani
Data Mining: 36-462/36-662

February 14 2013

1
Review: correlation
Given two random variables X, Y ∈ R, the (Pearson) correlation
between X and Y is defined as
Cov(X, Y )
Cor(X, Y ) = p p
Var(X) Var(Y )

Recall that
 
Cov(X, Y ) = E (X − E[X])(Y − E[Y ])
and
Var(X) = E (X − E[X])2 = Cov(X, X)
 

This measures a linear association between X, Y . Properties:


I −1 ≤ Cor(X, Y ) ≤ 1
I X, Y independent ⇒ Cor(X, Y ) = 0 (Homework 2)
I Cor(X, Y ) = 0 6⇒ X, Y independent (Homework 2)
More on this later ...
2
Review: sample correlation
Given centered x, y ∈ Rn , the sample correlation between x and y
is defined as
xT y
cor(x, y) = √ p .
xT x y T y
Note the analogy to the definition on the last slide—we just
replace everything by its sample version. I.e., if we write cov and
var for the sample covariance and variance, then

cov(x, y)
cor(x, y) = p p .
var(x) var(y)

Note: if x, y ∈ Rn are centered unit vectors then cor(x, y) = xT y

This measures a linear association between x, y. Properties:


I −1 ≤ cor(x, y) ≤ 1
I cor(x, y) = 0 ⇐⇒ x, y are orthogonal
3
Canonical correlation analysis
Principal component analysis attempts to answer the question:
“which directions account for much of the observed variance in a
data set?” Given a centered matrix X ∈ Rn×p , we first find the
direction v1 ∈ Rp to maximize the sample variance of Xv:
v1 = argmax var(Xv)
kvk2 =1

Canonical correlation analysis is similar but instead attempts to


answer: “which directions account for much of the covariance
between two data sets?” Now we are given two centered matrices
X ∈ Rn×p , Y ∈ Rn×q , and we seek the two directions α1 ∈ Rp ,
β1 ∈ Rq that maximize the sample covariance of Xα and Y β:
α1 , β1 = argmax cov(Xα, Y β)
kXαk2 =1, kY βk2 =1

Subject to the constraints, this is equivalent to maximizing


cor(Xα, Y β). (Why?)
4
Canonical directions and variates
The first canonical directions α1 ∈ Rp , β1 ∈ Rq are given by

α1 , β1 = argmax (Xα)T (Y β)
kXαk2 =1, kY βk2 =1

Vectors Xα1 , Y β1 ∈ Rn are called the first canonical variates, and


ρ1 = (Xα1 )T (Y β1 ) ∈ R is called the first canonical correlation

Given the first k − 1 directions, the kth canonical directions


αk ∈ Rp , βk ∈ Rq are defined as

αk , βk = argmax (Xα)T (Y β)
kXαk2 =1, kY βk2 =1
(Xα)T (Xαj )=0, j=1,...k−1
(Y β)T (Y βj )=0, j=1,...k−1

Vectors Xαk , Y βk ∈ Rn are called the kth canonical variates, and


ρk = (Xαk )T (Y βk ) ∈ R is called the kth canonical correlation

5
Example: scores data
Example: n = 88 students took tests in each of 5 subjects:
mechanics, vectors, algebra, analysis, statistics. (From Mardia et
al. (1979) “Multivariate analysis”.) Each test is out of 100 points

The tests on mechanics, vectors were closed book and those on


algebra, analysis, statistics were open book. There’s clearly some
correlation between these two sets of scores:

alg ana sta


mec 0.547 0.409 0.389
vec 0.610 0.485 0.436

Canonical correlation analysis attempts to explain this phenomenon


using the variables in each set jointly. Here X contains the closed
book test scores and Y contains the open book test scores, so
X ∈ R88×2 and Y ∈ R88×3

6
The first canonical directions (multiplied by 103 ):
 
  8.782 alg
2.770 mec
α1 = , β1 =  0.860  ana
5.517 vec
0.370 sta
The first canonical correlation is ρ1 = 0.663, and the variates:

0.8

0.7

● ●
●●
● ●
● ●
0.6 ●
● ● ● ● ●
● ●●

● ● ● ●●● ●

● ● ● ● ●● ● ●

0.5


● ●●
●● ● ● ● ●
Yβ1

● ●

● ● ●●

● ● ● ●●●
●●
● ● ●● ●●
0.4

● ● ●●
● ● ●
● ● ●
● ●
● ●
0.3
0.2

0.1 0.2 0.3 0.4 0.5 0.6

Xα1

The second directions are more surprising, but ρ2 = 0.041

7
How many canonical directions are there?

We have X ∈ Rn×p and Y ∈ Rn×q . How many pairs of canonical


directions (α1 , β1 ), (α2 , β2 ), . . . are there?

We know that any n orthogonal (linearly independent) vectors in


Rn form a basis for Rn . Therefore there cannot be more than p
orthogonal vectors of the form Xα, α ∈ Rp , and q orthogonal
vectors of the form Y β, β ∈ Rq . (Why?)

Hence there are exactly r = min{p, q} canonical directions


(α1 , β1 ), . . . (αr , βr )1

1
This is assuming that n ≥ p and n ≥ q. In general, there are actually only
r = min{rank(X), rank(Y )} canonical directions
8
Transforming the problem

If A ∈ Rp×p , B ∈ Rq×q are invertible, then computing

α̃1 , β̃1 = argmax (XAα̃)T (Y B β̃),


kXAα̃k2 =1, kY B β̃k2 =1

is equivalent to the first step of canonical correlation analysis. In


particular, the first canonical directions are given by α1 = Aα̃1 and
β1 = B β̃1 . The same is also true of further directions

I.e., we can transform our data matrices to be X̃ = XA, Ỹ = Y B


for any invertible A, B, solve the canonical correlation problem
with X̃, Ỹ , and then back-transform to get our desired answers

Why would we ever do this? Because there is a transformation


A, B that makes the computational problem simpler

9
Sphering
For any symmetric invertible matrix A ∈ Rn×n , there is a matrix
A1/2 ∈ Rn×n , called the (symmetric) square root of A, such that
A1/2 A1/2 = A

We write the inverse of A1/2 as A−1/2 . Note A−1/2 AA−1/2 = I.


(Why?)

Given centered matrices X ∈ Rn×p and Y ∈ Rn×q ,2 we define


VX = X T X ∈ Rp×p and VY = Y T Y ∈ Rq×q . Then
−1/2 −1/2
X̃ = XVX ∈ Rn×p and Ỹ = Y VY ∈ Rn×q
are called the sphered versions of X and Y .3 Note that the sample
covariance of X̃ and Ỹ is
cov(X̃) = I/n and cov(Ỹ ) = I/n
2
Here we are assuming that rank(X) = p and rank(Y ) = q
3
Alternatively, for sphering we would sometimes define VX = (X T X)/n and
VY = (Y T Y )/n, so that the transformed sample covariances are exactly I
10
Transforming the problem (continued)
−1/2
As suggested by the previous slide, we will take X̃ = XVX and
−1/2
Ỹ = Y VY , and we’ll solve the problem

α̃1 , β̃1 = argmax (X̃ α̃)T (Ỹ β̃)


kX̃ α̃k2 =1, kỸ β̃k2 =1

−1/2 −1/2
Recall that then α1 = VX α̃1 and β1 = VY β̃1 .

So why is this simpler? Note that the constraint says


−1/2 −1/2
1 = (X̃ α̃)T (X̃ α̃) = α̃T VX X T XVX α̃ = α̃T α̃

i.e., kα̃k2 = 1. Similarly, kβ̃k2 = 1. Hence our problem can be


rewritten as:
α̃1 , β̃1 = argmax α̃T M β̃
kα̃k2 =1, kβ̃k2 =1
−1/2 −1/2
where M = X̃ T Ỹ = VX X T Y VY ∈ Rp×q . The same is true
for further directions
11
Computing canonical directions and variates
Now comes the singular value decomposition to the rescue
(again!). Let r = min{p, q}. Then we can decompose

M = U DV T

where U ∈ Rp×r , V ∈ Rq×r have orthonormal columns, and


D = diag(d1 , . . . dr ) ∈ Rr×r with d1 ≥ . . . ≥ dr ≥ 0. Further:
I The transformed canonical directions α̃1 , . . . α̃r ∈ Rp and
β̃1 , . . . β̃r ∈ Rq are the columns of U and V , respectively
I The canonical directions α1 , . . . αr ∈ Rp and β1 , . . . βr ∈ Rq
−1/2 −1/2
are the columns of VX U and VY V , respectively;
I the canonical variates Xα1 , . . . Xαr ∈ Rn and
−1/2
Y β1 , . . . Y βr ∈ Rn are the columns of XVX U ∈ Rn×r and
−1/2
Y VY V ∈ Rn×r , respectively
I The canonical correlations ρ1 ≥ . . . ≥ ρr are equal to
d1 ≥ · · · ≥ dr , the diagonal entries of D
12
Example: olive oil data
Example: n = 572 olive oils, with p = 9 features (the olives data
set from the R package classifly):

1. region
2. palmitic
3. palmitoleic
4. stearic
5. oleic
6. linoleic
7. linolenic
8. arachidic
9. eicosenoic

Variable 1 takes values in {1, 2, 3}, indicating the region (in Italy)
of origin. Variables 2-9 are continuous valued and measure the
percentage composition of 8 different fatty acids

13
We are interested in the correlations between the region of origin
and the fatty acid measurements. Hence we take X ∈ R572×8 to
contain the fatty acid measurements, and Y ∈ R572×3 to be an
indicator matrix, i.e., each row of Y indicates the region with a 1
and otherwise has 0s. This might look like:
 
1 0 0
 1 0 0 
 
Y =  0 0 1 

 0 1 0 
...

(In this case, canonical correlation analysis actually does the exact
same thing as linear discriminant analysis, an important tool that
we will learn later for classification)

14
The first two canonical X variates, with the points colored by
region:

1.40

●● ●●●
● ● ●
● ● ●● ● ●●
● ● ●
●● ● ●● ●
● ●●●●●● ●
● ●
●●● ●
Second canonical x variate
● ● ●●●●
●● ●●

●●●●

●●
● ● ●●●
●● ● ●●
● ● ●●●● ●

● ●
●●
●● ●
●● ●
● ●●●
● ● ●●●●
●●

1.35 ●

● ● ● ● ●
● ●●●


●●●
●●
●●●●

● ● ● ●● ● ● ● ● ●● ● ●
●●●●
●● ●●● ● ●● ●● ● ●● ● ●
● ● ●●

●● ● ●
●●● ● ●●● ●●●●
●● ● ●●
●● ●● ● ●●●● ● ●
● ●●● ●●●
●● ●● ●●
●●
● ●● ● ●●● ●●● ●●● ●
●● ● ●
● ●
●●
●●
●●●
● ● ● ●●●
●● ●
●● ● ● ● ●
● ● ● ●●●●●●●●● ●


● ●
● ●●●● ● ●
● ● ●●


●●●● ●
● ●●●●
●●

●● ● ●● ●● ●
●●● ● ●
●● ●● ● ●●●● ● ●●
● ●●● ●
●●● ●●
●●●● ● ●
1.30

●●●●● ●● ●● ● ● ● ●● ●●
● ●●● ●●●●●●●●
●●● ●●
● ●●
●● ●●●
●● ●● ●●●●●●● ● ●●
● ●
● ● ● ●●● ●
● ● ●
●● ● ● ● ●
●●
● ● ● ● ● ●●● ● ●●
● ● ● ●●● ● ●●
● ● ●●● ●
● ● ●● ●●●

●●
● ●
● ●● ●● ● ● ●●●

●●
●●
● ●
● ●
● ●●●
●●●
● ●


●●
● ●
●●

●●●
1.25

● ●●●


●● ● ● ●●
●● ●●● ●
●● ●● ●
●●
●●


●●●

●●● ●●●●
●●
● Region 1 ● ●●

●●

● Region 2
1.20

● Region 3 ●

−0.25 −0.20 −0.15 −0.10

First canonical x variate

15
Canonical correlation analysis in R

Canonical correlation analysis is implemented by the cancor


function in the base distribution. E.g.,

cc = cancor(x,y)
alpha = cc$xcoef
beta = cc$ycoef
rho = cc$cor
xvars = x %*% alpha
yvars = y %*% beta

16
Recap: canonical correlation analysis

In canonical correlation analysis we are looking for pairs of


directions, one in each of the feature spaces of two data sets
X ∈ Rn×p , Y ∈ Rn×q , to maximize the covariance (or correlation)

We defined the pairs of canonical directions (α1 , β1 ), . . . (αr , βr ),


where r = min{p, q}, and αj ∈ Rp , βj ∈ Rq . We also defined the
pairs of canonical variates (Xα1 , Xβ1 ), . . . (Xαr , Xβr ), where
Xαj ∈ Rn and Xβj ∈ Rn . Finally, we defined the canonical
correlations ρ1 , . . . ρr ∈ R

We saw that transforming the problem leads to a simpler form.


From this simpler form we can compute the canonical directions,
correlations, and variates using the singular value decomposition

17
Next time: measures of correlation
A lot of work has been done, but there’s still a lot of interest

...
1888 2012

18

You might also like