0% found this document useful (0 votes)
12 views

Lesson 13 - Canonical Correlation Analysis

This document discusses canonical correlation analysis, a method for exploring relationships between two sets of multivariate variables. It provides examples of variable sets that could be analyzed, such as exercise and health variables or sales performance and aptitude variables. The document then describes how canonical correlation analysis works by defining canonical variates and maximizing their correlations. It also provides a SAS example analyzing sales data with two variable sets.

Uploaded by

skyartcyber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Lesson 13 - Canonical Correlation Analysis

This document discusses canonical correlation analysis, a method for exploring relationships between two sets of multivariate variables. It provides examples of variable sets that could be analyzed, such as exercise and health variables or sales performance and aptitude variables. The document then describes how canonical correlation analysis works by defining canonical variates and maximizing their correlations. It also provides a SAS example analyzing sales data with two variable sets.

Uploaded by

skyartcyber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Lesson 13: Canonical Correlation Analysis

Lesson 13: Canonical Correlation Analysis

Overview
Canonical correlation analysis is a method for exploring the relationships between two multivariate sets
of variables (vectors), all measured on the same individual.

Consider, as an example, variables related to exercise and health. On one hand, you have variables
associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can
run a certain distance, the amount of weight lifted on bench press, the number of push-ups per minute,
etc. On the other hand, you have variables that attempt to measure overall health, such as blood
pressure, cholesterol levels, glucose levels, body mass index, etc. Two types of variables are measured
and the relationships between the exercise variables and the health variables are of interest.

As a second example consider variables measured on environmental health and environmental


toxins. A number of environmental health variables such as frequencies of sensitive species, species
diversity, total biomass, productivity of the environment, etc. may be measured and a second set of
variables on environmental toxins are measured, such as the concentrations of heavy metals,
pesticides, dioxin, etc.

For a third example consider a group of sales representatives, on whom we have recorded several sales
performance variables along with several measures of intellectual and creative aptitude. We may wish
to explore the relationships between the sales performance variables and the aptitude variables.

One approach to studying relationships between the two sets of variables is to use canonical
correlation analysis which describes the relationship between the first set of variables and the second
set of variables. We do not necessarily think of one set of variables as independent and the other as
dependent, though that may potentially be another approach.

Objectives
Upon completion of this lesson, you should be able to:

Carry out a canonical correlation analysis using SAS (Minitab does not have this functionality);
Assess how many canonical variate pairs should be considered;
Interpret canonical variate scores;
Describe the relationships between variables in the first set with variables in the second set.

13.1 - Setting the Stage for Canonical Correlation Analysis

13.1 - Setting the Stage for Canonical Correlation Analysis

What motivates canonical correlation analysis?


It is possible to create pairwise scatter plots with variables in the first set (e.g., exercise variables), and
variables in the second set (e.g., health variables). But if the dimension of the first set is p and that of
the second set is q, there will be pq such scatter plots, it may be difficult, if not impossible, to look at all
of these graphs together and interpret the results.

Similarly, you could compute all correlations between variables from the first set (e.g., exercise
variables), and variables in the second set (e.g., health variables), however interpretation is difficult
when pq is large.

Canonical Correlation Analysis allows us to summarize the relationships into a lesser number of
statistics while preserving the main facets of the relationships. In a way, the motivation for canonical
correlation is very similar to principal component analysis. It is another dimension reduction technique.

Canonical Variates
Let's begin with the notation:

We have two sets of variables and .

Suppose we have p variables in set 1:

and suppose we have q variables in set 2:

We select X and Y based on the number of variables that exist in each set so that . This is done
for computational convenience.

We look at linear combinations of the data, similar to principal components analysis. We define a set of
linear combinations named U and V. U corresponds to the linear combinations from the first set of
variables, X, and V corresponds to the second set of variables, Y. Each member of U is paired with a
member of V. For example, below is a linear combination of the p X variables and is the
corresponding linear combination of the q Y variables. Similarly, is a linear combination of the p X
variables, and is the corresponding linear combination of the q Y variables. And, so on....
Thus define

as the canonical variate pair. ( , ) is the first canonical variate pair, similarly ( , ) would be
the second canonical variate pair and so on. With there are p canonical covariate pairs.

We hope to find linear combinations that maximize the correlations between the members of each
canonical variate pair.

We compute the variance of variables with the following expression:

The coeffcients through that appear in the double sum are the same coefficients that appear in
the definition of . The covariances between the and X-variables are multiplied by the
corresponding coefficients and for the variate .

Similar calculations can be made for the variance of as shown below:

The covariance between and is:

The correlation between and is calculated using the usual formula. We take the covariance
between the two variables and divide it by the square root of the product of the variances:

The canonical correlation is a specific type of correlation. The canonical correlation for the canonical
variate pair is simply the correlation between and :
This is the quantity to maximize. We want to find linear combinations of the X's and linear combinations
of the Y's that maximize the above correlation.

Canonical Variates Defined


Let us look at each of the p canonical variates pair one by one.

First canonical variate pair: :

The coefficients and are selected to maximize the canonical


correlation of the first canonical variate pair. This is subject to the constraint that variances of the
two canonical variates in that pair are equal to one.

This is required to obtain unique values for the coefficients.

Second canonical variate pair:

Similarly we want to find the coefficients and that maximize the


canonical correlation of the second canonical variate pair, . Again, we will maximize this
canonical correlation subject to the constraints that the variances of the individual canonical variates
are both equal to one. Furthermore, we require the additional constraints that , and
are uncorrelated. In addition, the combinations and must be uncorrelated. In
summary, our constraints are:

Basically, we require that all of the remaining correlations equal zero.

This procedure is repeated for each pair of canonical variates. In general, ...

canonical variate pair:

We want to find the coefficients and that maximize the canonical


correlation subject to the constraints that

,
,

Again, requiring all of the remaining correlations to be equal to zero.

Next, let's see how this is carried out in SAS...

13.2 - Example: Sales Data

13.2 - Example: Sales Data

Example 13-1: Sales


The example data comes from a firm that surveyed a random sample of n = 50 of its employees in an
attempt to determine which factors influence sales performance. Two collections of variables were
measured:

Sales Performance:
Sales Growth
Sales Profitability
New Account Sales
Test Scores as a Measure of Intelligence
Creativity
Mechanical Reasoning
Abstract Reasoning
Mathematics

There are p = 3 variables in the first group relating to Sales Performance and q = 4 variables in the
second group relating to Test Scores.

Download the text file containing the data here: sales.txt [1]

[2]

1. Using SAS

Canonical Correlation Analysis is carried out in SAS using a canonical correlation procedure that is
abbreviated as cancorr. Let's look at how this is carried out in the SAS Program below

Download the SAS program here: sales.sas or click on the copy icon below. [3]
options ls=78;
title "Canonical Correlation Analysis - Sales Data";
data sales;
infile "D:\Statistics\STAT 505\data\sales.txt";
input growth profit new create mech abs math;
run;
proc cancorr out=canout vprefix=sales vname="Sales Variables"
wprefix=scores wname="Test Scores";
var growth profit new;
with create mech abs math;
run;
proc gplot;
axis1 length=3 in;
axis2 length=4.5 in;
plot sales1*scores1 / vaxis=axis1 haxis=axis2;
symbol v=J f=special h=2 i=r color=black;
run;

View the video explanation of the SAS code.


https://www.youtube.com/watch/4WksKAFD_o0 [4]

13.3. Test for Relationship Between Canonical Variate Pairs

13.3. Test for Relationship Between Canonical Variate Pairs

Let's first determine if there is any relationship between the two sets of variables at all. Perhaps the two
sets of variables are completely unrelated to one another and independent!

To test for independence between the Sales Performance and the Test Score variables, first consider a
multivariate multiple regression model where we predict the Sales Performance variables from the Test
Score variables. In this general case, we have p multiple regressions, each multiple regression
predicting one of the variables in the first group ( X variables) from the q variables in the second group
(Y variables).

In our example, we have multiple regressions predicting the p = 3 sales variables from the q = 4 test
score variables. We wish to test the null hypothesis that these regression coefficients (except for the
intercepts) are all equal to zero. This would be equivalent to the null hypothesis that the first set of
variables is independent from the second set of variables.

This is carried out using Wilks lambda. The results of this are found on page 1 of the output of the SAS
Program.
Test of H0: The canonical correlations in the current row and all that
follow are zero

Likelihood Approximate
Num DF Den DF Pr > F
Ratio F Value
1 0.00214847 87.39 12 114.06 <.0001
2 0.19524127 18.53 6 88 <.0001
3 0.85284669 3.88 2 45 0.0278

SAS reports Wilks lambda . Wilks lambda is a


ratio of two variance-covariance matrices (raised to a certain power). If the values of these statistics
are large (small p-value), then we reject the null hypothesis. In our example, we reject the null
hypothesis that there is no relationship between the two sets of variables and conclude that the two
sets of variables are dependent. Note also that the above null hypothesis is also equivalent to testing
the null hypothesis that all p canonical variate pairs are uncorrelated, or

Because Wilks lambda is significant and the canonical correlations are ordered from largest to
smallest, we can conclude that at least .

We may also wish to test the hypothesis that the second or the third canonical variate pairs are
correlated. We can do this in successive tests. Next, test whether the second and third canonical
variate pairs are correlated...

We can look again at the SAS output above. In the second row for the likelihood ratio test statistic we
find . From this test we can conclude that the
second canonical variate pair is correlated, .

Finally, we can test the significance of the third canonical variate pair.

The third row of the SAS output contains the likelihood ratio test statistic
. This is also significant and so we conclude that
the third canonical variate pair is correlated.

All three canonical variate pairs are significantly correlated and dependent on one another. This
suggests that we may summarize all three pairs. In practice, these tests are carried out successively
until you find a non-significant result. Once a non-significant result is found, you stop. If this happens
with the first canonical variate pair, then there is not sufficient evidence of any relationship between the
two sets of variables and the analysis may stop.

If the first pair shows significance, then you move on to the second canonical variate pair. If this second
pair is not significantly correlated then stop. If it was significant you would continue to the third pair,
proceeding in this iterative manner through the pairs of canonical variates testing until you find non-
significant results.
13.4 - Obtain Estimates of Canonical Correlation

13.4 - Obtain Estimates of Canonical Correlation

Now that we rejected the hypotheses of independence, the next step is to obtain estimates of canonical
correlation.

The estimated canonical correlations are found at the top of page 1 in the SAS output as shown below:

Canonical Correlation Analysis

Adjusted Approximate Squared


Canonical
Canonical Standard Canonical
Correlation
Correlation Error Correlation
1 0.994483 0.994021 0.001572 0.988996
2 0.878107 0.872097 0.032704 0.771071
3 0.383606 0.366795 0.121835 0.147153

The squared values of the canonical variate pairs, found in the last column, can be interpreted much in
the same way as values are interpreted.

We see that 98.9% of the variation in is explained by the variation in , and 77.11% of the variation
in is explained by , but only 14.72% of the variation in is explained by . These first two are
very high canonical correlations and suggest that only the first two canonical correlations are
important.

One can actually see this from the plots that SAS generates. The first canonical variate for sales is
plotted against the first canonical variate for scores in the scatter plot for the first canonical variate
pair:

Canonical Correlation Analysis - Sales Data


The regression line shows how well the data fits. The plot of the second canonical variate pair is a bit
more scattered, but is still a reasonably good fit:

Canonical Correlation Analysis - Sales Data


A plot of the third pair would show little of the same kind of fit. We may refer to only the first two
canonical variate pairs from this point on based on the observation that the third squared canonical
correlation value is so small.

13.5 - Obtain the Canonical Coefficients

13.5 - Obtain the Canonical Coefficients

Page 2 of the SAS output provides the estimated canonical coefficients for the sales variables:

Canonical Correlation Analysis

Raw Canonical Coefficients for the Sales Variables

sales1 sales2 sales3


growth 0.0623778783 -0.174070306 -0.377152934
profit 0.020925642 0.2421640883 0.1035150082
net 0.0782581746 -0.23829403 0.3834150736

Using the coefficient values in the first column, the first canonical variable for sales is determined using
the following formula:

Likewise, the estimated canonical coefficients for the test scores are located in the next table in
the SAS output:

Raw Canonical Coefficients for the Test Scores

scores1 scores2 scores3


create 0.0697481411 -0.192391323 0.2465565859
mech 0.0307382997 0.201574382 -0.141895279
abs 0.0895641768 -0.495763258 -0.280224053
math 0.0628299739 0.0683160677 0.0113325936

Using the coefficient values in the first column, the first canonical variable for test scores is determined
using a similar formula:

In both cases, the magnitudes of the coefficients give the contributions of the individual variables to the
corresponding canonical variable. However, just like in principal components analysis, these
magnitudes also depend on the variances of the corresponding variables. Unlike principal components
analysis, however, standardizing the data has no impact on the canonical correlations.
13.6 - Interpret Each Component

13.6 - Interpret Each Component

To interpret each component, we must compute the correlations between each variable and the
corresponding canonical variate.

a. The correlations between the sales variables and the canonical variables for Sales Performance
are found at the top of the fourth page of the SAS output in the following table:

Correlations Between the Sales Variables and Their Canonical Variables

sales1 sales2 sales3


growth 0.9799 0.0006 -0.1996
profit 0.9464 0.3229 0.0075
new 0.9519 -0.1863 0.2434

Looking at the first canonical variable for sales, we see that all correlations are uniformly large.
Therefore, you can think of this canonical variate as an overall measure of Sales Performance. For
the second canonical variable for Sales Performance, none of the correlations are particularly
large, and so, this canonical variable yields little information about the data. Again, we had
decided earlier not to look at the third canonical variate pairs.

A similar interpretation can take place with the Test Scores.

b. b. The correlations between the test scores and the canonical variables for Test Scores are also
found in the SAS output:

Correlations Between the Test Scores and Their Canonical Variables

scores1 scores2 scores3


create 0.6383 -0.2157 0.6514
mech 0.7212 0.2376 -0.677
abs 0.6472 -0.5013 -0.5742
math 0.9441 0.1975 -0.0942

Because all correlations are large for the first canonical variable, this can be thought of as an
overall measure of test performance as well, however, it is most strongly correlated with
mathematics test scores. Most of the correlations with the second canonical variable are small.
There is some suggestion that this variable may be negatively correlated with abstract reasoning.

c. Putting (a) and (b) together, we see that the best predictor of sales performance is
mathematics test scores as this indicator stands out the most.

13.7 - Reinforcing the Results


13.7 - Reinforcing the Results

These results are further reinforced by looking at the correlations between each set of variables and the
opposite group of canonical variates.

a. The correlations between the sales variables and the first canonical variate for test scores are
found on page 4 of the SAS output:

Correlations Between the Sales Variables and the Canonical Variables of


the Test Scores

scores1 scores2 scores3


growth 0.9745 0.0006 -0.0766
profit 0.9412 0.2835 0.0029
new 0.9466 -0.1636 0.0934

We can see that all three of these correlations are strong and show a pattern similar to that with
the canonical variate for sales. The reason for this is obvious: The first canonical correlation is
very high.

b. The correlations between the test scores and the first canonical variate for sales are also in the
SAS output:

Correlations Between the Test Scores and the Canonical Variables of the
Sales Variables

sales1 sales2 sales3


create 0.6348 -0.1894 0.2499
mech 0.7172 0.2086 -0.0260
abs 0.6437 -0.4402 -0.2203
math 0.9389 0.1735 -0.0361
Note! These also show a pattern similar to that with the canonical variate for test scores. Again,
this is because the first canonical correlation is very high.

c. These results confirm that sales performance is best predicted by mathematics test scores.

13.8 - Summary

13.8 - Summary
In this lesson we learned about:

How to test for independence between two sets of variables


How to determine the number of significant canonical variate pairs
How to compute the canonical variates from the data
How to interpret each member of a canonical variate pair using its correlations with the member
variables
How to use the results of canonical correlation analysis to describe the relationships between two
sets of variables

Legend
[1] Link

↥ Has Tooltip/Popover
Toggleable Visibility

Source: https://online.stat.psu.edu/stat505/lesson/13

Links:

1. https://online.stat.psu.edu/onlinecourses/sites/stat505/files/data/sales.txt
2. https://online.stat.psu.edu/stat505#tablist-cke_1-tab-pane-1
3. https://online.stat.psu.edu/onlinecourses/sites/stat505/files/sas/sales.sas
4. https://www.youtube.com/watch/4WksKAFD_o0

You might also like