Lesson 13 - Canonical Correlation Analysis
Lesson 13 - Canonical Correlation Analysis
Overview
Canonical correlation analysis is a method for exploring the relationships between two multivariate sets
of variables (vectors), all measured on the same individual.
Consider, as an example, variables related to exercise and health. On one hand, you have variables
associated with exercise, observations such as the climbing rate on a stair stepper, how fast you can
run a certain distance, the amount of weight lifted on bench press, the number of push-ups per minute,
etc. On the other hand, you have variables that attempt to measure overall health, such as blood
pressure, cholesterol levels, glucose levels, body mass index, etc. Two types of variables are measured
and the relationships between the exercise variables and the health variables are of interest.
For a third example consider a group of sales representatives, on whom we have recorded several sales
performance variables along with several measures of intellectual and creative aptitude. We may wish
to explore the relationships between the sales performance variables and the aptitude variables.
One approach to studying relationships between the two sets of variables is to use canonical
correlation analysis which describes the relationship between the first set of variables and the second
set of variables. We do not necessarily think of one set of variables as independent and the other as
dependent, though that may potentially be another approach.
Objectives
Upon completion of this lesson, you should be able to:
Carry out a canonical correlation analysis using SAS (Minitab does not have this functionality);
Assess how many canonical variate pairs should be considered;
Interpret canonical variate scores;
Describe the relationships between variables in the first set with variables in the second set.
Similarly, you could compute all correlations between variables from the first set (e.g., exercise
variables), and variables in the second set (e.g., health variables), however interpretation is difficult
when pq is large.
Canonical Correlation Analysis allows us to summarize the relationships into a lesser number of
statistics while preserving the main facets of the relationships. In a way, the motivation for canonical
correlation is very similar to principal component analysis. It is another dimension reduction technique.
Canonical Variates
Let's begin with the notation:
We select X and Y based on the number of variables that exist in each set so that . This is done
for computational convenience.
We look at linear combinations of the data, similar to principal components analysis. We define a set of
linear combinations named U and V. U corresponds to the linear combinations from the first set of
variables, X, and V corresponds to the second set of variables, Y. Each member of U is paired with a
member of V. For example, below is a linear combination of the p X variables and is the
corresponding linear combination of the q Y variables. Similarly, is a linear combination of the p X
variables, and is the corresponding linear combination of the q Y variables. And, so on....
Thus define
as the canonical variate pair. ( , ) is the first canonical variate pair, similarly ( , ) would be
the second canonical variate pair and so on. With there are p canonical covariate pairs.
We hope to find linear combinations that maximize the correlations between the members of each
canonical variate pair.
The coeffcients through that appear in the double sum are the same coefficients that appear in
the definition of . The covariances between the and X-variables are multiplied by the
corresponding coefficients and for the variate .
The correlation between and is calculated using the usual formula. We take the covariance
between the two variables and divide it by the square root of the product of the variances:
The canonical correlation is a specific type of correlation. The canonical correlation for the canonical
variate pair is simply the correlation between and :
This is the quantity to maximize. We want to find linear combinations of the X's and linear combinations
of the Y's that maximize the above correlation.
This procedure is repeated for each pair of canonical variates. In general, ...
,
,
Sales Performance:
Sales Growth
Sales Profitability
New Account Sales
Test Scores as a Measure of Intelligence
Creativity
Mechanical Reasoning
Abstract Reasoning
Mathematics
There are p = 3 variables in the first group relating to Sales Performance and q = 4 variables in the
second group relating to Test Scores.
Download the text file containing the data here: sales.txt [1]
[2]
1. Using SAS
Canonical Correlation Analysis is carried out in SAS using a canonical correlation procedure that is
abbreviated as cancorr. Let's look at how this is carried out in the SAS Program below
Download the SAS program here: sales.sas or click on the copy icon below. [3]
options ls=78;
title "Canonical Correlation Analysis - Sales Data";
data sales;
infile "D:\Statistics\STAT 505\data\sales.txt";
input growth profit new create mech abs math;
run;
proc cancorr out=canout vprefix=sales vname="Sales Variables"
wprefix=scores wname="Test Scores";
var growth profit new;
with create mech abs math;
run;
proc gplot;
axis1 length=3 in;
axis2 length=4.5 in;
plot sales1*scores1 / vaxis=axis1 haxis=axis2;
symbol v=J f=special h=2 i=r color=black;
run;
Let's first determine if there is any relationship between the two sets of variables at all. Perhaps the two
sets of variables are completely unrelated to one another and independent!
To test for independence between the Sales Performance and the Test Score variables, first consider a
multivariate multiple regression model where we predict the Sales Performance variables from the Test
Score variables. In this general case, we have p multiple regressions, each multiple regression
predicting one of the variables in the first group ( X variables) from the q variables in the second group
(Y variables).
In our example, we have multiple regressions predicting the p = 3 sales variables from the q = 4 test
score variables. We wish to test the null hypothesis that these regression coefficients (except for the
intercepts) are all equal to zero. This would be equivalent to the null hypothesis that the first set of
variables is independent from the second set of variables.
This is carried out using Wilks lambda. The results of this are found on page 1 of the output of the SAS
Program.
Test of H0: The canonical correlations in the current row and all that
follow are zero
Likelihood Approximate
Num DF Den DF Pr > F
Ratio F Value
1 0.00214847 87.39 12 114.06 <.0001
2 0.19524127 18.53 6 88 <.0001
3 0.85284669 3.88 2 45 0.0278
Because Wilks lambda is significant and the canonical correlations are ordered from largest to
smallest, we can conclude that at least .
We may also wish to test the hypothesis that the second or the third canonical variate pairs are
correlated. We can do this in successive tests. Next, test whether the second and third canonical
variate pairs are correlated...
We can look again at the SAS output above. In the second row for the likelihood ratio test statistic we
find . From this test we can conclude that the
second canonical variate pair is correlated, .
Finally, we can test the significance of the third canonical variate pair.
The third row of the SAS output contains the likelihood ratio test statistic
. This is also significant and so we conclude that
the third canonical variate pair is correlated.
All three canonical variate pairs are significantly correlated and dependent on one another. This
suggests that we may summarize all three pairs. In practice, these tests are carried out successively
until you find a non-significant result. Once a non-significant result is found, you stop. If this happens
with the first canonical variate pair, then there is not sufficient evidence of any relationship between the
two sets of variables and the analysis may stop.
If the first pair shows significance, then you move on to the second canonical variate pair. If this second
pair is not significantly correlated then stop. If it was significant you would continue to the third pair,
proceeding in this iterative manner through the pairs of canonical variates testing until you find non-
significant results.
13.4 - Obtain Estimates of Canonical Correlation
Now that we rejected the hypotheses of independence, the next step is to obtain estimates of canonical
correlation.
The estimated canonical correlations are found at the top of page 1 in the SAS output as shown below:
The squared values of the canonical variate pairs, found in the last column, can be interpreted much in
the same way as values are interpreted.
We see that 98.9% of the variation in is explained by the variation in , and 77.11% of the variation
in is explained by , but only 14.72% of the variation in is explained by . These first two are
very high canonical correlations and suggest that only the first two canonical correlations are
important.
One can actually see this from the plots that SAS generates. The first canonical variate for sales is
plotted against the first canonical variate for scores in the scatter plot for the first canonical variate
pair:
Page 2 of the SAS output provides the estimated canonical coefficients for the sales variables:
Using the coefficient values in the first column, the first canonical variable for sales is determined using
the following formula:
Likewise, the estimated canonical coefficients for the test scores are located in the next table in
the SAS output:
Using the coefficient values in the first column, the first canonical variable for test scores is determined
using a similar formula:
In both cases, the magnitudes of the coefficients give the contributions of the individual variables to the
corresponding canonical variable. However, just like in principal components analysis, these
magnitudes also depend on the variances of the corresponding variables. Unlike principal components
analysis, however, standardizing the data has no impact on the canonical correlations.
13.6 - Interpret Each Component
To interpret each component, we must compute the correlations between each variable and the
corresponding canonical variate.
a. The correlations between the sales variables and the canonical variables for Sales Performance
are found at the top of the fourth page of the SAS output in the following table:
Looking at the first canonical variable for sales, we see that all correlations are uniformly large.
Therefore, you can think of this canonical variate as an overall measure of Sales Performance. For
the second canonical variable for Sales Performance, none of the correlations are particularly
large, and so, this canonical variable yields little information about the data. Again, we had
decided earlier not to look at the third canonical variate pairs.
b. b. The correlations between the test scores and the canonical variables for Test Scores are also
found in the SAS output:
Because all correlations are large for the first canonical variable, this can be thought of as an
overall measure of test performance as well, however, it is most strongly correlated with
mathematics test scores. Most of the correlations with the second canonical variable are small.
There is some suggestion that this variable may be negatively correlated with abstract reasoning.
c. Putting (a) and (b) together, we see that the best predictor of sales performance is
mathematics test scores as this indicator stands out the most.
These results are further reinforced by looking at the correlations between each set of variables and the
opposite group of canonical variates.
a. The correlations between the sales variables and the first canonical variate for test scores are
found on page 4 of the SAS output:
We can see that all three of these correlations are strong and show a pattern similar to that with
the canonical variate for sales. The reason for this is obvious: The first canonical correlation is
very high.
b. The correlations between the test scores and the first canonical variate for sales are also in the
SAS output:
Correlations Between the Test Scores and the Canonical Variables of the
Sales Variables
c. These results confirm that sales performance is best predicted by mathematics test scores.
13.8 - Summary
13.8 - Summary
In this lesson we learned about:
Legend
[1] Link
↥ Has Tooltip/Popover
Toggleable Visibility
Source: https://online.stat.psu.edu/stat505/lesson/13
Links:
1. https://online.stat.psu.edu/onlinecourses/sites/stat505/files/data/sales.txt
2. https://online.stat.psu.edu/stat505#tablist-cke_1-tab-pane-1
3. https://online.stat.psu.edu/onlinecourses/sites/stat505/files/sas/sales.sas
4. https://www.youtube.com/watch/4WksKAFD_o0