Canonical Partial Least Squares A Unifie
Canonical Partial Least Squares A Unifie
Received: 31 October 2008, Revised: 19 March 2009, Accepted: 13 April 2009, Published online in Wiley Interscience: 18 May 2009
Keywords: canonical correlation analysis; partial least squares; regression with several responses; discriminant analysis;
powered partial least squares
1. INTRODUCTION: DATA COMPRESSION for providing more parsimonious models in terms of both the
FOR CLASSIFICATION AND number of components needed to obtain good predictions and
the complexity of these components.
MULTI-RESPONSE REGRESSION PROBLEMS In the present paper we will propose an alternative multi-
response PLS methodology called canonical PLS (CPLS), which
This paper presents a collection of new ideas and possibilities in
combines classification and regression in a joint framework and
the ongoing research and development of PLS methodology. The
which emphasizes both aspects introduced above. The method
underlying motivation of our work is twofold. We believe that for
combines PLS and canonical correlation analysis (CCA), and it
modeling with latent variables, more emphasis should be put on
will be demonstrated to have additional and favorable properties
the possibilities of
related to
1. using additional information such as
• weights (individual weights or weighting of groups • incorporating information from additional variables (not to be
of observations related to for instance their relative considered as predictors or responses) to improve predictions
frequencies), or interpretations,
• additional measurements (e.g., reference measurements, • simultaneous utilization of several available responses for the
design factors etc. ) not necessarily available for prediction purpose of predicting one particular of these responses as well
of future samples. as possible.
2. deriving alternative approaches to extraction of components
so that Applications to several data sets indicate that CPLS (in
• fewer components are required for good predictions, comparison to existing PLS methodology) is able to extract
• interpretations of the associated models are simplified.
Both aspects have been emphasized in recent work; the * Correspondence to: U. G. Indahl, Department of Mathematical Sciences and
Technology and Center for Integrative Genetics, Norwegian University of Life
difference between multi-response PLS (PLS2) with dummy Sciences, N-1432 Ås, Norway.
coded responses indicating group membership, and the PLS E-mail: [email protected]
discriminant analysis (PLS-DA) discussed in References [1,2], and
[3] is an important example emphasizing the weighting aspect. a U. G. Indahl
Department of Mathematical Sciences and Technology and Center for
In these papers the idea of using PLS for discriminant analysis Integrative Genetics, Norwegian University of Life Sciences, N-1432 Ås, Norway
is put in a framework where more natural and theoretically
satisfying weights are assigned to the different groups of b K. H. Liland
observation as compared to the straight forward application of Section for Biostatistics, Norwegian University of Life Sciences, N-1432 Ås,
Norway
PLS2 with dummy coded responses. The second aspect is an
important part of the powered PLS (PPLS) methodology (see c T. Næs
References [4] and [5]). PPLS is a modification of PLS useful Nofima Mat AS, Oslovegen 1, NO-1430 Ås, Norway
495
J. Chemometrics 2009; 23: 495–504 Copyright © 2009 John Wiley & Sons, Ltd.
U. G. Indahl, K. H. Liland and T. Næs
more information in the first few components. CPLS is also modeling of classification problems. Maximization of covariance
advantageous since it provides a theoretical framework including corresponds to finding the dominant unit eigenvector u of the
a number of PLS based methods. After presenting the basics of associated between-groups sum of squares and cross-products
t
CPLS we introduce some extensions and generalizations with matrix BP = nXg PXg where P is a g × g diagonal weighting matrix
direct reference to the underlying motivation described above. with diagonal entries pk = nnk . This is equivalent to maximization
Generalization to the framework of PPLS (see References [4] of the function
and [5]) provides a class of methods called canonical PPLS
1
(CPPLS), encompassing CPLS, PPLS, and single response PLS as f2 (u, v) = ut Xt Y(Yt Y)− 2 v = ut W v (3)
sub-methodologies.
restricted to unit vectors, where
2. BACKGROUND: PLS AND CANONICAL W = W(Yt Y)− 2
1
(4)
CORRELATION ANALYSIS
The factor Xg = (Yt Y)−1 Yt X is a (g × p)-matrix of group means.
2.1. Notational conventions
According to Indahl et al. [3], the dominant left singular unit vector
In the following, scalars will be denoted by lower case italicized a obtained by SVD of W = Xt Y is also a dominant eigenvector of
characters, e.g., c ∈ R. With p and q being positive integers, the weighted between-groups sum of squares and cross-products
t
vectors will be denoted by lower case bold italic characters, i.e., matrix BQ = nXg QX g . Here Q is a g × g diagonal weighting matrix
the p-dimensional u ∈ Rp . Matrices will be denoted by upper case with diagonal entries qk that are non-negative and proportional
bold roman characters, i.e., the p × q matrix W ∈ Rp×q . to n2k(the square of group sizes nk , k = 1, . . . , g) and scaled so
g
that 1 qk = 1. The dominant left singular unit vector u of the
2.2. The present PLS approaches to regression and scaled version W coincides with a dominant eigenvector of BP
classification problems defined above. In the following sections we refer to the latter
choice of weighting as PLS-DA, in short for PLS discriminant
Given the n × p matrix X = [x 1 x 2 . . . x p ] of predictors and the analysis.
n × q matrix Y = [y 1 y 2 . . . y q ] of responses, a PLS2 component Implementations of PLS are often based on the NIPALS
is found by maximization of the covariance between X and Y. algorithm or the SIMPLS algorithm (see References [9] and
More precisely; for the regression situation (continuous response [10]). SIMPLS extracts components based on the dominant left
variables) we seek unit vectors u ∈ Rp and v ∈ Rq so that the singular unit vector of W = Xt Y followed by deflation of W
expression before extraction of the next component. NIPALS also extracts the
dominant left singular unit vector of W, but deflates the entire X
f1 (u, v) = ut Xt Yv = ut Wv (1) matrix before recomputing W according to Equation (2). de Jong
[9] and Burnham and Viveros [10] give detailed descriptions of the
is maximized. A solution to this problem is provided by the most popular PLS algorithms. As noted in Reference [9], NIPALS
dominant left and right singular vectors (with unit length) and SIMPLS lead to similar but not identical models when the Y-
obtained by singular value decomposition (SVD) of matrix has two or more columns (multiple responses). In the case
of PLS-DA either algorithm can be applied with W of Equation
W = Xt Y (2) (4) replacing W of Equation (2).
where both the predictor matrix X and the response matrix Y are 2.3. Canonical correlation analysis
assumed to be centered. A pair of dominant unit vectors a ∈ Rp CCA addresses the problem of maximizing the correlation
and b ∈ Rq maximizing Equation (1) and the associated maximal between the n × p matrix X and the n × q matrix Y in the sense
singular value s corresponds to a rank 1 SVD approximation W(1) = of finding vectors a ∈ Rp and b ∈ Rq so that the correlation
sabt of W. The function f1 (u,v) is a scaled version of the covariance between Xa and Yb becomes as large as possible (see Mardia
between the vectors Xu and Yv, i.e., cov(Xu,Yv) = n1 f1 (u,v), and et al. [11]). Assuming that X and Y are centered, CCA maximizes
its relationship to the CCA problem will be explained further
below. ut Xt Yv
PLS2 aims at simultaneous prediction of several responses corr(Xu, Yv) = √ √ (5)
ut Xt Xu v t Yt Yv
based on a joint set of latent variables (see Reference [6]). Despite
the fact that single response PLS models (PLS1) often require
over all possible choices of u ∈ Rp and v ∈ Rq . It is
fewer latent variables for prediction of a particular response
straightforward to show that maximization of Equation (5) is
variable compared to a PLS2 model, PLS2 is the recommended
equivalent to maximization of the function
approach when a common context is required for interpretation
of the prediction models (see References [7] and [8]). 1 1
PLS associated with classification problems usually includes f3 (r, t) = r t (Xt X)− 2 Xt Y(Yt Y)− 2 t (6)
an n × g dummy coded group membership matrix Y (a matrix
of zeros and ones arranged so that each column indicates over unit vectors r and t. The problem is solved by choosing r = r 0
membership according to the corresponding group) where g ≥ 2 and t = t 0 where r 0 and t 0 are the unit vectors corresponding
is the number of groups considered. Rather than maximizing to the largest singular value (s0 ) in the SVD of the matrix
1 1
Equation (1), Barker and Rayens [1] and Nocairi et al. [2] showed (Xt X)− 2 Xt Y(Yt Y)− 2 . The unit vector r 0 obtained by maximization
1 1
that a slightly different expression is the logical choice for PLS of f3 is also an eigenvector of the matrix T− 2 BT− 2 corresponding
496
www.interscience.wiley.com/journal/cem Copyright © 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 495–504
Canonical partial least squares
to its dominant eigenvalue = s02 with the definitions T = the PLS2 component does not take any further advantage of the
1
Xt X and B = Xt Y(Yt Y)−1 Yt X. By defining a = T− 2 r 0 and b = available Y information. It acts unsupervised on W = Xt Y with
t − 12
(Y Y) t 0 a corresponding maximum of Equation (5) is respect to the computation of a linear combination of the W-
obtained. columns. CCA on the other hand relates Z to Y by considering the
Note that a is also a dominant eigenvector of the matrix T−1 B Y-information a second time. In this respect the PLS2 covariance
with as its associated eigenvalue. For classification problems maximization based on Equation (1) may be considered as an
where Y is the uncentered dummy coded group membership unnecessarily modest optimization criterion.
matrix, the above definition of B corresponds to the between As an improvement we suggest CPLS where maximization of
groups sum of squares and cross-products matrix (see Indahl Z, Y covariance is replaced by maximization of the Z, Y canonical
et al. [3]). When X is centered, the computations of both B and correlation according to Equation (5) with Z replacing X, and the
T−1 B, as well as the eigenvalues and eigenvectors of the latter, are associated maximizing vectors denoted by c, d ∈ Rq . We define
unaffected with respect to Y-centering or not (because the factor the unit vector w of CPLS loading weights as
Y(Yt Y)−1 Yt of B corresponds to the projection mapping onto the
column space Col(Y), and the centered columns of X are already Wc
orthogonal to the associated subspace spanned by the constant w=
Wc
vector 1 ∈ Col(Y)). The dominant eigenvector a defines the
canonical loadings and the corresponding dominant canonical
with the corresponding score vector t = Xw. For extraction of
variate z = Xa of Fisher’s canonical discriminant analysis (FCDA)
subsequent CPLS components we suggest deflation of the X-
and maximizes the two associated Rayleigh quotients
matrix (NIPALS) or W-matrix (SIMPLS) based on the definitions of
w and t. Other modifications are not required. Thus the largest
ut Bu ut Bu
r1 (u) = and r2 (u) = possible number of extracted components coincides with the
ut Vu ut Tu rank of the centered X-matrix.
Note that in comparison to PLS2, CPLS can be considered as
where V = T − B is the within groups sum of squares and cross
finding a supervised linear combination of columns from W. In
products matrix associated with the optimization of FCDA (see
particular, the CCA finds vectors c, d ∈ Rq so that the correlation
Reference [3]). This important relationship between CCA and
between Zc(= XWc) and Yd is maximized. The corresponding
FCDA, well known from the literature, was first recognized by
CPLS loading weights w = Wc can be normalized without
Bartlett [12].
affecting the optimized correlation. Thus, compared to PLS2
Finally we note that with Xt X proportional to the p × p
and PLS-DA, the loading weights defined by CPLS aim more
identity matrix, maximization of f3 in Equation (6) is equivalent to
aggressively toward prediction of the Y-data in the same sense
maximization of f2 in Equation (3), and if also Yt Y is proportional to
that linear regression aims more aggressively toward prediction
the q × q identity matrix, maximization of f3 simplifies to solving
when compared to PCA. Hence, compared to the traditional PLS
the original PLS problem, i.e., maximization of f1 in Equation (1).
solutions, one should expect that the resulting CPLS models
require fewer components and lead to simplified interpretations
from the two dimensional substructures associated with the
3. NEW DEVELOPMENTS
model. Because solving the CCA problem in multi-variate
3.1. Canonical PLS (CPLS) regression with dummy responses is equivalent to solving the
FCDA problem in classification, maximization of Equation (5) with
From Reference [3] (Section 3.1) it follows that maximization Z replacing X avoids the PLS2/PLS-DA inconsistency.
of the expressions ut Xt Yv and ut0 Zt Yv where Z = XW are
equivalent. This equivalence follows directly by definition of
the dominant right singular vector v = b as the dominant unit 3.2. Extensions
eigenvector in the eigendecomposition of Wt W with s2 as the 3.2.1. Weighted CPLS
corresponding maximal eigenvalue. The associated dominant
left singular unit vector u = a is given by a = s−1 Wb. Thus, Indahl et al. [3] suggested inclusion of prior probabilities in the
optimization of X, Y-covariance and optimization of Z, Y- calculations of PLS-DA components, and used this to motivate
covariance is equivalent and the solution of the last can be found weighted generalizations of PLS-DA and FCDA. A similar weighted
by SVD (or eigendecomposition) of extension of ordinary CCA is straightforward. If the n × n diagonal
weighting matrix D assigns non-negative weights for each of
Zt Y = W t X t Y = W t W the n individual observations associated with the Y data, a
slight modification of Equation (5) implies maximization of the
weighted correlation, i.e.,
For any n × p predictor matrix X, presumably with p ≫ n and an
n × q response matrix Y with q ≪ n, the dimensionality and/or
potential multi-collinearity problems associated with modeling ut Zt DYv
wcorr(Zu, Yv, D) = √ √ (7)
directly based on the X data are much avoided when replaced by u Z DXu v t Yt DYv
t t
Z = XW of dimension n × q. In particular, according to ordinary
PLS theory, each column of Z corresponds to the direction of
Maximization of Equation (7) is equivalent to maximization of the
maximum sample covariance with the corresponding column of
function
Y. Note that this direction also corresponds to the dominant
component of ordinary principal component analysis (PCA)
1 1
applied to Wt W. SVD of W (or equivalently PCA of Wt W) finding f4 (u, v) = ut (Zt DX)− 2 Zt DY(Yt DY)− 2 v (8)
497
J. Chemometrics. 2009; 23: 495–504 Copyright © 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
U. G. Indahl, K. H. Liland and T. Næs
restricted to unit vectors u and v. From a maximizing pair a0 , b0 of in the X-predictors, Yadd will contribute more emphasis on this
Equation (8) the corresponding maximizing pair a, b of Equation information enabling it to be more efficiently extracted into the
1 1
(7) is given by a = (Zt DX)− 2 a0 and b = (Yt DY)− 2 b0 . Here we CPLS components. Algorithm 1 shows the general formulation of
assume that Z and Y are centered by subtraction of weighted CPLS including additional responses:
means according to the diagonal elements of D.
1. Calculate W = Xt Y with Y = [Yprim , Yadd ].
For estimation of the loading weights of weighted CPLS we
2. Transform the X-data to Z = XW.
maximize the weighted canonical correlation between Z = XW
3. With Z replacing X and Yprim replacing Y, obtain the unit
and Y in Equation (7) analogously to the maximization of the
vectors a ∈ Rq , b ∈ Rq1 from maximization of Equation (5), or
ordinary canonical correlation in the CPLS. The weighted CPLS is
Equation (7) if a weighting matrix D is available, and calculate
appropriate for:
the optimal unit loading weight vector w = Wa/Wa.
• Classification problems with group priors (corresponding to 4. Use the calculated loading weight vector w to find scores, p-
identical weights of the samples within each group, but loadings and do other required computations according to the
different weights associated with different groups). preferred algorithm (such as SIMPLS, NIPALS etc.).
• Regression and classification problems with individual (a) Stop or
weighting of the observations (possibly unique weights for all (b) Deflate the data set before calculation of the next
n observations). component (steps 1–5).
Algorithm 1: The general formulation of CPLS.
3.2.2. CPLS with mixed responses
When applying PLS in situations where the response Note that inclusion of additional variables Yadd in the general
matrix Y contains a mixture of categorical and continuous formulation of CPLS corresponds exactly to the first bullet
columns (variables) a theoretical problem emerges because point of the introduction: • ‘incorporating information from
of the PLS2/PLS-DA inconsistency regarding maximization of additional variables (not to be considered as predictors or
covariance (see Section 2.2). Consequently either choice between responses) to improve predictions or interpretations.’ From the
the two variants leaves the practitioner with an ad hoc strategy for general formulation with multiple responses Y = [y 1 , · · · y q ] it
extraction of components (except in the special cases of equally is also straightforward to focus one particular response y i (if the
sized groups for the categorical data or if all the columns of Y are response is continuous) or an n × g sub-matrix of associated
individually standardized). With CPLS (weighted or unweighted) dummy columns Yg (if these responses represents the group
a mixture of categorical and continuous columns will not cause labeling of a g-group classification problem) as the primary
such problems because scaling of the responses is not required block Yprim . The remaining columns should be assigned into
(scaling has no effect on maximization of canonical correlation). the additional block Yadd as indicated in step 1 above. This
corresponds to the second bullet point in the introduction: •
3.2.3. The general formulation: CPLS with primary and ‘simultaneous utilization of several available responses for the
additional responses purpose of predicting one particular of these responses as well as
possible.’
Although modeling and prediction of the selected responses
from the predictors associated with X is our primary goal, 3.2.4. Canonical PPLS
additional information associated with each sample is often
available as a by-product the data generation process. By The PPLS methods introduced in References [4] and [5] extend
assumption, we consider this information not to be available for the ordinary PLS methodology. An important feature of PPLS
prediction of future samples. Examples of additional information is its ability to reduce or eliminate the influence of predictors
may include additional reference measurements, the design less important to prediction. We complete our description of
factors from an experimental design used to generate the data, extensions based on canonical correlation by presenting the
or something more exotic such as the fitted values of a model canonical PPLS (CPPLS).
predicting the responses based on a set of predictors not in Reference [4], flexible trade-offs between the element wise
supposed to be included in X. correlations and variances defining the loading weights were
Assume that the n × q1 matrix Yprim represent our primary defined by
response data (to be modeled for later predictions based on X- 1−
data), and that a corresponding set of additional information is
w( ) = k · s1 corr x 1 , y 1− · std (x 1 ) ,...,
available as the n × q2 matrix Yadd . Although (by assumption)
the variables corresponding to Yadd will not be available for 1− t
prediction of future samples, we can still take advantage of sp corr x p , y 1− · std x p (9)
this information when building the desired model. Combine the
primary and additional information into the n × q (q = q1 + q2 ) where the power parameter is ranging from 0 to 1, sk denotes
super matrix Y = [Yprim , Yadd ] where prediction of Yprim is the the sign of the kth correlation and k is a scaling constant assuring
main task. With Y composed of the primary and additional blocks unit length of w( ). For a corresponding parameterization of CPLS
we compute W = Xt Y and the corresponding transformed data we need a generalized reformulation of the transformation matrix
Z = XW followed by maximization of the canonical correlation W = Xt Y.
between Z and Yprim . The columns of W essentially correspond to PLS1 loading
Thus the background information represented in Yadd is used weights whose directions maximize the covariance between X
to add extra columns to Z. To the extent that Yadd contains and the associated columns in Y. Hence we can factorize W =
information relevant for prediction of Yprim that is also present Sx CP, where Sx is a diagonal (p × p) matrix containing the column
498
www.interscience.wiley.com/journal/cem Copyright © 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 495–504
Canonical partial least squares
wise standard deviations of X, C is a (p × q) matrix containing the • group 4:‘iy’ (1163 samples)
pairwise correlations between the columns of X and Y, and P • group 5: ‘sh’ (872 samples)
is a (q × q) diagonal scaling matrix with entries proportional to
the column wise standard deviations of Y. Because the factor P As in Indahl et al. [3] we impose an ‘artificial’ context for this data
only contributes to Z = XW by scaling the columns of XSx C, i.e., set by introducing the prior distribution
span(Z) = span(XSx C), it is sufficient to consider the canonical
correlation between Z0 = XSx C and Yprim . = {1 = 0.47, 2 = 0.47, 3 = 0.02, 4 = 0.02, 5 = 0.02}
Accordingly, a parametric version of the simplified W0 = Sx C
(13)
that corresponds to Equation (9) is given by
W0 ( ) = Sx ( )C( ) (10) The priors are chosen to increase focus on the two phonemes
that are hardest to distinguish from one another. Accordingly
the data set is split into a training set of 4009 samples and a
where
test set of 500 samples. The test set contains 235 samples from
1−
each of the groups 1 and 2, and 10 samples from each of the
std (x 1 ) 0
groups 3, 4, and 5, to reflect the specified prior distribution of
Sx ( ) =
..
(11) Equation (13). From the training data we compute models with
.
up to 15 components according to the following four different
1−
0 std x p strategies:
J. Chemometrics. 2009; 23: 495–504 Copyright © 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
U. G. Indahl, K. H. Liland and T. Næs
80 80
75 75
% correctly classified
% correctly classified
70 70
65 65
60 60
55 55
Ordinary PLS−DA
50 PLS−DA with specified priors 50
CPLS with specified priors
CPPLS with specified priors
45 45
0 5 10 15 0 5 10 15
Component no. Component no.
Figure 1. [Speech data] Classification results for components extracted by PLS-DA using empirical priors (dashed), PLS-DA using specified
priors (dotted), CPLS using specified priors (solid), and CPPLS using specified priors (dot-dashed). This figure is available in color online at
www.interscience.wiley.com/journal/cem
• group 1: ‘Soybean’ (30 training samples and 12 test samples) component modeling. Figure 3 shows the different associated
• group 2: ‘Sunflower’ (18 training samples and 6 test samples) score plots.
• group 3: ‘Canola’ (15 training samples and 9 test samples) In Figure 4, the first two vectors of loading weights for
• group 4: ‘Olive’ (12 training samples and 12 test samples) PLS-DA and CPPLS without and with additional responses are
• group 5: ‘Corn’ (24 training samples and 0 test samples) shown. The first loading weight vector from both the powered
• group 6: ‘Grapeseed’ (21 training samples and 3 test samples) models show distinct peaks around 1700 nm and between
2200 and 2400 nm. The PLS-DA loadings on the other hand
In the modeling, we have compared the LDA-success rates based
indicate no distinct focus on particular predictors. The improved
on up to 20 components obtained by:
focus of the CPPLS models corresponds to the fact that -
1. PLS-DA with empirical priors (without inclusion of additional values quite close to 0 (emphasizing wavelengths of large
responses). variance) dominated the computation of the first component.
2. CPLS with empirical priors (without inclusion of additional Figure 5 illustrates the s found to be optimal for the different
responses). components of the two CPPLS approaches. The -values close
3. CPLS with empirical priors (including the five design to 0 or 1 correspond to the more focused vectors of loading
parameters as additional responses). weights. Regarding model interpretation, the more focused
4. CPPLS with empirical priors (without inclusion of additional loadings resulting from the powered methodology may lead
responses). to significant simplifications compared to the traditional PLS
5. CPPLS with empirical priors (including the five design methods.
parameters as additional responses).
4.3. Regression with several continuous responses
The results of the five strategies are shown in Figures 2–5.
Both with and without inclusion of additional responses, To illustrate modeling with several continuous responses we
the initial CPLS-components discriminate much better than the analyze a data set where the predictors are raw NIR measurements
components derived by PLS-DA. With five CPLS-components (700 wavelengths, 1100–2498 nm in steps of 2 nm) measured on
including additional responses, we obtain cross-validated and biscuit dough. The calibration set has N = 40 samples of p = 700
test set success rates of 96.7 and 100%, respectively. Similar variables. For each sample, four response variables representing
success rates were found for CPLS without additional responses percentages of fat, sucrose, flour, and water, respectively, have
(95.8 and 100% after nine components, and for PLS-DA 96.7 and been measured. A corresponding set of 32 samples is reserved for
100% after thirteen components). testing of the candidate models. The two sets have been created
In contrast to the phoneme results, components extracted from and measured on different occasions. Further descriptions and
the CPPLS approaches simplifies the classification significantly. modeling based on this data set are reported in Brown et al. [15].
Without additional responses, the cross-validated and test set We compare the Root Mean Squared Error of Cross-Validation
success rates are 94.2 and 100% with five component modeling. (10-fold cross-validation) and Prediction (RMSECV and RMSEP) for
With inclusion of additional responses we obtain cross-validated different modeling strategies including up to 20 components for
and test set success rates of 95.8 and 100%, respectively, by two each of the response variables. The modeling strategies are:
500
www.interscience.wiley.com/journal/cem Copyright © 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 495–504
Canonical partial least squares
90 90
80 80
70 70
% correctly classified
% correctly classified
60 60
50 50
40 40
30 30
20 PLS−DA 20
CPLS
CPLS with additional responses
10 10
CPPLS
CPPLS with additional responses
0 0
0 5 10 15 20 0 5 10 15 20
Component no. Component no.
Figure 2. [Mayonnaise data] Classification results for components extracted by PLS-DA (dashed), CPLS without additional responses (solid), CPLS
including additional responses (dotted), CPPLS without additional responses (dot-dashed), and CPPLS including additional responses (solid with dots).
All methods use the empirical priors. This figure is available in color online at www.interscience.wiley.com/journal/cem
−1.25
−2.6 −1.3 −0.07
−2.8 −1.35
−0.075
−1.4
−3
−1.45
−0.08
−17 −16 −15 −14 −0.05 0 0.05 0.1 −0.04 −0.03 −0.02 −0.01
First component First component First component
−0.108
−0.2
−0.11
Second component
1
2 −0.21 −0.112
3
−0.114
4 −0.22
5 −0.116
6 −0.23 −0.118
−0.12
−0.24
−0.2 −0.15 −0.1 −20 −15 −10 −5 0
First component First component x 10−3
Figure 3. [Mayonnaise data] Score plots for components extracted by PLS-DA, CPLS without additional responses, CPLS including additional responses,
CPPLS without additional responses, and CPPLS including additional responses. All methods use the empirical priors. This figure is available in color
online at www.interscience.wiley.com/journal/cem
501
J. Chemometrics. 2009; 23: 495–504 Copyright © 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
U. G. Indahl, K. H. Liland and T. Næs
PLS−DA
CPPLS/PPLS−DA
0.2
−0.2
−0.4
1200 nm 1400 nm 1600 nm 1800 nm 2000 nm 2200 nm 2400 nm
0.2
−0.2
Figure 4. [Mayonnaise data] Loading plots for the first two components extracted by PLS-DA, canonical powered PLS without additional responses,
and canonical powered PLS including additional responses. All methods use the empirical priors. This figure is available in color online at
www.interscience.wiley.com/journal/cem
γ−values
1
CPPLS
CPPLS with additional responses
0.5
0
2 4 6 8 10 12 14 16 18 20
Component no.
Figure 5. [Mayonnaise data] -values from the models for canonical powered PLS without additional responses (solid) and canonical
powered PLS including additional responses (dashed). Both methods use the empirical priors. This figure is available in color online at
www.interscience.wiley.com/journal/cem
1. PLS1 with separate modeling for each of the four response responses. This is caused by restricting the -domain to [0.9, 1]
variables. in the CPPLS modeling. The associated powers of the standard
2. CPLS with separate modeling for each of the four response deviation block and the correlation block are then < 91 and
variables as the primary response and the other three response >9, respectively and these restrictions forces a sharpened focus
variables as additional responses. on variables in X with the largest correlations to the response
3. CPPLS with separate modeling for each of the four response variable.
variables as the primary response and the other three response
variables as additional responses. (To put focus on the
predictors most correlated to the response, the domain for 5. DISCUSSION/CONCLUSIONS
optimization of the power parameter is restricted to the
interval [0.9, 1].) In summary, we consider the following aspects of the CPLS
methodology to have particular importance: (1) CPLS as a
Prediction results and regression coefficients for the three generalization of the traditional single response PLS. (2) The
strategies are shown in Figures 6 and 7. ability of CPLS to extract good components for both regression
The results confirm the impression that the CPLS including and classification problems. (3) The possibility of exploiting
additional responses in the modeling identifies good models with additional responses for more powerful modeling when such data
fewer components than ordinary PLS1. Furthermore it avoids are available. (4) Further extensions from CPLS to CPPLS.
instabilities indicated by the test data for the two component
PLS1 models. With an exception for the fat response, CPPLS (1) The CPLS is a genuine generalization of single response
performs slightly better than CPLS in prediction of the test data. PLS (PLS1). With a single primary response (continuous or
The regression coefficients (see Figure 7) of CPPLS are also more two group categorical) and no additional responses, the
focused on particular predictors (from 1900 to 2100 nm and 2000 mathematical formulation of the CPLS simplifies to the
to 2200 nm) compared to the other two methods for all the classical PLS1 because the former is based on forming linear
502
www.interscience.wiley.com/journal/cem Copyright © 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 495–504
Canonical partial least squares
RMSECV
RMSEP
CPPLS using additional responses
Fat
Fat
2 2
1 1
0 0
0 5 10 15 20 0 5 10 15 20
Component no. Component no.
8 8
6 6
RMSECV
Sucrose
Sucrose
RMSEP
4 4
2 2
0 0
0 5 10 15 20 0 5 10 15 20
Component no. Component no.
5 5
4 4
RMSECV
RMSEP
Flour
3 3
Flour
2 2
1 1
0 0
0 5 10 15 20 0 5 10 15 20
Component no. Component no.
2 2
RMSECV
Water
RMSEP
Water
1 1
0 0
0 5 10 15 20 0 5 10 15 20
Component no. Component no.
Figure 6. [Dough data] Prediction results for each of the four responses. Components extracted by PLS1 (dashed), CPLS with one primary response and
inclusion of the other three reference variables as additional responses (solid), and CPPLS with one primary response and inclusion of the other three
reference variables as additional responses (dotted). This figure is available in color online at www.interscience.wiley.com/journal/cem
−10
1200 nm 1400 nm 1600 nm 1800 nm 2000 nm 2200 nm 2400 nm
Sucrose (3 components)
10
0
−10
−20
1200 nm 1400 nm 1600 nm 1800 nm 2000 nm 2200 nm 2400 nm
Flour (3 components)
2
0
−2
−4
−6
1200 nm 1400 nm 1600 nm 1800 nm 2000 nm 2200 nm 2400 nm
Water (3 components)
5
0
−5
1200 nm 1400 nm 1600 nm 1800 nm 2000 nm 2200 nm 2400 nm
Figure 7. [Dough data] Regression coefficients for each of the four responses corresponding to the test data in Figure 6. Components extracted by
PLS1 (dashed), CPLS with one primary response and inclusion of the other three reference variables as additional responses (solid), and CPPLS with
one primary response and inclusion of the other three reference variables as additional responses (dotted). This figure is available in color online at
www.interscience.wiley.com/journal/cem
503
J. Chemometrics. 2009; 23: 495–504 Copyright © 2009 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
U. G. Indahl, K. H. Liland and T. Næs
combinations of the columns in W = Xt Y and the fact that required components (2) and complexity of the associated
the number of columns in W and Y in this case are identical loading weights. It should also be noted that with appropriate
to 1. restrictions of the most general formulation including
(2) The temporary loading weights matrix W = Xt Y defines additional responses, each of the following
directions maximizing the covariances between the X data – PLS1 (fixed = 0.5 and Y = Yprim = y)
and the individual Y responses according to classical PLS1. – CPLS (fixed = 0.5)
As linear combinations of the columns in W, the CPLS – PPLS (Y = Yprim = y)
components (and the associated loading weights) retains – PPLS-DA(Y = Yprim = [y 1 . . . y g ])
a close relationship to the established PLS methodology. can be considered as a special case of CPPLS.
The transformation Z = XW temporarily maps X to a
The indicated differences between the various PLS approaches
subspace where serious collinearity problems are avoided.
focused in this paper are partly empirical. More research is
The computations required for optimization can be executed
required to establish valid theoretical guidelines for selection
effectively because the canonical loading weights of CPLS
between the alternatives. A particular issue of interest is the
are obtained by scaling w = Wa to unit length. Here
consequence of modifying the various PLS algorithms (deflation
the coefficient vector a is found by solving a canonical
strategies) according to CPLS. Simulation results indicate the
correlation with not more than q variables in either of the
possibility that the CPLS versions of SIMPLS and NIPALS (with
involved matrices. The examples analyzed above indicate
multiple responses) may lead to equivalent solutions.
that CPLS is more effective than the traditional PLS methods
in the sense that simpler models (fewer components) are
required to obtain good predictions. Because the method
provides optimal components for both continuous and REFERENCES
categorical response variables, it can also be applied without
1. Barker M, Rayens W. Partial least squares for discrimination. J.
modifications when the columns of Y contain a mixture of the Chemometr. 2003; 17: 166–173.
two types. 2. Nocairi H, Qannari EM, Vigneau E, Bertrand D. Discrimination on latent
(3) Inclusion of additional responses for the purpose of predicting components with respect to patterns. Application to multicollinear
a set of primary responses is unique to CPLS. With CPLS data. Comput. Stat. Data Anal. 2005; 48: 139–147.
it is possible to include directly a broader context of the 3. Indahl U, Martens H, Næs T. From dummy regression to prior
probabilities in PLS-DA. J. Chemometr. 2007; 21: 529–536.
particular prediction problem (such as background reference 4. Indahl U. A twist to partial least squares regression. J. Chemometr.
measurements, design factors from experimental designs, 2005; 19: 32–44.
fitted values from prediction models of Yprim based on 5. Liland KH, Indahl UG. Powered PLS discriminant analysis. J. Chemometr.
measurements not included in X, etc.) in the model building. 2009; 23(1): 7–18.
6. Martens H, Næs T. Multivariate Calibration. John Wiley and Sons:
The examples of classification including additional responses Chichester, UK, 1989.
and regression with several continuous responses indicate 7. Martens H, Martens M. Multivariate Analysis of Quality. An Introduction.
that using additional information from the data generation John Wiley and Sons: Chichester, UK, 2001.
process has the potential of contributing to the building of 8. Wold S, Sjöström M, Eriksson L, PLS-regression: a basic tool of
simpler and more stable models. Note that if the matrix of chemometrics. Chemometr. Intell. Lab. Syst. 2001; 58: 109–130.
9. de Jong S. SIMPLS: An alternative approach to partial least squares
available additional responses has many columns compared regression. Chemom. Intell. Lab. Syst. 1993; 42: 251–263.
to the number of rows (observations), some action (PCA 10. Burnham AJ, Viveros R. Frameworks for latenent variable multivariate
or other suitable data compression techniques) must be regression. J. Chemometr. 1996; 10: 31–45.
taken to prevent collinearity problems. Finally, we stress 11. Mardia KV, Kent JK, Bibby JM. Multivariate Analysis. Academic Press:
New York, 1979.
the fact that additional information (possibly unavailable or 12. Bartlett MS. Further aspects of the theory of multiple regression.
too ‘expensive’) is not required for the samples only to be Proceedings of the Cambridge Philosophical Society, 1938; 34: 33–40.
used for testing or application of a model based on the 13. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning.
CPLS methodology. Only (X, Yprim )-data is required for model Springer: New York, 2001.
testing, and for prediction only X-measurements are required. 14. Indahl U, Sahni NS, Kirkhus B, Næs T. Multivariate strategies for
classification based on NIR-spectra—with application to mayonnaise.
(4) The extension to CPPLS incorporates the basics of the PPLS Chemometr. Intell. Lab. Sys. 1999; 49: 19–31.
methodology for computation of prediction models. For the 15. Brown PJ, Fearn T, Vannucci M. Bayesian wavelet regression on curves
mayonnaise data, CPPLS including additional responses gave with application to a spectroscopic calibration problem. J. Am. Stat.
the most efficient models with respect to the number of Assoc. 2001; 96: 398–408.
504
www.interscience.wiley.com/journal/cem Copyright © 2009 John Wiley & Sons, Ltd. J. Chemometrics 2009; 23: 495–504