The Geometry of PLS1 Explained Properly
The Geometry of PLS1 Explained Properly
Received: 6 August 2013, Revised: 11 December 2013, Accepted: 11 December 2013, Published online in Wiley Online Library
The insight from, and conclusions of this paper motivate efficient and numerically robust ‘new’ variants of algorithms
for solving the single response partial least squares regression (PLS1) problem. Prototype MATLAB code for these
variants are included in the Appendix. The analysis of and conclusions regarding PLS1 modelling are based on a
rich and nontrivial application of numerous key concepts from elementary linear algebra. The investigation starts
with a simple analysis of the nonlinear iterative partial least squares (NIPALS) PLS1 algorithm variant computing
orthonormal scores and weights.
A rigorous interpretation of the squared P-loadings as the variable-wise explained sum of squares is presented.
We show that the orthonormal row-subspace basis of W-weights can be found from a recurrence equation. Conse-
quently, the NIPALS deflation steps of the centered predictor matrix can be replaced by a corresponding sequence of
Gram–Schmidt steps that compute the orthonormal column-subspace basis of T-scores from the associated
non-orthogonal scores.
The transitions between the non-orthogonal and orthonormal scores and weights (illustrated by an easy-to-grasp
commutative diagram), respectively, are both given by QR factorizations of the non-orthogonal matrices. The
properties of singular value decomposition combined with the mappings between the alternative representations of
the PLS1 ‘truncated’ X data (including Pt W) are taken to justify an invariance principle to distinguish between the PLS1
truncation alternatives. The fundamental orthogonal truncation of PLS1 is illustrated by a Lanczos bidiagonalization
type of algorithm where the predictor matrix deflation is required to be different from the standard NIPALS deflation.
A mathematical argument concluding the PLS1 inconsistency debate (published in 2009 in this journal) is also
presented. Copyright © 2014 John Wiley & Sons, Ltd.
Keywords: PLS1 algorithms; bidiagonalization; orthogonal and non-orthogonal weights; scores and projections; change of
coordinates and bases; truncation; QR factorization; singular value decomposition; reorthogonalization
and last but not least WWt as the associated orthogonal projections. The column vec-
(d) a mathematical conclusion of the PLS1 inconsistency debate tors of P and the entries of q are referred to as the corresponding
by considering the key orthogonal and non-orthogonal pro- X- and y-loadings of the associated PLS1 model.
jections involved in PLS1 model building. By application of elementary linear algebra to the various parts
of the NIPALS algorithm described above, we are going to estab-
The main source of inspiration for the work presented subse-
lish a sequence of fundamental PLS1 modelling properties listed
quently was found in the two publications [12,18] by Ergon.
as notes in the following sections.
2. THE NIPALS PLS1 ALGORITHM WITH
ORTHONORMAL SCORES 3. PLS1 WITHOUT THE DEFLATION STEP
OF NIPALS
The widely applied NIPALS PLS1 algorithm with orthogonal (but
not normalized scores) [1–3] is usually considered as the bench- Note 1: explained variance by P-loadings
mark for comparison of other algorithmic approaches to PLS1
The X-loadings identity P = Xt0 T follows from
modelling. According to [20], the NIPALS PLS1 is relatively slow
but numerically stable in most practical situations. The main rea-
Xt0 ta = Xta–1 ta + (X0 – Xa–1 )t ta = Xta–1 ta = pa (1)
son for its lack of speed is the extensive data matrix deflation
that requires computation of the outer products between each because the column space of the matrix (X0 – Xa–1 ) is spanned
extracted component and the corresponding loadings. In the by the orthonormal subset {t1 , : : : , ta–1 } of scores that are all
typical applications of PLS, where the number of predictors is orthogonal to ta .
large compared to the number of observations, deflation of the Each row of P represents the coordinates of the corresponding
predictor matrix is a computationally expensive way to extract X0 -column w.r.t. the orthonormal column-subspace basis asso-
the desired sets of orthogonal PLS1 scores (and weights). ciated with T. Hence, P can also be referred to as the projected
To make the mathematics and interpretations as transpar- variables coordinate matrix.
ent as possible, we will focus on the version of the NIPALS The projection of the i-th column xi of X0 onto the direction of
PLS1 algorithm (algorithm 1) that computes orthonormal scores the score basis vector ta equals xia = ta tta xi = xti ta ta , and the
(orthogonal unit vectors). As usual, we will assume that X0 is
coordinate value pa (i) = xti ta relates directly to the amount of xi -
the mean-centered version of the n m predictor matrix X and h i
that y is an n-dimensional response vector where the entries are variance = n1 xti xia accounted for by the a-th PLS1 component
a
associated with the corresponding rows of X. (i.e. ta ), that is,
According to the conventional PLS1 terminology and 1 t 1 t 2 1 2
properties, the matrices of scores (T) and weights (W) are both xia xia = x ta = pa (i) (2)
n n i n
orthogonal, that is, Tt T = Wt W = IA (the A A identity matrix).
Hence, the associated vectors represent orthonormal bases for Hence, with orthonormal scores, the explained sum of squares
the PLS1 column and row subspaces, respectively, with TTt and corresponding to the i-th variable (xi , 1 i m) accounted for
wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)
The geometry of PLS1
by the PLS1 model is found by squaring and adding all (A) entries of P is (A + 1 A) lower bidiagonal, and the corresponding
in the i-th row of P. orthonormal basis vectors are arranged in the augmented p
(A + 1) weight matrix W+ = [W wA+1 ]. Consequently,
Note 2: W-weights recurrence equation gives NIPALS
alternative without deflation Pt W = Bt2 Wt+ W
k 1 k –k 1 k kv 2k
2 3
By step 1 in algorithm 1, the vectors vta = yt Xa–1 (a = 1, : : : , A). kv k
0
1
Left multiplication by yt in step 6 of algorithm 1 gives the follow- kv3 k
6 7
6 k 2 k –k 2 k kv k
7
ing identities and implications: 2
6 7
. .
6 7
=6 . . 7
t 6 . . 7
wa va
6 7
kva k kvA k 7
yt Xa = yt Xa–1 – yt ta pta ) va+1 = va – k A–1 k –k A–1 k kv k5
6
pa = va – pa 4
A–1
k a k k a k
0 k A k
(7)
+
is upper bidiagonal and of size A A. Note that Pt W equals the
1 transposed coordinates of P truncated to the A basis (column)
va+1 = kva k wa – pa (3) vectors of W. According to (7), Pt W can therefore be found
k a k
directly from the normalizing constants of the orthogonal scores
By normalization of va+1 , we obtain wa+1 = kv 1 k va+1 . Hence, (T) and weights (W) only.
a+1
(3) followed by normalization defines a recurrence equation for
computing the orthonormal PLS1 weights. The associated nested Note 4: some computational issues in algorithms 1 and 2
sequence of vector equations can be entirely solved from the
starting vectors w1 , t1 , p1 (a = 1) and norms kv1 k, k 1 k that The main difference between algorithms 1 and 2 is that the
are all available before execution of the first deflation step in the computationally ‘expensive’ deflation (step 6) in algorithm 1
NIPALS algorithm. Note that Equation (3) with the succeeding (involving the computation and subtraction of A large vector
normalization is equivalent to the content of lemma 2 in [21]. outer products) is accounted for by a considerably less expensive
For a > 1, a trivial projection argument shows that the score GS orthogonalization (step 4) in algorithm 2 (involving the indi-
vector ta = k1 k a is obtained from a Gram–Schmidt (GS) cated computationally ‘moderate’ matrix–vector products only).
a Consequently, an implementation of algorithm 2 executes con-
orthogonalization step of the vector ?a = X0 wa with respect to siderably faster than the NIPALS PLS1. The GS orthogonalization
the preceding orthonormal scores {t1 , : : : , ta–1 } that forms a basis
step replacing the deflation step of algorithm 1 indicates that
for the column space of (X0 – Xa–1 ). Because the corresponding
the numerical robustness in proper implementations of the two
loading vector can be found directly by pa = Xt0 ta according to
algorithms should be similar.
note 1, wa+1 can (by induction) be found without executing the
The competitiveness of algorithm 2 with the fastest ‘stable’
deflation of X0 .
PLS1 algorithms discussed in [20] will not be investigated in detail
From notes 1 and 2, we can now establish a mathematically
in the present paper, but some comments on speed and preci-
equivalent PLS1 algorithm where deflation of X0 is replaced with
sion is given in Section 5. A prototype MATLAB implementation
a GS step to obtain the orthonormal scores:
of algorithm 2 is given in Appendix A.1.
Note 3: the W-weights recurrence equation shows that Pt W is 4. ADDITIONAL NOTES ON MATHEMATICS
bidiagonal
AND ALGORITHMIC FACETS OF PLS1
By a re-arrangement of Equation (3), the X-loadings
The vectors ?a of algorithm 2 coincide with the non-orthogonal
scores found by Martens’ alternative PLS1 algorithm (see frame
kva+1 k
pa = k a k wa – wa+1 , a = 1, : : : , A < m (4) 3.5 in [3]) that was shown by Helland [7] to be mathematically
kva k
equivalent to the NIPALS algorithm: both algorithms share the
appear as linear combinations of exactly two successive weights. same set of orthogonal weights.
These vector equations of (4) are compactly expressed by the
matrix product Note 5: alternative coordinates and interpretations by
P = W+ B2 (5) elementary linear algebra
where the coordinate matrix By inspection of algorithm 2, the orthonormal scores
2
k 1 k 0
3 T = [t1 t2 : : : tA ] are successively derived
from the non-
6–k 1 k kv2 k orthogonal scores T? = X0 W = ? 1 2 : : : A by the
? ?
k 2 k 7
6 kv1 k 7 GS-orthonormalization steps (4 and 5) establishing an
..
6 7
kv3 k . orthonormal basis for the subspace spanned by the non-
–k 2 k kv
6 7
6 7
2k orthogonal scores.
B2 = 6
6 7
.. 7
It should not be ignored that the rows of the non-orthogonal
. | A–1 k
6 7
6 7
6 kvA k
7 score matrix T? represent coordinates of the observations
6 –k A–1 k kv k A k 7
4 A–1 k 5 with respect to the orthonormal row-subspace basis W =
0 –k A k kvkvA+1kk [w1 w2 ... wA ] of PLS1 weights (the coordinate interpretation
A
(6) of the T? entries is valid because we compute inner products
J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
U. G. Indahl
between the X0 observations and the W basis vectors). From the Right multiplication of T? by D–1
2 in the preceding QR factor-
latter interpretation (focusing on the rows of observations), it ization (9) gives
makes sense to refer to T? as the Martens coordinate matrix w.r.t –1
the basis W .
? –1 ?
T = T D2 = T T T t ? = X0 WD–1 ?
2 = X0 W (12)
The associated orthogonal projection of the centered data
matrix X0 onto to the row subspace spanned by the orthonormal
basis W results in the truncation X0tr given by where –1 –1
W? = WD–1 t ?
2 =W T T = W Pt W (13)
X0tr = T? Wt = X0 WWt (8) is the matrix of corresponding non-orthogonal weights. W? coin-
cides with the (non-orthogonal) weights matrix directly com-
with respect to the original coordinates. Note that T? = X0 W = puted by the mathematically equivalent SIMPLS algorithm [22] (in
X0 WWt W = X0tr W. Hence, the GS steps for deriving the the case of a single response vector y). Finally, a left multiplication
orthonormal scores from the Martens coordinate matrix imply of Equation (13) by Wt shows that
the existence of an invertible upper triangular (A A) matrix
D2 that when paired with T represents a QR factorization of T? D–1 t ?
2 =W W (14)
(compare with section 3 in [18]),
The basic algebraic properties of PLS1 (just pointed out in
T? [= X0 W] = TD2 (9) notes 4 and 5) are illustrated by the commutative diagram shown
in Figure 1.
Left multiplication by Tt in Equation (9) solves for D2 as follows:
Note 6: QR factorization of the non-orthogonal W? -weights
Tt T? = Tt X0 W = Tt TD 2 = D2 (10)
The inverse of an upper triangular matrix is always upper trian-
gular. Because W is orthonormal (Wt W = IA ), the first identity
Thus, with respect to the orthonormal column-subspace basis
W? = WD–1 2 of Equation (13) represents a QR factorization of
T, an alternative set of (untruncated) coordinates for the non-
the non-orthogonal weight matrix W? because D–1 2 is upper
orthogonal scores (the columns of T? ) is given by the correspond-
triangular.
ing columns of
D2 = Tt X0 W = Pt W (11) Note 7: everything can be calculated if you know W or W?
which is a bidiagonal matrix according to note 3. Consequently, Any PLS1 algorithm (mathematically equivalent to the NIPALS
in PLS1 modelling, the matrix product Pt W has two different algorithm) computing either the orthonormal weights W or cor-
coordinate interpretations. responding non-orthogonal weights W? can easily be modi-
wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)
The geometry of PLS1
d1 –b1 0
2 3
6 d2 –b2 7
. .
6 7
U=6 .. .. (17)
6 7
7
6 7
4 dA–1 –bA–1 5
0 dA
t
uQ 1 = d1 0 : : : 0 and for 1 < a A,
1
t
ba–1 1
uQ a = uQ a–1 + 0 .. .. 0
da da
Figure 1. Commutative diagram showing the elementary linear algebra that is, the a-th (and only nonzero) entry of the last vector
of PLS1 modelling. Arrows indicate multiplication from the right by the is equal to d1 ; see [25]. Hence, by taking da = k a k and
corresponding matrices. (All directed paths in the diagram with the same a
kva k
start and endpoints lead to the same result by composition.) ba = k a–1 k kv k
, the desired columns in the upper trian-
a–1
J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
U. G. Indahl
MATLAB code for an algorithm including this modification is prototype code. Later, we will briefly explain the reasons for and
given in Appendix A.3. The relationships to the PLS1 versions of the possible solutions to this problem.
Appendix A.1 and A.2 are evident. For de Jongs SIMPLS algorithm [22] with a single response vec-
It should be noted that when using the code in Appendix A.3, tor (y), computation of the non-orthogonal W? -weights (in [22],
one runs into the same kind of trouble as was reported for the the notation R is used for the non-orthogonal weights) is driven
Bidiag2 in [20]. by the orthogonality requirement for the scores:
IA = Tt T = Tt X0 W? = Pt W? (18)
5. DISCUSSION
Hence, requiring orthonormality between the scores is equiv-
5.1. Notes 1–3
alent to requiring orthogonality between the P-loadings and
Extraction of orthonormal bases for the subspaces of interest W? -weights not corresponding to the same component. De Jong
simplifies later computations and interpretations, and is com- found this requirement to be satisfied exactly by the residual
mon practice in applied linear algebra. The advantages of nor- weight vector obtained after projecting w = Xt0 y onto the sub-
malization seems, however, to have been partly overlooked by space spanned by the P-vectors found ‘so far’. Appropriate scaling
the earliest pioneers of PLS modelling. Although mathematically of the desired residual weight vector was chosen to obtain nor-
equivalent, the original NIPALS PLS1 algorithm, with extraction of malization of the corresponding score vector. Thus, the SIMPLS
orthogonal but not normalized scores, disturbs the interpretation algorithm has its focus on computation of the (non-orthogonal)
of the squared P-loading entries as variable-wise explained sum weights required to obtain an orthonormal basis for the desired
of squares that was shown in Equation (2) of note 1. PLS1 column subspace that exactly corresponds to the column
The missing normalization is perhaps also the reason why subspace basis found by algorithm 2.
the simple relationship between the weights and loadings as
described in Equation (3) of note 2 was overlooked when for- 5.3. Notes 6 and 7
mulating the original NIPALS PLS1 algorithm. Another possible
explanation is that the NIPALS PLS was initially designed to Although Pt W is bidiagonal only in the PLS1 case, the commu-
also handle multivariate responses (Y) where the possibilities tative diagram of Figure 1 is valid for any right projection of
of avoiding the X-deflation were not so obvious. This problem X0 , where the row coordinates T? = X0 W with respect to an
has, however, later been solved by de Jongs SIMPLS algorithm orthonormal-subspace basis W of the m-dimensional Euclidean
[22]. A minor modification of the direct scores PLS1 algorithm of space. By the QR factorization of T? = TR, we obtain the
Andersson [20] (according to its close relationship to the SIMPLS orthonormal column-subspace basis T and the upper triangular
algorithm) will also solve the problem for multivariate responses matrix R = Pt W = Tt T? describing the coordinate relation-
without X-deflation. ships of both P with respect to W and T? with respect to T. In
The bidiagonal form of Pt W is of course ‘old news’ in the con- particular, this describes the situation for the multiresponse case
text of PLS1 modelling. The purpose of including note 3 is to (PLS2) as well as the various modifications of PLS1 based on extra
demonstrate that the bidiagonal form has a very simple and requirements in the computation of the orthonormal weights.
transparent interpretation as coordinates. From Equation (4), it For any number A of components, all algorithms mathemati-
is clear that each loading weight is a linear combination of not cally equivalent to the NIPALS PLS1 must necessarily compute a
more than two distinct weights and hence has at most two set of basis vectors for the subspace spanned by the columns of
nonzero coordinates with respect to both of the bases W+ and W = [w1 , : : : , wA ] (or W? ). If W is not found directly, the associ-
(the projection onto the subspace spanned by) W. ated orthonormal basis can always be uniquely (up to the sign of
each basis vector) obtained by a normalized GS post-processing.
5.2. Notes 4 and 5 Thereafter, the entire collection of scores, weights, loadings and
regression coefficients can easily be calculated. Consequently,
The main challenges of numerical problem solving are related to interpretations of the resulting model are not restricted by the
numerical precision and computational efficiency. A brief com- particular choice of algorithm. The main concerns when choosing
parison of the two algorithms indicate that their major difference between the mathematically equivalent PLS1 algorithms should
is the deflation step (6 in algorithm 1) and the GS step (4 in therefore be directed towards numerical precision and computa-
algorithm 2). Application of MATLABs profiling tool to the corre- tional efficiency.
wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)
The geometry of PLS1
5.4. Note 8
As pointed out earlier, the orthonormal weights W together with
the centered data X0 are sufficient to describe the entire PLS1
model. The processing of any observation xr (a row vector either
present in X0 or new) according to the PLS1 model can there-
fore be based on its orthogonal projection xrtr = xr Htr = xr WWt
onto the modelled row subspace and the corresponding residual
xrres = xr –xrtr . The right multiplication of the skew truncation TPt
by WWt in Equation (15) is exactly the post-processing solution,
suggested by Ergon [18], to the alleged inconsistency problem
of the PLS1 model space lined out by Pell et al. [15] (and later
discussed in Journal of Chemometrics (2009; 23, pages 67–77),
see [16–19]). The truncation alternative (corresponding to a right
multiplication by the skew projection Hstr = W? Pt ) advocated
by Wold et al. [16] is illustrated in the extended commutative
diagram of Figure 2.
Consistent use of the skew truncation requires the non-
orthogonal projection xrstr = xr Hstr to be considered together
with the residual xrsres = xr – xrstr . Note that the vector of regres-
sion coefficients b = W(Pt W)–1 q is an eigenvector corresponding
to the eigenvalue = 1 for both Htr and Hstr : Figure 2. The extended commutative diagram showing the elementary
h i –1 linear algebra of PLS1 modelling with both types of truncation. Arrows
Htr b = WWt W Pt W indicate multiplication from the right.
–1
q = W Wt W Pt W (19)
–1 Consequently, the regression vector part of the inconsistency
q = W Pt W q=b debate can be ended by concluding that we can consistently
navigate between the alternative vectors of regression coeffi-
and –1 –1 cients by using the projection matrices Hstr and G according to
Hstr b = W Pt W Pt W Pt W Equations (24) and (25).
In traditional PLS1 modelling, preference for the skew trunca-
–1 –1
(20) tion TPt of X0 seems to be mainly justified by the deflation step
q = W Pt W Pt W Pt W
of the NIPALS PLS1 algorithm. It is, however, important to keep
–1 in mind that the particular choice of deflation strategy is not
q = W Pt W q=b
at all theoretically critical. In particular, application of the right
Consequently, orthogonal projections (equivalent to double projections) in the
corresponding deflations
yO r = xr b = xrtr b = xrstr b (21)
demonstrates that both projections of xr are consistent with Xa = Xa–1 – Xa–1 wa wta = Xa–1 – a wta (26)
application of the regression coefficients b.
According to [15], the minimum norm regression coefficients inside the Bidiag2 algorithm (demonstrated in the MATLAB code
for the NIPALS truncation TPt of X0 are of appendix A.4) is one way of stabilizing this algorithm. By noting
that (i) this type of deflation is identical to the deflation step in
+ –1 Martens alternative PLS1 algorithm [3] and (ii) the NIPALS type of
ˇ = TPt y = P Pt P q (22)
deflation will not work correctly inside these algorithms, it is clear
that sacrificing orthogonality (between the row-subspace residu-
and does not share the properties reflected by Equation (21), als and corresponding part of the PLS model) is an unnecessary
that is, price to pay. Actions should therefore be taken to review our
xrtr ˇ ¤ yO r = xrstr ˇ ¤ xr ˇ (23) PLS modelling tools accordingly (in particular the latent variable
model view of PLS advocated in [16]). Putting the pieces together
However, from the identity TPt = X0 Hstr , it follows that the corre-
after computing the singular value decomposition (SVD) of Pt W
sponding X0 -regression coefficients b are obtained by applying
(i.e. with the orthonormal T and W) is all that is needed to get
the skew projection Hstr to ˇ:
it right.
–1 –1 –1
Hstr ˇ = W Pt W Pt P Pt P q = W Pt W q = b (24) 5.5. Notes 9 and 10
J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
U. G. Indahl
wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)
The geometry of PLS1
J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
U. G. Indahl
wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)
The geometry of PLS1
J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem
U. G. Indahl
wileyonlinelibrary.com/journal/cem Copyright © 2014 John Wiley & Sons, Ltd. J. Chemometrics (2014)
The geometry of PLS1
J. Chemometrics (2014) Copyright © 2014 John Wiley & Sons, Ltd. wileyonlinelibrary.com/journal/cem