0% found this document useful (0 votes)
123 views10 pages

Robust and Classical PLS Regression Compared: Bettina Liebmann, Peter Filzmoser and Kurt Varmuza

Uploaded by

Tiruneh GA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views10 pages

Robust and Classical PLS Regression Compared: Bettina Liebmann, Peter Filzmoser and Kurt Varmuza

Uploaded by

Tiruneh GA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Special Issue Article

Received: 28 October 2009, Revised: 1 December 2009, Accepted: 3 December 2009, Published online in Wiley InterScience: 3 February 2010

(www.interscience.wiley.com) DOI: 10.1002/cem.1279

Robust and classical PLS regression compared


Bettina Liebmanna *, Peter Filzmoserb and Kurt Varmuzaa
Classical PLS regression is a well-established technique in multivariate data analysis. Since classical PLS is known to be
severely affected by the presence of outliers in the data or deviations from normality, several PLS regression methods
with robust behavior towards data contamination have been proposed. We compare the performance of the classical
SIMPLS approach with the partial robust M regression (PRM). Both methods are applied to three different data sets
including outliers intentionally created. A simulated data set with known true model parameters allows insight in the
modeling performance with increasing data contamination. QSPR data are modified with a cluster of outlying
observations. A third data set from near infrared (NIR) spectroscopy is likely to include noise and experimental errors
already in the original variables, and is further contaminated with outliers. To provide a sound comparison of the
considered methods we apply repeated double cross validation. This validation procedure judiciously optimizes the
model complexity (number of PLS components) and estimates the models’ prediction performance based on test-set
predicted errors. All studied robust regression models outperform the classical PLS models when outlying obser-
vations are present in the data. For uncontaminated data, the prediction performances of both the classical and the
robust models are in the same range. Copyright ß 2010 John Wiley & Sons, Ltd.
Keywords: partial robust M-regression (PRM); PLS regression; repeated double cross validation (rdCV); outliers; R

1. INTRODUCTION not resistant to leverage points [6]. Based on the second strategy,
a robust covariance estimation, the robust SIMPLS method [7]
Partial Least Squares (PLS) regression is a well-known and often provides resistance to all types of outliers including leverage
successfully applied technique in multivariate data analysis. The points. The latter also applies to the ‘Partial Robust M Regression’
typical task of regression is to model a response y by means of a (PRM), introduced in 2005 by Serneels et al. [8]. As the name
set of explanatory variables (features) x1; . . . ; xm. suggests, it is a partial version of the robust M-regression. In an
Many different PLS algorithms have been developed over the iterative scheme, weights ranging between zero and one are
last 25 years, and have been implemented in various software calculated to reduce the influence of deviating observations in
products. For the ‘classical PLS’ method as referred to in this work, the y space as well as in the space of the regressor variables. PRM
we choose the SIMPLS algorithm introduced by de Jong [1]. is very efficient in terms of computational cost and statistical
Essential advantages of the PLS approach are its ability to deal properties, and therefore the robust method of choice in this
with collinear variables and numerous x-variables, and it allows to paper.
optimize the model’s complexity [2]. These properties are The objective is to compare the predictive performance of
especially useful with modern analytical instruments such as classical and robust PLS regression by a judicious validation
spectrometers, where many and strongly correlated x-variables method. We apply the repeated double cross validation (rdCV)
are recorded. procedure to both regression types, and study the models
However, the classical PLS procedures are known to be severely resulting from three data sets with different characteristics.
affected by the presence of outliers in the data or deviations from
normality [3]. The non-robustness of PLS was justified theoreti-
cally in Reference [4]. Outliers are different from the majority of 2. METHODS
the data, but they are not necessarily incorrect. Often the outlying
observations were made under exceptional circumstances or 2.1. Partial robust M-regression
they belong to another statistical population. In general, classical
The PRM approach is ‘partial’ because it follows the idea of
methods usually fail in identifying the outliers. Consequently, the
dimensionality reduction by using a few latent variables. The
resulting model may be fitting the outlying observations thus
original regressor variables are replaced with orthogonal latent
‘masking’ their erroneous nature (masking effect). By contrast,
some good data points might show up as outliers (swamping effect).
Several robust alternatives to classical PLS have been pro- * Correspondence to: B. Liebmann, Laboratory for Chemometrics, Institute of
posed. Their common goal is to detect data contamination and Chemical Engineering, Vienna University of Technology, Getreidemarkt 9/166,
estimate a regression model that primarily fits the ‘good’ data. A-1060 Vienna, Austria.
E-mail: [email protected]
Outliers can then be identified easily by their residuals from this
robust fit. a B. Liebmann, K. Varmuza
The two main strategies for robust PLS regression are (1) Laboratory for Chemometrics, Institute of Chemical Engineering, Vienna
downweighting of outliers and (2) robust estimation of the University of Technology, Getreidemarkt 9/166, A-1060 Vienna, Austria
covariance matrix. The early approaches for robust regression by b P. Filzmoser
downweighting of outliers are considered semi-robust: they had, Institute of Statistics and Probability Theory, Vienna University of Technology,
for instance, non-robust initial weights [5] or the weights were Wiedner Hauptstrasse 8-10, A-1040 Vienna, Austria
111

J. Chemometrics 2010; 24: 111–120 Copyright ß 2010 John Wiley & Sons, Ltd.
B. Liebmann, P. Filzmoser and K. Varmuza

variables with maximum covariance with y, as in classical PLS in the intermediate distances gi
regression. Suppose that observations xi ¼ (xi1; . . . ; xim) and yi, for
i ¼ 1; . . . ; n, are available, forming the (n  m) matrix X and the kxi  x
~k
gi ¼ (5)
vector y, respectively. For simplicity, we assume mean-centered y. mediankxi  x
~k
i
Then the original regression problem
Note that in all subsequent steps of the algorithm, the
yi ¼ x i b þ "i (1)
distances gi are computed in the score space, i.e. according to the
scores ti.
with the coefficients b ¼ (b1; . . . ; bm)T and the error term ei is
By passing the intermediate distances hi and gi to a weight
reduced to the latent variables regression model
function, they are transformed to values between 0 and 1.
yi ¼ t i g þ di (2) Observations with large distances to the data majority receive a
weight close to zero, so to have reduced influence on the
with the new regression coefficients g ¼ (g1; . . . ; ga)T and the regression model. Observations among the data majority get a
error term di. The new model is of lower dimension a < m, and it is weight close to 1. We choose the ‘Fair’ weight function [6] with a
in fact a regression on the score vectors ti, which are to be tuning constant set to 4, which is reported to have good
determined. performance properties. The residual weight w ir and the leverage
In general, two types of outliers can be influential to the weight w ix for object i is then calculated as
estimation of the regression coefficients: Leverage points, which 1 1
are multivariate outliers in the space of the regressor variables, wir ¼    2 and wix ¼    2 (6)
and vertical outliers, which are not atypical in the regressor space 1 þ h i 
4
1 þ g i 
4
but have large residuals. PRM offers good robustness properties
by taking into account both types of outliers: a weight w ix is The transforming character of the Fair function is shown in
responsible for dealing with leverage points, while a weight wir is Figure 1. By choosing continuous weights the dilemma of an
relevant for vertical outliers. For each observation, these all-or-nothing decision—the object is an outlier: yes or no—can
continuous weights are iteratively adjusted, in order to diminish be avoided. The weight given to each object corresponds to its
the negative influence of outlying objects on the regression degree of outlyingness.
model. The total weight to multiply each object with is then Once the robust starting weights are computed, we construct a
defined as model using classical SIMPLS on the weighted rows of X and
pffiffiffiffiffiffiffiffiffiffiffi weighted y. This analysis yields a first estimate of the regression
wi ¼ wix wir (3) coefficients g and the PLS scores ti. Note that the resulting scores
have to be corrected by division by the total weight wi. At this
Note that all PLS regressions are performed on weighted point the residuals ri are computed:
observations wi xi and wi yi. ri ¼ yi  t i g (7)
A brief description of the PRM algorithm:
Step 1: Compute robust starting values for weights from data xi The residual weights w ir are then updated according to
and yi. Equations (4) and (6), by substituting yi with the residuals ri. For an
Step 2: Perform classical PLS on weighted observations wi xi and update of the leverage weights w xi, we replace the original
wi yi. x-variables in Equation (5) by the current set of score vectors ti
Step 3: Recompute weights w ri (from residuals), w xi (from PLS and apply the Fair function given in Equation (6). The original data
scores) and total weights wi
Step 4: Iterate steps 2 and 3 until convergence of the regression
coefficients
Step 5: Obtain final regression coefficients bPRM directly from last
PLS step
In the first and crucial step, the weights are initialized—in a
robust manner. Therefore, ‘robust autoscaling’ is applied to the X
matrix as well as the y vector. Instead of the usually applied
measures mean and standard deviation, their robust counter-
parts median and median absolute deviation (MAD) are used [3].
The data are centered to the median, and then divided by MAD.
Robust autoscaling in y thus results in the intermediate
distances hi
yi  ymedian
hi ¼ (4)
medianjyi  ymedian j
i

The robust center of X can be calculated by a multidimensional Figure 1. Weight wx (or wr) to account for outliers—calculated by the
median estimator such as the column-wise median or the Fair function. The origin can be considered the robust data center
L1-median x ~ [9]. Then we yield each objects’ Euclidean distance (median). Observations far from the origin are downweighted by weights
kxi  x
~k to the robust center x
~. The robust autoscaling in X results much smaller than 1.
112

www.interscience.wiley.com/journal/cem Copyright ß 2010 John Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 111–120
Robust versus classical PLS regression

matrix X as well as vector y is reweighted with the updated total 2.2.4. RMSE of regression coefficients
weights, and the next classical SIMPLS regression step is
We expect a good regression method to find the true underlying
performed until convergence of the regression coefficients g.
linear function relating X and y. Given a defined set of true model
If the difference between the regression coefficients of two
coefficients b and a random data set X, the calculation of a
consecutive PLS steps is smaller than a certain threshold value,
perfectly corresponding response y is straightforward. It is an
here 102, the iterative procedure is terminated. From the last
easy task to solve the linear regression problem of perfectly
regression step, the robust PLS model is obtained.
related X and y data, obtaining estimated regression coefficients
b that are (almost) identical with b. In case of data contamination
2.2. Performance criteria
or noise, however, the estimated coefficients will deviate from b.
2.2.1. SEP The RMSE indicates to what extent the predefined coefficients b
are correctly estimated by the considered method.
We estimate the prediction performance of the models based on
The RMSE of the estimated regression parameters b ¼ (b1; . . . ;
many test-set predicted errors (residuals), that is the difference
bm)T is introduced as
between the experimental value yi and the predicted (modeled)
value ŷi for an object i. The standard deviation of these prediction vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u X
errors—usually abbreviated to standard error of prediction— u1 m  2
RMSE ¼ t b  bj (11)
SEP is defined by m j¼1 j
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1 XnSEP
SEP ¼ ðyi  ^yi  biasÞ2 (8) where b denotes the true model coefficients. Ideally, the RMSE
nSEP  1 i¼1 value is close to zero.

1 X
nSEP
2.3. Repeated double cross validation (rdCV)
bias ¼ ðyi  ^yi Þ (9)
nSEP i¼1
Model evaluation is of high importance in chemometrics. For that
purpose, we use repeated double cross validation [11]. This
The SEP used within this work is equivalent to SEPTEST, because
procedure allows a reasonable estimation of the optimum model
all predicted ŷi values are derived from test set objects. Applying
complexity (number of PLS components) as well as the prediction
the rdCV approach (see Section 2.3) the number of available
performance. A randomly chosen subset of data—the calibration
ŷ-values, nSEP, is the number of objects times the number of
set—is subjected to a k-fold cross validation loop, yielding a first
repetitions. The bias is the arithmetic mean of the prediction
suggestion for the optimum model complexity. Subsequently, a
errors; especially for large nSEP it is near zero.
model for the entire calibration set is constructed and applied to
the left out test data. Due to the repetitive nature of rdCV, the
2.2.2. SEPTRIM variability of optimum model complexity as well as the variability
The SEP criterion becomes illusive when applied to robust models of test-set predicted errors with different data subsets is
fitted to contaminated data. A good robust fit leads to large accessible. The rdCV procedure combined with classical PLS is
residuals for the outlying objects, whilst a classical model tends to published in detail in Reference [11], and an application is
describe outliers better—sometimes even better than the presented in Reference [12]. A fair comparison of classical and
regular observations. Since we intend to assess the robust robust PLS models requires a comparable validation technique.
model’s performance in fitting the good data but not the outliers, Therefore, we implemented the robust PRM method into the
a robust SEP measure is necessary [3]. The exclusion of a certain three nested loops of the rdCV procedure. The pseudo-code is as
percentage of unusually large (absolute) residuals leads to an follows:
acceptable robust performance criterion SEPTRIM. We choose a Repetition loop: FOR r ¼ 1 TO nREP
trimming constant of 20%. This choice was also made in other (1) Split all n objects randomly into SEGTEST segments (typically
papers [10], as with real data the percentage of outliers is 3–10) of approximately equal size.
unknown. Note that for data sets where only few outliers are (2) Outer loop: FOR t ¼ 1 TO SEGTEST
expected, a smaller trimming constant can prevent too optimistic
estimates of the prediction performance. (a) Test set ¼ segment with number t (nTEST objects)
(b) Calibration set ¼ other SEGTEST1 segments (nCALIB
2.2.3. MSE objects)
(c) Split calibration set into SEGCALIB segments (typ. 4–10) of
Another statistical error criterion based on residuals computed approximately equal size.
from (repeated double) cross validation is the mean squared error (d) Inner loop: FOR k ¼ 1 TO SEGCALIB
(MSE). (i) Validation set ¼ segment with number k (nVAL
objects)
1 X
nMSE
MSE ¼ ðyi  ^yi Þ2 (10) (ii) Training set ¼ other SEGCALIB —1 segments (nTRAIN
nMSE i¼1 objects)
(iii) Make PRM models from the training set, with
A trimmed version of this measure, MSETRIM, is easily computed a ¼ 1; . . . ; aMAX components
by excluding 20% of the largest squared residuals. In the robust (iv) Apply the PRM models to the validation set, resulting
repeated double cross validation algorithm, MSETRIM is used for in ŷCV for the objects in segment k for a ¼ 1; . . . ; aMAX
an estimation of the optimum number of PLS components. NEXT k
113

J. Chemometrics 2010; 24: 111–120 Copyright ß 2010 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
B. Liebmann, P. Filzmoser and K. Varmuza

(e) Estimate optimum number of components, aOPT, nested loops in rdCV, e.g. number of segments for creation of test,
from ŷCV of the calibration set by the ‘one-standard calibration and validation sets. The package ‘chemometrics’ also
error’ method (see below), giving aOPT(t) for this provides the partial robust M regression algorithm in the function
outer loop. ‘prm’. Additionally, a single k-fold cross-validation procedure for
(f ) Make PRM models from the whole calibration set for PRM is implemented in ‘prm_cv’. We developed ‘prm_dcv’,
a ¼ 1; . . . ; aMAX components. the robust counterpart to ‘mvr_dcv’, employing the above-
(g) Apply the models to the current test set, resulting in mentioned subfunctions for the robust method in the repetitive
test-set predicted ŷ for nTEST test set objects and validation scheme. A typical call of both the classical and the
a ¼ 1; . . . ; aMAX components. robust PLS evaluated by rdCV is
NEXT t
library(chemometrics) # load package ‘chemometrics’
(3) After completing the outer loop, we have one test-set
predicted ŷ for each of the n objects for aMAX different
data(PAC) # load PAC dataset
model complexities.

NEXT r class.result<- mvr_dcv(y  X,data¼PAC, ncomp¼ 20,method ¼


‘simpls’)
(A) After completing the repetition loops, a three-dimensional
data array consisting of test-set predicted ŷ for each object, rob.result < - prm_dcv(yX, data ¼ PAC, ncomp ¼ 20)
every repetition and all considered numbers of components
is available. The calculation of the corresponding prediction
errors is straightforward. By default, no scaling of the data is provided; the number of
(B) The choice of a final optimum number of PLS components, repetitions is 100; the data set is split into four segments in the
aFINAL, is based on all (SEGTEST times nREP) available values for outer and seven segments in the inner loop. This call will consider
aOPT and picks the value with the highest frequency. Note models up to 20 PLS components. Further parameters of
that aFINAL is determined without using test-set predicted ‘mvr_dcv’ and ‘prm_dcv’ are explained in their help files.
values. In this work, all classical PLS models are run with 100 repetition
(C) Eventually, the residuals at aFINAL for all the repetitions are loops. To reduce computational cost, the repetitions for all robust
summarized in the performance criterion SEPTRIM. PRM models are reduced to 25. A comparison of computation
time will be given in the Results section.
Steps (f ) and (g) are primarily important for model diagnostic
plots, and could be omitted to speed up calculations. Having
finished the rdCV procedure, a final regression model for future 3.2. Data
use is built from all objects with aFINAL components. Unless this
The performance of robust and classical PLS models is compared
robust model is applied to new samples that are from a very
by three different types of data that are intentionally
different data population, the future prediction errors can be
contaminated with outliers.
expected within the range of  2 SEPTRIM.
Note: The optimum number of PLS components is based on a
‘one-standard error’ rule [3,13]. The error criterion used to 3.2.1. ART
estimate the optimum number of components, aOPT, is the
MSETRIM. With, for instance, seven segments in the inner rdCV This artificial data set is intended to compare the modeling
loop (SEGCALIB) there are seven MSETRIM values available. For each capability of PLS and PRM. The data set contains 500 samples
model complexity a, the mean as well as the standard deviation described by 10 x-variables and a defined underlying model with
of MSETRIM is computed. The least complex model within one fixed coefficients b. The x-variables contain random numbers of a
standard error of the best is chosen as an optimum. The uniform distribution U(0,10) in the range of 0–10. The model
pseudo-code given for PRM also applies to PLS; except that coefficients are randomly drawn from the uniform distribution
non-trimmed MSE values are used within the one-standard error U(50,50). The corresponding response y results from a linear
rule for PLS models. relationship defined by b (without intercept) and yields values
between 0 and 1539.

3. SOFTWARE AND DATA


3.2.2. PAC
3.1. Software
The second data set is associated with quantitative structure-
The free software R offers an environment with focus on statistical property relationship (QSPR). It is available in the R-package
data analysis and graphical representation [14]. It is licensed ‘chemometrics’ by calling data(PAC) and contains data for 209
under the GNU General Public License (GPL) and available from polycyclic aromatic compounds. Each compound is characterized
the Comprehensive R Archive Network (CRAN). R is an open by 467 numerical descriptors of its approximated three-dimensional
source programming language that is easily extended by freely molecular structure (x-variables) calculated by software Dragon
available collections of functions (packages). The package ‘pls’, for [17,18]. The goal is to model the gas chromatographic retention
instance, provides routines for principal component regression index as dependent y-variable [19]. As the molecular descriptors
(PCR) and the partial least-squares regression (PLS) used within are calculated from the molecular structures, they are not prone
our rdCV functions [15]. We employ the rdCV function ‘mvr_dcv’ to experimental error. Outliers are more likely to appear in the
for classical PLS as available in the R-package ‘chemometrics’ [16]. response variable y, the experimentally determined retention
The function manages the data and passes the settings for the index, ranging from about 200 to 500.
114

www.interscience.wiley.com/journal/cem Copyright ß 2010 John Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 111–120
Robust versus classical PLS regression

3.2.3. NIR
The third data set is from 166 mash samples withdrawn from
bioethanol fermentation experiments that varied with respect to
enzymatic pretreatment and type of feedstock (wheat, corn, or
rye). The samples span the range of 22–88 g/L ethanol
concentration, which was determined by HPLC and serves as
the property of interest (y-variable). The first derivatives of near
infrared (NIR) absorbance spectra in the wavelength range of
1100–2300 nm provide 235 x-variables for each sample [12]. In
these data, experimental errors are possible in both X and y.
Figure 2. Scheme for creating outliers, shown for the ART data set. The
3.3. Creating outliers first three x-variables are changed for creation of leverage points, while
the y-value is unchanged (samples 1–15). Additional 15 observations are
Preliminary regression tests showed the absence of strong contaminated in the y-value (vertical outliers, samples 16–30). Samples
outliers in the above-mentioned data sets. Hence, some 31–45 have outliers in both xi and yi.
perturbing observations are artificially generated.
Leverage points are constructed by substituting a fraction of
the original x-variables with a considerably higher value (X
outlier). Therefore, the maximum of an x-variable xj is calculated. The creation of these three types of outliers is shown
According to Equation (12), xj,max is then multiplied with a schematically in Figure 2 for ART data.
random value chosen from the uniform distribution U(3,10).

xij;out ¼ xj;max  Uð3; 10Þ (12)


4. RESULTS
Typically, the first 3% of the objects are contaminated; for 4.1. ART
instance, the first 15 samples in the artificial data set ART, having The artificial data set consists of 500 samples described by 10
in total 500 objects. For generating the outlying observations, regressor variables and one response. As long as no outlying
always and only the first three x-variables ( j ¼ 1, 2, 3) were
observations contaminate the data, the classical PLS model
contaminated. This choice is rather arbitrary for the PAC and NIR performs equally to the robust PRM model. This is primarily
data sets; however, with the ART data it gives some interesting reflected in the RMSE value for the estimated regression
insights for assessing the final model. coefficients, which is zero for the classical PLS model, and close
Vertical outliers are not contaminated in the X-space, but their
to zero for PRM. With respect to the prediction performance the
original values in y are changed (y outlier) according to Equation classical model performs perfectly (SEP ¼ 0), while the robust
(13), which worked well for the used data. model is slightly worse with a SEPTRIM of 2 (Table I).
If we add leverage points with errors in the first three
yi;out ¼ yi  Uðymax ; 2ymean Þ (13)
x-variables, xi1, xi2 and xi3, the classical PLS method estimates
regression coefficients with considerable deviations from the real
An outlying observation yi,out is calculated by adding or model parameters b (see Figure 3). Evidently, the regression
substracting (randomly) a value drawn from a uniform coefficients having the largest deviations are associated with the
distribution of values ranging between the maximum, ymax first three x-variables. The RMSE of the estimated regression
and twice the mean, ymean. In case of a resulting negative value of coefficients results in 10.1 for classical PLS and yields 1.5 for the
yi,out, it is set to zero instead. robust PRM method (Table I). Hence, the robust method succeeds
A third type of outlier, belonging to the class of leverage points, in downweighting the influence of erroneous x data, even if the
is introduced by combination of data contamination in X and y. outliers are not completely excluded from the regression.

Table I. Regression results for ART data (n ¼ 500, m ¼ 10) with classical PLS and robust PRM regression. The data are contaminated
with varying number of outliers in X, y and both X and y. RMSE indicates to what extent the considered method estimates the real
model parameters correctly. Ideally, RMSE is 0. The models’ prediction performance is assessed by SEP and SEPTRIM, respectively,
based on test-set predicted errors. The optimum model complexity, aFINAL, is determined by rdCV

No. of outliers in RMSE aFINAL SEP SEPTRIM

X y X&y PLS PRM PLS PRM PLS PLS PRM

0 0 0 0.0 0.1 8 4 0 0 2
25 0 0 10.1 1.5 4 3 124 67 11
0 25 0 7.2 2.0 3 3 190 19 6
25 25 0 10.3 1.9 3 3 231 71 16
15 15 15 11.0 0.7 3 3 247 76 8
115

J. Chemometrics 2010; 24: 111–120 Copyright ß 2010 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
B. Liebmann, P. Filzmoser and K. Varmuza

y-values for classical PLS (4a, b) and robust PRM (4c, d) are shown.
All models discussed result from a repeated double cross
validation procedure with an optimum model complexity
determined by applying the one-standard error rule (see Section
2.3). For the classical PLS models, the validation yields 100 test-set
predicted y-values for every object. Due to the iterative
reweighting loops in the robust method, the computational
effort of PRM within the rdCV algorithm is elevated. Therefore, the
number of repetitions for the validation of the robust PLS models
is reduced to 25, giving 25 test-set predicted y-values for each
sample. With the given rdCV settings the classical method takes
30 seconds, whilst the robust method takes 6 minutes for the ART
data.
Figure 3. Five per cent of outlying observations in the first three
The predicted values ŷi for all repetitions are included in
x-variables of the simulated data set ART severely influence the estimated Figures 4a and c as gray crosses, and give a picture of the
regression coefficients (b1, b2, b3) of the classical PLS model. Deviations to variability of the predicted responses. The mean of all predicted
the real model parameters b are considerably lower for the robust PRM values for each object is denoted by a black cross. In Figure 4a, the
model. data points are notably spread, and the cloud containing the
majority of data is systematically distorted from the 458 line. It
would be difficult to select candidates for outliers based on this
Once we introduce not only leverage points but also vertical regression result. In contrast, the robust model in Figure 4c gives
outliers, the prediction performance of classical PLS deteriorates superior results with a distinct fit of the majority of data to the
drastically with a SEP of 247. Comparing the trimmed SEP values optimum 458 line. Furthermore, the variation of predicted values
of both considered methods, the classical model yields about ten with every repetition is smaller. In Figure 4b (PLS) and 4d (PRM)
times higher values (SEPTRIM ¼ 76) than the robust model only the mean of the predicted values for each object is shown,
(SEPTRIM ¼ 8). In the investigated data set, 9% of the samples and marked individually according to the type of outlier. It can be
are outliers—15 samples with errors in x, another 15 with errors seen that the robust method computes exceptionally large
in y, and additional 15 samples with errors in both x and y. In residuals for some outlying observations, while the good data are
Figure 4, test-set predicted y versus experimental (simulated) fitted almost perfectly. The dangerous influence of outliers on

Figure 4. Test-set predicted y versus experimental (simulated) y-values for classical PLS (a, b) and robust PRM (c, d) for the simulated data set ART. The
predicted y for every repetition is included in (a) (100 repetitions) and (c) (25 repetitions) as gray cross; the mean of all repetitions is denoted by black
crosses. Equivalently, these mean values with good data and outliers marked individually are shown in (b, d). In all plots, a 458 line is included as target line.
116

www.interscience.wiley.com/journal/cem Copyright ß 2010 John Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 111–120
Robust versus classical PLS regression

classical PLS models, namely ‘pulling’ a model towards their


direction, can be observed in Figure 4b.

160
150
4.2. PAC

140
The second data set is composed of 467 molecular descriptors
(x-variables) calculated from the three-dimensional structure of

130
209 polycyclic aromatic compounds. As molecular descriptors will
never be prone to experimental error, we assume error sources

120
such as data manipulation errors, the inclusion of a partly wrong

110
molecular structure, or modeling errors caused by choosing the
wrong descriptor model. All outlying observations are computed
2 4 6 8 10 12 14
according to the concept presented in Section 3.3. The outliers
are allowed to be physical impossibilities, because in this work Figure 5. SEP of the classical PLS model as a function of the number of
the focus is on observing effects of outliers rather than PLS components for PAC data with 18 outlying observations. Gray lines are
interpreting their physical meaning. for each of 100 repetitions. The black line is the mean of the 100
The contamination of this data set is exceptional in that it is repetitions. The dashed lines indicate the computed optimum at one
designed to affect only samples with low values of the gas PLS component with SEP ¼ 115. Higher number of components yields
chromatographic retention index, here used as the dependent much larger variation in the prediction errors for different repetitions.
y-variable.
Evidently, the prediction performance of the classical PLS
models—measured by SEP—deteriorates with increasing num- objects, especially because they form a strong cluster in the low
ber of outliers (Table II). While the original data set gives a SEP of value range of y and mask each other. A further effect of the
11, the strongly contaminated data give a ten times higher value. outliers is that the relationship between estimated and predicted
The robust model becomes slightly worse by adding more y-values even indicates nonlinearity in the data. The best
outliers too, but it is still in the same range of prediction quality as achievable classical model has one PLS component. Ninety-five
the classical model without outlying observations. per cent of the prediction errors for the gas chromatographic
It is notable that the optimum number of PLS components retention index are expected in the range of  2 SEP ¼ 230, which
decreases from 11 to 1 for the classical models, once outliers are is about 70% of the mean retention index.
present. One might claim a higher model complexity for a better The robust method reveals all leverage objects and down-
prediction performance; for fairness, even a value comparable to weights the flawed observations with weights below 0.3
the robust PRM models’ optimum complexity (aFINAL ¼ 5 and 7, (Figure 7). These samples have only a small influence on the
respectively). To confirm, we consult the available diagnostic plot modeling process, while the ‘good’ data prevail and allow for a
on SEP as a function of the number of PLS components (Figure 5). good regression model. Consequently, the test-set predicted
The optimum complexity is marked with the vertical dashed line values for outliers have large residuals (Figure 6b). Excluding
at one PLS component. With higher model complexity, e.g. a ¼ 4, these large absolute residuals from the performance criterion, we
the mean value of SEP (black line) of all 100 repetitions (gray lines) get a 95% error range for future prediction errors of  2
is slightly lower. The price to be paid is having drastically larger SEPTRIM ¼ 30, which equals 8% of the mean retention index y.
variations in SEP for different test sets (repetitions), which
indicates overfitting.
The included 12 leverage objects as well as six vertical outliers
4.3. NIR
strongly affect the classical PLS model. First, data points denoted
as X-outliers in Figure 6a seem well fitted to the rest of the data. The third data set used comes from near-infrared spectroscopy
However, the regression line is definitely twisted by the leverage measurements in 166 different liquid fermentation samples.
Apart from a large range of ethanol concentration covered by the
samples, they differ from each other with respect to the feedstock
Table II. Regression results for PAC data (n ¼ 209, m ¼ 467) used and the enzymatic pre-treatment applied in the production
with classical PLS and robust PRM regression using original process. Consequently, the original data may include obser-
data as well as data including vertical outliers and leverage vations from different statistical populations, which might show
points. The optimum model complexity, aFINAL, is determined outlying behavior in the regression. Additionally, 15 outliers are
by rdCV. The models’ prediction performance is assessed by created on purpose. As for the PAC data, the optimum model
SEP and SEPTRIM, respectively, based on test-set predicted complexity decreases with increasing number of outliers, in
errors particular for the classical PLS model (Table III). Using more than
two PLS components for the classical method promotes
No. of outliers in aFINAL SEP SEPTRIM over-fitting of the outliers present in the calibration data, and
gives even worse results for the prediction quality. In Figure 8a,
X y X&y PLS PRM PLS PLS PRM the X-outliers are well fitted to the data majority, while
observations with errors in y and errors in both X and y are
0 0 0 11 14 11 6 6 easily detectable. A systematic deviation of points being
0 10 0 1 5 108 29 14 perpendicular to the 458 line can be observed for ‘good’ samples
6 6 6 1 7 115 28 15 with high values in experimental y. As they are not conspicuous in
regression results of the original data (without contamination
117

J. Chemometrics 2010; 24: 111–120 Copyright ß 2010 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
B. Liebmann, P. Filzmoser and K. Varmuza

Figure 6. PAC data with 18 outliers: Test-set predicted y versus experimental y-values for classical PLS (a) and robust PRM (b). The predicted y-values are
means of 100 repetitions for PLS and 25 repetitions for PRM, respectively. Good data and different types of outliers are marked individually.

added), we encounter the so-called swamping effect. Because of Even if the main settings for the rdCV procedure (e.g. number
the presence of outliers, good data are incorrectly fitted. of segments for creating test, calibration and validation sets;
The total weights assigned to each NIR sample are displayed in maximum number of considered PLS components) are the same
Figure 9. The robust method detects most of the introduced for both regression methods, the robust method PRM needs
outliers (samples 1–15). In contrast to the results presented for more time for computation due to the iterative adjustment of
the other two data sets, the influence of X-outliers in NIR data is weights. The compromise chosen in this work is to reduce the
reduced only moderately. These outliers are influenced by a number of repetitions from 100 to 25 for PRM. The computational
group of samples with a slightly different multivariate data effort for a typical data set in chemometrics, such as the NIR data
structure (samples 16–52), which are withdrawn from exper- set, is 2 min for the classical PLS model and 30 min for the robust
iments with the particular feedstock rye. PLS model. Since PRM yields better prediction results in all
investigated data sets with outliers being present, the higher
computation time is justified. However, the rdCV procedure can
4.4. Summary be accelerated by omitting some calculation steps only necessary
for model diagnostics plots.
We compared a classical PLS regression method with the partial Repeated double cross validation is a reliable validation
robust M regression method by application of repeated double technique that provides a realistic estimation of the models’
cross validation. The rdCV procedure is published in a previous prediction performance, and allows optimizing the model
work with focus on classical PLS (SIMPLS) [11], and it is freely complexity. Apart from profound model diagnostic plots made
available in the package ‘chemometrics’ for the R programming available by rdCV, the weights plot calculated by PRM is useful for
environment. For this study, we extended the rdCV algorithm to data inspection, in particular for the detection of outliers. We
robust PLS regression (PRM) to provide a common ground for a present three data sets with different characteristics. The artificial
fair and careful comparison of both the considered methods. data ART is simulated following a perfect linear relationship
between x and y variables, and contain neither errors nor noise.
The PAC data are likely to contain experimental errors and/or
1.0
0.8

Table III. Regression results for NIR data (n ¼ 166, m ¼ 235)


with classical PLS and robust PRM regression. The original data
0.6


weight


might contain experimental error and noise, and is further
contaminated with each 3% of outliers in X, y and both X and
0.4

y. The optimum model complexity, aFINAL, is determined by


rdCV. The models’ prediction performance is assessed by SEP
0.2


●● good data



X outlier and SEPTRIM, respectively, based on test-set predicted errors
●● ● y outlier
0.0

X&y outlier

0 50 100 150 200 No. of outliers in aFINAL SEP SEPTRIM


sample number
X y X&y PLS PRM PLS PLS PRM
Figure 7. Total weight wi assigned to each PAC sample by the robust
PRM method. The first 18 samples are outliers created intentionally; they 0 0 0 14 15 2 1 1
are unveiled and downweighted by PRM. Three y outliers, however, are 5 5 5 2 5 20 5 4
found close to the data majority.
118

www.interscience.wiley.com/journal/cem Copyright ß 2010 John Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 111–120
Robust versus classical PLS regression

Figure 8. NIR with 15 outliers: Test-set predicted y versus experimental y-values for classical PLS (a) and robust PRM (b). The predicted y-values are
means of 100 repetitions for PLS and 25 repetitions for PRM, respectively. Good data and different types of outliers are marked individually.

noise in the y-variables only. The most realistic chemical data set is to perform robust autoscaling in X (Equation (5)) for both
NIR is prone to experimental error and noise for both x- and the new data and data used for model creation, and then
y-variables. calculate the robust weights wxi (Equation (6)). Other more
sophisticated robust outlier detection methods are available, see
for example [20].
5. CONCLUSIONS
Whenever outliers are probable in the data the application of a Acknowledgements
robust method should be considered. The main advantage of the
presented robust PRM method including repeated double cross This work was partly funded by the Austrian Research Promotion
validation is that no outlier detection is necessary prior to model Agency (FFG), BRIDGE program, project no. 812097/11126. We
creation, and a realistic estimation of the model’s future thank Anton Friedl (Vienna University of Technology, Institute of
performance is made available. If no outliers are present in the Chemical Engineering) for encouragement and support.
data, the robust method is practically as good as the classical
method. It is shown for artificial data that the true underlying
model parameters are estimated correctly by PRM; in particular REFERENCES
with aberrant observations being present in the calibration data
the robust methods clearly outperform classical PLS. Con- 1. de Jong S. SIMPLS: an alternative approach to partial least squares
regression. Chemom. Intell. Lab. Syst. 1993; 18(3): 251–253.
sequently, the robust models give better prediction results for 2. Martens H, Naes T. Multivariate Calibration. Wiley: Chichester, UK,
non-outliers than the classical models. Nevertheless, the problem 1989.
of detecting outliers in new data remains. A straightforward way 3. Varmuza K, Filzmoser P. Introduction to Multivariate Statistical Analysis
in Chemometrics. CRC Press: Boca Raton, FL, 2009.
4. Serneels S, Croux C, Van Espen PJ. Influence properties of partial least
squares regression. Chemom. Intell. Lab. Syst. 2004; 71: 13–20.
1.0

5. Wakeling IN, Macfie HJH. A robust PLS procedure. J. Chemom. 1992;


6(4): 189–198.
0.8

6. Cummins DJ, Andrews CW. Iteratively reweighted partial least


squares: a performance analysis by Monte Carlo simulation. J. Che-

mom. 1995; 9(6): 489–507.
0.6
weight

7. Hubert M, Vanden Branden K. Robust methods for partial least


squares regression. J. Chemom. 2003; 17(10): 537–549.
0.4


●●

8. Serneels S, Croux C, Filzmoser P, Van Espen PJ. Partial robust


M-regression. Chemom. Intell. Lab. Syst. 2005; 79: 55–64.
0.2


●●
good data 9. Hössjer O, Croux C. Generalizing univariate signed rank statistics for

X outlier
● y outlier
testing and estimating a multivariate location parameter. J. Nonpar-
0.0

X&y outlier ametr. Stat. 1995; 4: 293–308.


0 50 100 150 10. Serneels S, Filzmoser P, Croux C, Van Espen PJ. Robust continuum
regression. Chemom. Intell. Lab. Syst. 2005; 76(2): 197–204.
sample number
11. Filzmoser P, Liebmann B, Varmuza K. Repeated double cross vali-
dation. J. Chemom. 2009; 23(4): 160–171.
Figure 9. Total weight wi assigned to each NIR sample by the robust 12. Liebmann B, Friedl A, Varmuza K. Determination of glucose and
PRM method. The first 15 samples are outliers created intentionally that ethanol in bioethanol production by near infrared spectroscopy
are mostly unveiled and downweighted. In addition, the regular samples and chemometrics. Anal. Chim. Acta 2009; 642: 171–178.
16 to 52 appear as outliers, too. Indeed, all these samples are withdrawn 13. Hastie T, Tibshirani RJ, Friedman J. The Elements of Statistical Learning.
from experiments with a particular feedstock. Springer: New York, NY, 2009.
119

J. Chemometrics 2010; 24: 111–120 Copyright ß 2010 John Wiley & Sons, Ltd. www.interscience.wiley.com/journal/cem
B. Liebmann, P. Filzmoser and K. Varmuza

14. R. A language and environment for statistical computing. Vienna, 18. Todeschini R, Consonni V, Mauri A, Pavan M. Dragon. Software
Austria: R Development Core Team, Foundation for Statistical Com- for Calculation of Molecular Descriptors, Todeschini R, Consonni V,
puting, www.r-project.org 2009. Mauri A, Pavan M (eds). Talete srl: Milan, Italy, 2004. www.talete.
15. Mevik BH, Wehrens R. The pls package: principal component and mi.it
partial least squares regression in R. J. Stat. Software 2007; 18(2): 1–24. 19. Lee M, Vassilaros D, White C, Novotny M. Retention indices for
16. Filzmoser P, Varmuza K. Chemometrics: multivariate statistical analysis programmed-temperature capillary-column gas chromatography
in chemometrics. R package version 0.5. Vienna, Austria: http:// of polycyclic aromatic hydrocarbons. Anal. Chem. 1979; 51: 768–
cran.r-project.org 2009. 773.
17. Todeschini R, Consonni V. Handbook of Molecular Descriptors. 20. Filzmoser P, Maronna R, Werner M. Outlier identification in high
Wiley-VCH: Weinheim, Germany, 2000. dimensions. Comput. Stat. Data Anal. 2008; 52(3): 1694–1711.
120

www.interscience.wiley.com/journal/cem Copyright ß 2010 John Wiley & Sons, Ltd. J. Chemometrics 2010; 24: 111–120

You might also like