0% found this document useful (0 votes)

17 views16 pages

VSN Paper 2020

A paper about variable sorting for normliasation

Uploaded by

5n2nnm2r9p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views16 pages

VSN Paper 2020

A paper about variable sorting for normliasation

Uploaded by

5n2nnm2r9p

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Received: 26 October 2018 Revised: 1 June 2019 Accepted: 10 June 2019

DOI: 10.1002/cem.3164

SPECIAL ISSUE ‐ RESEARCH ARTICLE

VSN: Variable sorting for normalization

Gilles Rabatel1 | Federico Marini2 | Beata Walczak3 | Jean‐Michel Roger1

1
ITAP, Irstea, Montpellier SupAgro,
University of Montpellier, Montpellier,
Abstract
France Spectrometric and analytical techniques in general collect multivariate signals
2
Department of Chemistry, University of from chemical or biological materials by means of a specific measurement
Rome “La Sapienza”, Rome, Italy instrumentation, usually in order to characterize or classify them through the
3
Institute of Chemistry, University of
estimation of one of several compounds of interest. However, measurement
Silesia, Katowice, Poland
conditions might induce various additive (baseline) or multiplicative effects
Correspondence on the collected signals, which may jeopardize the accuracy and generalizabil-
Jean‐Michel Roger, ITAP, Irstea,
Montpellier SupAgro, University of
ity of estimation models. A common way of dealing with such issues is signal
Montpellier, Montpellier 34000, France. normalization and in particular, when the baseline is constant, the standard
Email: jean‐[email protected] normal variate (SNV) transform. Despite its efficiency, SNV has important
Funding information drawbacks, in terms of physical interpretation and robustness of estimation
French Ministry of Agriculture and the models, because all the variables are equally considered, independently on
Association Nationale de la Recherche et
what their actual relationship with the response(s) of interest is. In the present
de la Technologie ANRT, Grant/Award
Number: 1412 study, a novel algorithm is proposed, named variable sorting for normalization
(VSN). This algorithm automatically produces, for a given set of multivariate
signals, a weighting function favoring signal variables that are only impacted
by additive and multiplicative effects, and not by the response(s) of interest.
When introduced in SNV preprocessing, this weighting function significantly
improves signal shape and model interpretation. Moreover, VSN can be suc-
cessfully used not only for constant but also with more complex baselines, such
as polynomial ones. Together with the description of the theory behind VSN,
its application on various synthetic multivariate data, as well as on real SWIR
spectral data, is presented and discussed.
KEYWORDS
spectrometry, normalization, SNV, MSC, pretreatments, RANSAC

1 | INTRODUCTION

The techniques of spectrometry and, more generally, of analytical chemistry provide highly multivariate signals. These
instrumental signals are often collected for the purpose of evaluating one or more properties of a product. They are then
treated as vectors of numbers by chemometrics.1 Chemometric tools such as principal component analysis (PCA) or par-
tial least squares (PLS) are essentially based on linear algebra.2 In the following, matrices will be written in boldface
uppercase, vectors in boldface lowercase, and scalars in italic lowercase. The matrix X (n x p) contains n multivariate
signals described by p variables.
A multivariate signal x (p dimensional vector) can be affected by multiplicative and additive effects unrelated to the
quantity of interest y. This means that, for a given value of y, ax + b is measured instead of x, with a ≠ 1 and b ≠ 0.

Journal of Chemometrics. 2020;34:e3164. wileyonlinelibrary.com/journal/cem © 2019 John Wiley & Sons, Ltd. 1 of 16
https://doi.org/10.1002/cem.3164
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
2 of 16 RABATEL ET AL.

In the field of chromatography, the multiplicative effect can be due to the quantity of product analyzed, whereas the
information resides in the shape of the signal, in what it is normally called the “shape effect.”3 In the field of nuclear mag-
netic resonance (NMR), the dilution of the analyzed product is responsible of the multiplicative effect. In the field of near
infrared spectroscopy (NIRS), the multiplicative effect may be due to the geometry of the measuring device, when the col-
lected signal is reported as reflectance, or to the particle size of the product, when the signal is expressed as absorbance.4,5
The additive effect is usually referred to as the baseline. In chromatography or NMR, the baselines are assigned to a back-
ground, that is to say a signal superimposed to the relevant part of the signal. In NIRS, the baselines are more closely
intertwined with the useful portion of the signals, because they come from the interaction between light and matter.6 This
article deals with the processing of any of the aforementioned instrumental profiles called the signal. The factor along
which the signal is described (wavelength, time, etc) will be named variable. We also will call shape effect the useful signal
and size effect the detrimental signal due to additive and multiplicative factors, to be removed.
The multiplicative effect is incompatible with the tools of linear algebra. Indeed, any processing of the signal x by linear
algebra results in one or more matrix operations as, eg, t = x′ P, where P is a loading matrix. If x is multiplied by a, the
result t will also be multiplied by a. In other words, the multiplicative effects go through linear models. The additive effect
is much less severe since it can be compensated by the tools of linear algebra. Indeed, even nonconstant lines (linear and
parabolic) all belong to a subspace of the signal space that can be down‐weighted or eliminated by chemometric methods.
The multiplicative effect can be treated by applying a logarithm to the measured signal, which turns the multiplicative
effect into additive effect. The advantages of this pretreatment have been clearly demonstrated in the context of discrimi-
nation based on NIR spectra with a pure multiplicative effect.7 However, in the regression framework, the application of a
logarithmic transformation is problematic: Since a linear relationship between x and y is supposed, one should also apply
the logarithm to the y. In this case, the distribution of y value is deformed and the regression model cannot be calculated
under optimal conditions anymore. Moreover, the application of a logarithmic transformation no longer works if an addi-
tive effect also affects the data. Using normalization is more common. The idea of normalization consists in dividing all the
variables of a collected signal x by a divisor d depending on x, in such a way that d(a.x) = a.d(x). Doing so, every signal ax
will lead to the same corrected profile z = ax/d (ax) = x/d(x), whatever the value of a. The same strategy is used to correct
for the additive effect, by subtracting from the signal an offset o such that o(x + b) = o(x) + b.
Lots of methods have been proposed to define d and o. A first category consists in computing d and o as statistics calcu-
lated on each single x. Thus, d can be estimated by the maximum, the sum, or the norm (L1 and L2) of x. In NIRS, this
statistic is usually calculated over the entire spectrum. In chromatography and NMR, it can be calculated on a peak of a
known compound or a standard. For the offset o, it is common to use the mean or the minimum. Standard normal variate
(SNV)8 is surely the most popular method of this category. It calculates the offset o as the mean of x, and the coefficient d as
the standard deviation of x. A second category of methods relies on the comparison of x with a reference signal xref for the
estimation of d and o. Multiplicative signal correction (MSC)9 assumes that x consists of a multiplicative part, an additive
part and a residue bearing the shape effect: x = dxref + o + r. MSC calculates the coefficients d and o by a least squares
regression between x and xref. The reference signal should represent a kind of prototype of the chemical basis of x. In prac-
tice, the mean or median profile of the samples in X is used. The extended MSC10,11 makes it possible to treat additive
effects that are more complex than a simple constant.
Normalization is very efficient, even essential to allow the use of the linear tools of chemometrics. In NIRS, it is a widely
used pretreatment step and in particular, among the available tools, the SNV method is very popular (to the point that the
original paper has been cited more than 2000 times). This is probably due to the fact that it is effective and does not require
any parameter setting. Moreover, since, unlike for MSC, parameter estimation requires only the individual signal to be
pretreated, SNV does not pose problem when integrated in a cross‐validation procedure. This is the reason why SNV nor-
malization is almost always used on NIR spectra. However, there are a number of problems with normalization. Its perfor-
mance depends on the quality of the link between the calculated coefficients (d,o) and the actual multiplicative and
additive effects (a,b). To satisfy the objective of insensitivity of the normalized signal to the multiplicative effect, it suffices
that d is a multiple of a, ie, d = βa, with β independent of a. Analogously, for the additive effect, it suffices that o = γ + b,
with γ independent of b. Now, in all common normalization methods, the coefficients d and o are estimated from x. If no
precaution is taken, the values of the coefficients β and γ may depend also on the contribution of the informative part of the
profile and, thus, the coefficients d and o as well. This results in a differential deformation of the signal. Indeed, normal-
ization has been identified as a cause of spurious correlation problems due to data closure.3,12 It is clear that if a multivar-
iate signal is normalized to a constant sum, the growth of a variable with a high magnitude will be automatically
compensated by the decrease of the other variables. Also, normalization induces unnatural constraints on the signal that
can distort the space of X. For example, if normalization is done by the standard deviation, all normalized signals are of
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 3 of 16

unitary length and lie on a sphere; if it is done by the area, all the signals lie on a hyperplane of equation x1 + x2+ … +xp = 1.
Finally, normalization distorts the distribution of noise. Therefore, the choice of the normalization method should depend
on the type of noise. For example, the normalization by the norm is not suited to heteroscedastic noises.13 All these side
effects of normalization do not necessarily pose a problem when analyzing the signals of a homogeneous training set.
However, by deforming the signals, they lead to deformations in the loadings calculated during the analysis and thus to
erroneous conclusions when interpreting these loadings. It is also likely that they pose robustness problems.
Normalization therefore poses a paradox: It is very useful, even necessary, but it can have dangerous side effects. Therefore,
the ideal would be to achieve normalization while minimizing the negative effects.
Some methods propose to select the variables of x before calculating d and o. The benefits of normalization over only a
portion of the variables are clearly presented in Johansson et al.14 If the chosen variables are mainly involved in size effect,
the geometrical distortion induced by SNV or MSC15 remains only in spectral (profile) regions less related to the chemical
information. This idea is taken up in Bylesjo et al12 by proposing to exclude from the computation of d the variables with
high variance. The robust normal variate (RNV)16 proposes a variant of SNV, in which only a percentile of the x values is
considered, eg, the 25% lowest values of x. Accordingly, o can be calculated as the highest value of this percentile, and d as
its standard deviation. This method was proposed to process NIR absorbance spectra, for which the shape effects are peaks
(high values) and the size effects are baselines (low values). Probabilistic quotient normalization (PQN)17 corrects for a
multiplicative effect. It calculates the quotients of all the variables of x by those of a reference signal xref, which produces
a series of P values. Then, d is calculated as the most probable value of these values, ie, the mode of the distribution of the
quotients. In practice, since it is rather difficult to find the mode of a distribution, the median is taken as the estimate of the
most probable value. Using a similar idea, robust variants of MSC have been proposed by using robust regression for the
estimation of d and o. The underlying hypothesis of robust regression is that only a subset of regular individuals needs
to follow the regression model under construction, while the other ones (outliers), which anyway must remain a minority,
have to be excluded. In MSC, the regression is calculated between pairs of corresponding variables in the two signals:
Consequently, robust MSC selects the variables of x that best follow the relation x = dxref + o as regular individuals.
However, as a particular case of robust regression, it requires that a majority of the variables of x are impacted only by
the size effect.
Some other methods propose to weight the variables of x before calculating d and o. This is more general than
selecting variables, since the selection is a particular case of weighting. In Rietjens,13 it is proposed to weight the vari-
ables, in order to account for the noise distribution on the variables. In Martens and Stark,10 the authors clearly indi-
cates: “To avoid wavelength regions where the chemical absorbance variations might influence the estimation of ai
and bi, weighted least squares is used in practice ….” These weights must be set according to the knowledge of the
contributions to the signal of both the compound(s) of interest and of the undesired effects. In Shenk and Westerhaus,18
a modified version of MSC is proposed where each variable is weighted according to its standard deviation prior to com-
pute o and d. In Gallagher et al,19 an iterative procedure is proposed to calculate the weights, in association with a MSC
correction. The weights are initialized with a guess (eg, a pure spectrum), and then those corresponding to variables that
have high residuals in MSC fitting are lowered. The procedure is repeated until convergence.
The present paper proposes a new method, which does not require any prior knowledge about the signal of the com-
pound(s) of interest, called variable sorting for normalization (VSN), to calculate the weights to be applied to each
variable before a normalization procedure (either SNV or MSC). The first part of the article details the theoretical prin-
ciples of the algorithm. The second part presents the data used to test the algorithm and to compare it with other
methods. Finally, the last part presents the results and proposes a discussion.

2 | THEORY

2.1 | Basic principle

The n signals contained in the matrix X are supposed to be issued from measurements on the same type of material,
with various additive and multiplicative factors, due to various measurement conditions (as discussed above) and
various shape effects associated to chemistry.
Let W be a pxp weighting diagonal matrix (wij = 0, 0 < =wii < =1). For instance, W can be used in a weighted SNV,
with o = mean (xW) and d = std (xW). The role of W is to control individually the influence of every variable in the
computation of d and o. In the present case, the more the signal amplitude for a given variable is affected by shape
effects, the lower should be the weight of this variable.
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
4 of 16 RABATEL ET AL.

Let us imagine particular variables for which the signal amplitude only varies due to additive and multiplicative
effects (no shape effect). These variables can be grouped in a class called Sz, indicating a pure size effect. In practice,
depending on the subset of X considered, some variables can appear as belonging to Sz or not. Thus, the VSN method
proposes to calculate wii as the probability that the variable i belongs to Sz, when considering the whole matrix X.
To evaluate this probability, we propose the following:

• To make Ns random samplings of pairs of signals in the X matrix.

• For each of these pairs of profiles, to identify the variables belonging to Sz.

The probability wii is then estimated as the frequency of assignment of the variable i in the class Sz.
The reason for dealing with pairs of signals is that the additive and multiplicative factors between both signals can be
numerically defined. Let us consider, for the sake of illustration, the following example (Figure 1) issued from real spec-
trometric measurements.
Figure 1A shows two spectra x1 and x2, which are nearly similar up to additive and multiplicative factors, except in
their left part (shape effect). In Figure 1B, it can be seen that a large set of wavelength amplitudes are linearly related:
They lie on a straight line, defining the actual additive and multiplicative factors between the two spectra. The other
wavelength amplitudes, which are not on this straight line, are the ones affected also by shape effects.
According to Figure 1, the assignment of a variable with respect to the Sz class for a couple of signals thus consists in
determining the variables following a common linear relationship. The way to do this is presented in the following.

2.2 | Variable assignments for a pair of signals

As illustrated in Figure 1, the assignment problem can be considered as a linear regression problem in presence of
nonregular individuals (outliers): Only individuals that fit with this linear regression model will be retained in Sz.
This kind of problem has been widely covered in the literature, under the general term of robust regression.20
However, as mentioned earlier, robust regression techniques require a majority of regular individuals. Also, they
address overdetermined regression problems with a significant amount of residuals: The goal is to define a regression
model that is “approximately” applicable to a majority of data samples, even if it does not apply exactly to anyone.
The present case is slightly different. As shape effect can affect large ranges of variables in the signals, depending of
the components of interest, the assignment procedure should remain applicable when the size effect class Sz contains a
reduced number of variables, compared with the total number of variables. Conversely, Sz members are expected to fit
the resulting model very accurately, ie, with very low residual values.
For these reasons, a random sample consensus (RANSAC) approach has been preferred.21 Similarly to robust regres-
sion, RANSAC seeks for a subset of regular individuals in the studied data set. However, it is a model fitting approach

FIGURE 1 Example of amplitude relationship for a pair of spectra. A, Graphical display of the two spectra x1 and x2. B, x2 amplitude
versus x1 amplitude
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 5 of 16

that is not based on overdetermined estimation. Model candidates are computed from sample sets containing exactly the
required number of samples to provide a unique solution. These sample sets are drawn randomly, and the computed
model fitting with the larger number of individuals is retained.
Let consider two signals x1 and x2. According to our previous definitions, the Sz class contains variables where the
signal amplitudes of x1 and x2 only differ according to additive and multiplicative factors (no shape effect). Formally,
it means that one can find two scalar values a and b such that
m ∈ Sz ⇔ x 2m ¼ ax 1m þ b: (1)

Given the two signals x1 and x2, the Sz class is estimated by the RANSAC algorithm, which iterates the following
procedure:

• Draw two variables i and j, so that x1i − x1j ≠ 0. From the four values {x1i; x1j; x2i; x2j} a (a,b) candidate is calculated,
from Equation (1) applied on i and j.
• Apply the following evaluation function to the (a,b) candidate:
p
E ða; bÞ ¼ ∑k¼1 δða; b; k Þ; (2)
where
δ(a,b,k) = 1 if |x1k − ax2k − b| < ε
ε being a threshold to be set
δða; b; k Þ ¼ 0 otherwise (3)

ε acts as a tolerance value, indicating the error level that can be accepted for considering a sample as fitting with the
current (a,b) model. After Nw iterations, the (a,b) candidate associated with the highest value of E(a,b) is retained,
and the corresponding variables are assigned to Sz.

2.3 | Processing chart summary

1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
6 of 16 RABATEL ET AL.

2.4 | Extension to more complex transformation models

Until now, only size effects involving simple multiplicative and additive factors (a,b) were considered. However, in the
case of more complex size effects, such as linear or parabolic baselines, the extension of the present solution is straight-
forward. Let us consider a size effect, which can be modeled using q parameters (a1, a2, …, aq), eg, a multiplicative effect
and a q‐2 order polynomial baseline. For a given couple of signals, a unique solution can be computed for this model
from a given sample set (i1, i2, …, iq) of q variables. The same processing chart can thus be applied, provided that the
steps of the inside loop (random variable drawing, model computation, and Sz selection) are adapted to q variables.
The resulting W matrix is then used to compute the relevant signal correction process (eg, EMSC with a q‐2
polynomial).

3 | MATERIAL A ND METHODS

3.1 | Simulated spectra

The spectra of three pure products, described by p = 256 wavelengths, were generated by combination of Gaussians, as
shown in Figure 2A, and organized as row of a profile matrix S. A concentration matrix Y, made up of n = 30 rows
(mixtures) and three columns (constituents) was also generated. The first column of Y was taken constant and equal
to 1; the values of the second and third columns of Y were drawn randomly from a uniform distribution between 0
and 1. A set of n spectra, by p wavelengths was then generated by X0 = YS. These spectra were then added with a

(A) (B)

FIGURE 2 Simulated data set. A, Spectra of the three pure components. B, Spectra obtained by linear combination of the three
components. C, Spectra affected by a multiplicative effect and a horizontal baseline. D, Spectra affected by a multiplicative effect and a
parabolic baseline
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 7 of 16

spectrally structured background and a white noise. The background was simulated as the random combination of 20
Gaussian‐shaped spectral peaks. The order of magnitude of the background was 0.1% of the signal. The order of mag-
nitude of the white noise was 0.001% of the signal. The resulting spectra are illustrated in Figure 2B. These spectra were
then multiplied by a scalar, randomly drawn according to a normal distribution of mean 1 and standard deviation 0.1,
then added with a scalar drawn at random according to a normal distribution of mean 0 and standard deviation 0.1,
giving the spectra X1, illustrated in Figure 2C. Finally, these spectra were added with a term c l + d l2, with c and d
following a normal distribution of zero mean and standard deviation 6.10−4 and 3.10−6, respectively, and l = [1 … p].
The result is the matrix X2 represented in Figure 2D.

3.2 | NIR spectra of leaves

Spectra of apple leaves were acquired by collecting images of healthy and scabbed leaves using a hyperspectral camera.
All images were converted to reflectance using a reference included in the image. For each infected leaf image, an aver-
age spectrum was calculated on ROIs corresponding to scab spots, whereas average spectra of healthy leaf images were
computed on ROIs having similar shape and width. This method produced 25 healthy and 25 infected spectra as shown
in Figure 3. This data set is described in detail in Nouri et al.22

3.3 | NIR spectra of musts

A second real data set was made up of NIR spectra of musts acquired in laboratory conditions during a wine making
campaign, associated to values of alcohol by volume. A double beam NIR spectrometer JASCO V560 was used with a
1 mm cell filled with water placed in the reference compartment and a 1 mm cell containing the must placed along
the sample beam. Spectra were recorded on 750 wavelengths equally spaced (2 nm) between 800 and 2298 nm. The data
set, containing 621 spectra, is shown in Figure 4. A calibration set of 414 samples (about 2/3) and a test set of 207 sam-
ples (about 1/3) were drawn using the Duplex algorithm.23

3.4 | Algorithm

The algorithm described in Section 2 has been implemented with Matlab R2015b (The Mathworks, Natick, MA, USA).
Weighting matrices have been determined using size effect models with two to four parameters. In each case, an opti-
mal tolerance value ε was chosen manually (possible ways to automatically choose the optimal value will be discussed
in Section 4).
Specific weighted SNV and EMSC algorithms have been defined. Let x (1xp) be a signal to be processed, z (1xp) the
resulting signal, and W (pxp) the diagonal weighting matrix.

FIGURE 3 Reflectance spectra of sane and infected apple tree leaves

FIGURE 4 NIR spectra of musts acquired at line during wine making process

The weighted SNV was carried out as follows:

1. Calculate the weighted mean of the signal: m = mean (xW)

2. Subtract this value from the signal: xC = x–m
3. Calculate the weighted standard deviation of the signal s = std (xCW)
4. Divide each term of the signal by this value: z = (1/s)xC

The weighted EMSC was carried out as follows (second order polynomial is used as an example):
0 1
r1 1 1 1
B r 2 4 C
B 2 1 C
1. Define M ¼ B
B ⋮
C
@ ⋮ ⋮ ⋮ C
A
rp 1 2
p p

where (r1 r2⋯rp) is the EMSC reference signal

2. Calculate the EMSC coefficients(a b c d) = (MWMT)−1MWxT

0 10 b 1
1 1 1
B CB a C
B 1 2 4 CB cC
1 B
3. Correct the signal using the coefficients z¼ x − B CBB
C
C
a @ ⋮ ⋮ ⋮ C
AB aC
@ A
d
1 p p2
a
4 | R E S U L T S AN D D I S C U S S I O N

4.1 | Simulated spectra

Figure 5 shows results of standard SNV (Figure 5A) and weighted SNV (Figure 5B,C) preprocessing on spectra affected
by a multiplicative factor and a horizontal baseline (data of Figure 2C). All the weights of Figure 5A (top) are equal to 1,
which corresponds to standard SNV. The weights represented in Figure 5B (top) have been set manually. They are zero
in a zone covering the two chemical peaks (from 0 to 125), and unitary for the other variables. The weights represented
in Figure 5C (top) have been calculated by the VSN algorithm, with a tolerance value ε = 10−3. This value has been
chosen manually to produce a smooth and contrasted weight spectrum.
Spectra processed by standard SNV are shown in Figure 5A middle. The spectra we should ideally find are shown in
Figure 2B. We can observe large differences between these two sets of spectra. On the ideal spectra, the two chemical
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 9 of 16

(A) (B)

FIGURE 5 Results of using weighted SNV on the synthetic data of Figure 2C, for several weights. A, Unitary weights. B, Optimal binary
weights. C, Weights calculated by VSN. Each plot from A to C represents, from top to bottom: the weights, the processed spectra, the b‐
coefficients of a 2LV PLS calibrated between the spectra and the concentrations of the peak centered at 70. Panel D represents the relation
between the concentration related to the peak centered on 70 and its height for classical SNV and VSN processed spectra. SNV, standard
normal variate; VSN, variable sorting for normalization

peaks, at about 40 and 70, show similar variations, with similar maxima. On the spectra processed by SNV, the respec-
tive amplitudes of these two peaks have changed; the second peak varies less than the first one. This effect is clearly
noticeable in Figure 5D, which shows that the intensity of the peak of the SNV processed spectra does not vary linearly
with the original intensity. These artifacts are the consequence of the global normalization carried out by the SNV trans-
form. Part of the variance, initially concentrated on certain areas of the spectrum, is spread over all the variables, as
reported in Filzmoser and Walczak.3 All ideal spectra are close to 0 for the first variable (left of Figure 2B). On the con-
trary, the spectra processed by SNV have variable intensities in this zone. Similarly, in the right part of the spectra
(between 100 and 256), while the ideal spectra are almost identical, the spectra processed by SNV show baseline varia-
tions and homothetic deformations. These effects are due to the combination of an error in identifying the additive
effect and the global normalization seen previously. Figure 5A (bottom) shows the b‐coefficients of a PLS calibrated
between the SNV processed spectra and the concentrations related to the peak centered at about 70. According to all
the deformations previously described, these coefficients exhibit nonzero values in the right part of the figure, where
actually there is no chemical variation.
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
10 of 16 RABATEL ET AL.

Figure 5B,C is related to weighted SNV. In both figures, the weights are close to 0 in the zone where there are shape
effects (from 0 to 125) and nonzero in the zone where the variations are due to size effects only (Figure 5B,C [top]).
Using these weights, the weighted SNV algorithm has calculated the mean and the standard deviation of the spectra
mainly in the right zone. In the extreme case of Figure 5B, the matrix W contains unitary values on the diagonal
between 125 and 256 and 0 elsewhere. Examining the algorithm shown in the material and methods part, it is straight-
forward to see that this matrix performs a variable selection, leading to calculate the mean and the standard deviation of
the spectra on these variables only. In the case of Figure 5C, the behavior is less binary, but the result is the same. Both
weightings give a good estimation of the multiplicative and additive factors, resulting in a quasi‐perfect correction, as
shown by Figure 5B,C (middle). Figure 5B,C (bottom) shows the b‐coefficients of a PLS calibrated between the spectra
processed by weighted SNV and the y value related to the peak centered at 70. It is clear that these coefficients behave
properly. The zone not related to chemical variations (from 125 to 256) exhibits null coefficients. A peak centered on 70
clearly accounts for the y variations. Due to the correlation between the concentrations related to the two peaks

(A) (B)

FIGURE 6 Results of applying the weighted EMSC algorithm (degree 2) on the synthetic data of Figure 2D, for various weights. A, Unitary
weights. B, Optimal binary weights. C, Weights calculated by VSN (tolerance = 10−4). Each plot represents, from top to bottom: the weights,
the processed spectra, the b‐coefficients of a 2LV PLS calibrated between the spectra and the concentrations of the peak centered at 70. Figure
D represents the relation between the concentration related to the peak centered on 70 and its height for classical SNV and VSN processed
spectra. SNV, standard normal variate; VSN, variable sorting for normalization
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 11 of 16

(at 30 and 70), the b‐coefficients exhibit a small negative peak at about 30. Figure 5D clearly shows that the weighted
SNV preserves the linear relationship between the peak intensities and the chemicals, contrarily to the classical SNV.
Figure 6, whose panels are organized analogously to those in Figure 5, shows the results of the application of EMSC to
the synthetic spectra of Figure 2D. The spectra used are affected by a multiplicative effect and a parabolic baseline. As a
result, the VSN algorithm was used with four parameters, and EMSC was used with a degree 2 polynomial. In a similar
way to Figure 5, it is found that the application of the EMSC without precaution gives results far from the ideal spectra.
The observed distortions (Figure 6A) seem even more important than for SNV (Figure 5A). This is probably due to EMSC
model resolution problems, in borderline cases. The coefficients of the model produced by the EMSC corrected spectra
(Figure 6A [bottom]) are similarly deformed to those produced by the SNV corrected spectra (Figure 5A [bottom]). The
right part shows coefficients very far from 0, because of the variance transported from the left part to the right part of
the spectra. The “ideal” weights in Figure 6B and the calculated weights in Figure 6C give very satisfactory results when
used with the weighted EMSC algorithm. Both the corrected spectra and the b‐coefficients of PLS are in accordance with
the ground truth. As for SNV (Figure 5D), the relation between the peak height and the concentration for EMSC processed
spectra is far from linear and is perfectly linear for weighted EMSC processed spectra (Figure 6D).
In the two examples above (Figures 5 and 6), the tolerance value ε has been chosen manually. Theoretically, the role
of the tolerance ε, according to the RANSAC algorithm, is to ensure the selection of the wavelength amplitudes for
which a given size effect model is applicable, despite the amplitude noise. Therefore, the tolerance value must be high
enough to accept amplitude deviation due to spectrum noise (let us say at least three standard deviations for a Gaussian
noise). On the other hand, it must not be too high, so that main deviations due to shape effect are not occulted, leading
to a trade‐off in its selection. Figure 7B shows the weights obtained with the data of Figure 5, for several values of the
tolerance threshold varying from 10−8 to 1. Actually, it is noticeable that the selection of the value ε is not that critical.
Similar weighting functions are obtained for a tolerance value ε approximately comprised between 10−2 and 10−4. This
range is in accordance with the noise that was added to the synthetic spectra (10−3). So a rule of thumb for tuning the
tolerance value would be to adopt the magnitude of the spectral noise. Nevertheless, evaluating the noise that affects
spectra is not always easy. Then, another method is presented below.
Figure 7A shows that the weighting function is comprised between a constant value of 0 (ε too low: no wavelength
selected) and a constant value of 1 (ε too high: all wavelengths selected). Between these two states, this function takes on
a more or less noisy signal appearance. As shown in Figure 7B, we found on the synthetic data and real data that the
standard deviation of this function passes through a net and unique maximum, between the constant state at 0 and the
constant state at 1, both giving a null standard deviation. This maximum standard deviation ideally corresponds to a
well‐contrasted curve, taking low values (close to 0) and high values (close to 1), which approaches the form of a func-
tion that realizes the selection of variables impacted by the size effect. On the basis of Figure 7B, optimal values for the
tolerance can be chosen from 10−4 to 10−3, which is in accordance with the level of noise.

(A) (B)

FIGURE 7 A, Evolution of the weights W calculated on synthetic data of Figure 2B, as a function of the tolerance ε used in RANSAC. B,
Evolution of the standard deviation of the weights W
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
12 of 16 RABATEL ET AL.

4.2 | NIR spectra of leaves

Figure 8 shows the results of the processing of apple tree leave spectra. The original spectra, shown in Figure 3, are
clearly affected by baseline shifts and multiplicative effects, as expected on reflectance spectra. It has been established
elsewhere (publication under process) that the most informative part of these spectra relies on the slope between 1000
and 1400 nm. This slope is due to the changes in light scattering caused by the fungus. The results of the nonweighted
EMSC are shown in Figure 8A (middle). The baselines and multiplicative effects are well diminished and the two classes
appear as separated in the left part of the spectra (1000‐1400 nm). Nevertheless, the two classes also separate in the
range 1600 to 1800 nm, and, to a lesser extent, in the range 1800 to 2500 nm. EMSC processing dilutes the information
all over the spectral range. This results in model coefficients (Figure 8A [bottom]), which look like the mean spectrum
of the data set. The interpretation of such a model is poor. In Figure 8B, the weights have been manually chosen in
order to exclude the part expected as informative from the EMSC calculation; weights are null from 1000 to 1400 nm
and unitary on the other wavelengths. The results of the weighted EMSC are shown in Figure 8B (middle). The side

(A) (B)

FIGURE 8 Results of applying the weighted EMSC (degree 1) approach on the leaves data, for various weights. A, Unitary weights. B,
Optimal binary weights. C, Weights calculated by VSN. D, Weights calculated by VSN and squared. Each plot from A to D represents,
from top to bottom: the weights, the processed spectra, the b‐coefficients of a 2LV PLS‐DA calibrated between the spectra and the two classes
healthy and scabbed
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 13 of 16

effects of the EMSC processing have disappeared. The two classes separate only on the left part of the spectra. The
model shown in Figure 8B (bottom) exhibits a more interpretable shape; a slope between 1000 and 1350 nm and very
low coefficients elsewhere. Figure 8C shows the results with weights calculated by VSN. The tolerance value has been
chosen accordingly to the maximum of the standard deviation, as explained above. The global shape of the weights is
similar to the one manually chosen in Figure 8B (top); low weights until 1400 nm, then an increase and high weights
above 1700 nm. With these weights, EMSC produces the spectra of Figure 8C (middle). The classes appear well sepa-
rated on the left part of the spectrum, while are very close all over the rest of the wavelength range. A small separation
remains in the middle part, between 1600 and 1800 nm. The resulting model is shown in Figure 8C (bottom). Regarding
the slope between 1000 and 1350 nm, it looks like the preceding model but differs in the central part, exhibiting some
positive coefficient between 1600 and 1800 nm.
Thus, the weights calculated by VSN do not seem to act optimally. In Figure 8C (top), it can be noticed that the low
weights, on the left, are about 0.2 while the highest weights, on the right, are close to 1. However, we know that the two
parts are distinct, regarding the observed phenomenon. Thus, the weights lack of contrast; the left part should be lower.
In order to increase the contrast of the weights, because they lie between 0 and 1, it is possible to raise them to a power
greater than 1. This was done in Figure 8D. The weights of Figure 8C (top) have been squared. The results are now very
close to the ones of Figure 8B obtained with the optimally chosen weights.

4.3 | NIR spectra of musts

With the two previous examples, the PLSR and PLSDA models worked perfectly, in terms of fitting or prediction errors,
both with SNV or with VSN. These examples mainly aimed at showing the effect of VSN on the spectra, and its contri-
bution to a more reliable interpretation of the models. On the other hand, the results presented for this third example
will illustrate how VSN can also improve the performance of a model with respect to SNV.
The spectra of Figure 4 exhibit a baseline due to the turbidity of the musts, which varies from one sample to another.
Significant variability is visible between 1900 and 2300 nm, where the absorption bands of water and ethanol are
located. The negative parts of the spectra are due to the fact that the pure water is used as reference.
Figure 9 shows the results of classical SNV (A) and of VSN plus weighted SNV (B). The weights calculated by VSN
(Figure 9B [top]) clearly exhibits two contrasted zones: a zone of weights close to 1, between 800 and 1300 nm, meaning
that the corresponding variables are affected by the size effect only, and another region, between 1400 and 2300 nm, mostly
affected by shape effects. The differences between SNV and VSN is clearly visible on the spectra (Figure 9 [middle]). VSN

FIGURE 9 Results of applying weighted SNV on the real must data of Figure 4, for various weights. A, Unitary weights. B, Weights
calculated by VSN. Each plot represents, from top to bottom: the weights, the processed spectra, the b‐coefficients of a PLS calibrated
between the spectra and the alcohol by volume
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
14 of 16 RABATEL ET AL.

TABLE 1 Results of PLS regression on the wine must data set, after preprocessing by classical SNV (first row) or VSN and weighted SNV
(second row)

#LV RMSECV R2CV RMSEP R2TEST

SNV 5 0.890 0.94 0.963 0.93

VSN + Weighted SNV 5 0.653 0.97 0.701 0.96
Note. RMSEP and R2TEST were calculated on the external TEST set, while RMSECV and R2CV
are the average results of 10 runs of random split‐half cross‐val-
idation on training data. The number of latent variables corresponds to the minimum of RMSECV.
Abbreviations: PLS, partial least squares; SNV, standard normal variate; VSN, variable sorting for normalization.

manages to completely remove the baselines, unlike SNV. The effects on PLS coefficients (Figure 9 [bottom]) are more
subtle. In the case of VSN (Figure 9B [bottom]), the regression coefficients exhibit a sharper and more pronounced
variation in the spectral areas containing peaks, from 1400 to 1500 nm and from 1900 to 2300 nm, than those calculated
on SNV‐processed data (Figure 9A [bottom]).
The regression performances of the PLS models built on data preprocessed with either SNV or VSN plus weighted
SNV are summarized in Table 1, where it is clearly evident how the proposed approach significantly outperforms
classical SNV. Indeed, all the statistics show an improvement when VSN is used and, in particular, RMSEP is reduced
by almost 30%.

5 | CONCLUSION

In this study, a novel algorithm aimed at removing additive and multiplicative effects in the framework of multivariate
signal processing has been presented.
Rather than proposing another completely new approach for signal normalization, VSN is intended to be used prior
to SNV or other standard normalization methods. Its role is to optimize the computation of normalization parameters
involved in such methods, by selecting relevant variables.
The efficiency of VSN in terms of interpretability of the preprocessed signals and of the corresponding PLS model
parameters has been clearly shown on all the synthetic and real data presented.
However, some particular issues connected to the use of VSN should be further highlighted.
First, VSN gives up one of the most popular advantages of SNV, which is that any multivariate signal can be proc-
essed individually, ie, totally independently of other ones. In the present case, a kind of “learning set” must be defined
in order to compute the weighting matrix. This is not anyway a great concern, as such a learning set is usually required
for the successive model building stage.
It must also be noticed that the degree of expected baseline must be carefully chosen. Though not reported in the
present study, it can be shown with our real data example that introducing a baseline correction term with a polynomial
degree higher than necessary (eg, 2 instead of 1) can decrease the quality of the results.
Finally, several points would require further studies, in order to reach a fully automatized algorithm implementation.
The first one is the choice of an optimal tolerance value, which should be related to the noise level of the profiles.
Indeed, although an empirical method based on the maximum deviation of the weighting function was proposed, this
approach is presently not fully theoretically supported, and it should be validated on other data sets. The same remark
can be done concerning the increase of the weighting function contrast by powering its values, as it was done for the
apple leaves real data set.
Other tunable parameters, whose effect could require additional investigations, are the number of times, for a given
couple of signals, random pairs of variables should be drawn, and the number of times, for a given data set, the selection
of random pairs of signals should be iterated. These must be chosen as a trade‐off between the total processing time and
a reasonable probability to “catch” the relevant Sz class (for a signal couple) or a correctly estimated weighting function
(for the data set). Theoretical studies could be investigated on this point.
Finally, notice that the VSN addresses multivariate data in which variable ranges not affected by shape effect actually
exist. If the shape effect is spread all over the variables, VSN cannot bring any improvement compared with standard
normalization methods.
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
RABATEL ET AL. 15 of 16

ACK NO WLE DGE MEN TS

Data about scabbed apple tree leaves have been provided by a project supported by the French Ministry of Agriculture
and the Association Nationale de la Recherche et de la Technologie ANRT (Casdar project Aventuria 1412).

DAT A AVAI L A BI LI TY S TAT EM EN T

Code is available on request to the corresponding author or at https://framagit.org / chemhouse_repository /

chemometrics_toolbox / raw / master / chemhouse_toolbox.zip.

ORCID
Jean‐Michel Roger https://orcid.org/0000-0003-2123-5266

R EF E RE N C E S
1. Massart DL, Vandeginste BG, Buydens LMC, Lewi PJ, Smeyers‐Verbeke J. Handbook of Chemometrics and Qualimetrics: Part A. Amsterdam,
The Netherlands: Elsevier Science Inc; 1997.
2. Martens H, Naes T. Multivariate Calibration. United Kingdom: John Wiley & Sons; 1992.
3. Filzmoser P, Walczak B. What can go wrong at the data normalization step for identification of biomarkers? J Chromatogr A.
2014;1362:194‐205.
4. Geladi P, MacDougall D, Martens H. Linearization and scatter‐correction for near‐infrared reflectance spectra of meat. Appl Spectrosc.
1985;39(3):491‐500.
5. Isaksson T, Næs T. The effect of multiplicative scatter correction (MSC) and linearity improvement in NIR spectroscopy. Appl Spectrosc.
1988;42(7):1273‐1284.
6. Yang L, Miklavcic SJ. Revised Kubelka–Munk theory. III. A general theory of light propagation in scattering and absorptive media. JOSA
A. 2005;22(9):1866‐1873.
7. Hadoux X, Gorretta N, Roger JM, Bendoula R, Rabatel G. Comparison of the efficacy of spectral pre‐treatments for wheat and weed
discrimination in outdoor conditions. Comput Electron Agric. 2014;108:242‐249.
8. Barnes RJ, Dhanoa MS, Lister SJ. Standard normal variate transformation and de‐trending of near‐infrared diffuse reflectance spectra.
Appl Spectrosc. 1989;43(5):772‐777.
9. Martens H, Jensen S, Geladi P. Proc. Nordic Syrup. on Applied Statistics. Stokkand Forlag: Stavanger; 1983:205.
10. Martens H, Stark E. Extended multiplicative signal correction and spectral interference subtraction: new preprocessing methods for near
infrared spectroscopy. J Pharm Biomed Anal. 1991;9(8):625‐635.
11. Kohler A, Kirschner C, Oust A, Martens H. Extended multiplicative signal correction as a tool for separation and characterization of phys-
ical and chemical information in Fourier transform infrared microscopy images of cryo‐sections of beef loin. Appl Spectrosc.
2005;59(6):707‐716.
12. Bylesjo M, Cloarec O, Rantalainen M. Normalization and closure. In: Brown S, Tauler R, Walczak B, eds. Comprehensive Chemometrics.
Vol.2 Amster‐dam: Elsevier; 2009:109‐127.
13. Rietjens M. Reduction of error propagation due to normalization: effect of error propagation and closure on spurious correlations. Anal
Chim Acta. 1995;316(2):205‐215.
14. Johansson E, Wold S, Sjoedin K. Minimizing effects of closure on analytical data. Anal Chem. 1984;56(9):1685‐1688.
15. Fearn T, Riccioli C, Garrido‐Varo A, Guerrero‐Ginel JE. On the geometry of SNV and MSC. Chemom Intel Lab Syst. 2009;96(1):22‐26.
16. Guo Q, Wu W, Massart DL. The robust normal variate transform for pattern recognition with near‐infrared data. Anal Chim Acta.
1999;382(1‐2):87‐103.
17. Dieterle F, Ross A, Schlotterbeck G, Senn H. Probabilistic quotient normalization as robust method to account for dilution of complex
biological mixtures. Application in 1H NMR metabonomics. Anal Chem. 2006;78(13):4281‐4290.
18. Shenk JS, Westerhaus MO. Routine Operation, Calibration, Development and Network System Management Manual. Silver Spring, MD,
USA: NIRSystems, Inc; 1995:1995.
19. Gallagher NB, Blake TA, Gassman PL. Application of extended inverse scatter correction to mid‐infrared reflectance spectra of soil.
J Chemometr. 2005;19(5‐7):271‐281.
20. Andersen, R. (2008). Modern methods for robust regression. Sage university paper series on quantitative analysis; No 152
1099128x, 2020, 2, Downloaded from https://analyticalsciencejournals.onlinelibrary.wiley.com/doi/10.1002/cem.3164 by Novo Nordisk A/S, Wiley Online Library on [27/11/2024]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
16 of 16 RABATEL ET AL.

21. Fischler MA, Bolles RC. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated
cartography. Commun ACM. 1981;24(6):381‐395.
22. Nouri M, Gorretta N, Vaysse P, et al. Near infrared hyperspectral dataset of healthy and infected apple tree leaves images for the early
detection of apple scab disease. Data Brief. 2018;16:967‐971.
23. Snee RD. Validation of regression models: methods and examples. Dent Tech. 1977;19(4):415‐428.

How to cite this article: Rabatel G, Marini F, Walczak B, Roger J‐M. VSN: Variable sorting for normalization.
Journal of Chemometrics. 2020;34:e3164. https://doi.org/10.1002/cem.3164

(RSC Analytical Spectroscopy Momographs) M. J. Adams-Chemometrics in Analytical Spectroscopy - Royal Society of Chemistry (1995)
100% (1)
(RSC Analytical Spectroscopy Momographs) M. J. Adams-Chemometrics in Analytical Spectroscopy - Royal Society of Chemistry (1995)
225 pages
User Friendly Multivariate Calibration GP
100% (4)
User Friendly Multivariate Calibration GP
354 pages
STAT501 Multivariate Analysis
No ratings yet
STAT501 Multivariate Analysis
196 pages
Multivariate Material
No ratings yet
Multivariate Material
58 pages
A First Course in Multivariate Statistics: Bernard Flury
No ratings yet
A First Course in Multivariate Statistics: Bernard Flury
4 pages
SANDEL-3308 Install Manual Rev G 9-25-03
No ratings yet
SANDEL-3308 Install Manual Rev G 9-25-03
130 pages
Module 1 - Introduction To Computer Networks
No ratings yet
Module 1 - Introduction To Computer Networks
9 pages
Sta 809 A
No ratings yet
Sta 809 A
58 pages
SPE-199091-MS, Electric Submersible Pump Troubleshooting Guide, An Effective Way To Improve System Performance and Reduce Avoidable System Failues
100% (1)
SPE-199091-MS, Electric Submersible Pump Troubleshooting Guide, An Effective Way To Improve System Performance and Reduce Avoidable System Failues
18 pages
X2 Interface - LTE
100% (1)
X2 Interface - LTE
41 pages
Presentation Generalized Linear Model Theory
No ratings yet
Presentation Generalized Linear Model Theory
77 pages
Slides 4
No ratings yet
Slides 4
51 pages
EVT For Visual Recognition Review PDF
No ratings yet
EVT For Visual Recognition Review PDF
108 pages
SVD and Data Science
No ratings yet
SVD and Data Science
52 pages
CSI 4500 Datasheet PDF
No ratings yet
CSI 4500 Datasheet PDF
16 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
77 pages
Journal of Statistical Software: TIMP: An R Package For Modeling Multi-Way Spectroscopic Measurements
No ratings yet
Journal of Statistical Software: TIMP: An R Package For Modeling Multi-Way Spectroscopic Measurements
46 pages
Multivariate Statistical Analysis Using The R Package Chemometrics
No ratings yet
Multivariate Statistical Analysis Using The R Package Chemometrics
71 pages
J Chemolab 2018 07 008
No ratings yet
J Chemolab 2018 07 008
25 pages
Devos 2009
No ratings yet
Devos 2009
7 pages
The Geometry of Partial Least Squares
No ratings yet
The Geometry of Partial Least Squares
28 pages
Major Pro Review 1
No ratings yet
Major Pro Review 1
24 pages
MBC W1-2 Notes
No ratings yet
MBC W1-2 Notes
21 pages
Week 14
No ratings yet
Week 14
17 pages
PCA 1 Geladi Comprehensive Chemometrics 2020
No ratings yet
PCA 1 Geladi Comprehensive Chemometrics 2020
21 pages
Liu, 2021 - Projection - Multiobj - SVM
No ratings yet
Liu, 2021 - Projection - Multiobj - SVM
13 pages
Sorting Variables by Using Informative Vectors As A Strategy For Feature Selection in Multivariate Regression
No ratings yet
Sorting Variables by Using Informative Vectors As A Strategy For Feature Selection in Multivariate Regression
17 pages
Coursework
No ratings yet
Coursework
14 pages
ML Preprocessing Exercise 1
No ratings yet
ML Preprocessing Exercise 1
12 pages
Analysis Report
No ratings yet
Analysis Report
14 pages
CS Decomposition
No ratings yet
CS Decomposition
10 pages
CS Decomposition Based Bayesian Subspace Estimation
No ratings yet
CS Decomposition Based Bayesian Subspace Estimation
11 pages
1 s2.0 S0169743922000314 Main
No ratings yet
1 s2.0 S0169743922000314 Main
13 pages
22 A Comparison of Some Multivariate Linear Regression Estimation Methods
No ratings yet
22 A Comparison of Some Multivariate Linear Regression Estimation Methods
9 pages
(J E Jackson, G S Mudholkar) Control Procedures For Residuals Associated With Principal Component Analysis
No ratings yet
(J E Jackson, G S Mudholkar) Control Procedures For Residuals Associated With Principal Component Analysis
10 pages
A Tutorial On Support Vector Regression
No ratings yet
A Tutorial On Support Vector Regression
24 pages
Artigo Original)
No ratings yet
Artigo Original)
6 pages
The Mvtnorm Package: R Topics Documented
No ratings yet
The Mvtnorm Package: R Topics Documented
12 pages
The Impact of Signal Pre-Processing On The Final Interpretation of
No ratings yet
The Impact of Signal Pre-Processing On The Final Interpretation of
9 pages
Incremental SVD Missingdata PDF
No ratings yet
Incremental SVD Missingdata PDF
14 pages
Journal of Chemometrics - 2018 - Brereton - Introduction To Analysis of Variance
No ratings yet
Journal of Chemometrics - 2018 - Brereton - Introduction To Analysis of Variance
4 pages
A SVM Stock Selection Model Within PCA
No ratings yet
A SVM Stock Selection Model Within PCA
7 pages
FPCR Jasa
No ratings yet
FPCR Jasa
13 pages
Stellar Data Classification Using SVM With Wavelet Transformatio
No ratings yet
Stellar Data Classification Using SVM With Wavelet Transformatio
6 pages
Singular Value Decomposition and Principal Component Analysis - ExtraStudyMaterial
No ratings yet
Singular Value Decomposition and Principal Component Analysis - ExtraStudyMaterial
17 pages
A Practical Guide To Support Vector Classi Cation - Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin
No ratings yet
A Practical Guide To Support Vector Classi Cation - Chih-Wei Hsu, Chih-Chung Chang and Chih-Jen Lin
12 pages
Orthogonal Signal Correction of Near-Infrared Spectra: Svante Wold, Henrik Antti, Fredrik Lindgren, Jerker Ohman
No ratings yet
Orthogonal Signal Correction of Near-Infrared Spectra: Svante Wold, Henrik Antti, Fredrik Lindgren, Jerker Ohman
11 pages
1) Common Univariate Summaries: I) I) Iii) I) Ii)
No ratings yet
1) Common Univariate Summaries: I) I) Iii) I) Ii)
5 pages
Comparison of The Variable Importance in Projection VIP
No ratings yet
Comparison of The Variable Importance in Projection VIP
9 pages
Mahoney Drineas 2009 Cur Matrix Decompositions For Improved Data Analysis
No ratings yet
Mahoney Drineas 2009 Cur Matrix Decompositions For Improved Data Analysis
6 pages
Multivariate Calibration What Is in Chemometrics For The Analytical Chemist?
No ratings yet
Multivariate Calibration What Is in Chemometrics For The Analytical Chemist?
10 pages
cs229 Notes10 PDF
No ratings yet
cs229 Notes10 PDF
6 pages
Reference Vs Consensus Values
No ratings yet
Reference Vs Consensus Values
7 pages
S M S T C Lecture Notes Lecture4
No ratings yet
S M S T C Lecture Notes Lecture4
11 pages
Multivariate Calibration . II. Chemometric Methods: Tormod Naes and Harald Martens
No ratings yet
Multivariate Calibration . II. Chemometric Methods: Tormod Naes and Harald Martens
6 pages
Prediction Uncertainty Multivariate Model
No ratings yet
Prediction Uncertainty Multivariate Model
4 pages
Cayuela Sanchez 2023 Specific Normalization Method For Nirs Calibrations
No ratings yet
Cayuela Sanchez 2023 Specific Normalization Method For Nirs Calibrations
4 pages
Econometrics Proofs
No ratings yet
Econometrics Proofs
4 pages
These Are Some Serious Notes About SVM. They Can Make You An Amazing Data Scientist. Please Read Them and Enjoy
No ratings yet
These Are Some Serious Notes About SVM. They Can Make You An Amazing Data Scientist. Please Read Them and Enjoy
3 pages
Chemometrics in Spectroscopy - CHP 26
No ratings yet
Chemometrics in Spectroscopy - CHP 26
3 pages
VISTA EXPLODIDA Lei SA
No ratings yet
VISTA EXPLODIDA Lei SA
56 pages
Operating Manual Models M10/M11/M20/M22: Pre-Charge Pneumatic Air Rifle
No ratings yet
Operating Manual Models M10/M11/M20/M22: Pre-Charge Pneumatic Air Rifle
13 pages
Installation: Order No.: Customer: Equipment: Converter Type: Document: 3BHS213774E01 ACS 1000 W
No ratings yet
Installation: Order No.: Customer: Equipment: Converter Type: Document: 3BHS213774E01 ACS 1000 W
73 pages
Verified PDF Download Discrete Time Signal Processing 3rd Edition by Alan V Oppenheim Ebook and TestBank Bundle Fast Instant Download
No ratings yet
Verified PDF Download Discrete Time Signal Processing 3rd Edition by Alan V Oppenheim Ebook and TestBank Bundle Fast Instant Download
408 pages
LV Circuit Breaker Calculator Guide (Level 2) European Arc Guide EAG
No ratings yet
LV Circuit Breaker Calculator Guide (Level 2) European Arc Guide EAG
5 pages
Outlining Long Quiz
No ratings yet
Outlining Long Quiz
3 pages
Cyber Insurance Policy
No ratings yet
Cyber Insurance Policy
4 pages
JAVA PROGRAMMING Lab Manual
No ratings yet
JAVA PROGRAMMING Lab Manual
42 pages
Mercedes Benz StarTuned December 2019
No ratings yet
Mercedes Benz StarTuned December 2019
36 pages
How To Download Google Maps For Windows 11 - 10
No ratings yet
How To Download Google Maps For Windows 11 - 10
28 pages
Lecture 01.1 Introduction To Website Development
No ratings yet
Lecture 01.1 Introduction To Website Development
22 pages
Starzplay Dec Data
No ratings yet
Starzplay Dec Data
423 pages
Movie Recommendation System
No ratings yet
Movie Recommendation System
28 pages
4 Word Processor
No ratings yet
4 Word Processor
22 pages
CBD ZZ 00 DR DR 1001
No ratings yet
CBD ZZ 00 DR DR 1001
1 page
Psyc325 U5 Ip Final Turn in This One 2
No ratings yet
Psyc325 U5 Ip Final Turn in This One 2
6 pages
December 2024 Statement
No ratings yet
December 2024 Statement
8 pages
.Trashed-1742732428-Abstraction in Java - GeeksforGeeks
No ratings yet
.Trashed-1742732428-Abstraction in Java - GeeksforGeeks
11 pages
Phoebe
No ratings yet
Phoebe
2 pages
Final Nikhil Cover - Page - Certi.
No ratings yet
Final Nikhil Cover - Page - Certi.
10 pages
Resume: Lokam Srikanth Contact No: +91 8463931010
No ratings yet
Resume: Lokam Srikanth Contact No: +91 8463931010
2 pages
RL Quadcopter Movement Control Using Image Processing Techniques
No ratings yet
RL Quadcopter Movement Control Using Image Processing Techniques
4 pages
Bipolar Soft Neutrosophic Topological Region
No ratings yet
Bipolar Soft Neutrosophic Topological Region
5 pages
NLP Extc Sem8 Final Exam IMPs
No ratings yet
NLP Extc Sem8 Final Exam IMPs
3 pages
Cirvyn Ithinus
No ratings yet
Cirvyn Ithinus
2 pages
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
From Everand
Multi-dimensional Monte Carlo Integrations Utilizing Mathematica
SUJAUL CHOWDHURY
No ratings yet
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
From Everand
Optical Flow: Exploring Dynamic Visual Patterns in Computer Vision
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

VSN Paper 2020

Uploaded by

VSN Paper 2020

Uploaded by

Received: 26 October 2018 Revised: 1 June 2019 Accepted: 10 June 2019

SPECIAL ISSUE ‐ RESEARCH ARTICLE

VSN: Variable sorting for normalization

Gilles Rabatel1 | Federico Marini2 | Beata Walczak3 | Jean‐Michel Roger1

2.1 | Basic principle

• To make Ns random samplings of pairs of signals in the X matrix.

2.2 | Variable assignments for a pair of signals

2.3 | Processing chart summary

2.4 | Extension to more complex transformation models

3.1 | Simulated spectra

3.2 | NIR spectra of leaves

3.3 | NIR spectra of musts

FIGURE 3 Reflectance spectra of sane and infected apple tree leaves

The weighted SNV was carried out as follows:

1. Calculate the weighted mean of the signal: m = mean (xW)

where (r1 r2⋯rp) is the EMSC reference signal

2. Calculate the EMSC coefficients(a b c d) = (MWMT)−1MWxT

4.1 | Simulated spectra

4.2 | NIR spectra of leaves

4.3 | NIR spectra of musts

#LV RMSECV R2CV RMSEP R2TEST

SNV 5 0.890 0.94 0.963 0.93

ACK NO WLE DGE MEN TS

DAT A AVAI L A BI LI TY S TAT EM EN T

Code is available on request to the corresponding author or at https://framagit.org / chemhouse_repository /

You might also like