Asghari 2019
Asghari 2019
Comparison between partial least square and support vector regression with T
a genetic algorithm wavelength selection method for the simultaneous
determination of some oxygenate compounds in gasoline by FTIR
spectroscopy
⁎
Ahmad Asghari , Mohammadreza Khanmohammadi Khorrami, Amir Bagheri Garmarudi
Chemistry Department, Faculty of Science, Imam Khomeini International University, Qazvin 3414896818, Iran
A R T I C LE I N FO A B S T R A C T
Keywords: In the current research, FTIR spectroscopy (Mid, 600–4000 cm−1) coupled with a multivariate calibration
FTIR spectroscopy method has been suggested as a powerful regression model for the simultaneous determination of oxygenate in
Chemometrics gasoline. To reach that goal, partial least squares regression (PLS-R) combined with genetic algorithm wave-
Multivariate calibration length selection method (GA) was compared with the GA- support vector regression (GA-SVR) method. In order
Gasoline additives
to evaluate the models, root mean square error of prediction, and leave-one-out cross-validation root mean
square error, as well as the correlation coefficient between the calculated (R 2cal ) and predicted values (R 2pred ),
were applied. Based on the findings in this work, GA-SVR model is the superior predictive factor of the two,
having a higherR 2pred (0.971, 0.950, 0.955, 0.960, 0.970, and 0.969) and a lower root mean square error of
prediction values (RMSEP = 0.185, 0.245, 0.218, 0.229, 0.218, and 0.227) respectively for methyl t-butyl ether
(MTBE), iso-butanol, n-butanol, propanol, ethanol, and methanol in comparison to PLS (R 2pred = 0.951, 0.940,
0.938, 0.940, 0.952, and 0.949; RMSEP = 0.32, 0.283, 0.303, 0.299, 0.300, and 0.311). The lowest detection
limit was 0.06% w/w for GA-SVR and 0.2% w/w for GA-PLS model. Also, in a concentration range from 0.06 to
3.5% w/w the values were in accordance to gas chromatography analysis of oxygenates compound. Hence,
together with GA-SVR, FTIR can be an efficient, real-time approach towards a feasible quantitative analysis of
oxygenate compounds in gasoline.
⁎
Corresponding author.
E-mail address: [email protected] (A. Asghari).
https://doi.org/10.1016/j.infrared.2019.103177
Received 26 October 2019; Received in revised form 26 December 2019; Accepted 26 December 2019
Available online 28 December 2019
1350-4495/ © 2020 Elsevier B.V. All rights reserved.
A. Asghari, et al. Infrared Physics and Technology 105 (2020) 103177
Fig. 1. (a) FTIR spectra of gasoline sample (b) auto scaled data (c) the trend of the average fitness and the best fitness with generation number (d) variable selected
spectra by genetic algorithm.
data matrix (X) and the dependent variables (Y). As such, it undertakes feature space by employing kernels, and then, to carry out linear re-
the maximization of the covariance by extracting latent variables which gression in that space [25,32,33]. By contrast, variable selection tech-
are correlated with dependent variables and capturing a large amount niques yield favorable results in removing irrelevant and noisy vari-
of the variations in data matrix. The greatest challenge faced by the PLS ables and making a model less complicated [15,34]. Currently, genetic
method is the fact that spectrum-property relationship is assumed to be algorithm based on principles of Darwin’s theory of evolution and
linear [23,24]. However, this assumption is not valid in many industrial natural selection or “survival of the fittest” has been successfully ap-
situations and is completely inadmissible when it comes to systems with plied in many studies [28,35–38]. Up to now, most of the multivariate
potent intermolecular or intramolecular interactions, such as p-stacking analysis has been done for the measurement of oxygenated compounds
and hydrogen bonding [25]. PLS is not always the most suitable tech- in gasoline by FTIR spectroscopy are based on PLS method. So far, the
nique can be used, especially in cases of facing a complex nonlinear FTIR-combined SVR method has not been used to measure these com-
model. In such instances, neural networks that work by utilizing a pounds. Also, by using the genetic algorithm to select the useful wa-
multilayer perceptron have been suggested and shown some degree of velength, it is also possible for non-specialists to analyze.
success; however, exhaustive external validation processes are neces- In the present study, petroleum systems were selected as re-
sary to prevent data overfitting. More recently, numerous studies have presentative examples of real-world samples. The FTIR spectra mea-
been focusing on SVM-based techniques to solve crucial chemical and sured for several specimens of reference calibration and their values of
industrial issues [26–29]. To tackle complicated nonlinear classifica- concentration, or of any attribute of interest, are linked so that a mul-
tion, as well as regression issues, support vector machines (SVMs) can tivariate calibration can be reached. Then, in order to analyze the
be employed [30,31]. As a learning algorithm, SVM boasts admirable spectra of unknown samples to prepare the element concentration es-
generalization power and focusing on similarity between samples. It is a timation or the desired attribute value in the unknown sample, the
potent analysis tool for small sample numbers and high dimension is- results of the calibration model are applied. A genetic algorithm-sup-
sues. SVMs undertake finding the best separating hyperplane for clas- port vector regression (GA-SVR) coupled approach was employed to
sification problems, and the best regression hyperplane for regression reach a suitable model. GA was employed to select variables, which
problems, support vector regression (SVR). When it comes to SVR, the could enhance the model’s predictive abilities. Furthermore, widely-
principal idea is to map the original data into a higher dimensional used approaches, such as partial least squares regression (PLSR)
2
A. Asghari, et al. Infrared Physics and Technology 105 (2020) 103177
3
A. Asghari, et al. Infrared Physics and Technology 105 (2020) 103177
Fig. 3. (a) RMSECV, RMSEC, RMSEP by PLS for each latent variable (b) Actual versus predicted concentrations of gasoline samples by PLS calibration model (c)
Parameter search summary plot for SVM-R model (d) Actual versus predicted concentrations of gasoline samples for MTBE by SVR calibration model.
Table 1 indicated in Fig. 1c. The horizontal dashed line is associated with the
GA-PLS results for each oxygenate compound in the gasoline samples. RMSECV, which is acquired via all variables. The best fitness also points
Oxygenates RMSECV RMSEP
how well the best model is performing. The best and average fitness
R 2cal R 2cv R2pred
lines merge as the population initiates to become more alike. Ap-
MTBE 0.982 0.960 0.951 0.21 0.32 proximately at generation 100, the line was not changed much; hence,
Isobutanol 0.994 0.942 0.940 0.196 0.283 the GA has either achieved agreement on chosen variables or the
n-Butanol 0.979 0.951 0.938 0.210 0.303 equivalent solution with other chosen variables. As demonstrated in
n-Propanol 0.980 0.953 0.940 0.206 0.299
Fig. 1d, 471 variables were selected from 1771wavelengths, the asso-
Ethanol 0.982 0.962 0.952 0.199 0.300
Methanol 0.982 0.964 0.949 0.210 0.311 ciation between a measured spectrum matrix X and a response vector y.
To make sure prediction and real samples remain in the training set’s
subspace, the score plot on the first leverage versus the second leverage
Table 2 was sketched, and all samples were spanned with the training set scores
GA-SVR results for each oxygenate compound in the gasoline samples. (see Fig. 2a). The spectral outlier’s detection was tested by applying a
Oxygenates RMSECV RMSEP
95% confidence interval to the Hoteling T2 and residuals Q statistics
R 2cal R 2cv R2pred
(see Fig. 2b). As can be observed, there are no outliers because none of
MTBE 0.999 0.980 0.971 0.173 0.185 the data points is out of defined bounds. To evaluate the effectiveness of
Isobutanol 0.995 0.959 0.950 0.169 0.245 the model and prediction of the wavenumber selection method, a ca-
n-Butanol 0.994 0.981 0.955 0.185 0.218 libration model on the basis of PLS and SVR methods covering the
n-Propanol 0.991 0.978 0.960 0.174 0.229 entire selected variables was performed. PLS-R aims to establish a linear
Ethanol 0.997 0.983 0.970 0.163 0.218
Methanol 0.992 0.983 0.969 0.168 0.227
connection between the data matrix (X) and the dependent variables
(Y) [26,30]. As such, it undertakes the maximization of the covariance
by extracting latent variables that are correlated with dependent vari-
ables and capturing a large amount of the variations in data matrix. An
4
A. Asghari, et al. Infrared Physics and Technology 105 (2020) 103177
effective and dependable way to define the optimal quantity of latent standard deviation. Consequently, employing GA-SVR method results in
variables is using cross-validation. The furthermost appropriate quan- a more accurate solution compared to GA-PLSR.
tity of elements (latent variables) to be counted in the calibration model
was defined through computing the prediction error sum of squares 4. Conclusion
(PRESS) for cross-validated models applying a huge quantity of ele-
ments to avoid overfitting. For the PLS model. Fig. 3a displays the In conclusion, to determine the oxygenated content of gasoline
RMSECV versus number of LVs. As can be seen, RMSECV and RMSEP go samples in this research, a fast and non-destructive method was offered.
through an obvious minimum at three LVs the RMSECV goes through an This approach is founded on FT-MIR spectroscopy coupled with mul-
obvious minimum at three LVs. In order to evaluate the model’s pro- tivariate regression. GA method was applied to variables selection, and
ficiency both for PLS and SVR, coefficient of correlation in calibration the PLS and SVR regression models were compared to construct the
(R 2cal ), coefficient of correlation in calibration (R 2cal ), cross-validation linear and nonlinear quantitative relationships for each dataset. A
(R 2cv ) and prediction (R 2pred ), respectively, together with root mean comparison between the statics results of two models based on their
square error of calib correlation efficient, RMSECV, and RMSECV of GA-SVR model confirms
its superiority to the GA-PLSR model to predict oxygenates compounds
n
i )2
∑i = 1 (Ci − C in gasoline samples. Our results showed that a concentration range
RMSEP = (0.06–3% w/w) of oxygenates can be measured by FT-MIR spectro-
n
scopy and GA-SVR in comparison to GA-PLS (0.2–3% w/w). Hence, the
ration (RMSEV) and root mean square error of prediction (RMSEP) approach to the reliable and fast determination of oxygenated com-
were applied: pounds in gasoline by FT-MIR coupled with multivariate regression has
Where Ci and C i are the reference and predicted values for ith ob- a high potential for online utilization in petrochemical industrial units.
servation, respectively, by the PLS model, and n is the number of va-
lidated objects. This value determines the average uncertainty that Acknowledgements
should be expected for predictions in future samples. When selecting
the optimal calibrations, minimizing the RMSEP value must be con- The second author of this paper gratefully acknowledges the support
sidered. The calibration curve by PLS model for MTBE in gasoline from the Department of Food Science and Agricultural Chemistry,
samples is shown in Fig. 3b. The outcomes achieved for PLS model Macdonld Campus, McGill University, and special thanks go to Dr.
assessment parameters are outlined in Table 1. As a learning algorithm, Ashraf Ismail for providing research facilities.
SVM boasts admirable generalization power focusing on the similarity
between samples. It is a potent analysis instrument for small sample Declaration of Competing Interest
numbers and high dimension issues. SVMs undertake to find the best
separating hyperplane for classification problems, and the best regres- The authors declare no competing interests.
sion hyperplane for regression problems, support vector regression
(SVR). As regards SVR, the primary notion is mapping the original data Funding
into a higher dimensional feature space by employing kernels, and then,
performing linear regression in that space [25,29,43]. Epsilon-SVR This research did not receive any specific grant from funding
optimizes a model using the adjustable parameters epsilon (upper tol- agencies in the public, commercial, or not-for-profit sectors.
erance on prediction errors) and C (cost of prediction errors larger than
epsilon). The minimization of a cost function was done for SVR para- Appendix A. Supplementary material
meter optimization (C, e, andγ ). For the SVR model, the optimal com-
bination of e and γ was established to be 0.01 and 0.05, respectively. Supplementary data to this article can be found online at https://
The minimum value of misclassification rate is marked on the plot of doi.org/10.1016/j.infrared.2019.103177.
parameter search summary (see Fig. 3 c) by an “X”; this points out the
values of the SVM cost and gamma parameters, which yield the best References
performing model. Fig. 3d illustrates the predicted versus reference
calibration curves obtained for MTBE in gasoline by SVR regression [1] G. Mendes, H.G. Aleme, P.J.S. Barbeira, Determination of octane numbers in ga-
model through MIR spectroscopy. As observed in comparison to PLS, all soline by distillation curves and partial least squares regression, Fuel 97 (2012)
131–136.
the specimens are positioned around a diagonal line in SVR Model, [2] L.S.M. Wiedemann, L.A. d'Avila, D.A. Azevedo, Adulteration detection of Brazilian
which confirms better accuracy and predictivity for the SVR model. The gasoline samples by statistical analysis, Fuel 84 (2005) 467–473.
summary of the statistical results for SVR model is given in Table 2. [3] E. Monroe, J. Gladden, K.O. Albrecht, J.T. Bays, R. McCormick, R.W. Davis,
A. George, Discovery of novel octane hyperboosting phenomenon in prenol biofuel/
Hence, the SVR model is robust and may be employed to unseen sam- gasoline blends, Fuel 239 (2019) 1143–1148.
ples with certain accuracy. This superiority for SVR model may be due [4] C.-S. Lim, J.-H. Lim, J.-S. Cha, J.-Y. Lim, Comparative effects of oxygenates-gasoline
to the inherent nonlinearity of the spectrum-property relationship in blended fuels on the exhaust emissions in gasoline-powered vehicles, J. Environ.
Manage. 239 (2019) 103–113.
system such as those associated with the shifts in positions of vibra-
[5] T. Topgül, The effects of MTBE blends on engine performance and exhaust emis-
tional bands that make the Beer-Lambert-Bouguer law inapplicable. sions in a spark ignition engine, Fuel. Process. Tech 138 (2015) 483–489.
Crude oil, black oil, oxygenates-gasoline fuel mixtures, and solutions of [6] S.J. Choquette, S.N. Chesler, D.L. Duewer, S. Wang, T.C. O'Haver, Identification and
quantitation of oxygenates in gasoline ampules using Fourier transform near-in-
petroleum macromolecules are among these systems. The precision of
frared and Fourier transform Raman spectroscopy, Anal. Chem. 68 (1996)
PLS model may be negatively influenced by even relatively weak van 3525–3533.
der Waals intermolecular forces like the ones found in gasoline, bio- [7] L. Zwank, T.C. Schmidt, S.B. Haderlein, M. Berg, Simultaneous determination of
diesel paraffin wax, or aromatic hydrocarbons. It is also worth noting fuel oxygenates and BTEX using direct aqueous injection gas chromatography mass
spectrometry (DAI-GC/MS), Environ. Sci. Technol. 36 (2002) 2054–2059.
that nonlinear relations may just be modeled in a limited way using the [8] G.S. Frysinger, R.B. Gaines, Determination of Oxygenates in Gasoline by GC×GC, J.
PLS approach and even then, only by regarding more latent variables. High Resolut. Chromatogr. 23 (2000) 197–201.
Accordingly, the findings of the current research indicate that GA can [9] L.C. Brazdil, Oxygenates in gasoline: a versatile experiment using gas chromato-
graphy, J. Chem. Educ. 73 (1996) 1056.
be a suitable method for attribute selection in spectral datasets. Based [10] L.M. Avila, A.P.F. dos Santos, D.I.M. de Mattos, C.G. de Souza, D.F. de Andrade,
on the auto prediction results from Tables 1 and 2, compared to PLSR, L.A. d'Avila, Determination of ethanol in gasoline by high-performance liquid
SVR model demonstrates the lower root mean squares consisting of chromatography, Fuel 212 (2018) 236–239.
[11] V.S. Pinto, F.F. Gambarra-Neto, I.S. Flores, M.R. Monteiro, L.M. Lião, Use of 1H
RMSEC, RMSEV, and RMSEP that indicates higher accuracy and a lower
5
A. Asghari, et al. Infrared Physics and Technology 105 (2020) 103177
NMR and chemometrics to detect additives present in the Brazilian commercial Acta 642 (2009) 110–116.
gasoline, Fuel 182 (2016) 27–33. [28] A.S. Bangalore, R.E. Shaffer, G.W. Small, M.A. Arnold, Genetic algorithm-based
[12] R. Meusinger, Gasoline analysis by 1H nuclear magnetic resonance spectroscopy, method for selecting wavelengths and model size for use with partial least-squares
Fuel 75 (1996) 1235–1243. regression: application to near-infrared spectroscopy, Anal. Chem. 68 (1996)
[13] W.R. Kalsi, A.S. Sarpal, S.K. Jain, S.P. Srivastava, A.K. Bhatnagar, Determination of 4200–4212.
oxygenates in gasoline by 1H nuclear magnetic resonance spectroscopy, Energy [29] I. Barman, C.-R. Kong, N.C. Dingari, R.R. Dasari, M.S. Feld, Development of robust
Fuels 9 (1995) 574–579. calibration models using support vector machines for spectroscopic monitoring of
[14] A. Iob, R. Buenafe, N.M. Abbas, Determination of oxygenates in gasoline by FTIR, blood glucose, Anal. Chem. 82 (2010) 9719–9726.
Fuel 77 (1998) 1861–1864. [30] J.C.L. Alves, C.B. Henriques, R.J. Poppi, Determination of diesel quality parameters
[15] M. Khanmohammadi, A. Bagheri Garmarudi, M. de la Guardia, Feature selection using support vector regression and near infrared spectroscopy for an in-line
strategies for quality screening of diesel samples by infrared spectrometry and blending optimizer system, Fuel 97 (2012) 710–717.
linear discriminant analysis, Talanta, 104 (2013) 128–134. [31] J.C.L. Alves, R.J. Poppi, Biodiesel content determination in diesel fuel blends using
[16] N. Cavalcante da Silva, A.R. Caribé de Góes Massa, D. Domingos, J.M. Amigo, M. near infrared NIR) spectroscopy and support vector machines (SVM), Talanta 104
das Virgens Rebouças, C. Pasquini, M.F. Pimentel, NIR-based octane rating simu- (2013) 155–161.
lator for use in gasoline compounding processes, Fuel, 243 (2019) 381–389. [32] F. Zhang, J. Liu, J. Lin, Z. Wang, Detection of oil yield from oil shale based on near-
[17] P. Noor, M. Khanmohammadi, B. Roozbehani, F. Yaripour, A. Bagheri Garmarudi, infrared spectroscopy combined with wavelet transform and least squares support
Determination of reaction parameters in methanol to gasoline (MTG) process using vector machines, Infrared Phys. Technol. 97 (2019) 224–228.
infrared spectroscopy and chemometrics, J. Clean Prod. 196 (2018) 1273–1281. [33] J. Piri, S. Shamshirband, D. Petković, C.W. Tong, M.H.u. Rehman, Prediction of the
[18] T.Y. Inan, A. Al-Hajji, O.R. Koseoglu, Chemometrics-based analytical method using solar radiation on the Earth using support vector regression technique, Infrared
FTIR spectroscopic data to predict diesel and diesel/diesel blend properties, Energy Phys. Technol., 68 (2015) 179–185.
Fuels 30 (2016) 5525–5536. [34] Y.-H. Yun, H.-D. Li, B.-C. Deng, D.-S. Cao, An overview of variable selection
[19] M. Khanmohammadi, A.B. Garmarudi, A.B. Garmarudi, M. de la Guardia, methods in multivariate analysis of near-infrared spectra, Trends Analyt. Chem. 113
Characterization of petroleum-based products by infrared spectroscopy and che- (2019) 102–115.
mometrics, Trends Analyt. Chem. 35 (2012) 135–149. [35] A. Niazi, A. Soufi, M. Mobarakabadi, Genetic algorithm applied to selection of
[20] X. Zhang, B. Huang, Prediction of soil salinity with soil-reflected spectra: a com- wavelength in partial least squares for simultaneous spectrophotometric determi-
parison of two regression methods, Sci. Rep. 9 (2019) 5067. nation of nitrophenol isomers, Anal. Lett. 39 (2006) 2359–2372.
[21] J. Hu, Z. Wang, Y. Wu, Y. Liu, J. Ouyang, Rapid determination of the texture [36] N. Xin, X. Gu, H. Wu, Y. Hu, Z. Yang, Application of genetic algorithm-support
properties of cooked cereals using near-infrared reflectance spectroscopy, Infrared vector regression (GA-SVR) for quantitative analysis of herbal medicines, J.
Phys. Technol. 94 (2018) 165–172. Chemom. 26 (2012) 353–360.
[22] M. Marchetti, V. Boucher, J. Dumoulin, M. Colomb, Retrieving visibility distance in [37] R. Hunter, H. Anis, Genetic support vector machines as powerful tools for the
fog combining infrared thermography, Principal Components Analysis and Partial analysis of biomedical Raman spectra, J. Raman Spectrosc. 49 (2018) 1435–1444.
Least-Square regression, Infrared Phys. Technol 71 (2015) 289–297. [38] J. Ghasemi, A. Niazi, R. Leardi, Genetic-algorithm-based wavelength selection in
[23] P. Noor, M. Khanmohammadi, B. Roozbehani, A. Bagheri Garmarudi, Evaluation of multicomponent spectrophotometric determination by PLS: application on copper
ATR-FTIR spectrometry in the fingerprint region combined with chemometrics for and zinc mixture, Talanta 59 (2003) 311–317.
simultaneous determination of benzene, toluene, and xylenes in complex hydro- [39] D. Jouan-Rimbaud, D.-L. Massart, R. Leardi, O.E. De Noord, Genetic algorithms as a
carbon mixtures, Monatsh. Chem. 149 (2018) 1341–1347. tool for wavelength selection in multivariate calibration, Anal. Chem. 67 (1995)
[24] M. Bassbasi, S. Platikanov, R. Tauler, A. Oussama, FTIR-ATR determination of solid 4295–4301.
non fat (SNF) in raw milk using PLS and SVM chemometric methods, Food Chem. [40] B. Üstün, W.J. Melssen, M. Oudenhuijzen, L.M.C. Buydens, Determination of op-
146 (2014) 250–254. timal support vector regression parameters by genetic algorithms and simplex op-
[25] R.M. Balabin, E.I. Lomakina, Support vector machine regression (SVR/LS-SVM)—an timization, Anal. Chim. Acta 544 (2005) 292–305.
alternative to neural networks (ANN) for analytical chemistry? Comparison of [41] R. Leardi, Genetic algorithms in chemometrics and chemistry: a review, J. Chemom.
nonlinear methods on near infrared (NIR) spectroscopy data, Analyst 136 (2011) 15 (2001) 559–569.
1703–1712. [42] R. Leardi, Application of genetic algorithm–PLS for feature selection in spectral data
[26] R.G. Brereton, Introduction to multivariate calibration in analytical chemistry, sets, J. Chemom. 14 (2000) 643–655.
Analyst 125 (2000) 2125–2154. [43] R.G. Brereton, G.R. Lloyd, Support Vector Machines for classification and regres-
[27] N. Hernández, I. Talavera, R.J. Biscay, D. Porro, M.M.C. Ferreira, Support vector sion, Analyst 135 (2010) 230–267.
regression for functional data in multivariate calibration problems, Anal. Chim.