Linear Regression Models, Analysis, and Applications
Linear Regression Models, Analysis, and Applications
LINEAR REGRESSION
MODELS, ANALYSIS
AND APPLICATIONS
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or
by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no
expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of information
contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in
rendering legal, medical or any other professional services.
MATHEMATICS RESEARCH
DEVELOPMENTS
ANALYTICAL CHEMISTRY
AND MICROCHEMISTRY
LINEAR REGRESSION
MODELS, ANALYSIS
AND APPLICATIONS
VERA L. BECK
EDITOR
Copyright © 2017 by Nova Science Publishers, Inc.
All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted
in any form or by any means: electronic, electrostatic, magnetic, tape, mechanical photocopying,
recording or otherwise without the written permission of the Publisher.
We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to
reuse content from this publication. Simply navigate to this publication’s page on Nova’s website and
locate the “Get Permission” button below the title description. This button is linked directly to the
title’s permission page on copyright.com. Alternatively, you can visit copyright.com and search by
title, ISBN, or ISSN.
For further questions about using the service on copyright.com, please contact:
Copyright Clearance Center
Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: [email protected].
Independent verification should be sought for any data, advice or recommendations contained in this
book. In addition, no responsibility is assumed by the publisher for any injury and/or damage to
persons or property arising from any methods, products, instructions, ideas or otherwise contained in
this publication.
This publication is designed to provide accurate and authoritative information with regard to the subject
matter covered herein. It is sold with the clear understanding that the Publisher is not engaged in
rendering legal or any other professional services. If legal or any other expert assistance is required, the
services of a competent person should be sought. FROM A DECLARATION OF PARTICIPANTS
JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR ASSOCIATION AND A
COMMITTEE OF PUBLISHERS.
Additional color graphics may be available in the e-book version of this book.
Preface vii
Chapter 1 Weighting and Transforming Data in
Linear Regression 1
Julia Martín, Alberto Romero Gracia
and Agustín G. Asuero
Chapter 2 Regression through the Origin 69
Julia Martín and Agustín G. Asuero
Chapter 3 Linear Regression for Interval-Valued Data
in Kc (R) 117
Yan Sun and Chunyang Li
Chapter 4 Linear Regression versus Non-Linear
Regression in Mathematical Modeling
of Adsorption Processes 149
Gabriela-Nicoleta Moroi
Index 179
PREFACE
number of issues are thus addressed concerning random error, noise and
variance modelling when precision varies as the values of x (e.g.,
concentration) increase. The use of data transformation and weighted least
squares regression are two main solutions to deal with the heterocedasticity
problem. Non-linear terms may be introduced into the frame of linear
regression by transforming variables. Fitting is improved in this way and
necessary assumptions involved in least squares method such as
homocedastivity (constant variance) are thus satisfied. The following topics
concerning transformations are covered on this context: reasons to carry out,
simplification of relationships, model linearization, variance stabilization and
weighting transformation data. Box-Cox transformation topic has also
received a distinctive attention. Applications (weighting, transformation and
Box-Cox method) from a variety of fields (analytical, biochemical, clinical,
environmental and pharmaceutical) are summarized in tabular form. The
chapter is based on two previous reviews published by the authors in Critical
Reviews in Analytical Chemistry (2007, 37(3) 143-172 and 2011, 41(1), 36-
69).
Chapter 2 - Regression through the origin, a very interesting topic, has
usually received a scarce attention in the bibliography. This model is also
known as the no-intercept model. It is applied because of subject matter theory
or either when other physical and material considerations are necessary to
taken into account. An intensive bibliographical search has been carried out
with the purpose of gathering the literature on the subject, which is widely
scattered. Some about one hundredth and thirty references have been
compiled, comprising about twenty monographs and fifty scientific journals,
from varying fields, e.g., analytical, biological, clinical, chemometrical,
educational, environmental, pharmaceutical, physico-chemical, and statistical.
The authors will dealt systematically with the homocedastic condition, i.e.,
variance of y’s independent of x, errors of y’s accumulative, the heterocedastic
case, i.e., variance or standard deviation proportional to x values, respectively,
and orthogonal regression (error in both axes). The chapter also covers topics
such as prediction (using the regression line in reverse), leverage, goodness of
fit, comparison between models with and without intercept, uncertainty,
polynomial regression models without intercept, and an overview of robust
regression through the origin.
Preface ix
Chapter 1
ABSTRACT
*
Corresponding Author address: Agustín G. Asuero, Department of Analytical Chemistry,
Faculty of Pharmacy, University of Seville, Seville, Spain.
2 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
INTRODUCTION
WEIGHTING DATA
02
wf
2f (1)
Weighting and Transforming Data in Linear Regression 5
02
2y (2)
i ni
for the variance of yi , being its weighting factor according to Eqn. (1)
02 02
wy ni (3)
i
2y 02
i
ni
Equation: Slope:
ŷi a0 a1 xi a1 S XY / S xx
Mean responses Intercept
yi y / n iv i
a0 y a1 x
Weighted residuals
Residual sum of squares
SSE wy yi ŷi
i
2 w1/2
y i
yi ŷi
Correlation coefficient
Mean
r S XY / S XX SYY
x wy xi / wi
i
Standard errors
y wy yi / wi
i
SSE S a2 S
Sum of squares about the mean s 2y /x YY 1 XX
n2 n2
S XX wy xi x
w x / S
2
i sa2 s 2y / x
0
2
yi i XX w
yi
w y y
2
SYY yi i sa2 s2y /x / S XX
w x x y y
1
S XY yi i i
cov(a0 ,a1 ) x s y2 /x / S XX
6 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
02
wy ni wi (4)
i
02
ni wi
or
02
2y (5)
i
ni wi
02 02
wy n (6)
i
i2 i
i2
ni
(a) Absolute Weights. Equal weighting factors are assumed for all the
points; i.e., wi = 1.
(b) Statistical Weights. Replication for each calibration data point is
required to estimate the reciprocal of variance, which prevents its
application in routine practice (Mullins, 2003). For this reason, empirical
weights based on x-variable (i.e., concentration) or y-variable (i.e.,
response) may be used as approximations, i.e., weights such as 1/x0.5, 1/x,
1/x2, 1/y0.5, 1/y2 (Almeida et al., 2003). In those cases in which the variance
of residuals decrease with x, we may also apply:
1
wi (7)
xmax xi
2
1
wi (8)
y
z
2
y
2 2
(9)
y
z
z
2
y
wy wz (10)
z
Weighting and Transforming Data in Linear Regression 9
yi
Assumption of constant 1 Anderson and Snow, 1967;
percentage error Smith and Mathews, 1967
yi2
Instrumental weights 1 Jurs, 1986
si2
Transformation-dependent 1 de Levie, 1986
weights Meites, 1979
y
( )2
z
Mixed instrumental 1 de Levie, 1986
transformation depending Meites, 1979
y
weights sz ( )2
z
2
* si is the estimate of i2
Rothman et al., 1975; Ingle, 1974; Pardue et al., 1974; Ingle and Crouch,
1972; Winefordner et al., 1970). The precision of intensity measurements
in spectrochemical analysis (Klockenkämper and Bubert, 1986; Bubert and
Klockenkämper, 1983) can be affected by three kinds of noise, namely,
slot noise, flicker noise and detector noise. The rate and amount of ions
reaching the detector are the origins of the shot noise, which follow a
Poisson statistics. The process of nebulization as well as fluctuations
related with the source is the origin of flicker noise, which is proportional
to the signal magnitude. Detector and electronics are involved in the dark
count noise. So the error total is given by
st sshot
2
s 2flic sdet
2
(11)
1/2
log Var(Yij ) log 0 log i (12)
2ˆ
The estimated weights would then be y .
In fact, variance function estimation is challenging. Outliers strongly
affect (Baumann and Wätzig, 1995) the estimation of variance. Models
based on the addition of variance from independent sources are closer to
physical reality than the ones based on the contribution of standard
deviations.
x2 A(x 1)b
14 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
The lineal model, II, is the most simple. When the analytical errors
stem from two independent terms, the most satisfactory option should be to
combine variances. Then the models V and VI describe the variation of
precision with concentration more correctly. Standard deviation usually
increases with concentration, whereas the relative standard deviation
(coefficient of variation) remains constant or slightly decreases. Some
empirical models such as III and VIII have found use for radioassay ligand
and other general situations; the standard deviation is modelled as a
function of concentration.
The topic concerning the weighting choice is an open subject, and
there is no universal solution to this problem being (Modamio et al., 1996)
often subjective and somewhat arbitrary.
APPLICATIONS
Content Reference
Theory of chromatographic detection and modern approaches Asnin, 2016
to data acquisition and processing is given in the context of
the calibration problem
Characterizing nonconstant instrumental variance in emerging Noblitt et al., 2016
miniaturized analytical techniques
Simultaneous determination of 40 novel psychoactive Concheiro et al., 2015
stimulants in urine by liquid chromatography–high resolution
mass spectrometry and library matching
Practical guidelines for reporting results in single- and multi- Olivieri, 2015
component analytical calibration
Weighting and Transforming Data in Linear Regression 15
Content Reference
Method validation using weighted linear regression models Pereira da Silva et al.,
for quantification of UV filters in water samples 2015
Using Least Squares for Error Propagation. Practical Tellinghuisen, 2015
examples.
Analysis and interpretation of enzyme kinetic data Cornish-Bowden, 2014
Selecting the correct weighting factors for linear and quadratic Gu et al., 2014
calibration curves with least-squares regression algorithm in
bioanalytical LC-MS/MS assays and impacts of using
incorrect weighting factors on curve stability, data quality,
and assay performance
Impact of calibrator concentrations and their distribution on Tan et al., 2014
accuracy of quadratic regression for liquid chromatography–
mass spectrometry bioanalysis
Reducing the number of signals needed to perform LW Brasil et al., 2013
calibrations by developing models of weighing factors robust
to daily variations of instrument sensibility: Application to the
identification of explosives by ion chromatography
Comparative study of some robust statistical methods: Korany et al., 2013
weighted, parametric, and nonparametric linear regression of
HPLC convoluted peak responses using internal standard
method in drug bioavailability studies
The quality coefficient as performance assessment parameter de Beer et al., 2012
of straight line calibration curves in relationship with the
number of calibration points.
The approaches for estimation of limit of detection for ICP- Rajakovic et al., 2012
MS trace analysis of arsenic
A comparison in the evaluation of measurement uncertainty in Sousa et al., 2012
analytical chemistry testing between the use of quality control
data and a regression analysis
Application of a special in-house validation procedure Brüggemann and
for environmental–analytical schemes including a comparison Wennrich, 2011
of functions for modelling the repeatability standard deviation
Overall calibration procedure via a statistically based matrix- Lavagnnini et al., 2011
comprehensive approach in the stir bar sorptive extraction–
thermal desorption–gas chromatography–mass spectrometry
analysis of pesticide residues in fruit-based soft drinks
Using R2 to compare least square fit models: when it must Tellinghuisen y Bolster,
fail. 2011
Comparison of three weighting schemes in weighted Jain, 2010
regression analysis for use in a chemistry laboratory
Method validation for the endocrine disruptors and pesticides Mansilha et al., 2010
in water by gas chromatography–tandem mass spectrometry
using weighted linear regression schemes
16 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
Table 6. (Continued)
Content Reference
Calibration in atomic spectrometry: A tutorial review dealing Mermet, 2010
with quality criteria, weighting procedures and possible
curvatures
Comparison between ordinary least squares regression and Nascimento et al., 2010
weighted regression in the calibration of metals present in
human milk
Cochran’s test optimized “G test”: Expressions are derived to ’t Lam, 2010
calculate upper limit as well as lower limit critical values for
data sets of equal and unequal size at any significance level.
Least-squares analysis of data with uncertainty in x and y: A Tellinghuisen, 2010a
Monte Carlo methods comparison
Least-Squares Analysis of Phosphorus Soil Sorption Data Tellinghuisen, 2010b
with Weighting from Variance Function Estimation: A
Statistical Case for the Freundlich Isotherm
Least-squares analysis of phosphorus soil sorption data with Tellinghuisen and
weighting from variance function estimation Bolster, 2010
The guiding role of the assumptions for least-squares Brito et al., 2009
regression in practical problem solving: Calibration of 109Cd
KXRF systems
Weighted least-squares regression with different weighting Brito and Chettle, 2009
functions: Calibration of 109Cd KXRF systems
Verifying if alternative approaches are available for getting Desimoni and Brunetti,
acceptably approximate estimates of the limit of detection 2009
Least squares in calibration: weights, nonlinearity, and other Tellinghuisen, 2009a
nuisances
The least-squares analysis of data from binding and enzyme Tellinghuisen, 2009b
kinetics studies: weights, bias, and confidence intervals in
usual and unusual situations
Weighting Formulas for the Least-Squares Analysis of Tellinghuisen, 2009c
Binding Phenomena Data
Variance function estimation by replicate analysis and Tellinghuisen, 2009d
generalized least squares: A Monte Carlo comparison
Weighting formulas for the least-squares analysis of binding Tellinghuisen and
phenomena data Bolster, 2009
Analysis of Flavonoids in Oxytropis kansuensis Bunge by RP- Li et al., 2008
LC–DAD with Weighted Least-Squares Linear Regression
Least squares with non-normal data: estimating experimental Tellinghuisen, 2008a
variance functions
The problem with using “quality coefficients” to select Tellinghuisen, 2008b
weighting formulas
Weighting and Transforming Data in Linear Regression 17
Content Reference
Least-squares variance component estimation. Various Teunissen and Amiri-
examples are given to illustrate the theory Simkooei, 2008
Weighted least squares in calibration: Estimating data Zeng et al., 2008
variance functions in high-performance liquid
chromatography
A statistical overview on univariate calibration, inverse Lavagnini and Magno,
regression, and detection limits: application to gas 2007
chromatography/mass spectrometry technique
A general approach to heteroscedastic linear regression: The Leslie et al., 2007
methodology is applied to a number of simulated and real
examples
Weighted least-squares in calibration: The distinction between Tellinghuisen, 2007
a priori and a posteriori parameter standard errors is
emphasized
Why are we weighting? Recommendations Thompson, 2007
Reviews calibration-, uncertainty-, and recovery-related Vanatta and
documents from 10 consensus-based organizations Coleman, 2007
Determination of lanthanides in international geochemical Santoyo et al., 2006
reference materials by reversed-phase high-performance
liquid chromatography using error propagation theory
to estimate total analysis uncertainties
Understanding Least Squares through Monte Carlo Tellinghuisen, 2005b
Calculations
x
y (13)
x
Y X (14)
18 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
E + S ↔ ES ↔E + P (15)
C
v V (16)
C K m max
where Km and Vmax are the constants of Michaelis. Vmax is the reaction rate
when the enzyme is completely saturated with the substrate and the
reaction proceeds at the maximum possible speed, and Km is the substrate
concentration at half the maximum speed. The Michaelis-Menten equation
can be regrouped to produce different linear forms:
Lineaweaver-Burk (LB); plotting 1/v versus 1/C:
1 1 Km
(17)
v Vmax CVmax
C C K
m (18)
v Vmax Vmax
Weighting and Transforming Data in Linear Regression 19
vK m
v Vmax (19)
C
1 1 v4 (20a, b)
wi (LB) 2
v4 wi (H ) 2
(1/ v) (c / v) c2
v v
20 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
TRANSFORMING DATA
It may seems that the best way of calculating the coefficient of a non
linear equation is the direct application of a non linear regression program.
However, NLR is no free (Mager, 1991) from problems: i) depending on
the structure of the data and the starting value one may obtained different
final solutions; ii) the discrimination between rival models is difficult; iii)
NLR is relatively sensitive to deviations from homocedasticity; iv) a
substantially multi-collinearity may appears to lead to non robust
estimates. Some trouble may be originated using asymptotic NLR
estimates (Mager, 1991) because of the too small number of observations
in real experiments.
Some advantages are derived from the application of mathematical
transformation to experimental data. Transformation may be successfully
applied to reach homocedasticity (stabilize variances), to get (an
approximate) normality or test in an approximate way the type of model
(Meloun and Militký, 2011; Meloun, 1992; Draper and Smith, 1998;
Weisberg, 2005).
Graphical or numerical examination of data (Lavagnini and Magno,
2007; Barnet, 2004) may be carried out in order to check (separately or
jointly) key assumptions such as linearity of relationship, error
independence, residual variance constancy, normally distributed errors, and
outliers (Weisberg 2005; Belloto and Sokolovski, 1985). Informal plots
may reveal in a clear way the need for a given transformation such as ln x
or 1/y, holding in reserve the checking with a more formal analysis (Draper
and Smith, 1998). The log rule and the range rule are two often-helpful
empirical rules (Weisberg, 2005). Logarithm rule applies when the variable
is strictly positive and range over more than order of magnitude. If the
range is less than one order of magnitude any transformation is useless.
The greater the quotient ymax/ymin the greater the effect of the transformation
considered. As more differs from the unity greater is the effect of a
power transformation of the kind Y=y (Box and Draper, 1987).
Logarithms and exponentials are involved in the most common
transformations (EPA, 2000; Daniel and Wood, 1999; Tomassone et al.,
Weighting and Transforming Data in Linear Regression 23
“ 1
y 2
No Shape Change --- y 1
Negative Skew Mild y2 2
“ y3 3
Stronger exp y
Weighting and Transforming Data in Linear Regression 25
additive and normal disturbance terms for model functions are not valid for
the transformed data. Use non-linear regression on the original data, or
weighted least squares on the transformed data being then required. Fitting
then the transformed model leads to some initial estimates (Meloun et al.,
1992; Mager, 1991).
Non constant variance are related with non normal distributed data
(Canavos 1984, Rios 1977), being data transformation the most appropriate
mean to deal with such situations (Asuero and Martín Bueno, 2011).
Variance heterogeneity usually appears when the errors corresponding to
some treatments are significantly higher (or lower) than others, given the
nature of the experiences. In a normal distribution, the variance σy and the
mean σy are independent; a direct relationship between the mean and the
variance is typical from all other common distributions. Either theoretical
considerations and/or a preliminary empirical analysis may suggest the
nature of the dependence between the variance and the mean value (Box
and Draper, 1987). If the functional relationship is known, a transformation
exists making (approximately) constant the variance (Draper and Smith,
1998). With certain kinds of data, heterogeneous (non uniform) variance
and non normality are expected at first. The same experimental situations
that lead to non-normal distributions as usually provide heterogeneous
variances as σy =f ( ) in most non- normal distributions (Brownlee, 1984;
Natrella, 1963).
Table 11 summarizes a number of transformations (some from the
power family) used to correct for homogeneity and approximate normality.
Note that in stabilizing variance, the transformed variable is more normal
(Gaussian).
that better meet the normality criteria. The most appropriate transformation
can also be calculated empirically (Box and Cox, 1964).
Poisson (Count)*
y y
y0
Small counts
( y )**
y 1 or y y 1
Binomial
y 1 y a sin y
(0 y 1)
Negative 1 y
3
binomial
1 y 2 y
1 y 1 y
1
2
2
3 3
0 y 1 y
Variance = y ln y
(mean)2
y0
0.5 ln 1 y ln 1 y
Correlation 1
coefficient
1 y 1 1 y 2
* Modifications for the Poisson and binomial cases have been suggested by Freeman and Tukey
(105).
** It should be noted that the square root transformation overcorrects when very small values and
zero appears in the original data. In these cases, y 1 is often used as a transformation.
y , for 0
T (21)
ln y, for 0
Weighting and Transforming Data in Linear Regression 29
( y 1) / , for 0
W (22)
ln y, for 0
(23)
(24)
(25)
30 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
k
dyi( ) k
J ( , y) yi 1 (26)
i1 dyi i1
( y )1 1
2 if 1 0
y( ) 1
(27)
log( y ) if 1 0
2
y W y 1 /
1 y y 1 y 1
0.5 y 2
y 1
0 1(?) ln y
-0.5 1/ y
2 1 1/ y
-1 y 1 2 1 1/ y
APPLICATIONS
S H 1
log K (28)
4.576 4.576 T
Y a0 a1x (29)
S H 1
Y log K a0 a1 x (30)
4.576 4.576 T
Content Reference
A bilogarithmic hyperbolic cosine method for the evaluation of Beaumount et al.,
overlapping formation constants at varying (or fixed) ionic 2016
strength
Evaluation of three isotherm models (Langmuir, Freundlich, and Chen, 2015
Dubinin-Radushkevich) to correlate four sets of experimental
adsorption isotherm data, which were obtained by batch tests in
lab
Kinetics of Carbaryl Hydrolysis: An Undergraduate Hawker, 2015
Environmental Chemistry Laboratory
Feasibility study of potentiometric multisensor system of 18 ion- Yaroshenko et al.,
selective and cross-sensitive sensors as an analytical tool for 2015
determination of urine ionic composition
A novel multiple headspace extraction gas chromatographic Zhang and Chai,
method for measuring the diffusion coefficient of methanol in 2015
water and in olive oil
Adsorption Kinetics and Isotherms: A Safe, Simple, and Piergiovanni, 2014
Inexpensive Experiment for Three Levels of Students
Evaluation of Equilibrium Sorption Isotherm Equations: Datasets Chen, 2013
from literatures are selected and three two-parameter and three-
parameter equations were used to evaluate adsorption systems
Statistical Analysis of Linear and Non-linear Regression for the Osmari et al., 2013
Estimation of Adsorption Isotherm Parameters
Equlibrium sorption of the phosphoric acid modified rice husk: Dada et al., 2012
Langmuir, Freundlich, Temkin and Dubinin–Radushkevich
Isotherms Studies
Application of the van’t Hoff dependences in the characterization Denderz and
of molecularly imprinted polymers for some phenolic acids: Lehotay, 2012
Evaluation of the temperature effect on the sorption processes
investigated analytes in methanol and acetonitrile (porogen) as
mobile phases
An alternative analytical method for measuring the kinetic Heinzerling et al.,
parameters of the enzymes invertase and lactase 2012
34 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
Content Reference
Study of the pattern of formation of absorption signals for high Katskov et al., 2012
concentrations of analyte atoms in the absorption volume and to
employ the findings for High-resolution continuum source
electrothermal atomic absorption spectrometry data quantification
within a broad concentration range of the analyte
A comprehensive treatment of experimental enzyme kinetics Barton, 2011
strongly coupled to electronic data acquisition and use of
spreadsheets to organize data and perform linear and nonlinear
least-squares analyses
Chemical Dosing and First-Order Kinetics: Examples of multiple- Hladky, 2011
dose problems are presented that are appropriate for students
taking introductory, general, and physical chemistry courses
On the use of linearized pseudo-second-order kinetic equations El-Khaiary et al.,
for modeling adsorption systems 2010
Insights into the modeling of adsorption isotherm systems: Foo and
accuracy and consistency in parameters prediction or estimation Hameed, 2010
Introduce and compare numerical approaches that involve Markovic et al.,
diferent levels of knowledge about the noise structure of the 2010
analytical method used for initial and equilibrium concentration
determination
Polydimethylsiloxane-based permeation passive air sampler. Part Seethapathy and
II: Effect of temperature and humidity on the calibration constants Górecki, 2010
A simple competitive enzyme-linked immunosorbent assay Wang et al., 2010
(cELISA) was established for rapid measure- ment of secretory
immunoglobulin A (sIgA) in saliva
An equation relating the absorbance of the solute to the acidity Asuero, 2009
constants (pKa1 and pKa2) and pH is derived for weak diprotic
acids (diprotic bases and zwitterions)
Weighting Formulas for the Least-Squares Analysis of Binding Tellinghuisen, 2009c
Phenomena Data
A comprehensive study on the possibility of applying the nth- Cai et al., 2008
degree polynomial logistic regression model for fitting the kinetic
conversion data of cellulose pyrolysis
The Hill equation: a review of its capabilities in pharmacological Goutelle et al., 2008
modelling
Least-squares regression of adsorption equilibrium data: El-Khaiary, 2008
comparing the options
Evaluation of logistic and polynomial models for calibration Herman et al., 2008
curves spanning the quantitative concentration range for seven
different protein assays based on examination of residuals
Weighting and Transforming Data in Linear Regression 35
Content Reference
Methods for studying reaction kinetics in gas chromatography, Krupcık et al., 2008
exemplified by using the 1-chloro-2,2-dimethylaziridine
interconversion reaction
A bilogarithmic hyperbolic cosine method for the Boccio et al., 2007
spectrophotometric evaluation of stability con- stants of 1: 1 weak
complexes is developed and applied to data found in the literature
Examination of the limitations of using linearized Langmuir Bolster and
equations by fitting P sorption data collected on eight different Hornberge, 2007
soils with four linearized versions of the Langmuir equation and
comparing goodness-of-fit measures and fitted parameter values
with those obtained with the nonlinear Langmuir equation.
A review of existence criteria for parameter estimation of the Jukic et al., 2007
Michaelis–Menten regression model
Highlights some common errors of data evaluation that are fre- Badertscher and
quently found in the literature Pretsch, 2006
The general equation resulting from the logistic transformation is Capitán-Vallvey et
discussed considering the stoichiometric factors for monovalent al., 2006
anions, and the linearization of the theoretical fit to experimental
data was checked for two real cases
Alternative method to the Arrhenius equation for Naya et al., 2006
termogravimetric analysis based on a logistic mixture model
Log-log transformation without weighting is the simplest model Singtoroj et al., 2006
to fit the calibration data for the determination of piperaquine
(PC) in urine
A bilogarithmic hyperbolic cosine method for the Sayago and
spectrophotometric evaluation of stability constants of 1:1 weak Asuero,2006
complexes from continuous variation data is devised and applied
to literature data
Content Reference
Estimating Box-Cox power transformation parameter via Asar et al., 2017
goodness of fit tests. An artificial covariate method is also
included for comparative purposes
Two strategies are proposed to extend and unify residual error Dosne et al., 2016
modeling: a dynamic transform-both-sides approach combined
with a power error model capable of handling skewed and/or
heteroscedastic residuals, and a t-distributed residual error model
allowing for symmetric heavy tails
Models with Transformed Variables.Interpretation and Software Boef et al., 2015
36 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
Content Reference
Overview of state-of-the-art dose-response analysis, both in terms Ritz et al., 2015
of general concepts that have evolved and matured over the years
and by means of concrete examples
Optimization of sonochemical degradation of tetracycline in Safari et al., 2015
aqueous solution using a central composite design
New methodology for estimating λ and an alternative method of Vélez et al., 2015
determining plausible values for it
Experimental design and multiple response optimization. Using Candioti et al., 2014
the desirability function in analytical methods development
Design-based development of a stability-indicating RP-HPLC Roy and
method for the simultaneous determination of parabens in Chakrabarty, 2014
pharmaceutical formulation
Occurrence of pharmaceuticals in urban wastewater of north Singh et al., 2014
Indian cities and risk assessment
Statistical Evaluation and Validation of Quantitative Methods of Komsta, 2013
Drug Analysis
A calibration-free/minimum approach, iterative optimization Muteki et al., 2013
technology, which is used to predict (without calibration
standards) the composition of a mixture while maintaining a
similar predictability to calibration standard models
Gaussian Quadrature is an efficient method for the back- Dekkers and Slob,
transformation in estimating the usual intake distribution when 2012
assessing dietary exposure
CALUX measurements: Statistical inferences for the dose– Elskens et al., 2011
response curve. Use of linear calibration functions based on Box–
Cox transformations to overcome the issue of uncertainty
assessment
Statistical Data Analysis in data transformation: A practical guide Meloun and Militký,
2011
A general equation is presented for modeling retention, using the Komsta, 2010
organic modifier content of the mobile phase. The equation is
based on the Box-Cox transform of modifier concentration.
Overview of traditional normalizing transformations and how Osborne, 2010
Box-Cox incorporates, extends, and improves on these traditional
approaches to normalizing data. Examples of applications are
presented, and details of how to automate and use this technique
are included
Least-Squares Analysis of Phosphorus Soil Sorption Data with Tellinghuisen, 2010
Weighting from Variance Function Estimation: A Statistical Case
for the Freundlich Isotherm
Weighting and Transforming Data in Linear Regression 37
Content Reference
Evaluation of the environmental contamination at an abandoned Bagur et al., 2009
mining site using multivariate statistical techniques. The Box–
Cox transformation has been used to transform the data set in
normal form in order to minimize the non-normal distribution of
the geochemical data
A method for identifying relevant proteins from SIMCA Marengo et al., 2008
discriminating powers is proposed, based on the Box-Cox
transformation coupled to probability papers
Application of differential permeation and Box–Cox Xu and Que Hee,
transformation in the analysis of di-n-octyl disulfide in a straight 2006
oil metalworking fluid
The Box-Cox transformation applied to soil data improves sample Meloun et al., 2005
symmetry and stabilizes spread; the logarithmic plot of a profile
likelihood function enables the optimum transformation
parameter to be found
1 2 k k
Bexp ln s f i ( f i ln si2 ) (31)
C i1 i1
1 1 1
C 1
3(k 1)
f i f i
(32)
If the equality of the variances is true, the magnitude Bexp obeys the
Chi-square distribution ( 2 ) with k-1 degrees of freedom, if each of fi > 2.
s2y c
s
2
yi
i
wi (33)
ni s2y
i
2
where c (= sPE ) is an arbitrary constant that does not influence on the final
results of a0 and a1, sa0 and sa1 (although it does affect the magnitudes of sy/x
and cov(a0, a1)). If we follow the criterion of Spiridonov and Lopatkin
(1973) to make the sum of the weights equal to 1 we have
Weighting and Transforming Data in Linear Regression 39
1 1 1 1
w 1 c c wy f
yi
s 2y s2y i
s 2y
1 w2y
i i
i
s2y i
fi
i i
(34)
N 7 17 12 12 13
Mean 0.5181 0.4826 0.4775 0.4574 0.4451
SD 4.003E-03 3.794E-03 3.214E-03 7.912E-03 6.647E-03
Variance s2 1.603E-05 1.439E-05 1.033E-05 6.260E-05 4.419E-05
Degrees of 6 16 11 11 12
freedom (f)
Sum f 56
f * s2 9.616E-05 2.303E-04 1.136E-04 6.886E-04 5.303E-04
(s2 mean) 2.962E-05
ln (s2 mean) * -583.9083
sum f
ln s2 -11.0413 -11.1489 -11.4806 -9.6787 -10.0271
f * ln s2 -66.2475 -178.3821 -126.2870 -106.4655 -120.3247
Sum (f* ln s2) -597.7068
1/f 0.1667 0.0625 0.0909 0.0909 0.0833
Sum (1/f) 0.4943
B 13.7985 B = ln s2(mean) * sum fi - sum (fi * ln si2)
C 1.0397 C=1+[1/(3(k-1)]*[sum (1/f) - 1/(sum f)]
B/C 13.272 Chi2 9.488
(0.05, 4)
B/C = 13.272 > 9.488: The hypothesis of homogeneity of s2 can not be accepted at 5%
level. With a significance level of 1% [Chi2(0.01,4) = 13.277]: The hypothesis can not
be rejected with certainty.
40 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
2
sLOF 1.2604 106
Fexp 4.12 F0.05(3,42) 2.83 F0.01(3,42) 4.29
2
sPE 3.062110 7
(35)
Figure 4. Weighted residuals versus inverse of the temperature; log Keq=f(1/T ºK).
42 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
Y = A + B·X (39)
However, the residuals and the line resulting from the least-squares (Y
= A + BX) model fitted to the data could be combined in the same plot for
checking purposes. The results (Figure 6) lead to a correlation coefficient
of 0.99998876. This almost perfect fit is indeed very poor if attention is
paid to the pattern of residuals [+ + - - - - + + + + + + - -]. Systematic
deviations can either indicate a systematic error in the experiment (which
can not be tested since the details of the measurements are not known) or,
as it turns out in this case, the use of an incorrect or inadequate model. The
Claussius-Clapeyron equation does not exactly represent the vapor
pressure data over a wide temperature range. Results similar to those
44 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
1 1
wi Pi 2 (40)
Yi ln Pi
y P
i i
Table 18. CO2 vapor pressure data versus temperatura using Clausius-
Clapeyron equation (lnP = A + B/T) (Nogle, 1993).
The error lies not in the data, but in the model. We must try to improve
the latter. A more general form of the equation is:
ln P = A + B/T + C ln T + D T (41)
Weighting and Transforming Data in Linear Regression 45
Table 19. CO2 vapor pressure data versus temperatura using equation
(lnP = A + B/T + C· lnT + D·T) (Nogle, 1993)
CONCLUSION
REFERENCES
Cai, J., Liu, R., Sun, C., (2008). Logistic Regression Model for
Isoconversional Kinetic Analysis of Cellulose Pyrolysis. Energy Fuels
22, 867-870.
Canavos, G.C., (1984). Applied Probability and Statistical Methods.
Toronto, Canada: Little, Brown and Company.
Candioti, L.V., De Zan, M.M., Cámara, M.S., Goicoechea, H.C., (2014).
Experimental design and multiple response optimization. Using the
desirability function in analytical methods development. Talanta 124,
123-138.
Capitán-Vallvey, L.F., Arroyo-Guerrero, E., Fernández-Ramos, M.D.,
Cuadros-Rodríguez L., (2006). Logit linearization of analytical
response curves in optical disposable sensors based on coextraction for
monovalent anions. Anal. Chim. Acta 561, 156-163.
Carroll R.J., Ruppert, D., (1988). Transformation and Weighting in
Regresion. London, England: Chapman & Hall.
Chang, H.S., (1977). A computer program for Box-Cox transformation and
estimation technique. Econometrica 45(7), 1741.
Chen, C., (2013). Evaluation of Equilibrium Sorption Isotherm Equations.
Open Chem. Eng. J. 7, 24-44.
Chen, X., (2015). Modeling of Experimental Adsorption Isotherm Data.
Information 6, 14-22.
Chinn, S., (1996). Choosing a transformation. J. Appl. Stat. 23(4), 395-404.
Chow S.-C., Liu, J.-P., (1995). In Statistical Design and Analysis in
Pharmaceutical Sciences. New York, USA: Marcel Dekker.
Concheiro, M., Castaneto, M., Kronstrand, R., Huestis, M.A., (2015).
Simultaneous determination of 40 novel psychoactive stimulants in
urine by liquid chromatography-high resolution mass spectrometry and
library matching. J. Chromatogr. A 1397, 32-42.
Connors, K.A., (1987). Binding Constants, the Measurement of Molecular
Complex Stability. New York, USA: Wiley, 115.
Cornish-Bowden, A., (2014). Analysis and interpretation of enzyme kinetic
data. Perspect. Sci. 1, 121-125.
Crabbe, M.J.C., (1982). An enzyme-kinetic program for desk-top
computeres. Comput. Biol. Med. 12 (4), 263-283.
52 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
Dada, A.O., Olalekan, A.P., Olatunya, A.M., Dada, O., (2012). Langmuir,
Freundlich, Temkin and Dubinin–Radushkevich Isotherms Studies of
Equilibrium Sorption of Zn2+ Unto Phosphoric Acid Modified Rice
Husk. IOSR J. Appl. Chem. 3, 38-45.
Daniel, C., Wood, F.S., (1980). Fitting Equations to Data: Computer
Analysis of Multifactor Data. 2nd ed., New York, USA: Wiley.
Danzer, K., Currie, L.A., (1998). IUPAC Guidelines for calibration in
analytical chemistry. Part 1. Fundamentals and single component
calibration. Pure Appl. Chem. 70, 993-1014.
Davidian, M., Haaland, P.D., (1990). Regression and calibration with non
constant error variance. Chemometr. Intell. Lab. Systems, 9, 231-248.
de Beer, J.O., Naert, C., Deconinck, E., (2012). The quality coefficient as
performance assessment parameter of straight line calibration curves in
relationship with the number of calibration points. Accred. Qual.
Assur. 17 (3), 265-274.
de Brito J.A.A., Chettle, D.R., (2009). Calibration of 109Cd KXRF
systems for in vivo bone lead measurements: weighted least-squares
regression with different weighting functions. Phys. Med. Biol. 54,
L45-L50.
de Brito, J.A.A., de Carvalho, M.L., Chettle, D.R., (2009). Calibration of
109Cd KXRF systems for in vivo bone lead measurements: the
guiding role of the assumptions for least-squares regression in practical
problem solving. Phys. Med. Biol. 54, 919-934.
De Galan L., van Dalen, H.P.J., Kornblum, G.R., (1985). Determination of
strongly curved calibration graphs in flame atomic absorption
spectrometry: comparison of manually drawn and computer calculated
graphs. Analyst 110, 323-329.
de Levie, R., (1986). When, why, and how to use weighted least squares. J.
Chem. Educ. 63, 10-15.
de Levie, R., (2000). Curve fitting least squares. Crit. Rev. Anal. Chem. 30,
59-74.
de Levie, R., (2001). How to Use Excel in Analytical Chemistry and in
General Scientific Data Analysis. Cambridge, England: Cambridge
University Press.
Weighting and Transforming Data in Linear Regression 53
Gu, H., Liu, G., Wang, J., Aubry, A., Arnold, M.E., (2014). Selecting the
correct weighting factors for linear and quadratic calibration curves
with least-squares regression algorithm in bioanalytical LC-MS/MS
assays and impacts of using incorrect weighting factors on curve
stability, data quality, and assay performance. Anal. Chem. 86 (18),
8959-8966.
Hawker, D., (2015). Kinetics of Carbaryl Hydrolysis: An Undergraduate
Environmental Chemistry Laboratory. J. Chem. Educ. 92, 1531-1535.
Heinzerling, P., Schrader, F., Schanze, S., (2012). Measurement of
Enzyme Kinetics by Use of a Blood Glucometer: Hydrolysis of
Sucrose and Lactose. J. Chem. Educ. 89, 1582-1586.
Herman, R.A., Scherer, P.N., Shan, G., (2008). Evaluation of logistic and
polynomial models for fitting sandwich-ELISA calibration curves. J.
Immunolog. Methods 339, 245-258.
Heydorn, K., Anglow, T., (2002). Calibration uncertainty. Accred. Qual.
Assur. 7, 153-158.
Hladky, P.W., (2011). Chemical Dosing and First-Order Kinetics. J. Chem.
Educ. 88, 776-781.
Howarth R., Thompson, M., (1976). Duplicate analysis in geochemical
practice. Part 2. Examination of the proposed method and examples of
its use. Analyst 101, 699-709.
Hoyle, M.H., (1973). Transformations—An introduction and a
bibliography. Int. Stat. Rev. 41(2), 203-223.
Huang, C.-L., Moon, L.C., Chang, H.S., (1978). A computer program
using the Box-Cox transformation technique for the specification of
functional form. Am. Stat. 32(4), 144.
Hughes, H., Hurley, P.W., (1987). Precision and accuracy of test methods
and the concept of K-factor in chemical analysis. Analyst 112, 1445-
1449.
Hwang, L.-J., (1994). Impact of variance function estimation in regression
and calibration. Methods Enzymol. 240, 150- 170.
Ingle, Jr. J.D., Crouch, S.R., (1972). Evaluation of precision of quantitative
absorption spectrometric measurements. Anal. Chem. 44, 1375-1386.
56 Julia Martín, Alberto Romero Gracia and Agustín G. Asuero
Lavagnini, I., Urbani, A., Magno, F., (2011). Overall calibration procedure
via a statistically based matrix-comprehensive approach in the stir bar
sorptive extraction–thermal desorption–gas chromatography–mass
spectrometry analysis of pesticide residues in fruit-based soft drinks.
Talanta 83, 1754-1762.
Lee, J.C., Chen, D-T., Hung, H-N., Chen, J.J., (1999). Analysis of drug
dissolution data. Stat. Med. 18(7), 799-814.
Lee, J.-C., Ramsey, M.H., (2001). Modeling measurement uncertainty as a
function of concentration: an example from a contaminated land
investigation. Analyst 126, 1784-1791.
Leslie, D.S., Kohn, R., Nott, D.J., (2007). A general approach to
heteroscedastic linear regression. Stat. Comp. 17, 131-146.
Li, B.B., Moor, B., (2002). The general Box-Cox transformation in
multiple regression analysis. Commun. Stat. Simul. Comput. 31(4),
673-687.
Li, C., Liu, J., Di, D., Jiang, S., (2008). Analysis of Three Flavonoids in
Oxytropis kansuensis Bunge by RP-LC–DAD Coupled with Weighted
Least-Squares Linear Regression. Chromatographia 68, 773-779.
Logothetis, N., (1990). Box-Cox transformations and the Taguchi methods.
Appl. Stat. 39(1), 31-48.
Mager, P.P., (1991). Design Statistics in Pharmacochemistry. New York,
USA: Wiley, pp. 20-44.
Malaeb, Z.A., (1997). A SAS code to correct for non-normality and
nonconstant variance in regression and ANOVA models using the
Box-Cox method of power transformation. Environ. Monit. Assess.
47(3), 255-273.
Mansilha, C., Melo, A., Rebelo, H., Ferreira, I.M.P.L.V.O., Pinho, O.,
Domingues, V., Pinho, C., Gameiro, P., (2010). Quantification of
endocrine disruptors and pesticides in water by gas chromatography-
tandem mass spectrometry. Method validation using weighted linear
regression schemes. J. Chromatogr. A 1217 (43), 6681- 6691.
Marengo, E., Robotti, E., Bobba, M., Righetti, P.G., (2008). Evaluation of
the Variables Characterized by Significant Discriminating Power in the
Weighting and Transforming Data in Linear Regression 59
Statistical Case for the Freundlich Isotherm. Environ. Sci. Technol. 44,
5029-5034.
Tellinghuisen, J., Bolster, C.H., (2011). Using R2 to compare least square
fit models: when it must fail. Chemometr. Intell. Lab. Systems 105 (2),
220-222.
Tellinghuisen, J., (2015). Using Least Squares for Error Propagation. J.
Chem. Educ. 92, 864-870.
Teunissen, P.J.G., Amiri-Simkooei, A.R., (2008). Least-squares variance
component estimation. J. Geodesy, 82, 65-82.
Thompson, M., Howarth, R.J., (1973). Rapid estimation and control of
precision by duplicate determinations. Analyst 98, 153-160.
Thompson, M., (1976). Duplicate analysis in geochemical practice. Part 1.
Theoretical approach and estimation of analytical reproducibility.
Analyst 101, 690-698.
Thompson, M., (1978). Dupan 3, a subroutine for the interpretation of
analytical data in geochemical analysis. Comp. Geosci. 4, 333-340.
Thompson, M., Howarth, R., (1978). New approach to estimation of
analytical precision. J. Geochem. Explor. 9, 23-30.
Thompson, M., (1982). Regression methods in the comparison of accuracy.
Analyst 107, 1169-1180.
Thompson, M., (1988). Variation of precision with concentration in an
analytical system. Analyst 112, 1579-1587.
Thompson, M., (2007) Why are we weighting? Anal. Methods Committee.
AMC technical brief No 27.
Tomassone, R., Lesquoy, E., Miller, C., (1983). La Regression, nouveaux
regards sur une anciene methode statistique. Paris, France: Masson,
15, 38.
Tukey, J.W., (1977). Exploratory Data Analysis. Reading, MA: Addison-
Wesley.
van Loco, J., Hanot, V., Huysmans, G., Elskens, M., Degroodt, J.M.,
Beemaert, H., (2003). Estimation of the minimum detectable value for
the determination of PCBs in fatty food samples by GC-ECD: A
curvilinear calibration case. Anal. Chim. Acta 483(1-2), 413-418.
Weighting and Transforming Data in Linear Regression 67
Chapter 2
ABSTRACT
*
Corresponding Author address: Agustín G. Asuero, Department of Analytical Chemistry,
Faculty of Pharmacy, University of Seville, Seville, Spain.
70 Julia Martín and Agustín G. Asuero
regression (error in both axes). The chapter also covers topics such as
prediction (using the regression line in reverse), leverage, goodness of fit,
comparison between models with and without intercept, uncertainty,
polynomial regression models without intercept, and an overview of
robust regression through the origin.
INTRODUCTION
Regression and related fitting methods have found wide use (Finney,
1996, Deming, 1968; Howard, 2001) in the field of natural and social
sciences. Though linear least squares regression is probably the most
widely used modeling statistical method, linear regression through the
origin, in spite of its importance, has not received a great attention. There
are occasions when it appears appropriate for a regression line (Bissell,
1992; Brownlee, 1984; Freund et al., 2006; Myers, 1986; Noggle, 1993;
Ryan, 2008) to pass through the origin, i.e., for the true relation line to be
1 (1)
Yi 1xi i (2)
where (Yi, xi) is the ith pair of associated xi and Yi values, i is the number of
data point, 1 is the model parameter to be estimated, and i is the error
associated with the measurement Yi. This model is also called the no-
intercept model (Chatterjee et al., 2012; Afifi and Azen, 1972). When it is
known in advance that the intercept term is zero, then one has to impose
this on the model (Rousseeuw, 2001). Regression through the origin is
Regression through the Origin 71
Y 0 1x (3)
An example may be the regression of dose against area under the curve
(AUC) in pharmacokinetic studies (Bonate, 2011).
The error term in Eqn (2) is assumed to be normally distributed with
mean zero and unknown variance
2
i2 (4)
wi
where is a constant (which may be absorbed into the unknown i) and
the weighting factors wi’s are known for all i, being inversely proportional
to the variances.
The aim of this contribution is to offer a primer on the regression
through the origin to analytical chemists and other related researchers
interested in this subject. A number of applications have been compiled in
tabular form on this respect. Figure 1 shows the number of papers
published per year. Some fifty journals are cited from the fields of
analytical and physical chemistry, chemometrics, clinical chemistry,
ecology, educational chemistry, environmental chemistry, industrial
hygiene, pharmacy, biology, and statistics. The authors apologize for those
papers may have overlooked or inadvertently omitted. The most cited
journals are shown in Figure 2.
Figure 1. Number of publications cited per year.
Regression through the Origin 73
ŷi b1 xi (6)
b1
wx y i i i
(7)
wx 2
i i
wi 1 (9)
and then
b1
x y i i
(10)
x 2
i
Regression through the Origin 75
Variance proportional to x
and
b1
y i
y
(12)
x i
x
1
wi (13)
xi2
which led to the slope value
yi
x
b1 i
(14)
n
Aston (1959) and Barlow (1989) follow a notation slightly different fro
the one shown here.
By assuming that the values of x are free from error, applying the
random error propagation law (Asuero et al., 1988) we get for the
estimated variance of the slope
76 Julia Martín and Agustín G. Asuero
s y/2 x
s
2
(15)
b1
wx 2
i i
The larger the term wixi2, the greater the precision of the slope is.
Note in addition that large values of xwill contribute substantially to this
sum; increasing numbers of such values will also increase the denominator
sum in Eqn. (15)
For the estimate of variance provided by the weighted residual we get
w y ŷ w y b12 wi xi2
2
2
s 2
i i i
i i
y/ x
n 1 n 1
w x y
2
(16)
w y 2
i i i
w y i
2
i
b1 wi xi yi
i i
wx 2
i i
n 1 n 1
In straight line regression through the origin sy/x2 has n-1 degrees of
freedom (since only one parameter is estimated), not n-2 as is the case for a
model with intercept. The rightmost expression in Eqn. (16) is the most
convenient way (Green and Margerison, 1977) for computation purposes.
From Eqn. (6) we have for the variance of a point on the true
regression line
Regression through the Origin 77
xi2 s y/2 x
s x s
2 2 2
(18)
ŷi i b1
wx 2
i i
x0 s y/ x x0 sy/ x
b1x0 tn1, /2 x0 1 b1x0 tn1, /2 (19)
wx 2
i i wx 2
i i
The confidence band for the entire regression line is the region
between two straight lines passing through the origin (Figure 3), whereas
in a model with intercept they are parabolic curves. Thus, the interval
becomes larger as we move away from the origin.
Figure 3. The solid line is the least squares line of slope b1 passing through the origin
and the set of points (xi , ̂i ) . The co-ordinates of the point P are, therefore,
(x0 ,b1x0 ) . The dotted lines show the least squares estimates displaced vertically by ±
one standard error, s( ̂ ) . The diagram shows clearly that the uncertainty associated
with a least squares estimate of increases rapidly with increasing displacement of x0
from the origin (Green and Margerison, 1977).
78 Julia Martín and Agustín G. Asuero
Hahn (1977) and Natrella (1963) deal with the confidence intervals to
contain either b1 or the true average response for a given x value, as well as
the prediction interval containing a future response at a given x value.
Hedayat (1970) and Hedayat et al. (1977) propose a test for detecting a
monotonic relationship between the mean and variance. Iwase (1989)
studies the case in which the y values follow an inverse Gaussian
distribution being the coefficient of variation constant and unknown.
y0
x0 (20)
b1
the true value of x, for which the m observations y0 were made, with
(Bennett and Franklin, 1954; Lark et al., 1968), for the unweighted case
(wi=1)
2 2
x0 2 x0 2 s y/ x 1 x02
2
s
2
s s 2
2
(21)
y0 0 b1 1 b1 m xi
x0 y b
Regression through the Origin 79
z 2 y0 b1 x y b1 x
2 2
0
tn1,
2
/2 (22)
sz2 sz2 1 x2
s 2y/ x 2
m x
2 t 2 s 2y/ x 2 2
tn1, s2
/2 y/ x
b1 x 2 y0 b1 x y 0 0
2
(23)
x2 m
whose roots give the values of the lower and upper confidence limits
of x, xL and xU, respectively. The difference between the upper and lower
limits gives the confidence interval.
We may use a pooled variance (Cox, 1971; Seber and Lee, 2003)
instead of sy/x2
y b x y
n m 2
2
i 1 1 oj
y0
i1 j1
s 2p (24)
n m 2
being in this case n+m-2 the degrees of freedom for the Student t.
80 Julia Martín and Agustín G. Asuero
GOODNESS OF FIT
y y y ŷ ŷ y 2 yi ŷi ŷi y
2 2 2
i i i i
(26)
SSR SSE
R2 1 (28)
SSTm SSTm
Regression through the Origin 81
and then
y y ŷ ŷ 2 ŷi yi ŷi
2 2 2
i i i i
(30)
Taking into account that the cross product in Eqn. (30) is equal to zero,
the (redefined) total sum of squares is decomposed now as
SSE
R 2 1 (32)
SST
Source of Degrees of
Sum of squares E[M.S.]**
variation freedom
w x y / w x
Model (Due
2 2 wi xi2
2
to line) i i i
2
i i
1
w y ŷ
2
Residual
i i
n 1 2
w y
Total about
origin i
2
i
n
* Adapted from Brownlee (1984); ** M.S. = mean squares.
S XY
b1 (33)
S XX
Regression through the Origin 83
b0 yw b1xw (34)
where SXX and SYY are the sum of the squares of deviations from the mean
for the two variables x and y, respectively, and SXY is the corresponding
sum of the cross products (Asuero and Gonzalez, 2006; Asuero and
Gonzalez, 1989; Martin and Asuero, 2017; Sayago and Asuero, 2004)
S XY wi xi yi
w x w y
i i i i
(35)
w i
SXX and SYY may be easily derived from Eqn. (35) by substituting yi by
xi, and xi by yi, respectively. The weighted (sample) mean values of x and y
are given by
xw
wx i i
(36)
w i
and
yw
w yi i
(37)
w i
S XY
w x w y
i i i i
b1
w i
S XY w x y
i w w
(38)
w x w x
2 2
S XX i w
i i
S XX
w i
Thus, the line will not in general pass through the centroid so that Eqn.
(7) and (38) are not equivalent.
In what follows from this section we use unweighted regression (wi=1).
Casella (1983) has shown that a new point (xn+1, yn+1) may be added to the
previous full n set, forcing the straight line to pass now through the origin.
Model based on Eqn. (2) is then applied, the slope being given by Eqn. (7).
The new point added satisfies the identity
x n1
, yn1 n* x ,n* y (39)
where
n
n* (40)
n 1 1
The coordinates of the new point (its position with respect to the
others) determine its leverage, that is, the amount of influence that has on
each fitted value
1 nx2 2 1 x 2
hn1 1 n 1 n S (41)
n 1 2 n 1
xi
2
XX
x
i1
n
Thus, the impact of the new (augmented) n+1 data point increases with
Regression through the Origin 85
b0
*
rn1 (42)
x2
sy/ x 1
S XX
where sy/x2 is the residual mean square from the full fit on the original n
data points. Note that rn+1* is identical to the t statistic that tests H0: = 0. It
can be shown that
s0,y/
2
r n 1 2 x n 2
2
*
n1
(43)
sy/ x
where (s0,y/x)2 is the estimated residual mean squares from the regression
through the origin and (sy/x)2, as before, the estimated residual mean
squares of the full regression. The (rn+1*)2 statistics is an exact measure of
the relationship between the residual variances. The original paper of
Casella (1983) should be consulted for additional details not included here.
For the leverage of models with intercept refer to Meloun et al. (1994)
and Meloun and Miliktik (2012) for details. An excellent introduction to
the topic is found in Sheater (2006).
Models through the origin may be used when consistency with the
underlying theory or other adequate prior (material and physical) reasons
are evident. There cases, however, where it is not clear which model
should be used and the choice between both non-intercept and intercept
models should be made with care. A comparison of the residual mean
86 Julia Martín and Agustín G. Asuero
y y y0
x x x0
(44a,b)
Then
yi yi1 1 xi xi1 i (46)
zi 1 Li i (47)
zi i
1 Li (48)
Li Li
b1
z i
(49)
L i
di zi b1 Li (51)
Regression through the Origin 89
and then
1
2 i
y bxi
2
Qmin (53)
1 b1
90 Julia Martín and Agustín G. Asuero
Qmin 1 2
2 yi b1 xi xi y b x
2
b1
2 2 i 1 i
1 b1 1 b2
1
(54)
we get
xi2 yi2
b b1
2
1 0 (55)
xi yi
1
The minimization occurs when b1 has the sign of xiyi. This kind of
regression is known with the name of orthogonal regression (equal errors
in both axes).
Deming Regression
and the we have the following general expression for the weights
1 1
wi (58)
2
ri
b 2b1 cov xi , yi
2
yi
2
1
2
xi
Regression through the Origin 91
1 1
wi (59)
b1 xi
2
2 2
yi
C b12 x2 i
where
2y
C i
(60)
2
xi
1 y b x 2
2
Qmin i 21 i (61)
C b1 x i
and then
2
Qmin 1 y bx
2 yi b1xi
2 2 i
i 1 i
x 2
0
b1 C b12
xi
C b1
2
x2i
(62)
x2 y2
C 2 i2
i
xi xi
b12 b1 C 0 (63)
xy
i 2 i
xi
and like in the case of Eqn. (55) the minimization occurs when b1 has the
sign of the denominator of b1 in Eqn. (63).
Weighted regression with weights given by Eqn. (59) but applied to
models with intercept is known in the clinical literature (Linnet, 1993) with
the name of Deming regression, and it is very used in comparison methods.
It is also a kind of orthogonal regression, also named oblique regression.
We have assumed the independence of all the xi and yi, but when this is
no the case we must to include cross terms involving the covariance of
correlated variables. In addition, in those cases in which the ratio of
variances of y to x values are not a constant
From Eqn. (5) in the most general case we get
Qmin r 2 w
wi i ri2 i 0 (64)
b1 b1 b1
and then
ri2 w
wi b1
ri2 i
b1
(65)
ri2
2 yi b1 xi xi 2b1xi2 2xi yi (66)
b1
and
wi2 2a1 x2 cov xi , yi
i
(67)
w 2b x
i
2
1 i
2xi yi r w 2b
i
2 2
i 1
2
xi
2cov xi , yi (68)
b1 wi xi2 ri2 wi2 b1 x2 cov xi , yi wi xi yi
i
(69)
b1,n1
1 10 k (70)
b1,n
i
(71)
i1
y
ŷi median i (72)
xi
for a line through the origin, where observations with xi=0 are not
taken into account. Rousseeuw et al. (2001) showed a calibration data
example for peak area in ng/ml for cadmium from graphite furnace atomic
absorption spectrometry. The least squares line thorough the origin is
displaced towards the outlier observed at the highest concentration
standard, whereas the deepest regression through the origin is robust and
fits the good data points.
POLYNOMIAL REGRESSION
No-intercept models more complex than the previous use seen here
may also fitted to experimental data, e.g., a parabolic model passing
through the origin (Hahn, 1977; Karl and Huber, 1997)
y 1x 2 x 2 (73)
b
x y x x y x
i i
3
i
2
i i
2
i
(74)
x x x
2 2
3 2 4
i i i
and
b
x y b x
i i 2
3
i
(75)
1
x 2
i
2
b b y
x0 1 1 0 (76)
2b2 2b2 b2
The sign of the root with physical meaning coincides with the sign of
the parameter b2, lacking of meaning the other root.
The variance of the regression will be given in this case by
s 2
y b x b x
i 1 i 2
2
(77)
y/ x
n2
Meites and Leary (1985) and Leary and Messick (1985) have treated
constrained calibration curves with parabolic examples as shown above.
Dalebrou (1974) reports variance analysis of polynomial regression with
no intercept by means of the coefficients orthogonal method. D-optimal
designs for polynomial regression models with no intercept have been the
subject of statistical consideration (Fang, 2002).
98 Julia Martín and Agustín G. Asuero
Alexander et al., 2015 Francis and Kim and Burkart, Shayanfar and
Sobel, 1970 2008 Shayanfar, 2011
Bonate, 2011 Georgian, 2009 Leroy and Strong III, 1979
Messick, 1985
Bánfai and Kemény, 2012 Hubert, 1997 Raposo et al., 2015 Synek et al.,
2000
Bonate, 1992 Kemp, 1985 Ripley and Van Zoonen et
Thompson, 1987 al., 1999
Dolan, 2009 Kemp, 1984a Roy and Kas, 2014
Ellerton and Strong, 1980 Kemp, 1984b Schwartz, 1986
Content Reference
The usage of R2 as a measure of model fit and predictive power Alexander, Tropsha
in QSAR or QSPR modelling. Suggestion of how to use it and Winkler, 2015
appropriately as a measure of model fit.
Methodology for developing priors from individual or combined Hamel, 2015
meta-analyses which implicitly implies the assumption that there
is variation around the meta-analytical relationships themselves.
Examples of application to individual species are provided.
Comparison between models with and without intercept and Abdulsalam Othman,
statement the beast one. Applying the method leverage point 2014
when a new point is added to the original data.
The rm2 metrics and regression through origin approach: reliable Roy and Kar, 2014
and useful validation tools for predictive QSAR models
(Commentary on ‘Is regression through origin useful
in external
validation of QSAR models?’).
Comparisson study of the proposed criteria using the regression Shayanfar and
through origin method (calculation with SPSS and Excel) for Shayanfar, 2014
external validation and prediction capability for models
developed using literature data. Prediction capability was
evaluated using the statistically significant differences between
absolute error values of training and test sets.
100 Julia Martín and Agustín G. Asuero
Table 3. (Continued)
Content Reference
Iwao’s patchiness regression through the origin: exploration Waters et al., 2014
whether fixing Iwao’s m*– m relation to go through the origin is
theoretically justifiable, statistically advantageous given the
methods used to estimate its parameters, and reduces the sample
size required when used to design sequential sampling plans
with no loss of sampling precision. Both analytical methods and
resampling methods based on field data are employed.
Research on the suitability of interval hypotheses for a selection Bánfai, 2012
of analytical problems frequently occurring in the
pharmaceutical setting. Overview of the statistical intervals and
hypothesis tests used in the Dissertation. The interval hypothesis
testing is discussed for the following topics: the transfer of
analytical methods, the evaluation of the accuracy of analytical
methods, the applicability of single-point calibration, and the
content uniformity assessment.
Estimation of bias for single-point calibration using a proposed Bánfai and Kemény,
method based on the two one- sided tests (interval hypothesis). 2012
The test is performed by comparing a confidence interval for the
bias to an allowable limit, defined in concentration units.
Fieller’s theorem was used for the ratio of two normally
distributed random variables to construct the confidence interval
for the bias.
Survey of the development of different rm2 metrics followed by Roy and Mitra, 2012
their applications in modeling studies for selection of the best
QSAR models in different reports made by several workers.
Clarification of the statement “one often tends to use the origin Burkart and Kim,
point (0,0) in the data. However, whether that is best practice or 2009
not is entirely arguable.” The argument that a zeroed instrument
is expected to provide a point at (0,0) is specious and
misleading.
Calibration models: How to decide if a calibration curve goes Dolan, 2009
through zero and some problems that can occur if the wrong
choices are made.
Evaluating ‘goodness-of-fit’ for linear instrument calibrations Georgian, 2009
through the origin. A weighted regression coefficient is
subsequently defined to evaluate the ‘goodness-of-fit’ and is
expressed as function of the %RSD.
Regression through the Origin 101
Content Reference
Properties of weighted least squares regression, particularly with Knaub, 2009
regard to regression through the origin for establishment survey
data, for use in periodic publications.
The statistical reasons why regression through the origin should be Legendre and
used to analyze comparative data, and supports the Desdevises, 2009
recommendation of Garland et al. (1992) through additional
geometric reasons.
Discussing the visualization of statistical concepts and reply to the Kim and Burkart,
letter writed by Levie (2008) about including or not including an 2008
origin point (0,0) in a regression analysis for building a standard
curve.
The correct use of visualizing statistical concepts. Fails in the Levie, 2008
attempt of this test in the example described by Kim and Burkart,
2006 about “Beer’s Law Plot.”
Fitting curve passing for designated point to data for promoting the Sun et al., 2008
reproducibility of peripheral quantitative computed tomography.
A interactive and dynamic method of visual interactive regression Kim and Burkart,
minimizing the sum visible by allowing the individual to adjust 2006
heights in a bar graph. The interactive feature of Excel spreadsheet
programs is utilized; use of the spinner bar is particularly helpful.
Properties of the deepest regresion and applications in analytical Rousseeuw et al.,
chemistry: Regression through the origin, polynomial regression, 2001
the Michaelis–Menten model, and censored responses.
Linear regression of calibration lines passing through the origin Synek, 2001
was investigated for three models of y-direction random errors:
normally distributed errors with an invariable standard deviation
(SD) and log normally and normally distributed errors with an
invariable relative standard deviation (RSD).
Uncertainties of mercury determinations in biological materials Synek, Subrt and
using an atomic absorption spectrometer. Study of potential Marecek, 2000
sources of uncertainties as possible in order to work out a general
model of determination of uncertainty in trace atomic absorption
measurements.
Critical overview of most conflicting points concerning linear Giordano, 1999
regression. Confidence bands and a discussion about the use of a
line through the origin are included. In addition, the simplest
expressions for expressing parameters to the appropriate significant
figures from built-in calculator programs are also provided.
102 Julia Martín and Agustín G. Asuero
Table 3. (Continued)
Content Reference
Validation is put in the context of the process of producing Zoonen et al., 1999
chemical information. Two cases are presented in more detail: the
development of a European standard for chlorophenols and its
validation by a full scale collaborative trial, and the intralaboratory
validation of a method for ethylene-thiourea using alternative
analytical techniques.
Response to Comment of Cade and Terrell about cautions on Bourgeois et al.,
Forcing Regression Equations through the Origin: This paper 1997
strengthens the caution to any-one considering no-intercept models
for improving relations between fish density and weighted usable
area. The explanations given by Cade and Terrell (1997)
convincingly reinforce this warning.
Comment to Cautions on Forcing Regression Equations through Cade and Terrell,
the Origin (Bourgeois et al., 1997): Prediction of biological 1997
response still depends largely of an detailed understanding of local
biological conditions. Authors urged caution in forcing regression
of fish density on weighted usable area through the origins, when
such a forcing
is contemplated, one should verify the calculations
used by commercial statistical packages to generate summary
statistics.
Improved calibration for wide measuring ranges and low contents. Karl and Huber,
For some calibrations, a straight line through the origin instead of a 1997
general straight line should be determined by regression analysis:
advantages and restrictions.
A least-squares-based method for determining the ratio between Moreno, 1997
two measured quantities.
Relationship between principal components analysis and weighted Andrews et al.,
linear regression for bivariate data sets: Application to linear, two- 1996
dimensional data sets with a zero intercept.
The problem of fitting a straight line when both variables are Draper et al., 1991
subject to error. A brief review of the literature is undertaken, and
one fitting method, the geometric mean functional realationship, is
spotlighted and illustrated with two sets of example data.
Several methods of obtaining the best straught line from data in Tan and Jones,
which the two variables are subject to errors of measurement are 1989
proposed and discussed.
Regression through the Origin 103
Content Reference
Statistical analysis techniques to compare pairs of dust samples: A Knight and Moore,
straight line through the origin, linear with intercepts, logarithmic, 1987
a logarithmic (weigh + constant), and a fifth forced
through the origin. It is suggested that the best estimate of the
relation between two dust samplers can be obtained by a least
squares determination of the straight line through the origin
using transformed variables.
A regression-like technique, maximum-likelihood fitting of a Ripley and
functional relationship (MLFR), is explained and is Thompson, 1987
demonstrated to work well. Under some conditions weighted
regression provides a good approximation to MLFR, and so can
be used if more convenient.
Suggestions that may be helpful to researchers in deciding Schwartz, 1986
whether or not to impose constraints.
The four main calibration methods (single separate or added Kemp, 1985
standard and multiple separate or added standards) and some
modifications are described mathematically and subjected to
error-propagation analysis, to examine the likely effects of errors
in the analytical signal on the overall accuracy and precision of
the concentration estimate.
Constrained Calibration Curves: How the use of Lagrange Leary and Messick,
multipliers can supplement the more traditional least-squares 1985
curve fitting procedures. The concept of degrees of freedom
when describing the variability of data around a calibration
curve is also discussed.
Simple ways are described of constraining a calibration or other Meites and Leary,
equation so that it will pass through one or more independently 1985
selected points and also give the “best” representation of any
number of experimental data in terms of the model selected.
Theoretical aspects of one-point calibration: causes and effects Kemp, 1984a
of some potential errors, and their dependence on
concentration.
New ways of using data from analytical-recovery studies to Kemp, 1984b
assess analytical nonlinearity, without access to samples of
known concentration. A recovery-based method of assessing
constant, proportional, and non-linear errors with use of as little
as one sample pool of known concentration is described. In each
case, the theoretical basis of the method and an outline of a
practical experimental protocol is presented.
104 Julia Martín and Agustín G. Asuero
Table 3. (Continued)
Content Reference
Evaluation of both qualitatively and quantitatively the bias error Cardone and Palermo,
caused by an single-point-ratio calculations from an assumed 1980
linear response curve through zero for the case where the true
response curve is a straight line with a significant intercept.
Comments on the correspondence about regression through the Ellerton, 1980
origin (Strong, 1979): Precision and accuracy shoud be
considered from a statistical viewpoint and discussed.
Response of Strong to comments of Ellerton (1980) on Strong III, 1980
regression through the origin: Strong agree thoroughly with
Ellerton's definitions of precision and accuracy which apply to a
chemist's repetitions of determinations on the same sample.
Regarding the use of n-2 to calculate se, rather than n-1, as
recommended by Ellerton, Strong felt that requiring the best
straight line to pass through the origin was a constraint on the
system and therefore constituted a reduction in the number of
degrees of freedom.
Determination of the precision and accuracy of kinetic data. CvetanoviE,
Suggestions for the presentation of kinetic results and their Singleton and
uncertainties due to random and systematic errors. Regarding Paraskevapoulos,
random errors, least-squares expressions are summarized, and 1979
confidence limits, propagation of errors, and change of variable
are discussed. Sources of systematic errors are outlined, along
with potential methods for their detection and estimation.
Practical examples of fitting regression models with no intercept Hahn, 1979
term. Caution in the use of the model is advised.
Demostration of how in a photometric experiment if one Strong III, 1979
measures the absorbances, y, of solutions having solute
concentrations x, and if the solutions are expected to conform
with Beer's law, one should fit a straight line that passes through
the point y = 0 at x = 0. Strong proposes to accomplish this by
using a single-parameter model equation, y = blx, rather than the
conventional two-parameter model y = a + b2x. The single-
parameter model effectively forces the straight line to pass
through (0, 0), but in either case, the slope, bl or b2,represents the
absorptivity. This will unavoidably reduce the precision slightly,
but could increase the accuracy.
Regression through the Origin 105
Content Reference
Weighting factors in least squares: When there is great variation Sands, 1974
among the variances, the assumption of constant weights can
produce gross errors. A prcatical pH example.
Interval Estimate of the Ratio of an Unknown to a Standard. Francis and Sobel,
Methods for testing the suitability of the models under 1970
discussion are given.
QSAR or QSPR: Quantitative Structure-Activity/Property Relationships (QSAR or
QSPR)
FINAL COMMENTS
REFERENCES
Acton, F.S., (1959). Analysis of Straight Line Data: The equation y=b x.
New York, USA: Wiley, pp. 16-17.
Afifi, A.F., Azen, S.P., (1972). Statistical Analysis, A Computer Oriented
Approach. 2nd ed., 1st ed., New York, USA: Academic Press, p.125; pp.
88-89.
Alexander, D.L., Tropha, A., Winkler, D.A., (2015). Beware of R2: simple,
unambiguous assessment of the prediction accuracy of QSAR and
SSPR models. J. Chem. Inf. Model. 55(7), 1316-1322.
Andrews, D.T., Chen, L., Wentzell, P.D., Hamilton, D.C., (1996).
Comments on the relationship between principal components analysis
and weighted linear regression for bivariate data sets. Chemometr.
Intell. Lab. Systems 34, 231-244.
Asuero, A.G., Bueno, J., (2011). Fitting straight lines with replicated
observations by linear regression. IV. Transforming data. Crit. Rev.
Anal. Chem. 41(1), 36-69.
Asuero, A.G., Gonzalez, G., (1989). Some observations on fitting a straight
line to data. Microchem. J. 40(2), 216-225.
Asuero, A.G., Gonzalez, G., (2007). Fitting straight lines with replicated
observations by linear regression. III. Weighting data. Crit. Rev. Anal.
Chem. 37(3), 143-172.
Asuero, A.G., Gonzalez, G., de Pablos, F., Ariza, J.L.G., (1988).
Determination of the optimum working range in spectrophotometric
procedures. Talanta 35(7), 531-537.
Asuero, A.G., Sayago, A., Gonzalez, A.G., (2006). The correlation
coefficient: an overview. Crit. Rev. Anal. Chem. 36(1), 41-59.
Atilgan, Y.K., Gunay, S., (2011). Least median of squares solution of
multiple linear regression models through the origin. Commun. Stat.
Theory Methods 40(22), 4125-4137.
Austen, A.E.W., Pelzer, H., (1946). Linear curves of best fit. Nature 157,
693-694.
Regression through the Origin 107
Hamel, O.S., (2015) A method for calculating a meta- analytical prior for
the natural mortality rate using multiple life history correlates. ICES J.
Mar. Sci. 72(1), 62-69.
Hawkins, D.M., (1980). A note on fitting a regression without an intercept
term. Am. Stat. 34(4), 233.
Haws, A.P., Gordon, H.A., (1981). Letter to Editor. Stat. 30(4), 304-308.
Hedayat, A., (1970). Examination and analysis of residuals, diagnostic
checking of residuals for detecting a special type of heteroscedasticity
in linear regression through the origin. Biometrics 26(3), 603. (Joint
Meeting of ENAR with IMS and ASA, Chape Hill, North Caroline).
Hedayat, A., Raktoe, B.L., Talwar, P.P., (1977). Examination and analysis
of residuals: a test for detecting a monotonic relation between mean
and variance in regression through the origin. Commun. Stat. Theory
Methods 6(6), 497-506.
Howarth, R.J., (2001). A history of regression and related model-fitting in
the earth sciences. Nat. Resourc. Res. 10(4), 241-286.
Huber, M.K.W., (1997). Improved calibration for wide measuring ranges
and low contents. Accred. Qual. Assur. 2(8), 367-374.
Huber, P.J., (1964). Robust estimation of a location parameter. Ann. Math.
Stat. 35, 73-101.
Iwao, S., (1968). A new regression method for analyzing the aggregation
pattern of animal population. Res. Popul. Ecol. 10(1), 1-20.
Iwase, K., (1989). Linear regression through the origin with constant
coefficient of variation for the inverse Gaussian distribution. Commun.
Stat. Theory Methods 18(10), 3587-3593.
Kayhan, Y., Gunay, S.M., (2008). A new approach to least median of
squares and regression through the origin. Commun. Stat. Theory
Methods 37(5), 773-781.
Kemp, G.J., (1985). The susceptibility of calibration methods to errors in
the analytical signal. Anal. Chim. Acta 176, 229-247.
Kemp, G.J., (1984). Theoretical aspects of one-point calibration: causes
and effects of some potential errors, and their dependence on
concentration. Clin. Chem. 30(7), 1163-1167.
Regression through the Origin 111
Kemp, G.J., (1984). Assessment of analytical bias: four new ways to use
recovery measurements. Clin. Chem. 30(7), 1168-1170.
Kerrich, J.E., (1966). Fitting the line y=ax when errors of observation are
present in both variables. Am. Stat. 20(1), 24.
Kim, M-H., Burkart, M., (2008). The author replies. Including or not
including an original point (0,0) in a regression analysis for building a
standard curve. J. Chem. Educ. 85(5), 635-636.
Kim, M-H., Burkart, M., Kim, M.H., (2006). A method of visual
interactive regression. J. Chem. Educ. 83(12), 1884.
Knaub, J.R., (2009) Properties of weighted least squares regression for
cutoff sampling in establishment surveys. Conference paper Cuttof
Sampling and Establishment Surveys. InterStat J. December.
Knight, G., Moore, E., (1987). Comparison of dust samplers: statistical
analysis techniques. Am. Ind. Hyg. Assoc. J. 48(4), 344-353.
Kozak, A., Kozak, R.A., (1995). Notes on regression through the origin.
Forest. Chron. 7(3), 326-330.
Kvalseth, T.O., (1985). Cautionary note about R2. Am. Stat. 39(4), 279-
285.
Lark, P.D., Craven, B.R., Bosworth, R.C.L., (1969). The Handling of
Chemical Data (pp 159-163). Oxford, England: Pergamon Press.
Leary, J.J., Messick, E.B., (1985). Constrained calibration curves: a novel
application of Lagrange multipliers in analytical chemistry. Anal.
Chem. 57(4), 956-957.
Legendre, P., Desdevises, Y., (2009). Independent contrasts and regression
through the origin. J. Theoret. Biol. 259(4), 727-743.
Linnet, K., (1993). Evaluation of regression procedures for method
comparison studies. Clin. Chem. 39(3), 424-432.
Lisy, J.M., (1990). Multiple straight-line least squares analysis with
uncertainties in all variables. Comp. Chem. 14, 189-192.
Liteanu, C., Rica, I., (1980). Statistical Theory and Methodology of Trace
Analysis. New York, USA: Ellis Horwood, pp. 161-162.
Mandel, J., (1957). Fitting a straight line to certain type of cumulative data.
J. Am. Stat. Assoc. 12(280), 552-566.
112 Julia Martín and Agustín G. Asuero
Noggle, J.H., (1993). Practical Curve Fitting and Data Analysis. Software
and Self-Instruction for Scientists and Engineers. Chichester, England:
Ellis Horwood.
Okunade, A.A., Chang, C.F., Evans, R.D., (1993). Comparative analysis of
regression output summary statistics in common statistical packages.
Am. Stat. 47(4), 298-303.
Othman, S.A., (2014). Comparison between models with and without
intercept. Gen. Math. Notes 21(1), 118-127.
Raposo, F., (2016). Evaluation of analytical calibration based on least-
squares linear regression for instrumental techniques: a tutorial review.
Trends Anal. Chem. 77, 167-185.
Rieder, H., (1989). A finite-sample minimax regression estimator. Stat.
20(2), 211-221.
Ripley, B.D., Thompson, M., (1987). Regression techniques for the
detection of analytical bias. Analyst 112(4), 377-383.
Rousseeuw, P.J., (1984). Least median of squares regression. J. Am. Stat.
Assoc. 79(12), 871-880.
Rousseeuw, P., (1988). PROGRESS: a program for robust regression.
Trends Anal. Chem. 7(9), 320-321.
Rousseeuw, P.J., Hubert, M., (1997). Recent development in PROGRESS.
In Lectur Notes-Monograph Series. Vol. 31, Institute of Mathematical
Statistics (IMS).
Rousseeuw, P.J., van Aelst, S., Rambali, B., Smeyers-Verbeke. J., (2001).
Deepest regression in analytical chemistry. Anal. Chim. Acta 446, 245-
256.
Rousseeuw, P.J., Leroy, A.M., (1987). Robust Regression & Outlier
Detection: Simple regression through the origin. New York, USA:
Wiley, pp. 62-65.
Roy, K., Kar, S., (2014). The r(m)(2) metrics and regression through origin
approach: reliable and useful validation tools for predictive QASR
models commentary on ‘Is regression through the origin useful in
external validation of QASR models? Eur. J. Pharm. Sci. 62, 111-114.
114 Julia Martín and Agustín G. Asuero
Roy, K., Mitra, I., (2012). On the use of the metric rm2 as an effective tool
for validation of QASR models in computational drug design and
predictive toxicology. Mini Rev. Med. Chem. 12(6), 491-504.
Ryan, T.P., (2008). Modern Regression Methods. 2nd ed., New York, USA:
Wiley.
Sands, D.E., (1974). Weighting factors in least squares. J. Chem. Educ.
51(7), 473-474.
Sayago, A., Boccio, M., Asuero, A.G., (2004). Fitting straight lines with
replicated observations by linear regression: the least squares
postulates. Crit. Rev. Anal. Chem. 34(1), 39-50.
Sayago, A., Asuero, A.G., (2004). Fitting straight lines with replicated
observations by linear regression. Part II. Testing for homogeneity of
variances. Crit. Rev. Anal. Chem. 34(3-4), 133-146.
Schwartz, L.M., (1986). Effect of constraints on precision of calibration
analyses. Anal. Chem. 58(1), 246-250.
Schwartz, L.M., Gelb, R.I., (1984). Statistical uncertainties of end points at
intersecting straight lines. Anal. Chem. 56(8), 1487-1492.
Scott, A., Wild, C., (1991). Transformations and R2. Am. Stat. 45(2), 127-
129.
Seber, G.A.F., Lee, A.J., (2003). Linear Regression Analysis. 2nd ed., New
York, USA: Wiley, p. 149.
Shayanfar, A., Shayanfar, S., (2011). Is regression through origin useful in
external evaluation of QASR models?. Eur. J. Pharm. Sci. 87, 271-
273.
Sheather, S.J., (2009). A Modern Approach to Regression with R. New
York: Springer, pp 51-70, pp 115-123.
Strong III, F.C., (1979). Regression line that starts at the origin. Anal.
Chem. 51(2), 298-299.
Sun, L., Xie, T., Fan Y.M., Zhang, C., (2008). Fitting curve passing
through designated point to data for promoting the reproducibility of
peripheral quantitative computed tomography (pQCT). IEEE
Computer Society 2008: Proceedings of the 2008 International
Conference on BioMedical Engineering and Informatics, Sanya,
Hainan, China, Vol. 2, pp. 867-871.
Regression through the Origin 115
Synek, V., (2001). Calibration lines passing through the origin with errors
in both axes. Accred. Qual. Assur. 6(8), 360-367.
Synek, V., Subrt, P., Marecek, J., (2000). Uncertainties of mercury
determinations in biological materials using an atomic absorption
spectrometer – AMA 254. Accred. Qual. Assur. 5(2), 58-66.
Tan, H.S., Jones, W.E., (1989). Fitting of a straight line when both
variables contain errors. Application to the Beer-Lambert law. J.
Chem. Educ. 66(8), 650-651.
Turner, M.E., (1960). Straight line regression through the origin.
Biometrics 16(3), 483-485.
Uyar, B., Erdem, O., (1990). Regression procedures in SAS problems?.
Am. Stat. 44(4), 296-301.
Valentine, T.J., (1971). Regression through origin. Am. Stat. 25(5), 58-59.
van Zoonen, P., Hoogerbrugge, R., Gort, S.M., van de Wiel, H.J., van’t
Klooster, H.A., (1999). Some practical examples of method validation
in the analytical laboratory. Trends Anal. Chem. 18(9-10), 584-593.
Waters, E.K., Furlon, M.J., Benke, K.K., Grove, J.R., Hamilton, A.J.,
(2014). Iwao’s patchiness regression through the origin: biological
importance and efficiency of sampling applications. Popul. Ecol.
56(2), 393-399.
Willett, J.B., Singer, J.D., (1988). Another cautionary note about R2, it use
in weighted least squares regression. Am. Stat. 42(3), 236-238.
Williamson, J.H., (1968). Least squares fitting of a straight line. Can. J.
Phys. 46 (16), 1845-1847.
Winsor, C.P., (1946). Which regression. Biometr. Bull. 2(6), 101-109.
York, D., (1969). Least squares fitting of a straight line with correlated
errors. Earth Planet. Sci. Lett. 5, 320-324.
In: Linear Regression ISBN: 978-1-53611-992-3
Editor: Vera L. Beck
c 2017 Nova Science Publishers, Inc.
Chapter 3
Abstract
1. Introduction
Linear regression for interval-valued data has been attracting increasing inter-
ests among researchers. See [10], [20], [12, 13], [23], [8], [5], [14], [26, 27],
[6], [9], for a partial list of references. However, issues such as interpretabil-
ity and computational feasibility still remain. Especially, a commonly accepted
mathematical foundation is largely underdeveloped, compared to its demand of
applications. By proposing our new model, we continue to build up the theoreti-
cal framework that deeply understands the existing models and facilitates future
developments.
In the statistics literature, the interval-valued data analysis is most often
studied under the framework of random sets, which includes random intervals
as the special (one-dimensional) case. The probability-based theory for random
sets has developed since the publication of the seminal book of [24]. See [25] for
a relatively complete monograph. To facilitate the presentation of our results,
we briefly introduce the basic notations and definitions in the random set theory.
Let (Ω, L , P) be a probability space. Denote by K Rd or K the collection of
all non-empty compact subsets of Rd . In the space K , a linear structure is
defined by Minkowski addition and scalar multiplication, i.e.,
Linear Regression for Interval-Valued Data in KC (R) 119
A + B = {a + b : a ∈ A, b ∈ B} λA = {λa : a ∈ A},
∀A, B ∈ K and λ ∈ R. A natural metric for the space K is the Hausdorff metric
ρH , which is defined as
ρH (A, B) = max sup ρ (a, B), sup ρ (b, A) , ∀A, B ∈ K ,
a∈A b∈B
where ρ denotes the Euclidean metric. A random compact set is a Borel measur-
able function A : Ω → K , K being equipped with the Borel σ-algebra induced
by the Hausdorff metric. For each X ∈ K Rd , the function defined on the unit
sphere Sd−1 :
sX (u) = sup hu, xi, ∀u ∈ Sd−1
x∈X
is called the support function of X. If A(ω) is convex almost surely, then A is
called a random compact convex set. (See [25], p.21, p.102.) The collection of
d d
all compact convex subsets of R is denoted by KC R or KC . When d = 1,
the corresponding KC contains all the non-empty bounded closed intervals in
R. A measurable function X : Ω → KC (R) is called a random interval. Much of
the random sets theory has focused on compact convex sets. Let S be the space
of support functions of all non-empty compact convex subsets in KC . Then, S
is a Banach space equipped with the L2 metric
Z 1
2
2
ksX (u)k2 = d |sX (u)| µ (du) ,
Sd−1
Some advances have been made regarding this model and the associated es-
timators. [13] derived least squares estimators for the model parameters and
examined them from a theoretical perspective. [14] established a test of linear
independence for interval-valued data. However, many problems still remain
open such as biases and asymptotic distributions, as anticipated in [13]. This
Linear Regression for Interval-Valued Data in KC (R) 121
where E(εci ) = 0, E(εri ) = c > 0, and the signs “±" correspond to the two cases
in (4). Define
(
λi = εci , ηi = εri , if Yi (aXi + b) exists;
c r
(5)
λi = −εi , ηi = −εi , if otherwise (aXi + b) Yi exists.
where E(λi ) = 0, E(ηi ) = µ ∈ [−c, c], Var(λi) = σ2λ > 0, and Var(ηi ) = σ2η > 0.
Tohmodel thei outcome intervals Yi = Yi ,Yi by p interval-valued predictors
X j,i = X j,i , X j,i , i = 1, · · · , n; j = 1, · · · , p, we consider the multivariate exten-
sion of (3):
!
p
δ Yi , b + ∑ a j X j,i = kεi k2 , (8)
j=1
Linear Regression for Interval-Valued Data in KC (R) 123
where E(λi ) = 0, E(ηi ) = µ ∈ [−c, c], Var(λi) = σ2λ , and Var(ηi ) = σ2η . We have
assumed λi and ηi are independent in this chapter to simplify the presentation.
The model that includes a covariance between λi and ηi can be implemented
without much extra difficulty.
Therefore, the LSE of µ, b, a j, j = 1, · · · , p is defined as
1
µ̂, b̂, â j , j = 1, · · · , p = arg min L (µ, b, a j , j = 1, · · · , p) . (13)
n
124 Yan Sun and Chunyang Li
Let
! !
1 n c c 1 n c 1 n c
X cj , Xkc ∑ X j,iXk,i − ∑ X j,i ∑ Xk,i ,
S =
n i=1 n i=1 n i=1
! !
1 n r r 1 n r 1 n r
S X rj , Xk r
∑ X j,iXk,i − ∑ X j,i ∑ Xk,i ,
=
n i=1 n i=1 n i=1
p
b̂ = Y c − ∑ â j X cj , (15)
j=1
p
µ̂ = Y r − ∑ |â j |X rj . (16)
j=1
Linear Regression for Interval-Valued Data in KC (R) 125
See [18, 19]. For the case d = 1, it is shown by straightforward calculations that
EX = [EX, EX],
Var(X) = Var (X c ) + Var (X r ).
This leads us to define the sums of squares in KC (R) to measure the variability
of interval-valued data. A definition of the coefficient of determination R2 in
KC (R) follows immediately, which produces a measure of goodness-of-fit.
Definition 1. The total sum of squares (SST) in KC is defined as
n h 2 2 i
SST = ∑ Yic −Y c + Yir −Y r . (17)
i=1
R2 = SSE/SST.
3. Properties of LSE
In this section, we study the theoretical properties of the LSE for the univariate
model (6)-(7). Applying Proposition 1 to the case p = 1, we obtain the two
sets of half-space solutions, corresponding to a ≥ 0 and a < 0, respectively, as
follows:
S(X c ,Y c ) + S(X r ,Y r )
a+ = , (21)
S2 (X c ) + S2 (X r )
b+ = Y c − a+ X c , (22)
+
µ = Y r − |a+ |X r ; (23)
and
S(X c ,Y c ) − S(X r ,Y r )
a− = , (24)
S2 (X c ) + S2 (X r )
b− = Y c − a− X c , (25)
−
µ = Y r − |a− |X r . (26)
The final formula for the LS estimates falls in three categories. In the first, there
is one and only one set of existing solution, which is defined as the LSE. In the
second, both sets of solutions exist, and the LSE is the one that minimizes L. In
the third situation, neither solution exists, but this only happens with probability
going to 0. We conclude these findings in the following Theorem.
Theorem 3. Assume model (6)-(7). Let â, b̂, µ̂ be the least squares solution
defined in (13). If |S(X c,Y c)| > |S(X r ,Y r )|, then there exists one and only one
half-space solution. More specifically,
Otherwise, |S(X c,Y c )| < |S(X r ,Y r )|, and then either both of the half-space so-
lutions exist, or neither one exists. In particular,
2|a|S2(X c )
P(â = a− )I{a≥0} + P(â = a+ )I{a<0} .
E (|â| − |a|) = − 2 c 2 r
S (X ) + S (X )
2 c 2 r
Theorem 4. Consider model (6)-(7). Assume S (X ) = O(1) and S (X ) =
O(1). Then, the least squares solution â, b̂, µ̂ in Theorem 3 is asymptotically
unbiased, i.e.
â a
E b̂ → b ,
µ̂ µ
as n → ∞.
4. Simulation
We carry out a systematic simulation study to examine the empirical perfor-
mance of the least squares method proposed in this chapter. First, we consider
the following three models:
Linear Regression for Interval-Valued Data in KC (R) 129
30 15
10
25
5
20
0
Y
Y
15
-5
10
-10
5
-15
0 -20
-5 -25
-4 -2 0 2 4 6 8 10 12 14 -4 -2 0 2 4 6 8 10 12 14
X X
30
25
20
Y
15
10
-5
-4 -2 0 2 4 6 8 10 12 14
X
Figure 1: Plots of simulated datasets from models 1, 2, and 3, each with sample
size n = 50. The solid line denotes the regression line y = âx + b̂, and the two
dashed lines denote the two accompanying lines y = âx + b̂ ± µ̂.
130 Yan Sun and Chunyang Li
our model, with CCRM providing a baseline of converging rate and predicting
accuracy. From Model 1, 2, 3, respectively, we simulate 1000 independent
samples with size n = 20, 50, 100. Then, each sample is randomly split into a
training set (80%) and a validation set (20%). The two models are evaluated by
their sample variance adjusted mean squared errors (AMSE’s) on the validation
set, which are defined as
c 2
∑m c
i=1 Yi − Ŷi
AMSE(center) = c 2
,
∑m c
i=1 Yi −Y i
r 2
∑m r
i=1 Yi − Ŷi
AMSE(radius) = ,
r −Y r 2
∑m Y
i=1 i i
and
AMSE(center) + AMSE(radius)
AMSE(average) = ,
2
where m = n/5 is the size of validation set. We use the R function ccrm in
the iRegression package to implement CCRM. The average result of the 1000
repetitions are summarized in Table 2. For Model 1 and 2, both models have
competitive performances. Model 3 has a negative µ, so CCRM is slightly worse
than our model due to its positive restriction on µ. To better show this, we
continue to consider the following two univariate models and one multivariate
model with a much smaller µ:
• Model 4: a = 3, b = 5, µ = −5, ση = 0.5, σλ = 5;
• Model 5: a = −3, b = 5, µ = −5, ση = 0.5, σλ = 5;
100 20
80
0
60
−20
40
Y
Y
−40
20
−60
0
−20 −80
−40 −100
−15 −10 −5 0 5 10 15 20 25 30 35 −15 −10 −5 0 5 10 15 20 25 30
X X
Figure 2: Plots of simulated datasets from models 4 and 5, each with sample
size n = 50.
Administration (NOAA) and are publicly available. The three data sets we ob-
tained specifically are average temperatures for 51 large US cities in January,
April, and July. Each observation contains the averages of minimum and max-
imum temperatures based on weather data collected from 1981 to 2010 by the
NOAA National Climatic Data Center of the United States. July in general is
the hottest month in the US. By this analysis, we aim to predict the summer
(July) temperatures by those in the winter (January) and spring (April). Figure
3 plots the July temperatures versus those in January and April, respectively.
The parameters are estimated according to (14)-(16) as
Denote by TJan , TApril , and TJuly, the average temperatures in a US city in Jan-
uary, April, and July, respectively. The prediction for TJuly based on TJan and
TApril is given by
c c c
T̂July = 10.2510 − 0.4831TJan + 1.1926TApril , (27)
r r r
T̂July = −3.7071 + 0.4831TJan + 1.1926TApril . (28)
Table 2. Mean results of AMSE on the validation set based on 1000 inde-
pendent repetitions
SSR SSE
R2 = 1 − = = 0.7458.
SST SST
134 Yan Sun and Chunyang Li
Average Temperatures for Large US Cities Average Temperatures for Large US Cities
45 45
40 40
35 35
30 30
July (o C)
July (o C)
25 25
20 20
15 15
10 10
−15 −10 −5 0 5 10 15 20 25 0 5 10 15 20 25 30
January (o C) April (o C)
Figure 3: Left: plot of July versus January temperatures. Right: plot of July
versus April temperatures.
1 n 2
σ̂2λ = ∑ c
TJuly,i c
− T̂July,i = 2.1708;
n − 1 i=1
1 n 2
σ̂2η = ∑ r
TJuly,i r
− T̂July,i = 1.2047.
n − 1 i=1
r
Thus, by Theorem 2, an upper bound of P T̂July,i < 0 on average is estimated
to be
1 n σ̂2η 1.2047 n 1
∑
n i=1
r
2 = ∑
n i=1
r
2 = 0.047,
TJuly,i TJuly,i
r
which is very small and reasonably ignorable. We calculate T̂July,i for the entire
sample and all of them are well above 0. So, for this data, although µ̂ < 0 and
it is possible to get negative predicted radius, it in fact never happens because
the model has captured most of the variability. The empirical distributions of
residuals are shown in Figure 4. Both distributions are centered at 0, with the
center residual having a slightly bigger tail.
Linear Regression for Interval-Valued Data in KC (R) 135
Probability Density Plots of Residuals
0.4
T Jc u l y− T̂ Jc u l y
T Jr u l y− T̂ Jr u l y
0.35
0.3
0.25
Probability Density
0.2
0.15
0.1
0.05
0
−8 −6 −4 −2 0 2 4 6 8
Residuals
Figure 4: Empirical probability density plots of the residuals for the center and
radius.
Conclusion
We have rigorously studied linear regression for interval-valued data in the met-
ric space (KC , δ). The new model we introduces generalizes previous models
in the literature so that the Hukuhara difference Yi (aXi + b) needs not exist.
Analogous to the classical linear regression, our model together with the LS es-
timation leads to a partition of the total sum of squares (SSR) into the explained
sum of squares (SSE) and the residual sum of squares (SSR) in (KC , δ), which
implies that the residual is uncorrelated with the linear predictor in (KC , δ). In
addition, we have carried out theoretical investigations into the least squares es-
timation for the univariate model. It is shown that the LS estimates in (KC , δ)
are biased but the biases reduce to zero as the sample size tends to infinity.
Therefore, a bias-correction technique for small sample estimation could be a
good future topic. The simulation study confirms our theoretical findings and
shows that the least squares estimators perform satisfactorily well for moderate
sample sizes.
136 Yan Sun and Chunyang Li
Appendix: Proofs
Proof of Proposition 1
Proof. Differentiating L with respect to µ, b, and a j , j = 1, · · · , p, respectively,
and setting the derivatives to zero, we get
n
∂L
∝ ∑ Ŷir −Yir = 0,
(29)
∂µ i=1
n
∂L
∝ ∑ Ŷic −Yic = 0,
(30)
∂b i=1
n n
∂L
∝ ∑ Ŷic −Yic Xk,i
c
+ ∑ Ŷir −Yir sgn (ak ) Xk,i
r
= 0, (31)
∂ak i=1 i=1
k = 1, · · · , p.
Equations (14) are obtained by plugging (32)-(33) into (31), and equations (15)-
(16) follow from (32)-(33). This completes the proof.
The last equation is due to (29)-(30). Further in view of (11)-(12) and (31), we
have
n
∑ Yic − Ŷic Ŷic + Yir − Ŷir Ŷir
i=1
" #
n p p
∑ Yic − Ŷi c
∑ a j X cj,i + Yir − Ŷi
r
∑ |a j |X rj,i
=
i=1 j=1 j=1
p n
∑ aj ∑ Yic − Ŷi X cj,i + Yir − Ŷi sgn(a j )X rj,i
c r
=
j=1 i=1
= 0.
P Ŷir < 0 = P Ŷir −Yir < −Yir ≤ P |Ŷir −Yir | > Yir .
Case I: a ≥ 0.
138 Yan Sun and Chunyang Li
a+ − a
∑i< j (Xic − X cj )(Yic −Y jc ) + ∑i< j (Xir − X rj )(Yir −Y jr )
= −a
∑i< j (Xic − X cj )2 + ∑i< j (Xir − X rj )2
h i h i
(X c − X c ) (Y c −Y c ) − a(X c − X c ) + (X r − X r ) (Y r −Y r ) − a(X r − X r )
∑i< j i j i j i j ∑i< j i j i j i j
= c c 2 r r 2
∑i< j (Xi − X j ) + ∑i< j (Xi − X j )
∑i< j (Xic − X cj )(λi − λ j ) + ∑i< j (Xir − X rj )(ηi − η j )
= .
∑i< j (Xic − X cj )2 + ∑i< j (Xir − X rj )2
E a+ − a = 0.
(35)
Similarly,
and consequently,
2aS2 (X r )
E a− − a = −
. (36)
S2 (X c ) + S2 (X r )
Linear Regression for Interval-Valued Data in KC (R) 139
Notice now
since a ≥ 0. Therefore,
h i
2 ∑i< j (Xir − X rj ) a(Xir − X rj ) + (ηi − η j )
E(â − a) = −E I −
∑i< j (Xic − X cj )2 + ∑i< j (Xir − X rj )2 {â=a }
h i
2 ∑i< j |a|(Xir − X rj )2 P(â = a− ) + (Xir − X rj )E(ηi − η j )I{â=a− }
= −
∑i< j (Xic − X cj )2 + ∑i< j (Xir − X rj )2
2 ∑i< j (Xir − X rj )2 P(â = a− )
= −
∑i< j (Xic − X cj )2 + ∑i< j (Xir − X rj )2
2aS2 (X r )
= − P(â = a−). (40)
S2 (X c ) + S2 (X r )
140 Yan Sun and Chunyang Li
These imply
2aS2 (X r )
E(a+ − a) = − ,
S2 (X c ) + S2 (X r )
E(a− − a) = 0.
Linear Regression for Interval-Valued Data in KC (R) 141
1
S (X v ,Y v ) = ∑ (Xiv − X vj )(Yiv −Y jv ), (47)
n2 i< j
1
S2 (X v ) = ∑ (Xiv − X vj )2. (48)
n2 i< j
(48) follows by replacing Yiv with Xiv and Yiv with X vj in the above calculations.
P â = a− |a ≥ 0 → 0,
P â = a+ |a < 0 → 0,
as n → ∞.
Linear Regression for Interval-Valued Data in KC (R) 143
Proof. We prove the case a ≥ 0 only. The case a < 0 can be proved similarly.
Under the assumption that a ≥ 0,
Cov (X c ,Y c) = aVar(X c ) ≥ 0,
and consequently, P (S (X c ,Y c) < 0) → 0. According to Theorem 3, the only
other circumstance under which â = a− is when S (X r ,Y r ) > S (X c ,Y c ) > 0 and
L (a+ , b+, µ+) > L (a−, b− , µ−) simultaneously. It is therefore sufficient to show
that
P S (X r ,Y r ) > S (X c ,Y c ) > 0, L a+, b+ , µ+ > L a−, b−, µ−
(49)
→ 0.
Notice
L a+ , b+ , µ+ − L a− , b− , µ−
1 n h + c c 2 − c
i
c 2
∑
= a Xi + b −Yi − a Xi + b −Yi
n i=1
1 n h 2 2 i
+ ∑ a+Xir + µ −Yir − a−Xir + µ −Yir
n i=1
1
:= (I + II) .
n
The first term
n h 2 2 i
I = ∑ a+Xic + b −Yic − a−Xic + b −Yic
i=1
n 2 2 2
+ +
∑ Xic − X c + Xic − X c
= a −a λi − λ −2 a −a λi − λ
i=1
n
2 2 2
− −
−∑ Xic − X c + Xic − X c
a −a λi − λ −2 a −a λi − λ
i=1
h 2 2 i n 2
= a+ − a − a− − a ∑ Xic − X c
i=1
n
−2 a+ − a −
∑ Xic − X c
λi − λ
i=1
" #
n 2 n
a+ − a −
a+ + a− − 2a ∑ Xic − X c −2 ∑ Xic − X c λi − λ .
=
i=1 i=1
144 Yan Sun and Chunyang Li
From this, and the assumption that S (X r ,Y r ) > S (X c ,Y c ) > 0, we see that I > 0
is equivalent to
+
a + a−
n
2 n
− a ∑ Xic − X c − ∑ Xic − X c λi − λ (50)
2 i=1 i=1
> 0.
On the other hand,
(50)
n n
S (X c ,Y c )
c c 2− c
∑ ∑
= 2 c − a X − X X − X c λ − λ
i i i
S (X ) + S2 (X r ) i=1 i=1
c − X c (λ − λ )
∑i< j X i j i j 2 r
S (X ) n 2
= 2 − a 2 c ∑ Xic − X c
2 2 r
S (X ) + S (X ) i=1
∑i< j Xic − X cj + ∑i< j Xir − X rj
n
− ∑ Xic − X c λi − λ
i=1
2
∑ni=1 Xic − X c c c
2 ∑ Xi − X j (λi − λ j )
= 2
∑i< j Xic − X cj + ∑i< j Xir − X rj i< j
n n
S2 (X r ) c 2
− ∑ Xic − X c λi − λ − a 2 c ∑ c
X i − X
i=1 S (X ) + S2 (X r ) i=1
2
" #
∑ni=1 Xic − X c
n
c
= 2 2 n ∑ Xi − X c λi − λ
c c r r
∑i< j Xi − X j + ∑i< j Xi − X j i=1
n n
S2 (X r ) 2
− ∑ Xic − X c λi − λ − a 2 c 2 r ∑ Xic − X c
i=1 S (X ) + S (X ) i=1
n
S2 (X c )
c
= ∑ Xi − X
c λi − λ −1
i=1 S2 (X c ) + S2 (X r )
n
S2 (X r ) c c 2
∑
−a Xi − X
S2 (X c) + S2 (X r ) i=1
S2 (X r ) 2 c c
=− n aS (X ) + S (X , λ) ,
S2 (X c ) + S2 (X r )
Linear Regression for Interval-Valued Data in KC (R) 145
where S (X c , λ) = 1n ∑ni=1 Xic − X c λi − λ denotes the sample covariance of
the random variables X c and λ, which converges to 0 almost surely by the inde-
pendence assumption. Therefore,
1 S2 (X r )
I = −2 a+ − a− 2 c
2 c c
aS (X ) + S (X , λ)
n S (X ) + S2 (X r )
→ C1 < 0 (51)
almost surely, as n → ∞.
References
[1] Artstein, Z, & Vitale, R.A. (1975). A strong law of large numbers for ran-
dom compact sets. Annals of Probability, 5, 879-882.
[2] Aumann, R.J. (1965). Integrals of set-valued functions. J. Math. Anal.
Appl., 12,1-12.
[3] Billard, L., & Diday, E. (2000). Regression analysis for interval-valued
data. In: Data Analysis, Classification and Related Methods, Proceedings
of the Seventh Conference of the International Federation of Classification
Societies (IFCS’00). Springer, Berlin; 369-374.
[4] Billard, L., & Diday, E. (2002). Symbolic regression analysis. In: Classi-
fication, Clustering and Data Analysis, Proceedings of the Eighth Confer-
ence of the International Federation of Classification Societies (IFCS’02).
Springer, Berlin; 281-288.
146 Yan Sun and Chunyang Li
[8] Carvalho, F.A.T., Lima Neto, E.A., & Tenorio, C.P. (2004). A new method
to fit a linear regression model for interval-valued data. Lecture Notes in
Computer Sciences, 3238, 295-306.
[11] Gil, M.A., Lopez, M.T., Lubiano, M.A., & Montenegro, M. (2001). Re-
gression and correlation analyses of a linear relation between random in-
tervals. Test,10, 183-201.
[12] Gil, M.A., Lubiano, M.A., Montenegro, M., & Lopez, M.T. (2002). Least
squares fitting of an affine function and strength of association for interval-
valued data. Metrika, 56, 97-111.
[14] González-Rodríguez, G., Blanco, A., Corral, N., & Colubi, A. (2007).
Least squares estimation of linear regression models for convex compact
random sets. Advances in Data Analysis and Classification, 1, 67-81.
Linear Regression for Interval-Valued Data in KC (R) 147
[17] Kendall, D.G. (1974). Foundations of a theory of random sets. In: Harding
EF, & Kendall DG (Eds), Stochastic Geometry. New York: John Wiley &
Sons.
[18] Körner, R. (1995). A variance of compact convex random sets. Institut für
Stochastik, Bernhard-von-Cotta-Str. 2 09599 Freiberg.
[19] Körner, R. (1997). On the variance of fuzzy random variables. Fuzzy Sets
and Systems, 92, 83-93.
[20] Körner, R., & Näther, W. (1998). Linear regression with random fuzzy
variables: extended classical estimates, best linear estimates, least squares
estimates. Information Sciences, 109, 95-118.
[21] Lyashenko, N.N. (1982). Limit theorem for sums of independent compact
random subsets of Euclidean space. Journal of Soviet Mathematics, 20,
2187-2196.
[23] Manski, C.F., & Tamer, T. (2002). Inference on regressions with interval
data on a regressor or outcome. Econometrica, 70, 519-546.
[24] Matheron, G. (1975). Random Sets and Integral Geometry. New York:
John Wiley & Sons.
[26] Lima Neto, E.A., & Carvalho, F.A.T. (2008). Centre and range method for
fitting a linear regression model to symbolic interval data. Computational
Statistics & Data Analysis, 52, 1500-1515.
148 Yan Sun and Chunyang Li
[27] Lima Neto, E.A., & Carvalho, F.A.T. (2010). Constrained linear regression
models for symbolic interval-valued variables. Computational Statistics &
Data Analysis, 54,333-347.
Chapter 4
ABSTRACT
*
Corresponding Author: [email protected].
150 Gabriela-Nicoleta Moroi
INTRODUCTION
qe
C 0 C e V (mmol g1) (1)
m
- Llin1:
Ce 1 1
Ce (2)
qe qm KL qm
- Llin2:
1 1 1 1 (3)
qe K Lqm Ce qm
- Llin3:
1 qe
qe qm (4)
K L Ce
156 Gabriela-Nicoleta Moroi
- Llin4:
qe
KL qe K L qm (5)
Ce
- Lnonlin:
qm K L Ce (6)
qe
1 K L Ce
1 (7)
R L
1 K L C 0h
- Tlin:
RT RT (9)
qe ln C e ln K T
bT bT
- Tnonlin:
R T (10)
qe ln K T Ce
bT
and McKay 1998). Linear form, whose plot is lnqe versus ln C e (Flin),
and non-linear form (Fnonlin) of Freundlich isotherm are presented below:
- Flin:
1
ln q e ln C e ln K F (11)
n
- Fnonlin:
1n
q e K F Ce (12)
n
SSE q e q̂ e i2 (13)
i 1
100 n q e q̂ e
ARE
n i 1 q e
(14)
i
1 n
EABS q e q̂ e
n i 1
(15)
i
n
SAE q e q̂ e (16)
i 1 i
1 n
RMSE q e q̂ e i2 (17)
n i 1
2
1 n q e q̂ e
MPSD 100
n p i 1 q e
(18)
i
160 Gabriela-Nicoleta Moroi
2
1 n q e q̂ e
ARSE
n 1 i 1 q e
(19)
i
100 n q e q̂ e
2
HYBRID
n p i 1 qe
i
(20)
n
q q̂ 2
CST e e (21)
q̂ e
i 1 i
ADRSQ
1 (1 R 2 )
n 1 (22)
n ( k 1 )
q̂ q e i
2
e
R2 i 1 (23)
n n
q q̂ e q
2 2
e q̂ e i e i
i 1 i 1
Linear Regression versus Non-Linear Regression … 161
For all error functions except ADRSQ, the lower the value, the closer
the match between q̂e and qe; for ADRSQ, whose values may vary from 0
to 1, a higher value indicates that q̂e more closely match qe.
After determining the values of all error functions for all linear and
non-linear isotherm model forms, the following calculations were
performed for comparison reasons:
- for every error function, percent deviation (EPD) of each of its values
(E) with respect to the best of these values (E0, which is the maximum
value for ADRSQ and the minimum value for the other error functions)
was determined:
E E0
EPD 100 (%) (24)
E0
- for each isotherm model form, the sum of EPD values of all error
functions (SEPD) was calculated:
9
SEPD EPD (%) (25)
i 1
(a) (b)
(c) (d)
(e) (f)
Figure 1. Linear Langmuir Llin1 (a), Langmuir Llin2 (b), Langmuir Llin3 (c),
Langmuir Llin4 (d), Temkin (e) and Freundlich (f) isotherms for Cd(II) adsorption.
164 Gabriela-Nicoleta Moroi
(a) (b)
(c) (d)
(e) (f)
Figure 2. Linear Langmuir Llin1 (a), Langmuir Llin2 (b), Langmuir Llin3 (c),
Langmuir Llin4 (d), Temkin (e) and Freundlich (f) isotherms for Pb(II) adsorption.
Linear Regression versus Non-Linear Regression … 165
is better than Lnonlin as the former has six EFmin (ARE, EABS, SAE,
MPSD, ARSE and ADRSQ), whereas the latter has only three (RMSE,
HYBRID and CST), and SEPD value of the former (20.42%) is smaller
than that of the latter (36.54%); for Pb(II) adsorption, Llin4 is better than
Lnonlin as indicated by seven EFmin (ARE, EABS, SAE, MPSD, ARSE,
HYBRID and CST) versus two (RMSE and ADRSQ) and a smaller SEPD
value (1153 versus 1377%). Concerning Freundlich isotherm, for Cd(II)
adsorption, Fnonlin is better than Flin since it has six EFmin (ARE, EABS,
SAE, RMSE, CST and ADRSQ) versus three (MPSD, ARSE and
HYBRID) and a smaller SEPD value (1282 versus 1300%); for Pb(II)
adsorption, Fnonlin compared with Flin presents six EFmin (ARE, EABS,
SAE, RMSE, CST and ADRSQ) versus three (MPSD, ARSE and
HYBRID) and a slightly larger SEPD value (1737 versus 1731%). As
regards Temkin isotherm, for Cd(II) adsorption, Tlin and Tnonlin have
equal error values and, consequently, identical EPD values and the same
SEPD value (100.9%); for Pb(II) adsorption, Tlin and Tnonlin have
practically the same error values and, as a consequence, very similar to
each other (0 or very close to 0) EPD and SEPD values. It is noteworthy
that, for adsorption of the two Me species, modeling results provided by
linear regression are better (Langmuir isotherm) than or very similar
(Temkin isotherm) to those offered by non-linear regression, which is in
agreement with previously published data (Ho et al. 2002). Regression
giving the best fit differs from one model to another, being linear, non-
linear and both linear and non-linear for Langmuir, Freundlich and Temkin
isotherms, respectively.
Then, a comparison is made among linear forms of all isotherms for
adsorption of each Me species. For Cd(II) adsorption, Llin1 is the best with
seven EFmin (ARE, EABS, SAE, MPSD, ARSE, HYBRID and ADRSQ),
while Tlin has two (RMSE and CST) and Flin none, SEPD values for
Llin1, Tlin and Flin being 20.42, 100.9 and 1300%, respectively (Figure
5). For Pb(II) adsorption, the best is Tlin, which has nine EFmin and the
smallest SEPD value (0 versus 1153% for Llin4 and 1731% and Flin)
(Figure 6).
Table 3. Values of error functions for Cd(II) adsorption
Table 6. Error percent deviations (EPD) and EPD sums (SEPD) of linear Langmuir isotherm forms
for Pb(II) adsorption
(a)
(b)
Figure 5. Error percent deviations (EPD) (a) and EPD sums (b) of linear and non-linear
Langmuir, Temkin and Freundlich isotherms for Cd(II) adsorption.
172 Gabriela-Nicoleta Moroi
(a)
(b)
Figure 6. Error percent deviations (EPD) (a) and EPD sums (b) of linear and non-linear
Langmuir, Temkin and Freundlich isotherms for Pb(II) adsorption.
Lnonlin, Tlin and Tnonlin are below 30%, whereas most of those for Flin
and Fnonlin lie within the range of 135330% (Figure 5). For Pb(II)
adsorption, isotherm validity order revealed by using linear regression is
the same with that indicated by non-linear regression, i.e., Temkin
Langmuir Freundlich; EPD (except ADRSQ) values for Tlin and Tnonlin
are equal or very close to 0, whereas those for Llin4, Lnonlin, Flin and
Fnonlin range mostly from 130 to 490% (Figure 6). It is worth
emphasizing that, for adsorption of both Me species, the descending order
of isotherm model validities established by linear regression is identical
with that determined via non-linear regression. It is noted that, among all
linear and non-linear isotherm model forms considered, i.e., Llin1/Llin4,
Lnonlin, Flin, Fnonlin, Tlin and Tnonlin, the highest validity is presented,
for Cd(II) adsorption, by linear form Llin1 and, for Pb(II) adsorption, to
practically the same extent by linear form Tlin and non-linear form
Tnonlin.
The analysis of isotherm parameter values predicted by mathematical
modeling gives useful information on adsorption of Cd(II) and Pb(II)
(Tables 1 and 2, respectively). The values of qm for Cd(II) and Pb(II)
adsorption (0.112 and 0.074 mmol g1, respectively) are close to the
highest corresponding qe values (0.100 and 0.075 mmol g−1, respectively);
the qm value for Cd(II) is larger than that for Pb(II), as expected
considering qe values. The binding energy towards Pb(II) is higher than
that towards Cd(II), as indicated by the larger KL value for the former Me
species (20.52 L mmol1) compared with that for the latter (3.60 L
mmol1). Temkin isotherm fits well (to similar extents when using linear
and non-linear regressions) the experimental data, confirming that Me
chemisorption takes place. Strong interactions Meadsorbent consistent
with chemisorption are indicated by the high values of Temkin parameters
bT and KT (those estimated by linear regression are very close to the
corresponding ones determined by non-linear regression for both Me
adsorption); parameter values for Pb(II) are larger than the corresponding
ones for Cd(II), pointing out that the forces binding the former Me species
are stronger than those holding the latter, which is in agreement with what
KL values indicate (Zafar et al. 2007). Of the three models, Freundlich
174 Gabriela-Nicoleta Moroi
isotherm gives the poorest fit to experimental data for each Me species,
excluding the possibility that multilayer adsorption takes place and further
confirming the occurrence of chemisorption that results in monolayer
coverage of adsorbent surface (McKay 1995). The values of 1/n (0.312 and
0.241 for Cd(II) and Pb(II), respectively) are comprised within the range
01, showing favorable conditions for Me adsorption and therefore easy
Me removal from aqueous solutions (Subramanyam and Das 2009;
Hamdaoui and Naffrechoux 2007). The KF value for Cd(II) is higher than
that for Pb(II) (0.081 and 0.074 mmol11/n L1/n g1, respectively), which is
in accordance with the larger qm value of the former Me species compared
with that of the latter. The values of ΔG0 are negative for adsorption of
both Me species (19.96 and 24.20 kJ mol1 for Cd(II) and Pb(II),
respectively), indicating the feasibility and spontaneous nature of
adsorption (Boparai et al. 2011). The RL values (0.581 and 0.315 for Cd(II)
and Pb(II), respectively) lying within the 01 range point out that
adsorption is favorable, revealing that ILLF-SDVB beads constitute a good
adsorbent for the two Me species.
CONCLUSION
REFERENCES
Brdar, M. M.; Takači, A. A.; Šćiban, M. B.; Rakić, D. Z. Isotherms for the
adsorption of Cu(II) onto lignin – comparison of linear and non-linear
methods. Hemijska Industrija 2012, 66, 497–503.
Casas, J. S.; Sordo, J. Lead: chemistry, analytical aspects, environmental
impact and health effects (1st ed.); Elsevier: Amsterdam, 2006.
Foo, K. Y.; Hameed, B. H. Insights into the modeling of adsorption
isotherm systems. Chemical Engineering Journal 2010, 156, 210.
Freundlich, H. M. F. Über die adsorption in lösungen. Zeitschrift für
Physikalische Chemie 1906, 57A, 385470. [Adsorption in solution.
Journal of Physical Chemistry 57A: 385470].
Hamdaoui, O.; Naffrechoux, E. Modeling of adsorption isotherms of
phenol and chlorophenols onto granular activated carbon: Part I. Two-
parameter models and equations allowing determination of
thermodynamic parameters. Journal of Hazardous Materials 2007,
147, 381–394.
Han, R.; Zhang, J.; Han, P.; Wang, Y.; Zhao, Z.; Tang, M. Study of
equilibrium, kinetic and thermodynamic parameters about methylene
blue adsorption onto natural zeolite. Chemical Engineering Journal
2009, 145, 496–504.
He, J.; Hong, S.; Zhang, L.; Gan, F.; Ho, Y. S. Equilibrium and
thermodynamic parameters of adsorption of Methylene Blue onto
rectolite. Fresenius Environmental Bulletin 2010, 19, 26512656.
Ho, Y. S.; McKay, G. Sorption of dye from aqueous solution by peat.
Chemical Engineering Journal 1998, 70, 115–124.
Ho, Y. S.; Porter, J. F.; McKay, G. Equilibrium isotherm studies for the
sorption of divalent metal ions onto peat: copper, nickel and lead
single component systems. Water, Air, and Soil Pollution 2002, 141,
133.
Kumar, K. V. Comparative analysis of linear and non-linear method of
estimating the sorption isotherm parameters for malachite green onto
activated carbon. Journal of Hazardous Materials 2006, B136, 197–
202.
Kumar, K. V.; Porkodi, K.; Rocha, F. Isotherms and thermodynamics by
linear and non-linear regression analysis for the sorption of methylene
Linear Regression versus Non-Linear Regression … 177
A C
B D
data set, ix, 6, 7, 16, 37, 102, 106, 118, 121, heteroscedasticity, 38, 110
132 homogeneity, 3, 27, 28, 38, 39, 63, 114
Dubinin–Radushkevich, 33, 52, 151 hybrid fractional error function, 160
dynamic thermogravimetric analysis, 60
I
E
independent variable, vii, ix, 24, 87, 150,
Environmental Protection Agency (EPA), 160
22, 54 intervals, 76
enzyme-linked immunosorbent assay, 34, interval-valued, vii, ix
67 ionic liquid-like functionalities, 150
enzyme(s), 15, 16, 18, 33, 34, 51, 54, 65, 67 ions, 10, 175, 177
equilibrium, vii, ix, 31, 32, 33, 34, 53, 61, isotherm model parameters, 162, 166
150, 151, 154, 156, 157, 158, 160, 162, isotherm model validity, ix, 149, 151
174, 176, 177, 178 isotherm models, vii, ix, 33, 149, 150, 151,
equilibrium adsorption capacity, 151, 154, 152, 153, 154, 155, 161, 162, 166, 174,
156, 157, 158 178
equilibrium adsorption isotherms, 151 isotherms, 151, 162, 163, 164, 165, 167,
error functions, vii, ix, 150, 151, 152, 153, 168, 169, 171, 172, 177
155, 159, 161, 162, 168, 169, 174, 177
error percent deviations, 170, 171, 172
K
G L
112, 113, 114, 115, 117, 118, 120, 121, surface properties, 151
123, 124, 126, 128, 129, 135, 145, 146, surface-functionalized polymer beads, vii,
147, 148, 149, 150, 151, 152, 153, 154, ix, 149, 150, 174
158, 160, 167, 173, 174, 176, 177, 178
regression analysis, ix, 2, 3, 8, 15, 56, 59,
T
64, 101, 102, 111, 145, 149, 151, 152,
158, 176, 177
Temkin isotherm model, 157
regression equation, vii, ix, 48, 107, 108,
thermodynamic parameters, 176
150, 160
thermodynamics, 175, 176
regression line, vii, viii, 64, 70, 76, 77, 97,
transformation(s), viii, 1, 2, 3, 7, 9, 22, 23,
99, 108, 129
24, 25, 27, 28, 29, 30, 31, 33, 35, 36, 37,
regression method, x, 94, 96, 110, 150, 153,
40, 42, 43, 44, 46, 47, 48, 49, 50, 51, 53,
178
55, 56, 58, 59, 60, 61, 62, 63, 67, 87, 120
regression model, vii, viii, 15, 29, 34, 35,
treatment, 34, 47, 48, 89, 150, 175
56, 61, 63, 64, 70, 86, 95, 97, 104, 105,
trigonometric functions, 23
106, 109, 121, 126, 146, 147
root mean square error, 159
root(s), 23, 24, 28, 47, 79, 97, 159 U