The Symmetrical Fitting Method For Model Identification
The Symmetrical Fitting Method For Model Identification
10, 2025-06
Hugo Hernandez
ForsChem Research, 050030 Medellin, Colombia
[email protected]
doi: 10.13140/RG.2.2.31220.26244
Abstract
In this report, the Symmetrical Fitting method for model
optimization and parameter estimation is described in detail.
The algorithm has been implemented in R language and is
freely available (doi: 10.13140/RG.2.2.30381.40160). The
algorithm comprises various functions for estimating model
parameters while minimizing the risk of over-fitting. The main
R function for executing the symmetrical fitting method
(sm.fit) works for a wide range of single-output models,
including single-input and multiple-input variables, as well as
linear and nonlinear functions of the input variables and
model parameters. In the case of models with nonlinear
functions of the parameters, the CheMO multi-algorithm
numerical optimization method (doi: 10.13140/RG.2.2.29472.90887) is employed. Different step-
by-step examples are included to illustrate the usage of the R functions included in the toolbox.
Keywords
Correlation, Error, Model Identification, Optimization, Over-fitting, Parameter Estimation,
Parsimony, Probability Distribution, R, Randomistics, Relevance, Symmetrical Model
1. Introduction
The idea of symmetrical fitting of model parameters was introduced in a previous report [1]
with the purpose of avoiding overfitting a model when the focus is placed on minimizing the
error on a single variable (the response variable). It was shown that classical least squares
minimization (or regression in general) is not symmetric and may yield inconsistent sets of
model parameters when the roles of response and input variable are interchanged.
Cite as: Hernandez, H. (2025). The Symmetrical Fitting Method for Model Identification. ForsChem
Research Reports, 10, 2025-06, 1 - 55. doi: 10.13140/RG.2.2.31220.26244. Publication Date: 08/04/2025.
The Symmetrical Fitting Method
for Model Identification
Hugo Hernandez
ForsChem Research
[email protected]
Symmetrical fitting was also used to determine the relevance of a difference or effect in simple
linear models, working as an effective, practical substitute of statistical significance [2].
Symmetrical fitting was also found to be an efficient strategy for optimizing the structure of
mathematical models, even when they involve heteroscedastic model residuals [3].
During the development of this method, various improvements have been made to the
algorithm originally published. For that reason, the purpose of this report is to present the
rigorous derivation, formalization and robust implementation of those improvements in the
symmetrical fitting method for model identification.
Of course, several examples are included to illustrate the use and performance of the
algorithm.
The overall residual error can have any distribution of values (normal or non-normal) and can
be either homoscedastic or heteroscedastic [2,3].
The most general representation of a symmetrical model involving different functions
(terms) of a set of variables and a set of model parameters is:
∑ ( )
(2.1)
where ( )
is a standard transformation of ( )
:
§
Where ( ) represents the expected value operator.
( ) ( )
( )
( )
(2.2)
( is any arbitrary nonlinear function of the set of observed variables , which also depends
)
on the optional set of model parameters , ( )
is the mean of ( ), and ( ) is the
standard deviation of ( ).
represents the standard deviation of the model error, and is the standard random
distribution of the model error.
The standard transformation ( )
has the following properties:
( ( )) ( )
( ) ( ) ( )
( ) ( ) ( )
( ) ( ) ( ) ( )
( )
(2.3)
( ( )) ( )
( ) ( ) ( )
( ) ( ) ( ) ( )
( )
(2.4)
where ( ( )) ( )
is the variance operator applied to ( ).
The terms represent model coefficients. In symmetrical fitting, these terms must satisfy
certain conditions that will be derived in the following sections. For the moment, they will be
considered as arbitrary real numbers ( ).
Note that Eq. (2.1) describes the behavior of the model residual error ( ), which is a
dimensionless random variable with the following properties:
( ) (∑ )
∑ ( )
( ) ( )
(2.5)
(consistent with ( ) ), and
( ) (∑ )
∑∑ ( )
( ) ( ) ( )
(2.6)
where ( ) represents the covariance operator:
( ( ) ( )
) ( ( ) ( )
) ( ( )
) ( ( )
) ( ( ) ( )
)
( ( ) ( )
) ( ( ) ( )
)
( ) ( ) ( ) ( )
( ) ( ) ( )
( ) ( )
( ( ) ( ))
(2.7)
( ( ) ( ))
is the linear correlation coefficient between functions ( ) and ( ).
∑∑ ( ( ) ( ))
(2.8)
The goal of model identification is minimizing the residual error, so we would like to minimize
. However, this would lead to trivial results when all . In such a scenario, the resulting
model does not involve any variable and thus, is completely useless for modeling purposes. For
this reason, the first constraint on is that at least one of the coefficients should be different
from zero.
The function term whose coefficient is necessarily different from zero will be denoted as the
function of interest (or output or response variable) and will be represented by the variable .
Furthermore, we will arbitrarily assign to the corresponding coefficient a value of . Thus, if we
assume that function ( ) is the function of interest, then we set:
( )
(2.9)
(2.10)
and the general standardized model becomes (from Eq. 2.1, 2.9 and 2.10):
∑ ( ) ( )
(2.11)
with
( ) ( )
( )
(2.12)
where the remaining values are obtained by solving an optimization problem. ( )
represents the standard error determined using as function of interest.
( )
(2.13)
(2.14)
or equivalently,
(2.15)
Eq. (2.15) is the randomistic [5] representation of the constant unbiased model.
In this case, the standard distribution of model residuals belongs to the same family as the
probability distribution of the function of interest.
In addition, since ( ) is already the minimum variance of the model error, no further
optimization is needed. In fact, there are no decision variables ( ) available to perform the
optimization.
( ) ( )
(2.16)
Here, we have two potential functions of interest. If we assume that ( ) is the function of
interest, then and:
( ) ( ) ( ( ))
(2.17)
In this case,
( ( )) ( ) ( ( ) ( ))
( ) ( )
(2.18)
The minimization of ( ( ))
yields the following optimal coefficient:
( ( )) ( ( ) ( ))
(2.19)
This minimization corresponds to least squares minimization. The superscript indicates the
least-squares minimization optimum.
However, if we divide Eq. (2.17) by , the following expression is obtained:
( ( ))
( ) ( ) ( ( ))
(2.20)
corresponding to the model obtained when ( ) is the function of interest.
(2.21)
with an optimal coefficient:
( ( ))
( ( ) ( ))
(2.22)
The results obtained in Eq. (2.19) and (2.22) are clearly different, indicating the lack of symmetry
of least squares minimization.
( ( ) ( ))
( ( ) ( ))
(2.23)
with the following possible real results [1]:
(2.24)
Only one of those solutions represents a minimum error, corresponding to:
( ( ) ( )) ( ( ))
√ ( ( ) ( ))
√ ( ( ))
(2.25)
( ( )) ( ( ))
( | ( ( ))
|) ( | (
|)
( ) ( ))
(2.26)
And finally, the symmetrical simple linear model becomes (considering ( ) as the variable
of interest):
( )
( )
√ ( | ( ) |)
√ ( )
(2.27)
Or equivalently,
( ) ( )
( ) ( ) √ ( | ( ) |)
( )
√ ( )
( )
√ ( ))
( ) (
(2.28)
Eq. (2.28) shows the simple linear model obtained by symmetrical fitting, expressed in terms of
the optimal slope coefficient ( ( ) ) found by least-squares minimization for the standardized
model.
( ) ( )
(2.29)
where ( ) is the function of interest for the model.
Coefficient is obtained similarly as in the previous case, that is, according to Eq.
(2.25). The difference here is that the optimization problem for identifying the additional
parameters ( ) becomes:
(2.30)
which is equivalent to (from Eq. 2.30 and 2.26):
| ( ( ) ( ))
|
(2.31)
( )
∑ ( )
(2.32)
Eq. (2.32) can be alternatively expressed as a single nonlinear** model as follows (considering
( ) ):
( )
(2.33)
where
( ) ∑ ( )
(2.34)
∑ ( )
( )
√∑ ∑ ( ( ) ( ))
(2.35)
and
( ( )
)
√ ( ( ))
(2.36)
( ) is a new function containing all terms except the function of interest for the model,
is a set of additional model parameters emerging from this transformation, and is the
symmetrical coefficient of the resulting single nonlinear model.
Now, the original coefficients of the multiple-linear model are related to those of the single
linear model as follows:
**
Strictly, the model obtained is linear, but the parameters are treated as in the case of simple nonlinear
models, that is, by solving an optimization problem with the additional parameters as decision variables.
√∑ ∑ ( ( ) ( ))
(2.37)
And the model parameters are obtained by solving the following minimization problem:
(2.38)
which is equivalent to:
| ( )
|
( )
(2.39)
Since the solution to this optimization problem is the result of least squares minimization, we
may conclude that:
(2.40)
and therefore (from Eq. 2.36, 2.37 and 2.40):
( ( )
)
√∑ ∑ ( ( ) ( )) ( ( ))
(2.41)
The least-squares optimal coefficient values can be determined analytically using the following
expression [3]:
(2.42)
where
( ( ( ) ( ))
( ) ( ))
( ( ( ) ( ))
( ) ( ))
[ ( ( ) ( )) ( ( ) ( )) ]
(2.43)
( ( ))
( ( ))
[ ( ( )) ]
(2.44)
( )
∑ ( )
(2.45)
Proceeding similarly as in the previous case, the symmetrical coefficients obtained are:
( )
( ( ) )
( ( ))
∑ ∑ ( ) ( ) ( ( ))
√ ) (
( ( ) )
( ( ))
(2.46)
where
(
∑ ( )
( )) ( )
(2.47)
( ) ( ) ( )
(2.48)
( ( ( ) ( ))
( ) ( ))
( ( ( ) ( ))
( ) ( ))
( )
[ ( ( ) ( )) ( ( ) ( )) ]
(2.49)
( ( ) ( ))
( ( ) ( ))
[ ( ( ) ( )) ]
(2.50)
and the optimal set of additional model parameters is obtained by solving the following
optimization problem:
(2.51)
or equivalently:
| |
( ( ) )
( ( ))
(2.52)
where both problems are constrained by Eq. (2.48).
Once the optimal coefficients and parameters of the general standardized model have been
obtained, the function of interest can be transformed back into its de-standardized form as
follows:
( ) ∑ ( ) ( )
(2.53)
where
( )
∑ ( )
(2.54)
( )
( )
(2.55)
( ) ( )
(2.56)
√ ( | |)
( ( ) )
( ( ))
(2.57)
̂ ( )
∑ ( )
(2.58)
̂ ( )
√ ∑( ( ) ̂ ( )
)
(2.59)
̂ ( ) ( )
̂ ( )
̂ ( )
̂( ))
( )
( ) ( ̂ ( )
̂ ( )
(2.60)
Then, the general model obtained by symmetrical fitting using a sample of observations
becomes:
( )
̂ ∑ ̂ ( ) ̂ ( )
(2.61)
where
̂ ̂ ∑ ̂ ̂
( ) ( )
(2.62)
̂
̂ ̂
( )
̂ ( )
(2.63)
̂ ( ) ̂
( ( ) )
( ̂ ( ))
̂
∑ ∑ ̂ ( )̂ ( ) ̂( ( ))
̂
√ ) (
( ( ) )
( ̂
( ))
(2.64)
̂ ( )
̂ ( )
̂ ( )
(2.65)
̂( ̂( ))
( ) ( )) ( ) (
̂( ̂(
̂ ( ) ( ) ( )) ( ) ( ))
[ ̂( ( ) ( ))
̂( ( ) ( )) ]
(2.66)
̂( ( ) ( ))
̂(
̂ ( ) ( ))
[ ̂( ( ) ( )) ]
(2.67)
∑ ̂ ( )
̂
( ̂ ( )) ( )
(2.68)
( ) ̂ ( )
̂
( )
̂ ( )
(2.69)
̂ ( )
̂ ( )
̂ ̂
(2.70)
Unfortunately, the use of sample statistics to estimate population parameters always introduce
error, including uncertainty and eventually also bias. In addition, such error may also depend on
the probability distribution of the observed variables ( ), and the nonlinear functions ( ). In
addition, since nonlinear operations involving the sample statistics are present in this method,
the evaluation of bias and uncertainty analytically is a complex task. Monte Carlo simulation
methods [6] assuming a specific probability distribution of the observed variables can be used
to estimate bias and uncertainty of the model parameters obtained by symmetrical fitting.
The overall effect of bias and uncertainty in parameter estimation, along with the error
introduced during the measurement of the observed variables, is propagated through the
model resulting in an increased model error. Thus, the estimated overall residual error
becomes:
̂ √ ̂ ̂
(2.71)
where represents the error due to lack-of-fit of the model, given by:
( | |)
( ( ) )
( ( ))
(2.72)
̂ represents the estimation uncertainty propagated through the standardized model due to
sampling, and ̂ is the estimation uncertainty propagated through the standardized model
due to experimental errors (including measurement errors).
Unfortunately, is unknown since the true value is needed (the true
( ( ) )
( ( ))
( ) coefficients must be accurately known, and the data must be free of experimental error).
sampling error eventually becomes the dominant error term. Similarly, as the degrees of
freedom in the model increase, the sampling error term becomes less important. The exact
analytical expression will depend on the nature of the function terms ( ), and on the
distribution of experimental values ( ).
(2.73)
where ̂ ( )
represents the experimental error in the determination of ( ). Also note that
̂ ̂ since experimental error is also included in the variability of the data.
( ) ( )
̂̂ ∑ ( ̂ )
√ ( )
̂ ∑ ̂
( )
̂ √∑ ( ( ) ( ))
̂ ( )
̂ ( ) ̂ ( ) √
(2.74)
where
(2.75)
and represents the number of additional parameters ( ) included in the model.
From the three sources of error, only experimental error ( ̂ ) can be easily determined from
the data sample. Then, we would expect:
̂ ̂
(2.76)
where ̂ is determined from Eq. (2.74) and ̂ from Eq. (2.73).
If Eq. (2.76) is not valid, that is, when ̂ ̂ , Eq. (2.71) becomes inconsistent (since
imaginary uncertainties should be present). This situation occurs when the model has been
over-fitted. Over-fitting can be avoided by introducing Eq. (2.76) as a constraint in the model
error minimization problem.
As long as constraint Eq. (2.76) is satisfied, the model error can be minimized by maximizing the
model fit but also by maximizing the degrees of freedom of the model. Ideally, we might
maximize the degrees of freedom by increasing the number of observations until sampling
error becomes negligible. However, when the number of observations is fixed, the degrees of
freedom can only be increased by removing parameters from the model.
Two situations are possible: 1) The overall error decreases due to the increase in degrees of
freedom, or 2) the overall error increases due to the lack of fit of the simplified model. In the
first case, the parameter removed from the model was irrelevant, while for the second case the
parameter was relevant.
Thus, we may define a relevant parameter as any parameter whose presence in the model
allows decreasing the overall error. Then, by removing irrelevant parameters from the model,
the model performance will improve. This effect can be considered as a practical result of the
principle of Parsimony.
One important advantage of symmetrical fitting is that the model term coefficients are
dimensionless and directly comparable. Thus, we can evaluate first the relevance of the term
with less contribution to the model by choosing the term with minimum absolute value of .
In this procedure, two models are obtained, one with the selected term and a second without
the selected term. If the second model achieves a lower value of ̂ (according to Eq. 2.74),
then the term can be considered irrelevant and can be safely removed from the model (the
corresponding is set to zero). When all remaining terms are relevant, we can now proceed
to check the relevance of additional model parameters.
The order for evaluating the additional model parameters is rather arbitrary. For each
additional model parameter, a reference value is selected (not necessarily zero). The reference
value can be determined as the closest integer, or in terms of special numbers (e.g., ), or it can
be obtained theoretically, or simply defined by aesthetical considerations. Then, the residual
error ( ̂ ) is compared between the original model and the model obtained using the reference
value. Notice that by arbitrarily setting a value for the additional model parameter, it is no
longer estimated from the data and thus, it cannot be considered in . That is, the degrees of
freedom increase by one when the parameter is arbitrarily assigned instead of fitted from the
data. If the model error decreases, then the reference value can be incorporated in the model,
otherwise, it must be fitted from the data. The procedure is then repeated for all other
additional model parameters.
As a simple but illustrative example, let us consider the estimation of the simple linear model
( ):
( )
̂ ̂ ( ) ̂ ̂ ( )
̂
( )
(2.77)
Assuming large samples with negligible experimental error, the overall error residual error ̂
can be estimated as (from Eq. 2.71):
̂ ( ) ( | ̂( ( ) ( ))
|)
(2.78)
On the other hand, if the term ( ) is removed from the model, we obtain the constant model:
( )
̂ ̂ ̂ ( )
̂
( )
(2.79)
with (see Section 2.2)
̂ ( )
(2.80)
So, we can conclude that the term ( ) is relevant when ̂ ( ) ̂ ( ), or equivalently,
when:
| ̂( ( ) ( ))
|
(2.81)
This limiting correlation coefficient value can be used as an alternative, heuristic rule of thumb
for approximately determining the relevance of each model term.
(2.82)
where
( ̂ ) ( ̂ )
( ̂ )
( ̂ ) ( ̂ )
{
(2.83)
represents the cumulative probability function of the residuals, ̂ are the individual residual
values, is the ascending rank of the residual value ̂ , and n is the total number of residuals.
This metric can also be expressed in terms of the random coefficient of determination:
(2.84)
where
∑( )
(2.85)
Thus, the coefficient of determination of the random model becomes:
( )∑
(2.86)
Both models (deterministic and random models) are important components of the randomistic
model describing the experimental observations. For his reason, the randomistic goodness-of-
fit can be expressed as follows [8,9]:
( )
(2.87)
3. Algorithm Implementation
The symmetrical fitting method for model identification has been implemented in R language
(v.4.2.1). R (https://www.r-project.org/) is a free software for statistical computing and
graphics. The symmetrical fitting algorithm employs different R functions, all of which have
been assembled in a single R file (smtools.R) and is freely available (doi:
10.13140/RG.2.2.30381.40160).
A simplified flow diagram providing an overview of the algorithm is presented in Figure 1.
Input arguments:
y: Vector containing the observed values of the response variable to be used in the
identification procedure.
x: Vector or matrix of observed values of the input variables to be used in the
identification procedure. For matrices, each input variable must be represented by a
single column. The number of rows of x must be identical to the number of elements in
the response variable y.
terms: Optional vector of function names (using quotation marks "") representing the
different terms considered in the model. If omitted, each input variable in x is
considered as a different term.
Term functions must follow the following structure:
termfn<-function(x,param) {...}
All term functions considered must share the same arguments x and param.
param0: Optional vector of initial values for the additional parameters used by term
functions. The vector length must correspond to the length and order of the param
argument used by the different term functions.
These initial values will be used as reference values for evaluating the relevance of the
parameters. This input can be omitted only when no additional parameters are used.
lower: Optional vector of lower bounds for the additional parameters to be used in the
optimization. If omitted, they are set by default to -Inf.
upper: Optional vector of upper bounds for the additional parameters to be used in the
optimization. If omitted, they are set by default to Inf.
config: Optional list describing the configuration of the CheMO optimization method
[10]. By default, only one Queen (maoptim [11]) is used.
display: Optional text indicating the type of display to be used by the CheMO method:
No display ('none'), display after each iteration ('iter'), or display of final results ('final').
The 'iter' display option allows following the results of the optimization procedure in
real time.
maxit: Optional value of the maximum number of iterations to be performed by the
CheMO method. By default it is set to . Early stop of the optimization procedure, with
unsatisfactory results, may occur when maxit is low. Depending on the complexity of
the optimization problem, larger maxit values might be needed. Use display='iter' to
observe the evolution of the optimization procedure and decide if maxit should be
increased or not.
Uexp: Optional vector of experimental standard error values for the response variable
(first element) and each model term (given in the same order as terms). If only a single
uncertainty value is given, it is assumed that it corresponds to the uncertainty in the
determination of the response variable whereas the uncertainties of model terms are
set to zero. If no value is given, all uncertainties are assumed to be zero. This also
implies that potential over-fitting cannot be evaluated by the algorithm. The
uncertainty values are propagated through the fitted model, and this result is employed
as a constraint in the optimization procedure. The uncertainty propagated through the
model is also employed to determine the fitness coefficient of the model [12]. The
experimental error values should include at least measurement error due to truncation
or instrument resolution.
heur: Optional logical value indicating if a heuristic over-fit control is used or not. The
heuristic over-fit control removes terms having | | values less than . By default, it is
set to FALSE (no heuristic over-fit control).
cr: Optional value or vector of the desired resolution (by truncation/rounding) for the
bias and each model term coefficient. By default it is set to zero (no rounding). If a
single value is given, the resolution will be the same for all coefficients (including bias).
Otherwise, the first value will represent the bias resolution, followed by the model term
coefficients (in the same order given by terms or by x).
ptol: Optional value or vector of the desired resolution (by truncation/rounding) for
the additional parameters. By default it is set to zero (no rounding). If a single value is
given, the resolution will be the same for all additional parameters. Otherwise, they will
be assigned to each parameter in the same order given by param0.
plots: Optional logical value indicating if the results are plotted or not. By default it is
set to FALSE (no plots).
The output (smout) is a list containing the following information:
bias: Estimated value of the bias correction coefficient ( ) for the optimized model. It
is usually known as intercept, or as independent term.
coeff: Data frame showing the names of the model terms, estimated optimal
coefficient values ( ), estimated symmetrical model coefficients ( ), and a logical
variable indicating if the term was included or not in the optimal model structure.
par: Vector with the estimated optimal values of additional parameters. If no additional
parameters are considered the output is NULL.
partype: Vector with categorical types of additional parameters: 'param' for calculated
parameter, or 'const' for reference constant value. If no additional parameters are
considered the output is NULL.
dof: Degrees of freedom of the final model obtained (total number of observations of
the response variable minus the total number of parameters and coefficients estimated
from the data).
model_performance: Vector of model performance results, including: Standard error
(s), R2 coefficient (R2), Experimental uncertainty (UE), and Fitness coefficient (CF).
residual_model: Data frame indicating the best distribution model for the residuals,
with the corresponding random and randomistic R2 coefficients, Normality value
(Nvalue) [13], and Scedasticity value (Hvalue) [14] ††.
ypred: Vector containing the values of the response variable predicted by the optimal
model for the set of input variables x.
res: Vector containing the response variable residuals for the optimal model obtained.
If the problem does not involve additional parameters (other than the coefficients of each
term) then the optimization problem is solved analytically. When additional parameters are
involved, a numerical optimization is performed using param0 as initial estimations. By default,
the optimization is performed using the CheMO optimization method with a single “Queen”
(multi-objective optimization) and a maximum of iterations, without displaying the
optimization results. These options (config, display, and maxit) can be modified by the
user as input arguments. If for any reason, the CheMO function is not available, the optim
function using the Nelder-Mead (NM) method is employed‡‡. When the numerical optimization is
performed, the execution time of the algorithm increases compared to that of analytical
solutions. Use display='iter' to monitor the evolution of the optimization.
In addition to CheMO, the sm.fit function also requires the following additional functions
(included in smtools): sm.gof, N.norm.test, and H.sked.test.
If the plots option is set to TRUE, a graphical representation of the results obtained with the
optimized model is presented. This includes the following plots:
Scatterplot of predicted vs. observed response variable.
Scatterplot of predicted and observed response variable vs. each input variable and/or
model terms.
Scatterplot of model residuals vs. observed response variable.
Histogram of residual errors compared to best distribution model.
Scatterplot of cumulative relative frequency and cumulative probability vs. model
residuals.
Q-Q plot of residuals considering the best distribution model.
P-P plot of residuals considering the best distribution model.
††
The residual model performance, and the -values and -values are only illustrative as they have no
effect on the validity of the symmetrical model obtained.
‡‡
Nelder-Mead is the default method used by the optim function. It is also equivalent to a CheMO
optimization considering a single “Knight”.
Output:
ypred: Vector containing the values of the response variable predicted by the optimal
model for the set of input values x.
§§
While the function also allows extrapolation, it is highly advisable to avoid extrapolating results from a
fitted model.
***
Warning: Trying to interpolate models with more than one input variable using sm.plot may lead to
unexpected errors or erroneous results.
Graphical output:
Histogram of residual errors compared to the selected distribution function.
Scatterplot of cumulative relative frequency and cumulative probability vs. model
residuals for the selected distribution function.
Q-Q plot of residuals considering the selected distribution function.
P-P plot of residuals considering the selected distribution function.
Output:
r: Relevance value (r-value) corresponding to the absolute linear correlation coefficient
between the data set and a binary variable representing each group.
relevant: Boolean variable indicating if the sample difference is relevant or not. The
difference is considered relevant when r> .
sample.diff: Average difference observed between the samples.
model.diff: Estimated difference in mean values obtained using a symmetrical model.
Graphical output:
Scatterplot of observations grouped according to each sample, compared to the linear
model obtained by symmetrical fitting.
Usage:
CheMO(par,fn,gr,config,lower,upper,control,hessian,adaptboard,display)
Input arguments:
par: Initial values for the parameters to be optimized.
fn: An R function to be minimized (or maximized), with first argument the vector of
parameters over which minimization is to take place. It should return a scalar result.
gr: Optional function used to return the gradient for the "BFGS" and "L-BFGS-B"
methods. If it is NULL, a finite-difference approximation will be used.
config: A list containing the number of chess pieces considered in the optimization: P
(Pawns), B (Bishops), N (Knights), R (Rooks) and Q (Queens). The default value for
each type is zero. If all pieces are set to zero, the default chess configuration is used
(P=8,B=2,N=2,R=2,Q=1).
lower, upper: Bounds on the variables for the "L-BFGS-B" or "OAT" methods.
control: Optional list of control parameters, including:
o maxit: Maximum number of iterations for each optimizer and for the
optimization cycle. By default, maxit=10.
o fnscale: Scale constant for the objective function. Negative values are used
for maximization, positive for minimization. By default, fnscale=1.
o step0: Vector representing the initial search step-size for each decision
variable. Used for OAT.
o stepmin: Vector representing the minimum search step-size for each decision
variable. For integer decision variables the minimum search step-size must be 1.
Used for OAT.
hessian: Optional logical value. Should a numerically differentiated Hessian matrix be
returned?
adaptboard: Optional logical argument indicating whether the Adaptive Board
strategy (adapting the search region) is used or not.
display: Indicates which type of display is used. Default: 'none' (Nothing is displayed).
Options: 'iter' displaying results at each iteration, and 'final' displaying only the final
iteration.
Output:
par: Optimal values found for the decision variables.
value: Best objective function value found.
counts: Number of function evaluations performed.
time: Elapsed computation time in seconds.
Input arguments:
fun: R function representing the objective function. It must be a function with a single
argument x. x represents the values of the decision variables and can be a scalar or a
vector.
x0: Optional argument indicating the starting point for the optimization. By default, it is
randomly chosen within the decision variable bounds.
lower: Optional vector with lower limits for the decision variables. By default, all lower
bounds are -Inf.
upper: Optional vector with upper limits for the decision variables. By default, all upper
bounds are Inf.
step0: Optional vector representing the initial search step-size for each decision
variable.
stepmin: Optional vector representing the minimum search step-size for each decision
variable. For integer decision variables the minimum search step-size must be 1.
ncycles: Optional argument indicating the maximum number of full cycles to be
performed by the optimization algorithm.
tol: Optional argument indicating the tolerance (or resolution) for the objective
function.
MCcheck: Optional argument indicating the number of Monte Carlo trials used as a test
of local optima.
display: Optional Boolean argument used to show the progress of the optimization.
By default, it is TRUE.
Input arguments:
mainlm: Either an object of class "lm" (e.g., generated by lm), a data frame, or a list of
two objects: a response vector (y) and a matrix of reference values (x). These objects
must be given in that order.
ttype: Optional argument representing the type of test performed. Types available:
General scedastic ("scedastic"), test for homoscedasticity ("homoscedastic"), and test
for heteroscedasticity ("heteroscedastic").
display: Optional Boolean variable indicating if the test results are displayed or not.
Output:
ref.var: Name of variable used as reference in the evaluation of scedasticity.
statistic: R2 test statistic.
crit.value: Critical value of the test statistic.
p.value: Probability value of the scedasticity test.
H.value: Homoscedasticity value (H-value) calculated from optimal significance levels.
decision: Test decision (homoscedastic, heteroscedastic, or inconclusive).
4. Illustrative Examples
This section includes a collection of models obtained using the symmetrical fitting algorithm
programmed in R, for different case studies. Some of these case studies have already been
considered in previous reports.
A wide selection of models can be evaluated in this example, considering different response
variables and different transformations of the experimental variables.
First, let us obtain the best multiple-linear model describing the elongation at break in terms of
all other mechanical properties.
Remember first to load the functions in smtools (saved in the current working directory) using:
source("smtools.R")
Warning message:
H.sked.test: The sample size is too small (n<15). Conclusions may be unreliable
$bias
[1] 166.7157
$coeff
terms coeff alphaS include
1 D 0.0000000 0.0000000 FALSE
2 H -0.3415063 1.9046756 TRUE
3 TS 0.1772275 -3.5630364 TRUE
4 YS -0.1173737 2.0669945 TRUE
5 EM -2.0987926 0.6619995 TRUE
$par
NULL
$partype
NULL
$dof
[1] 8
$model_performance
s R2 R2adj
5.8840205 0.6322108 0.4483162
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9913432 0.9968161 1.713448 2.970658
$ypred
[1] 25.886777 19.742386 19.214314 17.630048 17.613711 8.096572 27.174493 25.3
93755 12.001310 3.463652 23.430645
[12] 6.711857 9.640479
$res
[1] 9.113222591 -7.742386132 0.785686042 0.369951687 1.386289205 -3.096572
028 -2.174493349 -3.393754703 -0.001310065
[10] 4.536348008 -7.430645245 4.288143440 3.359520550
The warning message is related to the evaluation of the scedasticity of residuals using the
function H.sked.test. Since this value is only informative, we may simply ignore the warning.
Figure 2 shows the graphical output obtained for this model, achieving a coefficient of
determination of .
The relative relevance of each model term is determined by the absolute value of ̂ . So, the
most important contribution to elongation at break appears to be the tensile strength,
followed by yield strength and hardness. Density had an irrelevant on the elongation at break.
Using the same data set, let us obtain the best model for describing the natural logarithm of
yield strength in terms of the natural logarithms of all other variables †††. The following code in R
can be used (the data set was already loaded in the previous example)‡‡‡:
sm.fit(log(YS),data.frame(log(D),log(H),log(TS),log(EM),log(EB)))
$bias
[1] 16.55898
†††
Rigorously speaking, special functions (such as exponential, logarithm, etc.) should not be directly
applied to variables with dimensions, but only to dimensionless variables. Unfortunately, the type I
standard transformation cannot be used with logarithms as it will result in the logarithm of negative
values (undefined in the realm of real numbers). In those cases, a type II standard transformation can be
used [4], which yields only positive (or only negative) values. Other functions having arguments with
upper and lower bounds may require type III standard transformations [4]. Now, in the case of type II
transformations for the logarithm function we have:
( )
( )
Then, the standard transformation of the logarithm becomes:
( ) ( ( ) ( )) ( )
√ ( √ ( )
( ))
Thus, the logarithm of the variables is used in the model, but formally the dimensionless type II standard
transformation has been considered.
‡‡‡
For simplicity, warnings, response variable predictions and model residuals will no longer be included
in the outputs.
$coeff
terms coeff alphaS include
1 log.D. 0.0000000 0.0000000 FALSE
2 log.H. -0.4664228 0.3937767 TRUE
3 log.TS. 1.6728987 -1.3200833 TRUE
4 log.EM. -4.1167859 0.1797774 TRUE
5 log.EB. -0.4242647 0.2727489 TRUE
$par
NULL
$partype
NULL
$dof
[1] 8
$model_performance
s R2 R2adj
0.2670242 0.9244147 0.8866220
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9799599 0.9984853 1.378534 2.208051
In this case, a model with was obtained with normal residuals, where the logarithm
of tensile strength was the most relevant variable, followed by logarithm of hardness. The
logarithm of density was discarded for being irrelevant.
In the absence of tensile strength data, the following model for describing the logarithm of
yield strength is obtained:
sm.fit(log(YS),data.frame(log(D),log(H),log(EM),log(EB)))
$bias
[1] 1.36776
$coeff
terms coeff alphaS include
1 log.D. 0.0000000 0.000000 FALSE
2 log.H. 1.0558788 -0.891424 TRUE
3 log.EM. 0.0000000 0.000000 FALSE
4 log.EB. -0.2527525 0.162488 TRUE
$par
NULL
$partype
NULL
$dof
[1] 10
$model_performance
s R2 R2adj
0.2807916 0.8955245 0.8746294
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9929549 0.999264 1.403032 3.130867
In addition to the logarithm of density, the logarithm of elastic modulus was also irrelevant.
Let us now consider a nonlinear model of the yield strength in terms of individual powers of
hardness and elongation at break. Two functions must be defined before running the fitting
procedure:
fnH<-function(x,par) as.matrix(x)[,1]^par[1]
fnEB<-function(x,par) as.matrix(x)[,2]^par[2]
sm.fit(YS,data.frame(H,EB),param0=c(1,1),terms=c("fnH","fnEB"))
$bias
[1] 59.30159
$coeff
terms coeff alphaS include
1 fnH 0.01308405 -1 TRUE
2 fnEB 0.00000000 0 FALSE
$par
[1] 2.077278 1.000000
$partype
[1] "param" "const"
$dof
[1] 10
$model_performance
s R2 R2adj
32.4679575 0.9548629 0.9458355
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9979307 0.9999066 1.789461 3.714821
The resulting model considered only the effect of hardness on yield strength. Notice that the
second additional parameter (exponent of elongation at break) was kept at its initial (nominal)
value and is treated as a constant and no longer as an unknown parameter (liberating one
degree of freedom).
The model plots (removing elongation at break) can be obtained as follows§§§:
sm.plot(sm.fit(YS,H,param0=1,terms="fnH"), yname="Yield Strength (MPa)", xname=
"Hardness (Brinell)",x=H)
§§§
Here, the same initial parameter value is used and not the optimal value found in the previous
optimization. If the optimal value is used, the parameter will be considered a constant, and an additional
degree of freedom will be gained, resulting in an artificially lower model error.
Figure 3. Nonlinear model of yield strength with respect to hardness, obtained by symmetrical fitting
The following R code can be used to fit a model of as a linear function of and :
x1=c(0.11,0.69,5.5,2.89,4.47,1.81,3.15,0,3.15,3.02,4.67,0.16,0.68,5.71,3.87)
x2=c(16.55,15.08,0,7.77,2.16,12.09,8.18,15.94,7.91,6.29,1.69,15.58,13.28,0,5.36
)
x=data.frame(x1,x2)
y=c(12.37,12.66,12,11.93,11.06,13.03,13.13,11.44,12.86,10.84,11.2,11.56,10.83,1
2.63,12.46)
sm.fit(y,x,plots=TRUE)
$bias
[1] -4.534961
$coeff
terms coeff alphaS include
1 x1 3.005533 -7.437323 TRUE
2 x2 1.002219 -7.437391 TRUE
$dof
[1] 12
$model_performance
s R2 R2adj
0.0074640 0.9999258 0.9999134
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9953441 0.9999997 1.833684 3.267741
$bias
[1] 9.877415
$coeff
terms coeff alphaS include
1 x1 2.81931 -1 TRUE
2 x2 0.00000 0 FALSE
$dof
[1] 13
$model_performance
s R2 R2adj
1.7525571 0.9089344 0.9019293
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Uniform 0.9784171 0.9980345 1.346825 2.206777
In this case, the effect of is observed as irrelevant, remaining only the effect of . In
addition, the model residuals are best described by a uniform model.
Figure 4. Extreme multiple linear model (Table 2), obtained by symmetrical fitting
The model performance can be observed graphically in Figure 5, obtained by using the
following code:
sm.plot(sm.fit(y+0.5*x1^2,x1),x=x1)
Figure 5. Modified version of the extreme multiple linear model, obtained by symmetrical fitting
This data set can also be used to evaluate model over-fitting. Let us first consider a modified
response variable given by . The model is obtained using the following code:
sm.fit(y+20*x1,x)
$bias
[1] -4.534348
$coeff
terms coeff alphaS include
1 x1 23.005422 -1.1484796 TRUE
2 x2 1.002182 -0.1500384 TRUE
$dof
[1] 12
$model_performance
s R2 R2adj
0.007463931 0.999999970 0.999999965
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9951947 1 1.827104 3.266348
First, the extremely high determination coefficient of the model raises doubts about over-
fitting. The second point is that | ̂ | values less than are obtained (heuristic condition for
over-fitting). However, these situations do not necessarily confirm over-fitting. So, let us
compare the model error with the minimum experimental error propagated through the
model. For this, we need the experimental error of the observed variables. Since we do not
have more information, let us consider only the truncation error estimated as for all
√
variables (assuming a uniform error model). The evaluation of model over-fitting is performed
simply including the Uexp values in the sm.fit function as follows:
sm.fit(y+20*x1,x,Uexp=rep(0.01/sqrt(12),3))
$bias
[1] 11.91705
$coeff
terms coeff alphaS include
1 x1 20.0312 -1 TRUE
2 x2 0.0000 0 FALSE
$dof
[1] 13
$model_performance
s R2 R2adj UE CF
0.83061043 0.99959479 0.99956362 0.06428043 0.01190692
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Uniform 0.995956 0.9999984 0.6650795 2.786777
The minimum experimental error propagated through the model was UE= , which is
much larger than the residual error of the previous model (s= ), thus confirming model
over-fitting when is also included in the model. Despite the increase in residual error for the
new model without over-fitting (s= ), the model performance remains highly
satisfactory (R2= ).
In the absence of experimental error information, the heuristic over-fit control strategy can be
used (yielding the same results), as follows:
sm.fit(y+20*x1,x,heur=TRUE)
$bias
[1] 11.91705
$coeff
terms coeff alphaS include
1 x1 20.0312 -1 TRUE
2 x2 0.0000 0 FALSE
$dof
[1] 13
$model_performance
s R2 R2adj
0.8306104 0.9995948 0.9995636
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Uniform 0.995956 0.9999984 0.6650795 2.786777
The goal here is to model the effect of excitation wavelength on the integrated second
harmonic spectrum obtained by considering a polynomial model up to the 10 th power. The
following R code can be used to fit the model:
y=c(1.10,1.17,0.95,1,1.66,1.12,1.59,2.02,2.12,1.46,2.34,3.38,3.39,2.71,3.32,4.0
5,4.76,4.93,4.90,4.85,4.59,3.93)
x=c(0.43,0.435,0.44,0.45,0.455,0.46,0.465,0.47,0.475,0.48,0.485,0.49,0.495,0.5,
0.505,0.51,0.515,0.52,0.525,0.53,0.535,0.54)
X=data.frame(x,x^2,x^3,x^4,x^5,x^6,x^7,x^8,x^9,x^10)
out=sm.fit(y,X)
out
$bias
[1] -130.4159
$coeff
terms coeff alphaS include
1 x 0.000 0.00000 FALSE
2 x.2 0.000 0.00000 FALSE
3 x.3 5437.589 -88.63674 TRUE
4 x.4 0.000 0.00000 FALSE
5 x.5 0.000 0.00000 FALSE
6 x.6 0.000 0.00000 FALSE
7 x.7 -722251.526 1609.43753 TRUE
8 x.8 2143333.521 -2708.72122 TRUE
9 x.9 -1677144.524 1187.12513 TRUE
10 x.10 0.000 0.00000 FALSE
$dof
[1] 17
$model_performance
s R2 R2adj
0.3753151 0.9466290 0.9340712
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9894352 0.9994361 1.984046 0.480491
A plot of the symmetrically fitted model, shown in Figure 6, can be obtained as follows:
xlim=c(min(x),max(x))
xmodel=xlim[1]+(xlim[2]-xlim[1])*(0:1000)/1000
Xmodel=data.frame(xmodel,xmodel^2,xmodel^3,xmodel^4,xmodel^5,xmodel^6,xmodel^7,
xmodel^8,xmodel^9,xmodel^10)
ymodel=sm.fn(Xmodel,out)
s=out$model_performance[1]
plot(xmodel,ymodel,xlab="Excitation Wavelength (um)",ylab="Integrated Spectra (
a.u.)",col="green",type="l",ylim=c(min(ymodel)-s,max(ymodel)+s))
lines(xmodel,ymodel-s,lty=2,col="red")
lines(xmodel,ymodel+s,lty=2,col="red")
points(x,y,pch=16,col="blue")
legend("top",inset=c(0,-0.22),legend=c("Experimental observations","Model predi
ctions","Standard error"), lty=c(0,1,2), pch=c(16,NA,NA), col=c("blue","green",
"red"),bty="n",xpd=TRUE)
Figure 6. Polynomial model describing the second harmonic spectra data of TiN films, obtained by
symmetrical fitting.
̂( ) ( )
(4.1)
Table 4. Stiffness vs. cantilever gap in AFM [19]
Stiffness Stiffness Stiffness Stiffness
Gap (A) Gap (A) Gap (A) Gap (A)
(N/m) (N/m) (N/m) (N/m)
10.28 0.338 13.92 0.099 21.83 -0.004 35.96 -0.022
10.35 0.279 14.13 0.115 22.05 0.015 37.25 -0.014
10.42 0.236 14.34 0.131 22.20 0.029 37.89 -0.009
10.49 0.206 14.56 0.150 22.48 0.042 38.32 -0.004
10.56 0.198 14.77 0.163 22.69 0.053 39.17 0.004
10.63 0.069 14.98 0.181 22.91 0.061 39.39 0.010
10.70 0.139 15.63 0.187 23.33 0.069 40.24 0.004
10.77 0.109 15.78 0.228 23.76 0.074 41.10 -0.004
10.84 0.061 16.06 0.187 24.19 0.058 41.53 -0.004
10.91 0.050 16.27 0.204 24.62 0.048 41.96 -0.004
10.98 0.039 16.48 0.166 24.83 0.034 42.60 -0.017
11.06 0.010 16.70 0.150 25.26 0.010 43.46 -0.020
11.14 -0.004 16.91 0.139 25.61 -0.001 44.10 -0.020
11.22 -0.009 17.13 0.115 25.90 -0.017 44.95 -0.014
11.30 -0.020 17.28 0.128 26.33 -0.036 45.38 -0.012
11.38 -0.039 17.34 0.096 27.19 -0.044 45.60 -0.014
11.46 -0.030 17.49 0.080 27.40 -0.044 46.67 -0.006
11.54 -0.047 17.55 0.058 28.04 -0.036 47.74 0.002
11.62 -0.052 17.77 0.039 28.47 -0.028 48.59 -0.004
11.70 -0.060 17.98 0.031 28.69 -0.017 49.24 -0.009
11.78 -0.047 18.41 0.010 28.90 -0.012 49.88 -0.009
11.99 -0.041 18.56 -0.004 29.54 -0.001 50.73 -0.012
12.33 -0.030 18.62 -0.025 30.61 0.004 51.38 -0.017
12.63 -0.022 19.05 -0.044 31.04 0.013 52.23 -0.009
12.75 -0.009 19.27 -0.057 31.47 0.021 53.09 -0.009
12.87 0.002 19.69 -0.065 31.68 0.023 53.94 -0.012
12.99 0.018 20.12 -0.074 32.32 0.015 55.23 -0.012
13.11 0.029 20.55 -0.068 32.97 0.010 55.44 -0.006
13.23 0.045 20.76 -0.052 33.39 -0.001 56.30 -0.012
13.35 0.058 20.91 -0.044 34.46 -0.014 56.94 -0.009
13.47 0.069 21.19 -0.033 34.68 -0.020 58.01 -0.009
13.62 0.074 21.34 -0.025 35.11 -0.022 59.30 -0.012
13.77 0.091 21.62 -0.001 35.32 -0.036 59.72 -0.009
The starting values considered for the parameters, and used in previous reports [3,20], are the
following: , , , .
The following R code is used to fit the model using symmetrical fitting****:
x=c(10.28,10.35,10.42,10.49,10.56,10.63,10.7,10.77,10.84,10.91,10.98,11.06,11.14,
11.22,11.3,11.38,11.46,11.54,11.62,11.7,11.78,11.99,12.33,12.63,12.75,12.87,12.99
,13.11,13.23,13.35,13.47,13.62,13.77,13.92,14.13,14.34,14.56,14.77,14.98,15.63,15
****
Note that the data might be alternatively imported from a .csv file. Also note that the optimization
results may differ between identical runs since stochastic optimization algorithms are used. However, if
enough iterations in the optimization method are considered, the optimal values obtained should be
similar.
.78,16.06,16.27,16.48,16.7,16.91,17.13,17.28,17.34,17.49,17.55,17.77,17.98,18.41,
18.56,18.62,19.05,19.27,19.69,20.12,20.55,20.76,20.91,21.19,21.34,21.62,21.83,22.
05,22.2,22.48,22.69,22.91,23.33,23.76,24.19,24.62,24.83,25.26,25.61,25.9,26.33,27
.19,27.4,28.04,28.47,28.69,28.9,29.54,30.61,31.04,31.47,31.68,32.32,32.97,33.39,3
4.46,34.68,35.11,35.32,35.96,37.25,37.89,38.32,39.17,39.39,40.24,41.1,41.53,41.96
,42.6,43.46,44.1,44.95,45.38,45.6,46.67,47.74,48.59,49.24,49.88,50.73,51.38,52.23
,53.09,53.94,55.23,55.44,56.3,56.94,58.01,59.3,59.72)
y=c(0.338,0.279,0.236,0.206,0.198,0.069,0.139,0.109,0.061,0.05,0.039,0.01,-0.004,
-0.009,-0.02,-0.039,-0.03,-0.047,-0.052,-0.06,-0.047,-0.041,-0.03,-0.022,-0.009,
0.002,0.018,0.029,0.045,0.058,0.069,0.074,0.091,0.099,0.115,0.131,0.15,0.163,0.18
1,0.187,0.228,0.187,0.204,0.166,0.15,0.139,0.115,0.128,0.096,0.08,0.058,0.039,0.0
31,0.01,-0.004,-0.025,-0.044,-0.057,-0.065,-0.074,-0.068,-0.052,-0.044,-0.033,-
0.025,-0.001,-0.004,0.015,0.029,0.042,0.053,0.061,0.069,0.074,0.058,0.048,0.034,
0.01,-0.001,-0.017,-0.036,-0.044,-0.044,-0.036,-0.028,-0.017,-0.012,-0.001,0.004,
0.013,0.021,0.023,0.015,0.01,-0.001,-0.014,-0.02,-0.022,-0.036,-0.022,-0.014,-
0.009,-0.004,0.004,0.01,0.004,-0.004,-0.004,-0.004,-0.017,-0.02,-0.02,-0.014,-
0.012,-0.014,-0.006,0.002,-0.004,-0.009,-0.009,-0.012,-0.017,-0.009,-0.009,-
0.012,-0.012,-0.006,-0.012,-0.009,-0.009,-0.012,-0.009)
nonoscillatoryexp<-function(x,param) exp(-param[1]*x)
oscillatoryexp<-function(x,param) exp(-param[2]*x)*cos(param[3]*x-param[4])
sm.fit(y,x,terms=c("nonoscillatoryexp","oscillatoryexp"),param0=c(0.35,0.15,0.8,0
),display='iter',plots=TRUE)
[1] "Chess-Inspired Multi-Algorithm Optimization (CheMO)"
iter# piece# piece_type counts fn par_1 par_2 par_3 par_4
1 0 1 Q 1 0.0215828756327954 0.35 0.15 0.8 0
iter# piece# piece_type counts fn par_1 par_2 par_3 par_4
1 1 1 Q 862 0.019196 0.374332 0.146445 0.79136 -0.069
iter# piece# piece_type counts fn par_1 par_2 par_3 par_4
1 2 1 Q 1716 0.018083 0.336898 0.146445 0.78661184 -0.0897
iter# piece# piece_type counts fn par_1 par_2 par_3
par_4
1 3 1 Q 2581 0.017816 0.321331 0.1491542325 0.785746566976
-0.06279
iter# piece# piece_type counts fn par_1 par_2 par_3
par_4
1 4 1 Q 3424 0.017809 0.315753 0.1491542325 0.785746566976
-0.056504721
iter# piece# piece_type counts fn par_1 par_2 par_3
par_4
1 5 1 Q 4330 0.017809 0.315753 0.1491542325 0.785746566976
-0.056504721
$bias
[1] -0.006105749
$coeff
terms coeff alphaS include
1 nonoscillatoryexp 9.738633 -1.460180 TRUE
2 oscillatoryexp 1.691305 -1.391505 TRUE
$par
[1] 0.31575300 0.15000000 0.78574657 -0.05650472
$partype
[1] "param" "const" "param" "param"
$dof
[1] 126
$model_performance
s R2 R2adj
0.01783783 0.94855812 0.94651677
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9824529 0.9990973 -6.924064 -12.10756
The graphical output for this example is shown in Figure 7 and Figure 8.
Figure 7. Nonlinear model (Eq. 4.1) describing the interaction stiffness as a function of cantilever gap in
AFM of liquid OMCTS, obtained by symmetrical fitting.
The optimization procedure stopped after the 5th iteration, and the model error decreased
from to . Both terms were found relevant for the model, but one of the
additional parameters ( ) was considered a constant. That is, the initial parameter value can
be satisfactorily used instead of the corresponding value obtained by optimization. The model
fitted the data with a coefficient of determination of . The model residuals were fitted
using a normal distribution model with a coefficient of determination of .
Figure 8. Residuals plots for the nonlinear model (Eq. 4.1) describing the interaction stiffness as a
function of cantilever gap in AFM of liquid OMCTS, obtained by symmetrical fitting.
̂ ( )
(4.2)
where , , and are model parameters.
This expression can be transformed using natural logarithms, resulting in:
̂ ( ) ( )
(4.3)
where represents the model bias, represents the term coefficient, and
is an additional model parameter.
Model (4.3) will be fitted for the vapor pressure of pure water at different temperatures in the
range – [22-24]. The available experimental data is summarized in Table 5.
Table 5. Vapor pressure data for water between and [22-24].
(K) (atm) (K) (atm) (K) (atm)
255.85 0.0013 333.15 0.197 473.15 15.35
273.16 0.0060 339.65 0.263 486.25 20.00
274.35 0.0066 343.15 0.308 493.15 22.89
275.15 0.0070 353.15 0.468 494.15 23.34
277.15 0.0080 356.15 0.526 498.15 25.17
283.15 0.0121 363.15 0.693 507.75 30.00
284.45 0.0132 369.15 0.866 513.15 33.03
287.15 0.0158 373.15 1.000 523.15 39.25
291.15 0.0204 379.15 1.230 524.25 40.00
293.15 0.0231 383.15 1.420 533.15 46.31
295.35 0.0263 393.15 1.960 537.85 50.00
298.15 0.0313 393.25 2.000 548.15 58.70
303.15 0.0419 398.15 2.290 549.65 60.00
307.15 0.0526 409.48 3.210 553.15 63.33
307.25 0.0526 413.15 3.570 573.15 84.76
313.15 0.0729 423.15 4.700 593.15 111.4
314.75 0.0789 425.55 5.000 613.15 144.1
317.15 0.0899 433.15 6.100 633.15 184.2
323.15 0.122 448.15 8.810 647.10 217.8
324.75 0.132 453.15 9.900
327.15 0.148 453.65 10.00
The following R code is used to fit the model, considering an initial parameter value , and
optimizing using CheMO:
T=c(255.85,273.16,274.35,275.15,277.15,283.15,284.45,287.15,291.15,293.15,295.35,
298.15,303.15,307.15,307.25,313.15,314.75,317.15,323.15,324.75,327.15,333.15,339.
65,343.15,353.15,356.15,363.15,369.15,373.15,379.15,383.15,393.15,393.25,398.15,4
09.48,413.15,423.15,425.55,433.15,448.15,453.15,453.65,473.15,486.25,493.15,494.1
5,498.15,507.75,513.15,523.15,524.25,533.15,537.85,548.15,549.65,553.15,573.15,59
3.15,613.15,633.15,647.096)
Pv=c(0.0013,0.006,0.0066,0.007,0.008,0.0121,0.0132,0.0158,0.0204,0.0231,0.0263,0.
0313,0.0419,0.0526,0.0526,0.0729,0.0789,0.0899,0.122,0.132,0.148,0.197,0.263,0.30
8,0.468,0.526,0.693,0.866,1,1.23,1.42,1.96,2,2.29,3.21,3.57,4.7,5,6.1,8.81,9.9,10
,15.35,20,22.89,23.34,25.17,30,33.03,39.25,40,46.31,50,58.7,60,63.33,84.76,111.4,
144.1,184.2,217.8)
Antfn<-function(T,par) (1/(T+par))
Antout=sm.fit(y=log(Pv),x=T,terms="Antfn",param0=0,display='iter')
Antout
$bias
[1] 11.71458
$coeff
terms coeff alphaS include
1 Antfn -3839.601 1 TRUE
$par
[1] -45.32275
$partype
[1] "param"
$dof
[1] 58
$model_performance
s R2 R2adj
0.02043926 0.99996256 0.99996127
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.837445 0.9999939 -21.16899 -49.53544
Figure 9. Antoine model (Eq. 4.3) for the vapor pressure of water between and , obtained
by symmetrical fitting. .
The best model residual was the normal distribution. However, the N-value obtained indicates
that residuals do not follow a normal distribution. Nevertheless, since the magnitude of the
residual error is so low, the lack-of-fit in the residual distribution model has a minimum impact
on the model performance (randomistic R2).
An alternative empirical model (although inspired by molecular mechanics) for describing
vapor pressure has been proposed [25]:
( )
̂ ( ) ( )
( )
(4.4)
where and represent a reference observation, and and are the model
parameters.
An unbiased logarithm transformation of this model is the following:
( )
̂ ( ) ( )
( )
(4.5)
where is the bias correction term, is the coefficient of the only term in
the model, and is an additional model parameter.
Using ( ), and considering an initial parameter value , the
following symmetrical model is obtained:
erfc<-function(x) 2*pnorm(x*sqrt(2),lower=FALSE)
mmfn<-function(T,par) log(T*erfc(par/T)/(373.15*erfc(par/373.15)))
mmout=sm.fit(y=log(Pv),x=T,terms="mmfn",param0=0,display='iter')
mmout
$bias
[1] 0.004348245
$coeff
terms coeff alphaS include
1 mmfn 3.036279 -1 TRUE
$par
[1] 431.2184
$partype
[1] "param"
$dof
[1] 58
$model_performance
s R2 R2adj
0.01606394 0.99997687 0.99997608
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.911854 0.999998 -15.28745 -29.00121
$bias
[1] 0
$coeff
terms coeff alphaS include
1 mmfn 3.04 -1 TRUE
$par
[1] 430.8
$partype
[1] "param"
$dof
[1] 59
$model_performance
s R2 R2adj
0.01592956 0.99997687 0.99997647
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9052866 0.9999978 -15.51949 -28.14989
This allowed us to gain an additional degree of freedom for the estimation of model residuals.
Figure 10 shows the fitted model inspired by molecular mechanics, plotted using:
sm.plot(mmout,x=T,xname="Temperature (K)",yname="log(Vapor Pressure (atm))")
While both models have comparable performance, the model inspired by molecular mechanics
has a slightly lower model error.
Figure 10. Empirical model (Eq. 4.5) for the vapor pressure of water between and ,
obtained by symmetrical fitting. .
̂ ( )
(4.6)
where , , and are model parameters.
Symmetrical fitting is used considering the experimental data reported by Baulch et al. [27]
corresponding to the rate coefficient of the reaction between hydrogen molecules and
monoatomic oxygen radicals determined at different temperatures. The data is loaded into the
R workspace using the following code‡‡‡‡:
x=c(3246.5,3178.1,2987.6,2987.1,2986.4,2818.4,2765.1,2764.7,2764,2665.3,2575.9,24
93.6,2447.3,2226.9,2225.8,2101.5,2101.3,1841.1,1775.3,1754.6,1694.6,1603.8,1602,1
569.7,1537.5,1491.5,1490.5,1490.3,1447.9,1433.7,1419.3,1393.6,1393.4,1393.3,1330.
8,1330.6,1319.7,1319.1,1284.7,1273.4,1251.6,1242.5,1241.7,1241.4,1201.9,1201.5,11
91.5,1173,1103,1041.3,1034.3,992.8,918.8,913.4,902.2,880.8,865.3,831.8,827,813.2,
804.7,755.5,747.6,740.5,740.3,736.8,733,715.5,676.4,673.3,622.4,612,607.1,590.3,5
90.3,572.1,572,550.7,544.7,521.8,519.8,516.3,514.5,509.2,504,499,497.2,493.9,492.
3,484.2,479.6,479.6,468.9,460.2,458.7,447.7,447.7,445,441,437.2,425.9,425.9,425.8
,422.2,415.1,412.9,412.9,408.3,405,396.3,396.3,391.1,384,378.1,377.2,375.2,372.4,
††††
Only one additional parameter is considered to correctly account for the degrees of freedom of the
model, since the term coefficient represents the degree of freedom consumed by the second function
( ).
‡‡‡‡
X represents Temperature in Kelvin, y represents the natural logarithm of the reaction rate coefficient
in .
371.5,369.6,368.7,364.2,363.3,362.4,356.3,353.8,349.6,348,345.6,340,336.1,328,322
.3,321.6,318.8,317.4,315.4,300.7,297.7,297.7,295.9)
y=c(-23.1658,-23.3932,-23.7342,-23.6489,-23.5067,-24.0467,-23.9045,-23.8192,-
23.6487,-23.8476,-24.7003,-25.8941,-24.4727,-24.643,-24.2735,-25.0124,-24.9555,-
25.4667,-25.7224,-25.8929,-26.0349,-26.6885,-25.523,-26.6316,-26.9157,-27.2852,-
26.5176,-26.4039,-27.4271,-27.3134,-26.7163,-27.6259,-27.4838,-27.3416,-27.4551,-
27.3414,-28.2795,-27.6541,-27.5686,-27.3127,-27.0568,-28.5634,-27.7674,-27.3978,-
28.3642,-27.8241,-27.4829,-27.9945,-27.8804,-28.3633,-28.9318,-29.1305,-29.0731,-
29.5279,-29.3856,-29.6981,-29.4421,-30.6642,-30.4083,-29.9248,-30.7491,-31.1466,-
30.4642,-31.4022,-30.8337,-31.5727,-30.9757,-31.6293,-31.9983,-32.1119,-32.1963,-
32.1109,-32.5372,-33.3044,-33.0201,-33.6735,-32.8207,-33.0761,-33.4455,-33.9282,-
33.0753,-34.1555,-33.729,-34.07,-33.7856,-34.354,-33.7854,-34.1548,-33.9558,-
34.0124,-34.9219,-34.4102,-34.4099,-34.5802,-34.3243,-35.0062,-34.8356,-35.1198,-
34.8638,-35.3185,-35.6307,-35.3465,-35.2043,-35.5453,-35.4598,-36.1988,-35.9713,-
35.7721,-36.0563,-36.3117,-36.0843,-36.1978,-36.5954,-36.3677,-37.0784,-36.6234,-
37.1634,-36.7369,-36.6516,-37.1632,-37.4757,-37.1345,-37.2482,-37.4184,-37.6741,-
37.418,-37.7306,-38.0716,-38.1849,-37.872,-38.44,-38.5533,-38.3828,-38.6668,-
38.4677,-38.7235,-39.0635,-39.2622,-39.12,-39.0346)
$bias
[1] -49.94089
$coeff
terms coeff alphaS include
1 Tfn -2643.468 1 TRUE
$par
[1] -0.001289
$partype
[1] "param"
$dof
[1] 137
$model_performance
s R2 R2adj
0.3432904 0.9948727 0.9947978
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9936177 0.9999673 -2.496254 0.9542165
The results obtained with this model are graphically summarized in Figure 11, obtained with the
following code:
sm.plot(out,x,xname="Temperature [K]",yname="log(k [cm3/s])")
Figure 11. Graphical summary of model (4.6) identified using symmetrical fitting. ,
, .
The purpose of this example is modeling the effect of sex on the change in HDL-C levels after 6
months of black tea consumption. Since sex is a categorical variable, it must first be
transformed into a suitable numerical variable (i.e. binary variable). This can be done as follows:
snum=scat=="Female"
snum is a Boolean variable, but it is interpreted by R as for FALSE and for TRUE.
Table 6. HDL-C levels difference of 28 adult individuals before and after 6 months of black tea
consumption [28]
HDL-C Difference,
ID Sex
mg/dL
1 Female 10
2 Female 10
3 Female 6
4 Male 2
5 Female -2
6 Male 5
7 Male -3
8 Female 25
9 Female -11
10 Female -1
11 Female 13
12 Male 4
13 Female 1
14 Female 11
15 Female -13
16 Male -13
17 Male -4
18 Male 4
19 Female -18
20 Male -1
21 Male -7
22 Female 2
23 Male -5
24 Female 3
25 Female -5
26 Female 8
27 Female -25
28 Female -1
$bias
[1] -0.1785714
$coeff
terms coeff alphaS include
1 x 0 0 FALSE
$dof
[1] 26
$model_performance
s R2 R2adj
10.42944637 0.00000000 -0.03846154
$residual_model
model randomR2 randomisticR2 Nvalue Hvalue
1 Normal 0.9920027 0.9920027 2.16558 4.075575
That is, the effect of sex on the difference in HDL-C levels after 6 months of black tea
consumption is irrelevant.
This evaluation can be alternatively performed using the test of relevance (r.test), as follows:
r.test(diff[which(scat=="Female")],diff[which(scat=="Male")],plot=TRUE)
The resulting plot is shown in Figure 12. We can observe higher dispersity of the results for
females, but not an evident difference in their means.
Figure 12. Relevance test for the effect of sex on the difference in HDL-C levels after 6 months of black
tea consumption.
This report provides data, information and conclusions obtained by the author(s) from original scientific
research, based on the best knowledge available to the author(s). The main purpose of this publication is
to openly share scientific knowledge. Any mistake, omission, error or inaccuracy published, if any, is
completely unintentional.
This research did not receive any specific grant from funding agencies in the public, commercial, or non-
profit sectors.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC
4.0). Anyone is free to share (copy and redistribute the material in any medium or format) or adapt
(remix, transform, and build upon the material) this work under the following terms:
Attribution: Appropriate credit must be given, providing a link to the license, and indicating if
changes are made. This can be done in any reasonable manner, but not in any way that suggests
endorsement by the licensor.
Non-Commercial: This material may not be used for commercial purposes.
References
[17] Hernandez, H. (2023). Optimal Model Structure Identification. 1. Multiple Linear Regression.
ForsChem Research Reports, 8, 2023-13, 1 - 53. doi: 10.13140/RG.2.2.31051.57121.
[18] Wen, X., Li, G., Gu, C., Zhao, J., Wang, S., Jiang, C., ... & Xiong, Q. (2018). Doubly enhanced second
harmonic generation through structural and epsilon-near-zero resonances in TiN nanostructures.
ACS Photonics, 5 (6), 2087-2093. doi: 10.1021/acsphotonics.8b00419.
[19] Maali, A., Cohen-Bouhacina, T., Couturier, G., & Aimé, J. P. (2006). Oscillatory dissipation of a
simple confined liquid. Physical Review Letters, 96 (8), 086105. doi:
10.1103/PhysRevLett.96.086105.
[20] Hernandez, H. (2023). Optimal Model Structure Identification. 2. Nonlinear Regression. ForsChem
Research Reports, 8, 2023-17, 1 - 55. doi: 10.13140/RG.2.2.25901.87527.
[21] Thomson, G. W. (1946). The Antoine equation for vapor-pressure data. Chemical Reviews, 38 (1), 1-
39. doi: 10.1021/cr60119a001.
[22] Stull, D. R. (1947). Vapor Pressure of Pure Substances. Inorganic Compounds. Industrial &
Engineering Chemistry, 39 (4), 540-550. doi: 10.1021/ie50448a023.
[23] Liu, C. T., & Lindsay Jr, W. T. (1970). Vapor pressure of deuterated water from 106 to 300. deg.
Journal of Chemical and Engineering Data, 15 (4), 510-513. doi: 10.1021/je60047a015.
[24] The Engineering ToolBox (2010). Water - Heat of Vaporization vs. Temperature. Available at:
https://www.engineeringtoolbox.com/water-properties-d_1573.html. Last accessed: March 18,
2025.
[25] Hernandez, H. (2022). Molecular Modeling of Macroscopic Phase Changes 2: Vapor Pressure
Parameters. ForsChem Research Reports, 7, 2022-16, 1 - 43. doi: 10.13140/RG.2.2.10226.38086.
[26] Hernandez, H. (2019). Collision Energy between Maxwell-Boltzmann Molecules: An Alternative
Derivation of Arrhenius Equation. ForsChem Research Reports, 4, 2019-13, 1-27. doi:
10.13140/RG.2.2.21596.33926.
[27] Baulch, D. L., Bowman, C. T., Cobos, C. J., Cox, R. A., Just, T., Kerr, J. A., ... & Walker, R. W. (2005).
Evaluated kinetic data for combustion modeling: supplement II. Journal of physical and chemical
reference data, 34(3), 757-1397. doi: 10.1063/1.1748524.
[28] Davis, R. B., & Mukamal, K. J. (2006). Hypothesis Testing: Means. Circulation, 114 (10), 1078-1082.
doi: 10.1161/circulationaha.105.586461.