Predictive Approach To Model Selection and Validation in Statistical Learning
Predictive Approach To Model Selection and Validation in Statistical Learning
Abstract: The best model selection and the validation of the model are key issues in any
model-building process. The present paper summarizes the results from international research done
in Europe, Australia, and most recently in the United States. It discusses the model selection and
validation in deep neural networks based on their prediction errors and provides some insights how
to improve their accuracy in a very cost-effective way.
Keywords: models; machine learning; data mining; predictive analytics; accuracy; model
evaluation and selection; validation; artificial neural networks; deep learning networks; statistical
learning networks; group method of data handling; multi-layered networks of active neurons
1. INTRODUCTION
The best model selection and the validation of the model are key issues in any
model-building process. Procedures and protocols for model verification and validation are
an ongoing field of academic study, research and development in both simulations’
technology and practice.
Models are the basis and one of the most important pillars in Data Mining (DM)
Machine Learning (ML) and Predictive analytics. ML and DM often employ the same
methods and overlap significantly, but ML focuses on prediction, based on known properties
learned from the training data, and DM focuses on the discovery of previously unknown
properties in the data, which is the analysis step of knowledge discovery in databases.
Predictive analytics encompasses a variety of techniques from statistics, ML and
DM, which analyze current and historical facts to make predictions about future or otherwise
unknown events. A common definition of predictive analytics is: “The use of statistics and
modeling to determine future performance based on current and historical data. Predictive
analytics look at patterns in data to determine if those patterns are likely to emerge again.
This allows businesses and investors to adjust where they use their resources to take
advantage of possible future events. Predictive analysis can also be used to improve
operational efficiencies and reduce risk.”2
1
Mihail Motzev – Retired professor at Walla Walla University School of Business, College Place WA, USA,
[email protected]
2
See: https://www.investopedia.com/terms/p/predictive-analytics.asp (Accessed: 02 November 2023).
1
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Models must be accurate enough in order to maximize the effectiveness of their use
in any of the above fields. Ever increasing model accuracy helps researchers analyze
problems more precisely, which leads to deeper and better understanding. Models with
higher accuracy simply produce better predictions and support managers in making better
decisions that are much closer to the real-life business case.
A common step in any model-building process is model selection. Model selection
is the task of selecting a model from among various candidates on the basis of performance
criterion to choose the best one. Model selection may also refer to the problem of selecting
a few representative models from a large set of computational models most commonly in
Artificial Neural Networks (ANNs).
Model validation is a closely related and similar task as model selection, but the
model validation does not concern so much the conceptual design of models as it
evaluates only the consistency between a chosen model and its stated outputs. It is usually
defined to mean “substantiation that a computerized model within its domain of applicability
possesses a satisfactory range of accuracy consistent with the intended application of the
model” (Schlesinger et al., 1979).
The present paper discusses the model selection and validation in deep neural
networks based on their prediction errors and provides some insights how to improve
their accuracy in a very cost-effective way. It presents results from international research
done in Europe, Australia, and most recently in the United States. It summarizes the
previous research projects discussed in Motzev (2015; 2018a; 2018b; 2019; 2021).
3
See: https://www.mygreatlearning.com/blog/what-is-machine-learning/ (Accessed: 02 November 2023).
2
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
The famous “Turing Test” was created in 1950 by Alan Turing, which would ascertain
whether computers had real intelligence (Turing, 1950). It has to make a human believe that
it is not a computer but a human instead, to get through the test. Arthur Samuel developed
the first computer program that could learn as it played the game of checkers in the year
1952. The first neural network, called the perceptron, was designed by Frank Rosenblatt in
the year 1957 according to The New Yorker4.
IBM has a rich history with Machine Learning (ML) and according to its studies (IBM,
2023) Arthur Samuel is credited for coining the term “Machine Learning” with his research
(Samuel, 1959) around the game of checkers. Moreover, according to Sara Brown at MIT
Sloan School of Management (Brown, 2023), ML was defined in the 1950s by AI pioneer
Arthur Samuel as “the field of study that gives computers the ability to learn without explicitly
being programmed.”
ML is a branch of AI and computer science which focuses on the use of data and
algorithms to imitate the way that humans learn, gradually improving its accuracy. In general,
ML is an umbrella term for solving problems for which development of algorithms by human
programmers would be cost-prohibitive, and instead the problems are solved by helping
machines "discover" their "own" algorithms (Alpaydin, 2020), without needing to be explicitly
told what to do by any human-developed algorithms.
The big shift happened in the 1990s when ML moved from being knowledge-driven
to a data-driven technique due to the availability of huge volumes of data. Businesses
recognized that the potential for complex calculations could be increased through ML.
ML is an application of AI that uses statistical techniques to enable computers to learn
and make decisions without being explicitly programmed. It is predicated on the notion that
computers can learn from data, spot patterns, and make judgments with little assistance
from humans. Good quality data is fed to the machines, and different algorithms are used to
build ML models to train the machines on this data. The choice of algorithm depends on the
type of data at hand and the type of activity that needs to be automated.
In general, there are seven steps of ML:
1. Gathering Data
2. Preparing that Data
3. Choosing a Model
4. Training
5. Evaluation
6. Hyperparameter Tuning
7. Prediction
The best model selection (i.e. choosing a model) and the validation (i.e.
evaluation) of the model is a key issue in any model-building process as we mentioned
above. A typical ML process can be summarized with the three major steps of the model
building process as shown in Fig. 1.
4
See https://www.newyorker.com/tech/annals-of-technology/hyping-artificial-intelligence-yet-again
(Accessed: 02 November 2023).
3
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
4
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Fig. 2 Knowledge Discovery in Databases process (Source: Fayyad et al., 1996, p. 41)
There have been some efforts to define standards for the DM process. For
exchanging the extracted models, in particular for use in predictive analytics, the key
standard is the Predictive Model Markup Language (PMML), which is an XML-based
language developed by the Data Mining Group (DMG) and supported as exchange format
by many DM applications. This standard only covers prediction models, a particular DM
task of high importance to business applications (see Günnemann et al., 2011).
KDD and DM we discussed in detail in Motzev and Lemke (2015) and Motzev (2021).
In this paper we will concentrate on model component in DM process, its selection and
validation. In general, DM platforms assist and automate the process of building and training
highly sophisticated models and applying these models to larger datasets. A white paper
published by MicroStrategy (2005, pp. 162-173) describes in detail the DM process and it
emphasizes three main steps:
1. Create a predictive model from a data sample – Advanced statistical and
mathematical techniques like regression analysis and ML algorithms are used to
identify the significant characteristics and trends in predicting responsiveness, and a
predictive model is created using these as inputs.
2. Train the model against datasets with known results – The new predictive
model is applied to additional data samples with known outcomes to validate
whether the model is reasonably successful at predicting the known results. This
gives a good indication of the accuracy of the model. It can then be further trained
using these samples to improve its accuracy.
3. Apply the model against a new dataset with an unknown outcome – Once the
predictive model is validated against the known data, it is used for scoring, which is
defined as the application of a DM model to forecast an outcome.
As we mentioned above, ML and DM often employ the same methods and ANNs is
one of the common techniques used by both of them. From a DM perspective, ANNs are
just another way of fitting a model to observed historical data in order to be able to make
classifications or predictions. In the following section we will discuss the most advanced
ANNs used in ML and DM – Deep Learning Neural Networks.
5
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
6
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Fig. 4 The scope and hierarchy of AI, ML and Deep Learning (Source:
https://en.wikipedia.org/wiki/Machine_learning#cite_note-journalimcms.org-22
Accessed: 02 November 2023)
Most modern deep learning models are based on multi-layered ANNs similar to the
one presented in Fig. 5. Deep learning uses multiple layers to progressively extract higher-
level features from the raw input and can refer to "computer-simulate" or "automate"
human learning processes from a source to a learned object. In such a way, the notion
coined as "deeper" learning or "deepest" learning makes sense.
Technically, the “deep” in ML is just referring to the number of layers in an ANN. A
neural network that consists of more than three layers, which would be inclusive of the input
and the output, can be considered a deep learning algorithm or a Deep Neural Network
(DNN). ANN that only has three layers is just a basic neural network.
7
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
However, to have an extremely robust ANN (DNN or other type ANN) the model, the
cost function, and the learning algorithm must be appropriately selected. Like typical ANNs,
many issues can arise with DNNs if they are naively trained. As we discussed in Motzev
and Lemke (2015) and Motzev (2018a; 2018b; 2021), the most common issues are
overfitting and computation time, due to increased model and algorithmic complexity which
results in very significant computational resource and time requirements. Other important
questions such as “How can secondary data series be inferred for the network generating
process?” or “What should the number of input nodes for the ANN be (i.e. what is the order
of the model)?” should be considered too. Furthermore, concerning the ANNs architecture,
other questions need proper addressing such as “What should the number of hidden nodes
be?” or “Which is the best activation function in any given instance?”
It is worth pointing out the fact that “The first general, working learning algorithm for
supervised, deep, feedforward, multilayered perceptron(s) (see Fig. 6) was published by
Ivakhnenko and Lapa in 1967” (https://en.wikipedia.org/wiki/Deep_learning >History).
Alexey G. Ivakhnenko (1968) introduced the Group Method of Data Handling
(GMDH)5 as an inductive approach to model building based on self-organization principles.
It is also referred to as Polynomial Neural Networks, Abductive and Statistical Learning
Networks (SLNs) (see http://gmdh.net/). For many years, GMDH has proven to be one of
the most successful methods in SLNs (Müller and Lemke, 2003; Onwubolu, 2009).
Statistical Learning Theory deals with the problem of finding a predictive function
(model) based on a given data set (see Fig 7). In general, it is a framework for ML drawing
from the fields of statistics and functional analysis (Hastie et al. 2017; Mohri et al. 2012).
In GMDH algorithms, models are generated adaptively from input data in the form of
an ANN of active neurons in a repetitive generation of populations of competing partial
models of growing complexity. A limited number is selected from generation to generation
by cross-validation, until an optimal complex model is finalized.
5
GMDH is a method of inductive statistical learning. See.
http://en.wikipedia.org/wiki/Group_Method_of_Data_Handling (Accessed: 02 November 2023)
8
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Fig. 6 General scheme of GMDH Self-Organizing modeling algorithm (Source: Madala &
Ivakhnenko, 1994, p. 8)
This modeling approach (see Fig. 6) grows a tree-like network out of data of input
and output variables in a pair-wise combination and competitive selection from a single
neuron to a final output – a model without predefined characteristics. Here, neither the
number of neurons and the number of layers in the network, nor the actual behavior of each
created neuron is predefined. The modeling is self-organizing because the number of
neurons, the number of layers, and the actual behavior of each created neuron are identified
during the learning process from layer to layer.
SLNs such as GMDH can address the common problems of ANNs and the DNNs in
particular (see Table 1). “One of the great things about deep learning is that users can
essentially just feed data to a neural network, or some other type of learning model, and
the model eventually delivers an answer or recommendation. The user does not have to
understand how or why the model delivers its results; it just does. But some enterprises are
finding that the black box nature of some deep learning models -- where their functionality
isn't seen or understood by the user -- isn't quite good enough when it comes to their most
important business decisions” (Burns, 2017).
Fig. 7 Statistical Learning Theory (Source: unknown)
9
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
The lack of transparency into how deep learning models work is refraining some
businesses from accepting them fully and some special techniques should be used to
address this problem. Müller and Lemke (1995) provided comparisons and pointed out that
in distinction to typical ANNs, the results of GMDH algorithms are explicit mathematical
models that are generated in a relatively short time on the basis of even small samples.
Another important problem, for example, is that deep learning and neural network
algorithms can be prone to overfitting. Following Gödel's (1931; 2001) incompleteness
theorems and Beer (1959, p. 280), only the external criteria, calculated on new,
independent information (cross validation in GMDH see Fig. 8) can produce the minimum
of the prediction model error.
SLNs also allow to overcome the common problems in the designing DNN topology
which is in general a trial-and-error process and there are no rules how to use the theoretical
a priori knowledge in DNN design, selecting the number of input nodes and so on (Müller
and Lemke, 2003). In summary, GMDH algorithms combine in a powerful and a cost-
effective way the best features of ANNs and statistical techniques.
10
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
One of the most useful methods in selection problems is the cross-validation (also
called rotation estimation, out-of-sample testing, predictive sample reuse, external criteria)
method. The main idea is simply splitting the data into two parts, using one part to derive a
prediction rule and then judge the goodness of the prediction by matching its outputs with
the rest of the data (hence the name cross validation). In other words, cross-validation is
a method for model selection according to the predictive ability of the model. One of the
appealing characteristics of cross-validation is that it is applicable to a wide variety of
problems, thus giving rise to applications in many areas. Examples include, but are not
limited to, the choice of smoothing parameters in nonparametric smoothing and variable
selection in regression.
Cross-validation is also a model validation technique for assessing how the results
of a data analysis will generalize to an independent dataset (Fig. 9). It is mainly used in
studies where the goal is prediction, and we want to estimate how accurately a predictive
model will perform in practice. In the prediction problem, a model is usually given a dataset
of known data on which training is run (training dataset), and a dataset of unknown data
(or first seen data) against which the model is evaluated (testing or validating dataset).
The goal of cross-validation is to evaluate the model's ability to predict new data
that was not used in estimating it, in order to flag problems like overfitting or selection bias
and to give an insight on how the model will generalize to an independent (i.e., an unknown)
11
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
12
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
13
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
In statistics and econometrics, cross-validation was recognized and utilized with the
publications of M. Stone (1974; 1977a; 1977b) who championed this method in his works. It
was in his very yearly paper, (Stone, 1974) when he drew researchers’ attention to a general
procedure of splitting a sample into two parts and formulated a characterization of this
procedure so that researchers can attempt to place it in proper relation to the more standard
methods. This simple idea of splitting a sample into two and then developing the hypothesis
on the basis of one part and testing it on the remainder (i.e. the cross-validation as shown
in Fig. 9) was one of the most seriously neglected ideas in statistics for a long period of time
until the 1990s when, as mentioned already, DM bridged the gap from applied statistics
and ML to the database management.
14
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
To evaluate the accuracy we have to use genuine data. Predictions are not perfect,
and their results usually differ from the real-life values. The difference between the actual
value and the predicted value for the corresponding period is referred to as a prediction
error. It is defined using the actual value of the outcome minus the predicted with the model
value (1):
et = yt – Ft (1)
1 N
MFE =
N
e
t =1 t
(2)
A good forecasting method will provide a fitted model with zero MFE calculated on
the training dataset. Since the selection and validation of the model are done using the
testing dataset it is almost impossible to have an MFE with a zero value and the goal is to
minimize the bias. Then, the forecasts can be improved by adjusting the forecasting model
by an additive constant that equals the MFE of the unadjusted errors.
Predictions of outcomes are rarely precise, and a researcher can only endeavor to
make the inevitable errors as small as possible. Each model represents a real-life process
or system with some accuracy related to the particular size of the prediction error. It is
good to know some general facts, for instance, the fact that the prediction accuracy
decreases as the time horizon increases, i.e. short-range predictions usually contend with
fewer uncertainties than longer-range predictions, and thus they tend to be more accurate.
However, it is more important to know the degree of each particular model’s accuracy.
15
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
1 N
MPE (%) =
N t =1
(et / yt ) 100 (3)
In terms of its computation (3), MPE is an average percentage error with the
following properties. It:
• represents the percentage of average error which occurred while forecasting.
• is independent of the scale of measurement but affected by data
transformations.
• (like MFE) shows the direction of errors that occurred.
• does not penalize extreme deviations.
• is like MFE, the opposite signed errors affect each other and cancel out, i.e.,
by obtaining a value of MPE close to zero, we cannot conclude that the model is perfect.
• is desirable that for a good model with a minimum bias, the obtained MPE
should be as close to zero as possible.
3.3.2. Mean Absolute Percentage Error (MAPE)
MAPE (4) puts errors in perspective. It is useful when the size of the predicted
variable is important in evaluating. It provides an indication of how large the model errors
are in comparison to the actual values. It is also useful to compare the accuracy of different
models on same or different data.
16
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
1
MAPE (%) = ∑𝑁
𝑡=1(|𝑒𝑡 | /𝑦𝑡 )x100 (4)
𝑁
MAPE important features are:
• As a measure, it represents the percentage of average absolute error that
occurred.
• it is independent of the scale of measurement but affected by data
transformations.
• unlike MFE, MAPE does not show the direction of error.
• it does not penalize extreme deviations.
• opposite signed errors do not offset each other (i.e., the effects of positive and
negative errors do not cancel out).
• for a good model, the calculated MAPE should be as small as possible.
3.3.3. Mean Squared Error (MSE) & Mean Squared Prediction Error (MSPE)
MSE or MSPE (5) is the average of the squared errors for the testing dataset, i.e. the
differences between the actual and the predicted values at period t:
𝑁
MSPE = ∑𝑡=1 (𝑒𝑡 )2 ⁄(𝑁 − 1) (5)
Technically, the MSPE is the second moment about the origin of the error, and thus
incorporates both the variance of the estimator and its bias. For an unbiased estimator, the
MSPE is the variance of the estimator. Like the variance, MSPE has the same units of
measurement as the square of the quantity being estimated. It has the following properties:
• it is a measure of average squared deviation of forecasted values.
• since the opposite signed errors do not offset one another, MSPE gives an
overall idea of the error.
• it penalizes extreme errors (it squares each) which occurred while forecasting.
• MSPE emphasizes the fact that the total model error is in fact greatly affected
by large individual errors, i.e. large errors are more expensive than small ones.
• MSPE does not provide any idea about the direction of overall error.
• it is sensitive to the change of scale and data transformations.
• although MSPE is a good measure of overall forecast error, it is not as intuitive
and easily interpretable as the other measures discussed above.
Because of these disadvantages, researchers mostly use the MSPE square root.
3.3.4. Root Mean Squared Prediction Error (RMSPE)
RMSPE (6) is the square root of calculated MSPE. In an analogy to the standard
deviation, taking the square root of MSPE yields the root-mean-squared-prediction error
(RMSPE), which has the same units as the quantity being estimated:
17
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Unlike MSPE, RMSPE measures the model’s error in the same units as the original
data and it has easy and clear business interpretation. It is a measure of average squared
deviation of predicted values and since the opposite signed errors do not offset one another,
RMSPE gives an overall idea of the occurring error.
Most importantly, RMSPE (like MSPE) emphasizes the fact that the total error is much
affected by large individual errors. It is the square root of the average of squared errors. The
effect of each error on RMSPE is proportional to the size of the squared error, i.e. larger
errors have a disproportionately large effect on RMSPE.
The RMSPE serves to aggregate the magnitudes of the errors in predictions for
various data points into a single measure of predictive power. Thus, it is a good measure of
accuracy, but only to compare errors of different models for a particular variable (or data
set) and not between variables (datasets), as it is scale-dependent. Another important
application of RMSPE is that it can be used directly to construct prediction intervals (see
Section 3.4 below).
3.3.5. Coefficient of Variation of the RMSPE CV(RMSPE)
CV(RMSPE) (7) is the RMSPE normalized to the mean of the real values:
It is the same concept as the coefficient of variation (CV) in statistics except that
RMSPE replaces the standard deviation. The CV is useful because the standard deviation
of data must always be understood in the context of the mean of these data. In contrast, the
actual value of the CV is independent of the unit in which the measurement has been taken,
so it is a dimensionless number. For comparison between datasets with different units or
widely different means, we should use the CV instead of the standard deviation. The smaller
the CV(RMSPE) value, the better the model.
When the mean value is close to zero, the CV(RMSPE) will approach infinity and is
therefore sensitive to small changes in the mean. This is often the case if the values do not
originate from a ratio scale. Finally, unlike the RMSPE, it cannot be used directly to construct
prediction intervals.
Supposition One – inferences:
As we can see, each of the measures above has some unique properties different
from the other measure’s properties and which metrics should be used depends on the
particular case and its specific goals. Experienced researchers normally use the criteria
MPE, MAPE, RMSPE, and CV(RMSPE) together. The main benefit of this group is that it
provides good information about both the bias and the precision of the model.
RMSPE and MPE represent the Trueness (Systematic error, Statistical Bias) and
thus measure model usefulness and reliability. MPE shows the direction of the bias and
RMSPE represents its absolute value and can be used directly to construct error’s prediction
intervals. When selecting a good model based on a testing dataset, it is desirable that both
criteria should be as close to zero as possible.
18
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
To measure the model’s precision (i.e. its random error), we should use MAPE and
CV(RMSPE) in tandem. Since CV(RMSPE) penalizes extreme errors (i.e. it is sensitive to
outliers) and MAPE does not, the researchers’ first goal should be to select a model where
the calculated values of both criteria are very close (almost equal), i.e. there are no extreme
error values. The second goal, as usual, is that these criteria values are as close to zero as
possible. Another advantage is that MAPE and CV(RMSPE) are the most common
measures used to compare the accuracy of two different models and select the best one
(more details we discuss in Motzev, 2021, pp. 65-96.)
6
In fact, the main goal of these experiments was to prove the advantages of utilizing SLNs in business simulations and
that SLNs such as MLNAN provide opportunities in both shortening the design time and reducing the cost and efforts
in model building, as well as developing reliably even complex models with high level of accuracy.
19
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
In the point estimate researchers try to find a unique point (average prediction) in the
parameter space which can reasonably be considered as the true value of the parameter.
On the other hand, instead of unique estimate of the parameter, we are interested in
constructing a family of sets that contain the true (unknown) parameter value with a specified
probability. In many problems we want not only to calculate a single value (the average
prediction) of the parameter, but to get a lower and upper bound for it as well.
For this purpose, researchers construct a prediction interval. We can calculate the
upper and lower limits of the intervals from the given data using the RMSPE. This estimation
provides a range of values where the parameter is expected to lie. It generally gives more
information than point estimates and is preferred when making inferences. For easy and
better understanding, often the upper limit of the interval is called optimistic (or Maximum)
prediction and the lower limit pessimistic (or Minimum) prediction (see Fig. 11).
Supposition Two – inferences:
When selecting the best model, the prediction intervals can improve the selection
process. The MPE which shows the direction of the Systematic error can be used to make
more precise decisions. If its value is very close to zero we should select the average
prediction model. When MPE is negative, we should select the minimum prediction model
and finally, when it is positive, we should select the maximum prediction model.
Fig. 10 General view of Point Estimate, Confidence and Prediction Intervals
20
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
The results presented in Tables 4 and 5 give more information about the selection
process. The good models were first evaluated using multiple criteria and among their Max,
Min, and Average versions (i.e. the prediction interval calculated with RMSPE) the best
model was selected using the prediction bias (the systematic error measured with MPE)
and the model’s precision (i.e. its random error) presented by MAPE and CV(RMSPE).
As we can see, this approach helps to improve the best model selection. For example,
in Case 1 presented in Fig. 12 the best model was the Optimistic (Max) version of MLNAN
and in Case 2 (see Fig 13) the Pessimistic (Min) version of MLNAN.
It should be noted that the dataset in Case 2 was made up in a way that the best
models, which students can develop with traditional model building approaches, will contain
precision of about 10% random error (measured by MAPE and CV(RMSPE) on the testing
dataset. The goal of the case is to improve model’s accuracy with the help of the selection
procedure using model errors, measured by multiple criteria, and prediction intervals.
Fig. 12 Predictions Accuracy using different models in 2020 (Case 1) - Source: Own data
21
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Fig. 13 Different Predictions Accuracy - same model (2020 Case 2) - Source: Own data
The typical case in cross-validation procedures is biased (more or less) model for the
testing dataset measured with MPE. The prediction interval allows us to identify the best
version of the model as presented in Fig. 13. These experimental results were confirmed in
2021 as we can see in Tables 4 and 5.
The data in both tables show how the inferences from Supposition Two stated above
help to improve the selection procedure in all three models presented there. The MPE gives
bias’s direction for the general average model and then, the corresponding version (Min or
Max) of the model is chosen. In this way, we reduce both the bias and the random error.
Another important point that should be pointed out is the computational process.
Unfortunately, most of the existing model building software does not provide options for
constructing prediction intervals and obtaining Min&Max versions of the selected best
models. For this reason, we designed special templates for MS Excel which compute the
most important error measures like MPE, MAPE, RMSPE, and CV(RMSPE) and construct
prediction intervals for models created by other software. These templates are modules that
can be used in any MS Excel file and even modified with MS VBA editor7.
7
These templates are parts of another research and will be presented in future editions of VSIM journal.
22
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Table 5. Experimental Results in 2021 – Case 2 Errors: Different versions of ARMAX model
23
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
Fig.15 Forecasts with a Composite model and actual data (2021 Case 2) - Source: Own data
24
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
This Supposition indeed requires many more experiments and more importantly,
automation of the procedure because all of the analyses, tests, and other experiments were
done manually.
4. CONCLUSIONS
SLNs provide processed data that are needed in the business context and the
extracted information is useful to decision makers because creates value or predict market
behavior in a way which leads to competitive advantages. In order to maximize the
effectiveness of their use, SLN models must be accurate enough. The higher the accuracy
the better the understanding and the predictions of the analyzed problem. This will support
managers in making better decisions that are much closer to the real-life business problem.
Model selection and validation procedures can help to find the best model in terms
of its accuracy, i.e. the minimum prediction bias (systematic error) and the maximum
precision (or minimum random error). The present paper discusses the model selection
and validation in deep neural networks such as SLNs based on their prediction errors
and intervals. To achieved the goal three Suppositions were made and examined.
Supposition One: it is important to consider more than one evaluation criteria, which
will help to obtain a reasonable knowledge about the amount, magnitude, and direction of
the overall model error.
Supposition Two: predictions (i.e. calculated by the model values for the testing
dataset) are more likely (closer to) intervals rather than to a single point, i.e. it is better to
consider the calculated values as intervals rather than point estimates.
Supposition Three: the best model is a composite model because there is no single
technique/model that works in every situation.
For five years in a row (2017-2021), we developed many predictive models using
more or less complex techniques including time series models, regression and
autoregression models, MLNANs and composite models. The results obtained so far and
presented in the current paper confirmed all three Suppositions made.
In practice, there should be more experiments and improvements especially in
automation of the procedures because many of the analyses, tests, and other experiments
were done manually. This will be the goal in the future research which will be conducted in
the following years.
25
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
References
Alpaydin, E. (2020) Introduction to Machine Learning. 4th ed. MIT, pp. xix.
Beer, S. (1959) Cybernetics and Management. London: English University Press.
Brown, S. (2023) ‘Machine Learning, Explained’ Available at: https://mitsloan.mit.edu/
ideas-made-to-matter/machine-learning-explained (Accessed: 02 November 2023).
Burns, Ed. (2017) ‘Deep learning models hampered by black box functionality’.
Available at: http://searchbusinessanalytics.techtarget. com/feature/Deep-learning-models-
hampered-by-black-box-functionality (Accessed: 04 May 2017).
Cureton, E. (1950) ‘Validity, reliability and boloney’. Educ. & Psych. Meas., No. 10,
pp. 94-96.
Cureton, E. (1951) ‘Symposium: The need and means of cross-validation. II.
Approximate linear restraints and best predictor weights’. Educ. & Psychol. Measurement,
No. 11, pp. 12-15.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth P. (1996) ‘From Data Mining to
Knowledge Discovery in Databases’. American Association for Artificial Intelligence
Magazine, Fall, pp. 37-54. Available at: http://www.kdnuggets.com/gpspubs/aimag-kdd-
overview-1996-Fayyad.pdf (Accessed: 02 November 2023).
Günnemann, St., Kremer, H., and Seidl, Th. (2011) ‘An extension of the PMML
standard to subspace clustering models’. Proceedings of the 2011 workshop on Predictive
markup language modeling, p. 48. doi:10.1145/2023598.2023605.
Hastie, T., Tibshirani, R., and Friedman J. (2017) The Elements of Statistical
Learning: data mining, inference, and prediction. New York: Springer.
Herzberg, P. A. (1969) ‘The Parameters of Cross-Validation’ (Psychometric
Monograph No. 16, Supplement to Psychometrika, 34). Richmond, VA: Psychometric
Society. Available at: http://www.psychometrika.org/journal/online/MN16.pdf (Accessed: 02
November 2023).
Hornik, K., Stinchcombe, M. and White, H. (1989) ‘Multilayer feed-forward networks
are universal approximators’. Neural Networks 2, pp. 359–366.
Horst, P. (1941) ‘Prediction of Personal Adjustment’. New York: Social Science
Research Council (Bulletin No. 48)
IBM newsletter. (2023) ‘What is machine learning?’ Available at:
https://www.ibm.com/topics/machine-learning (Accessed: 02 November 2023).
ISO 5725-1. (1994) ‘Accuracy (trueness and precision) of measurement methods
and results - Part 1: General principles and definitions’, p. 1. Available at:
https://www.iso.org/obp/ui/#iso: std:iso:5725:-1:ed-1:v1:en (Accessed: 02 November 2023).
Ivakhnenko, A., and Müller, J-A. (1996) ‘Recent Developments of Self-Organizing
Modeling Prediction and Analysis of Stock Market’. http://www.gmdh.net/articles/index.html
(Accessed: 02 November 2023).
Ivakhnenko, A. G. 1968. ‘Group Method of Data Handling - A Rival of the Method of
Stochastic Approximation’. Soviet Automatic Control, Vol. 1, No. 3, pp. 43-55.
26
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
27
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582
28