0% found this document useful (0 votes)
36 views28 pages

Predictive Approach To Model Selection and Validation in Statistical Learning

The best model selection and the validation of the model are key issues in any model-building process. The present paper summarizes the results from international research done in Europe, Australia, and most recently in the United States. It discusses the model selection and validation in deep neural networks based on their prediction errors and provides some insights how to improve their accuracy in a very cost-effective way.

Uploaded by

netbizzmaster
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views28 pages

Predictive Approach To Model Selection and Validation in Statistical Learning

The best model selection and the validation of the model are key issues in any model-building process. The present paper summarizes the results from international research done in Europe, Australia, and most recently in the United States. It discusses the model selection and validation in deep neural networks based on their prediction errors and provides some insights how to improve their accuracy in a very cost-effective way.

Uploaded by

netbizzmaster
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no.

1, 2023, ISSN 1314-0582

Predictive Approach to Model Selection


and Validation in Statistical Learning
Networks
Author: Mihail Motzev1

Abstract: The best model selection and the validation of the model are key issues in any
model-building process. The present paper summarizes the results from international research done
in Europe, Australia, and most recently in the United States. It discusses the model selection and
validation in deep neural networks based on their prediction errors and provides some insights how
to improve their accuracy in a very cost-effective way.
Keywords: models; machine learning; data mining; predictive analytics; accuracy; model
evaluation and selection; validation; artificial neural networks; deep learning networks; statistical
learning networks; group method of data handling; multi-layered networks of active neurons

JEL: C19, C45, C52, C53

1. INTRODUCTION
The best model selection and the validation of the model are key issues in any
model-building process. Procedures and protocols for model verification and validation are
an ongoing field of academic study, research and development in both simulations’
technology and practice.
Models are the basis and one of the most important pillars in Data Mining (DM)
Machine Learning (ML) and Predictive analytics. ML and DM often employ the same
methods and overlap significantly, but ML focuses on prediction, based on known properties
learned from the training data, and DM focuses on the discovery of previously unknown
properties in the data, which is the analysis step of knowledge discovery in databases.
Predictive analytics encompasses a variety of techniques from statistics, ML and
DM, which analyze current and historical facts to make predictions about future or otherwise
unknown events. A common definition of predictive analytics is: “The use of statistics and
modeling to determine future performance based on current and historical data. Predictive
analytics look at patterns in data to determine if those patterns are likely to emerge again.
This allows businesses and investors to adjust where they use their resources to take
advantage of possible future events. Predictive analysis can also be used to improve
operational efficiencies and reduce risk.”2

1
Mihail Motzev – Retired professor at Walla Walla University School of Business, College Place WA, USA,
[email protected]
2
See: https://www.investopedia.com/terms/p/predictive-analytics.asp (Accessed: 02 November 2023).

1
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Models must be accurate enough in order to maximize the effectiveness of their use
in any of the above fields. Ever increasing model accuracy helps researchers analyze
problems more precisely, which leads to deeper and better understanding. Models with
higher accuracy simply produce better predictions and support managers in making better
decisions that are much closer to the real-life business case.
A common step in any model-building process is model selection. Model selection
is the task of selecting a model from among various candidates on the basis of performance
criterion to choose the best one. Model selection may also refer to the problem of selecting
a few representative models from a large set of computational models most commonly in
Artificial Neural Networks (ANNs).
Model validation is a closely related and similar task as model selection, but the
model validation does not concern so much the conceptual design of models as it
evaluates only the consistency between a chosen model and its stated outputs. It is usually
defined to mean “substantiation that a computerized model within its domain of applicability
possesses a satisfactory range of accuracy consistent with the intended application of the
model” (Schlesinger et al., 1979).
The present paper discusses the model selection and validation in deep neural
networks based on their prediction errors and provides some insights how to improve
their accuracy in a very cost-effective way. It presents results from international research
done in Europe, Australia, and most recently in the United States. It summarizes the
previous research projects discussed in Motzev (2015; 2018a; 2018b; 2019; 2021).

2. ARTIFICIAL INTELLIGENCE, MACHINE LEARNING AND DATA MINING


2.1. Artificial Intelligence and Machine Learning
The term Artificial Intelligence (AI) was coined in a summer school held at the
mathematics department of Dartmouth University in 1956 by John McCarthy. He defined AI
as “the science and engineering of making intelligent machines, especially intelligent
computer programs” (McCarthy, 1956, p. 2) in order to distinguish the field from cybernetics
and escape the influence of the cyberneticist Norbert Wiener. The term is now used both for
the intelligent machines that are the goal and for the science and technology that are aiming
that goal (Lenox, 2020, p. 15)
In the 1990s and early 21st century, mainstream AI achieved great commercial
success and academic respectability by focusing on specific sub-problems where they can
produce verifiable results and commercial applications, such as Artificial Neural Networks
(ANNs) and Statistical Machine Learning.
All this began in the year 1943, when Warren McCulloch, a neurophysiologist along
with a mathematician named Walter Pitts authored a paper that threw light on neurons and
their working. They created a model with electrical circuits and thus a neural network was
born3.

3
See: https://www.mygreatlearning.com/blog/what-is-machine-learning/ (Accessed: 02 November 2023).

2
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

The famous “Turing Test” was created in 1950 by Alan Turing, which would ascertain
whether computers had real intelligence (Turing, 1950). It has to make a human believe that
it is not a computer but a human instead, to get through the test. Arthur Samuel developed
the first computer program that could learn as it played the game of checkers in the year
1952. The first neural network, called the perceptron, was designed by Frank Rosenblatt in
the year 1957 according to The New Yorker4.
IBM has a rich history with Machine Learning (ML) and according to its studies (IBM,
2023) Arthur Samuel is credited for coining the term “Machine Learning” with his research
(Samuel, 1959) around the game of checkers. Moreover, according to Sara Brown at MIT
Sloan School of Management (Brown, 2023), ML was defined in the 1950s by AI pioneer
Arthur Samuel as “the field of study that gives computers the ability to learn without explicitly
being programmed.”
ML is a branch of AI and computer science which focuses on the use of data and
algorithms to imitate the way that humans learn, gradually improving its accuracy. In general,
ML is an umbrella term for solving problems for which development of algorithms by human
programmers would be cost-prohibitive, and instead the problems are solved by helping
machines "discover" their "own" algorithms (Alpaydin, 2020), without needing to be explicitly
told what to do by any human-developed algorithms.
The big shift happened in the 1990s when ML moved from being knowledge-driven
to a data-driven technique due to the availability of huge volumes of data. Businesses
recognized that the potential for complex calculations could be increased through ML.
ML is an application of AI that uses statistical techniques to enable computers to learn
and make decisions without being explicitly programmed. It is predicated on the notion that
computers can learn from data, spot patterns, and make judgments with little assistance
from humans. Good quality data is fed to the machines, and different algorithms are used to
build ML models to train the machines on this data. The choice of algorithm depends on the
type of data at hand and the type of activity that needs to be automated.
In general, there are seven steps of ML:
1. Gathering Data
2. Preparing that Data
3. Choosing a Model
4. Training
5. Evaluation
6. Hyperparameter Tuning
7. Prediction
The best model selection (i.e. choosing a model) and the validation (i.e.
evaluation) of the model is a key issue in any model-building process as we mentioned
above. A typical ML process can be summarized with the three major steps of the model
building process as shown in Fig. 1.

4
See https://www.newyorker.com/tech/annals-of-technology/hyping-artificial-intelligence-yet-again
(Accessed: 02 November 2023).

3
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 1 Typical Machine Learning process (Source: www.mygreatlearning.com/blog/what-


is-machine-learning/ Accessed: 02 November 2023)

2.2. Knowledge Discovery in Databases and Data Mining


Gregory Piatetsky-Shapiro coined the term "Knowledge Discovery in Databases"
(KDD) for the first workshop on the KDD topic in 1989 and this term became more popular
in the AI and ML communities. However, the term Data Mining (DM) became more popular
in the business and press communities.
Though these terms are used interchangeably today, there are significant differences
between them. KDD refers to the overall process of discovering useful knowledge from data,
and DM refers to a particular step in this process. KDD applies to all activities and processes
associated with discovering useful knowledge from aggregate data. Using a combination of
techniques including statistical analysis, neural and fuzzy logic, multidimensional analysis,
data visualization, and intelligent agents, KDD can discover highly useful and informative
patterns within the data that can be used to develop predictive models in a wide variety of
knowledge domains.
In the 1960s, statisticians and economists used terms like data fishing or data
dredging to refer to what they considered the bad practice of analyzing data without an a
priori hypothesis. The term "data mining" was used in a similarly critical way by economist
Michael Lovell (1983) where he indicates that the practice "masquerades under a variety of
aliases, ranging from "experimentation" (positive) to "fishing" or "snooping" (negative).
The term data mining appeared around 1990 in the database community, with
generally positive connotations. According to Fayyad et al. (1996, p. 39) “Data mining is a
step in the KDD process (see Fig. 2) that consists of applying data analysis and
discovery algorithms that produce a particular enumeration of patterns (or models)
over the data.” Technically, DM is the application of specific algorithms for extracting
patterns from data.
DM bridges the gap from applied statistics and AI (which usually provide the
mathematical background) to database management by exploiting the way data is stored
and indexed in databases to execute the actual learning and discovery algorithms more
efficiently, allowing such methods to be applied to ever-larger data sets.

4
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 2 Knowledge Discovery in Databases process (Source: Fayyad et al., 1996, p. 41)

There have been some efforts to define standards for the DM process. For
exchanging the extracted models, in particular for use in predictive analytics, the key
standard is the Predictive Model Markup Language (PMML), which is an XML-based
language developed by the Data Mining Group (DMG) and supported as exchange format
by many DM applications. This standard only covers prediction models, a particular DM
task of high importance to business applications (see Günnemann et al., 2011).
KDD and DM we discussed in detail in Motzev and Lemke (2015) and Motzev (2021).
In this paper we will concentrate on model component in DM process, its selection and
validation. In general, DM platforms assist and automate the process of building and training
highly sophisticated models and applying these models to larger datasets. A white paper
published by MicroStrategy (2005, pp. 162-173) describes in detail the DM process and it
emphasizes three main steps:
1. Create a predictive model from a data sample – Advanced statistical and
mathematical techniques like regression analysis and ML algorithms are used to
identify the significant characteristics and trends in predicting responsiveness, and a
predictive model is created using these as inputs.
2. Train the model against datasets with known results – The new predictive
model is applied to additional data samples with known outcomes to validate
whether the model is reasonably successful at predicting the known results. This
gives a good indication of the accuracy of the model. It can then be further trained
using these samples to improve its accuracy.
3. Apply the model against a new dataset with an unknown outcome – Once the
predictive model is validated against the known data, it is used for scoring, which is
defined as the application of a DM model to forecast an outcome.
As we mentioned above, ML and DM often employ the same methods and ANNs is
one of the common techniques used by both of them. From a DM perspective, ANNs are
just another way of fitting a model to observed historical data in order to be able to make
classifications or predictions. In the following section we will discuss the most advanced
ANNs used in ML and DM – Deep Learning Neural Networks.

5
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 3 General view of a typical ANN

2.3. Machine Learning, Deep Learning and Artificial Neural Networks


Artificial neural networks (ANNs) are a branch of ML models that are built using
principles of neuronal organization discovered by connectionism in the biological neural
networks constituting animal brains. An ANN is based on a collection of connected units or
nodes (called “artificial neurons”). Each connection, like the synapses in a biological brain,
can transmit a signal to other neurons. An artificial neuron receives signals then processes
them and can signal neurons connected to it.
ANNs are comprised of node layers containing an input layer, one or more hidden
layers, and an output layer (see Fig. 3). Each node has an associated weight and
threshold. If the output of any individual node is above the specified threshold value, that
node is activated, sending data to the next layer of the network. Otherwise, no data is passed
along to the next layer of the network by that node.
Different layers may perform different transformations on their inputs. Signals travel
from the first layer (the input layer) to the last layer (the output layer). There is usually one,
but sometimes more than one, additional layer of units between the input layer and the
output layer. These layers are called hidden layers and the units in them are hidden units.
Since deep learning and machine learning tend to be used interchangeably, it is
worth noting the nuances between the two. ML, deep learning, and ANNs are all sub-fields
of AI. However, as shown in Fig. 4, ANNs is actually a sub-field of ML, and deep learning
is a sub-field of ANNs.
The way in which deep learning and ML differ is in how each algorithm learns.
"Deep" ML can use labeled datasets, also known as supervised learning, to inform its
algorithm, but it does not necessarily require a labeled dataset. ML can ingest unstructured
data in its raw form (e.g., text or images), and it can automatically determine the set of
features which distinguish different categories of data from one another. This eliminates
some of the human intervention required and enables the use of larger data sets.
Classical, or "non-deep", ML is more dependent on human intervention to learn.
Human experts determine the set of features to understand the differences between data
inputs, usually requiring more structured data to learn.

6
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 4 The scope and hierarchy of AI, ML and Deep Learning (Source:
https://en.wikipedia.org/wiki/Machine_learning#cite_note-journalimcms.org-22
Accessed: 02 November 2023)

Most modern deep learning models are based on multi-layered ANNs similar to the
one presented in Fig. 5. Deep learning uses multiple layers to progressively extract higher-
level features from the raw input and can refer to "computer-simulate" or "automate"
human learning processes from a source to a learned object. In such a way, the notion
coined as "deeper" learning or "deepest" learning makes sense.
Technically, the “deep” in ML is just referring to the number of layers in an ANN. A
neural network that consists of more than three layers, which would be inclusive of the input
and the output, can be considered a deep learning algorithm or a Deep Neural Network
(DNN). ANN that only has three layers is just a basic neural network.

2.4. Deep Learning and Statistical Learning Networks


Deep Neural Networks (DNNs) are one of ANNs with multiple hidden layers of units
between the input and output layers similar to Fig.5. The main advantages of DNNs are that
they make it possible to build faster and more accurate predictive models but at the same
time DNNs are difficult to develop and hard to understand.
The neural network-based deep learning models contain a great number of layers
since they rely more on optimal model selection and optimization through model building
and tuning.
ANNs and the deep learning techniques are considered as some of the most
capable tools for solving extraordinarily complex problems. They are data-driven and self-
adaptive in nature, i.e. there is no need to specify a particular model form or to make any a
priori assumption about the statistical distribution of the data. Perhaps their greatest
advantage is the ability to be used as an arbitrary function approximation mechanism that
“learns” from observed data. According to Hornik et al. (1989) ANNs and DNNs are universal
functional approximators and can deal with situations where the input data are erroneous,
incomplete, or fuzzy.

7
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 5 General scheme of Multi-Layered ANNs

However, to have an extremely robust ANN (DNN or other type ANN) the model, the
cost function, and the learning algorithm must be appropriately selected. Like typical ANNs,
many issues can arise with DNNs if they are naively trained. As we discussed in Motzev
and Lemke (2015) and Motzev (2018a; 2018b; 2021), the most common issues are
overfitting and computation time, due to increased model and algorithmic complexity which
results in very significant computational resource and time requirements. Other important
questions such as “How can secondary data series be inferred for the network generating
process?” or “What should the number of input nodes for the ANN be (i.e. what is the order
of the model)?” should be considered too. Furthermore, concerning the ANNs architecture,
other questions need proper addressing such as “What should the number of hidden nodes
be?” or “Which is the best activation function in any given instance?”
It is worth pointing out the fact that “The first general, working learning algorithm for
supervised, deep, feedforward, multilayered perceptron(s) (see Fig. 6) was published by
Ivakhnenko and Lapa in 1967” (https://en.wikipedia.org/wiki/Deep_learning >History).
Alexey G. Ivakhnenko (1968) introduced the Group Method of Data Handling
(GMDH)5 as an inductive approach to model building based on self-organization principles.
It is also referred to as Polynomial Neural Networks, Abductive and Statistical Learning
Networks (SLNs) (see http://gmdh.net/). For many years, GMDH has proven to be one of
the most successful methods in SLNs (Müller and Lemke, 2003; Onwubolu, 2009).
Statistical Learning Theory deals with the problem of finding a predictive function
(model) based on a given data set (see Fig 7). In general, it is a framework for ML drawing
from the fields of statistics and functional analysis (Hastie et al. 2017; Mohri et al. 2012).
In GMDH algorithms, models are generated adaptively from input data in the form of
an ANN of active neurons in a repetitive generation of populations of competing partial
models of growing complexity. A limited number is selected from generation to generation
by cross-validation, until an optimal complex model is finalized.

5
GMDH is a method of inductive statistical learning. See.
http://en.wikipedia.org/wiki/Group_Method_of_Data_Handling (Accessed: 02 November 2023)

8
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 6 General scheme of GMDH Self-Organizing modeling algorithm (Source: Madala &
Ivakhnenko, 1994, p. 8)

This modeling approach (see Fig. 6) grows a tree-like network out of data of input
and output variables in a pair-wise combination and competitive selection from a single
neuron to a final output – a model without predefined characteristics. Here, neither the
number of neurons and the number of layers in the network, nor the actual behavior of each
created neuron is predefined. The modeling is self-organizing because the number of
neurons, the number of layers, and the actual behavior of each created neuron are identified
during the learning process from layer to layer.
SLNs such as GMDH can address the common problems of ANNs and the DNNs in
particular (see Table 1). “One of the great things about deep learning is that users can
essentially just feed data to a neural network, or some other type of learning model, and
the model eventually delivers an answer or recommendation. The user does not have to
understand how or why the model delivers its results; it just does. But some enterprises are
finding that the black box nature of some deep learning models -- where their functionality
isn't seen or understood by the user -- isn't quite good enough when it comes to their most
important business decisions” (Burns, 2017).
Fig. 7 Statistical Learning Theory (Source: unknown)

9
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

The lack of transparency into how deep learning models work is refraining some
businesses from accepting them fully and some special techniques should be used to
address this problem. Müller and Lemke (1995) provided comparisons and pointed out that
in distinction to typical ANNs, the results of GMDH algorithms are explicit mathematical
models that are generated in a relatively short time on the basis of even small samples.
Another important problem, for example, is that deep learning and neural network
algorithms can be prone to overfitting. Following Gödel's (1931; 2001) incompleteness
theorems and Beer (1959, p. 280), only the external criteria, calculated on new,
independent information (cross validation in GMDH see Fig. 8) can produce the minimum
of the prediction model error.
SLNs also allow to overcome the common problems in the designing DNN topology
which is in general a trial-and-error process and there are no rules how to use the theoretical
a priori knowledge in DNN design, selecting the number of input nodes and so on (Müller
and Lemke, 2003). In summary, GMDH algorithms combine in a powerful and a cost-
effective way the best features of ANNs and statistical techniques.

3. BEST MODEL SELECTION AND VALIDATION


3.1. Cross-Validation
As we have already mentioned, models must be accurate enough in order to
maximize the effectiveness of their use. To achieve high accuracy we must provide a
procedure which selects a model among various candidates on the basis of performance
criterion to choose the best one. Another important step in any model-building process is
the model validation, which does not concern so much the conceptual design of models
but evaluates the consistency between a chosen model and its stated outputs.
Fig. 8 Cross-Validation in GMDH (Source: GMDH.net Accessed: 02 November 2023)

10
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Tab. 1 ANNs and SLNs - characteristics and comparisons

Source: Ivakhnenko & Müller, 1996

One of the most useful methods in selection problems is the cross-validation (also
called rotation estimation, out-of-sample testing, predictive sample reuse, external criteria)
method. The main idea is simply splitting the data into two parts, using one part to derive a
prediction rule and then judge the goodness of the prediction by matching its outputs with
the rest of the data (hence the name cross validation). In other words, cross-validation is
a method for model selection according to the predictive ability of the model. One of the
appealing characteristics of cross-validation is that it is applicable to a wide variety of
problems, thus giving rise to applications in many areas. Examples include, but are not
limited to, the choice of smoothing parameters in nonparametric smoothing and variable
selection in regression.
Cross-validation is also a model validation technique for assessing how the results
of a data analysis will generalize to an independent dataset (Fig. 9). It is mainly used in
studies where the goal is prediction, and we want to estimate how accurately a predictive
model will perform in practice. In the prediction problem, a model is usually given a dataset
of known data on which training is run (training dataset), and a dataset of unknown data
(or first seen data) against which the model is evaluated (testing or validating dataset).
The goal of cross-validation is to evaluate the model's ability to predict new data
that was not used in estimating it, in order to flag problems like overfitting or selection bias
and to give an insight on how the model will generalize to an independent (i.e., an unknown)

11
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

dataset. In summary, cross-validation combines (averages) measures of fitness in


prediction to derive a more accurate estimate of model prediction performance.
It should be noted that the concept of Cross Validation is an old one. Larson (1931)
employed random division of the sample in an educational multiple-regression study to
investigate the "shrinkage of the coefficient of multiple correlation between the fabrication
and trying out of predictors.” Horst (1941) in his research found a "drop in predictability"
between an "original" sample and a "check" sample that depended strongly on the method
of construction of the predictor. Nicholson (1960) addressed the shrinkage phenomenon
theoretically and proposed an alternative measure of prediction efficiency. Herzberg (1969)
made a detailed theoretical and numerical study of predictor construction methods, using
cross-validatory assessment.
In 1951 a symposium “The need and means of cross-validation” was held with a
few different sections dedicated to the “I. Problem and designs of cross-validation”, “II.
Approximate linear restraints and best predictor weights”, “III. Cross-validation of item
analyses”, and “IV. Comparison of cross-validation with statistical inference of betas and
multiple R from a single sample”. Mosier (1951), Cureton (1950), Katzell (1951) and Wherry
(1951) contributed separate papers to this symposium.
In fact, the fundamentals of cross-validation could be found in Gödel’s
Incompleteness Theorems (Gödel, 1931; 2006). These two theorems of mathematical
logic are concerned with the limits of provability in formal axiomatic theories.
First Incompleteness Theorem states: "Any consistent formal system F within
which a certain amount of elementary arithmetic can be carried out is incomplete, i.e., there
are statements of the language of F which can neither be proved nor disproved in F."
The unprovable statement G(F) referred to by the theorem is often referred to as "the
Gödel sentence" for the system F. The proof constructs a particular Gödel sentence for the
system F, but there are infinitely many statements in the language of the system that share
the same properties.
Each effectively generated system has its own Gödel sentence. It is possible to define
a larger system F' that contains the whole of F plus G(F) as an additional axiom. This will
not result in a complete system, because Gödel's theorem will also apply to F', and thus F'
also cannot be complete. In this case, G(F) is indeed a theorem in F', because it is an axiom.
Since G(F) states only that it is not provable in F, no contradiction is presented by its
provability within F'. However, because the incompleteness theorem applies to F', there will
be a new Gödel statement G(F') for F', showing that F' is also incomplete. G(F') will differ
from G(F) in that G(F') will refer to F', rather than F.
The first incompleteness theorem shows that the Gödel sentence G(F) of an
appropriate formal theory F is unprovable in F. Because this unprovability is exactly what
the sentence (indirectly) asserts, the Gödel sentence is, in fact, true. For this reason, the
sentence G(F) is considered to be "true but unprovable." However, since the Gödel sentence
cannot itself formally specify its intended interpretation, the truth of the sentence G(F) may
only be arrived at via a meta-analysis from outside the system (i.e. using external
criteria which is the cross-validation technique).

12
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Compared to the theorems stated in Gödel's (1931) paper, many contemporary


statements of the incompleteness theorems are more general in two ways. These
generalized statements are phrased to apply to a broader class of systems, and they are
phrased to incorporate weaker consistency assumptions.
Gödel demonstrated the incompleteness of the system of Principia Mathematica
(particular system of arithmetic), but a parallel demonstration could be given for any effective
system of a certain expressiveness. Gödel commented on this fact in the introduction to his
paper but restricted the proof to one system for concreteness. In modern statements of the
theorem, it is common to state the effectiveness and expressiveness conditions as
hypotheses for the incompleteness theorem, so that it is not limited to any particular formal
system.
The first incompleteness theorem states that no consistent system of axioms whose
theorems can be listed by an effective procedure (i.e., an algorithm) is capable of proving
all truths about the arithmetic of natural numbers. For any such consistent formal system,
there will always be statements about natural numbers that are true, but that are unprovable
within the system. The second incompleteness theorem, an extension of the first, shows
that the system cannot demonstrate its own consistency - a consistent theory is one that
does not lead to a logical contradiction.
The semantic definition states that a theory is consistent if it has a model, i.e., there
exists an interpretation under which all formulas in the theory are true. The syntactic
definition states a theory {T} is consistent if there is no formula (f) and its negation {not f} are
elements of the set of consequences of {T}.
For each formal system F containing basic arithmetic, it is possible to canonically
define a formula Cons(F) expressing the consistency of F. Gödel's second incompleteness
theorem shows that, under general assumptions, this canonical consistency statement
Cons(F) will not be provable in F.
It is important to note that the idea of external criteria and cross-validation in SLNs
and DNNs was implemented a long time ago. It was in his early publications where A.G.
Ivakhnenko (1971, p.369) clearly explained that all experimental data had to be divided into
training and testing sets.
Fig. 9 General Cross-Validation procedure in Data Analysis

13
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

In statistics and econometrics, cross-validation was recognized and utilized with the
publications of M. Stone (1974; 1977a; 1977b) who championed this method in his works. It
was in his very yearly paper, (Stone, 1974) when he drew researchers’ attention to a general
procedure of splitting a sample into two parts and formulated a characterization of this
procedure so that researchers can attempt to place it in proper relation to the more standard
methods. This simple idea of splitting a sample into two and then developing the hypothesis
on the basis of one part and testing it on the remainder (i.e. the cross-validation as shown
in Fig. 9) was one of the most seriously neglected ideas in statistics for a long period of time
until the 1990s when, as mentioned already, DM bridged the gap from applied statistics
and ML to the database management.

3.2. Accuracy as a Performance Criterion in Model’s Selection and Validation


In Motzev (2019) we discussed in detail different algorithms and steps when selecting
training and testing sets both for cross-sectional and time series data. Since the task is to
select a model among various candidates on the basis of performance criterion and to
choose the best model (or a few representatives from a large set of computational models
as in ANNs), it is also important to identify this performance criterion (or criteria).
The purpose of publication ISO 5725-1 of the International Organization for
Standardization (ISO) 5725 series of standards in 1994 (last reviewed and confirmed in
2018) “is to outline the general principles to be understood when assessing accuracy
(trueness and precision) of measurement methods and results, and in applications, and to
establish practical estimations of the various measures by experiment” (ISO 5725-1, 1994,
p.1). This ISO defined accuracy as describing a combination of both types of observational
error (random and systematic); therefore, high accuracy requires both high precision and
high trueness (reality).
Accuracy and Cost are two of the most important factors used to evaluate a model’s
effectiveness. Of course there are other important factors such as the time to gather and
analyze the data, computational power and software, availability of historical data, and time
horizon of the analysis (prediction), but it is more important that the degree of accuracy
should be clearly stated. This will enable users to plan for possible errors and will provide a
basis for comparing alternative models.
It is also important to understand that there is no such thing as absolute accuracy
and raising the level of accuracy increases the cost but does not necessarily increase the
value of information provided by the model. Decision makers need information which can be
used and create value. That is, the information obtained from a model is valuable to a
business only when it is used and leads to actions which create value or market behavior
that gives a competitive advantage.
The information which managers expect to obtain from a model should be relevant
information to assist them in planning, controlling, and decision making. According to Lucey
(1991, p. 12), relevant information is information which increases knowledge, reduces
uncertainty, and is usable for the intended purpose. Reducing uncertainty is closely related
to accuracy, i.e. information must be sufficiently accurate for its purpose and to be relied
upon by the decision maker who will use it and for the purpose for which it is intended.

14
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

To evaluate the accuracy we have to use genuine data. Predictions are not perfect,
and their results usually differ from the real-life values. The difference between the actual
value and the predicted value for the corresponding period is referred to as a prediction
error. It is defined using the actual value of the outcome minus the predicted with the model
value (1):

et = yt – Ft (1)

where et is the error at period t (t={1, 2, 3...N}).


- N is the prediction interval (or the size of the dataset).
- yt is the actual value at period t.
- Ft is the prediction calculated with the model for period t.
The expected value of all prediction errors, i.e. the Mean Forecast Error (MFE) (2)
is the basic measure of model’s accuracy. It:
• is a measure of the average deviation of predicted values from the actual ones.
• shows the direction of the errors and, thus, it measures the Forecast Bias.
• is a measure in which the effects of positive and negative errors cancel out
and there is no way to know their exact amount.
• can have zero value but it does not mean that forecasts are perfect with no
error, rather, it only indicates that the forecasts are on proper target.
• does not penalize extreme errors.
• depends on the scale of measurement and it is also affected by data
transformations.
• is desirable to be as close to zero as possible for a good model, i.e. a model
with a minimum bias which means high accuracy.

1 N
MFE =
N
 e
t =1 t
(2)

A good forecasting method will provide a fitted model with zero MFE calculated on
the training dataset. Since the selection and validation of the model are done using the
testing dataset it is almost impossible to have an MFE with a zero value and the goal is to
minimize the bias. Then, the forecasts can be improved by adjusting the forecasting model
by an additive constant that equals the MFE of the unadjusted errors.
Predictions of outcomes are rarely precise, and a researcher can only endeavor to
make the inevitable errors as small as possible. Each model represents a real-life process
or system with some accuracy related to the particular size of the prediction error. It is
good to know some general facts, for instance, the fact that the prediction accuracy
decreases as the time horizon increases, i.e. short-range predictions usually contend with
fewer uncertainties than longer-range predictions, and thus they tend to be more accurate.
However, it is more important to know the degree of each particular model’s accuracy.

15
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

3.3. Measures of Trueness (Bias) and Precision


In Motzev (2019; 2021) we discussed in detail the problem of models’ accuracy and
how to measure it using the prediction error. The prediction error should always be
calculated using actual data as a base. Traditionally, measures of fit are used to evaluate
how well the model matches the actual values.
There are different measures for judging accuracy of a fitted model. Each of these
measures has some unique properties which are different from the properties of the other
measures. Which metrics should be used depends on the particular case and its specific
goals.
Example: Given the predictions (Ft) for a variable (yt) and their errors (et) at period t
for a dataset t={1, 2, 3...N}.
Supposition One – it is important to consider more than one evaluation criteria – this
will help to obtain a reasonable knowledge about the amount, magnitude, and direction of
the overall model error. The most important criteria are described below.
3.3.1. Mean Percentage Error (MPE)
MPE (3) is useful when it is necessary to determine a model’s bias (i.e. the general
tendency of prediction to be too high or too low). In case of unbiased prediction, MPE will
produce a value that is close to zero. If the prediction overestimates the real life system,
MPE will have large negative values and contrary, large positive values indicate the system
is consistently underestimated. A disadvantage of this measure is that it is undefined
whenever a single actual value is zero.

1 N
MPE (%) = 
N t =1
(et / yt ) 100 (3)

In terms of its computation (3), MPE is an average percentage error with the
following properties. It:
• represents the percentage of average error which occurred while forecasting.
• is independent of the scale of measurement but affected by data
transformations.
• (like MFE) shows the direction of errors that occurred.
• does not penalize extreme deviations.
• is like MFE, the opposite signed errors affect each other and cancel out, i.e.,
by obtaining a value of MPE close to zero, we cannot conclude that the model is perfect.
• is desirable that for a good model with a minimum bias, the obtained MPE
should be as close to zero as possible.
3.3.2. Mean Absolute Percentage Error (MAPE)
MAPE (4) puts errors in perspective. It is useful when the size of the predicted
variable is important in evaluating. It provides an indication of how large the model errors
are in comparison to the actual values. It is also useful to compare the accuracy of different
models on same or different data.
16
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

1
MAPE (%) = ∑𝑁
𝑡=1(|𝑒𝑡 | /𝑦𝑡 )x100 (4)
𝑁
MAPE important features are:
• As a measure, it represents the percentage of average absolute error that
occurred.
• it is independent of the scale of measurement but affected by data
transformations.
• unlike MFE, MAPE does not show the direction of error.
• it does not penalize extreme deviations.
• opposite signed errors do not offset each other (i.e., the effects of positive and
negative errors do not cancel out).
• for a good model, the calculated MAPE should be as small as possible.
3.3.3. Mean Squared Error (MSE) & Mean Squared Prediction Error (MSPE)
MSE or MSPE (5) is the average of the squared errors for the testing dataset, i.e. the
differences between the actual and the predicted values at period t:
𝑁
MSPE = ∑𝑡=1 (𝑒𝑡 )2 ⁄(𝑁 − 1) (5)

Technically, the MSPE is the second moment about the origin of the error, and thus
incorporates both the variance of the estimator and its bias. For an unbiased estimator, the
MSPE is the variance of the estimator. Like the variance, MSPE has the same units of
measurement as the square of the quantity being estimated. It has the following properties:
• it is a measure of average squared deviation of forecasted values.
• since the opposite signed errors do not offset one another, MSPE gives an
overall idea of the error.
• it penalizes extreme errors (it squares each) which occurred while forecasting.
• MSPE emphasizes the fact that the total model error is in fact greatly affected
by large individual errors, i.e. large errors are more expensive than small ones.
• MSPE does not provide any idea about the direction of overall error.
• it is sensitive to the change of scale and data transformations.
• although MSPE is a good measure of overall forecast error, it is not as intuitive
and easily interpretable as the other measures discussed above.
Because of these disadvantages, researchers mostly use the MSPE square root.
3.3.4. Root Mean Squared Prediction Error (RMSPE)
RMSPE (6) is the square root of calculated MSPE. In an analogy to the standard
deviation, taking the square root of MSPE yields the root-mean-squared-prediction error
(RMSPE), which has the same units as the quantity being estimated:

RMSPE = √MSPE (6)

17
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Unlike MSPE, RMSPE measures the model’s error in the same units as the original
data and it has easy and clear business interpretation. It is a measure of average squared
deviation of predicted values and since the opposite signed errors do not offset one another,
RMSPE gives an overall idea of the occurring error.
Most importantly, RMSPE (like MSPE) emphasizes the fact that the total error is much
affected by large individual errors. It is the square root of the average of squared errors. The
effect of each error on RMSPE is proportional to the size of the squared error, i.e. larger
errors have a disproportionately large effect on RMSPE.
The RMSPE serves to aggregate the magnitudes of the errors in predictions for
various data points into a single measure of predictive power. Thus, it is a good measure of
accuracy, but only to compare errors of different models for a particular variable (or data
set) and not between variables (datasets), as it is scale-dependent. Another important
application of RMSPE is that it can be used directly to construct prediction intervals (see
Section 3.4 below).
3.3.5. Coefficient of Variation of the RMSPE CV(RMSPE)
CV(RMSPE) (7) is the RMSPE normalized to the mean of the real values:

CV(RMSPE) = RMSPE/ȳ (7)

It is the same concept as the coefficient of variation (CV) in statistics except that
RMSPE replaces the standard deviation. The CV is useful because the standard deviation
of data must always be understood in the context of the mean of these data. In contrast, the
actual value of the CV is independent of the unit in which the measurement has been taken,
so it is a dimensionless number. For comparison between datasets with different units or
widely different means, we should use the CV instead of the standard deviation. The smaller
the CV(RMSPE) value, the better the model.
When the mean value is close to zero, the CV(RMSPE) will approach infinity and is
therefore sensitive to small changes in the mean. This is often the case if the values do not
originate from a ratio scale. Finally, unlike the RMSPE, it cannot be used directly to construct
prediction intervals.
Supposition One – inferences:
As we can see, each of the measures above has some unique properties different
from the other measure’s properties and which metrics should be used depends on the
particular case and its specific goals. Experienced researchers normally use the criteria
MPE, MAPE, RMSPE, and CV(RMSPE) together. The main benefit of this group is that it
provides good information about both the bias and the precision of the model.
RMSPE and MPE represent the Trueness (Systematic error, Statistical Bias) and
thus measure model usefulness and reliability. MPE shows the direction of the bias and
RMSPE represents its absolute value and can be used directly to construct error’s prediction
intervals. When selecting a good model based on a testing dataset, it is desirable that both
criteria should be as close to zero as possible.

18
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

To measure the model’s precision (i.e. its random error), we should use MAPE and
CV(RMSPE) in tandem. Since CV(RMSPE) penalizes extreme errors (i.e. it is sensitive to
outliers) and MAPE does not, the researchers’ first goal should be to select a model where
the calculated values of both criteria are very close (almost equal), i.e. there are no extreme
error values. The second goal, as usual, is that these criteria values are as close to zero as
possible. Another advantage is that MAPE and CV(RMSPE) are the most common
measures used to compare the accuracy of two different models and select the best one
(more details we discuss in Motzev, 2021, pp. 65-96.)

3.4. Model Evaluation and Selection


In Motzev (2018a; 2021) we presented a highly automated procedure for developing
SLNs for business simulations in the form of Multi-Layered Networks of Active Neurons
(MLNAN) using GMDH (which is in fact a DNN) and in Motzev and Pamukchieva (2019) we
discussed the problem of accuracy in business simulations providing some insights into how
to address it in a cost-effective way using the SLNs in the form of MLNAN.
The proposed in Motzev (2018a) framework for developing MLNANs was used in
Business Forecasting and Predictive Analytics class at the Walla Walla University School of
Business. For five years in a row (2017-2021), we developed many predictive models using
more or less complex techniques including time series models, regression and
autoregression models, MLNANs and composite models.
In all cases, the models’ evaluation and selection was done using multiple criteria. As
presented in Tables 2 and 3, all the best models have very close values of MAPE and
CV(RMSPE). RMSPE values are not presented because they were not used directly, and
CV(RMSPE) also gives a good indication about the models’ accuracy6.
Supposition Two – predictions (i.e. calculated by the model values for the testing
dataset) are more likely (closer to) intervals rather than to a single point. As mentioned
above, predictions are not perfect, and their results usually differ from the real-life values.
Consequently, it is better to consider the calculated values as intervals rather than point
estimates (see Fig. 10).
In a forecasting process, we estimate the middle of the range of possible values the
predicted variable could take. Thus, it is better to present a forecast by a prediction interval
giving a range of values with relatively high probability for the variable.
Tab. 2. Sales Predictions Accuracy using different models in 2018
Best Model Second Best Third Best
MLNAN Triple Exponential Multiple Autoregression
MPE = 1.42% MPE = -0.57% MPE = 2.03%
MAPE = 1.42% MAPE = 1.76% MAPE = 2.58%
CV(RMSPE) = 1.56% CV(RMSPE) = 2.45% CV(RMSPE) = 3.17%
Source: Own data

6
In fact, the main goal of these experiments was to prove the advantages of utilizing SLNs in business simulations and
that SLNs such as MLNAN provide opportunities in both shortening the design time and reducing the cost and efforts
in model building, as well as developing reliably even complex models with high level of accuracy.

19
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Tab. 3. Sales Predictions Accuracy using different models in 2019


Best Model Second Best Third Best
MLNAN: Multiple Regression with Time Triple Exponential
and Dummy Seasonal Variable
MPE = 1.55% MPE = -1.09% MPE = -0.57%
MAPE = 1.55% MAPE = 1.59% MAPE = 1.76%
CV(RMSPE) = 1.56% CV(RMSPE) = 1.56% CV(RMSPE) = 2.45%
Source: Own data

In the point estimate researchers try to find a unique point (average prediction) in the
parameter space which can reasonably be considered as the true value of the parameter.
On the other hand, instead of unique estimate of the parameter, we are interested in
constructing a family of sets that contain the true (unknown) parameter value with a specified
probability. In many problems we want not only to calculate a single value (the average
prediction) of the parameter, but to get a lower and upper bound for it as well.
For this purpose, researchers construct a prediction interval. We can calculate the
upper and lower limits of the intervals from the given data using the RMSPE. This estimation
provides a range of values where the parameter is expected to lie. It generally gives more
information than point estimates and is preferred when making inferences. For easy and
better understanding, often the upper limit of the interval is called optimistic (or Maximum)
prediction and the lower limit pessimistic (or Minimum) prediction (see Fig. 11).
Supposition Two – inferences:
When selecting the best model, the prediction intervals can improve the selection
process. The MPE which shows the direction of the Systematic error can be used to make
more precise decisions. If its value is very close to zero we should select the average
prediction model. When MPE is negative, we should select the minimum prediction model
and finally, when it is positive, we should select the maximum prediction model.
Fig. 10 General view of Point Estimate, Confidence and Prediction Intervals

20
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 11 Another view of the Prediction Intervals

The results presented in Tables 4 and 5 give more information about the selection
process. The good models were first evaluated using multiple criteria and among their Max,
Min, and Average versions (i.e. the prediction interval calculated with RMSPE) the best
model was selected using the prediction bias (the systematic error measured with MPE)
and the model’s precision (i.e. its random error) presented by MAPE and CV(RMSPE).
As we can see, this approach helps to improve the best model selection. For example,
in Case 1 presented in Fig. 12 the best model was the Optimistic (Max) version of MLNAN
and in Case 2 (see Fig 13) the Pessimistic (Min) version of MLNAN.
It should be noted that the dataset in Case 2 was made up in a way that the best
models, which students can develop with traditional model building approaches, will contain
precision of about 10% random error (measured by MAPE and CV(RMSPE) on the testing
dataset. The goal of the case is to improve model’s accuracy with the help of the selection
procedure using model errors, measured by multiple criteria, and prediction intervals.
Fig. 12 Predictions Accuracy using different models in 2020 (Case 1) - Source: Own data

21
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig. 13 Different Predictions Accuracy - same model (2020 Case 2) - Source: Own data

The typical case in cross-validation procedures is biased (more or less) model for the
testing dataset measured with MPE. The prediction interval allows us to identify the best
version of the model as presented in Fig. 13. These experimental results were confirmed in
2021 as we can see in Tables 4 and 5.
The data in both tables show how the inferences from Supposition Two stated above
help to improve the selection procedure in all three models presented there. The MPE gives
bias’s direction for the general average model and then, the corresponding version (Min or
Max) of the model is chosen. In this way, we reduce both the bias and the random error.
Another important point that should be pointed out is the computational process.
Unfortunately, most of the existing model building software does not provide options for
constructing prediction intervals and obtaining Min&Max versions of the selected best
models. For this reason, we designed special templates for MS Excel which compute the
most important error measures like MPE, MAPE, RMSPE, and CV(RMSPE) and construct
prediction intervals for models created by other software. These templates are modules that
can be used in any MS Excel file and even modified with MS VBA editor7.

Table 4. Experimental results in 2021 - Case 2 Errors: Different versions of AR model

Source: Own data

7
These templates are parts of another research and will be presented in future editions of VSIM journal.

22
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Table 5. Experimental Results in 2021 – Case 2 Errors: Different versions of ARMAX model

Source: Own data


In Motzev and Lemke (2015) we presented DM software “KnowledgeMiner” which
is a self-organizing tool for modeling and predictions that implements GMDH, Analog
Complexion, and Fuzzy Rule Induction techniques (Mueller and Lemke, 2003).
“KnowledgeMiner”, called “Insights” in its last version, can build linear & nonlinear,
static & dynamic time series models, multi-input/single-output and multi-input/multi-output
models as systems of equations (SE) even from small and noisy data samples. The model
outputs are presented both analytically (as equations with estimated coefficients) and
graphically, by a system graph reflecting the interdependent structure of the system.
“KnowledgeMiner/Insights” has an excellent module for complex evaluation of the
synthesized MLNAN (which is in fact DNN), its adequacy and reliability, however, does not
provide enough options for conducting simulation experiments and what-if analysis. The
above mentioned templates for MS Excel were very useful since Insights has a special
option for direct export of the developed composite models (including Minimum, Maximum,
and Average versions) to a spreadsheet (see Fig. 14).
Fig. 14 KnowledgeMiner software - Source: Own data

23
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Fig.15 Forecasts with a Composite model and actual data (2021 Case 2) - Source: Own data

3.5. Further Improvements


Supposition Three – the best model is a composite model. There is no single
universal model that works in all possible cases.
As mentioned above, Case 2 was made up so that the best model (developed with
traditional model building approaches) precision will be about 10% random error (measured
by MAPE and CV(RMSPE) on the training dataset). Many different models were created
with the goal to improve model’s accuracy with the help of the selection procedure.
In Suppositions one and two, we have seen many positive results using model
errors, measured by multiple criteria, and prediction intervals. The goal in Supposition
three is slightly different. Here, we try to improve predictions accuracy with a procedure
which creates composite models of a different type.
In the example presented in Fig. 15, one of my students used as a basis a Multiple
Regression model to predict the time series data. After analysis of other model errors, he
realized that there are models with higher accuracy in some of the months. Thus, he
generated a better overall composite model which contained four different types of models
– Multiple Regression, as a foundation, and substitutes for certain months including Moving
Average Model for June and November, MLNAN Maximum for January and October, and
MLNAN Minimum for March forecasts. As can be seen, this composite model (see Table 6)
provided the most accurate forecast so far.
Table 6. Composite model errors summary (2021 Case 2) - Source: Own data

Source: Own data

24
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

This Supposition indeed requires many more experiments and more importantly,
automation of the procedure because all of the analyses, tests, and other experiments were
done manually.
4. CONCLUSIONS
SLNs provide processed data that are needed in the business context and the
extracted information is useful to decision makers because creates value or predict market
behavior in a way which leads to competitive advantages. In order to maximize the
effectiveness of their use, SLN models must be accurate enough. The higher the accuracy
the better the understanding and the predictions of the analyzed problem. This will support
managers in making better decisions that are much closer to the real-life business problem.
Model selection and validation procedures can help to find the best model in terms
of its accuracy, i.e. the minimum prediction bias (systematic error) and the maximum
precision (or minimum random error). The present paper discusses the model selection
and validation in deep neural networks such as SLNs based on their prediction errors
and intervals. To achieved the goal three Suppositions were made and examined.
Supposition One: it is important to consider more than one evaluation criteria, which
will help to obtain a reasonable knowledge about the amount, magnitude, and direction of
the overall model error.
Supposition Two: predictions (i.e. calculated by the model values for the testing
dataset) are more likely (closer to) intervals rather than to a single point, i.e. it is better to
consider the calculated values as intervals rather than point estimates.
Supposition Three: the best model is a composite model because there is no single
technique/model that works in every situation.
For five years in a row (2017-2021), we developed many predictive models using
more or less complex techniques including time series models, regression and
autoregression models, MLNANs and composite models. The results obtained so far and
presented in the current paper confirmed all three Suppositions made.
In practice, there should be more experiments and improvements especially in
automation of the procedures because many of the analyses, tests, and other experiments
were done manually. This will be the goal in the future research which will be conducted in
the following years.

List of acronyms used:


AI - Artificial Intelligence ANNs - Artificial Neural Networks
CV(RMSPE) - Coefficient of variation of RMSPE DM - Data Mining
DMG - Data Mining Group DNNs - Deep Neural Networks
GMDH - Group Method of Data Handling KDD - Knowledge Discovery in Databases
MAPE - Mean Absolute Percentage Error MFE - Mean Forecast Error
ML - Machine Learning MLNAN - Multi-Layered Networks of Active Neurons
MPE - Mean Percentage Error MSPE - Mean Squared Prediction Error
PMML - Predictive Model Markup Language RMSPE - Root Mean Squared Prediction Error
SE - simultaneous equations SLNs - Statistical Learning Networks

25
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

References

Alpaydin, E. (2020) Introduction to Machine Learning. 4th ed. MIT, pp. xix.
Beer, S. (1959) Cybernetics and Management. London: English University Press.
Brown, S. (2023) ‘Machine Learning, Explained’ Available at: https://mitsloan.mit.edu/
ideas-made-to-matter/machine-learning-explained (Accessed: 02 November 2023).
Burns, Ed. (2017) ‘Deep learning models hampered by black box functionality’.
Available at: http://searchbusinessanalytics.techtarget. com/feature/Deep-learning-models-
hampered-by-black-box-functionality (Accessed: 04 May 2017).
Cureton, E. (1950) ‘Validity, reliability and boloney’. Educ. & Psych. Meas., No. 10,
pp. 94-96.
Cureton, E. (1951) ‘Symposium: The need and means of cross-validation. II.
Approximate linear restraints and best predictor weights’. Educ. & Psychol. Measurement,
No. 11, pp. 12-15.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth P. (1996) ‘From Data Mining to
Knowledge Discovery in Databases’. American Association for Artificial Intelligence
Magazine, Fall, pp. 37-54. Available at: http://www.kdnuggets.com/gpspubs/aimag-kdd-
overview-1996-Fayyad.pdf (Accessed: 02 November 2023).
Günnemann, St., Kremer, H., and Seidl, Th. (2011) ‘An extension of the PMML
standard to subspace clustering models’. Proceedings of the 2011 workshop on Predictive
markup language modeling, p. 48. doi:10.1145/2023598.2023605.
Hastie, T., Tibshirani, R., and Friedman J. (2017) The Elements of Statistical
Learning: data mining, inference, and prediction. New York: Springer.
Herzberg, P. A. (1969) ‘The Parameters of Cross-Validation’ (Psychometric
Monograph No. 16, Supplement to Psychometrika, 34). Richmond, VA: Psychometric
Society. Available at: http://www.psychometrika.org/journal/online/MN16.pdf (Accessed: 02
November 2023).
Hornik, K., Stinchcombe, M. and White, H. (1989) ‘Multilayer feed-forward networks
are universal approximators’. Neural Networks 2, pp. 359–366.
Horst, P. (1941) ‘Prediction of Personal Adjustment’. New York: Social Science
Research Council (Bulletin No. 48)
IBM newsletter. (2023) ‘What is machine learning?’ Available at:
https://www.ibm.com/topics/machine-learning (Accessed: 02 November 2023).
ISO 5725-1. (1994) ‘Accuracy (trueness and precision) of measurement methods
and results - Part 1: General principles and definitions’, p. 1. Available at:
https://www.iso.org/obp/ui/#iso: std:iso:5725:-1:ed-1:v1:en (Accessed: 02 November 2023).
Ivakhnenko, A., and Müller, J-A. (1996) ‘Recent Developments of Self-Organizing
Modeling Prediction and Analysis of Stock Market’. http://www.gmdh.net/articles/index.html
(Accessed: 02 November 2023).
Ivakhnenko, A. G. 1968. ‘Group Method of Data Handling - A Rival of the Method of
Stochastic Approximation’. Soviet Automatic Control, Vol. 1, No. 3, pp. 43-55.

26
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Ivakhnenko, A., and Lapa, V. (1967) Cybernetics and forecasting techniques.


American Elsevier Pub. Co
Ivakhnenko, A.G. (1971) ‘Polynomial Theory of Complex Systems’, IEEE (Institute of
Electrical and Electronics Engineers, Inc.) TRANSACTIONS ON SYSTEMS, MAN, AND
CYBERNETICS Vol. SMC-1, No. 4, pp. 364-378.
Katzell, R. A. (1951) ‘Symposium: The need and means of cross-validation. III. Cross-
validation of item analyses’. Educ. & Psychol. Measurement, 11, pp. 16-22.
Gödel, Kurt (1931) ‘Über formal unentscheidbare Sätze der Principia Mathematica
und verwandter Systeme I’. Monatshefte für Mathematik und Physik, Vol. 38, No. 1, pp. 173–
198. doi:10.1007/BF01700692
Gödel, Kurt (2001) ‘Collected works’. Oxford University Press, Vol. I. Publications
1929-1936, pp. 144–195. ISBN 978-0195147209. The original German with a facing English
translation, preceded by an introductory note by Stephen Cole Kleene. May 21, 2001, Oxford
University Press, USA
Larson, S. (1931) ‘The shrinkage of the coefficient of multiple correlation’. Journal
Educ. Psychol., 22, pp. 45-55.
Lenox, J. (2020) 2084 (Artificial Intelligence and The Future of Humanity). Zondervan
Reflective, Grand Rapids, MI, p. 15.
Lovell, Michael C. (1983) ‘Data Mining’. The Review of Economics and Statistics, 65
(1): pp. 1–12. doi:10.2307/1924403. JSTOR 1924403
Lucey, T. (1991) Management Information Systems. DP Publications Lim.
Madala, H. and Ivakhnenko, A. G. (1994) Inductive Learning Algorithms for Complex
Systems Modelling. CRC Press Inc., Boca Raton.
McCarthy, J. (1956) ‘What Is Artificial Intelligence’, p. 2. Available at: http://www-
formal.stanford.edu/jmc/whatisai.pdf (Accessed: 02 November 2023).
MicroStrategy, (2005) ‘An Architecture for Enterprise Business Intelligence’. White
Paper., pp. 162-173. Available at: http://www.microstrategy.com/Publications/Whitepapers
(Accessed: 02 November 2023).
Mohri, M., Rostamizadeh, A. and Talwalkar, A. (2012) Foundations of Machine
Learning. The MIT Press
Mosier, C. I. (1951) ‘Problem and designs of cross-validation‘. Symposium: The need
and means of cross-validation. I. Educ. & Psychol. Measurement, 11, pp. 5-11
Motzev, M. and Lemke, Fr. (2015) ‘SELF-ORGANIZING DATA MINING
TECHNIQUES IN MODEL BASED SIMULATION GAMES FOR BUSINESS TRAINING
AND EDUCATION’. Vanguard Scientific Instruments in Management, Vol. 11.
Motzev, M. (2018a) ‘A Framework for Developing Multi-Layered Networks of Active
Neurons for Simulation Experiments and Model-Based Business Games Using Self-
Organizing Data Mining with the Group Method of Data Handling’. In: Lukosch H.,
Bekebrede G., Kortmann R. (eds) Simulation Gaming. Applications for Sustainable Cities
and Smart Infrastructures. Lecture Notes in Computer Science, vol 10825. Springer, pp.191-
199. DOI: 10.1007/978-3-319-91902-7_19.

27
VANGUARD SCIENTIFIC INSTRUMENTS IN MANAGEMENT, vol. 19, no. 1, 2023, ISSN 1314-0582

Motzev, M. (2018b) ‘Statistical Learning Networks in Simulations for Business


Training and Education’. In: Developments in Business Simulation and Experiential
Learning. Vol. 45. Proceedings of the Annual ABSEL Conference, Seattle, WA, pp. 291-301.
Motzev, M. (2019) ‘Prediction Accuracy - A Measure of Simulation Reality’. Vanguard
Scientific Instruments in Management, Vol. 15.
Motzev, M. and Pamukchieva, O. (2021) ‘Accuracy in Business Simulations’. In:
Wardaszko M. (ed.) SIMULATION & GAMING Through Times and Across Disciplines.
Lecture Notes in Computer Science, vol 11988. Springer, pp. 115-126. DOI: 10.1007/978-
3-030-72132-9.
Motzev, M. (2021) Business Forecasting: A Contemporary Decision Making
Approach. 2nd edition. Amazon Kindle. ASIN B09438TP9M, ISBN 978-1-7370258-0-1
Müller, J-A. and Lemke, F. (1995) ‘Self-Organizing modelling and decision support in
economics’. In: Proceedings of the IMACS Symposium on Systems Analysis and Simulation.
Gordon and Breach Publ., pp. 135-138.
Müller, J-A. and Lemke, F. (2003) Self-Organizing Data Mining: An Intelligent
Approach To Extract Knowledge From Data. Trafford Publishing, Canada.
Nicholson, G. E. (1960) ‘Prediction in future samples’. In: Olkin I. et al. (eds)
Contributions to Probability and Statistics. Stanford University Press
Onwubolu, G. (ed) (2009) Hybrid Self-Organizing Modeling Systems. Springer-Verlag
Berlin Heidelberg.
Samuel, A. (1959) ‘Some Studies in Machine Learning Using the Game of Checkers’.
IBM Journal of Research and Development, Vol. 3, No. 3, July, pp. 210-229, DOI:
10.1147/rd.33.0210.
Schlesinger, S. et al. (1979) ‘Terminology for Model Credibility’. Simulation 32(3), pp.
103-104.
Stone, M. (1974) ‘Cross-Validatory Choice and Assessment of Statistical Predictions,
Cross-Validation and Multinomial Prediction’. Journal of the Royal Statistical Society, Vol.36,
No.2, pp. 111-147
Stone, M. (1977a) ‘An Asymptotic Equivalence of Choice of Model by Cross-
Validation and Akaike's Criterion’. Journal of the Royal Statistical Society, Vol. 39, No.1, pp.
44–47.
Stone, M. (1977b) ‘Asymptotics For and Against Cross-Validation’. Biometrika, Vol.
64, No.1, pp. 29-35.
Symposium (1951). ‘The need and means of cross-validation’. Educ. & Psychol.
Measurement, 11.
Turing, A. (1950) ‘I.—COMPUTING MACHINERY AND INTELLIGENCE’. Mind,
Volume LIX, Issue 236, October, pp. 433–460, DOI: 10.1093/mind/LIX.236.433. Available
at: https://academic.oup.com/mind/article/LIX/236/433/986238 (Accessed: 02 November
2023).
Wherry, R. J. (1931) ‘A new formula for predicting the shrinkage of the multiple
correlation coefficient’. Ann. Math. Statist., 2, pp. 440-457.

28

You might also like